Transforming Natural Language Text to SQL
Comprehending natural language text, with its first-hand challenges of ambiguity and co-reference, has been a longstanding problem in Natural Language Processing (NLP). One manifestation of this problem is to enable question answering on a dataset by converting natural free-flowing query text into a structured query format. This remains one of the open challenges in NLP, and has garnered a lot of interest lately due to the availability of efficient language models.
The ability to structure natural language questions into Structured Query Language (SQL) has many potential use-cases:
- Making data-driven insights accessible to users without coding skills
- reducing time to insights in a domain
- improving the value of accumulated data
In this blog, we explore techniques of using semantic parsing in combination with syntactic mapping to create a domain-independent methodology of transforming natural language queries to structured query language.
As the volume of digital data has increased, a large proportion of it remains unanalyzed for want of:
- infrastructure to manage it
- efficient methods to process it
- lack of enough people with technical finesse to work with it
NLP researchers have approached challenges in text understanding with methods like rule-based mapping, and bi-directional encoding in association with reinforcement learning. Though accurate, the major barrier to implementing any deep-learning-based model is:
- availability of clean and tagged big datasets that could be used for training any natural language model
There is a lot of literature and work done to get a comprehensive overview of techniques in building natural language frontends to databases which are referenced in (Androutsopoulos, Ritchie, & Thanisch, 1995). (Blackburn & Bos, 2005) talked about computational semantics and (Lappin, 1996) or (Benthem & Meulen, 1997) provided a scope of more advanced approaches to semantics parsing, including how to handle tenses, general quantifiers in grammar annotations.
However, most supervised and semi-supervised systems we came across were domain-specific (Vadim Sheinin et al., 2018) and thus, these efforts would be re-done for other data types and industries. Some challenges of that approach are:
(i) need for human tagged corpus/questions and respective SQL queries for each industry
(ii) word embeddings mapped to context words to capture contextual information
(iii) time to build a working prototype is not favorable with business Return of investments (ROI)s
A lot of research and work around semantic parsing in the structured databases showed semi-supervised training of question-answer labeled data leads to accurate translations. But, in real-time applications, collecting such domain-specific datasets are sub-optimal.
In our method, the NLQ to SQL consists of three steps:
- (i) rule-based parsing
- (ii) aggregation of logics
- (iii) combining parsed info with syntactical operators.
Offline training, and real-time tokenization and mapping
The main idea is to create patterns and matches, both full and partial, and feed them into SQL schema-based rules to transform the natural text question. An important step here is question annotation. We need to answer how will the algorithm detect if a question has numeric annotators or date annotators or even timeline-based annotators.
Since the idea is to keep it as general as possible, we performed contraction mapping to ensure the algorithm is able to identify key terms and map them to respective SQL functions.
The graph below demonstrates the general process of how we built the application of NLQ to SQL conversion. There are two main components for the application:
- Offline NLP Model Training and Storage
- Real-Time User Question Tokenization and Mapping
Here are some of the examples of how text-to-SQL mapping would look like on a Covid-19 dataset:
Q1: “Show all cases in Beijing from 3 months ago
SELECT * FROM data WHERE ((LastUpdate LIKE '%2021-02%' OR ObservationDate LIKE '%2021-02%') AND Province_State LIKE '%Beijing%');
Q2: "Show average of Recovered in %Mainland China%" would translate to:
SELECT AVG(Recovered) AS AVG_OF_COLUMN FROM data WHERE Country_Region = ‘Mainland China;
Q3: "Tell me the details about Tibet and %Mainland China%" would translate to:
SELECT * FROM data WHERE Province_State = 'Tibet' AND (Country_Region ='Mainland China')
There’s a lot of on-going research in the NLP domain to address the additional challenges of handling multiple tables using various JOIN and sub-query methods. This is particularly useful if we have different data sources to merge to get to a solution.
In cases of custom solutions, the process is comparatively easier since we can train the models to recognize custom vocabulary as keywords. The bigger challenge is when we try to create a platform that is domain agnostic. In such cases, iterative testing and training is the best feasible method.
Similar open-source references: