The task of NL2SQL is to convert natural language questions into SQL queries in order to retrieve answers from a database. Existing methods using Large Language Models (LLMs) to guide SQL generation face challenges when dealing with large databases and complex multi-table queries, especially in handling redundant information and improving prompt engineering efficiency.

Figure 1

The RB-SQL Framework

To address the above issues, the RB-SQL framework is proposed, which contains three modules:

Table-Retriever

Retrieves the tables most relevant to the question.

Column-Retriever

Further retrieves relevant columns from the retrieved tables.

SQL-Skeleton-Retriever

Searches for a small number of examples with similar SQL skeletons and introduces the SQL skeletons into the example organization to enhance the in-context learning process.

The RB-SQL framework utilizes the Dense Passage Retrieval (DPR) model to retrieve relevant tables, columns, and examples to construct effective prompt engineering. In addition, the framework introduces SQL skeletons as an intermediate step in the example organization to guide the correct SQL generation process.

SQL generation process

Workflow of the Modules

Table Retriever

Table-Retriever

  1. Computes the similarity between the question and tables
  2. Retrieves tables highly relevant to the question
  3. Uses BERT to separately encode the question and tables
  4. Employs post-interaction based on MaxSim to calculate similarity scores
Table Retriever

Column-Retriever

  1. Retrieves columns highly relevant to the question

SQL-Skeleton-Retriever

  1. Searches for a small number of examples with similar SQL skeletons
  2. Introduces SQL skeletons into the example organization
Table frpm
SQL Skeleton from training set

Experimental Results

Through experiments on the public datasets BIRD and Spider, the results show that the RB-SQL model outperforms several competing baselines in performance, including GPT-4, DIN-SQL, and DAIL-SQL.

ModelBIRD
EXVES
ChatGPT + CoT36.6442.30
GPT-446.3549.77
DIN-SQL + GPT-450.7258.79
DAIL-SQL + GPT-454.7656.08
RB-SQL + GPT-458.0759.72

Table 2:EX and VES on dev set of BIRD dataset.

ModelSpider
EX (dev)EX (test)
C3 + ChatGPT81.8082.30
DIN-SQL + GPT-482.8085.30
DAIL-SQL + GPT-484.4086.60
RB-SQL + GPT-484.9185.68
+ Generated Evidence85.8986.73

Table 3:EX on both dev and test set of Spider.

Ablation studies were also conducted, demonstrating that all modules in the RB-SQL framework play an important role in performance improvement.

MethodBIRD
EXVES
(1) RB-SQL + GPT-458.0759.72
(2) GPT-446.35(↓11.72)49.77(↓9.95)
(3) + Table-Retriever & Column-Retriever54.06(↓4.01)56.11(↓3.61)
(4) + SQL skeleton(example organization)54.48(↓3.59)56.38(↓3.34)
(5) + SQL-Skeleton-Retriever(example selection)55.19(↓2.88)56.81(↓2.91)
(6) + Error correction58.07(↓0.0)59.72(↓0.0)

Table 4:Results of ablation study on BIRD. “+” means adding module on the basis of the previous row.

Conclusion

The RB-SQL framework provides an effective approach for handling large databases and complex multi-table queries in the NL2SQL task. By leveraging retrieval-based modules and introducing SQL skeletons, RB-SQL enhances the in-context learning process and guides the correct SQL generation. Experimental results validate the superior performance of RB-SQL compared to existing baselines.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *