key: cord-0202462-fw1x0z79
authors: Lin, Xi Victoria; Socher, Richard; Xiong, Caiming
title: Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing
date: 2020-12-23
journal: nan
DOI: nan
sha: 63b7ecc1d4206efc8f40f93480f37818be9bb5ff
doc_id: 202462
cord_uid: fw1x0z79

We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the text-DB contextualization is realized via the fine-tuned deep attention in BERT. Combined with a pointer-generator decoder with schema-consistency driven search space pruning, BRIDGE attained state-of-the-art performance on popular cross-DB text-to-SQL benchmarks, Spider (71.1% dev, 67.5% test with ensemble model) and WikiSQL (92.6% dev, 91.9% test). Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks. Our implementation is available at url{https://github.com/salesforce/TabularSemanticParsing}.

Text-to-SQL semantic parsing addresses the problem of mapping natural language utterances to executable relational DB queries. Early work in this area focus on training and testing the semantic parser on a single DB (Hemphill et al., 1990; Dahl et al., 1994; Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Dong and Lapata, 2016) . However, DBs are widely used in many domains and Figure 1 : Two questions from the Spider dataset with similar intent resulted in completely different SQL logical forms on two DBs. In cross-DB text-to-SQL semantic parsing, the interpretation of a natural language question is strictly grounded in the underlying relational DB schema. developing a semantic parser for each individual DB is unlikely to scale in practice.

More recently, large-scale datasets consisting of hundreds of DBs and the corresponding question-SQL pairs have been released Zhong et al., 2017; Yu et al., 2019b,a) to encourage the development of semantic parsers that can work well across different DBs Bogin et al., 2019b; Wang et al., 2019; Suhr et al., 2020; Choi et al., 2020) . The setup is challenging as it requires the model to interpret a question conditioned on a relational DB unseen during training and accurately express the question intent via SQL logic. Consider the two examples shown in Figure 1 , both questions have the intent to count, but the corresponding SQL queries are drastically different due to differences in the target DB schema. As a result, cross-DB text-to-SQL semantic parsers cannot trivially memorize seen SQL patterns, but instead has to accurately model the natural language question, the target DB structure, and the contextualization of both.

State-of-the-art cross-DB text-to-SQL semantic parsers adopt the following design principles to address the aforementioned challenges. First, the question and schema representation are contextualized with each other (Hwang et al., 2019; Wang et al., 2019; Yin et al., 2020) . Second, pre-trained language models (LMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019c) can significantly boost parsing accuracy by enhancing generalization over natural language variations and capturing long-term dependencies . Third, as much as data privacy allows, leveraging available DB content improves understanding of the DB schema (Bogin et al., 2019b; Wang et al., 2019; Yin et al., 2020) . Consider the second example in Figure 1 , knowing "PLVDB" is a value of the field Journal.Name helps the model to generate the WHERE condition.

We introduce BRIDGE, a powerful sequential text-DB encoding framework assembling the three design principles mentioned above. BRIDGE represents the relational DB schema as a tagged sequence concatenated to the question. In contrast to previous work which proposed task-specific layers for modeling the DB schema (Bogin et al., 2019a,b; Choi et al., 2020) and joint text-DB linking Wang et al., 2019) , BRIDGE encodes the tagged sequence with BERT and lightweight subsequent layers -two single-layer bi-directional LSTMs (Hochreiter and Schmidhuber, 1997) . Each schema component (table or field) is simply represented using the hidden state corresponding to its special token in the hybrid sequence. To better align the schema components with the question, BRIDGE augments the hybrid sequence with anchor texts, which are automatically extracted DB cell values mentioned in the question. Anchor texts are appended to their corresponding fields in the hybrid sequence (Figure 2) . The text-DB alignment is then implicitly achieved via fine-tuned BERT attention between overlapped lexical tokens.

Combined with a pointer-generator decoder (See et al., 2017) and schema-consistency driven search space pruning, BRIDGE achieves performances comparable to or better than the state-of-the-art on the Spider (71.1% dev, 67.5% test with ensemble model) and WikiSQL (92.6% dev, 91.9% test) benchmarks, outperforming most of lately proposed models with task-specific architectures. 1 Through in-depth model comparison and error analysis, we show the proposed architecture is effective for generalizing over natural language variations and memorizing structural patterns, but struggles in compositional generalization and suffers from lack of explainability. This leads us to conclude that cross-domain text-to-SQL still poses many unsolved challenges, requiring models to demonstrate generalization over both natural language variation and structure composition while training data is often sparse.

In this section, we present the BRIDGE model that combines a BERT-based encoder with a sequential pointer-generator to perform end-to-end cross-DB text-to-SQL semantic parsing.

We formally defined the cross-DB text-to-SQL task as the following. Given a natural language question Q and the schema S = T , C for a relational database, the parser needs to generate the corresponding SQL query Y. The schema consists of tables T = {t 1 , . . . , t N } and fields C = {c 11 , . . . , c 1|T 1 | , . . . , c n1 , . . . , c N|T N | }. Each table t i and each field c i j has a textual name. Some fields are primary keys, used for uniquely indexing eachEar data record, and some are foreign keys, used to reference a primary key in a different table. In addition, each field has a data type, τ ∈ {number, text, time, boolean, etc.}.

Most existing solutions for this task do not consider DB content (Zhong et al., 2017; . Recent approaches show accessing DB content can significantly improve system performance (Liang et al., 2018; Wang et al., 2019; Yin et al., 2020) . We consider the setting adopted by Wang et al. (2019) , where the model has access to the value set of each field instead of full DB content. For example, the field Property_Type_Code in Figure 2 can take one of the following values: {"Apartment", "Field", "House", "Shop", "Other"}.

We call such value sets picklists. This setting protects individual data record and sensitive fields such as user IDs or credit numbers can be hidden.

As shown in Figure 2 , we represent each table with its table name followed by its fields. Each table name is preceded by the special token [T] and each field name is preceded by [C] . The representations of multiple tables are concatenated to form a serialization of the schema, which is surrounded by two [SEP] tokens and concatenated to the question. Finally, following the input format of BERT, the question is preceded by [CLS] to form the hybrid question-schema serialization

X is encoded with BERT, followed by a bidirectional LSTM to form the base encoding h X ∈ R |X|×n . The question segment of h X is passed through another bi-LSTM to obtain the question encoding h Q ∈ R |Q|×n . Each table/field is represented using the slice of h X corresponding to its special token [T]/[C].

We train dense look-up features to represent meta-data of the schema. This includes whether a field is a primary key ( f pri ∈ R 2×n ), whether the field appears in a foreign key pair ( f for ∈ R 2×n ) and the data type of the field ( f type ∈ R |τ|×n ). These meta-data features are fused with the base encoding of the schema component via a feed-forward layer g (R 4n → R n ) to obtain the following encoding output:

where p is the index of [T] associated with table t i in X and q is the index of [C] associated with field c i j in X. u, v and w are feature indices indicating the properties of c i j . [h m X ; f u pri ; f v for ; f w type ] ∈ R 4n is the concatenation of the four vectors. The meta-data features are specific to fields and the table representations are fused with place-holder 0 vectors.

Modeling only the table/field names and their relations is not always enough to capture the semantics of the schema and its dependencies with the question. Consider the example in Figure 2 , Property_-Type_Code is a general expression not explicitly mentioned in the question, and without access to the set of possible field values, it is difficult to associate "houses" and "apartments" with it. To resolve this problem, we make use of anchor text to link value mentions in the question with the corresponding DB fields. We perform fuzzy string match between Q and the picklist of each field in the DB. The matched field values (anchor texts) are inserted into the question-schema representation X, succeeding the corresponding field names and separated by the special token [V] . If multiple values were matched for one field, we concatenate all of them in matching order (Figure 2 ). If a question mention is matched with values in multiple fields. We add all matches and let the model learn to resolve ambiguity. 2 The anchor texts provide additional lexical clues for BERT to identify the corresponding mention in Q. And we name this mechanism "bridging".

We use an LSTM-based pointer-generator (See et al., 2017) with multi-head attention (Vaswani et al., 2017) as the decoder. The decoder is initiated with the final state of the question encoder. At each step, the decoder performs one of the following actions: generating a token from the vocabulary V, copying a token from the question Q or copying a schema component from S.

Mathematically, at each step t, given the decoder state s t and the encoder representation [h Q ; h S ] ∈ R (|Q|+|S|)×n , we compute the multi-head attention as defined in Vaswani et al. (2017) :

where h ∈ [1, . . . , H] is the head number and H is the total number of heads. Figure 2 : The BRIDGE encoder. The two phrases "houses" and "apartments" in the input question both matched to two DB fields. The matched values are appended to the corresponding field names in the hybrid sequence.

The probability of generating from V and the output distribution is defined as

where P V (y t ) is the softmax LSTM output distribution andX is the length-(|Q| + |S|) sequence that consists of only the question words and special tokens [T] and [C] from X. We use the attention weights of the last head to compute the pointing distribution 3 . We extend the input state to the LSTM decoder using the selective read proposed by Gu et al. (2016) . The technical details of this extension can be found in §A.2.

We propose simple heuristics for pruning the search space of the sequence decoders, based on SQL syntax constraints and the fact that the DB fields appeared in each SQL clause must only come from the tables in the FROM clause.

We rearrange the clauses of each SQL query in the training set into the standard DB execution order (Rob and Coronel, 1995) shown in table 1. For example, the SQL SELECT COUNT(*) 3 In practice we find this approach better than using just one head or the average of multiple head weights ( §A.4).

Exec: FROM WHERE GROUPBY HAVING SELECT ORDERBY LIMIT We can show that a SQL query with clauses in execution order satisfies the following lemma.

Lemma 1 Let Y exec be a SQL query with clauses arranged in execution order, then any table field in Y exec must appear after the table.

As a result, we adopt a binary attention mask ξ

which initially has entries corresponding to all fields set to 0. Once a table t i is decoded, we set all entries in ξ corresponding to {c i1 , . . . , c i|T i | } to 1. This allows the decoder to only search in the space specified by the condition in Lemma 1 with little overhead in decoding speed.

In addition, we observe that a valid SQL query satisfies the following token transition lemma.

Lemma 2 Token Transition: Let Y be a valid SQL query, then any table/field token in Y can only appear after a SQL reserved token; any value token in Y can only appear after a SQL reserved token or a value token.

We use this heuristics to prune the set of candidate tokens at each decoding step. It is implemented via vocabulary masking.

Text-to-SQL Semantic Parsing Recently the field has witnessed a re-surge of interest for textto-SQL semantic parsing (Androutsopoulos et al., 1995) , by virtue of newly released large-scale datasets (Zhong et al., 2017; and matured neural network modeling tools (Vaswani et al., 2017; Shaw et al., 2018; Devlin et al., 2019) . While existing models have surpassed human performance on benchmarks consisting of single-table and simple SQL queries (Hwang et al., 2019; He et al., 2019a) , ample space of improvement still remains for the Spider benchmark 5 which consists of relational DBs and complex SQL queries.

Recent architectures proposed for this problem show increasing complexity in both the encoder and the decoder Wang et al., 2019; Choi et al., 2020; Furrer et al., 2020) . Bogin et al. (2019a,b) proposed to encode relational DB schema as a graph and also use the graph structure to guide decoding. proposes schema-linking and SemQL, an intermediate SQL representation customized for questions in the Spider dataset which was synthesized via a tree-based decoder. Wang et al. (2019) proposes RAT-SQL, a unified graph encoding mechanism which effectively covers relations in the schema graph and its linking with the question. The overall architecture of RAT-SQL is deep, consisting of 8 relational self-attention layers (Shaw et al., 2018) on top of BERT-large. In comparison, BRIDGE uses BERT combined with minimal subsequent layers. It uses a simple sequence decoder with search space-pruning heuristics and applies little abstraction to the SQL surface form.

Seq2Seq Models for Text-to-SQL Semantic Parsing Many work have applied sequence-tosequence models to solve semantic parsing, treating it as a translation problem (Dong and Lapata, 2016; Lin et al., 2018) . Text-to-SQL models take both the natural language question and the DB as input, and a commonly used input representation in existing work is to concatenate the question with a squential version of the DB schema (or table header if there 5 https://yale-lily.github.io/spider is only a single table). Zhong et al. (2017) proposed the Seq2SQL model which first adopted this representation and tested it on WikiSQL. Hwang et al. (2019) first demonstrated that encoding such representation with BERT can achieve upperbound performance on the WikiSQL benchmark. Our work shows that such sequence representation encoded with BERT is also effective for synthesizing complex SQL queries issued to multi-table databases. Concurrently, Suhr et al. (2020) adopted a transformer model with BERT as encoder on Spider; shows that the T5 model (Raffel et al., 2020) with 3 billion parameters achieves the state-of-the-art performance on Spider. However, both of these two models do not use DB content. In addition, BRIDGE achieves comparable performance with a significantly smaller model. Especially, the BRIDGE decoder is a single-layer LSTM compared to the 12-layer transformer in T5.

Text-to-SQL Semantic Parsing with DB Content Yavuz et al. (2018) uses question-value matches to achieve high-precision condition predictions on WikiSQL. Shaw et al. (2019) also shows that value information is critical to the cross-DB semantic parsing tasks, yet the paper reported negative results augmenting an GNN encoder with BERT and the overall model performance is much below state-of-the-art. While previous work such as Guo In addition, instead of directly taking DB content as input, some models leverage the content by training the model with SQL query execution (Zhong et al., 2017) or performing execution-guided decoding during inference . To our best knowledge, such methods have been tested exclusively on the WikiSQL benchmark.

Joint Representation of Textual-Tabular Data and Pre-training BRIDGE is a general framework for jointly representing question, DB schema and the relevant DB cells. It has the potential to be applied to a wider range of problems that requires joint textual-tabular data understanding. Recently, Yin et al. (2020) proposes TaBERT, an LM for jointly representing textual and tabular data pre-trained over millions of web tables. Similarly, proposes TaPas, a pre- Table   Train 8,695 4,730 140 56,355 17,984  Dev 1,034  564  20  8,421  1,621  Test 2,147  -40 15,878 2,787 We evaluate BRIDGE using two well-studied crossdatabase text-to-SQL benchmark datasets: Spider and WikiSQL (Zhong et al., 2017) . Table 2 shows the statistics of the train/dev/test splits of the datasets. In the Spider benchmark, the train/dev/test databases do not overlap, and the test set is hidden from public. For Wik-iSQL, 49.6% of its dev tables and 45.1% of its test tables are not found in the train set. Therefore, both datasets necessitates the ability of models to generalize to unseen schema. We run hyperparameter search and analysis on the dev set and report the test set performance only using our best approach.

We report the official evaluation metrics proposed by the Spider and WikiSQL authors.

Exact Match (EM) This metrics checks if the predicted SQL exactly matches the ground truth SQL. It is a performance lower bound as a semantically correct SQL query may differ from the ground truth SQL query in surface form. Execution Accuracy (EA) This metrics checks if the predicted SQL is executable on the target DB and if the execution results of match those of the ground truth. It is a performance upper bound as two SQL queries with different semantics can execute to the same results on a DB.

Anchor Text Selection Given a DB, we compute the pickist of each field using the official DB files. We developed a fuzzy matching algorithm to match a question to possible value mentions in the DB (described in detail in §A.3). We include up to k matches per field, and break ties by taking the longer match. We exclude all number matches as a number mention in the question may not correspond to a DB cell (e.g. it could be a hypothetical threshold as in "shoes lower than $50") or cannot effectively discriminate between different fields. Figure 3 shows the distribution of non-numeric values in the ground truth SQL queries from the Spider and WikiSQL dev sets. For Spider, 31% of the examples contain one or more non-numeric values in the ground truth queries and can potentially benefit from the bridging mechanism. For Wik-iSQL the ratio is significantly higher, with 76.8% of the ground truth SQL queries contain one or more non-numeric values. On both datasets, the proportion of ground truth SQL queries containing > 2 non-numeric values are negligible (0.8% for Spider and 1.1% for WikiSQL). Based on this analysis, we set k = 2 in all our experiments.

The original Spider dataset contains errors in both the example files and database files. We manually corrected some errors in the train and dev examples. For comparison with other models in §5.1, we report metrics using the official dev/test sets. For our own ablation study and analysis, we report metrics using the corrected dev files. We also use a high-precision heuristics to identify missing foreign key pairs in the databases and combine them with the released ones during training and inference: if two fields of different tables have identical name and one of them is a primary key, we count them as a foreign key pair 6 .

Training We train our model using cross-entropy loss. We use Adam-SGD (Kingma and Ba, 2015) with default parameters and a mini-batch size of 32. We use the uncased BERT-large model from the Huggingface's transformer library (Wolf et al., 2019) . We set all LSTMs to 1-layer and use 8-head attention between the encoder and decoder.

• Spider: We set the LSTM hidden layer dimension to 400. We train a maximum of 100k steps. We set the learning rate to 5e −4 in the first 5,000 iterations and shrink it to 0 with the L-inv function ( §A.5). We fine-tune BERT with a fine-tuning rate linearly increasing from 3e −5 to 6e −5 in the first 4,000 iterations and decaying to 0 according to the L-inv function.

We randomly permute the table order in a DB schema and drop one table which does not appear in the ground truth with probability 0.3 in every training step. The training time of our model on an NVIDIA A100 GPU is approximately 51.5h (including intermediate results verification time). • WikiSQL: We set the LSTM hidden layer dimension to 512. We train a maximum of 50k steps and set the learning rate to 5e −4 in the first 4,000 iterations and shrink it to 0 with the L-inv function. We fine-tune BERT with a fine-tuning rate linearly increasing from 3e −5 to 6e −5 in the first 4,000 iterations and decaying to 0 according to the L-inv function.

The training time of our model on an NVIDIA A100 GPU is approximately 6h (including intermediate results verification time).

Decoding The decoder uses a generation vocabulary consisting of 70 SQL keywords and reserved tokens, plus the 10 digits to generate numbers not explicitly mentioned in the question (e.g. "first", "second", "youngest" etc.). We use a beam size of 64 for leaderboard evaluation. All other experiments uses a beam size of 16. We use schemaconsistency guided decoding during inference only. It cannot guarantee schema consistency 7 and we run a static SQL correctness check on the beam search output to eliminate predictions that are either syntactically incorrect or violates schema consistency 8 . For WikiSQL, the static check also makes sure that the output query conforms to the SQL sketch used to create the dataset (Zhong et al., 2017) . If no predictions in the beam satisfy the two criteria, we output a default SQL query which count the number of entries in the first table. Table 3 shows the performance of BRIDGE compared to other approaches ranking at the top of the Spider leaderboard. BRIDGE v1 is our model described in the original version of the paper. Comparing to BRIDGE v1, the current model is trained with BERT-large with an improved anchor text matching algorithm ( §A.3). BRIDGE L performs very competitively, significantly outperforming most of recently proposed architectures with more complicated, task-specific layers (Global-GNN, EditSQL+BERT, IRNet+BERT, RAT-SQL v2, RYANSQL+BERT L ). It also performs better than or comparable to models that explicitly model compositionality in the decoder (SmBoP, RAT-SQL v3L+BERT L , RYANSQL). 9 In addition, BRIDGE 7 Consider the example SQL query shown in Table A2 which satisfies the condition of Lemma 1, the table VOTING_-RECORD only appears in the first sub-query, and the field VOTING_RECORD.PRESIDENT_Vote in the second sub-query is out of scope.

8 Prior work such as performs the more aggressive execution-guided decoding. However, it is difficult to apply this approach to complex SQL queries (Zhong et al., 2017) . We build a static SQL analyzer on top of the Mozilla SQL Parser (https://github.com/mozilla/ moz-sql-parser). Our static checking approach handles complex SQL queries and avoids DB execution overhead. 9 Simply comparing the leaderboard performances does not allow precise gauging of different modeling trade-offs, as all leaderboard entries adopt some customized pre-and postprocessing of the data. For example, the schema-consistency guided decoding adopted by BRIDGE is complementary to other models. BRIDGE synthesizes a complete SQL query while several other models do not synthesize values and synthesize the FROM clause in a post-processing step (Wang et al., 2019) .

Global-GNN (Bogin et al., 2019b) ♠ 52.7 47.4 EditSQL + BERT 57.6 53.4 GNN + Bertrand-DR (Kelkar et al., 2020) 57.9 54.6 IRNet + BERT 61.9 54.7 RAT-SQL v2 ♠ (Wang et al., 2019) 62.7 57.2 RYANSQL + BERT L (Choi et al., 2020) 66.6 58.2 SmBoP + BART (Rubin and Berant, 2020) 66.0 60.5 RYANSQL v2 + BERT L 70.6 60.6 RAT-SQL v3 + BERT L ♠ (Wang et al., 2019) generates executable SQL queries by copying values from the input question while most existing models only predicts the SQL syntax skeleton. 10 As of Dec 20, 2020, BRIDGE ranks top-1 on the Spider leaderboard by execution accuracy.

RAT-SQL v3+BERT L outperforms BRIDGE in terms of exact set match with a small margin. We further look at the performance comparison between the two models across different SQL query hardness level (Table 4) . Overall, BRIDGE outperforms RAT-SQL v3+BERT L in the easy category but underperforms it in the other three categories, with considerable gaps in medium and hard.

We hypothesize that differences in both the encoders and decoders of the two models have contributed to the performance differences. The RAT-SQL encoder and decoder are designed with compositional inductive bias. It models the relational DB schema as a graph encoded with relational selfattention. The decoder uses SQL-syntax guided generation (Yin and Neubig, 2017) . BRIDGE, on the other hand, adopts a Seq2Seq architecture. In addition, RAT-SQL v3 models the lexical mapping 10 We believe the execution accuracy can be further improved by having the model copying the anchor texts and plan to explore this in future work.

Easy between question-schema and question-value via a graph with edge labeled by the matching condition (full-word match, partial match, etc.). BRIDGE represents the same information in a tagged sequence and uses fine-tuned BERT to implicitly obtain such mapping. While the anchor text selection algorithm ( §4.3) has taken into account string variations, BERT may not be able to capture the linking when string variations exist -it has not seen tabular input during pre-training. The tokenization scheme adopted by BERT and other pre-trained LMs (e.g. GPT-2) cannot effectively capture partial string matches in a novel input (e.g. "cats" and "cat" are two different words in the vocabularies of BERT and GPT-2). Pre-training the architecture using more tables and heuristically aligned text may alleviate this problem (Yin et al., 2020; . Finally, we notice that ensembling three models (averaging the output distributions at each decoding step) trained with different random seeds improves the performance in all SQL hardness levels, especially in medium, hard and extra-hard. 86.6 92.4 86.5 92.2 IE-SQL+EG (Ma et al., 2020) 87.9 92.6 87.8 92.5 Table 5 : Comparison between BRIDGE and other topperforming models on the WikiSQL leaderboard as of Dec 20, 2020. ♠ denotes approaches using DB content. +EG denotes approaches using execution-guided decoding . BRIDGE L to be the best-performing model without execution-guided (EG) decoding . However, comparing to SQLova, X-SQL and HydraNet, BRIDGE benefits noticably less from EG. A probably reason for this is that the schema-consistency guided decoding already ruled out a significant number of SQL queries that will raise errors during execution. In addition, all models leveraging DB content during training (BRIDGE and NL2SQL) benefit less from EG.

Spider We perform a thorough ablation study to show the contribution of each BRIDGE subcomponent (Table 6 ). The decoding search space pruning strategies we introduced (including schema-consistency guided decoding and static SQL correctness check) are effective, with absolute E-SM improvements 0.3% on average. On the other hand, encoding techniques for jointly representing textual and tabular input contribute more. Especially, the bridging mechanism results in an absolute E-SM improvement of 1.6%. A further comparison between BRIDGE with and without bridging at different SQL hardness levels (Table 6) shows that the technique is especially effective at improving the model performance in the extra-hard category. We also did a fine-grained ablation study on the bridging mechanism, by inserting only the special token [V] into the hybrid sequence without the anchor texts. The average model performance is not hurt and the variance decreased. This indicates that the [V] tokens act as markers for columns whose value matched with the input question and contribute to a significant proportion of the performance improvement by bridging. 12 However, since the full model attained the best performance on the dev set, we keep the anchor texts in our representation.

We also observe that the dense meta data feature encoding ( § 2.2) is helpful, resulting in 1% absolute improvement on average. Shuffling and randomly dropping non-ground-truth tables during training also mildly helps our approach, as it increases the variation of DB schema seen by the model and reduces overfitting to a particular table arrangement. Furthermore, BERT is critical to the performance of BRIDGE, magnifying performance of the base model by more than three folds. This is considerably larger than the improvement prior approaches have obtained from adding BERT. Consider the performances of RAT-SQL v2 and RAT-SQL v2+BERT L in Table 3 , the improvement using BERT L is 7%. This shows that simply adding BERT to existing approaches results in significant redundancy in the model architecture. We perform a qualitative attention analysis in §A.8 to show that after fine-tuning, the BERT layers effectively capture the linking between question mentions and the anchor texts, as well as the relational DB structures. WikiSQL The model variance on WikiSQL is much smaller than that on Spider, hence we report the ablation study results using the best model in each category. As shown in Table 7 , the bridging mechanism significantly enhances the model performance, especially when execution-guided decoding is not applied. As shown in Figure 3 , 76.8% of the ground truth SQL queries in the WikiSQL dev set contain at least one non-numeric values. The dataset contains simple queries and the main challenge comes from interpreting filtering conditions in the WHERE clause (Yavuz et al., 2018) . And bridging is very effective for solving this challenge.

We randomly sampled 50 Spider dev set examples for which the best BRIDGE model failed to produce a prediction that matches the ground truth and manually categorized the errors. Each example is assigned to only the category it fits most. Figure 4 shows the number of examples in each category. 18% of the examined predictions are false negatives. Among them, 5 are semantically equivalent to the ground truths; 3 use GROUP BY keys different but equivalent to those of the ground truth (e.g. GROUY BY car_models.name vs. GROUP BY car_models.id); 1 has the wrong ground truth annotation. Among the true negatives, The dominant type of errors is logical mistake (18), where the output SQL query failed to represent the core logic expressed in the question. 17 have errors that can be pinpointed to specific clauses: WHERE (7), SELECT (5), FROM (2), ORDER BY (2), GROUP BY (1). 2 have errors in the operators: 1 in the aggregation operator and 1 in the DISTINCT operator. 1 have errors in compounding SQL clauses. 2 were due to lack of lexical and commonsense knowledge when interpreting the question, e.g. predominantly spoken language, all address lines. 1 example has nongrammatical natural language question. Table 8 shows examples of errors from each major error type mentioned previously.

Logic Errors Logic error is a diverse category. Frequently in this case we see the model memorizing patterns seen on the training set but failed on compositional generalization. Consider the first example in this category. Superlative relation such as "highest" is often represented in the training set by sorting the retrieved records in descending order and taking the top 1. The model memorizes this pattern and output the correct logic for finding the stadium with the highest capacity. It also output the correct pattern for counting the number of concerts.

Yet the correct way of combining these two logical fragments to realize the meaning in the question is to use a nested SQL query in the WHERE condition. BRIDGE joined them flatly, and the resulting query has completely different semantics. The second example illustrates an even more interesting case. The target database is a second normal form 13 that triggers self-join relations (the friend of a highschooler is another highschooler). Self-joins do not appear frequently in the dataset and we hypothesize it is very challenging for a Seq2Seq based model like BRIDGE to grasp such relation. Introducing compositional inductive bias in both the encoder and decoder could be a promising direction for solving these extra-hard cases.

Lexical Understanding Another category of errors occur when the input utterance contains unseen words or phrasal expressions. While BRIDGE builds on top of pre-trained language models such as BERT, it is especially challenging for the model to interpret these text units grounded to the DB schema. Consider the first example in this category, "predominantly" means spoken by the largest percentage of the population. It is almost impossible for the model to see such diverse natural language 13 https://en.wikipedia.org/wiki/Second_ normal_form Logic Find the number of concerts happened in the stadium with the highest capacity. What is the total population and average area of countries in the continent of North America whose area is bigger than 3000? concert_singer SELECT SUM(country.Population), AVG(country.Population) FROM country WHERE country.Continent = "North America" AND country.SurfaceArea 3000> SELECT SUM(country.population), AVG(country.surfacearea) FROM country WHERE country.Continent = "north america" and country.SurfaceArea 3000> Table 8 : Errors cases of BRIDGE on the Spider dev set. The samples were randomly selected from the medium hardness level. denotes the wrong predictions made by BRIDGE and denotes the ground truths. during supervised learning. Infusing such knowledge via pre-training is also non-trivial, but worth investigating. Continuous learning is a promising direction for this type of challenges, where the model is trained to ask clarification questions and learns new knowledge from user interaction (Yao et al., 2020) .

Commonsense As shown by the example, US address contains two lines is a commonsense knowledge, but the model has difficulty inferring that "all lines" maps to "line_1 and line_2". Again, we think continuous learning could be an effective solution for this case.

The final major category of error has to do with the model blatantly ignoring information in the utterance, even when the underlying logic is not complicated, indicating that spurious correlation was captured during training (Tu et al., 2020) . Consider the first example, the model places the Horsepower field in the SELECT clause, while the question asks for "the model of the car". In the second example, the model predicts SELECT SUM(Population), AVG(Population) while the question asks for total population and average area of countries. We think better modeling of compositionality in the natural language may reduce this type of errors. For example, modeling its span structure (Joshi et al., 2019; Herzig and Berant, 2020) and constructing interpretable grounding with the DB schema.

We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language question and relational DBs in cross-DB semantic parsing. BRIDGE serializes the question and DB schema into a tagged sequence and maximally utilizes pre-trained LMs such as BERT to capture the linking between text mentions and the DB schema components. It uses anchor texts to further improve the alignment between the two crossmodal inputs. Combined with a simple sequential pointer-generator decoder with schema-consistency driven search space pruning, BRIDGE attained state-of-the-art performance on the widely used Spider and WikiSQL text-to-SQL benchmarks.

Our analysis shows that BRIDGE is effective at generalizing over natural language variations and memorizing structural patterns. It achieves the upperbound score on WikiSQL and significantly outperforms previous work in the easy category of Spider. However, it struggles in compositional generalization and sometimes makes unexplainable mistakes. This indicates that when data is ample and the target logic form is shallow, sequence-tosequence models are good choices for cross-DB semantic parsing, especially given the implementation is easier and decoding is efficient. For solving the general text-to-SQL problem and moving towards production, we plan to further improve compositional generalization and interpretability of the model. We also plan to study the application of BRIDGE and its extensions to other tasks that requires joint textual and tabular understanding such as weakly supervised semantic parsing and fact checking. Table A1 : Examples of complex SQL queries with clauses in the normal order and the DB execution order. Table A2 : An example sequence satisfies the condition of Lemma 1 but violates schema consistency. Here the field VOTING_RECORD.PRESIDENT_Vote in the second sub-query is out of scope.

which are sub-strings of the question words, e.g. "cat" vs. "category". Denoting matched wholeword phrase in the question as s q , we define the question match score and cell match score as β q = |s m |/|s q | (11) β c = |s c |/|s q |

We define a coarse accuracy measurement to tune the question match score threshold θ q and the cell match threshold θ c . Namely, given the list of matched anchor texts P obtained using the aforementioned procedure and the list of textual values G extracted from the ground truth SQL query, when compute the percentage of anchor texts appeared in G and the percentage of values in G that appeared in P as approximated precision (p ) and recall (r ). Note that this metrics does not evaluate if the matched anchor texts are associated with the correct field.

For k = 2, we set θ q = 0.5 and θ c = 0.8. On the training set, the resulting p = 73.7, r = 74. To quantify the effect of anchor text matching accuracy to the end-to-end performance, we run a set of experiments comparing BRIDGE performance w.r.t. different anchor text matching F1s. Our preliminary results show that with the same anchor text matching recall, varying the precision does not significantly change the end-to-end model performance.

While multi-head attention between encoder and decoder is typically used in transformers (Vaswani et al., 2017) , our experiments show they are effective for the BRIDGE model as well. Table A3 shows the performance of BRIDGE L w.r.t. different number of attention heads, where the attention probability computed by the last head is used as the copy probability. We saw that using more than 1 heads in general significantly improves over using only 1 head, where both the 2-head and 4-head attentions give best performance. Table A3 : End-to-end performance of BRIDGE L w.r.t. different number of attention heads between encoder and decoder. "-last" indicates the last attention head is used as the copy probability. "-mean" indicates the mean of all attention heads is used. We report the E-SM of each model averaged over 5 runs.

A.5 The Linear-inverse-square-root (L-inv) learning rate decay function

The linear (γ 0 − αn) and inverse-square-root ( γ 0 √ n ) learning rate schedulers are commonly used for learning rate decay in neural network training 15 . The linear one decays slower in the beginning but slower in the end. The inverse-square-root one decays faster in the beginning but approaches 0 when n → inf. We hence combine the two functions and propose a new learning rate scheduler that both decays fast in the beginning and also reaches 0 with finite n. The L-inv learning rate scheduler is defined as:

where β = γ 0 √ n max and n max is the total number of back-propagation steps.

As shown in §5.2, the performance of BRIDGE on Spider is sensitive to the random seed. We train 10 different BRIDGE models with only differences in the random seeds. Figure A1 shows the performance of each individual model (sorted in decreasing exact set match), and the top-k models ensembled using average step probabilities.

The individual model performance variation is indeed large. The best and the worst models differ by 3.4 absolute points in E-SM, and 2.2 absolute points in execution accuracy. 16 We hypothesize that this is a result of both intrinsic model variance 15 https://fairseq.readthedocs.io/en/latest/ lr_scheduler.html 16 In general the execution accuracy of our model is lower than the E-SM. We believe the execution accuracy can be further improved by copying the anchor texts during SQL generation. Figure A1 : Performance ensemble models w.r.t. different # models in the ensemble.

Best Best Worst 61.2% 5.5% Worst 8.9% 24.4% as well as error in the evaluation metrics. Considering the false negatives, the true model performance could have less variance. Combining models in general leads to better performance. In particular, combining the best model with the second best model improves the E-SM by 1.3 absolute points. Further combining with the weaker models still shows improvements, but the return is diminishing.

The top-7 model ensemble achieves the best E-SM (72.2%) and the top-10 model ensemble achieves the best execution accuracy (72.1%). Table A4 shows the comparison between the best (70.2%) and worst (66.7%) models on the Spider dev set in terms of error overlap. For 61% of dev set, both models predicted the corrected answer and for 24.4% of dev set both models made a mistake. For 8.9% of the examples, only the best model is correct, while for 5.5% of the examples the worst model is correct. Manual examination shows that most of the examples where the two models evaluate differently are indeed different semantically.

We further compute the E-SM accuracy of BRIDGE over different DBs in the Spider dev set. Figure A2 shows drastic performance differences across DBs. While BRIDGE achieves near perfect score on some, the performance is only 30%-40% on others. Performance does not always negatively correlates with the schema size. We hypothesize that the model scores better on DB schema similar Figure A2 : E-SM accuracy of BRIDGE by DB in Spider dev set. From top to bottom, the DBs are sorted by # tables in the schema in ascending order.

to those seen during training and better characterization of the "similarity" between DB schema could help transfer learning.

We visualize attention in the fine-tuned BERT layers of BRIDGE (with BERT-base-uncased) to qualitatively evaluate if the model functions as an effective text-DB encoder as we expect. We use the BERTViz library 17 developed by Vig (2019). We perform the analysis on the smallest DB in the Spider dev set to ensure the attention graphs are readable. The DB consists of two tables, Poker_-Player and People that store information of poker players and their match results. While the BERT attention is a computation graph consisting of 12 layers and 12 heads, we were able to identify prominent patterns in a subset of the layers.

First, we examine if anchor texts indeed have the effect of bridging information across the textual and tabular segments. The example question we use is "show names of people whose nationality is not Russia" and "Russia" in the field People.Nationality is identified as the anchor text. As show in Figure A3 and Figure A4 , we find strong connection between the anchor text and their corresponding question mention in layer 2, 4, 5, 10 and 11.

We further notice that the layers effectively captures the relational DB structure. As shown in Figure A5 and Figure A6 , we found attention patterns in layer 5 that connect tables with their primary keys and foreign key pairs.

We notice that all interpretable attention connections are between lexical items in the input se-17 https://github.com/jessevig/bertviz quence, not including the special tokens ([T], [C] , [V] ). This is somewhat counter-intuitive as the subsequent layers of BRIDGE use the special tokens to represent each schema component. We hence examined attention over the special tokens (Figure A7) and found that they function as bindings of tokens in the table names and field names. The pattern is especially visible in layer 1. As shown in Figure A7 , each token in the table name "poker player" has high attention to the corresponding [T] . Similarly, each token in the field name "poker player ID" has high attention to the corresponding [C] . We hypothesize that this way the special tokens function similarly as the cell pooling layers proposed in TaBERT (Yin et al., 2020) .

We discuss a few aspects of BRIDGE that can be improved in future work.

Anchor Selection BRIDGE adopts simple string matching for anchor text selection. In our experiments, improving anchor text selection accuracy significantly improves the end-to-end accuracy. Extending anchor text matching to cases beyond simple string match (e.g. "LA"→"Los Angeles") is a future direction. Furthermore, this step can be learned either independently or jointly with the textto-SQL objective. Currently BRIDGE ignores number mentions. We may introduce features which indicate a specific number in the question falls within the value range of a specific column.

Input Size As BRIDGE serializes all inputs into a sequence with special tags, a fair concern is that the input would be too long for large relational DBs. We believe this can be addressed with recent architecture advancements in transformers (Beltagy et al., 2020) , which have scaled up the attention mechanism to model very long sequences.

Relation Encoding BRIDGE fuses DB schema meta data features to each individual table field representations. This mechanism loses some information from the original graph structure. It works well on Spider, where the foreign key pairs often have exactly the same names. We consider regularizing a subset of the attention heads (Strubell et al., 2018) to capture DB connections a promising way to model the graph structure of relational DBs

Natural language interfaces to databasesan introduction

Longformer: The long-document transformer

Representing schema structure with graph neural networks for text-to-sql parsing

Global reasoning over database structures for textto-sql parsing

RYANSQL: recursively applying sketch-based slot fillings for complex text-to-sql in cross-domain databases

Expanding the scope of the ATIS task: The ATIS-3 corpus

BERT: pre-training of deep bidirectional transformers for language understanding

Language to logical form with neural attention

Compositional generalization in semantic parsing: Pre-training vs

Incorporating copying mechanism in sequence-to-sequence learning

Towards complex text-to-sql in cross-domain database with intermediate representation

Content enhanced bert-based text-to-sql generation

X-SQL: reinforce schema representation with context

X-sql: reinforce schema representation with context

The ATIS spoken language systems pilot corpus

Spanbased semantic parsing for compositional generalization

Tapas: Weakly supervised table parsing via pre-training

Long short-term memory

A comprehensive exploration on wikisql with table-aware word contextualization

Spanbert: Improving pre-training by representing and predicting spans

Saurabh Vaichal, and Peter Relan. 2020. Bertranddr: Improving text-to-sql using a discriminative reranker

Adam: A method for stochastic optimization

Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019

Memory augmented policy optimization for program synthesis and semantic parsing

Bridging textual and tabular data for crossdomain text-to-sql semantic parsing

Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system

Towards comprehensive description generation from factual attribute-value tables

Multi-task deep neural networks for natural language understanding

Roberta: A robustly optimized BERT pretraining approach

Hybrid ranking network for text-to-sql

Mention extraction and linking for SQL query generation

Exploring the limits of transfer learning with a unified text-to-text transformer

Database systems -design, implementation, and management

Smbop: Semiautoregressive bottom-up semantic parsing

Get to the point: Summarization with pointergenerator networks

Compositional generalization and natural language variation: Can a semantic parsing approach handle both? CoRR

Generating logical forms from graph representations of text and entities

Self-attention with relative position representations

Linguistically-informed self-attention for semantic role labeling

Exploring unexplored generalization challenges for cross-database semantic parsing

Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models

Attention is all you need

A multiscale visualization of attention in the transformer model

Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers

Execution-guided neural program decoding

Huggingface's transformers: State-of-the-art natural language processing

An imitation game for learning semantic parsers from user interaction

What it takes to achieve 100 percent condition accuracy on wikisql

A syntactic neural model for general-purpose code generation

Tabert: Pretraining for joint understanding of textual and tabular data

Syntaxsqlnet: Syntax tree networks for complex and cross-domain text-to-sql task

Sparc: Cross-domain semantic parsing in context

Learning to parse database queries using inductive logic programming

Photon: A robust crossdomain text-to-sql system

Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars

Editing-based SQL query generation for cross-domain context-dependent questions

Seq2sql: Generating structured queries from natural language using reinforcement learning

We thank Yingbo Zhou for helpful discussions. We thank the anonymous reviewers and members of Salesforce Research for their thoughtful feedback. A significant part of the experiments were completed during the California Bay Area shelter-inplace order for COVID-19. Our heartful thanks go to all who worked hard to keep others safe and enjoy a well-functioning life during this challenging time.

We show more examples of complex SQL queries with their clauses arranged in written order vs. execution order in Table A1 .

The selective read operation was introduced by Gu et al. (2016) . It extends the input state to the decoder LSTM with the corresponding encoder hidden states of the tokens being copied. This way the decoder was provided information on which part of the input has been copied. Specically, we modified the input state 14 of our decoder LSTM to the following:where p t gen is the scalar probability that a token is copied at step t. e t−1 ∈ R n is either the embedding of a generated vocabulary token or a learned vector indicating if a table, field or question token is copied in step t − 1. ζ t−1 ∈ R n is the selective read vector, which is a weighted sum of the encoder hidden states corresponding to the tokens copied in step t − 1:Here K = j:X j =y t−1 α (H) t−1, j is a normalization term considering there may be multiple positions equals to y t−1 inX.

We convert the question and field values into lower cased character sequences and compute the longest sub-sequence match with heuristically determined matching boundaries. For example, the sentence "how many students keep cats as pets?" matches with the cell value "cat" (s c ) and the matched substring is "cat" (s m ). We further search the question starting from the start and end character indices i, j of s m in the question to make sure that word boundaries can be detected within i − 2 to j + 2, otherwise the match is invalidated. This excludes matches