key: cord-0059570-njshnejq
authors: Ferreira, Margarida; Terra-Neves, Miguel; Ventura, Miguel; Lynce, Inês; Martins, Ruben
title: FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions
date: 2021-03-01
journal: Tools and Algorithms for the Construction and Analysis of Systems
DOI: 10.1007/978-3-030-72016-2_9
sha: d80cc68f2eae6152405a2b1d8ce0f0150de94a98
doc_id: 59570
cord_uid: njshnejq

Form validators based on regular expressions are often used on digital forms to prevent users from inserting data in the wrong format. However, writing these validators can pose a challenge to some users. We present Forest, a regular expression synthesizer for digital form validations. Forest produces a regular expression that matches the desired pattern for the input values and a set of conditions over capturing groups that ensure the validity of integer values in the input. Our synthesis procedure is based on enumerative search and uses a Satisfiability Modulo Theories (SMT) solver to explore and prune the search space. We propose a novel representation for regular expressions synthesis, multi-tree, which induces patterns in the examples and uses them to split the problem through a divide-and-conquer approach. We also present a new SMT encoding to synthesize capture conditions for a given regular expression. To increase confidence in the synthesized regular expression, we implement user interaction based on distinguishing inputs. We evaluated Forest on real-world form-validation instances using regular expressions. Experimental results show that Forest successfully returns the desired regular expression in 70% of the instances and outperforms Regel, a state-of-the-art regular expression synthesizer.

Regular expressions (also known as regexes) are powerful mechanisms for describing patterns in text with numerous applications. One notable use of regexes is to perform real-time validations on the input fields of digital forms. Regexes help filter invalid values, such as typographical mistakes ('typos') and format inconsistencies. Aside from validating the format of form input strings, regular expressions can be coupled with capturing groups. A capturing group is a subregex within a regex that is indicated with parenthesis and captures the text This work was supported by NSF award CCF-1762363 and through FCT under project UIDB/50021/2020, and project ANI 045917 funded by FEDER and FCT. matched by the sub-regex inside them. Capturing groups are used to extract information from text and, in the domain of form validation, they can be used to enforce conditions over values in the input string. In this paper, we focus on the capture of integer values in input strings, and we use the notation $i, i ∈ {0, 1, ...} to refer to the integer value of the text captured by the (i + 1) th group.

Form validations often rely on complex regexes which require programming skills that not all users possess. To help users write regexes, prior work has proposed to synthesize regular expressions from natural language [1, 9, 12, 27] or from positive and negative examples [1, 7, 10, 26] . Even though these techniques assist users in writing regexes for search and replace operations, they do not specifically target digital form validation and do not take advantage of the structured format of the data.

In this paper, we propose Forest, a new program synthesizer for regular expressions that targets digital form validations. Forest takes as input a set of examples and returns a regex validation. Motivating Example. Suppose a user is writing a form where one of the fields is a date that must respect the format DD/MM/YYYY. The user wants to accept: 19 As we can see in the motivating example, data inserted into digital forms is usually structured and shares a common pattern among the valid examples. In this example, the data has the shape dd/dd/dddd where d represents a digit. This contrasts with general regexes for search and replace operations that are often performed over unstructured text. Forest takes advantage of this structure by automatically detecting these patterns and using a divide-and-conquer approach to split the expression into simpler sub-expressions, solving them independently, and then merging their information to obtain the final regular expression. Additionally, Forest computes a set of capturing groups over the regular expression, which it then uses to synthesize integer conditions that further constrain the accepted values for that form field. Input-output examples do not require specialized knowledge and are accessible to users. However, there is one downside to using examples as a specification: they are ambiguous. There can be solutions that, despite matching the examples, do not produce the desired behavior in situations not covered in them. The ambiguity of input-output examples raises the necessity of selecting one among multiple candidate solutions. To this end, we incorporate a user interaction model based on distinguishing inputs for both the synthesis of the regular expressions and the synthesis of the capture conditions. In summary, this paper makes the following contributions:

-We propose a multi-tree SMT representation for regular expressions that leverages the structure of the input to apply a divide-and-conquer approach. -We propose a new method to synthesize capturing groups for a given regular expression and integer conditions over the resulting captures. -We implemented a tool, Forest, that interacts with the user to disambiguate the provided specification. Forest is evaluated on real-world instances and its performance is compared with a state-of-the-art synthesizer.

The task of automatically generating a program that satisfies some desired behavior expressed as a high-level specification is known as Program Synthesis. Programming by Example (PBE) is a branch of Program Synthesis where the desired behavior is specified using input-output examples. Our synthesis procedure is split into two stages, each relative to an output component. First, Forest synthesizes the regular expression, which is the basis for the synthesis of capturing groups. Secondly, Forest synthesizes the capture conditions, by first computing a set of capturing groups and then the conditions to be applied to the resulting captures. The synthesis stages are detailed in sections 3 and 4. Figure 1 shows the regex validation synthesis pipeline. Both stages of our synthesis algorithm employ enumerative search, a common approach to solve the problem of program synthesis [4, 5, 10, 17, 21] . The enumerative search cycle is depicted in Figure 2 .

There are two key components for program enumeration: the enumerator and the verifier. The enumerator successively enumerates programs from the a predefined Domain Specific Language (DSL). Following the Occam's razor principle, programs are enumerated in increasing order of complexity. The DSL defines the set of operators that can be used to build the desired program. Forest dynamically constructs its DSL to fit the problem at hand: it is as restricted as possible, without losing the necessary expressiveness. The regular expression DSL construction procedure is detailed in section 3.1.

For each enumerated program, the verifier subsequently checks whether it satisfies the provided examples. Program synthesis applications generate very large search spaces; nevertheless, the search space can be significantly reduced by pruning several infeasible expressions along with each incorrect expression found. In the first stage of the regex validation synthesis, the enumerated programs are regular expressions. The enumeration and pruning of regular expressions is described in section 3.2. In the second stage of regex validation synthesis, we deal with the enumeration of capturing groups over a pre-existing regular expression. This process is described in section 4.1.

To circumvent the ambiguity of input-output examples, Forest implements an interaction model. A new component, the distinguisher, ascertains, for any two given programs, whether they are equivalent. When 

In this section we describe the enumerative synthesis procedure that generates a regular expression that matches all valid examples and none of the invalid.

Before the synthesis procedure starts, we define which operators can be used to build the desired regular expression and the values each operator can take as argument. Forest's regular expression DSL includes the regex union and concatenation operators, as well as several regular expression quantifiers:

-Kleene closure: r * matches r zero or more times, -positive closure: r + matches r one or more times, -option: r? matches r zero or one times, -ranges: r{m} matches r exactly m times, and r{m, n} matches r at least m times and at most n times.

The possible values for the range operators are limited depending on the valid examples provided by the user. For the single-valued range operator, r{m}, we consider only the integer values such that 2 ≤ m ≤ l, where l is the length of the longest valid example string. In the two-valued range operator, r{m, n}, the values of m and n are limited to integers such that 0 ≤ m < n ≤ l. The tuple (0,1) is not considered, since it is equivalent to the option quantifier: r{0, 1} = r?. All operators can be applied to regex literals or composed with each other to form more complex expressions. The regex literals considered in the synthesis procedure include the individual letters, digits or symbols present in the examples and all character classes that contain them. The character classes contemplated in the DSL are 

To enumerate regexes, the synthesizer requires a structure capable of representing every feasible expression. We use a tree-based representation of the search To explore the search space in order of increasing complexity, we enumerate k-trees of lower depths first and progressively increase the depth of the trees as previous depths are exhausted. The enumerator encodes the k-tree as an SMT formula that ensures the program is well-typed. A model that satisfies the formula represents a valid regex. Due to space constraints we omit the k-tree encoding but further details can be found in the literature [2, 17] .

Multi-tree representation. We considered several validators for digital forms and observed that many regexes in this domain are the concatenation of relatively simple regexes. However, the successive concatenation of simple regexes quickly becomes complex in its k-tree representation. Recall the regex for date validation presented in the motivating example:

Even though this is the concatenation of 5 simple sub-expressions, each of depth at most 2, its representation as a k-tree has depth 5, as shown in Figure 3 .

The main idea behind the multi-tree constructs is to allow the number of concatenated sub-expressions to grow without it reflecting exponentially on the encoding. The multi-tree structure consists of n k-trees, whose roots are connected by an artificial root node, interpreted as an n-ary concatenation operator. This way, we are able to represent regexes using fewer nodes. Figure 4 is the multi-tree representation of the same regex as Figure 3 , and shows that the multi-tree construct can represent this expression using half the nodes.

The k-tree enumerator successively explores k-trees of increasing depth. However, multi-tree has two measures of complexity: the depth of the trees, d, and the number of trees, n. Forest employs two different methods for increasing these values: static multi-tree and dynamic multi-tree.

Static multi-tree. In the static multi-tree method, the synthesizer fixes n and progressively increases d. To find the value of n, there is a preprocessing step, in which Forest identifies patterns in the valid examples. This is done by first identifying substrings common to all examples. A substring is considered a dividing substring if it occurs exactly the same number of times and in the same order in all examples. Then, we split each example before and after the dividing substrings. Each example becomes an array of n strings. Then, we apply the multi-tree method with n trees. For every i ∈ {1, ..., n}, the i th sub-tree represents a regex that matches all strings in the i th position of the split example tuples and the concatenation of the n regexes will match the original example strings. Since each tree is only synthesizing a part of the original input strings, a reduced DSL is recomputed for each tree.

Dynamic multi-tree. The dynamic multi-tree method is employed when the examples cannot be split because there are no dividing substrings. In this scenario, the enumerator will still use a multi-tree construct to represent the regex. However, the number of trees is not fixed and all trees use the original, complete DSL. A multi-tree structure with n k-trees of depth d has n × (k d − 1) nodes. Forest enumerates trees with different values of (n, d) in increasing order of number of nodes, starting with n = 1 and d = 2, a simple k-tree of depth 2.

Pruning. We prune regexes which are provably equivalent to others in the search space by using algebraic rules of regular expressions like the following:

To prevent the enumeration of equivalent regular expressions, we add SMT constraints that block all but one possible representation of each regex. Take, for example, the equivalence (r?)+ ≡ r * . We want to consider only one way to represent this regex, so we add a constraint to block the construction (r?)+ for any regex r. Another such equivalence results from the idempotence of union: r|r = r. To prevent the enumeration of expressions of the type r|r, every time the union operator is assigned to a node i, we force the sub-tree underneath i's left child to be different from the sub-tree underneath i's right child by at least one node. When we enumerate a regex that is not consistent with the examples, it is eliminated from the search space. Along with the incorrect regex, we want to eliminate regexes that are equivalent to it. The union operator in the regular expressions DSL is commutative: r|s = s|r, for any regexes r and s. Thus, whenever an expression containing r|s is discarded, we eliminate the expression that contains s|r in its place as well.

To increase confidence in the synthesizer's solution, Forest disambiguates the specification by interacting with the user. We employ an interaction model based on distinguishing inputs, which has been successfully used in several synthesizers [11, 24, 25, 14] . To produce a distinguishing input, we require an SMT solver with a regex theory, such as Z3 [15, 23] . Upon finding two regexes that satisfy the user-provided examples, r 1 and r 2 , we use the SMT solver to solve the formula:

where r 1 (s) (resp. r 2 (s)) is True if and only if r 1 (resp. r 2 ) matches the string s. A string s that satisfies (1) is a distinguishing input. Forest asks the user to classify this input as valid or invalid, and s is added to the respective set of examples, thus eliminating either r 1 or r 2 from the search space. After the first interaction, the synthesis procedure continues only until the end of the current depth and number of trees.

In this section we describe the synthesis procedure of the second component of a regex validation: a set of integer conditions over captured values that are satisfied by all valid examples but none of the conditional invalid examples.

To enumerate capturing groups, Forest starts by identifying the regular expression's atomic sub-regexes: the smallest sub-regexes whose concatenation results in the original complete regex. For example, 

To compute capture conditions, we need all conditional invalid examples to be matched by the regular expression. After, capturing groups are enumerated as described in section 4.1. The number of necessary capturing groups is not known beforehand, so we enumerate capturing groups in increasing number.

A capture condition is a 3-tuple: it contains the captured text, an integer comparison operator and an integer argument. Forest considers only two integer comparison operators, ≤ and ≥. However, the algorithm can be easily expanded to include other operators. Let C be a set of capturing groups and C(x) the integer captures that result from applying C to example string x. Let D C be the set of all possible capture conditions over capturing groups C. D C results from combining each capturing group with each integer operator. Finally, let V be the set of all valid examples, I the set of all conditional invalid examples, and X = V ∪ I the union of these two sets.

Given capturing groups C, Forest uses Maximum Satisfiability Modulo Theories (MaxSMT) to select from D C the minimum set of conditions that are satisfied by all valid examples and none of the conditional invalid. To encode the problem, we define two sets of Boolean variables. First, we define s cap,x for every cap ∈ C(x) and x ∈ X . s cap,x = True if capture cap in example x satisfies all used conditions that refer to it. We also define u cond for all cond ∈ D C . u cond = True means condition cond is used in the solution. Additionally, we define a set of integer variables b cond , for all conditions cond ∈ D C that represent the integer argument present in each condition.

Let SMT(cond, x) be the SMT representation of condition cond for example x: the capture is an integer value, and the integer argument is the corresponding b cond variable. Let D cap ⊆ D C be the set of capture conditions that refer to capture cap. Constraint (2) states that a capture cap in example x satisfies all conditions if and only if for every condition that refers to cap either it is not used in the solution or it is satisfied for the value of that capture in that example: (2) is: 

Since we are looking for the minimum set of capture conditions, we add soft clauses to penalize the usage of capture conditions in the solution:

We consider part of the solution only the capture conditions whose u cond is True in the resulting SMT model. We also extract the values of the integer arguments in each condition from the model values of the b cond variables.

To ensure the solution meets the user's intent, Forest disambiguates the specification using, once again, a procedure based on distinguishing inputs. Once Forest finds two different sets of capture conditions S 1 and S 2 that satisfy the specification, we look for a distinguishing input: a string c which satisfies all capture conditions in S 1 , but not those in S 2 , or vice-versa. First, to simplify the problem, Forest eliminates from S 1 and S 2 conditions which are present in both: these are not relevant to compute a distinguishing input. Let S * 1 (resp. S * 2 ) be the subset of S 1 (resp. S 2 ) containing only the distinguishing conditions, i.e., the conditions that differ from those in S 2 (resp. S 1 ).

We do not compute the distinguishing string c directly. Instead, we compute the integer value of the distinguishing captures in c, i.e., the captures that result from applying the regular expression and its capturing groups to the distinguishing input string. We define |C| integer variables, c i , which correspond to the values of the distinguishing captures: c 0 , c 1 , ..., c |C| = C(c).

As before, let SMT(cond, c) be the SMT representation of each condition cond. Each capture in C(c) is represented by its respective c i , the operator maintains it usual semantics and the integer argument is its value in the solution to which the condition belongs. Constraint (5) states that c satisfies the conditions in one solution but not the other.

In the end, to produce the distinguishing string c, Forest picks an example from the valid set, applies the regular expression with the capturing groups to it, and replaces its captures with the model values for c i .

Forest asks the user to classify c as valid or invalid. Depending on the user's answer, c is added as a valid or conditional invalid example, effectively eliminating either S 1 or S 2 from the search space. and get the value c 0 = 32 which satisfies S * 2 (and S 2 ), but not S * 1 (or S 1 ). If we pick the first valid example, "19/08/1996" as basis for c, the respective distinguishing input is c = "32/08/1996". Once the user classifies c as invalid, c is added as a conditional invalid example and S 2 is removed from consideration.

Program synthesis has been successfully used in many domains such as string processing [8, 19, 7, 26] , query synthesis [11, 25, 17] , data wrangling [2, 5] , and functional synthesis [3, 6] . In this section, we discuss prior work on the synthesis of regular expressions [10, 1] that is most closely related to our approach.

Previous approaches that perform general string processing [7, 26] restrict the form of the regular expressions that can be synthesized. In contrast, we support a wide range of regular expressions operators, including the Kleene closure, positive closure, option, and range. More recent work that targets the synthesis of regexes is done by AlphaRegex [10] and Regel [1] . AlphaRegex performs an enumerative search and uses under-and over-approximations of regexes to prune the search space. However, AlphaRegex is limited to the binary alphabet and does not support the kind of regexes that we need to synthesize for form validations. Regel [1] is a state-of-the-art synthesizer of regular expressions based on a multi-modal approach that combines input-output examples with a natural language description of user intent. They use natural language to build hierarchical sketches that capture the high-level structure of the regex to be synthesized. In addition, they prune the search space by using under-and over-approximations and symbolic regexes combined with SMT-based reasoning. Regel's evaluation [1] has shown that their PBE engine is an order of magnitude faster than AlphaRegex. While Regel targets more general regexes that are suitable for search and replace operations, we target regexes for form validation which usually have more structure. In our approach, we take advantage of this structure to split the problem into independent subproblems. This can be seen as a special case of sketching [22] where each hole is independent. Our pruning techniques are orthogonal to the ones used by Regel and are based on removing equivalent regexes prior to the search and to remove equivalent failed regexes during search. To the best of our knowledge, no previous work focused on the synthesis of conditions over capturing groups.

Instead of using input-output examples, there are other approaches that synthesize regexes solely from natural language [9, 12, 27] . We see these approaches as orthogonal to ours and expect that Forest can be improved by hints provided by a natural language component such as was done in Regel.

Forest is open-source and publicly available at https://github. com/Marghrid/FOREST. Forest is implemented in Python 3.8 on top of Trinity, a general-purpose synthesis framework [13] . All SMT formulas are solved using the Z3 SMT solver, version 4.8.9 [15] . To find distinguishing inputs in regular expression synthesis, Forest uses Z3's theory of regular expressions [23] .

To check the enumerated regexes against the examples, we use Python's regex library [18] . The results presented herein were obtained using an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, with 64GB of RAM, running Debian GNU/Linux 10. All processes were run with a time limit of one hour.

Benchmarks. To evaluate Forest, we used 64 benchmarks based on real-world form-validation regular expressions. These were collected from regular expression validators in validation frameworks and from regexlib [20] , where users can upload their own regexes. Among these 64 benchmarks there are different formats: national IDs, identifiers of products, date and time, vehicle registration numbers, postal codes, email and phone numbers. For each benchmark, we generated a set of string examples. All 64 benchmarks require a regular expression to validate the examples, but only 7 require capture conditions. On average, each instance is composed of 13.2 valid examples (ranging from 4 to 33) and 9.3 invalid (ranging from 2 to 38). The 7 instances that target capture conditions have on average 6.3 conditional invalid examples (ranging from 4 to 8).

The goal of this experimental evaluation is to answer the following questions: Q1: How does Forest compare against Regel? (section 6.1) Q2: How does pruning affect multi-tree's time performance? (section 6.2) Q3: How does static multi-tree improve on dynamic multi-tree? (section 6.2) Q4: How does multi-tree compare against other encodings? (section 6.3) Q5: How many examples are required to return a correct solution? (section 6.4)

Forest, by default, uses static multi-tree (when possible) with pruning. It correctly solves 31 benchmarks (48%) in under 10 seconds. In one hour, Forest solves 47 benchmarks (73%), with 96% accuracy: only two solutions did not correspond to the desired regex validation. Forest disambiguates only among programs at the same depth, and so if the first solution is not at the same depth Figure 5 : Instances solved using different methods as the correct one, the correct solution is never found. After 1 hour of running time, Forest is interrupted, but it prints its current best validation before terminating. After the timeout, Forest returned 3 more regexes, 2 of which the correct solution for the benchmark. In all benchmarks to which Forest returns a solution, the first matching regular expression is found in under 10 minutes. In 40 benchmarks, the first regex is found in under 10 seconds. The rest of the time is spent disambiguating the input examples. Forest interacts with the user to disambiguate the examples in 27 benchmarks. Overall, it asks 1.8 questions and spends 38.6 seconds computing distinguishing inputs, on average.

Regarding the synthesis of capture conditions, in 5 of the benchmarks, we need only 2 capturing groups and at most 4 conditions. In these instances, the conditions' synthesis takes under 2 seconds. The remaining 2 benchmarks need 4 capturing groups and take longer: 99 seconds to synthesize 4 conditions and 1068 seconds for 6 conditions. During capture conditions synthesis, Forest interacts 7.14 times and takes 0.1 seconds to compute distinguishing inputs, on average. Table 1 shows the number of instances solved in under 10, 60 and 3600 seconds using Forest, as well as using the different variations of the synthesizer which will be described in the following sections. The cactus plot in Figure 5 shows the cumulative synthesis time on the y-axis plotted against the number of benchmarks solved by each variation of Forest (on the x-axis). The synthesis methods that correspond to lines more to the right of the plot are able to solve more benchmarks in less time. We also compare solving times with Regel [1] . Regel takes as input examples and a natural description of user intent. We consider not only the complete Regel synthesizer, but also the PBE engine of Regel by itself, which we denote by Regel PBE.

As mentioned in section 5, Regel's synthesis procedure is split into two steps: sketch generation (using a natural language description of desired behavior) and sketch completion (using input-output examples). To compare Regel and Forest, we extended our 64 form validation benchmarks with a natural language description. To assess the importance of the natural language description, we also ran Regel using only its PBE engine. Sketch generation took on average 60 seconds per instance, and successfully generated a sketch for 63 instances. The remaining instance was run without a sketch. We considered only the highest ranked sketch for each instance. In Table 1 we show how many instances can be solved with different time limits for sketch completion; note that these values do not include the sketch generation time. Regel returned a regular expression for 47 instances within the time limit. Since Regel does not implement a disambiguation procedure, the returned regular expression does not always exhibit the desired behavior, even though it correctly classifies all examples. Of the 47 synthesized expressions, 31 exhibit the desired intent. This is a 66% accuracy, which is the same as Forest without disambiguation (Forest's 1 st regex) but it is much lower than Forest with disambiguation at 96%. We also observe that Regel's performance is severely impaired when using only its PBE engine. 51 out of the 63 generated sketches are of the form {S 1 , ..., S n }, where each S i is a concrete sub-regex, i.e., has no holes. This construct indicates the desired regex must contain at least one of S 1 , ..., S n , and contains no information about the top-level operators that are used to connect them. 22 of the 47 synthesized regexes are based on sketches of that form, and they result from the direct concatenation of all components in the sketch. No new components are generated during sketch completion. Thus, most of Regel's sketches could be integrated into Forest, whose multi-tree structure holds precisely those top-level operators that were missing from Regel's sketches.

To evaluate the impact of pruning the search space as described in section 3.2, we ran Forest with all pruning techniques disabled. In the scatter plot in Figure 6a , we can compare the solving time on each benchmark with and without pruning. Each mark in the plot represents an instance. The value on the y-axis shows the synthesis time of multi-tree with pruning disabled and the value on the xaxis the synthesis time with pruning enabled. The marks above the y = x line (also represented in the plot) represent problems that took longer to synthesize without pruning than with pruning. On average, with pruning, Forest can synthesize regexes in 42% of the time and enumerates about 15% of the regexes before returning. There is no significant change in the number of interactions before returning the desired solution.

Forest is able to split the examples and use static multi-tree as described in section 3.2 in 52 benchmarks (81%). The remaining 12 are solved using dynamic multi-tree. To assess the impact of using static multi-tree we ran Forest with a version of the multi-tree enumerator that does not split the examples, and jumps directly to dynamic multi-tree solving. In the scatter plot in Figure 6b , we compare the solving times of each benchmark. Using static multi-tree when possible, Forest requires, on average, less than two thirds of the time (59.1%) to return the desired regex for benchmarks solved by both methods. Furthermore, with static multi-tree Forest can synthesize more complex regexes: the maximum number of nodes in a solution returned by dynamic multi-tree is 12 (avg. 6.7), while complete multi-tree synthesizes regexes of up to 24 nodes (avg. 10.3).

To evaluate the performance of multi-tree enumeration, we ran Forest with two other enumeration encodings: k-tree and line-based. The latter is a state of the art encoding for the synthesis of SQL queries [17] . k-tree is the default enumerator in Trinity [13] , and the line-based enumerator is available in Squares [16] . The k-tree encoding has a very similar structure to that of multi-tree, so our pruning techniques were easily applied to this encoding. On the other hand, line-based encoding is intrinsically different, so the pruning techniques were not implemented. We compare the line-based encoding to multi-tree without pruning. In every other aspect, the three encodings were run in the same conditions, using Forest's regex DSL. k-tree is able to synthesize programs with up to 10 nodes, while the line-based encoding synthesizes programs of up to 9 nodes. Neither encoding outperforms multi-tree.

As seen in Table 1 , line-based encoding does not outperform the tree-based encodings for the domain of regexes while it was much better for the domain of SQL queries [17] . We conjecture this disparity arises from the different nature of DSLs. Most SQL queries, when represented as a tree, leave many branches of the tree unused, which results in a much larger tree and SMT encoding. After, we reduced the number of examples even further: only 5 valid and 5 invalid. The accuracy of Forest in this setting was reduced to 71%. On average, it interacted 4.3 times per benchmark, which is over two times more than before.

Regexes are commonly used to enforce patterns and validate the input fields of digital forms. However, writing regex validations requires specialized knowledge that not all users possess. We have presented a new algorithm for synthesis of regex validations from examples that leverages the common structure shared between valid examples. Our experimental evaluation shows that the multi-tree representation synthesizes three times more regexes than previous representations in the same amount of time and, together with the user interaction model, Forest solves 70% of the benchmarks with the correct user intent. We verified that Forest maintains a very high accuracy with as few as 10 examples of each kind. We also observed that our approach outperforms Regel, a state-of-the-art synthesizer, in the domain of form validations.

As future work, we would like to explore the synthesis of more complex capture conditions, such as conditions depending on more than one capture. This would allow more restrictive validations; for example, in a date, the possible values for the day could depend on the month. Another possible extension to Forest is to automatically separate invalid from conditional invalid examples, making this distinction imperceptible to the user.

Multi-modal synthesis of regular expressions

Maximal multi-layer specification synthesis

Functional synthesis with examples

Program synthesis using conflictdriven learning

Component-based synthesis of table consolidation and transformation tasks from examples

Manthan: A data driven approach for boolean function synthesis

Automating string processing in spreadsheets using input-output examples

Flashnormalize: Programming by examples for text normalization

Using semantic unification to generate regular expressions from natural language

Synthesizing regular expressions from examples for introductory automata assignments

Query from examples: An iterative, data-driven approach to query construction

Neural generation of regular expressions from natural language with minimal domain knowledge

Trinity: An Extensible Synthesis Framework for Data Science

User interaction models for disambiguation in programming by example

Z3: an efficient SMT solver

Encodings for enumeration-based program synthesis

Python Software Foundation: Python3's regular expression module re

Automated data extraction using predictive program synthesis

Regular Expression Library: www.regexlib.com, accessed on

cvc4sy: Smart and fast term enumeration for syntax-guided synthesis

Program sketching

Symbolic boolean derivatives for efficiently solving extended regular expression constraints

Interactive query synthesis from input-output examples

Synthesizing highly expressive SQL queries from input-output examples

FIDEX: filtering spreadsheet data using examples

Semregex: A semantics-based approach for generating regular expressions from natural language specifications

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.