BarreroMorenoCamacho-eswa2010.dvi Repositorio Institucional de la Universidad Autónoma de Madrid https://repositorio.uam.es Esta es la versión de autor de la comunicación de congreso publicada en: This is an author produced version of a paper published in: Expert Systems with Applications: An International Journal 39.3 (2012): 3061- 3070 DOI: http://dx.doi.org/10.1016/j.eswa.2011.08.168 Copyright: © 2012 Elsevier B.V. All rights reserved El acceso a la versión del editor puede requerir la suscripción del recurso Access to the published version may require subscription https://repositorio.uam.es/ http://dx.doi.org/10.1016/j.eswa.2011.08.168 Adapting Searchy to Extract Data using Evolved Wrappers David F. Barreroa, Marı́a D. R-Morenoa, David Camachob aUniversidad de Alcalá. Computer Engineering Department. Escuela Politécnica. Ctra. Madrid-Barcelona km 31,600 28871 Alcal de Henares, Madrid, Spain Phone: (+34) 91-885-69-20, fax: (+34) 885-66-41 bUniversidad Autónma de Madrid. Escuela Politécnica Superior. C/ Francisco Tomás y Valiente 11. Ciudad Universitaria de Cantoblanco. 28049 Madrid, Spain Phone: (+34) 91-497-22-88, fax: (+34) 91-497-22-35 Abstract Organisations need diverse information systems to deal with the increasing requirements in information storage and processing, yielding the creation of information islands and therefore an intrinsic difficulty to obtain a global view. Being able to provide such an unified view of the -likely heterogeneous- information available in an organisation is a goal that provides added-value to the information systems and has been subject of intense research. In this paper we present an extension of a solution named Searchy, an agent-based mediator system specialized in data extraction and integration. Through the use of a set of wrappers, it integrates information from arbitrary sources and semantically translates them according to a mediated scheme. Searchy is actually a domain-independent wrapper container that ease wrapper development, providing, for example, semantic mapping. The extension of Searchy proposed in this paper introduces an evolutionary wrapper that is able to evolve wrappers using regular expressions. To achieve this, a Genetic Algorithm (GA) is used to learn a regex able to extract a set of positive samples while rejects a set of negative samples. Keywords: Wrappers, Genetic Algorithms, Information Extraction 1. Introduction Organisations have to deal with increasing needs of process automation, yielding a grown of the number and size of soft- ware applications. As a result there is a fragmentation of the in- formation: it is placed in different databases, documents of dif- ferent formats or applications that hide valuable data. Thus, it originates the creation of information islands within the organi- sation. Then, it has a negative impact when users need a global view of the information, increasing the complexity and devel- opment costs of applications. Usually ad-hoc applications are developed despite its lack of generality and maintenance costs. Information Integration [1] is a research area that addresses the several problems that emerge when dealing with such scenario. When a bunch of organizations are involved in an integra- tion process, the problems associated in the integration are in- creased. Some traditional integration problems, such as infor- mation heterogeneity, are amplified and new problems such as the lack of centralized control over the information systems arises. One of the most interesting problems in such context is how to ensure administrative autonomy, i.e., limit as much as possible the constrains that the integration might impose to data Email addresses: david@aut.uah.es (David F. Barrero), mdolores@aut.uah.es (Marı́a D. R-Moreno), david.camacho@uam.es (David Camacho) sources. We have developed a data integration solution called Searchy with the intention of addressing those constrains. Searchy [2] is a distributed mediator system that provides a virtual unified view of heterogeneous sources. It receives a query and maps it into one or more local queries, then trans- lates the responses from the local schema to a mediated one defined by an ontology and integrates them. It separates the in- tegration issues from the data extraction mechanism, and thus it can be seen as a wrapper container that eases wrapper de- velopment. It is based on Web Standards like RDF (Resource Description Framework) or OWL (Web Ontology Language). Then, Searchy can be easily integrated in other platforms and systems based on the Semantic Web or SOA (Service Oriented Architecture). Experience using Searchy in production environments has shown issues to be enhanced. One of the most successful wrap- pers in Searchy was the regex wrapper, a wrapper that extracts data from unstructured documents using a regular expression (or simply regex). Regex is a powerful tool able to extract strings that match a given pattern. Two problems were found related to wrapper-based regex utilization: the need of an engi- neer (or a specialized user, which we usually denoted as wrap- per engineer) with specific skills in regex programming, and the lack of automatic way to handle errors in the extraction pro- cess. These problems lead us to adapt the Searchy architecture to support evolved wrappers. That is, wrappers based on regex Preprint submitted to Expert Systems with Applications September 12, 2011 that have been previously generated using Genetic Algorithms (GAs). This wrapper uses supervised learning to generate a regex able to extract records automatically from a set of posi- tive and negative samples. Wrappers in Searchy may implement arbitrary complex ex- traction algorithms. The wrapper may, for instance, rely in a Multiagent System (MAS) to generate a composed regular ex- pression able to extract records that match with a training set, as it is described in detail in [3].This paper describes a wrapper able to evolve a simple regex thought a VLGA with an alpha- bet automatically generated and extract records matching the regex. It also provides an empirical evaluation of these evolved wrappers. A second contribution of this paper is the description of a complex wrapper able to extract data by means of evolu- tionary regular expressions. That is, regular expressions (or simply regex) that have been generated using Genetic Algo- rithms (GA). This wrapper uses supervised learning to generate a regex able to extract records automatically, without human supervision, from a set of positive and negative samples. This article is structured as follows. Section 2 provides a general overview the system architecture. The information re- trieval and information integration mechanism used in Searchy is briefly described in section 3. The evolved regex wrapper is presented in section 4 followed by a description of the alpha- bet construction algorithm. Some experiments carried out by the regex wrapper are shown in section 6. Section 7 describes related work. Finally, some future steps are outlined and con- clusions are summarized. 2. Searchy Architecture Many properties of Searchy are consequences of two design decisions: the MAS approach [4, 5] and the Web standards compliance. Using MAS gives Searchy a distributed and de- centralized nature well suited for the integration scenario de- scribed in the introduction. Web Services are used by Searchy agents as an interface to access their functionalities meanwhile the Semantic Web standards are used to provide an informa- tion model for semantic and structural integration [6]. From an architectural point of view agents were designed to maximize modularity decoupling integration from extraction issues, eas- ing the implementation of extraction algorithms. In our architecture each agent is composed by four compo- nents, as can be seen in Figure 1. Some of the key properties of Searchy are directly derived from this architecture. These ele- ments are the communication layer, the core, the wrappers and the information source. The next lines describe these compo- nents related to the FIPA Agent Management Reference Model. Communication layer It provides features related to the com- munications such as SOAP message processing, access control and message transport. The Communication layer is equivalent to the Message Transport System (MTS) in the FIPA model. Core It contains the basic skills used by all the agents, includ- ing configuration management, mapping facilities or agent Figure 1: Searchy platform architecture. identification. Any feature shared by all the agents is con- tained in the core. It presents some of the features defined by FIPA for the Agent Management System (AMS), how- ever they are not equivalents. AMS are supposed to con- trol the access of the agents to the Agent Platform (AP) and their life cycle. Meanwhile the agent core supports the operation of the wrappers. Wrapper A wrapper is the interface between the core agent and a data source, extracting information from the me- diated data source. Wrappers are a key point in order to achieve generality and extensibility. Agents in the FIPA model have some similarities with Searchy wrappers from an architectural point of view. An AP in the FIPA model may contain several agents meanwhile each agent in Searchy may contain several wrappers. Both of them are containers for some software asset, agents in case of FIPA or wrappers in case of Searchy. Data source It is where information that is the object of the integration process is stored. Almost any digital informa- tion source might be used as data source. Due to the nature of Searchy, data sources are usually some kind of informa- tion system such as a web server or an index. However any source of digital information is a potential Searchy data source. There is no equivalent in the FIPA model to data sources. Figure 1 shows the architecture of a Searchy agent with its four components. Agents interfaces are published thought the HTTP server, one of the subsystems of the communication layer. It receives the HTTP request that has been sent by the Searchy client and extracts the SOAP message. In order to pro- vide a first layer of security, the HTTP subsystem filters the 2 request using the Access Control Module. This module is an IP based filter that enables basic access control. The HTTP server has responsibilities with the SOAP messages transport, but the processing of these messages is done by its own module, the SOAP Processing Module. It processes the SOAP messages and then it transfers the operation to the Control Module or re- turns an error message. Once the message has been successfully processed, the Control Module begins to operate. The Control Module sets the flow of operations that the dif- ferent elements involved in the integration must perform, in- cluding the wrappers, the Mapping Module, and the Integra- tion Module. The Mapping Module is composed of three sub- systems, with different responsibilities in the mapping process. The Query Mapping subsystem performs the query rewriting translating the query from the mediated schema into the local schema, for example, SQL. Meanwhile, the Response Map- ping subsystem translates the response from a local schema like SQL, into RDF following a mediated schema defined by an on- tology. Both, Query and Response Mapping subsystems use the Mapping subsystem, that provides common services related to mappings and rules management to Query and Response Map- ping subsystems. The way in which the integration and map- ping processes operate is described in the section 3. Respon- sibility for Information extraction as well as communication among the agents falls in the wrappers. In our architecture the coordination among agents is based on an organizational structuring model with two different discov- ery mechanisms. In the first mechanism each agent has a static knowledge about which agents it must query, where it can find them and how access. The result is a static hierarchical struc- ture. It is useful in order to adapt a Searchy deployment to the hierarchy of a organisation, however it cannot take full benefice of a MAS such as parallelism, the reliability of the whole sys- tem is reduced and it is difficult to integrate in dynamic envi- ronments. To overcome some of these disadvantages a second coordina- tion mechanism has been implemented. Using our previous or- ganizational structuring model, relationships among the agents are not stored within the agents, but externally in a WSDL doc- ument that can be fetched by any agent from a HTTP or FTP server. This agent discovery mechanism is simpler than using an UDDI (Universal Description, Discovery, and Integration) directory or a Directory Facilitator (DF) in a FIPA platform. Agents are accessed as another data source, and thus it is done by a set of wrappers responsible of the discovery and communi- cation between Searchy agents: the Searchy and WSDL wrap- pers. These wrappers implement the coordination mechanism in Searchy, however wrappers’ main purpose is to extract data from data sources. At present Searchy includes four ordinary wrappers: SQL, LDAP, Harvest and regex. By means of SQL and LDAP wrap- pers, structured data in databases and LDAP directories may be accessed. Using the Harvest wrapper Searchy can integrate re- sources available in an intranet like HTML, LATEX, Word, PDF documents and other formats. The support of new data sources is done by the development of new wrappers. In section 4 we explain the regex wrapper. There is no restriction on the al- gorithm and data source that the wrapper might implement, it may be a direct access to a database, a data mining algorithm or data obtained from a sensor. Mapping and integration issues are managed by the agent’s core, and thus the wrapper has not to be concerned by these issues. Next section describes how these tasks are performed. 3. Mapping and Integration in Searchy Integrating information means dealing with heterogeneity in several dimensions [6]. Technical heterogeneity can be over- came by selecting the proper implementation technology. In our work it has been done using Web Services (WS) as an in- terface to access to the service. Addressing information hetero- geneity requires the definition of a global information model, the mediated schema, among all the entities involved in the in- tegration process, as well as a mapping mechanism to perform a mapping between the different local information models and the global information model. Defining this model is a critical subject in an information integration system. Searchy uses semantic technologies standardized by the WWW -RDF, RDFS and OWL- to represent the integrated in- formation. RDF is basically an abstract data model that can be represented using several syntaxes. Searchy uses RDF serial- ized with XML to represent the information. This combination of RDF and XML grants interoperability in a structural level. Semantic integration requires an agreement about the meaning of the information to deal with semantic heterogeneity. This agreement is performed by using shared ontologies expressed in RDFS or OWL. Then, there must be an explicit agreement among all the actors involved in a Searchy deployment to es- tablish at least one global ontology. A set of mapping rules are needed in order to map entities according to a local schema into the global schema. Rules are used to map queries to a local schema and responses to the mediated schema. Query format is a tuple of strings, the Query Mapping subsystem rewrites the query to obtain a query valid for the local data source. The first element in the tuple is an URI that represents the concept to which the query is re- ferred, meanwhile the query is a string with the content of the concept that is being queried. The query model is simple but enough to fulfill the requirements of the application. The trans- lation of the query to the local schema is performed using the Mapping Module (see Figure 1). Mappings are done by means of a string substitution mechanism very similar to the traditional printf() function in C. This mechanism is enough to satisfy the needs in almost all cases. Once a query has been translated the response of the local information source must be extracted, mapped to a shared ontology and integrated, respectively, by the Response Mapping and Integration subsystems. Response mappings are done in two stages: 1. The response is mapped semantically conforming to a shared ontology. It is done using the same mechanism than the Query Mapping subsystem. A critical aspect is to provide a URI identifier for each resource, just like RDF requires to identify any resource. There is no unified way 3 to do this task: each type of wrapper and user policy define a way to name resources. 2. Each response of each wrapper is integrated in the Inte- gration Module. Integration is based on the URI of the resource returned by the wrappers. When two wrappers return two resources identified by the same URI, the agent interprets that they are referred to the same object and thus they are merged. Figure 2 shows a simple example of an integration pro- cess within Searchy. There are two data sources: a relational database and a LDAP directory service. In a first stage the wrappers retrieve the information from the local data source and this is mapped into a RDF model. The mapping is done by using the terms defined by an ontology and according to some rules given by the system administrator. The ontologies used within the integration process must be shared among all the agents. In general, a one to one correspondence between a data field and an ontology term will be defined. Several local fields or fixed texts may compose one value in RDF, this feature aids the administrator to define more accurate mappings. The mapping rules defined in the example shown in Figure 2 for the database wrapper are depicted in Example 1. Example 1 Query mapping rules example rdf:about IS "http://www.example.org/" + name dc:title IS name + " " + surname foaf:family_name IS surname The first rule defines that the RDF attribute rdf:about is built with the concatenation of the string ”http://www.example.org/” and the attribute Person as it is defined in the local schema. The rest of rules are defined in a similar way. Meanwhile the map- ping rules for the directory wrapper can be seen in Example 2. Example 2 Response mapping rules example rdf:about IS "http://www.example.org/" + uid rdf:type IS foaf:Person foaf:mbox IS email foaf:homepage IS web The wrappers in the example use two vocabularies: Dublin Core and FOAF. Each object retrieved from the data source must be identified by an URI, that in this case is built using local data with a fixed text. The second stage integrates the enti- ties returned by the wrappers. The agent core identifies the two objects as the same object by comparing their URI and merges the attributes, providing a RDF object with attributes retrieved from two different sources. Mapping and Integration Modules decouple data integration and mapping from the extraction, and thus it is possible to de- velop wrappers in Searchy without any concern about these issues. Next section shows an example of how a complex wrapper may be developed using the infrastructure provided by Searchy. The original architecture of Searchy [3] provided an easy to use extraction and integration platform. However, it required human supervision in some parts of the process. One of the most useful wrappers supported by Searchy is the regex wrap- per, which is able to extract data from unstructured documents. One problem associated with this wrapper is the need of a wrap- per engineer skilled in regex programming. Another problem is the error detection, that is, detect when the wrapper is not ex- tracting data correctly and correct it. It lead us to extend the original Searchy regex wrapper able to extract data using a regex created by the wrapper engineer with an evolved regex agent able to generate a regex from a set of positive and negative examples using a GA. Figure 3 depicts the extended architecture, where the original architecture is ex- tended with control and evolutive agents. The MAS contains three kind of agents: control, extractor and evolutive agents. The three types of agents share the same agent architecture de- picted in Figure 1, they differ from an architectural point of view in the wrappers they use. Figure 3 uses solid lines to rep- resent the iteration among the agents and resources with the exception of iterations that involve regex, which is represented with dotted lines. There must be one control agent that receives queries from the user and forwards it to the extractors, which are agents with a regex wrapper. Regex wrappers in the original Searchy ar- chitecture obtained the regex from the wrapper engineer, who generated manually the regex. When the wrapper detected a fail in the data extraction, i.e., when it was unable to extract data from a source, the wrapper notified it to the wrapper en- gineer who had to identify the problem and in case the regex was incorrectly constructed, he generates a new one. When the wrapper detected a fail in the data extraction, i.e., it is unable to extract data from a source, the wrapper notifies the wrapper en- gineer and he had to identify the problem and in case the regex was incorrectly constructed he had to generate a new one. The new architecture aims to automate this approach, using an evolutive agent that fulfills some roles of the wrapper en- gineer. Extraction agents obtain the regex from the evolutive agents on start-up time, but also when they identify an extrac- tion error. In this case, instead of requesting a new regex to the wrapper engineer, it would request it to the evolutive agent. When an evolutive agent is required to generate a new regex, it executes a GA that is described in the next section. 4. Wrapper based on evolved regular expressions The implementation of the evolved regex was done as a Searchy wrapper using the Searchy wrapper API. When an agent with the evolved regex wrapper is run, the wrapper gener- ates a valid regex executing the described VLGA with a given training set. Once a suitable regex is generated, the wrapper can begin to extract records from any text file accessible thought HTTP or FTP. It does not have to manage any issue related to mapping since the Mapping Module performs this task. 4 Figure 2: Example of the integration process in Searchy, with two data sources, one relational database and a directory. Figure 3: Searchy evolutive agents. 4.1. Codification Any GA has to set a way to codify the solution into a chro- mosome. The VLGA implemented in the wrapper uses a bi- nary genome divided in several genes of fixed length. Each gene codes a symbol σ from an alphabet Σ composed by a set of valid regular expressions constructions, as described in sec- tion 5. Some words should be dedicated to how genes codes regex. The alphabet is not composed by single characters, but by any valid regex, in this way the search space is restricted leading to a easier search. These simple regular expressions are the building blocks of all the evolved regex and cannot be divided, thus, we will call them atomic regex. The position (or locus) of a gene determines the position of the atomic regex. Gen in position i is mapped in the chromosome to regex transformation as an atomic regex in the position i. Figure 4 represents a simple example of how the regex ca[tr] could be coded in the GA. Figure 4: Example of chromosome encoding. 5 4.2. Evolution strategy Genetic operators used in the evolution of regular expres- sions are the mutation and crossover. Since the codifications rely in a binary representation, the mutation operator is the common inverse operation meanwhile the recombination is per- formed with a cut and splice crossover. Given two chromo- somes, this operator selects a random point in each chromo- some and use it to divide it in two parts, then the parts are in- terchanged. Obviously, the resulting chromosomes will likely be of different lengths. Selective pressure is introduced by a tournament selection where n individuals are randomly taken from the population and the one that scores a higher fitness is selected for reproduction. An elitist strategy has also be used, where some of the best individuals in the population are trans- ferred without any modification to the new generation. In this way it is assured that the best genetic information is not lost. 4.3. Fitness How goodness of any solution is measured is a key subject in the construction of a GA. In our case, for each positive exam- ple, the proportion of extracted characters is calculated. Then the fitness is calculated subtracting the average proportion of false positives in the negative example set to the average of characters correctly extracted. In this way, the maximum fit- ness that a chromosome can achieve is one. This happens when the evolved regex has correctly extracted all the elements of positive examples while any element of the negative examples has been matched. An individual with a fitness value of one is called ideal individual. From a formal point of view, the fitness function that has been adopted in the wrapper uses a training set composed by a positive and a negative subset of examples. Let P be the set of positive samples and Q the set of negative samples, such as P = {p1, p2, ..., pM} and Q = {q1,q2, ...,qN}. Both, P and Q are subsets of the set of all strings G and they have no common elements, so P∩Q=⊘. Chromosomes are evaluated as follows. Given a chromo- some it is transformed into the corresponding regex r ∈ R, then tries to match against the elements of P and Q. The set of strings that r extracts from a string p is given by the function ϕ( p,r) : (S×R) −→ R while the number of characters retrieved is represented by |ϕ( p,r)|. The percentage of extracted charac- ters of pi such as i = 0, ..., M is averaged and finally the fitness is calculated subtracting the average proportion of false posi- tives in the negative example set to the average of characters correctly extracted, as is expressed by equation (1). F(r) = 1 |P| ∑ pi∈P |ϕ( pi,r)| |pi| − 1 |Q| ∑ qi∈Q Mr (qi) (1) where |pi| is the number of characters of pi, |P| the number of elements of P, |Q| the number of elements of Q and Mr (qi) is defined as Mr (qi) = { 1 i f |ϕ(qi,r)|> 0 0 i f |ϕ(qi,r)|= 0 (2) 5. Zipf’s law based alphabet construction 5.1. Preliminary considerations Section 4.1 has shown how a classical binary codification is used to select one symbol σ from a predefined set Σ of symbols or atomic regex. The construction of Σ is a critical task since it determines the search space, its size and its capacity to express a correct solution. Of course, the simplest approach is to man- ually select the alphabet, however this approach may devaluate the added value of evolved regex: the automatic generation of regex. We can state that construction of Σ must satisfy three con- strains. 1. Σ must be sufficient, i.e., it must exist at least an element r ∈ Σ∗ such as r is an ideal individual. In other words, it must be possible to construct at least one valid solution using the elements of Σ. 2. |Σ| must contain the minimum number of elements able to satisfy the sufficiency constrain. Of course, being able to satisfy this condition is a challenging task with deep the- oretical implications. From a practical point of view, this constrain can be reformuled to keep |Σ| as low as possible. 3. Symbol selection must be automatic, with minimal human interaction and number of parameters. 5.2. Alphabet construction algorithm To reduce the number of elements of Σ, and keep the search space as small as possible, we aim to identify patterns in the positive samples and use them as building blocks. In order to satisfy the previous constrains we propose the following algo- rithm. Σ is built as the union of F , D and T , where F , is the set of fixed symbols, D the set of delimiters and T the set of tokens. Σ= {σi}\σi ∈F ∪D∪T (3) F contains hand-created reusable symbols that are meant to be common cross-domain regex, and thus, once they have been defined they can be used to evolve different regex. It should be noticed that F may contain any valid regex, nevertheless it is supposed to contain generic use regex such as \d+ or [1-9]+. Since F is supposed to include common used complex regex, it contributes to reduce the search space and increase individual fitness by introducing high fitness building blocks. The sets D and T are constructed using a more complex mechanism based on Zipf’s Law [7]. It states that occurrences of words in a text are not uniformally distributed, rather only a very limited number of words concentrates a high number of occurrences. This fact can be used to identify patterns in P and use them to construct a part of Σ. Since the tokens do not contain delimiters, the sufficiency constrain cannot be satisfied, so, each delimiter that appear in the examples is included in the set D. The overall process is described in Algorithm 1. Of course, |Σ| must be equal to the number of elements of the union of F , D and T , as is expressed in equation (4). 6 Algorithm 1 Selection of alphabet tokens. 1 .- P := Set of positive examples 2 .- S := Set of candidate delimiters 3 .- D := T := { } 4 .- 5 .- for each p in P 6 .- for each s in S 7 .- tokens := split p using s 8 .- numberTokens := number of tokens 9 .- 10.- for each token in tokens 11.- occurrence(token) := occurrence(token) + 1 12.- endfor 13.- 14.- if (numberTokens > 0) add s to D 15.- endfor 16.- endfor 17.- 18.- sort occurrence 19.- add n first elements of occurrence to T |Σ|= |F ∪D∪T| (4) given that |F ∩D∩T|= |F ∩D|= |F ∩T|= |D∩T|=⊘ (5) 5.3. Complexity analysis A better understanding of the algorithm can be achieved by a time complexity analysis. As can be seen in Algorithm 1, there are two main loops (see Algorithm 1, lines 5 and 6) that depends on the number of examples |P| and the number of po- tential delimiters |S |. The complexity of the algorithm is given by these loops and the operations that are performed inside. Splitting a string pi ∈P (line 7) is proportional to the length of the string |pi|, so the mean time required to perform this op- eration is proportional to the mean string length |p|. Lines 19 to 21 include a loop that is repeated as many times as the tokens are in the string. A hash table is accessed inside the loop (line 20), so it makes sense to suppose that its complexity is given by the calculus of the key, a string, therefore its time complex- ity is n|p|, where n is the number of tokens. Finally sorting occurrence can be performed in ntot log(ntot) where ntot is the number of tokens stored in occurence. The rest of the opera- tions in the algorithm can be performed in a negligible time. We can express these considerations in equation (6). t ∝ |P| · |S | · [|p|+ n|p|] + ntot log(ntot) (6) Both n and ntot are unknown and we have to estimate it for the average case. A string p ∈ P of length |p| can contain approx- imately a maximum of |p|2 tokens. We have supposed there is one delimiter for each token. The maximum number of tokens that can be stored in occurrences are |P|·|S |·|p|2 . Then n = |p| 2 (7) ntot = |P| · |S | · |p| 2 (8) and 6 can be expressed as t ∝ |P| · |S | · |p|[1 + |p| 2 ] + |P| · |S | · |p| 2 log( |P| · |S | · |p| 2 ) (9) Some terms can be removed t ∝ |P||S ||p| 2 log( |P| · |S | · |p| 2 ) (10) Using Big Oh notation it yields that the time complexity is given by O(k log(k)) (11) where k = |P||S ||p| and therefore we can conclude that the time complexity is linearithmic. 6. Evaluation Two phases have been used in the evaluation, a first phase where the basic behaviour of the GA is analyzed, and a second phase that uses the knowledge acquired along the first phase to measure extraction capabilities of the evolved regex wrap- per. Measures that have been used are the well known preci- sion, recall and F-measure. The sets of experiments described in this section are focused in the extraction of three types of data: URLs, phone numbers and email addresses. 6.1. Parameter tuning Some initial experiments were carried out to acquire knowl- edge about the behaviour of the regex evolution and select the GA parameters to use within the wrapper. Experiments showed that despite the differences between phone, URL and emails all the case studies have similar behaviors. In this way it is possible to extrapolate the experimental results and thus to use the same GA parameters. Setup experiments showed that best performance is achieved with a mutation probability of 0.003 and a tournament size of 2 individuals. A population composed by 50 individuals is a good trade-off between computational resources and convergence speed. Initial population has been randomly generated with chromosome lengths that range from 4 to 40 bits and elitism of size one has been applied. Table 1 summarizes the parameter values used in the experiments. 6.2. Regex evolution Once the main GA parameters have been set, the wrapper can evolve the regex. Experiments have used three datasets to evolve regex able to extract records in the three case studies un- der study. Figure 5 depicts the mean best fitness (MBF) and mean average fitness (MAF) of 100 runs. The fitness evolution of the case studies follows a similar path. The best MBF and 7 Table 1: GA parameters summary. Parameter Value Population 50 Mutation probability 0.003 Crossover probability 1 Tournament size 2 Elitism 1 Initial chromosome length 4 - 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 F itn e ss Generations Email Phone URL Figure 5: Best and average fitness of phone, URL and email regex. MAF are achieved by the email regex, while the poorest perfor- mance is given by the URL regex, with lower fitness values. The dynamics of the chromosome length can be observed in the Figure 6. It is clear that there is a convergence of the chro- mosome length and thus chromosome bloating does not appear. It can be explained by the lack of non-coding and overlapping regions in the chromosome, i.e, if the chromosome has achieved a maximum it hardly can increase its size without a penalty in its fitness. The longer is the chromosome, the more restrictive is the phenotype and it is closely related to the associated fit- ness. URL regex has a stronger tendency to local maximum, this fact is reflected in Figure 6, where a lower MBF and MAF are achieved. This fact also explains why URL chromosome length depicted in Figure 6 is shorter than phone regex: the lo- cal maximum of URL regex tends to generate populations with an insufficient chromosome length. Those results are not sur- prising since URLs follow a far more complex pattern than phone numbers or emails. The same can be affirmed about emails in comparison to phone numbers. Figure 6 shows another interesting behaviour. As the GA begins to run, the average chromosome length is reduced un- til a point where it begins to increase, then the chromosome length converges into a fixed value. In early generations indi- viduals have not suffered evolution and thus its genetic code has a strong random nature. Individuals with longer genotype have longer phenotypes and thus more restrictive regex that will likely have smaller fitness values. So long chromosomes are discarded in early stages of the evolutive process until the pop- ulation is composed by individuals representing basic pheno- types, then recombination leads to increase complexity of in- 6 8 10 12 14 16 18 20 22 24 0 10 20 30 40 50 60 70 A vg . ch ro m o so m e le n g th Generations URL Email Phone Figure 6: Evolution of regular expressions. dividuals until they reach a length associated with a local or global maximum. 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 S u cc e ss r a te ( % ) Generations Email Phone URL Figure 7: Probability of finding an ideal regex able to accept all the positive examples while rejecting the negative ones. Some of the facts found previously are confirmed by Fig- ure 7, where the success rate (SR) [8] is depicted versus the generation. SR is defined as the probability of finding an ideal individual in a given generation. It should be noted that Figure 7 depicts the average success rate of 100 runs of the experiment. It can be seen that email achieves a SR of 91%, phone numbers 54% and URLs 63% in generation 70. These results are consis- tent with those in Figure 6 and show that the hardest study cases are URLs, phone numbers, and emails, in that order. Here the term ”hard” should not be understood in a strict absolute way since the hardness of the search space is influenced from sev- eral factors, such as the training set, the selection of negative samples or the alphabet chosen. 6.3. Data extraction Three regex with an ideal fitness of one have been selected by the wrapper and its extraction capabilities have been evalu- ated by means of the precision, recall and F-measure. The ex- periments used a dataset composed by eight sets of documents from different origins containing URLs, emails and/or phone 8 Table 2: Extraction capacity of the evolved regex. The table shows the F-measure (F), precision (P) and recall (R) achieved in the three datasets (phone, URL and email addresses). Phone regex URL regex Email regex Phone URL Email F P R F P R F P R Set 1 99 0 0 1 1 1 - - - - - - Set 2 0 51 0 - - - 0.24 0.14 0.84 - - - Set 3 0 0 862 - - - - - - 0.79 0.51 0.62 Set 4 20 77 0 1 1 1 0.27 0.16 1 - - - Set 5 37 686 0 1 1 1 0.20 0.11 0.97 - - - Set 6 24 241 0 1 1 1 0.02 0.01 0.37 - - - Set 7 83 0 88 0.92 1 0.96 - - - 0.92 1 0.96 Set 8 0 51 0 - - - 0.63 0.47 0.96 - - - Avg. - - - 0.98 1 0.99 0.27 0.18 0.83 0.85 0.79 0.79 numbers. Table 2 shows basic information about the datasets and their average records and Table 3 contains some evolved regex with their fitness value. Sets one, two and three are com- posed by examples extracted from the training set. The rest of the sets are web pages retrieved from the Web classified by their contents. An extracted string has been evaluated as cor- rectly extracted if and only if it matches exactly the records, otherwise it has been computed as a false positive. The results, as can be seen in Table 2, are quite satisfactory for phone numbers and testing sets, but measures get worse for real raw documents, specially the ones containing URL records. Phone regex has a perfect extraction with a F-measure value close to 1. The training set used to evolve regex contains phone numbers in a simple format (000)000−0000, the same that can be found in the testing set, the reduction of recall in set 7 is due to the presence of phone extensions that are not extracted. On the contrary, measures achieved for URL extraction from raw documents are much lower. It can be explained looking at the regex used in the extraction, http://\w+\.\w+\.com. Docu- ments used in the test contain many URLs with paths, so the regex is able to partially extract them, increasing the count of false positives. The result is a poor precision. An explanation of the poor recall measures in URLs extraction is found in the fact that the evolved regex only is able to extract URL whose first level domain is .com, so its recall in documents with a high presence of first level domains in other forms is worse. Finally, email regex achieves an average F-measure of 0.85. Some of the factors that limits the URL regex extraction capa- bilities are also limiting email regex. However in this case the effects are not so severe for some reasons, for instance the lower percentage of addresses with more than two levels. 7. Related Work The use of ontologies [9] has attracted the attention of data integration community in the last years. It has provided a tool to define mediated schemas focused on knowledge sharing and interoperability, in contrast with traditional database cen- tric schemas, whose goal is to query single databases [10]. The adoption of ontologies has lead to reuse results achieved by two communities such as the database and the AI communities to Table 3: Some examples of evolved regular expressions with their fitness val- ues. Evolved regex (Phone) Fitness \w+ 0 \(\d+\) 0.33 \(\d+\)\d+ 0.58 \(\d+\)\d+-\d+ 1 Evolved regex (URL) Fitness http://-http://http:// 0 /\w+\. 0.55 http://\w+\.\w+\ 0.8 http://\w+\.\w+\.com 1 Evolved regex (Email) Fitness \w+\. 0.31 \w+\.\w+ 0.49 \w+@\w+\.\com 1 solve similar problems like schema mapping or entity resolu- tion. A deep discussion about the role of the ontologies in the data integration can be found in [11]. We can define a collection of semantic solutions based on on- tology technologies prior to the development of the SW. A in- troduction to this group of solutions can be found in [6]. We can remark classical literature examples such as InfoSleuth [12] or SIMS [13]. From these systems, we have to stress InfoSleuth, a solution that uses a MAS. Semantic integration tools in the last years have adopted WS standards and technologies. One of the first ones can be found in [14]. Vdovjak proposes a semantic mediator for querying heterogeneous information sources, but limited to XML docu- ments, furthermore, this solution relies on a wrapper layer that translates the local entities into XML and then the RDF is gen- erated. A step forward is done by Michalowski with Build- ing Finder [15], a domain specific mediator system aimed to retrieve and integrate information about streets and buildings from heterogeneous sources, presented to the user within satel- lite images. [16] describes an information integration tool that covers all the phases of integration, such as assisted mapping 9 definition and query rewrite. Another newcomer into the IT toolbox is the Web Services technology. WS provide a means to access services in a loose coupling way. Despite WS and the SW face different problems -one models and represents knowledge meanwhile the another one is concerned with service provision-, they are related by means of semantic descriptions of WS thought Semantic Web Services. In this way WS are enhanced with semantic descrip- tions, enabling dynamic service composition and data integra- tion [17]. A semantic integration solution based on SOA is SODIA (Service-Oriented Data Integration Architecture) [18]. It sup- ports some integration approaches such as federated searches and datawarehouse. By using a SOA approach SODIA has many of the benefits of using an agent technology. However, this is a process centric solution and has limited semantic sup- port. The most aligned solution to the one described in this paper is Knowledge Sifter [19]. It is an agent based approach that uses OWL to describe ontologies and WS as interface to the agents’ services. Despite the lack of semantic support, WS integration or distributed nature, we have to mention the system proposed by [20], a system able to automate the full integration process by creating the mediated schema and schema mapping on-the-fly. Another interesting integration suite related to bioin- formatics domain that that could be mentioned is INDUS [21]. Table 4 compares some representative federated ontology- driven search solutions. The scope of table 4 is limited, however some relevant facts are show. It depict if the integration system is supported by agents, it uses any WS or SW technology as well as the degree of specialization of the tool. 8. Future work and conclusions Some issues are still open. One of them is the limited num- ber of information sources that Searchy is able to integrate. WebMantic [22] is a wrapper web-centric information extrac- tion tool that once integrated in Searchy will enhance its Web information extraction and integration capabilities. One method to extract data from unstructured documents is the use of evolved regex. Genetic Algorithms provide a stochas- tic search method well suited for the complex search space con- formed by regular expressions. Despite the fact that VLGAs are much simpler than the fixed-length approach described in [3], it still presents some important drawbacks, such as the intrinsic difficulty to evaluate parts of the evolved regex or the linear na- ture of the codification. In order to avoid this type of problems it is expected to use Genetic Programming and Grammatical Evaluation to generate regex. In particular, Genetic Program- ming has shown well performance in related areas [23] and is a promising approach. From the point of view of the Searchy platform, The creation of a wide network of agents implies the management of huge amount of information. Then, the physical scalability of the system should be done in parallel with the intelligence system improvement. Some techniques such as ranking or information filtering with a case based reasoning or collaborative filtering are considered to provide some intelligence to the system that will produce better user satisfaction. Along this paper we have briefly presented the problem of information extraction and integration and we have proposed an extension of a partial solution called Searchy. This exten- sion aims to automatice some of the tasks asigned originally to the wrapper engineer thought some agents able to use Machine Learning to generate regex. When an extractor agent requires a regex (it can be in initialization time or because it cannot ex- tract data with a given regex) it request one to a evolutive agent that using a set of positive and negative examples and a Genetic Algorithm is able to generate the regex. 9. Acknowledgements The authors gratefully acknowledge Martı́n Knoblauch for his useful suggestions and valuable comments. This work has been partially supported by the Spanish Ministry of Science and Innovation under the projects ABANT (TIN 2010-19872), COMPUBIODIVE (TIN2007-65989) and by Castilla-La Man- cha project PEII09-0266-6640. References [1] L. Haas, Beauty and the beast: The theory and practice of information integration, ICDT 2007 (2007) 28–43. [2] D. F. Barrero, M. D. R-Moreno, D. R. López, Information integration in searchy: an ontology and web services approach, International Journal of Computer Science and Applications (IJCSA) 7 (2010) 14–29. [3] D. F. Barrero, D. Camacho, M. D. R-Moreno, In Data Mining and Mul- tiagent Integration. Chapter 9: Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions (2009), Springer, pp. 143– 154. [4] F. Wan, M. P. Singh, Commitments and causality for multiagent design, in: A. Pres (Ed.), 2nd International Joint Conference on Autonomous Agents and multiagent Systems (AAMAS), Melbourne, Australia. [5] R. Aler, J. M. Valls, D. Camacho, A. Lopez, Programming robosoccer agents by modeling human behavior, Expert Systems with Applications 36 (2009) 1850–1859. [6] H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neu- mann, S. Hübner, Ontology-based integration of information — a survey of existing approaches, in: IJCAI–01 Workshop: Ontologies and Infor- mation Sharing, Seattle, Washington, USA, pp. 108–117. [7] G. Zipf, The Psycho-Biology of Language, Houghton Mifflin, Boston, MA, 1935. [8] D. F. Barrero, D. Camacho, M. D. R-Moreno, Confidence intervals of success rates in evolutionary computation, in: Proceedings of Genetic and Evolutionary Computation Conference (GECCO-2010)., ACM, Portland, Oregon,USA, 2010, pp. 975–976. [9] T. Grubber, A translation approach to portable ontology specifications, Knowledge Acquisition 5 (1993) 199–220. [10] M. Uschold, M. Gruninger, Ontologies and semantics for seamless con- nectivity, SIGMOD Rec. 33 (2004) 58–64. [11] N. F. Noy, Semantic integration: a survey of ontology-based approaches, SIGMOD Rec. 33 (2004) 65–70. [12] M. H. Nodine, J. Fowler, T. Ksiezyk, T. Perry, M. Taylor, A. Unruh, Ac- tive information gathering in infosleuth, International Journal of Cooper- ative Information Systems 9 (2000) 3–28. [13] C. A. Knoblock, J.-L. Ambite, Agents for information gathering, in: J. M. Bradshaw (Ed.), Software Agents, AAAI Press / The MIT Press, 1997, pp. 347–374. [14] R. Vdovjak, G. Houben, Rdf based architecture for semantic integration of heterogeneous information sources, in: E. Simon, A. Tanaka (Eds.), Proceedings of the International Workshop on Information Integration on the Web, Rio de Janeiro, Brazil, pp. 51–57. 10 Table 4: Semantic information integration tools comparison. Platform Agent support Semantic Web Web Services Interdomain support InfoSleuth Yes No No Yes SIMS Yes No No Yes Building Finder No Yes No No SODIA No Yes Yes Yes Knowledge Sifter Yes Yes Yes Limited Searchy Yes Yes Yes Yes [15] M. Michalowski, J. Ambite, S. Thakkar, R. Tuchinda, C. Knoblock, S. Minton, Retrieving and semantically integrating heterogeneous data from the web, IEEE Intelligent Systems 19 (2004) 72–79. [16] J. Yuan, A. Bahrami, C. Wang, M. Murray, A. Hunt, A semantic infor- mation integration tool suite, in: VLDB ’06: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, 2006, pp. 1171–1174. [17] P. Deepti, B. Majumdar, Semantic web services in action - enterprise in- formation integration., in: B. J. Kramer, K.-J. Lin, P. Narasimhan (Eds.), ICSOC, volume 4749 of Lecture Notes in Computer Science, Springer, 2007, pp. 485–496. [18] F. Zhu, M. Turner, I. A. Kotsiopoulos, K. H. Bennett, M. Russell, D. Bud- gen, P. Brereton, J. Keane, M. Rigby, J. Xu, Dynamic data integration using web services., in: IEEE International Conference on Web Services (ICWS’04), San Diego, California, USA, pp. 262–269. [19] L. Kerschberg, M. Chowdhury, A. Damiano, H. Jeong, S. Mitchell, J. Si, S. Smith, Knowledge sifter: Ontology-driven search over heterogeneous databases., in: 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), IEEE Computer Society, San- torini Island, Greece, 2004, pp. 431–432. [20] A. D. Sarma, X. Dong, A. Halevy, Bootstrapping pay-as-you-go data integration systems, in: SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, New York, NY, USA, 2008, pp. 861–874. [21] D. Caragea, J. Pathak, J. Bao, A. Silvescu, C. Andorf, D. Dobbs, V. Honavar, Information integration and knowledge acquisition from se- mantically heterogeneous biological data sources, in: Proceedings of the 16th International Workshop on Database and Expert Systems Applica- tions, Springer-Verlag, 2005, pp. 175–190. [22] D. Camacho, M. D. R-Moreno, D. F. Barrero, R. Akerkar, Semantic wrappers for semi-structured data extraction, Computing Letters (COLE) 4 (2008) 1–14. [23] You-Heng, L. Ge, Learning ranking functions for geographic information retrieval using genetic programming, Journal of Research and Practice in Information Technology 41 (2009) 39–52. 11