key: cord-0230588-kyx4xnh6 authors: Lith, John W. van; Vanschoren, Joaquin title: From Strings to Data Science: a Practical Framework for Automated String Handling date: 2021-11-02 journal: nan DOI: nan sha: 70de85e73a9aa43ccb234f8e3f50efdfab2bde0c doc_id: 230588 cord_uid: kyx4xnh6 Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. Categorical string features can represent a wide variety of data (e.g., zip codes, names, marital status), and are notoriously difficult to preprocess automatically. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations. We also provide an open source Python implementation to automatically preprocess categorical string data in tabular datasets and demonstrate promising results on a wide range of datasets. Datasets acquired from the real world often contain categorical string data, such as zip codes, names, or occupations. Many machine learning algorithms require that such string features be converted to a numerical representation to work as intended. Depending on the type of data, specific processing is required. For example, geographical string data (e.g., addresses) may be best expressed by latitudes and longitudes. Data scientists are required to manually preprocess such unrefined data, requiring a significant amount of time, up to 60% of their day [10] . Automated data cleaning tools exist but often fail to robustly address the wide variety of categorical string data. This paper presents a framework that systematically identifies various types of categorical string features in tabular datasets and encodes them appropriately. We also present an open-source Python implementation that we evaluate on a wide range of datasets. Our framework addresses a range of challenges. First, type detection aims to identify predefined 'types' of string data (e.g., dates) that require special preprocessing. Probabilistic Finite-State Machines (PFSMs) [9] are a practical solution based on regular expressions and can produce type probabilities. They can also detect missing or anomalous values, such as numeric values in a string column. Statistical type inference predicts a feature's statistical type (e.g., ordinal or categorical) based on the intrinsic data distribution. Valera et al. [29] use a Bayesian approach to discover whether features are ordinal, categorical, or real-valued, although some manual assessment is still needed. Other techniques that predict classes using features of the data use random forest and gradient boosting classifiers [11] , which could be leveraged to achieve class prediction for the statistical type. Encoding techniques convert categorical string data to numeric values, which is challenging because there may be small errors (e.g., typos) and intrinsic meaning (e.g., a time or location). Cerda et al. [7, 8] use string similarity metrics and min-hashing to tackle morphological variants of the same string. Geocoding APIs (e.g., pgeocode and geopy) can convert geographical strings to coordinates [3, 5] . For ordinal string data, heuristic approaches exist that could determine order based on antonyms, superlatives, and quantifiers, e.g. using WordNet [22] or sentiment intensity analyzers (e.g. VADER, TextBlob, and FlairNLP) [6, 13, 20] . Methods have been proposed that recognize, categorize, and process different string entities based on regular expressions [28] or domain knowledge that can outperform human experts [12, 14] . Data cleaning tools exist that are manually operated [1, 2, 4] , semi-automated [23,30], or fully automated [15, 16, 17, 21, 26] . At present, however, these automated tools do not focus on string handling [15, 16, 26] or they focus on specific steps such as error correction [17, 21] . Our framework, shown in Fig. 1 , is designed to detect and appropriately encode different types of string data in tabular datasets. First, we use PFSMs to infer whether a column is numerical, a known type of string feature (e.g., a date), or any other type of 'standard' string data. Based on this first categorization, appropriate missing value and outlier handling methods are applied to the entire dataset to repair inconsistencies. Next, columns with recognized string types go through intermediate type-specific processing, while the remaining columns are classified based on their statistical type (e.g., nominal or ordinal). Finally, the data is encoded by applying the most fitting encoding for each feature. In the first step, we build on PFSMs and the ptype library [9] . We created PFSMs based on regular expressions for nine types of string features: coordinates, days, e-mail addresses, filepaths, months, numerical strings, sentences, URLs, and zip codes. A detailed description for each of these can be found in Appendix A. Next, missing values are imputed based on the missingness of the data according to Little's test [19, 27] (missing at random, missing not at random, missing completely at random) using mean/mode imputation or a multivariate imputation technique [24] . Minor typos are corrected using string metrics [18] , and data type outliers are corrected if applicable to ensure robustness in the remaining steps. Next, we perform intermediate processing of all the string types identified by the PFSMs. First, we simplify the strings, for instance, by only taking the nouns in sentences. Second, we assign or perform specific encoding techniques, such as replacing a date with year-month-day values. Third, we include additional information, such as fetching latitude and longitude values for zip codes. String features not identified by the PFSMs are marked as 'standard' strings. For these, we infer their statistical type, i.e., whether they contain ordered (ordinal) data or unordered (nominal) data. The prediction is based on eight properties extracted from the feature, shown in Fig. 2 Finally, an appropriate encoding is applied for each categorical string feature. For the string types identified by the PFSMs, a predefined encoding is applied, per Appendix A. For nominal data, we use category encoders and target encoding due to their robustness to morphological variants and high cardinality features [7, 8] . We apply the similarity, Gamma-Poisson, and min-hash encoders when the cardinality is below 30, below 100, and at least 100, respectively. For ordinal data, the ordering is defined by a text sentiment intensity analyzer (FlairNLP [6] ) that converts string values (e.g., very bad, very good) to an ordered encoding. We evaluate our framework and its individual components on a range of realworld datasets with categorical string features, listed in Appendix C. We evaluate the downstream performance of gradient boosting models trained on the encoded data. Boosting models are intrinsically robust against high-dimensional data, thus allowing a fairer comparison. We use stratified 5-fold cross-validation, and measure accuracy for classification tasks and MAE for regression tasks. Global framework evaluation. First, we evaluate the framework as a whole and compare it to a baseline where the data is manually preprocessed using mean/mode imputation and ordinal or target encoding for the categorical string features. Fig. 3 shows the relative performance differences for each of the five folds and their mean. These results indicate that our framework performs well on real-world data. On the winemag-130k dataset, the automated encoding proves suboptimal, which is likely tied to the specific heuristics used. Feature type inference. In Table 1 , we compare the predictions of our PFSMs against the ground truth feature types. Most PFSMs report perfect accuracy. Filepaths and sentences are detected with 70-80% accuracy, since the exact format of such data can be unexpected. Outliers are also detected correctly, except for sentence PFSMs, where out of the 130217 entries, 42158 entries were false negatives, which is certainly a point for improvement. Processing inferred string features. Fig. 4 compares the performance of our framework with and without intermediate processing for the string feature types identified by the PFSMs. For some features, this processing causes a 10% performance improvement, while on others, it remains about the same. This processing does require extra processing time, caused by API latency and text processing, yet it seems worth the extra time for zip codes and sentences. Moreover, the reduced string complexity (removing redundant words) and conversion of numerical strings (e.g. '> 10') into numerical representations reduce the training time by at least ten percent on half of the datasets. Statistical type prediction. Tables 2a and 2b evaluate the gradient boosting classifier that predicts whether standard string features are nominal or ordinal, by comparison against the ground truth using leave-one-out cross-validation. These predictions are highly accurate, with very few misclassifications. Hence, our eight extracted features are highly indicative of ordinality in the feature values. Ordinal encoding. Finally, we compare the ordinal encoding based on sentiment intensity (FlairNLP) vs. the baseline ordinal encoding by comparing them to an oracle with the ground-truth ordering. The automation of data cleaning is a fledgling open research field. We presented a framework that combines state-of-the-art techniques and additional novel components to enable automated string data cleaning. This framework shows promising results, and some of its novel components (e.g., string feature type inference, ordinality detection, and encoding using FlairNLP) perform very well, especially in terms of identifying special types of categorical string data and adequately processing and encoding them. However, several challenges remain. First, string feature type inference using (regular expression-based) PFSMs is sensitive to the exact formatting of strings. More robust techniques are needed, such as subpattern matching and better use of type probabilities in subsequent processing (e.g., if it is only 60% certain that a string feature represents a date, a more robust encoding is needed, or a human should be brought in the loop). Second, the string type-specific processing was suboptimal or no better than the baseline on some datasets in some aspects, providing interesting cases for further study. Finally, the encoding of ordinal string data still leaves room for improvement. The sentiment intensity-based encoding has shown to perform well on some aspects and poorly on others. We believe that more sophisticated approaches are possible, e.g., paying special attention to numbers appearing in the string data. Overall, we hope that this framework and open-source implementation will speed up research in these areas. Coordinate This is a string feature that represent GPS or Degree-Minute-Second (DMS) coordinates such as N29.10.56 W90.00.00, N29:10:56, and 29°1 0'56.22"N. Coordinates can be distinguished from other string features based on the following characteristics: -Two sequences of at most two digits and a third sequence which is a float with at most two digits before the decimal point. -A character that separates the three sequences of digits (e.g., . or :). In the context of DMS coordinates, these characters are°, ', and " respectively. This string feature is processed as follows. First, the string feature is split up into two separate parts for each coordinate in the entry, in which one part represents the cardinal direction of the coordinate and the other part represents the numerical information. Second, the current format of the coordinate string feature is converted into the corresponding decimal latlong value. The string feature is formatted using degrees, minutes, and seconds. This format can be converted to representative decimal values using the following formula 2 : -A distinct set of characters that comes after the prefix and before the suffix (e.g., if the string is Thursday, then Th should be followed by urs). As this string feature is the least complex out of all inferred string features, it is also the most simple to process. Considering only the first two characters for days is the most reduction that can be done while still being able to make a distinction between each unique day of the week. If the user decides to encode the data, this string feature will receive a nominal encoding in the final step of the framework. E-mail This string feature represents all valid e-mail addresses from any domain such as Jane@tue.nl and john.doe@hotmail.co.uk. This feature can be distinguished from others based on the following characteristics: -The character @ which is between two sets of characters. -A substring in front of the @ (i.e., the name of the e-mail) which is composed of valid characters (e.g., the e-mail address #@*%#$@hotmail.com is invalid as the characters before the final @ cannot be included in an e-mail name). -A substring that comes after the @ which is composed of valid characters and at least one dot inbetween those characters (e.g., name@hotmail is not a valid e-mail address as the domain name is incomplete). This string feature is processed as follows. We first remove the longest common suffix of all entries. Then, additional special characters are removed to simplify the values. If the user decides to encode the data, this string feature will receive a nominal encoding in the final step of the framework. Filepath This string feature represents paths within a local system such as C:/Windows/ and C:/Users/Documents. Filepaths can be distinguished from other string features by the following characteristics: -A series of substrings which are separated from each other using either / or \ (e.g., home/users). -Each substring cannot contain any of the following characters: \/:*?"<>| -If present, a prefix that represents the root disk or a sequence of dots followed by a slash or a backslash (e.g., C:/, ../, etc.). Processing this string feature is similar to how e-mail addresses are processed and is aimed to reduce string complexity. For this feature, the longest common prefix and suffix are removed from all entries and all special characters are removed. If the user decides to encode the data, this string feature will receive a nominal encoding in the final step of the framework. Month This string feature represents the non-numerical representation of a month with or without year and day and can be distinguished based on the following criteria: -A prefix of at least three characters, representing a unique month (e.g., Apr). -If present, the remaining substring that comes after the prefix (e.g., il comes after the prefix Apr). -If present, a sequence of at most two digits before or after the month which represents a day in the month (e.g., 1 January or January 1). -If present, a sequence of at most four digits or a sequence with prefix ' followed by two digits that comes after the month which represents the year (e.g., January 2000 or January '00). Both day and year can be present at the same time. Processing this string feature is based on the format that is being presented. Each format is split up into individual components that represent either a day, month, or year. The key step in this procedure is to ensure that the string representative of the month is turned into the corresponding numerical representation. After this key step is performed, all values are concatenated to each other according to the format yyyymmdd. The overall workflow of this processing step is depicted in Fig. 6 . As the string feature is already transformed to its numerical representation, no encoding would be required if the user requested so. Numerical There are a variety of entries that are relatively easy for users to distinguish as numerical values but are usually inferred as strings by any type detection or inference technique due to the presence of certain non-numerical characters. It is therefore important to categorize such entries as a string feature to properly handle and process them. We obtain numerical string features using any of the following characteristics: -Between two sequences of digits, one of the following characters: -+ /:;&' A space or the substring to is also applicable (e.g., 100 to 200). -Before a single sequence of digits, any of the following words: Less than, Lower than, Under, Below, Greater than, Higher than, Over, Above. -Before or after a single sequence of digits, any of the following characters: Numerical string features are processed based on what they represent. If the string feature represents a range of values, the mean of the range is calculated for each entry. After that, the range entries are encoded according to the numerical order of the ranges. If the string feature does not represent a range, we remove all special characters and consider all resulting numbers as separate numerical features. As the string feature is already transformed to its numerical representation, no encoding would be required if the user requested so. Sentence This string feature is composed of a sequence of words, typically found in datasets that contain reviews or descriptions. It is slightly more difficult to express this string feature as a regular expression compared to the others because of its overlapping characteristics with regular string entries that consists of a couple of words. However, it is still possible to perform string feature inference for sentences based on the following characteristics: -A substring of characters followed by a space for at least five times (i.e., the entry is at least six words long). The goal for processing sentence string features is to remove redundancy in the entry and to make these more relevant for use in tabular data. The technique used to achieve this goal is the NLTK word tokenizer, which takes a sentence and divides these into tuples containing each word and their associated part of speech. Then, every word that is associated with a noun is joined together with a space into a single string which is then passed on. The result is a group of nouns that are supposed to represent the essence of the sentence and are ready to be encoded in the next step of the library. If the user decides to encode the data, this string feature will receive a nominal encoding in the final step of the framework. URL This string feature represents any link to a website or domain such as https://www.tue.nl/ and http://canvas.tue.nl/login. The characteristics of this string feature is similar to that of filepaths, with a few exceptions: -An optional suffix which represents a certain protocol (e.g., http://). -A series of at most four character sequences, separated from each other by a dot (e.g., www.google.com or google.com). Note that the last (pair of) sequence(s) contain(s) at most three characters. Processing this string feature follows the same procedure as filepaths. If the user decides to encode the data, this string feature will receive a nominal encoding in the final step of the framework. Zip code This string feature represents zip or postal codes from a handful of countries. Note that we are only able to infer zip codes that contain nonnumerical characters as numerical-only zip codes are much more difficult to infer using PFSMs without overlapping actual numerical features. Processing this string feature is mainly done to extract additional information from each entry. In our case, we make use of the geopy library to fetch the latitude, longitude, and country code of the zip code. Furthermore, we also calculate the ECEF coordinate using the latitude and longitude values. If the user decides to encode the data, the zip code string feature will receive a nominal encoding in the final step of the framework. to vary more in cardinality as opposed to ordinal data. Furthermore, some ordinal data columns tend to adhere to Likert-scale characteristics regarding the possible number of unique entries, which also limits its cardinality. The value is obtained by counting all unique entries in a column. -The ratio between the number of unique values and the total number of rows: As a rule of thumb, some domain experts tend to classify data as ordinal when the ratio between the unique values and the total number of rows is at most 0.05. The ratio for nominal data tends to be at most 0.2. As a result of this rule of thumb, we extract the ratio for use in the classifier by dividing the number of unique values by the total number of rows. -The mean of the variance of the distance between the word embeddings of unique entries: The idea behind extracting this feature is that the word embeddings of certain entries showcase interesting linear substructures in the word vector space. By taking a pre-trained word vector space, the classifier may be able to make a distinction between ordered and unordered data based on differences in the substructure. The first step is to split each entry into a set of words which are then embedded using a pre-trained word vector space. This work makes use of the Wikipedia word vector space by GloVe, which consists of over 400 000 words in the vocabulary embedded into a 50dimensional vector space [25], to assign each word in an entry to a vector. A random point in the vector space will be assigned to a word in case it does not appear in the pre-trained corpus. Next, the mean of all dimensions for each word vector in the entry is calculated such that all word vectors of the entry are now represented as a single point in the 50-dimensional vector space. After this is done for all entries in the column, the variance between each dimension is calculated. Finally, the mean of each dimension is taken and the resulting value is a single float value that can be used to potentially distinguish ordered data from unordered data. -Whether the column name is commonly used in ordinal data: There are a set of keywords that can commonly be found in column names for ordinal data. Examples of certain keywords include grade, stage, and opinion. By checking whether a column name is contained within one of those keywords and vice versa, we are able to tell whether the data in the column is more likely to be ordinal or not. Note that keywords for column names used in this work are created based on domain expertise, which means that results may vary when other keywords are used. -Whether the column name is commonly used in nominal data: The approach of extracting this feature is similar to that of checking ordinal traits in the column name, except that the names obtained via domain expertise are now commonly used in nominal data. Typical nominal column names include address, city, name, and type. Again, since the used names are based on domain expertise, results in performance may vary when other names are used. -Whether the unique entries contain keywords that are commonly found in ordinal data: Extracting this feature is similar to the two aforementioned techniques, except that ordinality will now be implied based on keywords that are commonly found in ordinal data. The keywords that were used in this work are adjectives and nouns that are typically found in Likert-scale questionnaires. -Whether the unique entries share a number of common substrings: Ordinal data tends to contain entries with overlapping substrings. For example, the strings disagree, agree, and wholeheartedly agree all contain the substring agree. By checking whether there are common substrings in the data of sufficient length, it is possible that the classifier associates the occurrence of substrings with the implication that the data is ordinal. A list of all datasets that were used during the evaluation are provided here, including additional information on what was used from the data. Some of the datasets listed below were used to evaluate both string feature inference and string feature processing. The datasets without model, task, and target were only used to evaluate string feature inference based on the manually assigned ground truth of the columns. The datasets with the previously mentioned components were used for both string feature inference (with manual labeling of string feature) and string feature processing (evaluation by running the associated task). FLAIR: An easy-to-use framework for state-of-the-art NLP Encoding high-cardinality string categorical variables Similarity encoding for learning with dirty categorical variables ptype: probabilistic type inference Feature selection with decision tree criterion Automating big-data analysis Vader: A parsimonious rule-based model for sentiment analysis of social media text Deep Feature Synthesis: Towards automating data science endeavors SampleClean: Fast and Reliable Analytics on Dirty Data Activeclean: Interactive data cleaning for statistical modeling AlphaClean: Automatic Generation of Data Cleaning Pipelines Binary codes capable of correcting deletions, insertions, and reversals dat a-scientists?select=aug train.csv -Students' Academic Performance Dataset. Columns: StudentAbsence-Days america-excluding-us?select=bermuda.csv -House Price Data Columns: class, cap-shape, cap-surface, capcolor, bruises, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalkshape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-colorabove-ring, stalk-color-below-ring, veil-type, veil-color, ring-numer, ring-type, spore-print-color, population Acknowledgements The authors would like to thank Marcos de Paula Bueno on his valuable feedback on this paper. Furthermore, this work was partly funded by the European Research Council under EU Horizon 2020 progamme, project 592215 (TAILOR).