key: cord-151118-25cbus1m authors: Murray, Benjamin; Kerfoot, Eric; Graham, Mark S.; Sudre, Carole H.; Molteni, Erika; Canas, Liane S.; Antonelli, Michela; Visconti, Alessia; Chan, Andrew T.; Franks, Paul W.; Davies, Richard; Wolf, Jonathan; Spector, Tim; Steves, Claire J.; Modat, Marc; Ourselin, Sebastien title: Accessible Data Curation and Analytics for International-Scale Citizen Science Datasets date: 2020-11-02 journal: nan DOI: nan sha: doc_id: 151118 cord_uid: 25cbus1m The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. Over 4.7 million participants and 189 million unique assessments have been logged since its introduction in March 2020. The success of the Covid Symptom Study creates technical challenges around effective data curation for two reasons. Firstly, the scale of the dataset means that it can no longer be easily processed using standard software on commodity hardware. Secondly, the size of the research group means that replicability and consistency of key analytics used across multiple publications becomes an issue. We present ExeTera, an open source data curation software designed to address scalability challenges and to enable reproducible research across an international research group for datasets such as the Covid Symptom Study dataset. The join operation is the most memory-intensive operation that is typically carried out on tabular datasets. ExeTera implements highly scalable versions of left, right and inner joins; left join is the most typical join performed so we focus on its performance here. A left join operation involves the mapping of data from one table onto another table, based on the key relationships between the two tables. Values in the right table are mapped to values in the left table. For example, patient data may be joined to assessment data so that assessments can be processed with patient-level features such as age or BMI (body mass index). An illustrated example of a left join can be seen in Figure 1 . To test the performance of the join operator when ExeTera and Pandas are used, we generate a dataset composed of a left primary key (int64), a right foreign key (int64) and 1, 2, 4, 8, 16 , and 32 fields respectively of random numbers corresponding to entries in the right table (int32). The resulting join performance for both is shown in Table 4 . The left and right tables have the same number of rows for each test, listed in the first column. It should be noted that ExeTera organises data into groups, which may be considered logical tables, but the user typically loads the data on a field by field basis when working with it. ExeTera's merging API allows the user to think in a tabular fashion by accepting tuples of ExeTera fields, which only load their contents when specifically requested. ExeTera achieves its scalability through two techniques; firstly, by calculating the map and then applying it to each field in turn, and secondly, by exploiting the natural sorted order to convert the mapping into a streaming merge. The result is a merge operation that is fast and scales to billions of rows. ExeTera can combine snapshots of datasets to create a journaled dataset, keeping multiple, timestamped copies of otherwise destructive changes to corresponding records between the snapshots. Table 5 shows the results of journaling together shapshots of the Covid Symptom Study from the August 1st 2020 and September 1st 2020, and the time taken to do so. ExeTera provides the ability to load data very efficiently, as seen in Table 2 and 3 . Once loaded, analytics can be performed through use of libraries such as Numpy and Matplotlib 11 , using tools that researchers are familiar with, such as Jupyter Notebook 12 . Figure 2 shows a histogram of healthy and unhealthy assessment logs bucketed into seven day periods that must parse 189 million assessments to generate its results. In this work, we present ExeTera, a data curation and analytics tool designed to provide users with a low complexity solution for working on datasets approaching terabyte scale, such as national / international-scale citizen science datasets like the Covid Symptom Study. ExeTera makes this possible without the additional complexity of server-based datastores, and thus simplifies access to such datasets for both academic and independent researchers who are versed in the Python scientific programming ecosystem. ExeTera provides features for cleaning, journaling, and generation of reproducible processing and analytics, enabling large research teams to work with consistent measures and analyses that can be reliably recreated from the base data snapshots. Its ability to store multiple snapshots in a journaled format enables researchers to perform full longitudinal analysis on otherwise unjournaled datasets and facilitates the ability to move between snapshots whilst being able to properly explore the impact of doing so on analyses. ExeTera has been a key part in enabling analysis of the Covid Symptom Study dataset, including being used for analysis in the following manuscripts [13] [14] [15] [16] . Although ExeTera was developed to provide data curation for researchers working on the Zoe Symptom Study, this software is being developed to be generally applicable to large-scale relational datasets for researchers who work in Python. However, this requires further work to separate core ExeTera functionality from Covid Symptom Study-specific functionality. While our goal is to closely emulate a Pandas-style API, additional work is still required to ensure this is available to the user across the whole API. The sorting and journaling APIs, for example, are accessed through the low-level operations API rather than through abstractions to more closely emulate Pandas, and work is ongoing to make the API more consistent in the presentation and manipulation of the data as logical tables. ExeTera is currently built on top of Numpy and Pandas functionality. The ability to scale is provided through implementations of key operations that can stream large collections from drive. This enables processing of large datasets, but, at present, ExeTera doesn't take advantage of multiple cores or processors, nor is it able to process across a cluster. Future development will make use of Dask 17 a library designed to convert operations on Numpy and Pandas to a directed acyclic graph of sub-operations that can be distributed to multiple cores and nodes, and focus on provision of the elements of scalable processing, such as multi-key argsorts and tableless merges that Dask lacks. Dask is also integrated with more specialised back-ends such as Nvidia's RAPIDs 18 , which enables execution of distributed graph processing across GPU clusters. The integration of Dask should allow ExeTera to execute over datasets well into the multi-terabyte range. Reproducibility relies on immutability of the algorithms deployed during analysis. At present, ExeTera achieves this through a convention that algorithms are treated as immutable once implemented and deployed. A more robust system is being designed to provide algorithmic immutability without this constraint. HDF5 has proven to be fragile to interruptions while data is being written to it; an interrupted write is capable of rendering the entire dataset unreadable and so all writes must be protected from user interrupts and other exceptions. Additionally, HDF5 does not allow the space from deleted fields to be properly reclaimed from the dataset. This must be done separately through use of the 'h5repack' tool. The primary benefit of HDF5 to ExeTera is its ability to flexibly store contiguous fields of data for rapid reading. This can also be achieved through use of alternative columnar data formats such as ORC 19 or Parquet 19, 20 or, alternatively, through use of the file system as a datastore. Storing individual fields as serialized Numpy arrays and field metadata in JSON allows for transparent, robust dataset serialization that can be explored on the file system. ExeTera As of the 14th October, 2020, the dataset is composed of four tables: Patients: 4.701 million patients with 224 data fields. Patient records store data such as the patients' physiological statistics, long-term illnesses, lifestyle factors, location and other data that only occasionally changes, at the patient level. Assessments: 189.2 million assessments with 66 fields. Patients are asked to give regular assessments through the app that cover their current health status and symptoms, aspects of their lifestyle such as potential exposure to COVID-19, and, in early versions of the schema, any COVID-19 tests that they have had. Tests: 1.455 million tests, with 26 fields. Test records are kept for each COVID-19 test that a patient has had along with the evolving status of that test (typically 'waiting' to some result). Diet: 1.563 million diet study questionnaires with 104 fields. These ask people at several time points about their dietary and lifestyle habits. Assessments, tests and diet study questionnaires are mapped to patients via ids that serve as foreign keys. This dataset is delivered as daily snapshots in CSV format. As of 14th October 2020, the daily snapshot is 42.2 GB in size, and the accumulated daily snapshots are over 3TB in size. The dataset, excepting fine-grained geolocation data, is publicly available at https://healthdatagateway.org. In order to successfully curate the Covid Symptom Study data, it is necessary to be able to handle data that cannot fit into RAM. Data size and structure, and the set of operations needed to handle the dataset have to be addressed. We can define three scale domains that necessitate a change of approach. This is the scale at which the dataset entirely fits in the computer's RAM. Commodity laptops and desktops used by researchers typically have between 16 and 32 GB of RAM. Loading the data can inflate its memory footprint depending on the datatypes used, and operations can multiply memory requirements by a small constant factor, but provided peak memory usage does not dramatically exceed RAM, researchers can make use of programming languages with numerical / scientific libraries such as Numpy or Pandas to effectively analyse the data. At drive scale, only a portion of the dataset can fit into RAM at a given time, so specific solutions are required to effectively stream the dataset from drive to memory. Datastores become a more compelling option at this scale, as they already have 4/15 memory efficient, streaming versions of the operations that they support, but their usage may not be desirable due to the need to learn a new language or API, and the installation and maintenance burden they represent. This is the scale of dataset that ExeTera currently targets. At distributed scale, the use of server-based datastores is typically mandatory. It becomes necessary to redesign operations to exploit distributed computing across many nodes. Selection of appropriate datastore technology becomes critical, with specific datastore technologies addressing different roles within the overall system. This scale will be targeted by ExeTera in future development. The ExeTera software provides functionality that enables a data curation pipeline incorporating data curation best practice. The pipeline has the following steps: • Import / preliminary data cleaning and filtering • Journaling of snapshots into a consolidated dataset • Generation of derived data and analytics The first two stages are generic operations that apply to any tabular dataset being imported into ExeTera. The third stage is specific to a given dataset, such as the Covid Symptom Study. The import process converts CSV data to a binary, columnar format, discussed further in the 'Implementation' subsection, that is many orders of magnitude more time efficient for querying in most cases. As part of this process, the data is converted from strings to data types defined by a JSON schema file. The mapping of data types is from a single CSV field to one or more strongly typed fields. How this is done is determined by the JSON schema and the type specified in the schema. Fixed string fields Fixed string fields contain string data where each entry is guaranteed to be no longer than the length specified by the field. Fixed string fields can handle UTF8 unicode data, but this is encoded into bytes and so the specified length must take into account the encoding of the string to a byte array. Indexed string fields Indexed string fields are used for string data where the strings may be of arbitrary length. The data is stored as two arrays; a byte string of all of the strings concatenated together, and an array of indices indicating the offset to each entry. Numeric / logical fields Fields which contain a combination of strings to be converted to strongly typed values and empty values, for example "", "False", "True" are converted to the appropriate numeric / logical dataset and a corresponding filter field indicating whether a value is present for a given row of the field. Fields containing string values can be processed as categorical fields if specified as such. Categorical fields Categorical fields map a limited set of string values to a corresponding numeric value. A key is stored along with the field providing a mapping between string and number. A categorical field can also be specified as leaky, in the case that the field is a mixture of categorical values and free text. In this case, a value is reserved to indicate that a given row doesn't correspond to a category, and an indexed string field is created for free text entries. Datetime fields Datetime fields store date times as posix timestamps in double precision floating point format. The schema can also specify the generation of a 'day' field quantising the timestamp to the nearest day, and can also specify whether the field contains empty values, in which case as filter is also generated, as with numeric fields. Data for the Covid Symptom Study project is delivered as a series of timestamped snapshots. The unanonymised data generated by the Covid Symptom Study app is stored in a relational database or similar datastore, that is not accessible to query by the broader research community. Instead, the data needs to first be anonymised and bulk exported to CSV format. The database is a live view of the dataset, however; users can update data through the app, and, unless the database is explicitly journaled and each entry made immutable, the prior states are erased. As such, a row corresponding to a given entity in two different snapshots can be contain conflicting values. When each snapshot is large, the scaling problem is exacerbated by having to reconcile multiple snapshots. The Covid Symptom Study dataset does not have a field that reliably indicates whether the contents of a given row have changed and so determining whether a row at time t has changed relative to a row at time t + 1 requires a full comparison of all common fields. An example of this can be seen in Figure 3 . In addition to the initial data cleaning performed during import, it is useful to perform application specific cleaning and generate ancillary fields that are widely used for downstream analyses. This helps to ensure consistency across analytics and reduces scripting complexity for new users. The Covid Symptom Study data schema has seen rapid iteration since its inception, due to a number of factors. Firstly, the initial app was rapidly released to allow users to contribute as soon as possible after the pandemic was declared. Secondly, the evolving nature of the pandemic, particularly around prevalence in the population and availability and type of tests has necessitated structural changes to the schema. Thirdly, this dataset is novel in terms of its scale and deployment for Epidemiological analysis, and the initial wave of papers published by the research group has fed back into the schema. Public health surveillance campaigns such as the Covid Symptom Study impose time constraints to software development, with frequent changes in database structure and intense versioning to accommodate iterative refinements. The evolving epidemiology of COVID-19, the response of governments and populations to the pandemic, and academic responses to papers based on the dataset all shape the questions that are added to or removed from the app over time. The dataset is only minimally validated at source. The fields often contain data of mixed type, and different fields can be in mutual contradiction. Numeric values are only validated for type rather than sensible value ranges. Furthermore, the dataset contains multiple competing schema for the same underlying data, and the app version is tied to the schema version, so users who are using older versions of the app are still contributing to otherwise retired schema elements. As such, a considerable amount of data cleaning and processing is required in order to extract data suitable for analysis. The handling of COVID-19 tests in the dataset is an example of the complexity created by changes to the schema. Testing was initially reported as an assessment logging activity, but this came with a number of problems. Firstly, a test needed to be logged on the day it was taken for the assessment date to be treatable as the test date. Secondly, some users interpreted the test field as something to be logged only when they took a test or received the result, whilst other users filled in intermediate assessments with the pending status. Thirdly, this system did not allow for users to enter multiple tests unambiguously. Whilst this was not a problem in the initial months of the pandemic, the ramping up of test availability necessitated a solution. A new test table was introduced in June 2020, giving each test a unique id to allow multiple tests for each patient. However, existing tests recorded in the old schema were not connected with new test entries, although many users re-entered old test results in the new test format. Furthermore, new tests continued to be added by users in the old, assessment-based schema format, logging on previous versions of the app. As such, there is no unambiguous way of determining whether tests in the old format are replicated by tests in the new format. This is an example of a postprocessing activity with no unambiguously correct output, which therefore requires at least a single, agreed upon algorithm to be consistently deployed to avoid inconsistencies between related analyses. In the Covid Symptom Study app, user-entered numeric values are only validated to ensure that they are numeric, as of the time of writing. There are no validations of sensible ranges given the user-selected units of measurement. Some users enter incorrect values, and some users enter values that appear sensible but only in some other unit (1.8 is a plausible height if the user is entering height in metres, for example). The covid test table has a 'mechanism' field where the user is free to either select a categorical value indicating the test mechanism, or enter free text relating to the test mechanism. Some free text clearly indicates the test type, whereas other free text entries only infer the test type weakly, through inference such as 'home test kit'. As such, a set of gradated flags are generated that indicate the quality of the categorisation. In case of multiple daily entries by the users, these assessments can optionally be quantised into a single daily assessment that, for symptoms, corresponds to the maximum value for each symptom that the user reported in the day. This considerably simplifies many downstream analyses. Analysis often involves the filtering of patients that are categorised by aspects of the assessments and tests that they have logged. These include metrics such as whether the patient logged as being initially healthy, or whether they have ever logged a positive test result. Reproducibility depends on the ability to reproduce a given analysis from a version of the dataset and a set of algorithms run on the dataset. For this to be possible, algorithms must be considered immutable once implemented. This allows any subsequent version of the software to generate results consistent with those of the software version in which the algorithm was introduced. ExeTera does this by requiring that a version of any given algorithm that is created is treated as immutable in the code base. This means that any target script is guaranteed to exhibit the same behaviour, provided that the following conditions hold. Firstly, any algorithms written for ExeTera are explicitly versioned. Secondly, any randomness introduced must be given consistent random seeds and, ideally, multiple sources of randomness should be given different random number generators. Once an algorithm is used in analysis, it may no longer be altered in the codebase, even if it subsequently shown to contain errors. This enables researchers to run multiple versions of the same algorithm as part of their analytics and understand how sensitive their results are to changes and corrections. An example for this is the multiple versions of height / weight / body mass index (BMI) cleaning that have been devised over the course of the project; each is available as separate versions of the algorithm for reproducibility. ExeTera is implemented in the Python programming language. Python has two aspects that make it suitable for writing software that performs data analytics and numerical analysis. Firstly, it is dynamically typed, which reduces code complexity and verbosity 24 . Secondly, it has a strong ecosystem of scientific libraries and tools to mitigate the performance and memory penalties that come with using a dynamically typed, byte-code interpreted language and runtime. Code that is compiled and run directly in CPython (the reference Python implementation) executes in the Python interpreter. The Python interpreter is extremely slow relative to optimised code such as that generated by compiled, optimised C/C++; in many cases it is orders of magnitude slower. Python is dynamically typed, but its type system does not provide light-weight objects to represent primitive types. Even numeric values such as integers and floats are stored as full objects, and typically take up 28 bytes for a 4 byte integer value. This overhead precludes efficient memory usage when iterating over large numbers of values. Numpy 6, 7 is the Python community's main tool for circumventing such time and space inefficiencies. Amongst other features, it provides a library for space-efficient representations of multi-dimensional arrays, and a large library of time-efficient operations that can be carried out on arrays. The performance of such operations can be orders of magnitude faster than native CPython, but this is conditional on minimising the number of transitions between Python code and the internal compiled code in which the operations are implemented. Not all code can be easily phrased to avoid transitions between CPython and Numpy internals. Where this is not possible, Numba 25 is used to compile away the dynamic typing and object overhead, resulting in functions that execute at near optimised C performance levels. CSV is a very common way to portably represent large datasets, but it comes with many drawbacks, including a lack of strong typing and an inability to rapidly index to a given location in the dataset. These issues become severely problematic at scale, and so an alternative serialised data representation is required. Data representation: row-local vs. column-local data formats Data storage formats can be classified as primarily rowlocal or primarily column-local. This choice has key implications for analytics software. Row-local data formats store groups of related fields for a given data entry together in memory. Column-local data formats store a specific value for a group of data entries together in memory. Row-local data format CSV format nearly always used in a row-local fashion, i.e. rows are data entries and columns are fields. This typically makes CSV very slow to parse for a subset of the data; with CSV this is exacerbated because escape-sequenced line-breaks mean that the entirety of each line must be parsed to determine the start of the next row. Even without this issue, row-local data formats suffer from locality of reference issues 26 . Column-local data format Column-local data storage enables very efficient access to a given field. All the entries for a given field are (effectively) contiguous in memory and so loading a single field can be done in an asymptotically optimal fashion. When a dataset has many fields and a given operation operates on a smaller number of those fields, scaling the operation is a far simpler proposition, technically 9, 27 . The ability to load specific subsets of the data with maximal efficiency benefits both the ability to scale and the latency with which operations can be performed on a small subset of the dataset. As such, a column-local data format is a preferable format. HDF5 as the ExeTera serialised data format HDF5 10 is a data format for storing keys and their associated values in a hierarchically organised, nested collection. In contrast with CSV, HDF5 stores data in a column-local format. It also allows for data to be stored as binary, concrete data types. HDF5 permits a user to explore the overall structure of the data without loading fields. Fields are loaded at the point that a user specifically requests the contents of a given field. This can be a direct fetch of the entire field or an iterator over the field. This makes it a suitable initial data format for ExeTera, although alternative columnar data storage formats are being considered to replace HDF5 for future development due primarily to issues of dataset fragility and shortcomings relating to concurrent reading / writing. Most analysis of tabular data is performed through a combination of joins, sorts, filters and aggregations. ExeTera operates on arrays of effectively unlimited length, particularly when certain preconditions are met, using the following techniques. Sorting is one of the key operations that must scale in order to process large datasets, as imposition of a sorted order enables operations such as joins to scale. ExeTera uses several techniques to provide highly scalable sorting. Generation of a sorted index Rather than sorting data directly, ExeTera generates a sorted index that is a permutation of the original order. This is used to scale related sort operations, and implement a soft sort, where fields are stored in a natural sorted order and the permuted index applied when the field is read. Scaling multi-key sorts on long arrays Multi-key sorts are memory intensive when keys are large, and expensive due to the internal creation of tuples in the inner loops of sorts. Multi-key sorts in ExeTera are rephrased as a series of sorts on individual keys from right to left, where the output of each sort step is a sorted index that is the input to the next sort step, using a stable sort. Figure 4 shows pseudocode for this operation. Scaling sorts on very long arrays ExeTera has a second sorting algorithm that can be selected if an array is too large to fit into memory in its entirety. Such arrays are sorted via a two-phase approach in which the array is divided into subsets; each subset is sorted, and the sorted subsets are merged together by maintaining a heap of views onto the sorted subsets. A separate index is generated and maintained with the sorted chunks, so that the merge phase is stable. Figure 5 shows pseudocode for this operation Sorting multiple fields The sorts described above, that produce a permutation of the original order, can be used to sort multiple fields in a space-efficient fashion. For large arrays, the array can be permuted in turn and written back to disk, or the permuted order maintained and reapplied when needed. ExeTera scales to provide this functionality even for very large arrays. Operations on sorted fields A number of operations become merges with various predicates when performed on fields that have been sorted by the key field and can be performed in O(m + n) time where m and n are the lengths of the fields to be merged. This includes joins and aggregations. ExeTera performs these operations as merges when the key field is sorted. Importantly, arbitrarily large fields can be operated on in this way. Figure 1 . A left join of a simplified dummy patent and assessment dataset. The left join is from patients on the right to assessments on the left to generate a patient_age field in assessment space. . Pseudocode for a streaming sort that outputs a sorted index for subsequent application to many fields. A new dawn for citizen science The future of citizen science: emerging technologies and shifting paradigms An analysis of citizen science based research: usage and publication patterns Opinion: Toward an international definition of citizen science Rapid implementation of mobile technology for real-time epidemiology of covid-19 The numpy array: a structure for efficient numerical computation Array programming with numpy Data Structures for Statistical Computing in Python Sql databases v. nosql databases Matplotlib: A 2D graphics environment Jupyter notebooks -a publishing format for reproducible computational workflows Key predictors of attending hospital with covid19: An association study from the covid symptom tracker app in 2,618,948 individuals Estrogen and covid-19 symptoms: associations in women from the covid symptom study Geo-social gradients in predicted covid-19 prevalence and severity in great britain: results from 2,266,235 users of the covid-19 symptoms tracker app Diagnostic value of skin manifestation of sars-cov-2 infection Dask Development Team. Dask: Library for dynamic task scheduling Columnar Storage Formats Apache parquet Real-time tracking of self-reported symptoms to predict potential covid-19 Risk of covid-19 among front-line health-care workers and the general community: a prospective cohort study. The Lancet Public Heal Delirium is a presenting symptom of covid-19 in frail, older adults: a cohort study of 322 hospitalised and 535 community-based older adults Do programming languages affect productivity? a case study using data from open source projects A llvm-based python jit compiler What every programmer should know about memory C-store: A column-oriented dbms Week starting