What distinguishes data from models? PAPER IN PHILOSOPHY OF SCIENCE IN PRACTICE What distinguishes data from models? Sabina Leonelli1 Received: 21 June 2018 /Accepted: 21 December 2018 /Published online: 15 January 2019 # The Author(s) 2019 Abstract I propose a framework that explicates and distinguishes the epistemic roles of data and models within empirical inquiry through consideration of their use in scientific practice. After arguing that Suppes’ characterization of data models falls short in this respect, I discuss a case of data processing within exploratory research in plant phenotyping and use it to highlight the difference between practices aimed to make data usable as evidence and practices aimed to use data to represent a specific phenomenon. I then argue that whether a set of objects functions as data or models does not depend on intrinsic differences in their physical properties, level of abstraction or the degree of human intervention involved in generating them, but rather on their distinctive roles towards identifying and characterizing the targets of investigation. The paper thus proposes a characterization of data models that builds on Suppes’ attention to data practices, without however needing to posit a fixed hierarchy of data and models or a highly exclusionary definition of data models as statistical constructs. Keywords Datamodel.Experimentation.Empiricism .Researchpractice.Plant science. Data processing . Statistics . Phenomics . Big data . Inference 1 Introduction This paper investigates the relation between data and models, their respective roles as research components within empirical inquiry, and the reasons why these roles should be kept distinct within scientific epistemology. I focus on the epistemic function of data models and the circumstances under which they should be distinguished from data. The account is developed through a detailed reconstruction of the stages of data processing involved in contemporary plant phenotyping and specifically the use of high-throughput European Journal for Philosophy of Science (2019) 9: 22 https://doi.org/10.1007/s13194-018-0246-0 * Sabina Leonelli s.leonelli@exeter.ac.uk 1 Exeter Centre for the Study of the Life Sciences (Egenis) & Department of Sociology, Philosophy and Anthropology, University of Exeter, Byrne House, St Germans Road, Exeter EX4 4PJ, UK http://crossmark.crossref.org/dialog/?doi=10.1007/s13194-018-0246-0&domain=pdf http://orcid.org/0000-0002-7815-6609 mailto:s.leonelli@exeter.ac.uk imaging data to acquire insight into plant development and growth patterns – a case which is representative of exploratory research practices within the life sciences (and beyond), and yet has received next to no attention from philosophers of science.1 This enables me to highlight philosophically significant aspects of the activities of data production, processing and interpretation, and argue that whether a set of objects functions as data or models does not depend on intrinsic differences in their physical properties, level of abstraction or the degree of human intervention involved in gener- ating them, but rather on their distinctive roles towards identifying and characterizing the targets of investigation. I thus use the analysis of data practices as an entry point into the study of data modelling and inferential reasoning whose applicability extends well beyond the case under consideration. This is not a completely new approach to the study of modelling, as exemplified by Patrick Suppes’ seminal account of the hierarchy of models, which was itself grounded on an analysis of the processes through which researchers go from data collection to the formulation of theories and the crucial role played by models in enabling that shift (Suppes 1962). In what follows, I take inspiration from Suppes’ approach and scruti- nize the ways in which data and data models are generated and used within contem- porary science. In contrast with Suppes however, I consider research practices where some of the data being modelled come in forms other than numerical (namely, as images); and where statistical analysis is coupled with: qualitative judgements around what data to consider for further analysis, shifts in what objects are actually considered to be data, and the implementation of computational tools to extract measurable traits from images in an automated fashion.2 In such a case, which is often encountered especially within the biological, social, environmental, historical and health sciences, Suppes’ hierarchy of models proves difficult to apply and does not help to resolve questions around the nature and epistemic function of data and data models. To make better sense of the variety of data practices found across the sciences, I propose to move away from a structural characterization of data models altogether and instead to distinguish data from models by virtue of the circumstances and purpose of their use within situations of inquiry. I argue that many of the operations through which researchers process data are primarily aimed towards making them useable as evidence for claims, whether or not the specific targets of the claims in question have been clearly defined – and thus, in Bogen and Woodward’s terms (1988), whether or not data are endowed with the power of representing one or more phenomena. In deciding what counts as useable data, researchers define the evidential scope of their investigation, that is the range of phenomena that they will be able to consider once they start clustering and ordering data in ways that may help to interpret them as evidence. The clusters of data thus obtained (which may take various forms, depending on which visualisations researchers find most tractable as evidence) are what I call data models: that is, arrangements of data that are evaluated, manipulated and modified with the explicit goal of representing a phenomenon, which is often (though not always) meant 1 On the characteristics of exploratory research and its contrast to hypothesis-driven modes of inquiry, see Steinle (1997), Burian (1997), Waters (2007) and O’Malley (2008). 2 The extent to which computational tools intersect with qualitative judgements on what counts as data and how they should be used is particularly significant at a time where the development of automated methods for large-scale data production and analysis are hailed by some analysts and many funding bodies as the next frontier of artificial intelligence (e.g. Mayer-Schönberger and Cukier 2013). 22 Page 2 of 27 European Journal for Philosophy of Science (2019) 9: 22 to capture specific aspects of the world.3 Hence data models are defined by the representational power that researchers impute to them, and play an essential role in specifying the target of the claims for which data can be used as evidence – in other words, the phenomenon being investigated. The argument is structured as follows. In the second section, I introduce what I call the representational view of data and models, and highlight the difficulties generated by this widely held view when attempting to analyse the distinctive epistemic roles of these research components. In section three, I discuss existing scholarship on data models, where questions about the relation between data and models have been most closely addressed. I trace the motivations underpinning Suppes’ seminal work and note his commitment to highlighting and defending the significance of statistical reasoning within knowledge production. While this commitment has proved generative to philosophers and researchers concerned with formal methods of inquiry, I note that it takes attention away from aspects of data processing and inference that are not informed by statistical techniques. In section four, I delve into my case study and reconstruct data practices involved in the experiments carried out at the National Centre for Plant Phenotyping (NCPP) in Aberystwyth in Wales. In particular, I focus on the SureRoot project - a collaboration between the NCPP and the North Wykes Farm Platform in Devon, England that was carried out between 2014 and 2017 to improve understanding of root systems.4 My analysis focuses largely on the part of SureRoot that was developed at the NCPP, within which I identify seven distinct stages of data processing, each of which involves distinctive research skills, interests, assumptions and decisions.5 The case exemplifies the diversity of expertise involved in data processing activities and the specific challenges linked to exploratory research. Section five examines the role that representational assumptions play within each stage and problematises the idea of representation as sole or even primary epistemic goal for the researchers involved. In closing, I consider the implications of this analysis for understanding the relationship between data and models, the crucial role that data models can play in identifying the targets of scientific investi- gation, and the epistemology of empirical inquiry more generally. 3 My analysis is compatible with a wide spectrum of views around what representation involves in relation to data and modelling, including accounts within which models represent non-existing (e.g. abstract or fictional) entities and the target of representation is not mind-independent. I am particularly sympathetic to the liberal but precise account of how material models represent through denotation, exemplification, key and identifi- cation by Frigg and Nguyen (2018). 4 The project was funded by the UK Biotechnology and Biological Sciences Research Council LINK grant BB/L009889/1 (which is match-funded by the UK grassland industry sector). 5 This empirical analysis results from an extensive investigation of data processing activities across biological research sites in the UK which I conducted between 2015 and 2018. This involved visits to the sites in which data were produced, stored, disseminated, visualized and analysed, where these research activities were documented via ethnographic observation, photographs and videos; a review of the resulting scientific publications and the ways in which authors describe their methods and techniques therein; the study of websites and digital databases used as reference points or tools throughout the inquiry; and in-depth, semi- structured interviews with the researchers involved, including technicians as well as software and computer experts in charge of processing and storing data. The point of gathering such diverse and rich information on data processing was to document the nuances and sophistication of related activities, resulting in thick descriptions of the stages through which data of various types are processed for the purposes of scientific discovery, such as in the case analysed within in this paper. More information on this method and overarching project can be found on the Exeter Data Studies website (URL: www.datastudies.eu, accessed October 2018). Selected interview transcripts and photographs pertaining to the NCPP case are available as open data on Zenodo (URL: https://zenodo.org/communities/datastudies, accessed October 2018). European Journal for Philosophy of Science (2019) 9: 22 Page 3 of 27 22 http://www.datastudies.eu https://zenodo.org/communities/datastudies 2 Data and models as representations The nature and epistemic role of data remain under-researched topics in philosophy of science, especially when compared to the extensive scholarship on models and modelling activities. Philosophers tend to assume that data have some sort of representational content, in the sense of instantiating some of the properties of a given target of investi- gation in ways that are mind-independent. This representational conceptualization of data epistemology is often viewed as playing an important role in understanding the empirical basis of scientific knowledge, since the properties instantiated by the data are the medium through which the world, in its unpredictable complexity, becomes amenable to scientific study. Data are taken to capture and convey the same information about the world regardless of the circumstances of inquiry, and particularly of the assumptions and background knowledge of the researchers who are using them as evidence; such assump- tions may colour the extent to which researchers are able to extract information from data, but do not affect the content of data as documents of specific aspects of reality. Hence, the informational content of data is regarded as fixed and context-independent. In such a framework, statistical analysis plays a crucial role in guaranteeing the reliability of data and the validity of the inferences extracted from them.6 What data suggest about the world can of course be misunderstood and misinterpreted whenever researchers use the wrong inferential techniques or modelling approach, but data themselves are viewed as reliable information sources – a mere “input” into processes of modelling. Finding the right interpretation involves finding the right tools to extract truth from data.7 A key problem of the representational view of data is that it makes it hard to differentiate data from models, given that models are themselves typically conceptualised as represen- tations – though what they represent can vary from (parts of) the material world to highly abstract concepts. Miles MacLeod and Nancy Nersessian (2013), for instance, describe model-based reasoning as “a process of generating novel representations through the abstraction and integration of constraints from many different contexts (literature, target, analogical source, modeling platforms and so forth)” – a view they broadly share with Daniela Bailer-Jones (2009), Mauricio Suarez (2004), Michael Weisberg (2013) and Alex Gelfert (2016), among many others. Even authors who emphasise the use of models as tools – “mediators” or “artefacts” – whose chief research function is to enable interventions, such as Knuuttila (2011) and Mary Morgan (2012), note the value of models in “standing for” specific phenomena. Data are not given much prominence within these accounts, with most authors treating them as empirical input for modelling. This can be interpreted as implicitly accepting a view of data as intrinsically reliable representations of the world. And indeed, while many philosophers have no trouble recognizing models as representations that may well be fictitious or false and yet yield “true theories” (e.g. Wimsatt 2007; Toon 2012), there is strong resistance against treating data in the same way and some confusion around what it is about data that gives them the epistemic power to provide empirical warrant and even direct support for claims about phenomena.8 It could be argued that within these accounts, data and models exist on a representational continuum between 6 As pointed out by Woodward in 2000, the evidential relationship between data and claims is conceptualized as “a purely formal, logical, or a priori matter” – a point highlighted and critiqued also by John Norton (2010), whose theory of material induction emphasizes the role of local facts in warranting inductive inferences. 7 I provided an extended critique of the representational approach to data in Leonelli (2016). 8 Reiss (2015) provides a useful analysis of this situation. 22 Page 4 of 27 European Journal for Philosophy of Science (2019) 9: 22 theory and the world, with data typically taken to operate closer to the “world” end of the spectrum and models to the “theory” end [see Fig. 1]. This is an uncomfortable position for philosophers to be in. Ronald Giere (1999) signals that discomfort when conceptualizing scientific research as “models almost all the way down.” One the one hand, that position stresses the constructed and theory-laden nature of the objects used by scientists to investigate and represent the world, whether they be concepts, diagrams, maps, equations or material objects such as a scale model. On the other hand, Giere is at pains to point out that data is somewhat different from the various representations created by scientists to make sense of reality: data may be theory-laden, but are still the closest one gets to an objective document of scientists’ interactions with the parts of the world that they study, and thus need to retain properties that would make them adequate and credible empirical grounding for claims about phenomena, irrespectively of the representational value that researchers choose to bestow upon them. Similarly, in their seminal paper on data and phenomena Bogen and Woodward (1988) rightly emphasize the situated and deeply embedded nature of experimental data, thus following in the long philosophical tradition arguing against the existence of “raw data” providing unmediated access to reality. However, this leaves them struggling when pointing to the importance of data as ultimate arbiters of empiricism.9 Just like Giere, they provide an argument for why we should avoid a view of science as “models all the way down”, but do not offer a view on how this can be achieved. 3 What are data models? Problems with Suppes’ account The closest that philosophers have come to explicitly discussing the relation between data and models is through consideration of what Suppes called “data models”. Within the representational account described in the previous section, these models consist of an intermediate step between data and models (Fig. 2). This makes data models into an excellent starting point towards investigating how data relate to models, and yet does not by itself resolve fundamental questions around the status of data in the representational spectrum, nor does it help to offer an account of how data may operate differently from models and hence be reliably used as sources of empirical evidence for the models themselves. The contemporary characterisation of data models as “corrected, rectified, regimented and in many instances idealized version of the data we gain from immediate observation, the so-called raw data” (Frigg and Hartmann 2016) demonstrates how the very distinction between data models and other types of models is World Theory DATA MODELS Fig. 1 A graphic rendition of representational view of data and models, with the spectrum between world and theory standing for what is being represented, and data and models indicating what representations are associated to which parts of the spectrum 9 I have previously noted how, despite providing a revolutionary analysis of data processing that countered the view of statistics as sole arbiter of inference from data and inspired my own views, Bogen and Woodward (1988) do not explicitly take issue with the representational account of data (Leonelli 2016). European Journal for Philosophy of Science (2019) 9: 22 Page 5 of 27 22 predicated upon presupposing the existence of “raw data” resulting from “immediate observation” of the world, and thus arguably providing direct and unmediated access to it. A key motivation for Suppes’ examination of how modelling practices relate to data production activities was precisely the recognition of this vicious circle and its troubling implications for philosophical accounts of what it means for scientific research to be empirically grounded. Suppes was deeply concerned with the complexity of data pro- cessing activities within experiments, and it was the study of the means and motivations used for procedures such as data reduction and curve fitting that inspired him to differentiate between models of theory, models of experiment & models of data. In his words, “the exact analysis of the relation between empirical theories and relevant data calls for a hierarchy of models” (Suppes 1962, 33). This was not, however, the only motivation behind Suppes’ account. As he made clear when first presenting the notion of data models, an equally powerful goal was to “to show that in moving from the level of theory to the level of experiment we do not need to abandon formal methods of analysis” (ibid., 260; see also Suppes 2007). Indeed, Suppes was so concerned by what he called the “bewildering complexity” of experimental situations, that he worried that philosophers would not appreciate the ways in which statistics can and does help scientists to abstract data away from such complexity. This concern motivated his choice to further distinguish between models of data and a large group of related research components and activities used to prepare data for modelling, which include models of experiment (which describe choices of parameters and setup) and practical issues such as sampling, measurement conditions, and data cleaning. Suppes describes these “pragmatic aspects” as encompassing “every intuitive consideration of experimental design that involved no formal statistics” (1962, 258), and depicts them as the lowest steps of his hierarchy – at the opposite end of its pinnacle, which are models of theory.10 My worries with Suppes’ characterisation stem not from these distinctions per se, but rather from his conclusion that “once the empirical data are put in canonical form, every question of systematic evaluation that arises is a formal one” (ibid, 261). In other words, Suppes concluded that once data are adequately prepared for statistical modelling, all the concerns and choices that motivated data processing become irrelevant to their analysis and interpretation. Thus, Suppes argued that data models are necessarily statistical models, that is objects “designed to incorporate all the information about the experiment which can be used in statistical tests of the adequacy of the theory” (Suppes 1962, 258). His formal definition of data models reflects this decision, with statistical requirements identified as the ultimate criteria to identify a data model and evaluate its adequacy: “Z is an N-fold model of the data for experiment Y if and only if there is a set Yand a probability measure P on subsets of Y such that Y = is a model of the theory of the experiment, Z is an N-tuple of elements of Y, and Z satisfies the statistical tests of homogeneity, stationarity and order” (1962, my emphasis). 10 I will not linger on the problematic nature of Suppes’ hierarchy in this paper, which has already been critiqued by Koray Karaca (2018) with reference to data processing procedures in particle physics. World Theory DATA DATA MODELS MODELS Fig. 2 The place of data models in the representational view of data and models depicted in Fig. 1 22 Page 6 of 27 European Journal for Philosophy of Science (2019) 9: 22 Many philosophers have accepted and further promoted Suppes’ decision to define data models as statistical models. Prominent examples range from Deborah Mayo, who in her seminal book Error and the Growth of Experimental Knowledge asked: “What should be included in data models? The overriding constraint is the need for data models that permit the statistical assessment of fit (between prediction and actual data)” (Mayo 1996, 136)11; to Baas van Fraassen, who despite holding different views on the nature of science from Mayo, also embraced the idea of data models as “summarizing relative frequencies found in data” (Van Fraassen 2008, 167). Through works such as these, Suppes’ legacy has come to be identified with the focus on statistics as an essential component of data modelling, thus underestimating his broader concerns with the epistemology of data and his curiosity about experimental practices where such formal approaches to inferential procedures from data are not readily applicable or even relevant.12 I want to argue that the insistence on formal methods as an entry point into the analysis of data processing which characterizes Suppes’ work and much of contemporary philos- ophy of science fails to tackle critical questions around the source of the epistemic value of data, and the relation between data and models. This is, first, because this analysis deals only with a subset of the objects that scientists working across different fields identify as “data”: that is, those objects – typically numbers or symbols - that can be subjected to statistical manipulation. This precludes Suppes’ approach from being applied to research situations where data are not quantities that are amenable to statistical treatment, and/or where statistical methods of analysis are not used as a means of validating data models, but rather as a way to visualize data (for instance by helping to arrange data into graphs, as illustrated below) – not to speak of cases where statistical methods are not used for data analysis at all. Second, it is hard to see how Suppes’ views can apply to cases where what research questions are being investigated, which conditions are ceteris paribus, and what constitutes the target phenomenon, are not given at the start of the inquiry – as is typically the case within exploratory research. Third and perhaps most important, Suppes’ ap- proach makes uncritical assumptions about the ease with which researchers can identify “raw data” and dismisses the tight intertwinement between activities of data acquisition and data manipulation. As Todd Harris has shown in relation to data models in physics, “in many cases the data that has traditionally been referred to as raw is in fact a data model”, an observation from which Harris concludes that “the process of data acquisition cannot be separated from the process of data manipulation” (Harris 2003). In what follows, I build on Harris’ analysis and expand on its significance by consid- ering a case of data processing where the differentiation between data models and “simple datasets” is indeed problematic, particularly when it is approached as a difference in the physical characteristics of these research components. I show that researchers can and do change what they consider to be “raw data” to suit different investigative purposes, resulting in changes to the informational content attributed to data and thus their value as evidence for claims. More broadly, I aim to provide an alternative to Suppes’ account of 11 Mayo clearly acknowledges that “modeled data, not raw data, are linked to the experimental models. Accordingly, two questions arise [..] how to generate and model raw data so as to put them in the canonical form needed to address the questions in the experimental model [..] how to check whether actual data satisfy various assumptions of experimental models” (Mayo 1996, 129). Her work masterfully shows how statistics can be employed to help with these questions, but does not explore whether and how researchers may address these questions beyond the application of statistical techniques – which is my main concern here. 12 See for instance Suppes (1997, 2003). European Journal for Philosophy of Science (2019) 9: 22 Page 7 of 27 22 data models that (1) does not rely on problematic and fixed assumptions about what “raw data” need to be; (2) can be applied to cases of exploratory research and situations where statistics is not central to data analysis; and (3) addresses and resolves the problem of distinguishing data from models, by defining both research components through their relation to inquirers and their role within specific epistemic activities. 4 Stages of data processing: A case from plant phenotyping Phenotyping is the area of the life sciences devoted to the study of morphology at all levels of biological organization, ranging from the molecular to the whole organism, under varying environmental conditions. A long-term component of botany, phenotyp- ing is currently undergoing a revival within plant science, where it is recognized as crucial to the analysis of gene-environment interactions.13 For instance, phenotyping is indispensable to understanding how shoots and roots respond to drought or flooding – which in turn informs estimates of the impact of weather conditions associated to climate change on agricultural yields, thus facilitating the development of what re- searchers call “precision agriculture” to tackle the urgent social challenges associated with food security. A recent review in Plant Methods describes phenotyping as a “quantitative description of the plant’s anatomical, ontogenetical, physiological and biochemical properties” (Walter et al. 2015). One of the key challenges in this field – and the reason for choosing it as a case study to illustrate my argument - is precisely the transformation of complex qualitative objects such as free-text descriptions and images into machine-readable data that can be subjected to computational analysis. Contem- porary phenotyping relies heavily on the analysis of large sets of imaging data, which are produced at a fast rate and high volume through automated systems comprising several cameras, each geared to capture different signals (ranging from the visible to the infrared spectrum of light; e.g. Fahlgren et al. 2015). As this section illustrates, efforts to find ways of analyzing these data are deeply intertwined with efforts to develop tractable specimens, instruments and computational tools, a complex set of iterations and expertise that defines not only how plants are described, but the type of questions and phenomena that researchers end up focusing on. The “roots for the future” (SureRoot) project provides a good instance of the challenges involved in processing phenotypic imaging data. The goal of the project was to understand grass-soil interactions in order to improve root strength, depth and ability to efficiently use water.14 This was achieved in two steps. Part A, which is what I will focus on here and was carried out at the NCPP, generated and analyzed a vast set of root imaging data in order to assess how root structures linked to specific genetic traits fit different soil conditions. This involved relying on the tightly controlled climatic and experimental conditions characterizing the “smart glasshouse” of the NCPP, within 13 For a rich discussion of the history of phenotyping experiments, see Taylor and Lewontin (2017). For a review of current developments in phenotyping and the significance of data and data analytic tools within them, see e.g. Bolger et al. (2017) and Coppens et al. (2017). 14 In the words of Principal Investigator Mike Humphreys, “the study represented a real breakthrough in high throughput root analysis necessary for future grass breeding.. helpful to new holistic approaches to crop improvement that take into account not only measures of crop production but the impact these have on their surrounding habitat” (pers. comm., October 2018; see Humphreys et al. 2018). 22 Page 8 of 27 European Journal for Philosophy of Science (2019) 9: 22 which plants are carefully monitored and regularly photographed through the use of conveyor belts set up to bring the plants to five different imaging chambers as often as required (sometimes multiple times per day, to capture fine-grained patterns of plant development).15 Part B, which was carried out at an experimental farm where plant specimens were grown in full exposure to the natural environment, aimed to generate comparable field data, with the purpose of checking the external validity of results obtained on plants grown under more controlled conditions. During a research visit to the NCPP in 2015, I identified seven stages involved in the production and processing of data within part A of the SureRoot project, which I briefly discuss below. 4.1 Stage 1: Preparing specimens The first requirement towards the production of useable imaging data is to grow plant specimens that are amenable to the transport and imaging technologies employed in the smart glasshouse. While the initial parameters for the experiment are provided by plant scientists, including the choice of which species to use (in this case, the tall grass Festulolium), it falls mostly to the technicians that run the glasshouse, the adjourning fields and the imaging facilities to ensure that specimens satisfy the requirements of experimental design. A considerable amount of care and know-how is required to plant seeds so that they are equally numbered and spaced in every pot and maintain the growing plants so that their size and growth rate stays within a range that makes imaging results comparable across plants. The health of plants is also carefully monitored, with plants that manifest unusual traits marked out as unusable and/or potentially interesting for other investigative purposes (such as understanding whether the trait is the result of a mutation or environmental exposure – a point to which I shall come back below). This kind of standardisation, which I have elsewhere discussed as a form of material abstraction used to create material models (Leonelli 2008), is particularly complex to achieve in this case since the experiment requires growing plants on real soil, which is itself a source of variability given its highly diverse microstructure and mineral and microbial composi- tion.16 Further elements to keep under control are the conditions under which plants travel to the imaging chambers. Conveyor belts are not fully reliable, with sudden jerks resulting in plants being thrown off (and thus a gap in imaging data) or dirty pots (that damage the extent to which images can be compared). Plants themselves also play tricks on the technology by shedding leaves which can jam the conveyor belt (problematic especially overnight, when humans are not 15 The following video gives a good overview of what the NCPP smart glasshouse looks like and how it works in practice: https://www.youtube.com/watch?v=8qBsVP0j70k. Smart glasshouses are increasingly popular experimental spaces for phenotyping around the world, though their characteristics and the extent to which procedures are automated and insulated from the external environment varies considerably (George et al. 2014). It should be noted that the majority of plant specimens used for SureRoot was housed outside the glasshouse prior to screening, so as to maximize the plants’ exposure to the natural environment while still carefully monitoring and measuring their exposure to light, water and humidity. As the majority of experi- ments processed in facilities such as NCPP rely on specimens grown exclusively in the glasshouse, it remains important to stress the significance that environmental controls inside the glasshouse have for subsequent imaging and the interpretation of the data. 16 Many studies of root growth avoid this source of variability by resorting to soil-free growth media (which have the further advantage of being transparent, thus considerably facilitating the imaging). European Journal for Philosophy of Science (2019) 9: 22 Page 9 of 27 22 https://www.youtube.com/watch?v=8qBsVP0j70k at hand to check) and/or compromise the comparability of the images.17 Finally, environmental controls can also fail in ways that researchers had not predicted or accounted for. Six months after the inauguration of the facility, for example, NCPP technicians realised that in some of the experiments being carried out, the temperature difference between the glasshouse and the imaging chambers was giving plants a thermal shock, a factor which may affect the measurement of plant temperature responses. Such issues are amplified when imaging plants that are grown outside the glasshouse. 4.2 Stage 2: Preparing and performing imaging Another key condition for the experiment is identifying appropriate conditions, techniques and tools for generating digital images. Technicians consider the desired background, resolution, focus, lightning conditions and angle of the pictures, as well as the number and interval of repeats per plant – which is constrained by which imaging tools are employed and how they are calibrated, as well as the number of experiments to be carried out at any given time18 – and what counts as ‘dirt’ and ‘debris’.19 Technicians also develop techniques and tools such as glass pots to make the roots visible to imaging, and imaging specialists are consulted on how to adapt available imaging techniques and instruments to the experimental conditions of the glasshouse. The result are images such as Fig. 3 below. Generating such an image involves a vast amount of know-how that affects the extent to which the resulting data are viewed as “usable” by researchers, and yet is not typically recorded systematically. In the words of a technician, “quite often that sort of stuff is lost and it stays in somebody’s lab book, or in their computer, or on their server, but, in essence, that person moves on and the group goes to do other things. It could be recreated through an immense amount of work. It’s almost as bad as [..] stuff being lost in breeders’ notebooks. It’s the modern equivalent of that” (PI_1_C). 4.3 Stage 3: Data storage and dissemination The third stage of data processing involves storing and labelling the images being produced, so that they can be searched and retrieved as required for analysis. I have shown in previous work the epistemic importance of this stage and the challenges and tools involved in organizing data collections so that they can be easily searched and used for specific investigative purposes (Leonelli 2016). This is where data managers step in, bringing expertise on available 17 NCPP has an alarm system to signal overnight blockages, but somebody comes in to check only if the whole system is compromised. Experiments typically run from midnight to 11 am and from 3 pm to 10 pm, so there is only a small window for maintenance, which is sometimes taken up with other experiments. Technicians are concerned about the “lack of redundancy in the system”, which is likely to worsen as the apparatus ages. 18 The SureRoot experiment involved only 28 plants, with each round of imaging taking 1 h and 40 min. For experiments involving a larger group of plants, imaging can take as long as 12 h. 19 For a conceptualisation of the epistemic function of “dirt” in the production, cleaning and visualization of data, see Boumans and Leonelli (under review). 22 Page 10 of 27 European Journal for Philosophy of Science (2019) 9: 22 systems for the curation and classification of data and related contextual information (“meta-data”). This work involves interpretative decisions around which types of inquiry the imaging data could help to address. In the case of SureRoot, it is clear that the images can serve as evidence for claims about roots, and they can therefore be classified under that term. However, depending on how they are subsequently analyzed, the images could provide a host of other information about the plants, for instance about stem and leaf growth. By labelling data with reference to which phenomena they may be used to docu- ment, data managers contribute to identifying and circumscribing their eviden- tial value in ways that shape their usability for analysis. The same holds for decisions around how to label meta-data (documenting for instance plant prov- enance and growth conditions), which determine how researchers evaluate the potential significance of data and what they can be taken to represent.20 Last but not least, and typically in consultation with biologists and technicians, data managers control access to the data, by deciding whether and how the data are shared among NCPP staff, external collaborators and other stakeholders – a 20 When I visited, the name of each file would indicate the plant type, mutant number, experiment number and repeat number, thus including a lot of information about the circumstances of data production. The labelling system was, however, far from standardized and the keys to the system were not written down anywhere which made the files difficult to handle computationally and created problems of scale for very large experiments. Such circumstances vividly illustrate the significance of a data manager’s know-how towards the re-use of data, though such expertise remains equally important in situations where commonly used standards need to be applied to locally generated data (e.g. Boumans and Leonelli under review). Fig. 3 Example of Festulolium plant image produced for SureRoot European Journal for Philosophy of Science (2019) 9: 22 Page 11 of 27 22 decision that can have significant implications for which data formats, labels, software and visualization tools are ultimately chosen to carry the data.21 4.4 Stage 4: Coding for image analysis The fourth stage involves developing software that can support the analysis of imaging data. This is where computer scientists enter the fray, initially consulting with biologists about the aims of the experiment, but then working largely on their own to develop a programme through which images could be mined. This process includes evaluating which measurements could be effectively extracted from the imaging data through computational means, so as to make it possible to accurately and consistently compare root systems. In this case, the task was overseen by a senior, highly skilled computer scientist with decades of experience, who discussed with me the difference between his approach and the biologists’ in the following terms: while the latter look for ways to use data to answer biological questions, often resulting in chasing methods to capture information that was hard to extract from the available files, the former focus on identifying information that could be easily harnessed with available computational tools - whether or not such information had immediate biological significance. Rather than focusing on the biological questions at hand, computer scientists thus approached the questions of how to analyse the plant images by considering which properties of the images at hand would be most easily and reliably amenable to analysis through existing computational tools. As a result of this research, computer scientists zoomed on measurements of root width, number and positions within the pot, which could be analysed by tweaking an existing programme available within the widely used software MatLab so that it would capture plant-relevant parameters. Tweaks included determin- ing the range of expected minimal and maximal width of roots, their geometrical angles and location, the required spacing of pot image, the relevant time intervals and the number of measurements per day (Fig. 4). 4.5 Stage 5: Image filtering Once satisfied with the code, the computer scientists used it to analyze thousands of plant images, resulting in a series of “filtered images” that look much like the original data to the untrained eye (see Fig. 5), and yet have been modified in order to make the parameters of the analysis more prominent and easier for the computer to pick up, while features that are considered to be less significant (such as very small roots positioned at awkward angles) disappear from the image. 4.6 Stage 6: Image analysis Now it was possible for the computer scientists to automate the extraction of parameters from each filtered image, so as to produce plot charts that quantified and analyzed root 21 During my visit at NCPP, such decisions were constrained by the lack of a systematic approach to data storage (which was all done via local servers) and of a data sharing policy for the whole institute, which were in turn due to a shift in personnel: the previous data manager had just left, bringing with him a wealth of local knowledge that the new data manager had yet to reconstruct. 22 Page 12 of 27 European Journal for Philosophy of Science (2019) 9: 22 distribution relative to soil structure and depth of roots in pot (e.g. Fig. 6). Again, there are computational constraints on what kinds of analysis and graphical representations can be selected to visualize the data at this stage, and the computer scientists playing around with the design and color of visualizations and the ways in which the coding and filtering affect the charts. Computer scientists emphasize the idea of “play” as crucial to their work, especially when compared with the work of biologists and technicians. For one thing, playing around with images is cheaper than running experiments, with more trial and error allowed; and “data games” can be interrupted, picked up again, interspersed with other tasks without significant disruption. More importantly, “sometimes the problems phenotyping poses are not amenable to AI/vision attacks. Sometimes the fun we can have writing programs is of no relevance to phenotyping. But we think the interesting solutions are often to problems that were PIXELSPERCM = 20; data = csvread(csvfile); plants = unique(data(:,1)); plantsn = length(plants); days = unique(data(:,2)); daysn = length(days); angles = unique(data(:,3)); anglesn = length(angles); instances = unique(floor(plants/10)); instancesn = length(instances); reps = unique(mod(plants,10)); repsn = length(reps); if (length(i2plot)==0) i2plot = instances; % First, for each plant accumulate per-day means for i=1:plantsn for j=1:anglesn % interpolate over each signal (we may not have a full set on a day) idx = find((data(:,1)==plants(i)) & (data(:,3)==angles(j))); tmp1 = data(idx,4:5); tmp2(j,:,:) = interp1(data(idx,2),tmp1,mday:Mday); % mean over angles pdata(i,:,:) = mean(tmp2(:,:,:)); %%% We could plot per-plant data at this point % Now average over instances % Keep std for later for i=1:instancesn plantmin = 10*instances(i); plantmax = 10*(instances(i)+1); idx = find((plants>=plantmin) & (plants