HowToDoDPoS_Preprint This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 1 How to do digital philosophy of science Charles H. Pence Department of Philosophy and Religious Studies Louisiana State University Baton Rouge, LA, USA charles@charlespence.net https://charlespence.net Grant Ramsey Institute of Philosophy KU Leuven Leuven, Belgium grant@theramseylab.org http://www.theramseylab.org Abstract Philosophy of science is beginning to be expanded via the introduction of new digital resources—both data and tools for its analysis. The data comprise digitized published books and journal articles, as well as heretofore unpublished and recently digitized material, such as images, archival text, notebooks, meeting notes, and programs. This growing bounty of data would be of little use, however, without quality tools with which to analyze it. Fortunately, the growth in available data is matched by the extensive development of automated analysis tools. For the beginner, this wide variety of data sources and tools can be overwhelming. In this essay, we survey the state of digital work in the philosophy of science, showing what kinds of questions can be answered and how one can go about answering them. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 2 1. Introduction. Our understanding of science is being broadened by the digitization and automated analysis of the various outputs of the scientific process, such as scientific literature, archival data, and networks of collaboration and correspondence. These technological changes are laying the foundation for new types of problems and solutions in the philosophy of science. The purpose of this article is to provide an overview and guide to some of the novel capabilities of digital philosophy of science. To best understand the reasons why digital philosophy of science lets us ask a new class of questions, let’s consider how it differs from more traditional approaches. For example, consider how we might draw conclusions about articles in the journal Nature. It has published over 360,000 articles since its founding in 1869, meaning that one would have to read ten articles a day for one hundred years to work through the complete archives of this journal alone. Of course, the standard response in the philosophy of science is to favor depth over breadth, and closely read a much smaller number of articles. While there is certainly much we can learn about science in this way, some broad questions about the nature and history of science—questions, for example, about how theories arise and become established in the literature as a whole—would remain unanswerable without a way to glean information from hundreds of thousands or even millions of journal articles. Much the same argument holds for scientific images, or information about the collaboration, communication, training, or citation connections between researchers. The question, then, is to what degree we can learn from the vast scientific literature without having to read every article closely—to instead do what is called distant reading (Moretti 2013). With distant reading, we input a large body of literature into a computer, and use it to do the “reading” for us, extracting large-scale patterns that would be invisible or impractical to find otherwise. In the philosophy of science in particular, this process has been aided by a This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 3 number of large digitization efforts targeted at the outputs of the scientific process. One crowning achievement of these efforts is the nearly complete digitization of the academic journal literature. This content is thus now accessible in ways that it has never been before. Digital approaches to the philosophy of science contrast with traditional methods involving close reading—intently reading a narrow body of literature within a focal area. With close reading, a philosopher will have an impressive command over a limited domain. He or she closely reads a select set of documents from the scientific literature, or analyzes the experimental, training, or collaborative records of a small group of researchers to attempt to extract the structure of a scientific theory, or to understand the meaning of its terms. We should stress that the close reading-based traditional philosophy of science and distant reading-based digital philosophy of science are not in competition. Instead, they are complementary. If, for example, a researcher wants to know how the meaning of a particular term has changed over time, he or she could use automated textual analysis tools to locate instances of the term, find hot spots in which the term is used frequently, quickly see which words it is associated with, and how these word associations have changed over time. In conjunction with digital analysis, performing close reading of key texts will be invaluable. The close reading may then spur further digital inquiries, and so on. Thus, traditional and digital philosophy of science work in tandem, each supporting the other. The remainder of this article will canvass a number of significant issues that must be dealt with in order to develop a digital philosophy of science research program. We hope that this overview will be helpful to researchers who are interested in moving forward with digital tools but are not certain where or how to begin. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 4 2. Getting started. Because digital philosophy of science is a relatively new field, not only is there no set of standard tools, it is often unclear what sorts of questions can be answered by the extant tools. Thus, let’s begin by considering some of the new kinds of questions one can address. One of the most significant advantages of distant reading comes from the ability to engage with corpora significantly larger than those usually treated by philosophers and historians of science. For example, Murdock, Allen, and DeDeo (2017) were able to analyze large-scale patterns in Darwin’s reading by accessing the full text of every book that we know him to have read over a period of decades. These kinds of analyses simply would not be possible without the aid of technology. Answering research questions that leverage broad (yet still circumscribed; see section 4) sets of data are thus likely to be a fruitful use of digital tools. For example, one could track concepts over the entire print run of a journal, the collections of books published in the Biodiversity Heritage Library (Gwinn and Rinaldo 2009), or the PubMed Open Access Subset of contemporary biomedical journal articles (Roberts 2001). These kinds of investigations allow us to explore the conceptual landscape of a field through distant reading, by offering (at least in some cases) an exhaustive analysis of an area. Another advantage comes from the ability of analytical algorithms to parse texts in ways that even well trained close readers cannot. For example, fine-grained patterns of language usage, such as the shift in a term from a noun use to a verb use, or a shift from referring to science as a one-person activity to a group activity, could be traced in the literature with a level of exhaustiveness, objectivity, and care that would simply be impossible for a single reader. Automated tools can analyze sentence structure, word order, or parts-of-speech usage in a way that would try the patience of any scholar (Manning et al. 2014). This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 5 The ability of digital tools to increase the breadth of a research question is also important. If one has a hypothesis drawn from a particular domain (maximization or optimality inferences in biology, for example), this hypothesis could be tested in other, separate domains (economics, psychology, sociology) with only a modest further investment of resources. While digital tools can aid in answering existing research questions, these tools also open the possibility of framing new questions without a clear analogue in the pre-digital world. For instance, work by Manfred Laubichler and colleagues applies dynamic network analysis to our understanding of scientific conceptual development (Miller et al. 2015). The questions they ask arise in conjunction with the digital tools, and in dialogue with digital humanities researchers in other disciplines. 3. Choosing the right tools. Now that we have a sense of the advantages of digital analysis, let’s consider the currently available tools and corpora of data relevant to the philosophy of science. To begin, we should draw attention to the central repository of digital humanities tools, known as the DiRT Directory, accessible at (for more on its construction and predecessors, see Dombrowski 2014). There are nearly as many digital humanities tools as there are digital humanities researchers, and the landscape of contemporary software changes rapidly. For nearly any kind of analysis, the directory will include some tool which performs it—the most important question will be whether the data available can efficiently be converted into the format required by that tool. 3.1. Basic tools. There are a number of tools that may be used immediately by researchers, as they do not require that one collate a set of documents of interest in advance. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 6 Perhaps the most famous of these is the Google Ngram corpus (Brants and Franz 2006; Michel et al. 2011), accessible at . This corpus contains the entirety of the scanned Google Books project, current as of 2012, with frequency data for single words as well as pairs and longer sequences (so called bigrams, trigrams, and, more generally, n-grams). Obviously, the Ngrams project does not exclusively contain scientific or philosophical content, and hence a number of queries that might interest philosophers of science will simply not be meaningful when queried against the Ngram Viewer. For example, the scientific usage of the term “evolution” will be completely masked by the broader cultural use of the term, and hence philosophers interested in the use of this term are unlikely to be able to uncover interesting data. There are also a number of worries about the statistical representativeness of the Google Ngram corpus, even when judged as a measure of broader cultural usage or popularity (Morse- Gagné 2013; Pechenick, Danforth, and Dodds 2015). Much more precise search and analysis may be performed by using JSTOR’s Data for Research project (Burns et al. 2009), available at . This tool allows users to perform searches and analyses against the entire corpus of JSTOR journals. Researchers may search for articles by journal, publication date, author, subject, and more, allowing for careful control over the set of articles to be analyzed. These articles may then be queried for word frequencies (and ngram frequencies), as well as automatically extracted “key terms,” which are words common in the selected articles but uncommon in the corpus as a whole (computed using the tf-idf score). The frequency scores from JSTOR DFR may also be used as an input to a variety of the tools described below. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 7 3.2. Gathering a corpus. The more advanced tools are set apart primarily by not coming with a pre-loaded corpus of material to study. This means that the challenge of obtaining data falls to individual researchers. As mentioned above, we find ourselves in a particularly fertile period for data availability in the philosophy of science. Much of the journal literature, in some cases back into the nineteenth century, is available online in PDF or HTML form. Comprehensive online projects are available that focus on the works, life, and correspondence of figures like Darwin (Secord 1974; van Whye 2002), Newton (Iliffe and Mandelbrote 1998), Poincaré (Walter, Nabonnand, and Rollet 2002), Einstein (Mendelsson 2003), and others (Pouyllau et al. 2005; Beccaloni 2008; Mills 2011). A number of discipline-specific archives have also been constructed, such as the Embryo Project Encyclopedia, an open access, digital repository covering the history of embryology and developmental biology (Maienschein et al. 2007). To this may be added the digital collections now increasingly available from a wide variety of museums and libraries. With an appropriate collection of data obtained for a researcher’s private use, it becomes possible to leverage a much wider variety of analytical tools. (These data must also be carefully curated and safely preserved; we will return to the question of data management in the next section.) A researcher gathering a corpus must consider how and to what extent the data should be annotated. Minimal annotation—for example, leaving content as plain text with only bibliographic data for tracking—allows for the rapid creation of a large corpus, and lowers the future burden of maintaining and updating the annotations. But more significant annotation— such as marking up textual data in a format like that described by the Text Encoding Initiative (Ide and Véronis 1995)—allows for more complex, fine-grained, and accurate analyses. This annotation can take a variety of forms. For textual data, TEI allows users to indicate the locations This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 8 of various parts of the document (pages, paragraphs, chapters, indexes, figures, or tables), or the various kinds of references made by a piece of text (dates, citations, abbreviations, names of persons or institutions, etc.). This process of cross-referencing documents may be aided by the use of external ontologies—in the sense (not the one common in philosophy) of collections of standardized verbs and concepts that allow for the same term to refer unambiguously across multiple documents. In philosophy, the Indiana Philosophy Ontology project, or InPhO (Buckner, Niepert, and Allen 2007), available at , allows standardized reference to concepts such as “sociobiology,” or to particular philosophers. A number of such ontologies also appear in other areas of the sciences, and a document may be marked up with multiple ontologies to add further semantic richness. With a heavily annotated document, significantly more complex analysis may be applied, as the computer now “knows” where particular concepts are mentioned, how they are used, and how they relate to other ideas. While the use of such methods is relatively untested in philosophy, the biomedical field has made significant strides in this direction in recent years— for example, analysis of the usage of gene and chemical concepts in the scientific literature has actually enabled the extraction of novel relationships (previously unpublished by researchers, but discernible from the body of literature as a whole), and even the generation of novel hypotheses about future drug development (A. M. Cohen and Hersh 2005). The question of the representativeness of one’s sample of data is also a significant one with which researchers must engage. As we noted above, even in the largest corpora, such as Google’s Ngram collection, there are still problems with the statistical significance of the sample (Morse-Gagné 2013), with biases in temporal availability of data (more data tends to be available This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 9 closer to the present, as the relevant outputs were “born digital”; Michel et al. 2011) and systematic sources of error such as that introduced by optical character recognition (Hoover 2012). These concerns are somewhat alleviated when using a curated corpus known to be complete (such as databases of historical correspondence), but even in these instances, researchers must remain constantly vigilant against statistical bias. 3.3 Advanced tools. With a corpus in place, there is a variety of options for users interested in performing analyses impossible with the basic tools described above. First, there are a number of tools designed to aid researchers in presenting their material as an easily navigable, searchable, categorized public resource—a public digital archive or museum. The most popular of these is Omeka (D. Cohen 2008), available at . Omeka is a free, open-source software product that allows users to construct online archives and museum exhibitions, to add catalog information and metadata to digital items, and to attractively present all of this material to the public at large. Deploying a website such as this is a nice way to garner some immediate, public-facing payoff from the difficult work of obtaining and curating a digital collection. One alluring feature of large digital data sets is the possibility of analyzing the networks found within them—whether these are networks of collaboration drawn from experimental archives or lab notebooks, networks of correspondence drawn from digitized letters, or citation networks extracted from the journal literature. Such network analysis can often allow us to see patterns in the overall structure of a field that would be otherwise difficult to discern. One of the most user-friendly network analysis tools available is Gephi (Bastian, Heymann, and Jacomy 2009), available at . Gephi allows users to import graphs in a number of This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 10 formats (including ones as simple as CSV spreadsheet data), and to perform a variety of analyses and visualizations. The network may be broken into clusters (using a standard measure known as modularity; Blondel et al. 2008), the degree of connectivity of individual nodes may be easily explored, and the results can then be rendered graphically for presentation. If the data to be analyzed is text, a popular choice is Voyant Tools (Sinclair and Rockwell 2016), available at . Once a corpus of text is uploaded to Voyant, the user is immediately presented with a wide variety of options: a word cloud, a cross-corpus reader, a tool for tracking word trends through the text, and a short snippet concordance are among the immediately available tools, and a variety of other, more complex analyses and attractive visualizations may be performed using plugins. Voyant may also be used to save online corpora for future use, which facilitates classroom usage of textual analysis. Another challenging problem likely to be faced by philosophers of science interested in the scientific literature is the analysis of a large number of journal articles, a kind of analysis not often performed in traditional digital humanities, which often focuses on book-length source material. To solve these problems, one of us has created a software package, RLetters (Pence 2016), available at . (One public installation of this software, containing a corpus of journals in evolutionary biology, is available at , and described in (Ramsey and Pence 2016).) This is a web application, backed by a search engine and database, which may be deployed by anyone wishing to analyze a corpus of academic journal articles. It includes a variety of analysis methods (sharing many of those described for Voyant), including an especially powerful word frequency analyzer. Finally, should all of these tools fall short, the statistical computing language R (R Core Team 2017, available at ) has become a very popular base for This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 11 constructing novel analyses in the digital humanities (Jockers 2014). R combines a comprehensive set of standard statistical analyses (such as principal component analysis and dendrogram or tree clustering) with an extensive collection of user-contributed packages which may be utilized to perform complex tasks such as querying Google Scholar or Web of Science. This power comes at the cost of significant complexity, however, as R operates like a programming language rather than a graphical application. 3.4. Copyright issues. One of the most common pitfalls that users are likely to encounter when building corpora of digital data is copyright and licensing issues. While much material pertaining to figures like Newton or Darwin is available in the public domain, a confusing legal landscape besets all work created after 1923 (the date of “public domain” for published works in the United States). A number of recent court decisions (most significantly Authors Guild v. HathiTrust; Bayer 2012) have begun to clear the legal landscape in the United States, indicating that scholarly textual analysis and other sorts of digital-humanities work are likely to fall under the U.S. “fair use” provision. This, however, does nothing to simplify obtaining copyrighted materials, nor does it help scholars in other countries, many of which lack an analogue to fair use. It also may well be cold comfort to litigation-sensitive universities. Increasingly, however, publishers are recognizing the demand for digital analyses of their materials. Elsevier has deployed a text and data mining policy that applies to all of their journals, and will allow researchers to access and analyze articles as part of any institutional subscription (Elsevier 2014). Under the auspices of JSTOR’s DFR project, researchers may request access to full-text articles, if their university subscribes to the appropriate JSTOR collections. We also have had some degree of personal success negotiating access contracts for closed-access journal This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 12 articles with their publishers, including with Nature Publishing Group, who were very receptive to the possibilities opened by digital analyses. We anticipate that this trend toward increased ease of access will only continue. 4. Data. The academic process relies on the ability of other researchers to access, verify, and reproduce the results of analyses such as these. We will next consider how to publish and archive data, and how make public the tools and techniques used to achieve the results. 4.1. Data management. Philosophers are not, as a rule, accustomed to producing large amounts of data as part of our research. When using digital tools, we find ourselves faced with many of the same questions our scientific colleagues have dealt with for some time—how do we document, store, and preserve the data that our research generates? We cannot offer comprehensive answers to these questions here; we raise them only to emphasize that problems of metadata, documentation, and archiving have been discussed extensively in other contexts and should not be neglected. Early engagement with these resources will prevent significant problems from arising in the long term (York 2009; Michener 2015). 4.2. Reproducibility. If digital analyses are to serve as elements of the permanent research record along with journal articles, then we must take care to make those analyses reproducible in the future. This is a multifaceted problem that has, in recent years, received significant attention from the scientific community (Munafò et al. 2017). For most digital philosophy projects, there are three key components to reproducibility. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 13 First, software must be reproducible—that is, easily installed and run by those with the relevant technical expertise. To that end, the development and use of open source software is laudable, as is using a readily accessible distribution platform such as GitHub. Second, corpora must be reproducible. This can be a difficult challenge, particularly if one has negotiated access to a body of copyrighted materials for analysis. It is often possible to negotiate access not just for an individual researcher or research team, but also for any researchers accessing a public resource (Ramsey and Pence 2016 successfully negotiated such contracts for evoText). We encourage researchers to think very seriously about this challenge as they develop corpora. Finally, the original forms of data must be—and remain—available. Open data repositories such as figshare (figshare Team 2012; Kraker et al. 2015) or Zenodo (CERN 2013) will accept raw data and make it citable. Researchers should also take care to upload data into these repositories in formats that are likely to remain readable indefinitely into the future, such as comma-separated value (CSV) format for spreadsheets, or plain Unicode text or XML for textual data. 5. Integrating digital results into philosophy of science. The digital tools are powerful and they have great potential for the philosophy of science. But digital results do not automatically translate into philosophical results. We therefore must consider how to integrate them with broader answers to philosophical questions. 5.1 Justifying digital results. A recurring problem with digital humanities results consists in how we can be certain that we have obtained genuine information supporting the conclusions This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 14 we hope to draw. We can in part resolve this by proceeding in an “hypothesis-first” manner— forming clear hypotheses prior to performing analyses. All datasets are apt to contain chance patterns, and we should not be led astray by these patterns by basing our conclusions upon them. And when we formulate hypotheses, we should attempt to be open to a range of possible conclusions, since approaching a statistical analysis system with an answer to one’s question already in mind tends to result in the cherry-picking of tools and methods to produce the desired result (Ioannidis 2005). That said, it can be difficult, even having carefully formulated and tested an hypothesis, to be certain that one has in fact demonstrated it conclusively. Many analyses in the digital humanities lack statistical validation, and have only a history of successful use as evidence in their favor (see, e.g., the discussion of validation in Koppel, Schler, and Argamon 2009). Others require collaboration between experts in philosophy and statistics, computer science, or even electrical engineering (Miller et al. 2015). An important step in developing a digital research program, therefore, is to consider how to assess whether a project has succeeded or failed. This may involve validating the methods, producing standard kinds of analysis outputs, or, as we now consider, using digital research methods only as a first step in a broader program of philosophical research. 5.2 Digital humanities as research generator. Because digital tools give us significantly increased breadth and depth, we have found that they are useful not just as research tools in and of themselves, but as a compass, directing us toward questions that would be answered by traditional methods in philosophy of science. For example, Pence has recently combined existing work on an episode in the history of biology (Pence 2011) with digital tools (Ramsey and Pence This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 15 2016), to produce a more general hypothesis about debates over paradigm change, which is now ripe for a non-digital analysis (Pence in preparation). We anticipate that this workflow will, in fact, be quite common. As a digital tool shows us a provocative but not fully theorized result, this can provide us with an excellent working hypothesis, case study, or set of sample data for developing a philosophical thesis. 6. Conclusion. As scholars interested in studying the natural sciences, we cannot ignore the availability of digital data that might assist us in our research. It was once the case that the body of scientific literature was modest in size and represented only a narrow distillation of and reflection upon the world. Now the literature has become so massive, complex, and diverse that it constitutes a world unto itself, one poised for scientific and philosophical analysis. Adding to this all of the digital traces of work not heretofore published—archival images, notebooks, and so on—we are confronted with an overwhelming, but incredibly rich, world of information. Philosophers are beginning to see how this information can bear on questions in the philosophy of science, and can inspire new ones. But the profusion of sources and formats of data, on top of the assortment of available tools, some of which require considerable technical savvy, provides a barrier to the philosopher. In this essay, we have attempted to provide a window into digital philosophy of science, with both an overview of what is possible and some guidance in seeking data and analysis tools. We are excited about the prospects for future work in this field, and hope that this article will help to spread our excitement. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 16 References Bastian, Mathieu, Sebastian Heymann, and Mathieu Jacomy. 2009. “Gephi: An Open Source Software for Exploring and Manipulating Networks.” In Third International AAAI Conference on Weblogs and Social Media, 361–62. AAAI Publications. Bayer, Harold, Jr. 2012. The Authors Guild, Inc., et al., v. HathiTrust, et al., 11 CV 6351 (HB). United States District Court, Southern District of New York. Beccaloni, George. 2008. “The Alfred Russel Wallace Correspondence Project.” http://wallaceletters.info. Blondel, Vincent D., Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast Unfolding of Communities in Large Networks.” Journal of Statistical Mechanics: Theory and Experiment 2008 (10): P10008. doi:10.1088/1742-5468/2008/10/P10008. Brants, Thorsten, and Alex Franz. 2006. The Google Web 1T 5-Gram Corpus Version 1.1 (LDC2006T13). Philadelphia, PA: Linguistic Data Consortium. Buckner, Cameron, Mathias Niepert, and Colin Allen. 2007. “InPhO: The Indiana Philosophy Ontology .” APA Newsletter 7 (1): 26–28. Burns, John, Alan Brenner, Keith Kiser, Michael Krot, Clare Llewellyn, and Ronald Snyder. 2009. “JSTOR - Data for Research.” In Research and Advanced Technology for Digital Libraries, 416–19. Lecture Notes in Computer Science 5714. Berlin: Springer. CERN. 2013. Zenodo. Geneva. https://zenodo.org/. Cohen, Aaron M., and William R. Hersh. 2005. “A Survey of Current Work in Biomedical Text Mining.” Briefings in Bioinformatics 6 (1): 57–71. Cohen, Dan. 2008. “Introducing Omeka.” http://hdl.handle.net/1920/6089. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 17 Dombrowski, Quinn. 2014. “What Ever Happened to Project Bamboo?” Literary and Linguistic Computing 29 (3): 326–39. doi:10.1093/llc/fqu026. Elsevier. 2014. “Text and Data Mining.” https://www.elsevier.com/about/our- business/policies/text-and-data-mining. figshare Team. 2012. Figshare. London. https://figshare.com/. Gwinn, Nancy E., and Constance Rinaldo. 2009. “The Biodiversity Heritage Library: Sharing Biodiversity Literature with the World.” IFLA Journal 35 (1): 25–34. doi:10.1177/0340035208102032. Hoover, David L. 2012. “Textual Analysis.” In Literary Studies in the Digital Age: An Evolving Anthology. Modern Language Association. http://dlsanthology.commons.mla.org/textual- analysis/. Ide, Nancy, and Jean Véronis, eds. 1995. Text Encoding Initiative: Background and Context. Dordrecht: Kluwer. Iliffe, Rob, and Scott Mandelbrote. 1998. “The Newton Project.” http://www.newtonproject.sussex.ac.uk/. Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8): e124. doi:10.1371/journal.pmed.0020124. Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Cham, Switzerland: Springer. Koppel, Moshe, Jonathan Schler, and Shlomo Argamon. 2009. “Computational Methods in Authorship Attribution.” Journal of the American Society for Information Science and Technology 60 (1): 9–26. doi:10.1002/asi.20961. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 18 Kraker, Peter, Elisabeth Lex, Juan Gorraiz, Christian Gumpenberger, and Isabella Peters. 2015. “Research Data Explored II: The Anatomy and Reception of Figshare.” http://arxiv.org/abs/1503.01298. Maienschein, Jane, Manfred D. Laubichler, Jessica Ranney, Kate MacCord, Steve Elliott, and Federica Turriziani Colonna. 2007. “The Embryo Project Encyclopedia.” https://embryo.asu.edu. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. “The Stanford CoreNLP Natural Language Processing Toolkit.” In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55–60. Baltimore, MD: Association for Computational Linguistics. Mendelsson, Dalia. 2003. “Einstein Archives Online.” http://www.alberteinstein.info. Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82. doi:10.1126/science.1199644. Michener, William K. 2015. “Ten Simple Rules for Creating a Good Data Management Plan.” PLoS Computational Biology 11 (10): e1004525. doi:10.1371/journal.pcbi.1004525. Miller, B. A., M. S. Beard, M. D. Laubichler, and N. T. Bliss. 2015. “Temporal and Multi- Source Fusion for Detection of Innovation in Collaboration Networks.” In 2015 18th International Conference on Information Fusion (Fusion), 659–65. Mills, Virginia. 2011. “The Joseph Dalton Hooker Project.” http://www.sussex.ac.uk/cweh/research/josephhooker. Moretti, Franco. 2013. Distant Reading. London: Verso. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 19 Morse-Gagné, Elise E. 2011. “Culturomics: Statistical Traps Muddy the Data.” Science 332: 35– 36. Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1: 21. doi:10.1038/s41562-016-0021. Murdock, Jaimie, Colin Allen, and Simon DeDeo. 2017. “Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks.” Cognition 159: 117–26. doi:10.1016/j.cognition.2016.11.012. Pechenick, Eitan Adam, Christopher M. Danforth, and Peter Sheridan Dodds. 2015. “Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.” PLOS ONE 10 (10): e0137041. doi:10.1371/journal.pone.0137041. Pence, Charles H. in preparation. “How Not to Fight about Theory: The Debate between Biometry and Mendelism in Nature, 1890–1915.” In The Evolution of Science, edited by Andreas De Block and Grant Ramsey. ———. 2011. “‘Describing Our Whole Experience’: The Statistical Philosophies of W. F. R. Weldon and Karl Pearson.” Studies in History and Philosophy of Biological and Biomedical Sciences 42 (4): 475–485. doi:10.1016/j.shpsc.2011.07.011. ———. 2016. “RLetters: A Web-Based Application for Text Analysis of Journal Articles.” PLoS ONE 11 (1): e0146004. doi:10.1371/journal.pone.0146004. Pouyllau, Stephane, Christine Blondel, Marie-Helene Wronecki, Bertrand Wolff, and Delphine Usal. 2005. “Ampère et l’histoire de l’électricité.” http://www.ampere.cnrs.fr. This is a preprint of an article whose final and definitive form is published in Philosophy of Science. Please quote only the published version of the paper. 20 R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org. Ramsey, Grant, and Charles H. Pence. 2016. “evoText: A New Tool for Analyzing the Biological Sciences.” Studies in History and Philosophy of Biological and Biomedical Sciences 57: 83–87. doi:10.1016/j.shpsc.2016.04.003. Roberts, Richard J. 2001. “PubMed Central: The GenBank of the Published Literature.” Proceedings of the National Academy of Sciences 98 (2): 381–82. doi:10.1073/pnas.98.2.381. Secord, James. 1974. “The Darwin Correspondence Project.” http://www.darwinproject.ac.uk. Sinclair, Stéfan, and Geoffrey Rockwell. 2016. Voyant Tools. http://voyant-tools.org/. Walter, S. A., Ph. Nabonnand, and L. Rollet. 2002. “Henri Poincaré papers.” http://henripoincarepapers.univ-nantes.fr. Whye, John van. 2002. “The Complete Work of Charles Darwin Online.” http://darwin- online.org.uk/. York, Jeremy. 2009. “This Library Never Forgets: Preservation, Cooperation, and the Making of HathiTrust Digital Library.” Archiving Conference 2009 (1): 5–10.