building a machine learning pipeline as a new machine learning (ml) practitioner, it is important to develop a mindful approach to the craft. by mindful, i mean possessing the ability to think clearly about each individual piece of the process, and understanding how each piece fits into the larger whole. in my experience, there are many good tutorials available that will help you work with an individual tool, deploy a specific algorithm, or complete a single task. it is more difficult to find guidelines for building a holistic system that supports the entire ml workflow. my aim is to help you build just such a system, so that you are free to focus on inquiry and discovery rather than struggling with infrastructure and process. i write this as a software developer who has, at one time or another, been on the wrong end of all the recommendations presented here, and hopes to save you from similar headaches. many of the examples and design choices are drawn from my experiences at the digital public library of america, where i have worked alongside a very talented team of developers. this is by no means an exhaustive text, but rather a bit of pragmatic advice and a jumping-off point for further research, designed to give you a clearer idea of which questions to ask throughout your practice. this article reviews the basic machine learning workflow, discussing design considerations along the way. it offers recommendations for data storage, guidelines on selecting and working with ml algorithms, and questions for tool selection. finally, it describes some challenges with scaling up. my hope is that the insight presented here, combined with your good judgement, will empower you to get started with the actual practice of designing and executing a machine learning project. the machine learning pipeline the metaphor of a pipeline is often used for a machine learning workflow. this metaphor captures the idea of data channeled through a series of sequential transformations. however, it is important to note that each stage in the process will need to be repeated and honed throughout the course of your project. therefore, don’t think of yourself as building a single intelligent model, such as a decision tree or clustering algorithm. instead, build a pipeline with pieces that can be swapped in and out as needed. data flows through the pipeline and outputs a version of a decision tree, clustering algorithm, or other intelligent model. throughout your process, you will tweak your pipeline, making many intelligent models. eventually you will select the best model for your use case. to use another metaphor, don’t build a car, build an assembly line for making cars. while the final output of a machine learning workflow is some sort of intelligent model, there are many factors that make repetition and iteration necessary. ml processes often involve subjective decisions, such as which data points to ignore, or which configurations you select for your algorithm. you will want to test different possibilities to see what works best. as you learn more about your dataset throughout the course of the project, you will go back and tweak parts of your process. you may discover biases in your data or algorithms that need to be addressed. if you are working collaboratively, you will be incorporating asynchronous feedback from members of your team. at some point, you may need to introduce new or revised data, or try a new tool or algorithm. it is also prudent to expect and plan for errors. human errors are inevitable, and hardware errors, such as network timeouts or memory overloads, are common. for all of these reasons, you will be well-served by a pipeline composed of modular, repeatable steps, each with discrete and stable output. a modular pipeline supports a batch processing workflow, in which whole datasets undergo a series of transformations. during each step of the process, a large amount of data (possibly the entire dataset) is transformed all at once and then incrementally stored. this can be contrasted with a real-time workflow, in which individual records are transformed instantaneously (e.g. a librarian updates a single record in library catalog); or a streaming workflow, in which a continuous flow of data is pushed through an entire pipeline, often without incremental storage along the way (e.g. performing analysis on a continuous stream of new tweets). batch processing is common in the research and development phase of an ml project, and may also be a good choice for a production system. when designing any step in the batch processing pipeline, assume that at some point you will need to repeat it either exactly as is, or with modifications. documenting your process lets you compare the outputs of different variations and communicate the ways in which your choices impact the final results. if you’re writing code, version control software can help. if you’re doing more manual data manipulations, such as editing data in spreadsheets, you will need an intentional system of documenting exactly which transformations you are applying to your data. it is generally preferable to automate processes wherever possible so that you can repeat them with ease and consistency. a concrete example from my own experience demonstrates the importance of a pipeline that supports repetition. in my first ever ml project, i worked with a set of xml library data converted to csv. i did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, i deleted and wrote over many important intermediate computations, saving only the final results. this whole process took me countless hours, and when an updated dataset became available, there was no way to reproduce my painstaking cleanup process. i was stuck with outdated data, and my final output was doomed to grow more and more irrelevant as time wore on. since then, i have always written repeatable scripts for all my data cleanup tasks. each decision you make will have an impact on the final results, so it is important to keep clear documentation and to verify your assumptions and hypotheses wherever possible. sometimes there will be explicit tests to perform; at other times, you may just need to look at data — make a quick visualization, perform a simple calculation, or glance through a sample of records. be cognizant of the potential to introduce error or bias. for example, you could remove a field that you don’t think is important, but that would, in fact, have a meaningful impact on the final result. all of these precautions will strengthen confidence in your final outcomes and make them intelligible to your collaborators and other audiences. the pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. data acquisition the first step is to acquire the data that you will be using for your machine learning project. you may need to combine data from several different sources. there are many ways to acquire data, including downloading files, querying a database or api, or scraping web pages. depending on the size of the source data and how it is made available, this can be a quick and simple step or the most challenging bottleneck in your pipeline. however you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least during the initial phase of testing different algorithms or configurations. having a raw, immutable copy of your initial dataset (or datasets) ensures that you can always go back to the beginning of your ml process and start over with exactly the same input. it will also save you from the possibility that the source data will change from beneath you, thereby compromising your ability to compare the outputs of different operations (for more on this, see the section on data storage). if possible, it’s often worthwhile to learn about how the original data was created, especially if you are getting data from multiple sources that differ in subtle ways. data preparation data preparation involves cleaning data and transforming it into an appropriate format for subsequent machine learning tasks. this is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you’ve started training and testing models. the first step of data preparation is to parse your acquired data and transform it into a common, usable schema. acquired data often comes in file formats that are good for data sharing, such as xml, json, or csv. you can parse these files into whatever schema makes sense to manage the various transformations you want to perform, but it can help to have a sense of where you are headed. your eventual choice of data format will likely be dictated by your ml algorithms; likely candidates include multidimensional arrays, tensors, matrices, and dataframes. look ahead to specific functions in the specific libraries you plan to use, and see what type of input data is required. you don’t have to use these same formats during your data preparations, though it can simplify the process. data cleanup and transformation is an art. data is messy, and the messier the data, the harder it is to analyze and uncover underlying patterns. yet, we are only human, and perfect data is far beyond our reach. to strike a workable balance, focus on those cleanup tasks that you know (or strongly suspect) will have a significant impact on the final product. cleanup and transformation operations include removing punctuation or stopwords from textual data, standardizing date and number formats, replacing missing or dummy values with a meaningful default, and excluding data that is known to be erroneous or atypical. you will select relevant data points, and you may need to represent them in a new way: a birth date becomes age range; a place name becomes geo-coordinates; a text document becomes a word density vector. there are many possible normalizations to perform, depending on your dataset and which algorithm(s) you plan to use. it’s not a bad idea to ensure that there’s a genuinely unique identifier for each record (even if you don’t see an immediate need for one). this is also a good time to reflect on any biases that might be inherent in your data, and whether or not you can adjust for them; even if you cannot, understanding how they might impact the ml process will help you conduct a more nuanced analysis and frame your final results. at the very least, you can record biases in the documentation so that future researchers will be aware of them and react accordingly. as you become more familiar with the data, you will likely hone your cleanup process and iterate through the steps multiple times. the more you can learn about the data, the better your preparations will be. during the data preparation phase, practitioners often make use of visualizations and query frameworks to picture their data holistically, identify patterns, and find errors or outliers. some ml tools support these features out-of-the-box, or are intentionally interoperable with external query and visualization tools. for a lightweight tool, consider spreadsheet or notebook software. depending on your use case, it may be worthwhile to put your data into a temporary database or search index so that you can make use of a more sophisticated query interface. model testing and training during the testing and training phase, you will build multiple models and determine which one gives you the best results. one of the main ways you will tune your model is by trying multiple combinations of hyperparameters. a hyperparameter is a value that you set before you run the learning process, which impacts how the learning process works. hyperparameters control things like the number of learning cycles an algorithm will iterate through, the number of layers in a neural net, the characteristics of a cluster, or the number of decision trees in a forest. often, you will also want to circle back to your data preparation steps to try different configurations, apply new enhancements, or address new problems and particularities that you’ve uncovered. the process is deceptively simple: try out different configurations until you get a good result. the challenge comes when you try to define what constitutes a good (or good-enough) result. measuring the quality of a machine learning model takes finesse. start by asking: what would you expect to see if the model learned perfectly? equally important, what would you expect to see if the model didn’t learn anything at all? you can often utilize randomness as a stand-in for no learning, e.g. “if a result was selected at random, the probability of the desired outcome would be x”. these two questions will help you to set benchmarks at both extremes of the realm of possible outcomes. perfection is illusive, and the return on investment dwindles after a while, so be prepared to stop training once you’ve arrived at an acceptably good model. in a supervised learning problem the dataset is split into training and testing datasets. the algorithm uses the training data to “learn” a set of rules that it can subsequently apply to new, unseen data to predict the outcome. the testing dataset (also called a validation dataset) is used to test how well the model performs. often, a third dataset is held out as well, reserved for final testing after the model has been trained. this third dataset provides an additional bulwark against bias and overfitting. results are typically evaluated based on some statistical measurement that is directly relevant to your research question. in a classification problem, you might optimize for recall or precision. in a regression problem, you can use formulas such as the root-mean square deviation to measure how well the regression line matches the actual data points. how you choose to optimize your model will depend on your specific context and priorities. testing an unsupervised model is not as straightforward, since there is no preconceived notion of correct and incorrect categorization. you can sometimes rely on a known pattern in the underlying dataset that you would reasonably expect to be reflected in a successful model. there may also be characteristics of the final model that indicate success. for example, if you are working with a clustering algorithm, models with dense, well-defined clusters are probably better than sparse clusters with vague boundaries. in unsupervised learning, you may want to hold back some portion of your data to perform an independent validation of your results, or you may use the entire dataset to build the model — it depends on what type of testing you want to perform. application of results as the final step of your workflow, you will use your intelligent model to perform some task. perhaps you will use it for scholarly analysis of a dataset, or perhaps you will integrate it into a software project. if it is the former, consider how to export any final data and preserve the artifacts of your project. if it is the latter, consider how the model, its outputs, and its continued maintenance will fit into existing systems and workflows. planning for interoperability may influence decisions from tool selection to data formats and storage. immutable data storage immutable data storage can benefit the batch-processing ml pipeline, especially during the initial research and development phase. this type of data storage supports iteration and allows you to compare the results of many different experiments. treating data as immutable means that after each significant change or set of changes to your data, you save a new snapshot of the dataset that is never edited or changed. it also allows you to be flexible and adaptive with your data model. immutable data storage has become a popular choice for data-intensive or “big data” applications as a way to easily assemble large quantities of data, often from multiple sources, without having to spend time upfront crafting a strict data model. you may have heard the term “data lake” to refer to such large, unstructured collections of data. this can be contrasted with a “data warehouse”, which usually indicates a highly structured, centralized repository such as a relational database. to demonstrate how immutable supports iteration and experimentation, consider the following scenario: you start with an input file my_data.csv, and then perform some cleanup operation over the data, such as converting all measurements in miles to kilometers, rounded to the nearest whole number. if you were treating your data as mutable, you might overwrite the original contents of my_data.csv with the transformed values. the problem with this approach comes if you want to test some alteration of your cleanup operation. say, for example, you wanted to round all your conversions to the nearest tenth instead. since you no longer have your original data, you would have to start the entire ml process from the top. if you instead treated your data as immutable, you would keep my_data.csv in its original state, and save the output of your cleanup operation in a new file, say my_clean_data.csv. that way, you could return to my_data.csv as many times as you wished, try different operations on this data, and easily compare the results of these operations knowing the source data was exactly the same for each one. think of each immutable dataset as a place in your process that you can safely rewind to anytime you want to try something new or correct for some error or failure. to illustrate the benefits of a flexible data model, consider a mutable data store, such as a relational database. before you put any data into the database, you would first need to design a system of tables with set fields and datatypes, and the relationships between those tables. this can feel like putting the cart before the horse, especially if you are starting with a dataset with which you are not yet intimately familiar, and you want the ability to experiment with different algorithms, all of which might require slightly different transformations on the original dataset. revisiting the example in the previous paragraph, you might initially have defined your distance datatype as an integer (when you were rounding to the nearest whole number), and would later have to change it to a floating point number (when you were rounding to the nearest tenth). making this change would mean altering the database schema and migrating all of the existing data to the new type, which is a nontrivial task — especially if you later decide to revert back to the original type. by contrast, if you were working with immutable csv files, it would be much easier to write out two files, one with each data type, and keep whichever one ultimately proved most effective. throughout your ml process, you can create several incremental datasets that are essentially read-only. there’s no one correct data storage format, but ideally you would use something simple and space-efficient with the capacity for interoperability with different tools, such as flat files (plain text files without extraneous markup, such as txt, csv, or parquet). even if your data is ultimately destined for a different kind of datastore, such as a relational database or triplestore, consider using simple, immutable storage as an intermediary to facilitate iteration and experimentation. if you’re concerned about overwhelming your local drive, cloud storage is a good option, especially if you can read and write directly from your programs or software services. one final benefit of immutable storage relates to scale. batch processing workflows with immutable data are also design principles for distributed data processing frameworks, such as mapreduce and spark. therefore, if you need to scale your ml project using distributed processing, the integration will be more seamless (for more, see the section on scaling up). comment by daniel johnson: am i parsing this right? organizing immutable data one of the challenges in working with immutable data stores is keeping it all organized, especially with multiple users. a little planning can save you from losing track of your experiments and results. a well-ordered directory structure, informative and consistent file names, liberal use of timestamps, and disciplined note-taking are simple but effective strategies. for example, say you were acquiring marcxml records from the university library’s api feed, parsing out subject terms, and building a clustering algorithm around these terms. let us explore one possible way that you could organize your data outputs through each step of the machine learning pipeline. to enforce a naming convention, create a helper method that generates the output path for each run of a particular data process. this output path includes the date and timestamp of the run — that way you won’t have to think about naming each individual file, and can avoid the phenomenon of a mess of files called my_clean_data.csv, my_cleaner_data.csv, my_final_cleanest_data.csv, etc. your file path for the acquired data might be in the format: myproject/acquisitions/marc_yyyymmdd_hhmmss.xml in this case, yymmdd represents the date and hhmmss represents the timestamp. your file path for prepared and cleaned data might be: myproject/clean_datasets/subjects_yyyymmdd_hhmmss.csv finally, each clustering model you build could be saved using the file path pattern: myproject/models/cluster_yyyymmdd_hhmmss following this general pattern, you can organize all of the outputs for your entire project. using date and timestamps in the file name also enables easy sorting and retrieval of the most recent output. for each data output, you will want to maintain a record of the exact input, any special attributes of the process (e.g. “this time i rounded decimals to the nearest hundredth”), and metrics that will help you determine success or failure of the process. if you can generate this information automatically for each process, all the better for ensuring an accurate record. one strategy is to include a second helper method in your program that will generate and write out a companion file to each data output. the companion file contains information that will help evaluate results, detect errors, perform optimizations, and differentiate between any two data outputs. in the example project, you could accompany the acquisition output with a text file detailing the exact api call used to fetch the data, the number of records acquired, and the runtime for the process. keeping companion files as close as possible to their outputs helps prevent accidental separation, so save it to: myproject/acquisition/marc_yyyymmdd_hhmmss.txt in this case, the date and timestamp should exactly match that of its companion xml file. when running processes that test and train models, you can include information in your companion file about hyperparameters and whatever metrics you are using to evaluate the quality of the model. in our example, the companion file to each cluster model may contain the file path for the cleaned input data, the number of clusters, and a measure of cluster variance. algorithm selection as you begin ingesting and preparing data, you'll want to explore possible machine learning algorithms to perform on your dataset. choose an algorithm that fits your research question and data. if you’re not sure which algorithm to choose and not constrained by time, experiment with several different options and see which one yields the best results. start by determining what general type of learning algorithm you need, and proceed from there to research and select one that specifically addresses your research question. in supervised learning, you train a model to predict an output condition based on given input conditions; for example, predicting whether or not a patient has some disease based on their symptoms, or the topic of a news article based on keywords in the text. in order for supervised learning to work, you need labeled training data, meaning data in which the outcome is already known. examples include records of symptoms in patients who were known to have the disease (or not), or news articles that have already been assigned topics. classification and regression are both types of supervised learning. in a classification problem, you are predicting a discrete number of possible outcomes. for example, “based on what i know about this book, will it make the new york times best seller list?” is a classification problem because there are two discrete outcomes: yes or no. classification algorithms include naive bayes, decision trees, and k-nearest neighbor. regression problems try to predict an outcome from a continuum of possibilities, i.e., “based on what i know about this book, what will its retail price be?” regression algorithms include linear regression and regression trees. in unsupervised learning, the ml algorithm discovers a new pattern. the training data is unlabeled, meaning there is no indication of how the data should be organized at the outset. a common example is clustering, in which the algorithm groups items together based on features it finds mathematically significant. perhaps you have a collection of news articles (with no existing topic labels), and you want to discover common themes or topics that appear throughout the collection. the algorithm will not tell you what the themes or topics are, but will show which articles group together. it is then up to the researcher to work out the common thread. in addition to serving your research question, your algorithm should also be a good fit for your data. specific considerations will vary for each dataset and algorithm, so make sure you know the strengths and weaknesses of your algorithm and how they relate to the unique qualities of your dataset. for example, algorithms differ in their abilities to handle datasets with a very large number of features, handle datasets with high variance, efficiently process very large datasets, and glean meaningful intelligence from very small datasets. is it important that your algorithm be easy to explain? some algorithms, such as neural nets, function as black boxes, and it is difficult to decipher how they arrive at their decisions. other algorithms, such as decision trees, are easy to understand. can you prepare your data with a reasonable amount of pre-processing? can you find examples of success (or failure) from people using similar datasets with the same algorithm? asking these sorts of questions will help you to choose an algorithm that works well for your data, and will also inform how you prepare your data for optimal use. finally, consider whether or not you are constrained by time, hardware, or available toolsets. different algorithms require different amounts of time and memory to train and/or execute. different ml tools offer implementations of different algorithms. working with machine learning algorithms new technologies and software advances make machine learning more accessible to “lay” users, by which i mean those of us without advanced degrees in mathematics or data science. yet, the algorithms are complex, and you need at least an intuitive understanding of how they work if you hope to implement them correctly. i use the following three questions as a guide for understanding an algorithm. keep in mind that any one project will likely make use of several complex algorithms along the way. these questions help ensure that i have the information i truly need, and avoid getting bogged down with details best left to mathematicians. · what do the inputs and outputs of the algorithm mean? there are two parts to answering this question. first is the data structure, e.g. “this is a vector with integers.” second is knowing what this data describes, e.g. “each vector represents a document, and each integer specifies the number of times a particular word appears in that document.” you also need to be aware of specific implementation details — perhaps the input needs to be normalized in some way, perhaps the output has been smoothed (a technique that compensates for noisy data or outliers). this may seem straightforward, but it can be a lot to keep track of once you’ve gone through several layers of processing and abstraction. · what effect do different hyperparameters have on the algorithm? part of the machine learning process is tuning hyperparameters, or trying out multiple configurations until you get satisfying results. part of the frustration is that you can’t try every possible configuration, so you have to do some intelligent guesswork. twiddling hyperparameters can feel enigmatic and unitutive, since it can be difficult to predict their impact on the final outcome. the better you understand hyperparameters and their roles in the ml process, the more likely you are to make reasonable guesses and adjustments — though you should always be prepared for a surprise. · can you explain how this algorithm works to a lay person and why it’s beneficial to the project? there are two benefits to articulating a response to this question. first, it ensures that you really understand the algorithm yourself. and second, you will likely be called on to give this explanation to co-collaborators and other stakeholders. a good explanation will build excitement around the project, while a befuddling one could sow doubt or disinterest. it can be difficult to strike a balance between general summary and technical equations, since your stakeholders will likely include people with diverse backgrounds, so do your best and look for opportunities for people with different expertises to help refine your team’s understanding of the algorithm. learning more about the underlying math can help you make better, more nuanced decisions about how to deploy the algorithm, and is fascinating in its own right — but in most cases i have found that the above three questions provide a solid foundation for machine learning research. tool selection tool selection is an important part of your process and should be approached thoughtfully. a good approach is to articulate and prioritize the needs of your team, and make selections that meet these needs. i’ve listed some possible questions for consideration below, many of which you will recognize as general concerns for any tool selection process. · what sorts of features and interfaces do they offer? if you require a specific algorithm, the ability to make data visualizations, or query interfaces, you can find tools to meet these specific needs. · how well do tools interoperate with one another, or with other parts of your existing systems? one of the advantages of a well-designed pipeline is that it will enable you to swap out software components if the need arises. for example, if your data is in a format that is interoperable with many systems, it frees you from being tied down to any specific tool. · how do the tools align with the skill sets and comfort levels of your team? for example, consider what coding languages your collaborators know, and whether or not they have the capacity to learn a new one. if you have someone who is already a wiz with a preferred spreadsheet program, see if you can export data into a compatible file format. · are the tools stable, well-documented, and well-supported? machine learning is a fast-changing field, with new algorithms, services, and software features being developed all the time. something new and exciting that hasn’t yet been road-tested may not be worth the risk if there is a more dependable alternative. furthermore, there tends to be more scholarship, documented use cases, and tutorials for older, more widely-adopted tools. · are you concerned about speed and scale? don’t get bogged down with these considerations if you’re just trying to get a working pilot off the ground, but it can help to at least be aware of how problems are likely to manifest as your volume of data increases, or as you integrate into time-sensitive workflows. you and your team can work through these questions and articulate additional requirements relevant to your specific context. scaling up scaling up in machine learning generally means that you need to work with a larger volume of data, or that you need processes to execute faster. recent advances in hardware and software make the execution of complex computations magnitudes faster and more efficient than they were even a decade ago, and you can often achieve quite a bit by working on a personal computer. yet, time is valuable, and it can be difficult to iterate and experiment effectively when individual processes take too long to execute. there are many ml software packages that can help you make efficient use of whatever hardware you have, including your personal computer. some examples at the time of writing are apache spark, tensorflow, scikit-learn, and microsoft cognitive toolkit, each with their own strengths and applications. in addition to providing libraries for building and testing models, these software packages optimize algorithmic performance, memory resources, data throughputs, and/or parallel computations. they can make a remarkable difference in both processing speed and the amount of data you can comfortably handle. there are also services that allow you to submit executable code and data to the cloud for processing, such as google ai platform. managing your own hardware upgrades is not without challenge. you may be lucky enough to have access to a high-powered computer capable of accelerated processing. a common example is a computer with gpus (graphics processing units), which break complex processes into many small tasks and run them in parallel. however, these powerful machines can be prohibitively expensive. another scaling technique is distributed or cluster computing, in which complex processes are distributed across multiple computers, often in the cloud. a cloud cluster can bring significant cost savings, but managing one requires specialized knowledge and the learning curve can be rather steep. it is also important to note that different algorithms require different scaling techniques. some clustering algorithms, for example, scale well with gpus but not with distributed computing. even with the right hardware and software, scaling up can be a tricky business. ml processes tend to have dramatic spikes in memory or network use, which can tax your systems. not all ml algorithms scale well, causing memory use or execution time to grow exponentially as more data is added. sometimes you have to add additional, complexity-reducing steps to your pipeline to handle data at scale. some of the more common machine learning languages, such as python and r, execute relatively slowly, putting the onus on developers to optimize operations for efficiency. in anticipation of these and other challenges, it is often a good idea to start with a scaled-down pilot or proof of concept, and not to underestimate the time and resources necessary to scale up from there. conclusion new technologies make it possible for more researchers and developers to leverage the power of machine learning. building an effective machine learning system means supporting the entire workflow, from data acquisition to final analysis. practitioners must be mindful of how each implementation decision and subjective choice — from the way you structure and store your data to the algorithms you use to the ways you validate your results — will impact the efficiency of operations and the quality of learned intelligence. this article has offered some practical guidelines for building ml systems with modular, repeatable processes and intelligible, verifiable results. there are many resources available for further research, both online and in your libraries, and i encourage you to consult with subject specialists, data scientists, mathematicians, programmers, and data engineers. may your data be clean, your computations efficient, and your results profound. further reading i include here a few suggestions for further reading on key topics. i have also found that in the fast-changing world of machine learning technologies, blogs, internet communities, and online classes can be a great source of information that is current, introductory, and/or geared toward practitioners. tan, pang-ning, michael steinbach, and vipin kumar. . introduction to data mining. boston: pearson addison wesley. see chapter for data preparation strategies. later chapters introduce common classification and clustering algorithms. marz, nathan and james warren. . big data: principles and best practices of scalable real-time data systems. shelter island: manning. “part : batch layer” discusses immutable storage in depth. kleppmann, martin. . designing data-intensive applications: the big ideas behind reliable, scalable, and maintainable systems. boston: o’reilly. “chapter : batch processing” is especially relevant if you are interested in scaling up. taking a leap forward: machine learning for new limits introduction today, machines can analyze vast amounts of data and increasingly produce accurate results through the repetition of mathematical or computational procedures. with the increasing computing capabilities available to us today, artificial intelligence (ai) and machine applications have made a leap forward. these rapid technological changes are inevitably influencing our interpretation of what ai can do and how it can affect people’s lives. machine learning models that are developed on the basis of statistical patterns from observed data provide new opportunities to augment our knowledge of text, photographs, and other types of data in support of research and education. however, “the viability of machine learning and artificial intelligence is predicated on the representativeness and quality of the data that they are trained on,” as thomas padilla, interim head, knowledge production at the university of nevada las vegas, asserts ( , ). with that in mind, these technologies and methodologies could help augment the capacity of archives and libraries to leverage their creation-value and minimize their institutional memory loss while enhancing the interdisciplinary approach to research and scholarship. in this essay, i begin by placing artificial intelligence and machine learning in context, then proceed by discussing why ai matters for archives and libraries, and describing the techniques used in a pilot automation project from the perspective of digital curation at oklahoma state university archives. lastly, i end by challenging other areas in the library and adjacent fields to join in the dialogue, to develop a machine learning solution more broadly, and to explore opportunities that we can reap by reaching out to others who share a similar interest in connecting people to build knowledge. artificial intelligence and machine learning. why do they matter? artificial intelligence has seen a resurging interest in the recent past - in the news, in the literature, in academic libraries and archives, and in other fields, such as medical imaging, inspection of steel corrosion, and more. john mccarthy, american computer scientist, defined artificial intelligence as “the science and engineering of making intelligent machines, especially intelligent computer programs. it is related to the similar task of using computers to understand human intelligence, but ai does not have to confine itself to methods that are biologically observable” ( , ). this definition has since been extended to reflect a deeper understanding of ai today and what systems run by computers are now able to do. dr. carmel kent notes that “ai feels like a moving target” as we still need to learn how it affects our lives ( ). within the last decades, the amazing jump in computing capabilities has been quite transformative in that machines are increasingly able to ingest and analyze large amounts of data and more complex data to automatically produce models that can deliver faster and more accurate results.[footnoteref: ] their “power lies in the fact that machines can recognize patterns efficiently and routinely, at a scale and speed that humans cannot approach,” writes catherine nicole coleman, digital research architect for stanford university ( ). [ : see sas n.d. and brennan .] a paradigm shift for archives and libraries within the context of university archives, this paradigm shift has been transforming the way we interpret archival data. artificial intelligence, and specifically machine learning as a subfield of ai, has direct applications through pattern recognition techniques that predict the labeling values for unlabeled data. as the software analytics company sas argues, it is “the iterative aspect of machine learning [that] is important because as models are exposed to new data, they are able to independently adapt. they learn from previous computations to produce reliable, repeatable decisions and results” (n.d.). case in point, how can we use machine learning to train machines and apply facial and text recognition techniques to interpret the sheer number of photographs and texts in either analog or born-digital formats held in archives and libraries? combining automatic processes to assist in supporting inventory management with a focus on descriptive metadata, a machine learning solution could help alleviate time-consuming and relatively expensive metadata tagging tasks, and thus scale the process more effectively using relatively small amounts of data. however, the traditional approach of machine learning would still require a significant time commitment by archivists and curators to identify essential features to make patterns usable for data training. by contrast, deep learning algorithms are able “to learn high-level features from data in an incremental manner. this eliminates the need of domain expertise and hard core feature extraction” (mahapatra ). deep learning has regained popularity since the mid- s due to “fast development of high-performance parallel computing systems, such as gpu clusters” (zhao , ). deep learning neural networks are more effective in feature detection as they are able to solve complex problems such as image classification with greater accuracy when trained with large datasets. the challenge is whether archives and libraries can afford to take advantage of greater computing capabilities to develop sophisticated techniques and make complex patterns from thousands of digital works. the sheer size of library and archive datasets, such as university photograph collections, presents challenges to properly using these new, sophisticated techniques. as jason griffey writes, “ai is only as good as its training data and the weighting that is given to the system as it learns to make decisions. if that data is biased, contains bad examples of decision-making, or is simply collected in such a way that it isn’t representative of the entirety of the problem set[, …] that system is going to produce broken, biased, and bad outputs” ( , ). how can cultural heritage institutions ensure that their machine learning algorithms avoid such bad outputs? comment by daniel johnson: not sure where the discrepancy comes in, but the changes i've made are pulled directly from griffey's report here: https://journals.ala.org/index.php/ltr/issue/viewissue/ / . implications to machine learning machine learning has the potential to enrich the value of digital collections by building upon experts’ knowledge. it can also help identify resources that archivists and curators may never have the time for, and at the same time correct assumptions about heritage materials. it can generate the necessary added value to support the mission of archives and libraries in providing a public good. annie schweikert states that “artificial intelligence and machine learning tools are considered by many to be the next step in streamlining workflows and easing workloads” ( , ). for images, how can archives build a data-labeling pipeline into their digital curation workflow that enables machine learning of collections? with the objective being to augment knowledge and create value, how can archives and libraries “bring the skills and knowledge of library staff, scholars, and students together to design an intelligent information system” (coleman )? despite the opportunities to augment knowledge from facial recognition, models generated by machine learning algorithms should be scrutinized so long it is unclear how choices are made in feature selection. machine learning “has the potential to reveal things ... that we did not know and did not want to know” as charlie harper asserts ( ). it can also have direct ethical implications, leading to biased interpretations for nefarious motives. machine learning and deep learning on the grounds of generating value in the fall , oklahoma state university archives began to look more closely at a machine learning solution to facilitate metadata creation in support of curation, preservation, and discovery. conceptually, we envisioned boosting the curation of digital assets, setting up policies to prioritize digital preservation and access for education and research, and enhancing the long-term value of those data. in this section, i describe the parameters of automation and machine learning used to support inventory work and experiment with face recognition models to add contextualization to digital objects. from a digital curation perspective, the objective is to explore ways to add value to digital objects for which little information is known, if any, in order to increase the visibility of archival collections. what started this pilot project? before proceeding, we needed to gain a deeper understanding of the large quantity of files held in the archives–both types of data and metadata. the challenge was that with so many files, so many formats, files become duplicated and renamed, doctored, and scattered throughout directories to accommodate different types of projects over time, making it hard to sift due to sparse metadata tags that may have differed from one system to another. in short, how could we justify the value of these digital assets for curatorial purposes? how much could we rely on the established institutional memory within the archives? lastly, could machine learning or deep learning applications help us build a greater capacity to augment knowledge? in order to optimize resources and systematically make sense of data, we needed to determine that machine learning could generate value, which in turn could help us more tightly integrate our digital initiatives with machine learning applications. such applications would only be as effective as the data are good for training and the value we could derive from them. methodology and plan of action first, we recruited two student interns to create a series of processes that would automatically populate a comprehensive inventory of all digital collections, including finding duplicate files by hashing. we generated the inventory by developing a process that could be universally adapted to all library digital collections, setting up a universal list of works and their associated metadata, with a focus on descriptive metadata, which in turn could support digital curation and discovery of archival materials—digitized analog materials and born-digital materials. we developed a universal policy for digital archival collections, which would allow us to incorporate all forms of metadata into a single format to remedy inconsistencies in existing metadata. this first phase was critical in the sense that it would condition the cleansing and organizing of data. we could then proceed with the design of a face recognition database, with the intent to trace individuals featured in the inventory works of the archives to the extent that our data were accurate. we utilized the oklahoma state university yearbook collections and other digital collections as authoritative references for other works, for the purpose of contextualization to augment our data capacity. second, we implemented our plan; worked closely with the library systems’ team within a windows-based environment; decided on graphics processing unit (gpu) performance and cost, taking into consideration that training neural networks necessitates computing power; determined storage needs; and fulfilled other logistical requirements to begin the step-by-step process of establishing a pattern recognition database. we designed the database on known objects before introducing and comparing new data to contextualize each entry. with this framework, we would be able to add general metadata tags to a uniform storage system using deep learning technology. third, we applied tesseract ocr on a series of archival image-text combinations from the archives to extract printed text from those images and photographs. “tesseract adds a new neural net (lstm) [long short-term memory] based ocr engine which is focused on line recognition,” while also recognizing character patterns (“tesseract” n.d.). we were able to obtain successful output for the most part, with the exception of a few characters that were hard to detect due to pixelation and font types. fourth, we looked into object identifiers, keeping in mind that “when there are scarce or insufficient labeled data, pre-training is usually conducted” (zhao , ). working through the inventory process, we knew that we would also need to label more data to grow our capacity. we chose to use resnet- , a smaller version backbone of keras-retinanet, frequently used as a starting point for transfer learning. keras is a deep learning network api (application programming interface) that supports multiple back-end neural network computation engines (heller ) and retinanet is a single, unified network consisting of a backbone network and two task-specific subnetworks used for object detection (karaka ). we proceeded by first dumping a lot of pre-tagged information from pre-existing datasets into this neural network. we experimented with three open source datasets: pascal voc , a set including object categories; open images database (oid), a very large dataset annotated with image-level labels and object bounding boxes; and microsoft coco, a large-scale object detection, segmentation, and captioning dataset. with a few faces from the oid dataset, we could compare and see if a face was previously recognized. expanding our process to data known from the archives collection, we determined facial areas, and more specifically, assigned bounding box regressions to feed into the facial recognition api, based on keras code written in python. the face recognition api is available via github.[footnoteref: ] it uses a method called histogram of oriented gradient (hog) encoding that makes the actual face recognition process much easier to implement for individuals because the encodings are fairly unique for every person, as opposed to encoding images and trying to blindly figure out which parts are faces based on our label boxes. figure illustrates our test, confirming from two very different photographs the presence of jessie thatcher bost, the first female graduate from oklahoma a&m college in . comment by daniel johnson: this is a really good example. [ : see https://github.com/ageitgey/face_recognition.] figure : face recognition test ren et al. stated that it is important to construct a deep and convolutional per-region object classifier to obtain good accuracy using resnets ( ). going forward, we could use the tool “as is” despite the low tolerance for accuracy, or instead try to establish large datasets of faces by training on our own collections in hopes of improving accuracy. we proceeded with utilizing the oklahoma state university yearbook collections, comparing image sets with other photographs that may include these faces. we look forward to automating more of these processes. a conclusive first experiment we can say that our first experiment developing a machine learning solution on a known set of archival data resulted in positive output, while recognizing that it is still a work in progress. for example, the model we ran for the pilot is not natively supported on windows, which hindered team collaboration. in light of these challenges, we think that our experiment was a step in the right direction of adding value to collections by bringing in a new layer of discovery for hidden or unidentified content. above all, this type of work relies greatly on transparency. as schweikert notes, “transparency is not a perk, but a key to the responsible adoption of machine learning solutions” ( , ). more broadly, issues in transparency and ethics in machine learning are important concerns in the collecting and handling of data. in order to boost adoption and get more buy-in with this new type of discovery layer, our team shared information intentionally about the process to help add credibility to the work and foster a more collaborative environment within the library. also, the team developed a graphic user interface (gui) to search the inventory within the archives and ultimately grow the solution beyond the department. challenges and opportunities of machine learning challenges in a national library of medicine blog post, patti brennan points out “that ai applications are only as good as the data upon which they are trained and built”( ), and having these data ready for analysis is a must in order to yield accurate results. scaling of input and output variables also plays an important role in the performance improvement when using neural network models. jerome pesenti, head of ai at facebook, states that “when you scale deep learning, it tends to behave better and to be able to solve a broader task in a better way” ( ). clifford lynch affirms, “machine learning applications could substantially help archives make their collections more discoverable to the public, to the extent that memory organizations can develop the skills and workflows to apply them” ( ). this raises the question whether archives can also afford to create the large amount of data from print heritage materials or refine their born-digital collections in order to build the capacity to sustain the use of deep-learning applications. granted, the increasing volume of born-digital materials could help leverage this data capacity somehow; it does not exclude the fact that all data will need to be ready prior to using deep learning. since machine learning is only good so long as value is added, archives and libraries will need to think in terms of optimization as well, deciding when value-generated output is justified compared to the cost of computing infrastructure and skilled labor needs. besides value, operations, such as storing and ensuring access to these data, are just as important considerations to making machine learning a feasible endeavor. comment by mark dehmlow: maybe use active voice? comment by daniel johnson: i'm ok with this construction, in this case, if you wish to leave as is. opportunities investment in resources is also needed for interpreting results, in that “results of an ai-powered analysis should only factor into the final decision; they should not be the final arbiter of that decision” (brennan ). while this could be a challenge in itself, it can also be an opportunity when machine learning helps minimize institutional memory loss in archives and libraries (e.g., when long-time archivists and librarians leave the institution). machine learning could supplement practices that are already in place – it may not necessarily replace people – and at the same time generate metadata for the access and discovery of collections that people may never have the time to get to otherwise. but we will still need to determine accuracy in results. as deep learning applications will only be as effective as the data, archives and libraries should expand their capacity by working with academic departments and partnering with university supercomputing centers or other highly performant computing environments across consortium aggregating networks. such networks provide a computing environment with greater data capacity and more gpus. along similar lines, there are opportunities to build upon carpentries workshops and the communities of practice that surround this type of interest. these growing opportunities could help boost the use of machine learning and deep learning applications to minimize our knowledge gaps about local history and the surrounding community, bringing together different types of data scattered across organizations. this increased capacity for knowledge could grow through collaborative partnerships, connecting people, scholars, computer scientists, archivists and librarians, to share their expertise through different types of projects. such projects could emphasize the multi- and interdisciplinary academic approach to research, including digital humanities and other forms or models of digital scholarship. conclusion along with greater computing capabilities, artificial intelligence could be an opportunity for libraries and archives to boost the discovery of their digital collections by pushing text and image recognition machine learning techniques to new limits. machine learning applications could help increase our knowledge of texts, photographs, and more, and determine their relevance within the context of research and education. it could minimize institutional memory loss, especially as long-time professionals are leaving the profession. however, these applications will only be as effective as the data are good for training and for the added value they generate. at oklahoma state university, we took a leap forward developing a machine learning solution to facilitate metadata creation in support of curation, preservation, and discovery. our experiment with text extraction and face recognition models generated conclusive results within one academic year with two student interns. the team was satisfied with the final output and so was the library as we reported on our work. again, it is still a work in progress and we look forward to taking another leap forward. in sum, it will be organizations’ responsibility to build their data capacity to sustain deep learning applications and justify their commitment of resources. nonetheless, as oklahoma state university’s face recognition initiative suggests, these applications can augment archives’ and libraries’ support for multi and interdisciplinary research and scholarship. references brennan, patti. . “ai is coming. are data ready?” nlm musings from the mezzanine (blog). march , . https://nlmdirector.nlm.nih.gov/ / / /ai-is-coming-are-the-data-ready/. carmel, kent. . “evidence summary: artificial intelligence in education.” european edtech network. https://eetn.eu/knowledge/detail/evidence-summary--artificial-intelligence-in-education. coleman, catherine nicole. . “artificial intelligence and the library of the future, revisited.” stanford libraries (blog). november , . https://library.stanford.edu/blogs/digital-library-blog/ / /artificial-intelligence-and-library-future-revisited. “face recognition.” n.d. accessed november , . https://github.com/ageitgey/face_recognition. griffey, jason, ed.. . “artificial intelligence and machine learning in libraries.” special issue, library technology reports , no. (january). https://journals.ala.org/index.php/ltr/issue/viewissue/ / . harper, charlie. . “machine learning and the library or: how i learned to stop worrying and love my robot overlords.” code lib, no. (august). https://journal.code lib.org/articles/ . heller, martin. . “what is keras? the deep neural network api explained.” infoworld (website). january , . https://www.infoworld.com/article/ /what-is-keras-the-deep-neural-network-api-explained.html. karaka, anil. . “object detection with retinanet.” weights & biases (website). july , . https://www.wandb.com/articles/object-detection-with-retinanet. lynch, clifford. . “machine learning, archives and special collections: a high level view.” international council on archives blog. october , . https://blog-ica.org/ / / /machine-learning-archives-and-special-collections-a-high-level-view/. mahapatra, sambit. “why deep learning over traditional machine learning?” towards data science (website).march , . https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning- b a . mccarthy, john. “what is artificial intelligence?” professor john mccarthy (website). revised november , . http://jmc.stanford.edu/articles/whatisai/whatisai.pdf. padilla, thomas. . responsible operations: data science, machine learning, and ai in libraries. dublin, oh: oclc research. https://doi.org/ . /xk z- g . pesenti, jerome. . “facebook's head of ai says the field will soon ‘hit the wall.’” interview by will knight. wired (website). december , . https://www.wired.com/story/facebooks-ai-says-field-hit-wall/. ren, shaoqing, kaiming he, ross girshick, xiangyu zhang, and jian sun. . “object detection networks on convolutional feature maps.” ieee transactions on pattern analysis and machine intelligence , no. (april). sas. n.d. “machine learning: what it is and why it matters.” accessed december , . https://www.sas.com/en_us/insights/analytics/machine-learning.html. schweikert, annie. . “audiovisual algorithms, new techniques for digital processing.” master’s thesis, new york university. https://www.nyu.edu/tisch/preservation/program/student_work/ spring/ s_thesis_schweikert.pdf. “tesseract ocr.” n.d. accessed december , . https://github.com/tesseract-ocr/tesseract. zhao, zhong-qiu, peng zheng, shou-tao xu, and xindong wu. “object detection with deep learning: a review.” ieee transactions on neural networks and learning systems , no. ( ): - . comment by daniel johnson: it appears you were using a preprint. this information is for the official published version. jason cohen and mario nakazawa berea college april machine learning + data creation in a community partnership for archival research chapter abstract: this chapter explores the relationship between the techniques and ethical implications of machine learning deployed in the context of local archival research, and particularly, in partnership with a cultural heritage institution. the authors have been engaged for four years in working on an archives management and migration project in partnership with our home institution, berea college, and the pine mountain settlement school (pmss) in eastern kentucky. this culturally rich but resource-poor cultural heritage institution carries a legacy of the progressive era with its commitment to educating the rural poor. following a data migration from a blog platform to a modern content management system, the , objects pmss had digitized using a blogging platform contain no consistent metadata to govern how the archives handle document preservation and conservation status, item-level indexing and retrieval, object or collections descriptions, or naming conventions. consequently, our chapter describes the process we used to ( ) generate technical and descriptive metadata for historical photographs as we pulled material from an extant blog website into a digital archives platform; ( ) identify recurring faces in individual pictures as well as in photographs of groups of sometimes unidentified people in order to generate social networks as metadata; and ( ) to help develop a controlled vocabulary for the institution’s future needs for object management and description. the collaborative nature of our public outreach is singular: not only is the represented archive a unique and endangered collection, it is also housed in a uniquely challenging location. its remote physical location and historical neglect means that, moving forward, any system put into place has to be able to be managed by unpaid volunteers and part-time non-expert caretakers, across shifting institutional leadership. the chapter describes our workflow and its goals; further, it explores the tension between the reparative functions machine learning can offer to archival projects and the constraints, biases, and strategies that enter into community-based public history research projects that must carry on long after computationally intensive processes cease to be supported. introduction: cultural heritage and archival preservation in eastern kentucky in this chapter, two researchers, jason cohen and mario nakazawa, describe the contexts for an archivally focused project that emerged from a partnership between the pine mountain settlement school (pmss) in harlan county, kentucky, and scholars and students at berea college. in this process, we have entered into a critical dialogue with our sources and knowledge production that roopika risam calls for in “self-reflexive” investigations in the digital humanities ( , para. ). risam’s intervention, nevertheless, does not explicitly distinguish questions of class and the concomitant geographic constraints that often accompany the economic and social disadvantages of poverty (ahmed et al. ). our work demonstrates how class and geography are tied, even in digital archives, to the need for reflexive and diverse approaches to humanist materials. for instance, a recent invited contribution to proceedings of the ieee articulates a need for diversity in computing and technology without mentioning class or region as factors shaping these related issues of diversity (stephan et al. , - ). given these constraints, perhaps it is also pertinent to acknowledge that the machine learning application we describe in this chapter is itself not particularly novel in scope or method – we describe our data acquisition and preparation, and two parallel implementations of commercially available tools for facial recognition. what stands out as unique are the ethical and practical concerns tied to bringing unique archival materials out of their local contexts into a larger conversation about computer vision as a tool that helps liberate, and at the same time possibly endanger, a subaltern cultural heritage. in that light, we enter our archival investigation into what bruno latour has productively named “actor-network theory” ( , - ) because, as we suggest below, our actions were highly conditioned not only by the physical and social spaces our research occupies and where its events occurs, but also because the nature of the historical artifacts themselves act powerfully to shape our work in these contexts. moreover, the partnership model of curation and archiving that we pursued in this project complicates the very concept of agency because the actions forming the project emerged from a continuing dialogue rather than any one decision or hierarchy. as we suggest later, a distributed model for decisions (sabharwal , - ) also revealed the limitations of using a participatory and identity-based model for archival development and management. indeed, those historical artifacts will exert influence on this network of relations long after any one of us involved in the current project has ceased to pursue them. when we came to this project, we asked a version of a classic question that has arisen in a variety of forms beginning with very early efforts by bell laboratories, among others, to translate data structures to suit the often flexible needs of humanist data: “what aspects of life are formalizable?” (weizenbaum , ). we discovered that while an ontology may represent a formalized relationship of an archive to a database or finding aid, it also asks questions about the ethical implications of what information and embedded relationships can be adequately formalized by an abstract schema. the promises and realities of technology after coal in eastern kentucky despite the longstanding threats of having to adapt to a post-coal economy, harlan county, kentucky continues to rely on coal and the mountains from which that coal is extracted as two of the cornerstones that shape the identity of the territory as well as the people who call it home. the mountains of eastern kentucky, like much of appalachia, are by turns beautiful and devastated, and both authors of this essay have found conversations with eastern kentucky’s citizens about the role the mountains play and the traditions that emerge from them both insightful and, at times, heartbreaking. this dramatic landscape, with its drastic challenges, may not sound like a place likely to find uses for machine learning. you would not be alone in your assumption. standing far from urban centers of technology and mobility, eastern kentucky combines deeply structural problems of generational poverty with a hard won understanding that, since the moment of the region’s colonization, outsiders have taken resources and made uninformed decisions about what the region needs, or where it should turn in order to gain a better purchase on the narrative of american progress, self-improvement, and the unavoidable allures of development-driven capitalism. suspicion of outsiders is endemic here. and unfortunately, economic and social conditions, such as the high workplace injury rates associated with mining and extraction-related industries, the effects of the pharmaceutical industry’s abuse of prescription opioids to treat a wide array of medical pain symptoms without treating the underlying causal conditions, and the systematic dismantling of federal- and state-level social support programs, have become increasingly acute concerns today. but this trajectory is not new: when president lyndon b. johnson announced the beginning of the war on poverty in , he landed an hour away in martin county, and subsequently, drove through harlan on a regional tour to inaugurate the initiative. successive generations have sought to leave a mark, and all the while, the residents have been collecting their own local histories of their place. our project, centered on recovering a latent social network of historical families represented by the images held in one local archive, mobilizes this tension between insiders’ persistence and outsiders’ interventions to think about how, as bruno latour puts it, we can “reassemble the social” while still respecting the local ( , - ). pmss occupies a unique position in this social and physical landscape: both local in its emplacement and attention, and a site of philanthropic work that attracted outside money as well as human and cultural capital, pmss is at once of harlan county and beyond it. as we suggest in the later sections of this essay, pmss’s position, both within local and straddling regional boundaries, complicates the network we identified. more than that, however, its split position complicates the relationships of power and filiation embedded in its historical social network. while an economy centered on coal continues to define the eastern kentucky regional identity, a second history can be told about this place and its people, one centered on resilience, independence, simplicity, and beauty, both of the land and its people. this second history has made outsiders’ recent appeals for the region to court technology as a potential solution for what comes “after coal” particularly attractive to a region that prides itself on its capacity to sustain, outlast, and overcome obstacles. while that techno-utopian vision offers another version of the self-aggrandizing silicon valley bootstraps success story j.d. vance narrates in hillbilly elegy ( ), like vance’s story itself, those narratives most often get told by outsiders to outsiders using regional stereotypes as the grounds for a sales pitch. in reality, however, those efforts have largely proven difficult to sustain, and at times, become the sources of potentially explosive accusations of fraud and malfeasance. recently, for instance, organizations including mined minds have been accused by residents aiming to prepare for a post-coal economy of misleading students, at least, and of fraud at worst. as with the timber, coal, and gas extraction industries that preceded these software development firms’ aspirations, the promises of technology have not been kind to eastern kentucky, and in particular, as with those extraction industries that preceded them, the technological-industrial complex making its pitch in kentucky’s mountains has not returned resources to the region’s residents whom the work was intended at least nominally to support (hochschild ; campbell ; bailey ). in this context of technology, culture, and the often controversial position machine learning occupies in generating obscure metrics for its classifiers that may embed bias, our project aims to activate its archival holdings and bring critical awareness to the question of how to actively engage with a paper archive of a local place as we venture further into our pervasively digital moment. the school operates today as a regional cultural heritage institution; it opened in as a residential school and operated as an educational institution until , at which point it transformed itself into an environmental and cultural outreach institution focused on developing its local community and maintaining the richness of the region’s cultural resources and heritage. every year since , pmss has brought hundreds of students and citizens onto its campus to learn about nature and the landscape, traditional crafts and artistic practices, and musical and dance forms, among many other programs. similarly, it has created a space for locals to come together for social events, community celebrations, and festival days, and at the same time, has become a destination for national-level events that create community from shared interests including foodways, wildflowers, traditional dance forms, and other wide-ranging attractions. project background: preserving cultural heritage in harlan country the archives of the pine mountain settlement school emerge from its shifting history. the majority of its papers relate to its time as a traditional institution of education, including student records (which continue to be restricted for several reasons, including ferpa constraints, and personal and community interests in privacy), minutes of its board meetings (again, partially restricted), and financial and narrative accounts of its many activities across a year. the school’s records are unique because they provide a snapshot, year by year and month by month, of the region’s interests and challenges during key years of the th century, spanning the first world war to vietnam. in addition, they detail the relations the school maintained with a philanthropic base of donors who helped to support it and shape it, and beyond its local relations, place it into contact with a larger set of cultural interactions than a boarding school that relied on tuition or other profit-driven means to sustain its operations would. while the archival holdings continued to be informally developed by its directors and staff, who kept the official papers organized roughly by year, the archive itself sat largely neglected after . beginning around the turn of the millennium, a volunteer archivist named helen wykle began digitizing items one by one, and soon, hosted a curated selection of those digital surrogates along with interpretive and descriptive narration on a wordpress installation, the pine mountain settlement school collections. the pmss collections wordpress site has been continuously running and frequently updated by wykle and the volunteer community members she has organized since .[footnoteref: ] together with her collaborators and volunteers, wykle has grown the wordpress site to over pages, including over , embedded images that include photographs and newspapers; scanned memos, meeting minutes and other textual material (in jpg and pdf formats); html transcriptions and bibliographies hard-coded into the pages; scanned images of -d collections objects like textile looms or wood carving tools; partially scanned runs of serial publications; and other composite visual material. none of those objects was hosted within a regular and complete metadata hierarchy or ontology: no regular scheme of fields or file-naming convention was followed, no controlled vocabulary was maintained, no object-types were defined, no specific fields were required prior to posting, and perhaps unsurprisingly as a result, the search and retrieval functions of the site had deteriorated noticeably. [ : jason cohen and mario nakazawa wish to extend a note of appreciation to helen hays wykle, geoff marietta, the former director of pmss, and preston jones, its current director, for welcoming us and enabling us to access the physical archives at pmss from - .] in , jason cohen approached pmss with the idea of using its archives as the basis for curricular development at berea college.[footnoteref: ] working in collaboration beginning in , mario nakazawa and cohen developed two courses in digital and computational humanities, led a team-directed study in augmented reality in coordination with pine mountain, contributed materials and methods for a new course in appalachian studies, and promoted the use of pmss archival materials in several other extant courses in history and art history, among others. these new college courses each make use of pmss historical documents as a shared core of visual and textual material in a digital and computational humanities concentration that clusters around critical archival and textual studies.[footnoteref: ] [ : jason cohen would like to recognize the support this project received from the national endowment for the humanities’ “humanities connections” grant. see grant number ak- - , description online here: https://securegrants.neh.gov/publicquery/main.aspx?f= &gn=ak- - .] [ : in the original version of the collaboration, we had planned also to teach basic computer programming to high school students during a summer program that also would have used that same set of materials, but with the paired departures of my original co-pi as well as the former director, that plan has thus far remained unfulfilled.] the success of that initial collaboration and course development seeded the potential in - for a whiting public engagement fellowship focused on developing middle and high school curricula for use in kentucky public schools with pmss archival materials. that whiting funded project has generated over lessons keyed to kentucky state standards; these lessons are currently in use at nine schools across eight school districts, and each school is using pmss materials to highlight its own regional and local interests. the work we have done with these archives has thus far reached the classrooms of at least eleven different middle and high school teachers, and as a result, touched over students in eastern and central kentucky public schools. we mention these numbers in order to demonstrate that our collaboration has not been shallow nor fleeting. we have come to know these archives quite well, and because they are not adequately cataloged, the only way to get to know them is to spend time reading through the materials one page at a time. an ancillary consequence of this durable collaboration and partnership across the public-academic divide is the shared recognition early in that the pmss archival database and its underlying data structure (a flat sql database generated by the wordpress interface) would provide inadequate stability for records management and quality control in future development. in addition, we discovered that the interpretive materials and metadata associated with the wordpress installation were also insufficient for linked metadata across the objects in this expanding digital archive, for reasons discussed below. as partners, we decided together to migrate to a contentdm instance hosted by the kentucky virtual library, a consortium to which berea college belongs, and which is open to future membership from pmss. that decision led a team of berea college undergraduate and faculty researchers to scrape the data from the pmss archive site and supplement the images and transcriptions it contains with available textual metadata drawn from the site.[footnoteref: ] alongside the wordpress instance as our reference, we were also granted access to a dropbox account that hosted higher resolution versions of the images featured on the blog. the scraper pulled over , unique images (and located over , duplicate images in the process), document transcriptions for scanned texts on the site, and subject and person bibliographies, including library of congress subject headings that had been hard-coded into the site’s html. we also extracted the unique object identifiers and labels associated with each image, which in wordpress are not associated with the image objects themselves. we used that data to populate the contentdm instance and returned a sparse but stable skeleton for future archival development. in the process, we also learned significantly about how a future implementation of a controlled vocabulary, an image acquisition and processing pipeline, and object documentation standards should work in the next stages of our collaborative pmss archival development. [ : jason cohen wishes to thank mario nakazawa, bethanie williams, and tradd schmidt for undertaking this project with him. the github repo for the pmss scraper is hosted here: https://github.com/tradd-schmidt/pmss_scraper.] as we developed and refined this new point of entry to the digital archives using the contentdm hosting and framework, some of the ethical issues surrounding this local archive came more clearly into focus. a parallel set of questions that arose in response in the first instance to j.d. vance’s work, and in the second, to outsiders’ claims for technological solutions to the deterioration of local and cultural heritage, became more immediate for us. because we were creating virtual archival surrogates for materials housed at pine mountain, for instance, questions arose from the pmss board members related to privacy and use of historical materials. further, the board was concerned that even historical materials could bear on families present in the community today. we found that while profession-wide responses to archival constraints are shaped predominantly by discussions of copyright and fair use, issues of personal privacy are often left tacit. this gap between legal use and public interests in privacy reveals how tasks executed using techniques in machine learning may impinge upon more ethical constraints of public trust and civic obligation.[footnoteref: ] [ : the professional conversation in archive and collections management has not been as rich as the one emerging in ai contexts more broadly. for a recent discussion of the conflict in the roles of public trust and civic service that emerge from the context of the powers artificial intelligence holds for image recognition in policing applications, see elizabeth joh, “artificial intelligence and policing: first questions,” seattle university law review : - .] similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextricably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what ryan calo calls the “historical validation” of primary source materials ( , - ). when an ai system recognizes an object, calo remarks, that object is validated. but how should one handle the lack of a specific vocabulary within a given training set? one answer, of course, would be to train a new set – but that response is becoming increasingly prohibitive for smaller cultural heritage projects like ours: the time and computational power required to execute the training is non-negligible. in addition, training resources (such as data sets, algorithms, and platforms) are increasingly becoming monetized, and we do not have the margins to buy access to new data for training. as a consequence, questions stemming from how one labels material in a controlled vocabulary were also at issue. we encountered a failure in historical validation when, for instance, our ai system labeled a “spinning wheel” as a wheel, but did not detect its historical relationship to weaving and textiles. that validation was further obscured when the system also failed to categorize a second form of “spinning wheel,” which refers locally to a home-made merry-go-round.[footnoteref: ] in other words, not only did the system flatten a spinning wheel into a generic wheel, it also missed the regional homology between textile production and play, a cultural crux that reveals how this place envisions an intersection between work and recreation. by breaking the associations between two forms of “spinning wheel,” our system erased a small but significant site of cultural inheritance. how, we asked, should one handle such instances of effacement? at one level, one would expect an archival system to be able to identify the primitive machine for spinning wool, flax, or other raw materials into usable thread for textiles, but what about the merry-go-round? and what should one do when a system neglects both of these meanings and reduces the object to the same status as a wheel on a tractor, car, or carriage? [ : see “spinning wheel” in cassidy - .] similarly, when competing naming conventions arise for landmarks, we were conscious to consider which name should be granted priority as the default designation, and we asked how one should designate a local or historical name, whether for a road, waterway, knob, or other feature, in relationship to a more widely accepted nomenclature such as state route designations or standardized toponym? as we attempted to address the challenge of multiple naming conventions, we encountered some of the same challenges that archivists find in dealing with indigenous peoples and their textual, material, and physical artifacts.[footnoteref: ] following an example derived from the passamaquoddy people, we implemented a small set of “traditional knowledge labels”[footnoteref: ] to describe several forms of information, including (a) restrictions on images that should not be shown to strangers (to protect family privacy), (b) places that should remain undisclosed (for instance, wild ginseng, ramp, orchid, or morel mushroom patches), and (c) educational materials focused on “how it was done” as related to local skills and crafts that have more modern implementations, but for which the traditional practices have remained meaningful. this included cases such as maypole dancing and festivals, which remain endowed with ritual significance. in the final analysis, neither the framework supplied by copyright and fair use nor the one supplied by data validation proved singularly adequate to our purposes, but they did provide guidelines from which our facial recognition project could proceed, as we discuss below. [ : one well-documented digital approach to handling indigenous archival materials includes the mukurtu platform for indigenous cultural heritage: https://mukurtu.org/. ] [ : for the original traditional knowledge labels, see: https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels.] machine learning in a local archive these preliminary discussions of ethics and convention may seem unrelated to the focus this collection adopts toward machine learning and artificial intelligence in the archive. however, as we have begun to suggest, the data migration to contentdm opened the door to machine learning for this project, and those initial steps framed the pitfalls that we continue to navigate as we continue forward. as we suggested at the outset, the technical machine-learning task that we set for ourselves is not cutting edge research as much as an application of existing technologies to a new aspect of archival investigation. we proposed (and succeeded with) an application of commercial facial recognition software to identify the persons in historic photographs in the pmss archives. we subsequently proposed and are currently working to identify the photographs sharing common but unnamed faces, and in coordination with photographs of known people, to re-create the social network of this historic institution across slices of its history. we describe the next steps briefly below, but let us tarry for a moment with the question of how the ethical concerns we navigated up to this point also influenced our approach to facial recognition. the first of those concerns has to do with commercial and public access to archival materials that, as we suggested above, include materials that are designated as restricted use in some way. we demonstrated to the local members at pine mountain how our use case and its constraints for digital archives fit with the current standards for the fair use of copyrighted materials based on the “substantive transformation” of reproduced objects (levendowski , - ). since we are not making available large bodies of materials still protected by copyright, and since our use of select materials shifts the context within which they are presented, we were able to negotiate with pmss to allow us to design a system for facial recognition using the contentdm instance as our image source. what that negotiation did not consider, however, is when fair use does not provide a sufficiently high standard of control for the institution involved in the application of algorithms to institutional memory or its technological dependencies. first, to test the facial recognition processes, we reached back to the most primitive and local version of facial recognition software that we could find, google’s retired platform, the picasa web albums api, which was retired in may and fully deprecated as of march (sabharwal ). we chose picasa because it is a self-contained software application that operates using a locally hosted script and locally hosted images. given its deprecated status and its location on a local machine, we were confident that no cloud services would be ingesting the images we fed into the system for our trial. this meant that we could test small data examples without fear of having to upload an entire corpus of material that could subsequently be incorporated into commercial facial recognition engines or pop up unexpectedly in search results. we thus began by upholding a high threshold for privacy and insisting on finding ways for pmss to maintain control over these images within the grasp of its local directories. the picasa system created surprisingly good results within the scope we allowed it. it was highly successful at matching the small group of known faces we supplied as test materials. while it would be difficult to supply a numerical match rate first because of this limited test set, and second because we have not expanded the test to a broad sample using another platform, we were anecdotally surprised at how robust picasa’s matching was in practice. for instance, picasa matched the images of a single person’s face, celia cathcart, from pictures of her as a teenager to images of her as a grandmother. it recognized cathcart in a group of basketball players, and it also identified her face from side-view and off-center angles, as in a photograph of her looking down at her newborn child. the most immediate limitation of picasa lies in its tagging, which required manual entry of every name and did not allow any automation. following the success of that hand-tagging and cross-image identification process, we discussed with our partners whether the next step, using amazon web services’ computer vision and facial recognition platform, rekognition, would be acceptable. they agreed, and we ran the images through the aws application, testing our results against samples pulled from our picasa run to verify the results. perhaps unsurprisingly, aws rekognition fared even better with those test cases. using one photograph image, the aws application identified all of the picasa matches as well as three new images that had not previously been tagged with cathcart’s name. the same pattern held for other images in our sample group: katherine pettit was positively identified across more likenesses than had been previously tagged, and alice cobb was also positively tracked across images. this positive attribution also reveals a limitation of the metadata: while these three women we have named are important historical figures at pmss, and while they are widely acknowledged in the archive and well-represented in the photographic record, not all of the photographs have been well-tagged or fully documented in the archive. the newly tagged images that we found would enrich the metadata available to the archive not because these images include surprising faces, but rather, because the tagging has been inconsistent, and over time, previously known faces have become less easy to discern. like other recent discussions of private materials disclosed within systems trained for matching and similarity, we found that the ethics of private materials for this non-private purpose provoked strong reactions. while some of the reaction was positive with community members happy to have more images of the school’s founding director, katherine pettit, identified, those same community members were not comfortable with our role as researchers identifying people in the photographs in their community’s archive, unsupervised. they wanted instead to verify each positive identification, a point that we agreed with, but which also hindered the process of moving through , images. they wanted to maintain authority, and while we saw our efforts as contributions to their goals of better describing their archival holdings, it turns out that the larger scope of automation we brought to the project was intimidating. while its legal status and direct ethics seemed settled before the beginning of the project, ultimately, this project contributed to a sense among some individuals at pmss that they were losing control of their own archive.[footnoteref: ] that fear of a loss of control led to another reckoning with the project, as we discuss in the next section. [ : see, for another example of the ethical quandaries that may be associated with legal applications of machine learning techniques, ema et al. .] what machine learning cannot learn: an ethics of the archive it became clear at the same moment we validated our test case, that our research goals and those of our partners had quickly diverged. we had discussed the scope and use of pmss materials with our partners at pmss and laid out in a formally drafted “memorandum of understanding” (mou) adapted from the us department of justice ( ; ) our shared goals in the project. as we described in the mou, both partners considered it mutually beneficial for the archive and its metadata to be able to identify faces of named as well as unnamed people. we aimed to capture single-person images as well as groups in order to enrich the archive with cross-links to other photographs or archival materials with a shared subject heading, and we hoped to increase the number of names included in object attributes. despite those conversations and multiple revisions of the mou draft, what we discovered was ultimately different than the path our planning had indicated. instead of creating an historical social network using the five decades of photographs we had prepared, we found that the history of the social network and the family and kinship relationships detailed through those images was deeply personal for the community living in the region today. we found out the hard way that those kinships reflected economic changes in status and power, realignments among families and their communities, and new patterns in the social fabric formed by the warp of personal relationships and the weft of local institutions (schools, hospitals, and local governance). revealing those changes was not always something that our partners wanted us to do, and these were not patterns we had sought to discover: they are simply there, embedded in the images and the relations among images. these social changes in local alignments – tied in complex ways to marriages and separations, legal conflicts and resolutions, changes in ownership of residential and commercial interests, and other material reflections of that social fabric – remain highly charged and, for those continuing to live in the area, they revealed potentially unexpected parts of the lived realities and values of the place. as a result, even though we had an mou that worked for the technical details of the project, we could not find common ground for how to handle the competing social and ethical values of the project. as we problem-solved, we tried to describe new forms of restriction and to generate appropriately sensitive guidelines to handle future use and access, but it turned out that all of these approaches were threatening to the values of a tightly knit community. they, rightly, want to tell their story, and so many people have told it so poorly for so long that they wish to have sole access to the materials from which the narratives are assembled. as a researcher interested in open access and stable platform management, we have disagreements with the scholarly and archival implications of this decision, but we ultimately respect the resolve and underlying values that accompany the difficult choices pmss makes about its public audiences and the corresponding goals it maintains for its collections. interestingly, wykle has come to view our work with pmss collections as another form of the material and cultural extraction that has dominated the region for generations. while we see our work in light of preservation and access as well as our lasting commitment to pmss and the region, we have also come to recognize the powerful explanatory force that the idea of “extraction” has become for the communities in a region that has suffered many forms of extraction industries’ negative effects. in acknowledging the limitations of our own efforts, we would posit that our case study offers a counter-example to works that suggest how ai systems can be designed automatically to meet the needs of their constituents (winfield et al. ). we tried to use a design approach to address our research goals and our partner’s needs, and it turned out that the dynamically constructed and evolving nature of those needs outstripped the capacity we could build into our available system of machine learning. the divergence of our goals has led the collaboration to an impasse. given that we had already outlined further steps in our initial documents that could not be satisfied after the partners identified their divergent intentions, the collaborative scope the partners initially described was not completely fulfilled. the divergence of goals became stark: as researchers interested in the relevance and sustainability of these archives, we were moving the collections toward a more accessible and comprehensive platform with open documentation and protocols for future development. by contrast, the pmss staff were moving toward more stringent and local controls over access to the archives in order to limit dissemination. at this juncture, we had some negotiating to do. first, we made the contentdm instance a password protected and not publicly accessible (private) sandbox rather than a public instance of a virtual digital collection. as pmss owns the material, they decided shortly thereafter to issue a take-down order of the contentdm instance, and we complied. as the contentdm materials were ultimately accessible in the public domain on their live site, this decision revealed how personal the challenges had become. nothing included in the take-down order was unique or new material – rather, the contentdm site simply provided a more accessible format for existing primary material on the wordpress site, stripped of its interpretive and secondary contexts. if there is a silver lining, it lies in this context for use: the “academic divorce” we underwent by discontinuing our collaboration has made it possible for us to continue conducting research on the publicly available archival materials without being obligated to host a live and dynamic repository for further materials. as a result, we can test best-approaches without having to worry about pushing them to a live production site. within this constraint, we aim to continue re-creating the historical social network without compromising our partners’ needs for privacy and control of their production site. the mutual decision to terminate further partnership activities based in archival development arose because of these differing paths forward. that decision meant that any further enrichment of the archival materials would not become publicly available, which we saw as a penalty against using the archive at a moment when archives need as much advocacy and visible support as possible. under these constraints of private accessibility, we have continued to work on the aws rekognition pipeline and have successfully identified all of the faces of named people featured in the archive, with face and name labels now associated with over unique images. our next step, delayed to spring as a result of the covid- pandemic, includes the creation of an associative network that first identifies unnamed faces in each image using unique identifiers. the second element of that process will be to generate an historical social network using the co-occurrence among those faces as well as the faces of named people in the available images. given that our metadata enrichment has already included date associations for most of the images, we are confident that we will be able to reconstruct historically specific networks for a given year or range of years, and moreover, that the association between dates and named people will help us to identify further members of the community who are not currently named in the photographs because of the small groups involved in activities and clubs, as well as the generally limited student and teacher populations during any given year. we are now far more sensitive to how the local concerns of this community shape our research methods and outcomes. the longer-term hope, one it is not clear at all that we will be allowed to pursue, would be to use natural language processing tools on the archive’s textual materials, particularly named entity recognition and word vectors, to search and match images where known names occur proximate to the names of unmatched faces. the present goal, however, remains to create a more replete and densely connected network of faces and the places they occupied when they were living in the gentle shadows of pine mountain. in order to abide by pmss community wishes for privacy, we will be using anonymized aggregate results without identifying individuals in the photographs. while this method has the drawback of not being able to reveal the complexity of the historical relations at the granular level of individuals, it will allow us to report on the persistence or variation in network metrics, such as network density, centrality, path length, and betweenness measures, among others. in this way, we aim to be able to measure and report on the network and its changes over time without reporting on individuals. we arrived at an anonymizing method as a solution to the dissolved partnership by asking about the constraints of ferpa as well as by looking back at federal and commercial facial recognition practices. in each case, the dark side of these technological tools remains one associated with surveillance, and in the language of eastern kentucky, extraction. we mention this not only to be transparent about our recognition of these limitations, but also in the hopes of opening a new dialogue with our partners that might stem from generating interesting discoveries without compromising their sense of the local ownership of their archival materials. nonetheless, in order to report on the most interesting aspects, the actual people and their local histories of place, the work to be done would remain more at a human level than at a technical one. conclusion in conclusion, our project describes a success that remains imbricated with a shortcoming in machine learning. the machine learning tasks and algorithms our project implemented serve a mimetic function in the distilled picture of the community they reflect. by matching historical faces to names, the project embraces a form of digital surrogacy: we have aimed to produce a meta-historical account of the present institution’s social and cultural function as a site of social networking and local knowledge transmission. as robyn caplan and danah boyd have recently suggested, the “bureaucratic functions” these algorithms promote can be understood by the ways in which they structure users’ behaviors ( , ). we would like to supplement caplan and boyd’s insight regarding the potential coercions involved in how data structures implicitly shape their contents as well as their users’ behaviors. not only do algorithms promote a kind of bureaucracy, to ends that may be positive and negative, and sometimes both at once, but further, those same structures may reflect or shape public behaviors and interactions beyond a single platform. as we move between digital and public spheres, our work similarly shifts its scope. the research that we intended to have positive community effects was instead read by that very same set of people as an attempt to displace a community from the center of its own history. in other words, the bureaucratic functions embedded in pmss as an institution saw our new approach to their storytelling as an unwanted and external intervention. as their response suggests, the internal and extant structures for governing their community, its stories, and the people who tell them, saw our contribution as an effort to co-opt their control. where we thought we were offering new tools for capturing, discovering, and telling stories, they saw what safiya noble has recently characterized in a specifically racialized context as “algorithms of oppression” ( ). here the oppression would be geographic, socio-economic, and cultural, rather than racial; nevertheless, the perception that one is being oppressed by systems set into place by agents working beyond one’s own community remains a shared foundation in noble’s argument and in the unexpected reception of our project. as we move forward with our own project into unknown territories, in which our work-products may never see the light of day because of the value conflicts bound up in making archival objects public and accessible, we have found a real and lasting respect for the institutional dependencies and emplacements within which we all do our work. we hope to channel some of those functions of emplacement to create new forms of accountability and restraint that will allow us to move forward, but at least for now, we have found with our project one limitation of machine learning, and it is not the machine. references ahmed, manan, maira e. Álvarez, sylvia a. fernández, alex gil, rachel hendery, moacir p. de sá pereira, and roopika risam. . “torn apart / separados.” group for experimental methods in humanistic research. https://xpmethod.plaintext.in/torn-apart/volume/ /. bailey, ronald. . “the noble, misguided plan to turn coal miners into coders.” reason, november , . https://reason.com/ / / /the-noble-misguided-plan-to-tu/. calo, ryan. . “artificial intelligence policy: a primer and roadmap.” university of california, davis law review : - . caplan, robyn and danah boyd. . “isomorphism through algorithm: institutional dependencies in the case of facebook.” big data & society (january-june): - . https://doi.org/ . / . cassidy, frederic g. et al., eds. - . dictionary of american regional english. cambridge, ma: belknap press. https://www.daredictionary.com/. ema, arisa et. al. . “clarifying privacy, property, and power: case study on value conflict between communities.” proceedings of the ieee , no. (march): - . https://doi.org/ . /jproc. . . harkins, anthony and meredith mccarroll, eds. . appalachian reckoning: a region responds to hillbilly elegy. morgantown, wv: west virginia university press. hochschild, arlie. . “the coders of kentucky.” the new york times, september , . https://www.nytimes.com/ / / /opinion/sunday/silicon-valley-tech.html. joh, elizabeth. . “artificial intelligence and policing: first questions.” seattle university law review ( ): - . latour, bruno. . reassembling the social: an introduction of actor-network theory. new york: oxford university press. levendowski, amanda. . “how copyright law can fix artificial intelligence’s implicit bias problem.” washington law review ( ): - . mukurtu cms. https://mukurtu.org/. accessed december , . noble, safiya. . algorithms of oppression: how search engines reinforce racism. new york: nyu press. passamaquoddy people. “passamaquoddy traditional knowledge labels.” accessed december , . https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels. risam, roopika. . “beyond the margins: intersectionality and the digital humanities.” dhq: digital humanities quarterly ( ). http://digitalhumanities.org/dhq/vol/ / / / .html. robertson, campbell. . “they were promised coding jobs in appalachia. now they say it was a fraud.” the new york times, may , . https://www.nytimes.com/ / / /us/mined-minds-west-virginia-coding.html. sabharwal, anil. . “moving on from picasa.” google photos blog. last modified march , . https://googlephotos.blogspot.com/ / /moving-on-from-picasa.html. sabharwal, arjun. . digital curation in the digital humanities: preserving and promoting archival and special collections. boston: chandos. stephan, karl d., katina michael, m.g. michael, laura jacob, and emily p. anesta. . “social implications of technology: the past, the present, and the future.” proceedings of the ieee , special centennial issue (may): - . https://doi.org/ . /jproc. . . united states department of justice. . “guidelines for a memorandum of understanding.” https://www.justice.gov/sites/default/files/ovw/legacy/ / / /sample-mou.pdf. ———. . “sample memorandum of understanding.” http://www.doj.state.or.us/wp-content/uploads/ / /mou_sample_guidelines.pdf. vance, j.d. . hillbilly elegy: a memoir of a family and culture in crisis. new york: harper. weizenbaum, joseph. . computer power and human reason: from judgment to calculation. new york: w.h. freeman and co. winfield, alan f., katina michael, jeremy pitt, and vanessa evers. . “machine ethics: the design and governance of ethical ai and autonomous systems.” proceedings of the ieee , no. (march): - . https://doi.org/ . /jproc. . . draft harper draft v. harper generative machine learning charlie harper, phd digital scholarship specialist freedman center for digital scholarship kelvin smith library case western reserve university introduction generative machine learning is a hot topic. with the election approaching, facebook and reddit have each issued their own bans on the category of machine-generated or -altered content that is commonly termed “deep fakes” (cohen ; romm, harwell, and stanley-becker ). calls for regulation of the broader, and very nebulous category of fake news are now part of us political debates, too. although well known and often discussed in newspapers and on tv because of their dystopian implications, deep fakes are just one application of generative machine learning. there is a remarkable need for others, especially humanists and social scientists, to become involved in discussions about the future uses of this technology, but this first requires a broader awareness of generative machine learning’s functioning and power. many articles on the subject of generative machine learning exist in specialized, highly technical literature, but there is little that covers this topic for a broader audience while retaining important high-level information on how the technology actually operates. this chapter presents an overview of generative machine learning with particular focus on generative adversarial networks (gans). gans are largely responsible for the revolution in machine-generated content that has occured in the past few years and their impact on our future extends well beyond that of producing purposefully-deceptive fakes. after covering generative learning and the working of gans, this chapter touches on some interesting and significant applications of gans that are not likely to be familiar to the reader. the hope is that this will serve as the start of a larger discussion on generative learning outside of the confines of technical literature and sensational news stories. what is generative machine learning? machine learning, which is a subdomain of artificial intelligence, is roughly divided into three paradigms that rely on different methods of learning: supervised, unsupervised, and reinforcement learning (murphy , - ; burkov , - ). these differ in the types of datasets used for learning and the desired applications. supervised and unsupervised machine learning use labeled and unlabeled datasets, respectively, to assign unseen data to human-generated labels or statistically-constructed groups. both supervised and unsupervised approaches are commonly used for classification and regression problems, where we wish to predict categorical or quantitative information about new data. a combined form of these two paradigms, called semi-supervised learning, that mixes labeled and unlabeled data also exists. reinforcement learning, on the other hand, is a paradigm in which an agent learns how to function in a specific environment by being rewarded or penalized for its behavior. for example, reinforcement learning can be used to train a robot to successfully navigate around obstacles in a physical space. comment by daniel johnson: apologies -- my zotero connector is not working, i believe because we have different zotero libraries, so i have to manually edit the linked content for style. this hopefully won't prove a problem, unless you need to add more citations for some reason. if you do, please just add as plain text, not as linked zotero content. generative machine learning, rather than being a specific learning paradigm, encompasses an ever-growing variety of techniques that are capable of generating new data based on learned patterns. the process of learning these patterns can engage both supervised and unsupervised learning. a simple, statistical example of one type of generative learning is a markov chain. from a given set of data, a markov chain calculates and stores the probabilities of a following state based on a current state. for example, a markov chain can be trained on a list of english words to store the probabilities of any one letter occuring after another letter. these probabilities chain together to represent that chance of moving from the current letter state (e.g. the letter q) to a succeeding letter state (e.g. the letter u) based on the data from which it has learned. f i o e . . . f i f a . . . english italian woreackeat laburoponls bridy ravafa sotale ammute figure the three most-common letters following “f” in two markov chains trained on an english and italian dictionary. three examples of generated words are given for each markov chain that show how the markov chain captures high-level information about letter arrangements in the different languages. if another markov chain were trained on italian words instead of english, the probabilities would change, and for this reason, markov chains can capture important high level information about datasets (figure ). they can then be sampled to generate new data by starting from a random state and probabilistically moving to succeeding states. in figure , you can see the probability that the letter “f” transitions to the three most common succeeding letters in english and italian. a few examples of “words” generated by two markov chains trained on an english and italian dictionary are also given. the example words are generated by sampling the probability distributions of the markov chain, letter by letter, so that the generated words are statistically random, but guided by the learned probability of one letter following another. the different probabilities of letter combinations in english and italian result in distinctly different generated words. this exemplifies how a generative model can capture specific aspects of a dataset to create new data. comment by charlie harper: does this better explain word generation with a markov model? comment by daniel johnson: yes, this is great, cleared up the confusion very well. thank you. the letter combinations are nonsense, but they still reflect the high-level structure of italian and english words in the way letters join together, such as the different utilization of vowels in each language. these basic markov chains demonstrate the essence of generative learning: a generative approach learns a distribution over a dataset, or in other words, a mathematical representation of a dataset, which can then be sampled to generate new data that exists within the learned structure of that dataset. how convincing the generated data appears to a human observer depends on the type and tuning of the machine learning model chosen and the data upon which the model has been trained. so, what happens if we build a comparable markov chain with image data[footnoteref: ] instead of words, and then sample, pixel by pixel, from it to generate new images? the results are just noise and the generated images reveal no hint of a wine bottle or circle to the human eye (figure ). [ : in many examples, i have used the google quickdraw dataset to highlight features of generative machine learning. the dataset is freely available (https://github.com/googlecreativelab/quickdraw-dataset) and licensed under cc by . .] real data generated data figure images generated with a simple statistical model appear as noise as the model is insufficient to capture the structure of the real data (markov chains trained using wine bottles and circles from google’s quickdraw dataset). the very simple generative statistical model we have chosen to use is incapable of capturing the distribution of the underlying images sufficiently enough to produce realistic new images. other types of generative statistical models, like naive bayes or a higher-order markov chain,[footnoteref: ] could perhaps capture a bit more information about the training data, but they would still be insufficient for real-world applications like this.[footnoteref: ] image, video, and audio are complicated; it is hard to reduce them to their essence with basic statistical rules in the way we were able to with the ordering of letters in english and italian. capturing the intricate and often-inscrutable distributions that underlie real-world media, like full-sized photographs of people, is where deep (i.e. using neural networks) generative learning shines and where generative adversarial networks have revolutionized machine-generated content. [ : the order of a markov chain reflects how many preceding states are taken into account. for example, a nd order markov chain would look at the preceding two letters to calculate the probability of a succeeding letter. rudimentary autocomplete is a good example of markov chains in application.] [ : this is not to imply that these models do not have immense practical applications in other areas of machine learning.] generative adversarial networks the problem of capturing the complexity of an image so that a computer can generate new images leads directly to the emergence of generative adversarial networks, which are a neural-network-based model architecture within the broader sphere of generative machine learning. although prior deep learning approaches to generating data, particularly variational autoencoders, already existed, it was a breakthrough in that changed the fabric and power of generative machine learning. like every big development, it has an origin story that has moved into legend with its many retellings. according to the handed-down tale (giles ), in doctoral student ian goodfellow was at a bar with friends when the topic of generating photos arose. his friends were working out a method to create realistic images by using complex statistical analyses of existing images. goodfellow countered that it would not work; there were too many variables at play within such data. instead, he put forth the idea of pairing two neural networks against each other in a type of zero-sum game where the goal was to generate believable fake images. according to the story, he developed this idea into working code that night and his paired neural network architecture produced results the very first time. this was the birth of generative adversarial networks or gans. goodfellow’s work was quickly disseminated in what is one of the most influential papers in the recent history of machine learning (goodfellow et al. ). z generator discriminator real or fake? fake real feedback feedback figure at the heart of a gan are two neural networks, the generator and the discriminator. as the generator learns to produce fake data, the discriminator learns to separate it out. the pairing of the two in an adversarial structure forces each to improve at its given task. gans have progressed in almost miraculous ways since , but the crux of their architecture remains the coupling of two neural networks. each neural network has a specific function in the pairing. the first network, called the generator, is tasked with generating fake examples of some dataset. to produce this data it randomly samples from an n-dimensional latent space often labeled z. in simple terms, the generator takes random noise (really a random list of n-numbers where n is the dimensionality of the latent space) as its input and outputs its attempt at a fake piece of data, such as an image, clip of audio, or row of tabular information. the second neural network, called the discriminator, takes both fake and real data as input. its role is to correctly discriminate between fake and real examples.[footnoteref: ] the generator and discriminator networks are then coupled together as adversaries, hence “adversarial” in the name. the output from the generator flows into the discriminator, and information on the success or failure of the discriminator to identify fakes (i.e. the discriminator’s loss) flows back through the network so that the generator and discriminator each knows how well it is performing compared to the other. all of this happens automatically, without any need for human supervision. when the generator finds it is doing poorly, it learns to produce better examples by updating its weights and biases through traditional backpropagation (see especially langr and bok , - for a more detailed summary of this). as backpropagation updates the generator network’s weights and biases, the generator inherently begins to map regions of the randomly sampled z space to characteristics found in the real dataset. contrarily, as the discriminator finds that it is not identifying better fakes accurately, it learns to separate these out in new ways. comment by daniel johnson: helpful, thank you. [ : its function is exactly that of any other binary classifier found in machine learning.] figure a gan being trained on wine bottle sketches from google’s quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset) shows the generator learning how to produce better sketches over time. moving from left to right, the generator begins by outputting random noise and progressively generates better sketches as it tries to trick the discriminator. at first, the generator outputs random data and the discriminator easily catches these fakes (figure ). as the results of the discriminator feed back into the generator, however, the generator learns to trick its foe by creating more convincing fakes. the discriminator consecutively learns to better separate out these more convincing fakes. turn after turn, the two networks drive one another to become better at their specialized tasks and the generated data becomes increasingly like the real data.[footnoteref: ] at the end of training, ideally, it will not be possible to distinguish between real and fake (figure ). comment by daniel johnson: we'll spell out "figure" for references to your included images. for parenthetical references, we'll keep the lower-cased abbreviation fig. [ : https://poloclub.github.io/ganlab/ (accessed jan , ) (kahng et al. )] figure the fully trained generator from figure produces examples that are not readily distinguishable from real world data. the top row of sketches were produced by the gan and the bottom row were drawn by humans. in the original publication, the first gans were trained on sets of small images, like the toronto face dataset, which contains x pixel grayscale photos of faces and facial expressions (goodfellow et al. ). although the generator’s results were convincing when compared to the originals, the fake images were still small, colorless, and pixelated. since then an explosion of research into gans and increased computational power has led to strikingly realistic images. the most recent milestone was reached in by researchers with nvidia, who built a gan that generates high-quality photo-realistic images of people (karras, laine, and aila ). when contrasted with the results of (figure ), the stunning progression of gans is self-evident, and it is difficult to believe that the person on the right does not exist. (goodfellow et al. , fig. b) (karras et al. , fig. ) figure an image of a generated face from the original gan publication (left) and the milestone (right) shows how the ability of gans to produce photo-realistic images has evolved since . some applications of generative adversarial networks over the past five years, many papers on implementations of gans have been released by researchers (alqahtani, kavakli-thorne, and kumar ; wang, she, and ward ). the list of applications is extensive and ever growing, but it is worth pointing out some of the major examples as of and why they are significant. these examples highlight the vast power of gans and underscore the importance of understanding and carefully scrutinizing this type of machine learning. data augmentation one major problem in machine learning has always been the lack of labeled datasets, which are required by supervised learning approaches. labeling data is time consuming and expensive. without good labeled data, trained models are limited in their power to learn and in their ability to generalize to real-world problems. services, such as amazon’s mechanical turk, have attempted to crowdsource the tedious process of manually assigning labels to data, but labeling has remained a bottleneck in machine learning. gans are helping to alleviate this bottleneck by generating new labeled data that is indistinguishable from the real data. this process can grow a small labeled dataset into one that is larger and more useful for training purposes. in the area of medical imaging and diagnostics this may have profound effects (yi, walia, and babyn ). for example, gans can produce photorealistic images of skin lesions that expert dermatologists are able to separate from real images only slightly over % of the time (baur, albarqouni, and navab ) and they can synthesize high-resolution mammograms for training better cancer detection algorithms (korkinof et al. ). a corollary effect of these developments in medical imaging is the potential to publicly release large medical datasets and thereby expand researchers’ access to important data. whereas the dissemination of traditional medical images is constrained by strict health privacy laws, generated images may not be governed by such rules. i qualify this statement with “may”, because any restrictions or ethical guidelines for the use of medical data that is generated from real patient data requires extensive discussion and legal reviews that have not yet happened. under certain conditions, it may also be possible to infer original data from a gan (mukherjee et al. ). how institutional review boards, professional medical organizations, and courts weigh in on this topic will be seen in the coming years. figure the images on the left are originals and the images on the right have been modified by a gan with the ability to translate images between the domains of “dirty lens” and “clean lens” on a vehicle (from uřičář et al. , fig. ). in addition to generating entirely new data, a gan can augment datasets by expanding their coverage to new domains. for example, autonomous vehicles must cope with an array of road and weather conditions that are unpredictable. training a model to identify pedestrians, street signs, road lines, and so on with images taken on a sunny day will not translate well to variable real-world conditions. using one dataset, in a process known as style transfer, gans can translate one image to other domains (figure ). this can include creating night road scenes from day scenes (romera et al. ) and producing images of street signs under varying lighting conditions (chowdhury et al. ). this added data permits models to account for greater variability under operating conditions without the high cost of photographing all possible conditions and manually labeling them. beyond medicine and autonomous vehicles, generative data augmentation will progressively impact other imaging-heavy fields (shorten and khoshgoftaar ) like remote sensing (l. ma et al. ; d. ma, tang, and zhao ). creativity and design the question of whether machines can possess creativity or artistic ability is philosophically difficult to answer (mazzone and elgammal ; mccormack, gifford, and hutchings ). still, in , christie’s auctioned off its first piece of gan art for $ , (cohn ) and gans are increasingly assisting humans in the creative process for all forms of media. simple models, like cyclegan, are already able to stylize images in the manner of van gogh or monet (zhu et al. ), and more varied stylistic gans are emerging. gaugan, a beta tool released by nvidia, is a great example of gan-assisted creativity in action. gaugan allows you to rough out a scene using a paint brush for different categories, like clouds, flowers, and houses (figure ). it then converts this into a photo reflecting what you have drawn. the online demo[footnoteref: ] remains limited, but the underlying model is powerful and has massive potential (park et al. ). recently, martin scorsese’s the irishman made headlines for its digital de-aging of robert deniro and other actors. although this process did not involve gans, it is highly likely that in the future, gans will become a major part of cinematic post-production (giardina ) through assistive tools like gaugan. [ : http://nvidia-research-mingyuliu.com/gaugan/ (last accessed january , ).] figure this example of gaugan in action shows a sketched out scene on the left turned into a photo-realistic landscape on the right. *if any representatives of christie’s are reading, the author would be happy to auction this piece fashion and product design are also being impacted by the use of gans. text-to-image synthesis, which can take free text or categories as input to generate a photo-realistic image, has promising potential (rostamzadeh et al. ). by accepting text as input, gans can let designers rapidly generate new ideas or visualize concepts for products at the start of the design process. for example, a recently published gan for clothing design accepts basic text and outputs modeled images of the described clothing (banerjee et al. ; figure ). in an example of automotive design, a single sketch can be used to generate realistic photos of multiple perspectives of a vehicle (radhakrishnan et al. ). the many fields that rely on quick sketching or visual prototyping, such as architecture or web design, are likely to be influenced by the use of gan-assisted design software in coming years. figure text-to-image synthesis can generate images of new fashions based on a description. from the input “maroon round neck mini print a-line bodycon short sleeves” a gan has produced these three photos (from banerjee et al. , fig. ). in a similar vein, gans have an upcoming role in the creation of new medicines, chemicals, and materials (zhavoronkov ). by training a gan on existing chemical and material structures, research is showing that novel chemicals and materials can be designed with particular properties (gómez-bombarelli et al. ; sanchez-lengeling and aspuru-guzik ). this is facilitated by how information is encoded in the gan’s latent space (the n-dimensional space from which the generator samples; see “z” in figure ). as the generator learns to produce realistic examples, certain aspects of the original data become encoded in regions of the latent space. by moving through this latent space or sampling particular areas, new data with desired properties can then be generated. this can be seen by periodically sampling the latent space and generating an image as one moves between two generated images (figure ). in the same way, by moving in certain directions or sampling from particular areas of the latent space, new chemicals or medicines with specific properties can be generated.[footnoteref: ] [ : this is also relevant to facial manipulation discussed below.] image a image b figure two examples of linearly-spaced mappings across the latent space between generated images a and b. note that by taking one image and moving closer to another, you can alter properties in the image, such as adding steam, removing a cup handle, or changing the angle of view. these characteristics of the dataset are learned by the generator during training and encoded in the latent space. (gan built on coffee cup sketches from google’s quickdraw dataset) impersonation and the invisible i have reserved some of the more dystopian and likely more well-heard-of applications of gans for last. this is the area where gans’ ability to generate convincing media is challenging our perceptions of reality and raising extreme ethical questions (harper ). deep fakes are, of course, the most well known of these. this can include the creation of fake images, videos, and audio of an individual or the modification of any media to alter what someone appears to be doing or saying. in images and video in particular, gans make it possible to swap the identity of an individual and manipulate facial attributes or expressions (tolosana et al. ). a large portion of technical literature is, in fact, now devoted to detecting faked and altered media (see tolosana et al. , table iv and v). it remains to be seen how successful any approaches will be. from a theoretical perspective, anything that can detect fakes can also be used to train a better generator since the training process of a gan is founded on outsmarting a detector (i.e. the discriminator network). one shocking extension of deep fakes that has emerged is transcript to video creation, which generates a video of someone speaking from a written text. if you want to see this at work, you can view clips of nixon giving the speech written in the case of an apollo disaster.[footnoteref: ] as of now, deep fakes like this remain choppy and are largely limited to politicians and celebrities because they require large datasets and additional manipulation, but this limitation is not likely to last. if the evolution of gans for images is any predictor, the entire emerging field of video generation is likely to progress rapidly. one can imagine the incorporation of text-to-image and deep fakes enabling someone to produce an image of, say, “politican x doing action y,” simply by typing it. [ : http://news.mit.edu/ /mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public- ] figure gans are providing a method to reconstruct hidden images of people and objects. images - show reconstructions as compared to an input occluded image (occ) and a ground truth image (gt) (from fulgeri et al. , fig. ) an application of gans that parallels deep fakes and is likely more menacing in the short term is the infilling or adding of hidden, invisible, or predicted information to existing media. one nascent use is video prediction from an image. for example, in , researchers were able to build a gan that produced -second video clips from a single starting frame (vondrick and torralba ). this may not seem impressive, but video is notoriously difficult to work with because the content of a succeeding frame can vary so drastically from the preceding frame (for other examples of on-going research into video prediction, see cai et al. ; wen et al. ). comment by daniel johnson: missing from references section; please add. comment by daniel johnson: also missing from references section; please add. for still images, occluded object reconstruction, in which a gan is trained to produce a full image of a person or object that is partially hidden behind something else, is progressing (fulgeri et al. ; see figure ). for some applications, like autonomous driving, this could save lives as it would help to pick out when a partially-occluded pedestrian is about to emerge from behind a parked car. on the other hand, for surveillance technology, it can further undermine anonymity. indeed, such gans are already being explicitly studied for surveillance purposes (fabbri, calderara, and cucchiara ). lastly, i would be remiss if i did not mention that researchers have designed a gan that can generate an image of what you are thinking about, using eeg signals (tirupattur et al. ). comment by daniel johnson: this is such a provocative field of research, that i think readers need this tidbit on what the input information is. gans and the future comment by daniel johnson: good, thanks for adding. the tension between the creation of more realistic generated data and the technology to detect maliciously generated information is only beginning. the machine learning and data science platform, kaggle, is replete with publicly-accessible python code for building gans and detecting fake data. money, too, is freely flowing in this domain of research; the deepfake detection challenge sponsored by facebook, aws, and microsoft boasted one million dollars in prizes (https://www.kaggle.com/c/deepfake-detection-challenge accessed april , ). meanwhile, industry leaders, such as nvidia, continue to fund the training of better and more convincing gans. the structure of a gan, with its generator and detector paired adversarially, is now being mirrored in society as groups of researchers competitively work to create and discern generated data. the path that this machine-learning arms race will take is unpredictable, and, therefore, it is all the more important to scrutinize it and make it comprehensible to the broader publics whom it will affect. comment by daniel johnson: just a suggestion, to tie everything back together to your introduction. references alqahtani, hamed, manolya kavakli-thorne, and gulshan kumar. . “applications of generative adversarial networks (gans): an updated review.” archives of computational methods in engineering, december. https://doi.org/ . /s - - -y. banerjee, rajdeep h., anoop rajagopal, nilpa jha, arun patro, and aruna rajan. . “let ai clothe you: diversified fashion generation.” in computer vision – accv workshops, edited by gustavo carneiro and shaodi you, - . cham: springer international publishing. baur, christoph, shadi albarqouni, and nassir navab. . “generating highly realistic images of skin lesions with gans.” september. https://arxiv.org/abs/ . . cai et al. please insert reference here. (cited on your p. ). comment by daniel johnson: please fill out this reference. burkov, andriy. . the hundred-page machine learning book. self-published, amazon. chowdhury, sohini roy, lars tornberg, robin halvfordsson, jonatan nordh, adam suhren gustafsson, joel wall, mattias westerberg, et al. . “automated augmentation with reinforcement learning and gans for robust identification of traffic signs using front camera images.” in rd asilomar conference on signals, systems & computers, - . n.p.: ieee. https://doi.org/ . /ieeeconf . . . cohen, libby. . “reddit bans deepfakes with ‘malicious’ intent.” the daily dot. january , . https://www.dailydot.com/layer /reddit-deepfakes-ban/. cohn, gabe. . “ai art at christie’s sells for $ , .” the new york times, october , . https://www.nytimes.com/ / / /arts/design/ai-art-sold-christies.html. fabbri, matteo, simone calderara, and rita cucchiara. . “generative adversarial models for people attribute recognition in surveillance.” paper presented at the ieee international conference on advanced video and signal based surveillance, lecce, italy, august-september. https://arxiv.org/abs/ . . fulgeri, federico, matteo fabbri, stefano alletto, simone calderara, and rita cucchiara. . “can adversarial networks hallucinate occluded people with a plausible aspect?” computer vision and image understanding (may): - . giardina, carolyn. . “will smith, robert de niro and the rise of the all-digital actor.” the hollywood reporter, august , . https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor- . giles, martin. . “the ganfather: the man who’s given machines the gift of imagination.” mit technology review , no. (march/april): - . gómez-bombarelli, rafael, jennifer n. wei, david duvenaud, josé miguel hernández-lobato, benjamín sánchez-lengeling, dennis sheberla, jorge aguilera-iparraguirre, timothy d. hirzel, ryan p. adams, and alán aspuru-guzik. . “automatic chemical design using a data-driven continuous representation of molecules.” acs central science , no. (february): - . https://doi.org/ . /acscentsci. b . goodfellow, ian j., jean pouget-abadie, mehdi mirza, bing xu, david warde-farley, sherjil ozair, aaron courville, and yoshua bengio. . generative adversarial networks. comment by daniel johnson: please double check and supplement information for this reference. i'm seeing a title of "generative adversarial nets" not networks, in their pdf, and i see it coming from advances in neural information processing systems (nips ): https://papers.nips.cc/book/advances-in-neural-information-processing-systems- - harper, charlie. . “machine learning and the library or: how i learned to stop worrying and love my robot overlords.” code lib journal, no. (august). https://journal.code lib.org/articles/ . karras, tero, samuli laine, and timo aila. . “a style-based generator architecture for generative adversarial networks.” in ieee/cvf conference on computer vision and pattern recognition (cvpr), - . n.p.: ieee. https://doi.org/ . /cvpr. . . korkinof, dimitrios, tobias rijken, michael o’neill, joseph yearsley, hugh harvey, and ben glocker. . “high-resolution mammogram synthesis using progressive generative adversarial networks.” preprint, submitted july , . https://arxiv.org/abs/ . . langr, jakub and vladimir bok. . gans in action: deep learning with generative adversarial networks. shelter island, ny: manning publications. ma, dongao, ping tang, and lijun zhao. . “siftinggan: generating and sifting labeled samples to improve the remote sensing image scene classification baseline in vitro.” ieee geoscience and remote sensing letters , no. (july): - . https://doi.org/ . /lgrs. . . ma, lei, yu liu, xueliang zhang, yuanxin ye, gaofei yin, and brian alan johnson. . “deep learning in remote sensing applications: a meta-analysis and review.” isprs journal of photogrammetry and remote sensing (june): - . https://doi.org/ . /j.isprsjprs. . . . mazzone, marian, and ahmed elgammal. . “art, creativity, and the potential of artificial intelligence.” arts , no. (march): - . https://doi.org/ . /arts . mccormack, jon, toby gifford, and patrick hutchings. . “autonomy, authenticity, authorship and intention in computer generated art.” in computational intelligence in music, sound, art and design, edited by anikó ekárt, antonios liapis, and maría luz castro pena, - . cham: springer international publishing. mukherjee, sumit, yixi xu, anusua trivedi, and juan lavista ferres. . “protecting gans against privacy attacks by preventing overfitting.” preprint, submitted december , . https://arxiv.org/abs/ . v . murphy, kevin p. . machine learning : a probabilistic perspective. adaptive computation and machine learning series. cambridge, mass: mit press. http://search.ebscohost.com/login.aspx?direct=true&authtype=ip,shib&db=cat a&an=cwru.b &site=eds-live&custid=s . park, taesung, ming-yu liu, ting-chun wang, and jun-yan zhu. . “semantic image synthesis with spatially-adaptive normalization.” in ieee/cvf conference on computer vision and pattern recognition (cvpr), - . n.p.: ieee. https://doi.org/ . /cvpr. . radhakrishnan, sreedhar, varun bharadwaj, varun manjunath, and ramamoorthy srinath. . “creative intelligence – automating car design studio with generative adversarial networks (gan).” in machine learning and knowledge extraction, edited by andreas holzinger, peter kieseberg, a min tjoa, and edgar weippl, - . cham: springer international publishing. romera, eduardo, luis m. bergasa, kailun yang, jose m. alvarez, and rafael barea. . “bridging the day and night domain gap for semantic segmentation.” in ieee intelligent vehicles symposium (iv), - . n.p.: ieee. https://doi.org/ . /ivs. . . romm, tony, drew harwell, and isaac stanley-becker. . “facebook bans deepfakes, but new policy may not cover controversial pelosi video.” the washington post. january , . https://www.washingtonpost.com/technology/ / / /facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/. rostamzadeh, negar, seyedarian hosseini, thomas boquet, wojciech stokowiec, ying zhang, christian jauvin, and chris pal. . “fashion-gen: the generative fashion dataset and challenge.” preprint, submitted june , . https://arxiv.org/abs/ . . sanchez-lengeling, benjamin, and alán aspuru-guzik. . “inverse molecular design using machine learning: generative models for matter engineering.” science , no. (july): - . https://doi.org/ . /science.aat . shorten, connor, and taghi m. khoshgoftaar. . “a survey on image data augmentation for deep learning.” journal of big data ( ): - . https://doi.org/ . /s - - - . tirupattur, praveen, yogesh singh rawat, concetto spampinato, and mubarak shah. . “thoughtviz: visualizing human thoughts using generative adversarial network.” in proceedings of the th acm international conference on multimedia, - . new york: association for computing machinery. https://doi.org/ . / . tolosana, ruben, ruben vera-rodriguez, julian fierrez, aythami morales, and javier ortega-garcia. . “deepfakes and beyond: a survey of face manipulation and fake detection.” preprint, submitted january , . https://arxiv.org/abs/ . . uřičář, michal, pavel křížek, david hurych, ibrahim sobh, senthil yogamani, and patrick denny. . “yes, we gan: applying adversarial techniques for autonomous driving.” in is&t international symposium on electronic imaging, - . springfield, va: society for imaging science and technology. https://doi.org/ . /issn. - . . .avm- . vondrick, carl, and antonio torralba. . “generating the future with adversarial transformers.” in ieee conference on computer vision and pattern recognition (cvpr), - . n.p.: ieee. https://doi.org/ . /cvpr. . . wang, zhengwei, qi she, and tomas e. ward. . “generative adversarial networks: a survey and taxonomy.” preprint, submitted june , . https://arxiv.org/abs/ . wen et al. please insert reference here. (cited on your p. ). comment by daniel johnson: please fill out this reference. yi, xin, ekta walia, and paul babyn. . “generative adversarial network in medical imaging: a review.” medical image analysis (december): - . https://doi.org/ . /j.media. . . zhavoronkov, alex. . “artificial intelligence for drug discovery, biomarker development, and generation of novel chemistry.” molecular pharmaceutics , no. (october): - . https://doi.org/ . /acs.molpharmaceut. b . zhu, jun-yan, taesung park, phillip isola, and alexei a efros. . “unpaired image-to-image translation using cycle-consistent adversarial networks.” in ieee international conference on computer vision (iccv), - . n.p.: ieee. https://doi.org/ . /iccv. . . editor comments (don): second draft is much improved, and i think it is definitely good for publication. can a hammer categorize highly technical articles? samuel hansen university of michigan hansensm@umich.edu when everything looks like a nail... i was sure i had the most brilliant research project idea for my course in digital scholarship techniques. i would use the mathematical subject classification (msc) values assigned to the publications in mathscinet[footnoteref: ] to create a temporal citation network which would allow me to visualize how new subfields were created and perhaps even predict them while they were still in their infancy. i thought it would be an easy enough project. i already knew how to analyze network data and the data i needed already existed, i just had to get my hands on it. i even sold a couple of my fellow coursemates on the idea and they agreed to work with me. of course nothing is as easy as that, and numerous requests for data went without response. even after i reached out to personal contacts at mathscinet, we came to understand we would not be getting the msc data the entire project relied upon. not that we were going to let a little setback like not having the necessary data stop us. comment by mark dehmlow: is this meant to be marc subfields or professional subfields? it might be good to disambiguate. [ : https://mathscinet.ams.org/ ] after all, this was early and there had already been years of stories about how artificial intelligence, machine learning in particular, was going to revolutionize every aspect of our world (kelly ; clark ; parloff ; sangwani ; tank ). all the coverage made it seem like ai was not only a tool with as many applications as a hammer, but that it also magically turned all problems into nails. while none of us were ai experts, we knew that machine learning was supposed to be good at classification and categorization. the promise seemed to be that if you had stacks of data, a machine learning algorithm could dive in, find the needles, and arrange them into neatly divided piles of similar sharpness and length. not only that, but there were pre-built tools that made it so almost anyone could do it. for a group of people whose project was on life support because we could not get the categorization data we needed, machine learning began to look like our only potential savior. so, machine learning is what we did. comment by mark dehmlow: maybe pursued? is machine learning an action or a technique? comment by daniel johnson: "used" would work too. i will not go too deep into the actual process, but i will give a brief outline of the techniques we employed. machine-learning-based categorization needs data to classify, which in our case were mathematics publications. while this can be done with titles and abstracts we wanted to provide the machine with as much data as we could, so we decided to work with full-text articles. since we were at the university of wisconsin at the time, we were able to connect with the team behind geodeepdive[footnoteref: ] who have agreements with many publishers to provide the full text of articles for text and data mining research (“geodeepdive: project overview” n.d.). geodeepdive provided us with the full text of , mathematics articles which we used as our corpus. in order to classify these articles, which were already pre-processed by geodeepdive with corenlp,[footnoteref: ] we first used the python package gensim[footnoteref: ] to process the articles into a python-friendly format and to remove stopwords. then we randomly sampled ⅓ of the corpus to create a topic model using the mallet[footnoteref: ] topic modeling tool. finally, we applied the model to the remaining articles in our corpus. we then coded the words within the generated topics to subfields within mathematics and used those codes to assign articles a subfield category. in order to make sure our results were not just a one-off, we repeated this process multiple times and checked for variance in the results. there was none, the results were uniformly poor. [ : https://geodeepdive.org/ ] [ : https://stanfordnlp.github.io/corenlp/ ] [ : https://radimrehurek.com/gensim/ ] [ : http://mallet.cs.umass.edu/topics.php ] that might not be entirely fair. there were interesting aspects to the results of the topic modeling, but when it came to categorization they were useless. of the subfield codes assigned to articles, only two were ever the dominant result for any given article: graph theory and undefined, which does not really tell the whole story as undefined was the run-away winner in the article classification race with more than % of articles classified as undefined in each run, including one for which it hit %. the topics generated by mallet were often plagued by gibberish caused by equations in the mathematics articles and there was at least one topic in each run that was filled with the names of months and locations. add how the technical language of mathematics is filled with words that have non-technical definitions (for example, map or space), or words which have their own subfield-specific meanings (such as homomorphism or degree), both of which frustrate attempts to code a subfield. these issues help make it clear why so many articles ended up as “undefined.” even for the one subfield which had a unique enough vocabulary for our topic model to partially be able to identify, graph theory, the results were marginally positive at best. we were able to obtain mathematical subject classification (msc) values for around % of our corpus. when we compared the articles we categorized as graph theory to the articles which had been assigned the msc value for graph theory ( cxx), we found we had a textbook recall-versus-precision problem. we could either correctly categorize nearly all of the graph theory articles with a very high rate of false positives (high recall and low precision) or we could almost never incorrectly categorize an article as graph theory, but miss over % that we should have categorized as graph theory (high precision and low recall). needless to say, we were not able to create the temporal subfield network i had imagined. while we could reasonably claim that we learned very interesting things about the language of mathematics and its subfields, we could not claim we even came close to automatically categorizing mathematics articles. when we had to report back on our work at the end of the course, our main result was that basic, off-the-shelf topic modelling does not work well when it comes to highly technical articles from subjects like mathematics. it was also a welcome lesson in not believing the hype of machine learning, even when a project looks exactly like the kind machine learning was supposed to excel at solving. while we had a hammer and our problem looked like a nail, it seemed that the former was a ball peen and the latter a railroad tie. in the end, even in the land of hammers and nails, the tool has to match the task. though we failed to accomplish automated categorization of mathematics, we were dilettantes in the world of machine learning. i believe our project is a good example of how machine learning is still a long way from being the magic tool as some, though not all (rahimi and recht ), have portrayed it. let us look at what happens when smarter and more capable minds tackle the problem of classifying mathematics and other highly technical subjects using advanced machine learning techniques. comment by mark dehmlow: perhaps a different word choice? solving refers to the word project here. maybe something like problem instead of project or supporting instead of solving? comment by daniel johnson: + finding the right hammer to illustrate the quest to find the right hammer i am going to focus on three different projects that tackled the automated categorization of highly technical content, two of which also attempted to categorize mathematical content and one that looked to categorize scholarly works in general. these three projects provide examples of many of the approaches and practices employed by experts in automated classification and demonstrate the two main paths that these types of projects follow to accomplish their goals. since we have been discussing mathematics, let us start with those two projects. both projects began because the participants were struggling to categorize mathematics publications so they would be properly indexed and searchable in digital mathematics databases: the czech digital mathematics library (dml-cz)[footnoteref: ] and numdam[footnoteref: ] in the case of radim řehůřek and petr sojka (Řehůřek and sojka ), and zentralblatt math (zbmath)[footnoteref: ] in the case of simon barthel, sascha tönnies, and wolf-tilo balke (barthel, tönnies, and balke ). all of these databases rely on the aforementioned msc[footnoteref: ] to aid in indexing and retrieval, and so their goal was to automate the assignment of msc values to lower the time and labor cost of requiring humans to do this task. the main differences between their tasks related to the number of documents they were working with (thousands for Řehůřek and sojka and millions for barthel, tönnies, and balke), the amount of the works available (full text for Řehůřek and sojka, and titles, authors, and abstracts for barthel, tönnies, and balke), and the quality of the data (mostly ocr scans for Řehůřek and sojka and mostly tex for barthel, tönnies, and balke). even with these differences, both projects took a similar approach, and it is the first of the two main pathways toward classification i spoke of earlier: using a predetermined taxonomy and a set of pre-categorized data to build a machine learning categorizer. [ : https://dml.cz/ ] [ : http://www.numdam.org/ ] [ : https://zbmath.org/ ] [ : mathematical subject classification (msc) values in mathscinet and zbmath are a particularly interesting categorization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in the same, or closely related, subfield as the article’s content before they are published. this multi-step process of review yields a built-in accuracy check for the categorization.] in the end, while both projects determined that the use of support vector machines (gandhi )[footnoteref: ] provided the best categorization results, their implementations were different. the Řehůřek and sojka svms were trained with terms weighted using augmented term frequency[footnoteref: ] and dynamic decision threshold[footnoteref: ] selection using s-cut[footnoteref: ] (Řehůřek and sojka , ) and barthel, tönnies, and balke's with term weighting using term frequency–inverse document frequency[footnoteref: ] and euclidean normalization[footnoteref: ] (barthel, tönnies, and balke , ), but the main difference was how they handled formulae. in particular the barthel, tönnies, and balke group split their corpus into words and formulae and mapped them to separate vectors which were then merged together for a combined vector used for categorization. Řehůřek and sojka did not differentiate between words and formulae in their corpus, and they did note that their ocr scans’ poor handling of formulae could have hindered their results (Řehůřek and sojka , ). in the end, not having the ability to handle formulae separately did not seem to matter as Řehůřek and sojka claimed microaveraged f₁ scores of . % (Řehůřek and sojka , ) when classifying the top level msc category with their best performing svm. when this is compared to the microaveraged f₁ of . % obtained by barthel, tönnies, and balke (barthel, tönnies, and balke , ), it would seem that either Řehůřek’s and sojka’s implementation of svms or their access to full-text led to a clear advantage. this advantage becomes less clear when one takes into account that Řehůřek and sojka were only working with top level mscs where they had at least ( in the case of their best result) articles, and their limited corpus meant that many top-level msc categories would not have been included. looking at the work done by barthel, tönnies, and balke makes it clear that these less common msc categories such as k-theory or potential theory, for which barthel, tönnies, and balke achieved microaveraged f₁ measures of . % and % respectively, have a large impact on the overall effectiveness of the automated categorization. remember, this is only for the top level of msc codes, and the work of barthel, tönnies, and balke suggests it would get worse when trying to apply the second and third level for full msc categorization to these less-common categories. this leads me to believe that in the case of categorizing highly technical mathematical works to an existing taxonomy, people have come close to identifying the overall size of the machine learning hammer, but are still a long way away from finding the right match for the categorization nail. [ : support vector machines (svms) are machine learning models which are trained using a pre-classified corpus to split a vector space into a set of differentiated areas (or categories) and then attempt to classify new items through where in the vector space the trained model places them. for a more in-depth, technical explanation, see: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms- a fca ] [ : augmented term frequency refers to the number of times a term occurs in the document divided by the number of times the most frequent occurring term appears in the document.] [ : the decision threshold is the cut-off for how close to a category the svm must determine an item to be in order for it to be assigned that category. Řehůřek and sojka’s work varied this threshold dynamically.] [ : score-based local optimization, or s-cut, allows a machine-learning model to set different thresholds for each category with an emphasis on local, or category, instead of global performance. ] [ : term frequency–inverse document frequency provides a weight for terms depending on how frequently it occurs across the corpus. a term which occurs rarely across the corpus but with a high frequency within a single document will have a higher weight when classifying the document in question.] [ : a euclidean norm provides the distance from the origin to a point in an n-dimensional space. it is calculated by taking the square root of the sum of the squares of all coordinate values.] now let us shift from mathematics-specific categorization to subject categorization in general and look at the work microsoft has done assigning fields of study (fos) in the microsoft academic graph (mag) which is used to create their microsoft academic article search product.[footnoteref: ] while the mag fos project is also attempting to categorize articles for proper indexing and search, it represents the second path which is taken by automated categorization projects: using machine learning techniques to both create the taxonomy and to classify. [ : https://academic.microsoft.com/ ] microsoft took a unique approach in the development of their taxonomy. instead of relying on the corpus of articles in the mag to develop it, they relied primarily on wikipedia for its creation. they generated an initial seed by referencing the science metrix classification scheme[footnoteref: ] and a couple thousand fos wikipedia articles they identified internally. they then used an iterative process to identify more fos in wikipedia based on whether they were linked to wikipedia articles that were already identified as fos and whether the new articles represented valid entity types – e.g. an entity type of protein would be added and an entity type of person would be excluded (shen, ma, and wang , ). this work allowed microsoft to develop a list of more than , fields of study for use as categories in the mag. [ : http://science-metrix.com/?q=en/classification ] microsoft then used machine learning techniques to apply these fos to their corpus of over million academic articles. the specific techniques are not as clear as they were with the previous examples, likely due to microsoft protecting their specific methods from competitors, but the article published to the arxiv by their researchers (shen, ma, and wang ) and the write up on the mag website does make it clear they used vector based convolutional neural networks which relied on skip-gram (mikolov et al. ) embeddings and bag-of-words/entities features to create their vectors (“microsoft academic increases power of semantic search by adding more fields of study - microsoft research” ). one really interesting part of the machine learning method used by microsoft was that it did not rely only on information from the article being categorized. it also utilized the citations to and references from information about the article in the mag, and used the fos the citations and references were assigned in order to influence the fos of the original article. the identification of potential fos and their assignment to articles was only a part of microsoft's purpose. in order to fully index the mag and make it searchable they also wished to determine the relationships between the fos; in other words they wanted to build a hierarchical taxonomy. to achieve this they used the article categorizations and defined a field of study a as the parent of b if the articles categorized as b were close to a subset of the articles categorized as a (a more formal definition can be found in shen, ma, and wang , ). this work, which created a six-level hierarchy, was mostly automated, but microsoft did inspect and manually adjust the relationships between fos on the highest two levels. to evaluate the quality of their fos taxonomy and categorization work, microsoft randomly sampled data at each of the three steps of the project and used human judges to assess their accuracy. the accuracy assessments of the three steps were not as complete as they would be with the mathematics categorization, as that approach would evaluate terms across the whole of their data sets, but the projects are of very different scales so different methods are appropriate. in the end microsoft estimates the accuracy of the fos at . %, the article categorization at . %, and the hierarchy at % (shen, ma, and wang , ). since msc was created by humans there is no meaningful way to compare the fos accuracy measurements, but the categorization accuracy falls somewhere between that of the two mathematics projects. this is a very impressive result, especially when the aforementioned scale is taken into account. instead of trying to replace the work of humans categorizing mathematics articles indexed in a database, which for was , items in mathscinet[footnoteref: ] and , in zbmath[footnoteref: ], it is trying to replace the human categorization of all items indexed in mag, which was , , in .[footnoteref: ] both zbmath and mathscinet were capable of providing the human labor to do the work of assigning msc values to the mathematics articles they indexed in . therefore using an automated categorization, which at best could only get the top level right with % accuracy, was not the right path. on the other hand, it seems clear that no one could feasibly provide the human labor to categorize all articles indexed by mag in so an % accurate categorization is a significant accomplishment. to go back to the nail and hammer analogy, microsoft may have used a sledgehammer but they were hammering a rather giant nail. comment by mark dehmlow: maybe use "the project?? comment by mark dehmlow: it would be interesting to know the accuracy of human indexing in msn or zbm. comment by samuel hansen: sadly no one seems to have studied this, but both databases are considered gold standard when it comes to classification. they do not rely on authors or publishers providing classification information and each article is classified not only by the subject area expert reviewer they send it to it is also identified before the reviewer receives it by an editor who is trying to find a reviewer, and again after the paper is reviewed as a check. there is rarely much disagreement over msc values between msn and zbmath and when it does occur, from what i am trying to remember from a presentation, it is in the last part so we are taking just tiny differences given how fine a classification msn is. comment by mark dehmlow: if you can find it, it may be worth citing or you could explain their methodology. it helps validate the claim that human categorization is better than machine. comment by daniel johnson: this is kind of interesting, as it might be a method of working other disciplines have never heard of. perhaps a short footnote? comment by mark dehmlow: maybe "approach"? [ : https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=eq&arg = ] [ : https://zbmath.org/?q=py% a ] [ : https://academic.microsoft.com/publications/ ] are you sure it’s a nail? i started this chapter talking about how we have all been told that ai and machine learning were going to revolutionize everything in the world. that they were the hammers and all the world's problems were nails. i found that this was not the case when we tried to employ it, in an admittedly rather naive fashion, to automatically categorize mathematical articles. from the other examples i included, it is also clear computational experts find the automatic categorization of highly technical content a hard problem to tackle, one where success is very much dependent on what it is being measured against. in the case of classifying mathematics, machine learning can do a decent job but not enough to compete with humans. in the case of classifying everything, scale gives machines an edge, as long as you have the computational power and knowledge wielded by a company like microsoft. comment by mark dehmlow: this is a bit hyperbolic - which is okay if the tone is intended. certainly, it would be easy to claim that ml intended to revolutionize classification. this collection is about the intersection of ai, machine learning, deep learning, and libraries. while there are definitely problems in libraries where these techniques will be the answer, i think it is important to pause, think, and be sure artificial intelligence techniques are the most effective approach before trying to use them. libraries, even those like the one i work in, which are lucky enough to boast of incredibly talented it departments, do not tend to have access to a large amount of unused computational power or numerous experts in bleeding-edge ai. they are also rather notoriously limited budget-wise and would likely have to decide between existing budget items and developing an in-house machine learning program. those realities combined with the legitimate questions which can be raised about the efficacy of machine learning and ai with respect to the types of problems a library may encounter, such as categorizing the contents of highly technical articles, make me worry. while there will be many cases where using ai makes sense, i want to be sure libraries are asking themselves a lot of questions before starting to use it. questions like: is this problem large enough in scale to substitute machines for human labor given that machines will likely be less accurate? or: will using machines to solve this problem cost us more in equipment and highly technical staff than our current solution, and has that factored in the people and services a library may need to cut to afford them? not to mention: is this really a problem or are we just looking for a way to employ machine learning to say that we did? in the cases where the answers to these questions are yes, it will make sense for libraries to employ machine learning. i just want libraries to look really carefully at how they approach problems and solutions, to make sure that their problem is, in fact, a nail, and then to look even closer and make sure it is the type of nail a machine-learning hammer can hit. comment by mark dehmlow: maybe "think analytically"? references barthel, simon, sascha tönnies, and wolf-tilo balke. . “large-scale experiments for mathematical document classification.” in digital libraries: social media and community networks, edited by shalini r. urs, jin-cheon na, and george buchanan, – . cham: springer international publishing. clark, jack. . “why was a breakthrough year in artificial intelligence.” bloomberg, december , . https://www.bloomberg.com/news/articles/ - - /why- -was-a-breakthrough-year-in-artificial-intelligence. gandhi, rohith. . “support vector machine — introduction to machine learning algorithms.” medium. july , . https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms- a fca . “geodeepdive: project overview.” n.d. accessed may , . https://geodeepdive.org/about.html. kelly, kevin. . “the three breakthroughs that have finally unleashed ai on the world.” wired, october , . https://www.wired.com/ / /future-of-artificial-intelligence/. “microsoft academic increases power of semantic search by adding more fields of study.” . microsoft academic (blog). february , . https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/. mikolov, tomas, ilya sutskever, kai chen, greg s corrado, and jeff dean. . “distributed representations of words and phrases and their compositionality.” in advances in neural information processing systems , edited by c. j. c. burges, l. bottou, m. welling, z. ghahramani, and k. q. weinberger, – . curran associates, inc. http://papers.nips.cc/paper/ -distributed-representations-of-words-and-phrases-and-their-compositionality.pdf. parloff, roger. . “from : why deep learning is suddenly changing your life.” fortune. september , . https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/. rahimi, ali, and benjamin recht. . “back when we were kids.” presentation at the nips conference. https://www.youtube.com/watch?v=qi yry tqe. Řehůřek, radim, and petr sojka. . “automated classification and categorization of mathematical knowledge.” in intelligent computer mathematics, edited by serge autexier, john campbell, julio rubio, volker sorge, masakazu suzuki, and freek wiedijk, – . berlin: springer-verlag. sangwani, gaurav. . “ is the year of machine learning. here’s why.” business insider, january , . https://www.businessinsider.in/ -is-the-year-of-machine-learning-heres-why/articleshow/ .cms. shen, zhihong, hao ma, and kuansan wang. . “a web-scale system for scientific knowledge exploration.” paper presented at the th annual meeting of the association for computational linguistics, melbourne, july . http://arxiv.org/abs/ . . tank, aytekin. . “this is the year of the machine learning revolution.” entrepreneur, january , . https://www.entrepreneur.com/article/ . editor notes there are some additional stylistic fixes mark and i picked up on, but those were easily made, and there's only a handful of very small issues for samuel to look at. the author addressed the concerns we had, especially about the conclusion, which is nicely moderated and melds together the implications of the strengths/weaknesses of ml classifying articulated earlier in the article. bringing algorithms and machine learning into library collections and services eric lease morgan june , abstract many aspects of librarianship have been automated to one degree or another. now, in a time of “big data,” it is possible to go beyond mere automation and towards the more intelligent use of computers; the use of algorithms and machine learning is an integral part of future library collection building and service provision. to make the point, this chapter first highlights a number of changes in librarianship that were deemed revolutionary in their time but are now taken for granted. second, this essay compares and contrasts library automation with the exploitation of computer functionality. finally, this chapter outlines both a number of possible machine learning applications for libraries as well as a few real world use cases. librarianship is evolutionary, and the use of machine learning is a part of librarianship’s evolution. this chapter outlines how and why. seemingly revolutionary changes at the time of their implementation, some changes in the practice of librarianship were deemed revolutionary, but now-a-days some of these same changes are deemed matter of fact. take, for example, the catalog. during much of the middle ages, a catalog was more akin to a simple acquisitions list. by the first author, title, subject catalog was created (loc , ). these catalogs morphed into books, books which could be mass produced and distributed. but the books were difficult to keep up to date, and they were expensive to print. as a consequence, in the early s, the card catalog was invented by ezra abbot, and the catalog eventually became a massive set of drawers ( ). unfortunately, because the way catalog cards are produced, it is not feasible to assign more than three or four subject headings to any given book. if one does, then the number of catalog cards quickly gets out of hand. in the s, the idea of sharing catalog cards between libraries became common, and the library of congress facilitated much of the distribution (loc , ). in and with the advent of computers, the idea of sharing cataloging data as marc (machine readable cataloging) became prevalent (crawford , ). the data structure of a marc record is indicative of the time. intended to be distributed on reel-to-reel tape, the marc record is a sequential data structure designed to be read from beginning to end, complete with checks and balances ensuring the record’s integrity. despite the apparent flexibility of a digital data structure, the tradition of three or four subject headings per book still holds true. now-a-days, the data from marc records is used to fill databases, the databases’ content is indexed, and items from the library collection are located by searching the index. the evolution of the venerable library catalog has spanned centuries, each evolutionary change solving some problems but creating new ones. with the advent of the internet, a host of other changes are (still) happening in libraries. some of them are seen as revolutionary, and only time will tell whether or not these changes will persevere. examples include but are not limited to: · the advocacy of alt-metrics and open access publications · the continuing dichotomy of the virtual library and library as place · the creation and maintenance of institutional repositories · the existence of digital scholarship centers · the increasing tendency to license instead of own content many of the traditional roles of libraries are not as important as they used to be. that does not mean the roles are unimportant, just not as important. like many other professions, librarianship is exploring new ways to remain relevant when many of their core functions are needed by fewer people. working smarter, not harder beyond automation, librarianship has not exploited computer technology. despite the fact that libraries have the world of knowledge at their fingertips, libraries do not operate very intelligently, where “intelligently” is an allusion to artificial intelligence. let’s enumerate the core functionalities of computers. first of all, computers... compute. they are given some sort of input, assign the input to a variable, apply any number of functions to the variable, and output the result. this process — computing — is akin to solving simple algebraic equations such as the area of a circle or a distance traveled. there are two factors of particular interest here. first, the input can be as simple as a number or a string (read: “a word”) or the input can be arbitrarily large combinations of both. examples include: · · · xyzzy · george washington · a marc record · the circulation history and academic characteristics of an individual · the full text and bibliographic descriptions of all early american authors what is really important is the possible scale of a computer’s input. libraries have not taken advantage of that scale. imagine how librarianship would change if the profession actively used the full text of its collections to enhance bibliographic description and resulting public service. imagine how collection policies and patron needs could be better articulated if: ) students, researchers, or scholars first opted-in to have their records analyzed, and ) the totality of circulation histories and journal usage histories were thoroughly investigated in combination with patron characteristics and data from other libraries. a second core functionality of computers is their ability to save, organize, and retrieve vast amounts of data. more specifically, computers save “data” — mere numbers and strings. but when the data is given context, such as a number denoted as date or a string denoted as a name, then the data is transformed into information. an example might include the birth year and the name of my pet, blake. given additional information, which may be compared and contrasted with other information, knowledge can be created -- information put to use and understood. for example, mary, my sister, was born in and is therefore years older than blake. computers excel at saving, organizing, and retrieving data which leads to information and knowledge. the possibilities of computers dispensing wisdom — knowledge of a timeless nature — is left for another essay. like the scale of computer input, the library profession has not really exploited computers’ ability to save, organize, and retrieve data; on the whole, the library profession does not understand the concept of a “data structure.” for example, tab-delimited files, csv (comma-separated value) files, relational database schema, xml files, json files, and the content of email messages or http server responses are all examples of different types of data structures. each has its own set of inherent strengths and weaknesses; there is no such thing as “one size fits all.” through the use of data structures, computers store and retrieve information. librarianship is about these same kinds of things, yet few librarians would be able to outline the differences between different data structures. again, data becomes information when it is given context. in the world of marc, when a string (one or more “words”) is inserted into the field of a marc bibliographic record, then the string is denoted as a title. in this case, marc is a “data structure” because different fields denote different contexts. there are fields for authors, subjects, notes, added entries, etc. this is all very well and good, especially considering that marc was designed more than fifty years ago. but since then, many more scalable, flexible, and efficient data structures have been designed. relational databases are a good example. relational databases build on a classic data structure known as the “table” -- a matrix of rows and columns where each row is a record and each column is a field. think “spreadsheet.” for example, each row may represent a book, with columns for authors, titles, dates, publishers, etc. the problem comes when a column needs to be repeatable. for example, a book may have multiple authors or more commonly, multiple subjects. in this case the idea of a table breaks down because it doesn’t make sense to have a column named subject- , subject- , and subject- . as soon as you do that, you will want subject- . relational databases solve this problem. the solution is to first add a “key” — a unique value — to each row. next, for fields with multiple values, create a new table where one of the columns is the key from the first table and the other column is a value, in this case, a subject heading. there are now two tables and they can be “joined” through the use of the key. given such a data structure it is possible to add as many subjects as desired to any bibliographic item. but you say, “marc can handle multiple subjects.” true, marc can handle multiple subjects, but underneath, marc is a data structure designed for when information was disseminated on tape. as such, it is a sequential data structure intended to be read from beginning to end. it is not a random access structure. what’s more, the marc data structure is really divided into three substructures: ) the leader, which is always twenty-four characters long, ) the directory, which denotes where each bibliographic field exists, and ) the bibliographic section where the bibliographic information is actually stored. it gets more complicated. the first five characters of the leader are expected to be a left-hand, zero-padded integer denoting the length of the record measured in bytes. a typical value may be . thus, the record is bytes long. now, ask yourself, “what is the maximum size of a marc record?” despite the fact that librarianship embraces the idea of marc, very few librarians really understand the structure of marc data. marc is a format for transmitting data from one place to another, not for organization. moreover, libraries offer more than bibliographic information. there is information about people and organizations. information about resource usage. information about licensing. information about resources that are not bibliographic, such as images or data sets. etc. when these types of information present themselves, libraries fall back to the use of simple tables, which are usually not amenable to turning data into information. there are many different data structures. xml became popular about twenty years ago. since then json has become prevalent. more than twenty years ago the idea of linked data was presented. all of these data structures have various strengths and weaknesses. none of them is perfect, and each addresses different needs, but they are all better than marc when it comes to organizing data. libraries understand the concept of manifesting data as information, but as a whole, libraries do not manifest the concept using computer technology. finally, another core functionality of computers is networking and communication. the advent of the internet is a relatively recent phenomenon, and the ubiquitous nature of computers combined with other “smart” devices has facilitated literally billions of connections between computers (and people). consequently the data computed upon and stored in one place can be transmitted almost instantly to another place, and the transmission is an exact copy. again, like the process of computing and the process of storage, efficient computer communication builds upon itself with unforeseen consequences. for example, who predicted the demise of many centralized information authorities? with the advent of the internet there is less of a need/desire for travel agents, movie reviewers, or dare i say it, libraries. yet again, libraries use the internet, but do they actually exploit it? how many librarians are able to create a file, put it on the web, and share the resulting url? granted, centralized computing departments and networking administrators put up road blocks to doing such things, but the sharing of data and information is at the core of librarianship. putting a file on the ’net, even temporarily, is something every librarian ought to be able to know how (and be authorized) to do. despite the functionality of computers and their place in libraries over the past fifty to sixty years, computers have mostly been used to automate library tasks. marc automated the process of printing catalog cards and eventually the creation of “discovery systems.” libraries have used computers to automate the process of lending materials between themselves as well as to local learners, teachers, and scholars. libraries use computers to store, organize, preserve, and disseminate the gray literature of our time, and we call these systems “institutional repositories.” in all of these cases, the automation has been a good thing because efficiencies were gained, but the use of computers has not gone far enough nor really evolved. lending and usage statistics are not routinely harvested nor organized for the purposes of monitoring and predicting library patron needs/desires. the content of institutional repositories is usually born digital, but libraries have not exploited their full text nature nor created services going beyond rudimentary catalogs. computers can do so much more for libraries than mere automation. while i will never say computers are “smart,” their fundamental characteristics do appear intelligent, especially when used at scale. the scale of computing has significantly changed in the past ten years, and with this change the concept of “machine learning” has become more feasible. the following sections outline how libraries can go beyond automation, embrace machine learning, and truly evolve their ideas of collections and services. machine learning: what it is, possibilities, and use cases machine learning is a computing process used to make decisions and predictions. in the past, computer-aided decision-making and predictions were accomplished by articulating large sets of if-then statements and navigating down decision trees. the applications were extremely domain specific, and they weren’t very scalable. machine learning turns this process on its head. instead of navigating down a tree, machine learning takes sets of previously made observations (think “decisions”), identifies patterns and anomalies in the observations, and saves the result as a mathematical model, which is really an n-dimensional array of vectors. outside observations are then compared to the model and depending on the resulting similarities or differences, decisions or predictions are drawn. using such a process, there are really only four different types of machine learning: classification, clustering, regression, and dimension reduction. classification is a supervised machine learning process used to subdivide a set of observations into smaller sets which have been previously articulated. for example, suppose you had a few categories of restaurants such as american, french, italian, or chinese. given a set of previously classified menus, one could create a model defining each category and then classify new, unseen menus. the classic classification example is the filtering of email. “is this message ‘spam’ or ‘ham’?” this chapter’s appendix walks a person through the creation of a simplified classification system. it classifies texts based on authorship. clustering is almost always an unsupervised machine learning process which also creates smaller sets from a larger one, but clustering is not given a set of previously articulated categories. that is what makes it “unsupervised.” instead, the categories are created as an end result. topic modeling is a popular example of clustering. regression predicts a numeric value based on sets of dependent variables. for example, given dependent variables like annual income, education level, size of family, age, gender, religion, and employment status, one might predict how much money a person may spend on an independent variable such as charity. sometimes the number of characteristics of each observation is very large. many times some of these characteristics do not play a significant role in decision-making or prediction. dimension reduction is another machine learning process, and it is used to eliminate these less-than-useful characteristics from the observations. this process simplifies classification, clustering, or regression. some possible use cases there are many possible ways to enhance library collections and services through the use of machine learning. i’m not necessarily advocating the implementation of any of the following ideas, but they are possibilities. each is grouped into the broadest of library functional departments: · reference and public services · given a set of grant proposals, suggest library resources be used in support of the grants · given a set of licensed library resources and their usage, suggest other resources for use · given a set of previously checked out materials, suggest other materials to be checked out · given a set of reference interviews, create a chatbot to supplement reference services · given the full text of a set of desirable journal articles, create a search strategy to be applied against any number of bibliographic indexes; answer the proverbial question, “can you help me find more like this one?” · given the full text of articles as well as their bibliographic descriptions, predict and describe the sorts of things a specific journal title accepts or whether a given draft is good enough for publication · given the full text of reading materials assigned in a class, suggest library resources to support them · technical services · given a set of multimedia, enumerate characteristics of the media (number of faces, direction of angles, number and types of colors, etc.), and use the results to supplement bibliographic description · given a set of previously cataloged items, determine whether or not the cataloging can be improved · given full-text content harvested from just about anywhere, analyze the content in terms of natural language processing, and supplement bibliographic description · collections · given circulation histories, articulate more refined circulation patterns, and use the results to refine collection development policies · given the full text of sets of theses and dissertations, predict where scholarship at your institution is growing, and use the results to more intelligently build your just-in-case collection; do the same thing with faculty publications implementing any of these possible use cases would necessarily be a collaborative effort. implementation requires an array of expertise. enumerated in no priority order, this expertise includes: subject/domain expertise (such as cataloging trends, circulation services, collection strategies, etc.), computer programming and data management skills (such as python, r, relational databases, json, etc.), and statistical modeling (an understanding of the strengths and weaknesses of different machine learning algorithms). the team would then need to: . articulate and share a common goal for the work . amass the data to model . employ a feature extraction process (lower case words, extract a value from a database, etc.) . vectorize the features . create and evaluate the resulting model . go to step # until satisfied . put the model into practice . go to step # ; this work is never done for example, to bibliographically connect grant proposals to library resources, try this: . use classification to sub-divide each of your bibliographic index descriptions . apply the resulting model to the full text of the grants . return a percentage score denoting the strength of each resulting classification . recommend the use of zero or more bibliographic indexes to predict scholarship, try this: . amass the full text and bibliographic descriptions of all theses and dissertations . topic model the full text . evaluate the resulting topics . go to step # until satisfied . augment the model’s matrix of vectors with bibliographic description . pivot the matrix on any of the given bibliographics . plot the results to see possible trends over time, trends within disciplines, etc. . use the results to make decisions the content of the github repository reproduced in this chapter’s appendix describes how to do something very similar in method to the previous example.[footnoteref: ] [ : see https://github.com/ericleasemorgan/bringing-algorithms.] some real-world use cases here at the university of notre dame’s navari center for digital scholarship, we use machine learning in a number of ways. we cut our teeth on a system called convocate.[footnoteref: ] in this case we obtained a set of literature on the theme of human rights. half of the set was written by researchers in non-governmental organizations. the other half was written by theologians. while both sets were on the same theme, the language of each was different. an excellent example is the use of the word “child.” in the former set, children were included in documents about fathers and mothers. in the later set, children often referred to the “children of god.” consequently, queries referring to children were often misleading. to rectify this problem, a set of broad themes were articulated, such as actors, harms and violations, rights and freedoms, and principles and values. we then used topic modeling to subdivide all of the paragraphs of all of the documents into smaller and smaller sets of paragraphs. we compared the resulting topics to the broad themes, and when we found correlations between the two, we classified the paragraphs accordingly. because the process required a great deal of human intervention, and thus impeded subsequent updates, this process was not ideal, but we were learning and the resulting index is useful. [ : see https://convocate.nd.edu.] on a regular basis we find ourselves using a program called topic modeling tool, which is a gui/desktop application heavily based on the venerable mallet suite of software.[footnoteref: ] given a set of plain text files and an integer, topic modeling tool will create a weighted list of latent themes found in a corpus. each theme is really a list of words which tend to cluster around each other, and these clusters are generated through the use of an algorithm called lda (latent dirichlet allocation). when it comes to topic modeling, there is no such thing as the correct number of topics. just as in the traditional process of denoting what a corpus is about, there can be many distinct topics or there can be a few. moreover, some of the topics may be large and others may be small. when using a topic modeler, it is important to iteratively configure and re-configure the input until the results seem to make sense. [ : see https://github.com/senderle/topic-modeling-tool for the topic modeling tool. see http://mallet.cs.umass.edu for mallet.] just like every other machine learning application, topic modeling tool bases its “reasoning” on a matrix of vectors. each row represents a document, and each column is a topic. at the intersection of a document row and a topic column is a score denoting how much the given document is “about” the calculated topic. it is then possible to sum each topic column and output a pie chart illustrating not only what the topics are, but how much of the corpus is about each topic. such can be very insightful. by adding metadata to the matrix of vectors, even more insights can be garnered. suppose you have a set of plain text files. suppose also you know the names of the authors of each file. you can then do topic modeling against your corpus, and when the modeling is complete you can add a new column to the matrix and call it authors. next, you update the values in the authors column with author names. finally, you “pivot” the matrix on the authors column to calculate the degree each authors’ works are “about” the calculated topics. this too can be quite insightful. suppose you have works by authors a, b, c, and d. suppose you have calculated topics i, ii, iii, and iv. by updating the matrix and pivoting the results, you might discover that author a discusses topic i almost exclusively, whereas author b discusses topics i, ii, iii, and iv in equal parts. this process works for just about any type of metadata: gender, genre, extent, dates, language, etc. what’s more, topic modeling tool makes this process almost trivial. to learn how, see the github repository accompanying this chapter.[footnoteref: ] [ : https://github.com/ericleasemorgan/bringing-algorithms.] we have used classification techniques in at least a couple of ways. one project required the classification of press releases. some press releases are deemed mandatory — declared necessary to publish. other press releases are considered discretionary — published at the will of a company. the domain expert needed a set of , press releases classified into either mandatory or discretionary piles. we used a process very similar to the process outlined in this chapter’s appendix. in the end, the domain expert believes the classification process was % correct, and this was good enough for them. in another project, we tried to identify articles about a particular yeast (cryptococcus neoformans), despite the fact that the articles never mentioned the given yeast. this project failed because we were unable to generate an accuracy score greater than %. this was deemed not good enough. we are developing a high performance computing system called the distant reader, which uses machine learning to do natural language processing against an arbitrarily large volume of text. given one or more documents of just about any number or type, the distant reader will: . amass the documents . convert the documents into plain text . do rudimentary counts and tabulations against the plain text . calculate statistically significant keywords against the plain text . extract narrative summaries against the plain text . use spacy (a natural language processing library) to classify each and every feature of each and every sentence into parts-of-speech and/or named entities[footnoteref: ] [ : see https://spacy.io.] . save the results of steps # through # as plain text and tab-delimited files . distill the tab-delimited files into an sqlite database . create both narrative as well as tabular reports against the database . create an archive (.zip file) of everything . return the archive to the student, researcher, or scholar the student, researcher, or scholar can then analyze the contents of the .zip file to get a better understanding of its contents. this analysis (“reading”) ranges from perusing the narrative reports, to using desktop tools to visualize the data, to exploiting command-line tools to investigate the data, to writing software which uses the data as input. the distant reader scales to everything between a single scholarly report, hundreds of book-length documents, and thousands of journal articles. its purpose is to supplement the traditional reading process, and it uses machine learning techniques at its core. summary and conclusion computers and libraries are a natural fit. they both excel at the collection, organization, and dissemination of data, information, and knowledge. compared to most professions, the practice of librarianship has used computers for a very long time. but, for the most part, the functionality of computers in libraries has not been fully exploited. advances in machine learning coupled with the data/information found in libraries present an opportunity for both librarianship and the people whom libraries serve. machine learning can be used to enhance library collections and services, and with a modest investment of time as well as resources, the profession can make it a reality. references crawford, walt. . marc for library use: understanding integrated usmarc. nd ed. boston: g.k. hall. loc (library of congress). . the card catalog: books, cards, and literary treasures. san francisco: chronicle books. appendix: train and classify this appendix lists two python programs. the first (train.py) creates a model for the classification of plain text files. the second (classify.py) uses the output of the first to classify other plain text files. for your convenience, the scripts and some sample data ought to be available in a github repository.[footnoteref: ] the purpose of including these two scripts is to help demystify the process of machine learning. [ : https://github.com/ericleasemorgan/bringing-algorithms.] train the following python script is a simple classification training application. given a file name and a list of directories containing .txt files, this script first reads all of the files’ contents and the names of their directories into sets of data and labels (think “categories”). it then divides the data and labels into training and testing sets. such is a best practice for these types of programs so the models can be evaluated for accuracy. next, the script counts and tabulates (“vectorizes”) the training data and creates a model using a variation of the naive bayes algorithm. the script then vectorizes the test data, uses the model to classify the test data, and compares the resulting classifications to the originally supplied labels. the result is an accuracy score, and generally speaking, a score greater than % is on the road to success. a score of % is no better than flipping a coin. finally, the model is saved to a file for later use. # train.py - given a file name and a list of directories containing # .txt files, create a model for classifying similar items # require the libraries/modules that will do the work from sklearn.feature_extraction.text import countvectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import multinomialnb import glob, os, pickle, sys # sanity check; make sure the program has been given input if len( sys.argv ) < : sys.stderr.write( 'usage: ' + sys.argv[ ] + " [ ...]\n" ) quit() # get the name of the file where the model will be saved model = sys.argv[ ] # get the rest of the input, the names of directories to process directories = [] for i in range( , len( sys.argv ) ) : directories.append( sys.argv[ i ] ) # initialize the data to analyze and its associated labels data = [] labels = [] # loop through each given directory for directory in directories : # find all the text files and get the directory's name files = glob.glob( directory + "/*.txt" ) label = os.path.basename( directory ) # process each file for file in files : # open the file with open( file, 'r' ) as handle : # add the contents of the file to the data data.append( handle.read() ) # update the list of labels labels.append( label ) # divide the data/labels into training sets and testing sets; a best practice data_train, data_test, labels_train, labels_test = train_test_split( data, labels ) # initialize a vectorizer, and then count/tabulate the training data vectorizer = countvectorizer( stop_words='english' ) data_train = vectorizer.fit_transform( data_train ) # initialize a classification model, and then use naive bayes to create a model classifier = multinomialnb() classifier.fit( data_train, labels_train ) # count/tabulate the test data, and use the model to classify it data_test = vectorizer.transform( data_test ) classifications = classifier.predict( data_test ) # begin to test for accuracy count = # loop through each test classification for i in range( len( classifications ) ) : # increment, conditionally if classifications[ i ] == labels_test[ i ] : count += # calculate and output the accuracy score; # above % begins to achieve success print ( "accuracy: %s%% \n" % ( int( ( count * . ) / len( classifications ) * ) ) ) # save the vectorizer and the classifier (the model) # for future use, and done with open( model, 'wb' ) as handle : pickle.dump( ( vectorizer, classifier ), handle ) exit() classify the following python script is a simple classification program. given the model created by the previous script (train.py) and a directory containing a set of .txt files, this script will output a suggested label (“classification”) and a file name for each file in the given directory. this script automatically classifies a set of plain text files. # classify.py - given a previously saved classification model and a directory of .txt files, classify a set of documents # require the libraries/modules that will do the work import glob, os, pickle, sys # sanity check; make sure the program has been given input if len( sys.argv ) != : sys.stderr.write( 'usage: ' + sys.argv[ ] + " \n" ) quit() # get input; get the model to read and the directory containing the .txt files model = sys.argv[ ] directory = sys.argv[ ] # read the model with open( model, 'rb' ) as handle : ( vectorizer, classifier ) = pickle.load( handle ) # process each .txt file for file in glob.glob( directory + "/*.txt" ) : # open, read, and classify the file with open( file, 'r' ) as handle : classification = classifier.predict( vectorizer.transform( [ handle.read() ] ) ) # output the classification and the file's name print( "\t".join( ( classification[ ], os.path.basename( file ) ) ) ) # done exit() artificial intelligence in the humanities: wolf in disguise, or digital revolution? introduction artificial intelligence, with its ability to machine learn coupled to an almost humanlike understanding, sounds like the ideal tool to the humanities. instead of using primitive quantitative methods to count words or catalogue books, current advancements promise to reveal insights that otherwise could only be obtained by years of dedicated scholarship. but are these technologies imbued with intuition or understanding, and do they learn like humans? are they capable of developing their own perspective, and can they aid in qualitative research? in the s and s, as home computers were becoming more common, hollywood was sensationalizing the idea of smart or human-like artificial intelligent machines (ai) through movies such as terminator, blade runner, short circuit, and bicentennial man. at the same time, the home experience of personal computing highlighted the difference between hollywood intelligent machines and the reality of how “dumb” machines really were. home, or even industry machines, could not answer simple natural language questions of anything but the simplest of complexity. instead, users or programmers needed to painstakingly implement an algorithm to address their question. then, the user was required to wait for the machine to slavishly follow each instruction that was programmed while hoping that whoever entered the instructions did not make a mistake. despite the hollywood intelligent machines sensation, people understood that computers did not and could not think like humans, but that they do excel at performing repetitive tasks with extreme speed and fidelity. this shaped the expectations for interacting with computers. computers became efficient tools that required specific instruction in order to achieve a desired outcome. comment by daniel johnson: good! this is nicely set up by the new intro paragraph. computational technology and user experience drastically changed over the next years. technology became much more intuitive to use while it also became much more powerful at handling large data sets. for instance, google can return search results for websites as a response to even the silliest or sparsest request, with a decent chance that the results are relevant to the question asked. did you read a manual before you used your smartphone, or did you like everyone else just “figure it out”? or, as a consequence of modern-day media and its on-demand services, children ask to skip a song playing through radio broadcast. the older technologies quickly feel archaic. comment by daniel johnson: did just a little language tightening here. these technological advancements go hand in hand with the developments in the field of machine learning and artificial intelligence. the automotive industry is on the cusp of fully self-driving cars. electronic assistants are not only keeping track of our dates and responding to spoken language, they will also soon start making our appointments by speaking to other humans on our behalf. databases are getting new voice-controlled intuitive interfaces, changing a typical incomprehensible “select avg(salary) from employeelist where yearhired > ;” to a spoken “average salary of our employees hired after ?” another phenomenon is the trend in many disciplines to go from “qualitative” to “quantitative” research, or to think about the “system” rather than the “components.” the field that probably experienced this trend first was biology. while obviously descriptive about species of organisms, biologists also always wanted to understand the mechanisms that drive life on earth spanning micro to macro scales. consequently, a lot is known about the individual chemical components that constitute our metabolism, the components that drive cell division and dna replication, and which genes are involved in, for example, developmental processes. however, in many cases, our scientific knowledge only covers single functions of single components. in the context of the cell, the state of the organism and how other components interact matters a lot. cancer, for example, cannot be explained by a single mutation on a single gene but involves many complex interactions (hanahan and weinberg ). ecosystems don’t collapse because a single insect dies, but because indirect changes in the food chain interact in complex ways (for a review of the different theories, see tilman ). as a result, systems biology emerged. systems biologists use large data sets and are often dependent on computer models to understand phenomena on the systems level. the field of bioinformatics is one such example of an entire field that emerged as a result of using computers to study entire systems that were otherwise humanly intractable. the human genome project to sequence the complete human genome finished in , a time when our consumer data storage was limited by the amount of data that fit on a dvd ( . gb). while the human genome fits on a dvd, the data that came from the sequencing machines was much larger. short repetitive sequences first needed assembly, which at that time was a high-performance computing task. other fields have since undergone their own computational revolutions, and now the humanities begin their computational revolution. computers have been a part of core library infrastructure and experience for some time now, by cataloging entries in a database and allowing intuitive user exploration of that database. however, the digital humanities go beyond this (fitzpatrick ). the ability to analyze (crawl) extremely large corpora of different sources, monitor the internet using the internet of things as large sensor arrays, and detect patterns by using sophisticated algorithms can each produce a treasure trove of quantitative data. until this point these tasks could only be described or analyzed qualitatively. additionally, artificial intelligence promises models of the human mind (yampolskiy and fox ). machine learning allows us to learn from these data sets in ways that exceed human capabilities, while an artificial brain will eventually allow us to objectively describe a subjective experience (through quantifying neural activations or positively and negatively associated memories). this would ultimately close the gap between quantitative and qualitative approaches by allowing an inspection of experience. however, this bridging between quantitative and qualitative methods causes a possible tension for the humanities, which historically defines itself by qualitative methodologies. when qualitative experiences or responses can be finely quantified, such as sadness caused by reading a particular passage, or the curiosity caused by viewing certain works of art, then the field will undergo a revolution. when this happens, we will be able to quantify and discuss how sadness was learned by reading, or how much surprise was generated by viewing an artwork. this is exactly the point where the metaphors break down. current computational models of the mind are not sophisticated enough to allow these kinds of inferences. machine learning algorithms work well for what they do but have nothing to do with what a person would call learning. artificial intelligence is a broad encompassing field. it includes methods that might have appeared to be magic only a couple of years ago (such as generative adversarial networks). algorithmic finesse resulting from these advances is capable of beating humans in chess (campbell, hoane jr, and hsu ), but it is only a very specialized algorithm that has nothing to do with the way humans play or learn chess. this means we are back to the problem we had in the s. instead of being disappointed by the difference between modern technology and hollywood technology, we are disappointed by the difference between modern technology and the experience implied by the labels given to those technologies. applying misnomer terminology, such as “smart,” “intelligent,” “search,” and “learning” to modern technologies that have little to do with those terms is misleading. it is possible that such technology was deliberately branded with these terms for the improved marketing and sales, effectively redefining them and obscuring their original meaning. consequently, we again are disappointed by the mismatch of the expectations of our computing infrastructure and the reality of our experiences. the following paragraphs will explore current machine learning and artificial intelligence technologies, explain how quantitative or qualitative they really are, and explore what the possible implications for future digital humanities could be. learning: phenomenon versus mechanism learning is an electrochemical process that involves cells, their genetic makeup, and how they are interconnected. some interplay between external stimuli and receptor proteins in specialized sensor neurons leads to electrochemical signals propagating over a network of interconnected cells, which themselves respond with physical and genetic changes to said stimuli, probably also dependent on previous stimuli (kandel, schwartz, jessel ). this concoction of elaborate terms might suggest that we know in principle which parts are involved and where they are, but we are far from an understanding of the learning mechanism. the description above is as generic as saying that a city functions because cars drive on streets. even though we might know a lot about long-term potentiation or the mechanism of neurons which fire together wiring together (aka hebbian learning), neither of these processes actually mechanistically explains how learning works. neuroscience, neurophysiology, and cognitive science have not been able to discover this complete process in such a way that we can replicate it, though some inroads are being made (el-boustani et al. ). similarly, we find promising new interdisciplinary efforts like “cognitive computational neuroscience” that try to bridge the gap between neuro- and cognitive science and computation (kriegeskorte and douglas ). so, unfortunately, while the components involved can be identified, the question about “how learning works” cannot be answered mechanistically. however, a lot is known about the phenomenon of learning. it happens during the lifetime of an organism. what happens between the lifetimes of related organisms is an adaptive process called evolution: inheritance, variation, and natural selection over many generations up to . billion years here on earth enabled populations of organisms to succeed in their environments in any way they could. evolutionary forces found ways for organisms to adapt to their environment during their own lifetimes. while this can take many forms, such as storing energy, seeking shelter, having a fight or flight response, it has led to the phenomenon we now call learning. instead of discussing the diversity of learning in the animal kingdom, we will discuss the richest example: human learning. here, learning is defined as the cognitive adaptation to external stimulus. the phenomenon of learning can be observed as an increase in performance over time. learning makes the organism better at doing something. in humans, because we have language and a much higher degree of abstract thinking, an improvement in performance can be facilitated very quickly. while it takes time to learn how to juggle, the ability to find the mean of a series of samples can be quickly communicated by reading wikipedia. both types of lifetime adaptations are called learning. however, these lifetime adaptations are facilitated by two different cognitive processes: explicit or implicit learning.[footnoteref: ] explicit learning — or episodic memory — is fact-based memory. what you did yesterday, what happened in your childhood, or the list of things you should buy when you go shopping, are all memories. currently, the engram theory best explains this mechanism (poo et al. elaborates on the origins of the term). explicit memory can be retrieved relatively easily and then used to inform future decisions: “press the green button if the capital of italy is paris, otherwise press the red.” the rate of learning for explicit memory can be much higher than for implicit memory, and it can also be communicated more quickly. abstract communication, such as “i saw a wolf” allows us to transfer the experience of seeing a wolf quickly to other individuals, even though their evoked explicit memory might not be identical to ours. [ : there are more than these two mechanisms, but these are the two major ones.] learning by using implicit memory — sometimes called procedural memory — is facilitated by much slower processes (schacter, chiu, and ochsner ). it is generally based on the idea that learning is a combination of expectation, observation or action, and internal model changes. for example, a recovering hospital patient who has suffered a stroke is handed an apple. in this exchange, the patient forms an expectation of where his hand will be to accept the apple. he engages his muscles to move his forearm and hand to accept the apple, which is his action. then the patient observes that his arm did not arrive at the correct position (due to neurological damage). this discrepancy between expectation and action-outcome drives internal changes so that the patient’s brain learns how to adequately control their arm. presumably, everything considered a skill is based on this process. while very flexible, this form of memory is not easily communicated nor fast to acquire. for instance, while juggling can be described it cannot be communicated in such a way that it enables the recipient to juggle without additional training. comment by daniel johnson: in the early draft, i suggest that john's edit of "his/her" was distracting, and i still thing that's the case. hence, i have edited to a hypothetical male patient. if you wish to use a female instead, that's perfectly fine with me. this description of explicit and implicit learning is an amalgamation of many different hypotheses and observations. also, these processes are not as well segregated in practice as outlined here. what is important is what these two learning mechanisms are based on: observations lead to memory, and internal predictions together with exploration lead to improved models about the world. lastly, these learning processes only exist in organisms because they previously conferred an evolutionary advantage: organisms that could memorize and then act on those memories had more offspring than those that did not. this interaction of learning and evolution is called the baldwin effect (weber and depew ). organisms that could explore the environment, make predictions about it, and use observations to optimize their internal models were similarly more capable than organisms that could not. machines do not learn; they are trained now prepared with a proper intuition about learning, we can turn our attention to machine learning. after all, our intuitions should be meaningful in the computational domain as well, because learning always follows the same pattern. one might be disappointed when looking over the table of contents of a machine learning book and find only methods for creating static transformation functions (see russell and norvig , one of the putative foundations of machine learning and ai). there will typically be a distinction between supervised and unsupervised learning, between categorical and continuous data, and maybe a section about other “smart” algorithms. you will not find a discussion about implicit and explicit memory, let alone methods for implementing these concepts. so, if these important sections in our imaginary machine learning book do not discuss the mechanisms of learning, then what are they discussing? unsupervised learning describes algorithms that report information based on associations within the data. clustering algorithms are a popular example of unsupervised learning. these use similarity between data points to form and report on distinct groups of data. clustering is a very important method but is only a well-designed algorithm that is not adaptive. supervised learning describes algorithms that refine a transformation function to convert from a certain input to a certain output. the idea is to balance specific and general refining such that the transformation function correctly transforms all known examples but generalizes enough to work well on new variations. for example, we would like the machine to transform image data into textual labels, such as “house” or “car.” the input is an image and the output is a label. the input image data are provided to the machine, and small adjustments to the machine’s function are made depending on how well it provided the correct output. many iterations later ideally will result in a machine that can transform all image data to correct labels, and even operate correctly on new variations of images not provided before. supervised learning is extremely powerful and is yet to be fully explored. however, supervised learning is quite dissimilar to actual learning. a common argument is that supervised learning uses feedback in a “student-teacher” paradigm of making changes with feedback until proper behavior is achieved, so it could be considered learning. but this feedback is external, objective, and not at all similar to our prediction and comparison model that, for instance, operates without an all-knowing oracle whispering “good” or “bad” into our ears. humans and other organisms instead compare predictions with outcomes, and the choices are driven by an intersection of desire and prediction. what seems astonishing is the diverse and specialized capabilities that these two rather simple types of computation, clustering and classification, can produce. their economic impact is enormous, and we are still finding new ways to combine neural networks and exploit deep learning techniques to create amazing data transformations, such as deep fake videos. but so far, each astounding example of ai, through machine learning or some other method, is not showcasing all these capabilities as one machine, but instead each as an independently achieved computational marvel. each of these examples does only exactly what it was trained to do in a narrow domain and no more. siri, or any other voice assistant for that matter, does not drive a car (lópez, quesada, and guerrero ), watson does not play chess (ferrucci et al. ), and google alpha go cannot understand spoken language (gibney ). even hybrid approaches, such as combining speech recognition, chess playing, and autonomous driving, would only be a combination of specialty strategies, not a trained entity from the ground up. modern machine learning gives us an amazing collection of very applicable, but extremely specialized, computational tools that may be customized to particular data sets, but the resulting machines do not learn autonomously as you or i do. there are cutting edge technologies, such as so-called neuromorphic chips (nawrocki, voyles, and shaheen ) and other computational brain models that more closely mimic brain function, but they are not what has been sensationalized in the media as machine learning or ai, and they have yet to showcase competence on difficult problems competitive with standard supervised learning. curiously, many people in the machine learning community defend the term “learning,” arguing there is no difference between learning and training. in traditional machine learning, the trained algorithm is deployed as a service after which it no longer improves. if the data set ever changes, then a new training set including correct labels needs to be generated and a new training phase initiated. however, if the teacher can be forever bundled with the learner and training continued during the deployment phase, even on new never-before-seen data, then indeed the delineation between learning and training is far less clear. approaches to such lifelong learning exist, but they struggle with what is called catastrophic forgetting - the phenomenon that only the most recent experiences are learned at the expense of older ones (french ). this is the objective for continuous delivery for machine learning. unfortunately, creating a new training set is typically the most expensive endeavor for standard supervised machine learning development. adequate training then becomes difficult or impossible without involving thousands or millions of human inputs to keep up with training and using the online machine on an ever-evolving data set. some have tried to use such “human-in-the-loop” methods, but the resulting machine then becomes only a slight extension of the humans who are forever caught in the loop. is it an intelligent machine, or a human trapped in a machine? to combat this problem of generating the training set, researchers altered the standard supervised learning paradigm of flexible learner and rigid teacher to make the teacher likewise flexible to generate new data, continually probing the bounds of the student machine. this is the method of generative adversarial networks, or gans (goodfellow et al. ). the teacher generates training examples and the student discerns between those generated examples and the original labeled training data. after many iterations, the teacher is improved to better fool the student, and the student is improved to better discern generated training data. as amazing as they are, gans only partially mitigate the problematic requirement for human-labeled training data, because gans can only mimic a known labeled distribution. if that distribution ever changes, then new labeled data must be generated, and again we have the same problem as before. unfortunately, gans have been sensationalized as magic, and public and hobbyist expectation is that gans are a way toward much better artificial intelligence. disappointment is inevitable because gans only allow us to explore what it would be like to have more training data from the same data sets we were using before. these expectations are important for machine learning and ai. we are very familiar with learning, to the point where our whole identity as human could be generously defined as the result of being a monkey with an exceptional proclivity for learning. if we now approach ai and machine learning with expectations that these technologies learn as we do, or are an equally general-purpose intelligence, then we will be bitterly disappointed. the best example of such discrepancy is how easily neural networks trained by deep learning can be fooled. images that are seemingly identical and differ only by a few pixels are grossly misclassified, a mistake no human would make (nguyen, yosinski, and clune ). fortunately, we know about these biases and the possible shortcomings of these methods. as long as we have the right expectations, we can take their flaws into account and still enjoy the prospects they provide. trained machines: tool or provocation? on one side we have the natural sciences characterized by hypothesis-driven experimentation reducing reality to an abstract model of causal interactions. this approach can inform us about the consequences of our possible actions, but only as far in the future as the model can adequately predict. with machine learning and ai, we can move this temporal horizon of prediction farther into the future. while weather models might still struggle to predict precipitation days in advance, global climate models predict in detail the effects of global warming in years. but these models are nihilist, void of values, and cannot themselves answer the question if humans would prefer to live in one possible future or another. is sunshine better than rain? the humanities, on the other hand, are home to exactly these problems. what are our values? how do we understand what is essential? now that we know the facts, how should we choose? do we speak for everyone? the questions seem to be endless, but they are what makes our human experience so special, and what separates the humanities from the sciences. labels – such as learning or intelligence – are too easily anthropomorphized. a technology branded in this way suggests human-like properties: intelligence, common sense, or even subjective opinion. from a name like “deep learning” we expect a system that develops a deep and intuitive understanding with insights more profound than our own. however, these systems do not provide an alternative perspective, but as explained above, are only as good or as biased as the scientist selecting their training data. just because humans and machine learning are both black boxes in the sense that their inner workings are opaque, does not mean they share other qualities. for instance, having labeled the ml training process as “learning” does not imply that ml algorithms are curious and learn from observations. while these new computerized quantitative measures might be welcomed by some scholars, there will be others who view it as an existential threat to the very nature of the humanities. are these quantitative methods sneaking into the humanities disguised by anthropomorphic terms like a wolf shrouded in a sheep’s fleece? from this viewpoint, having the wrong expectations is not only provoking a disappointment, but flooding the humanities with sophisticated technologies that dilute and muddy the nature of qualitative research that makes the humanities special. however, this imminent clash between quantitative and qualitative research also provides a unique opportunity. suppose there is a question that can only be answered subjectively and qualitatively. if so, it would define a hard boundary against the aforementioned reductionism of the purely causal quantitative approach. at the same time, such a boundary presents the perfect target for an artificially intelligent system to prove its utility. if a computational human analog can be created, then it must be capable of performing the same tasks as a humanities researcher. in other words, it must be able to answer subjective and qualitative questions, regardless of its computational and quantitative construction. failing at such a task would be equivalent to failing the famous turing test, thereby proving the ai is not yet human-like enough. in this way, the qualitative nature of the humanities poses a challenge – and maybe a threat – to artificially intelligent systems. while some might say the threat is mutual, past successes of interdisciplinary research suggest otherwise: the digital humanities could become the forefront of ai research. beyond machine training, towards general purpose intelligence currently, machines do not learn but must be trained, typically with human-labeled data. ml algorithms are not smart as we are, but they can solve specific tasks in sophisticated ways. perhaps sentience will only be a product of enough time and training data, but the path to sentience probably requires more than time and data. the process that gave rise to human intelligence was evolution. this opportunistic process optimized brains over endless generations to perform ever-changing tasks, and it is the only known example of a process that resulted in such complex intelligence. none of the earlier described computational methods even remotely follow this paradigm: researchers designed ad hoc algorithms that solved well-defined problems. the next iteration of these methods is either an incremental improvement of existing code, a new methodological invention, or an application to a new data set. these improvements do not compound to make ai tools better generalists, but instead contribute to the diversity of the existing tools. one approach that does not suffer from these shortcomings is neuro-evolution (floreano, dürr, and mattiussi ). currently, the field of neuroevolution is in its infancy, but finding new and creative solutions to otherwise unsolved problems, such as controlling robots driving cars, is a popular area of focus (lehman et al. ). at the same time, memory formation (marstaller, hintze, and adami ), information integration in the brain (tononi ), and how systems evolve the ability to learn (sheneman, schossau, and hintze ) are also being researched, as they are building blocks of general purpose intelligence. while it is not clear how thinking machines will ultimately emerge, they are on the horizon. the dualism of a quantitative system that can be subjective and understand the qualitative nature of existence makes it a strange artifact that cannot be ignored. references campbell, murray, a joseph hoane jr, and feng-hsiung hsu. . “deep blue.” artificial intelligence ( - ): - . el-boustani, sami, jacque p k ip, vincent breton-provencher, graham w knott, hiroyuki okuno, haruhiko bito, and mriganka sur. . “locally coordinated synaptic plasticity of visual cortex neurons in vivo.” science ( ): - . ferrucci, david, anthony levas, sugato bagchi, david gondek, and erik t mueller. . “watson: beyond jeopardy!” artificial intelligence : - . fitzpatrick, kathleen. . “the humanities, done digitally.” in debates in the digital humanities, edited by matthew k. gold, - . minneapolis: university of minnesota press. floreano, dario, peter dürr, and claudio mattiussi. . “neuroevolution: from architectures to learning.” evolutionary intelligence ( ): - . french, robert m. . “catastrophic forgetting in connectionist networks.” trends in cognitive sciences ( ): - . gibney, elizabeth. . “google ai algorithm masters ancient game of go.” nature news ( ): . goodfellow, ian, jean pouget-abadie, mehdi mirza, bing xu, david warde-farley, sherjil ozair, aaron courville, and yoshua bengio. . “generative adversarial nets.” in advances in neural information processing systems (nips ), edited by z. ghahramani, m. welling, c. cortes, n.d. lawrence, and k.q. weinberger, - . n.p.: neural information processing systems foundation. hanahan, douglas, and robert a weinberg. . “hallmarks of cancer: the next generation.” cell ( ): - . kandel, eric r, james h schwartz, and thomas m jessell. . principles of neural science. th ed. new york: mcgraw-hill. kriegeskorte, nikolaus, and pamela k douglas. . “cognitive computational neuroscience.” nature neuroscience : - . lehman, joel et al. . “the surprising creativity of digital evolution: a collection of anecdotes from the evolutionary computation and artificial life research communities.” artificial life ( ): - . lópez, gustavo, luis quesada, and luis a guerrero. . “alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces.” in international conference on applied human factors and ergonomics, edited by isabel l. nunes, - . cham: springer. marstaller, lars, arend hintze, and christoph adami. . “the evolution of representation in simple cognitive networks.” neural computation ( ): - . nawrocki, robert a, richard m voyles, and sean e shaheen. . “a mini review of neuromorphic architectures and implementations.” ieee transactions on electron devices ( ): - . nguyen, anh, jason yosinski, and jeff clune. . “deep neural networks are easily fooled: high confidence predictions for unrecognizable images.” in proceedings of the ieee conference on computer vision and pattern recognition (cvpr), - . n.p.: ieee. poo, mu-ming et al. . “what is memory? the present state of the engram.” bmc biology : - . russell, stuart j, and peter norvig. . artificial intelligence: a modern approach. malaysia: pearson education limited. schacter, daniel l, c-y peter chiu, and kevin n ochsner. . “implicit memory: a selective review.” annual review of neuroscience ( ): - . sheneman, leigh, jory schossau, and arend hintze. . “the evolution of neuroplasticity and the effect on integrated information.” entropy ( ): - . tilman, david. . “biodiversity: population versus ecosystem stability.” ecology ( ): - . tononi, giulio. . “an information integration theory of consciousness.” bmc neuroscience : - . weber, bruce h, and david j depew. . evolution and learning: the baldwin effect reconsidered. cambridge, ma: mit press. yampolskiy, roman v, and joshua fox. . “artificial general intelligence and the human mental model.” in singularity hypotheses: a scientific and philosophical assessment, edited by ammon h. eden, james h. moor, johnny h. søraker, and erik steinhart, - . heidelberg: springer. cultures of innovation: machine learning as a library service sue wiegand abstract why should libraries innovate to provide machine learning instruction and technological tools as a library service? because libraries and librarians have a distinctive role to play in the crucial shaping of understanding that is made possible by new technologies. in the implementation of new ways to preserve, discover, and create knowledge, the inclusive library, where all disciplines come together, can be one place that provides this vital function, in collaboration with other cultural institutions. introduction libraries and librarians have always been concerned with the preservation of knowledge. to this traditional role, librarians in the th century added a new function—discovery—teaching people to find and use the library’s collected scholarship. information literacy, now considered the signature pedagogy in library instruction, evolved from the previous bibliographic instruction. as digital literacy, the next stage, develops, students can come to the library to learn how to leverage the greatest strengths of machine learning. machines excel at recognizing patterns; researchers at all levels can experiment with innovative digital tools and strategies, and build st century skillsets. librarian expertise in preservation, metadata, and sustainability through standards can be leveraged as a value-added service. leading-edge librarians now invite all the curious to benefit from the knowledge contained in the scholarly canon, accessible through libraries as curated living collections in multiple formats at distributed locations, transformed into new knowledge using new ways to visualize and analyze scholarship. library collections themselves, including digitized, unique local collections, can provide the data for new insights and ways of knowing produced by machine learning. the library could also be viewed as a technology sandbox, a place to create knowledge, connect researchers, and bring together people, ideas, and new technologies. many libraries are already rising to this challenge, working with other cultural institutions in creating a culture of innovation as a new learning paradigm, exemplified by machine learning instruction and technology tool exploration. library practice the role of the library in preserving, discovering, and creating knowledge continues to evolve. originally, libraries came into being as collections to be preserved and disseminated, a central repository of knowledge, possibly for political reasons (ryholt and barjamovic , - ). libraries founded by scholars and devoted to learning came later, during the middle ages (casson , ). in more recent times, librarians began “[c]ollecting, organizing, and making information accessible to scholars and to citizens of a democratic republic” based on values developed during the enlightenment (bivens-tatum , ). bibliographic instruction in libraries, later information literacy, embodied the idea of learning in the library as the next step. now, librarians are contributing to and participating in the learning enterprise by partnering with the disciplines to produce new knowledge, completing the scholarly communications cycle of building on previous scholarship—“standing on the shoulders of giants.” one way to cultivate innovation in libraries is to implement machine learning in the library’s array of tools and services. machine learning is a method that can be applied in library practice by developing new tools, both behind-the-scenes and at the front end, developing standards, preserving the scholarly record and research data and protocols, and refining metadata to enhance discovery. citations analysis of prospective collections for the library to collect and of the institutions’ research outputs would provide valuable info for both further collection development and for developing researchers’ innovative tools. machine learning with its predilection for finding patterns, would reveal gaps in the literature and open up new questions to be answered, solving problems and leading to innovation. for example, yewno, a multi-disciplinary platform that uses machine learning to help combat “information overload,” advertises that it “helps researchers, students, and educators to deeply explore knowledge across interdisciplinary fields, sparking new ideas along the way…” and “makes [government] information accessible by breaking open silos and comprehending the complicated interconnections across agencies and organizations,” among other applications to improve discovery (yewno n.d.). in , the library of congress hosted a summit as “part of a larger effort to learn about machine learning and the role it could play in helping the library of congress reach its strategic goals, such as enhancing discoverability of the library’s collections, building connections between users and the library’s digital holdings, and leveraging technology to serve creative communities and the general public” (jakeway ). integration of machine learning is already starting at high levels in the library world. new services a focus on machine learning can inspire new library services to enhance teaching and learning in the library. connecting people with ideas and with technology enables library spaces to be used as a learning service by networking researchers at all levels in the enterprise of knowledge creation. finding gaps in the literature would be a helpful first step in new library discovery tools. a way this could be done is through a “researchers’ workstation,” a webpage toolkit that might start by using machine learning tools to automate alerts of new content in a narrow area of interest and help researchers at all levels find and focus on problem-solving. a researchers’ workstation could contain a collection of tools and learning modules to guide users through the phases of discovery. then, managing citations would be an important step in the process—storing, annotating, and sorting out the most relevant. starting research reports, keeping lab notebooks, finding datasets, and preserving the researcher’s own data are all relevant to the final results. a collaboration tool would enable researchers to find others with similar interests and share data or work collaboratively from anywhere, asynchronously. some of this functionality exists already, both in open source software and tools such as zotero for citation management, and in proprietary tools that combine multiple functions, such as mendeley from elsevier.[footnoteref: ] other commercial publishers are developing end-to-end tools to enable researchers to work within their proprietary platforms, from the point of searching for ideas and finding research gaps through the process of writing and submitting finished papers for publication. the coalition of open access repositories (coar) is similarly developing “next generation repositories” software integrating end-to-end tools for the open access literature archived in repositories, to “facilitate the development of new services on top of the collective network, including social networking, peer review, notifications, and usage assessment.” (rodrigues et al. , ). [ : see https://www.zotero.org and https://www.mendeley.com.] what else might a researcher want to do that the library could include in a researchers’ workstation? finding, writing, and keeping track of grants could be incorporated at some level. generating a timeline might be helpful, and infographics and data visualizations could improve communication and help make the case for the importance of the study with others, especially the public and funders. project management tools might be welcomed by some researchers, too. finally, when it’s time to submit the idea (whether at the preliminary or preprint stage) to something like an arxiv-like repository or an institutional repository, as well as to journals of interest (also identified through machine learning tools), the process of submission, peer-review, revision, and re-submitting could be done seamlessly. the tools in the workstation should be modular, interoperable, and easy to learn and use. functions in a researchers’ workstation would ideally be modular, flexible, and continuously updated. the workstation would be a complete ecosystem in the research cycle--saving time in the scholarly communications process, providing one place to go to for discovery, literature review, data management, collaboration, preprint posting, peer review, publication, and post-print commenting. it would provide one place to gather all the tools a researcher might want to employ to efficiently use machine learning and innovative tools to augment their human skills in the most effective ways to solve intractable problems and create new knowledge.[footnoteref: ] [ : in , i wrote a blog that mentions the idea (wiegand).] collections as data, collections as resources exemplified by the literature search, collections is an area that provides the greatest scope for library machine learning innovations to date, both applied and basic/theoretical. especially if the pathway to using the collections is clear and coherent, and the library provides instruction on why and how to use the various tools to save time and increase impact of research, researchers will benefit from partnering with librarians. the always already computational: collections as data final report and project deliverables and collections as data: part to whole project were designed to “develop models that support collections as data implementation and holistic reconceptualization of services and roles that support scholarly use…” (cite this). . the latter specifically seeks “to create a framework and set of resources that guide libraries and other cultural heritage organizations in the development, description, and dissemination of collections that are readily amenable to computational analysis.” (cite this). as a more holistic approach to data-driven scholarship, these projects aim to provide access to large collections to enable computational use on the national level. some current library databases have already built this kind of functionality. jstor, for example, will provide up to , documents (or more at special request) in a dataset for analysis.[footnoteref: ] clarivate’s content as a service provides web of science data to accommodate multiple purposes.[footnoteref: ] besides the many freely available bibliodata sources, researchers can sign up for developer accounts in databases such as scopus to work with datasets for text mining and computational analysis.[footnoteref: ] using library-licensed collections as data could allow researchers to save time in reading a large corpus, stay updated on a topic of interest, analyze the most important topics at a given time period, confirm gaps in the research literature for investigation, and increase the efficiency of sifting through massive amounts of research in, for instance, the race to develop a vaccine (ong ; vamathevan ). [ : see https://www.jstor.org/dfr/about/dataset-services.] [ : see https://clarivate.com/search/?search=computational% datasets.] [ : see https://dev.elsevier.com/ and https://guides.lib.berkeley.edu/text-mining.] learning spaces machine learning is a concept that calls out for educating library users through all avenues, including library spaces. taking a clue from other glam (galleries, libraries, archives, and museums) cultural institutions, especially galleries and museums, libraries and archives could mount exhibits and incorporate learning into library spaces as a form of outreach to teach how and why using innovative tools will save time and improve efficiency. inspirational, continuously-updating dashboards and exhibits could show progress and possibilities, while physical and virtual tutorials might provide a game-like interface to spark creativity. showcasing scholarship and incorporating events and speakers help create a new culture of ideas and exploration. events bring people together in library spaces to network for collaborative endeavors. as an example, the cleveland museum of art is analyzing visitor experiences using an artlens app to promote its collections.[footnoteref: ] the library of congress, as mentioned, hosted a summit that explored such topics as building machine learning literacy, attracting interest in glam datasets, operationalizing machine learning, crowdsourcing, and copyright implications for the use of content. as another example, in the united kingdom’s national archives attempted to demystify machine learning and explore ethics and applications such as topic modeling, which [ : see https://www.clevelandart.org/art-museums-and-technology-developing-new-metrics-measure-visitor-engagement and https://www.clevelandart.org/artlens-gallery/artlens-app.] “was used to find key phrases in discovery record descriptions and enable innovative exploration of the catalogue; and it was also deployed to identify the subjects being discussed across cabinet papers. other projects included the development of a system that found the most important sentence in a news article to generate automated tweeting, while another team built a system to recognise computer code written in different programming languages – this is a major challenge for digital preservation.” (source please) finally, the hg contemporary gallery in chelsea, in , mounted an exhibit that utilized a “machine-learning algorithm that did most of the work” (bogost ). sustainable innovation equity, diversity, and inclusion (edi) concerns with the scholarly record and increasingly with recognized biases implicit in algorithms can be addressed by a very intentional focus on the value of differing perspectives in solving problems. kat holmes, an inclusive design expert previously at microsoft and now a leading user experience designer at google, urges a framework for inclusivity that counteracts bias with different points of view by recognizing exclusion, learning from human diversity, and bringing in new perspectives. making more data available, and more diverse data, will significantly improve the imbalance perpetuated by a traditional-only corpus. in sustainability terms, machine learning tools must be designed to continuously seek to incorporate diverse perspectives that go beyond the traditional definitions of the scholarly canon if they are to be useful in combating bias. collections used as data in machine learning might undergo analysis to determine the balance of content through text analysis, and library subject headings should be improved to better reflect the diversity of human thought, cultures, and global perspectives. streamlining procedures is to everyone’s benefit, and saving time is universally desired—efficiency won’t fix the time crunch everyone faces, but with too much to do and to read, information overload is a very real threat to advancing the research agenda and confronting a multitude of escalating global problems. machine learning techniques, applied at scale to large corpora of textual data, could help researchers pinpoint areas where the human researcher should delve more deeply to eliminate irrelevant sources and hone in on possible solutions to problems. one instance--a new service, scite.ai “can automatically tell readers whether papers have been supported or contradicted by later academic work” (khamsi ). who (world health organization) is providing a global research database that can be searched or downloaded.[footnoteref: ] in research on self-driving vehicles, a systematic literature review found more than , articles, an estimated year’s worth of reading for an individual. a tool called iris.ai allowed groupings of this archive by topic and is one of several “targeted navigation” tools in development (extance ). working together as efficiently as possible is the only way to move ahead, and machine learning concepts, tools, and techniques, along with training, can be applied at scale to accelerate discovery. [ : https://www.who.int/emergencies/diseases/novel-coronavirus- /global-research-on-novel-coronavirus- -ncov] machine learning, like any other technology, augments human capacities, it does not replace them. if % of library resources (measured in whatever way works for each particular library), including time resources of expert librarians and staff, and financial resources, were utilized for innovation, libraries would develop a virtuous self-sustaining cycle. technologies that are not as useful can be assessed and dropped in an agile library, the useful can be incorporated into the % of existing services, and the resources (people and money) repurposed. in the same way, that % of library resources invested into innovations such as machine learning, whether in library practice or instruction and other services, will keep the program and the library fresh. creativity is key and will be the hallmark of successful libraries in the future. stewardship of resources such as people’s skills and expertise, and strategic use of the collections budget, are already library strengths. by building out new services and tools, and instructing at all levels, libraries can reinvent themselves continuously by investing in creative and sustainable innovation, from digital and data literacy to assembling modules for a library-based researchers’ workstation that uses machine learning to enhance the efficiency of the scholars’ research cycle. results and more questions a library that adapted machine learning as an innovation technology would improve its practices; add new services; choose, use, and license collections differently; utilize all spaces for learning; and role model innovative leadership. what is a library in rapidly changing times? how can librarians reconcile past identity, add value, and leverage hard-won expertise in a new environment? change management is a topic that all institutions will have to confront as the digital age continues, as we reinvent ourselves and our institutions in a fast paced technological world. value-added, distinctive, unique—these are all words that will be part of the conversation. not only does the library add value, but librarians will have to demonstrate and quantify that value. distinctive library resources and services that speak to the institutions’ academic mission and purpose will be a key feature. what does the library do that no other entity on campus can do? how best to communicate with stakeholders about the value of the distinctive library mission? can the library work with other cultural heritage institutions to highlight the unique contributions of all? one possible approach—develop a library science pedagogy as well as outreach that encompasses the scholarship of teaching and learning (sotl) and pervades everything the library does in providing resources, services, and spaces. emphasize that library resources solve problems, and then work on new ideas to save the time of researchers, improve discovery systems, advocate and facilitate open access and open source alternatives. from the library users’ point of view, think like the audience we are trying to reach to answer the question—why come into the library or use the library website instead of more familiar alternatives? in an era of increasing surveillance, library tools could be better known for an emphasis on privacy and confidentiality, for instance. this may require thinking more deeply about how we use our metrics, and find other ways to show how use of the library contributes to student success. it is also important to gather quantitative and qualitative evidence from library users themselves, and apply the feedback in an agile improvement loop. in the case of open access vs. proprietary information, librarians should make the case for open access (oa) by advocating, explaining, and instructing library users from the first time they do literature searches to the time they are graduate students, post-docs, and faculty. librarians should produce open educational resources (oer) as well as encourage classroom faculty to adopt these tools of affordable education. libraries also need to facilitate open access content from discovery to preservation by developing search tools that privilege oa, using open source software whenever possible. librarians could lead the way to changing the scholarly communications system by emphasizing change at the citations level—encourage researchers to insist on being able to obtain author archived citations in a seamless way, and facilitate making that happen through development of new discovery tools using machine learning. the concept of the “inside-out library” (cite—new directions) provides a way of thinking about opening local collections to discovery and use in order to create new knowledge through digitization and linking, with cross-disciplinary technologies to augment traditional research and scholarship. because these ideas are so new but fast-moving, librarians need to spread the word on possibilities. making local collections accessible for computational research helps to diversify findings and focuses attention on larger patterns and new ideas. in , for instance, the library of congress sought to “maximize the use of its digital collection” by launching a program “to understand the technical capabilities and tools that are required to support the discovery and use of digital collections material,” developing ethical and technological standards to automate to support emerging research techniques and “to preprocess text material in a way that would make that content more discoverable.” (price ) scholarly communication, dissemination, and discovery of research results will continue to be an important function of the library if trusted research results are to be available to all, not just the privileged. the so-called digital divide isolates and marginalizes some groups and regions. an important librarian role might be to identify gaps, in research or in dissemination, and work to overcome barriers to improving access to knowledge. libraries specialize in connecting disparate groups. here’s what libraries can do: instruct new researchers (including undergraduate researchers and up) in theories, skills, and techniques to find, use, populate, preserve, and cite datasets; provide server space and/or data management services; introduce machine learning and text analysis tools and techniques; provide machine learning and text analysis tools and/or services to researchers at all levels. researchers are now expected or even required to provide public scholarship, i.e., to bring their research into the public realm beyond obscure research journals, and to explain and illuminate their work, connecting it to the public good, especially in the case of publicly-funded research. librarians can and should partner in the public dissemination of research findings through explaining, promoting, and providing innovative new tools across siloed departments to catalyze cross-disciplinary research. the flow of research should be smooth and seamless to the researcher, whether in a researchers’ workstation or other library tools. the research cycle should be both clearly explained and embedded in systems and tools. the library, as a central place that cuts across narrowly-defined research areas, could provide a systemic place of collaboration. librarians, seeing the bigger picture, could facilitate research as well as disseminate and preserve the resulting data in journals and datasets. further investigations on how researchers work, how students learn, the best pedagogy, and life-long learning in the library could mark a new era in librarianship, one that involves teaching, learning, and research as a self-reinforcing cycle. how libraries can facilitate all aspects of learning should be the focus of academic library activities. beyond being a purchaser of journals and books, libraries can expand their role in the learning process itself into a cycle of continuous change and exploration, augmented by machine learning. library science, research, and pedagogy in library and information science (lis), graduate library schools should teach about machine learning as a way of innovating and emphasize pervasive innovation as the new normal. creating a culture of innovation and creativity in lis classes and in libraries will pay off for society as a whole, if librarians promote the advantages of a culture of innovation in themselves and in library users. subverting the stereotypes of tradition-bound libraries and librarians will revitalize the profession and our workplaces, replacing fear of change and an existential identity crisis with a spirit of creative, agile reinvention. academic libraries must transition from a space of transactional (one-time) actions into a learning-centered user space, both physical and virtual, that offers an enhanced experience with teaching, learning, and research—a way to re-center the library as the place to get answers that go beyond the internet, a true library learning community. do faculty, students, and other patrons know that when they find the perfect book on a library shelf through browsing (or on the library website with virtual browsing), it is because a librarian somewhere assigned it a call number to group similar books together? the next step in that process is to use machine learning to generate subject headings, and also show the librarians accomplishing that. this process is being investigated in different types of works from fiction to scientific literature (golub , joorabchi , wang , short ). cataloging, metadata, and arranging access are all things librarians do that add value for library users overwhelmed with google hits, and are worthy of further development. preservation is another traditional library function, and now includes born-digital items and digitization of special collections/archives. discovery will be enhanced by artificial intelligence and machine learning techniques. all of this should be taught in library schools, to build a new library culture of innovation and problem-solving beyond just providing collections and information literacy instruction. the new learning paradigm is immersive in all senses, and the future as reflected in library transformation and partnerships with researchers, galleries, archives, museums, citizen scientists, hobbyists, and life-long learners re-tooling their careers and life is bright. lis programs need to reflect that. to promote learning in libraries, librarians could design a “you belong in the library” campaign to highlight our diverse resources and new ways of working with technology, inviting participation in innovative technologies such as machine learning in an increasingly rare public, non-commercial space—telling why, showing how. in many ways, libraries could model ways to achieve academic success and life success, updating a traditional role in educating, instructing, preparing for the future, explaining, promoting understanding, and inspiring. discussion the larger questions now are, who is heard and who contributes? how are gaps, identified in needs analysis, reduced? what are sources of funding for libraries to develop this important work and not leave it to commercial services? library leadership and innovative thinking must converge to devise ways for libraries to bring people together, producing more diverse, ethical, innovative, inclusive, practical, transformative, and novel library services and physical and virtual spaces for the public good. libraries could start with analyses of needs—what problems could be solved with more effective literature searches? what research could fill gaps and inform solutions to those needs? what kind of teaching could help build citizens and critical thinkers, rather than simply expanding consumers of content? another need is to diversify collections used in machine learning, gathering cultural perspectives that reflect true diversity of thought through inclusion. all voices should be heard and empowered. a researchers’ workstation could bring together an array of tools and content to allow not only the organization, discovery, and preservation of knowledge, but also facilitate the creation of new knowledge through the library, beyond the literature search. the world is converging toward networking and collaborative research all in one place. i would like the library to be the free platform that brings all the others together. coming full circle, my vision is that when researchers want to work on their research, they will log on to the library and find all they need...the library is the one place...to get your scholarly work done. (wiegand ) suppose, for example, scholars wish to analyze the timeline of the beginning of the coronavirus crisis. logging on to the library’s researchers’ workstation, they start with the discovery module to generate a corpus of research papers from december to june . using the machine learning function, they search for articles and books, looking for gaps and ideas that have not yet been examined in the literature. they access and download full-text, save citations, annotate and take notes, and prepare a draft outline of their research using a word processing function, writing and citing seamlessly. a methods section could help determine the most effective path of the prospective research. then, they might search for the authors of the preprints and articles they find interesting, check the authors’ profiles, and contact some of them through the platform to discern interest in collaborating. the profile system would list areas of interest, current projects, availability for new projects, etc. using the project management function, scholars might then open a new workspace where preliminary thoughts could be shared, with attribution and acknowledgement as appropriate, and a peer review timeline chosen, to invite comments while authors can still claim the idea as their own. if the preprint is successful, and the investigation shows promise after the results are in, the scholars could search for an appropriate journal for publication, the version of record. the author, with researcher id (also contained in his/her profile), has it added to the final published section of the profile, with a doi. the journal showcases the article, sends out table of contents alerts and press releases where it can be picked up by news services and authors invited to comment publicly. each institution would celebrate its authors’ accomplishments, use the scholars’ workstation to determine impact and metrics, and promote the institutions’ research progress. finally, the article would be preserved through the library repository and also initiatives such as lockss. future scholars would find it still available and continue to discover and build on the findings presented. all done through the library. conclusion machine learning as a library service can inspire new stages of innovating, energizing, and providing a blueprint for the library future—teaching, learning, and scholarship for all. the teaching part of the equation invokes the faculty audience perspective: how can librarians help classroom faculty to integrate both library instruction and library research resources (collections, expertise, spaces) into the educational enterprise (wiegand and kominkiewicz )? how can librarians best teach skills, foster engagement, and create knowledge to make a distinctive contribution to the institution? our answers will determine the library's future at each academic institution. machine learning skills, engagement, and knowledge should fit well with the library’s array of services. learning is another traditional aspect of library services, this time from the student point of view. the library provides collections—multimedia or print on paper, digital and digitized, proprietary and open, local, redundant, rare, unique. the use of collections is taught by both librarians and disciplinary faculty in the service of learning, including life-long learning for non-academic, everyday knowledge. students need to know more about machine learning, from data literacy to digital literacy, including concerns about privacy, security, and fake news across the curriculum, while learning skills associated with machine learning. then, as libraries, like all digitally-inflected institutions, develop “change management” strategies, they need to double-down on these unique affordances and communicate them to stakeholders. the most critical strategy is embedding the scholarship of teaching and learning (sotl) in all aspects of the library workflow. instead of simply advertising new electronic resources or describing open access versus proprietary resources, libraries should broadly embed the lessons of copyright, surveillance, and reproducibility into patron interactions, from the first undergraduate literature search to the faculty research consultation. then, reinforce those lessons by emphasizing open access and data mining permissions in their discovery tools. these are aspects of the scholarly research cycle over which libraries have some control. by exerting that control, libraries will promote a culture that positions machine learning and other creative digital uses of library data as normal, achievable parts of the scholarly process. to complete the scholarly communications lifecycle, support for research, scholarship, and creative works are increasingly provided by libraries as a springboard to creation of knowledge, the library’s newest role. this is where machine learning as a new paradigm fits in most compellingly as an innovative practice. libraries can provide not only associated services such as data management of the datasets resulting from analyzing huge corpora, but also licensed databases of proprietary and locally-produced content from libraries on a global scale. researchers—faculty, students, and citizens (including alumni)—will benefit from crowdsourcing and citizen science while gaining knowledge and contributing to scholarship. but perhaps the largest benefit will be learning by doing, escaping the “black box” of blind consumerism to see how algorithms work and thus develop a more nuanced view of reality in the machine age. references please add here citations that you quoted. they are marked in the comments section “to references.” bivens-tatum, wayne. . libraries and the enlightenment. los angeles: library juice press. accessed january , . proquest ebook central. bogost, ian. . https://www.theatlantic.com/technology/archive/ / /ai-created-art-invades-chelsea-gallery-scene/ /?utm_source=share&utm_campaign=share. casson, lionel. . libraries in the ancient world. new haven: yale university press. accessed january , . proquest ebook central. extance, andy. . “how ai technology can tame the scientific literature.” nature , - ( ) doi: . /d - - - . golub, k. ( ) automated subject classification of textual web documents. journal of documentation, ( ): - http://dx.doi.org/ . / . jakeway, eileen. . “machine learning + libraries summit: event summary now live!” the signal (blog), library of congress. february , . https://blogs.loc.gov/thesignal/ / /machine-learning-libraries-summit-event-summary-now-live/. joorabchi . information here. khamsi, rozanne. . “coronavirus in context: scite.ai tracks positive and negative citations for covid- literature. nature. doi: . /d - - - . price, gary. . “the library of congress posts solicitation for a machine learning/deep learning pilot program to ‘maximize the use of its digital collection’.”library journal. june , . rodrigues, eloy et al. . “next generation repositories: behaviours and technical recommendations of the coar next generation repositories working group.” november , . https://doi.org/ . /zenodo. . ryholt, k. s. b, and gojko barjamovic, eds. . libraries before alexandria: ancient near eastern traditions. oxford: oxford university press. short, matthew. "text mining and subject analysis for fiction; or, using classification, keyword extraction, and named entity recognition to assign subject headings to dime novels." ( ). vamathevan, j., clark, d., czodrowski, p. et al. applications of machine learning in drug discovery and development. nat rev drug discov , – ( ). https://doi.org/ . /s - - - . wang, jun. . “an extensive study on automated dewey decimal classification.” journal of the american society for information science & technology ( ): – . doi: . /asi. . wiegand, sue. “acs solutions: the sturm und drang.” (march ) https://acrlog.org/ / / /acs-solutions-the-sturm-und-drang/. wiegand, sue, and frances kominkiewisz. . unpublished manuscript. “integration of student learning through library and classroom instruction.” yewno. n.d. "yewno - transforming information into knowledge." accessed january , . https://www.yewno.com/. further reading abbattista, fabio, luciana bordoni, and giovanni semeraro. . “artificial intelligence for cultural heritage and digital libraries.” applied artificial intelligence ( / ): . doi: . / . ard, constance. . “advanced analytics meets information services.” online searcher ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=bth&an= &site=ehost-live. “artificial intelligence and machine learning in libraries.” . library technology reports ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=lxh&an= &site=ehost-live. badke, william. . “infolit land. the effect of artificial intelligence on the future of information literacy.” online searcher ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=bth&an= &site=ehost-live. boman, craig. . “chapter : an exploration of machine learning in libraries.” library technology reports ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=lxh&an= &site=ehost-live. breeding, marshall. . “chapter : possible future trends.” library technology reports ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=lxh&an= &site=ehost-live. dempsey, lorcan, constance malpas, and brian lavoie. . "collection directions: the evolution of library collections and collecting" portal: libraries and the academy , (july): - . https://doi.org/ . /pla. . . enis, matt. . “labs in the library.” library journal ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=lxh&an= &site=ehost-live. finley, thomas. . “the democratization of artificial intelligence: one library’s approach.” information technology & libraries ( ): – . doi: . /ital.v i . . frank and paynter ( ) “predicting library of congress classifications from library of congress subject headings.” journal of the american society for information science and technology geary, daniel. . “how to bring ai into your library.” computers in libraries ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=bth&an= &site=ehost-live. griffey, jason. . “chapter : conclusion.” library technology reports ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=lxh&an= &site=ehost-live. inayatullah, sohail. . “library futures: from knowledge keepers to creators.” futurist ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=bth&an= &site=ehost-live. johnson, ben. . “libraries in the age of artificial intelligence.” computers in libraries ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=bth&an= &site=ehost-live. liu, xiaozhong, chun guo, and lin zhang. . “scholar metadata and knowledge generation with human and artificial intelligence.” journal of the association for information science & technology ( ): – . doi: . /asi. . mitchell, steve. . “machine assistance in collection building: new tools, research, issues, and reflections.” information technology & libraries ( ): – . doi: . /ital.v i . . ojala, marydee. . “proquest’s new approach to streamlining selection and acquisitions.” information today ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=bth&an= &site=ehost-live. ong, edison, mei u wong, anthony huffman, yongqun he, biorxiv . . . ; doi: https://doi.org/ . / . . . padilla, thomas. . responsible operations: data science, machine learning, and ai in libraries. dublin, oh: oclc research. https://doi.org/ . /xk z- g . plosker, george. . “artificial intelligence tools for information discovery.” online searcher ( ): – . http://smcproxy .saintmarys.edu: /login?url=https://search.ebscohost.com/login.aspx?direct=true&db=bth&an= &site=ehost-live. white, philip. "using data mining for citation analysis" college & research libraries [online], volume number ( january ) https://scholar.colorado.edu/concern/parent/cr n /file_sets/ s witbrock, michael j., and alexander g. hauptmann. . “speech recognition for a digital video library.” journal of the american society for information science ( ): – . doi: . /(sici) - ( ) : < ::aid-asi > . .co; - . zuccala, alesia, maarten someren, and maurits bellen. . “a machine-learning approach to coding book reviews as quality indicators: toward a theory of megacitation.” journal of the association for information science & technology ( ): – . doi: . /asi. . add to further reading: inclusive design: https://mitpress.mit.edu/books/mismatch https://www.campaignlive.com/article/recognizing-exclusion-key-inclusive-design-conversation-kat-holmes/ semantics derived automatically from language corpora necessarily contain human biases: http://www.cs.bath.ac.uk/~jjb/ftp/caliskansemantics-arxiv.pdf no computation without representation: avoiding data and algorithm biases through diversity: https://arxiv.org/pdf/ . .pdf editor comments, second round (dan) thank you for addressing our comments in this draft -- it is much improved, and after making a few minor adjustments, as indicated in the comments, it should be just about ready to go. i did take the liberty to prune a bit and adjust some paragraphs (such as incorporating one or two-sentence stubs into larger paragraphs) for readability, while (i think!) still retaining your intentions. the one area that could use just a bit more work is the bibliography. the main pieces are in place, but some of the entries are missing information. for example, the ian bogost entry has just the author’s name, the date, and url. however, that is an atlantic article, and should receive the full periodical workup, complete with article title, periodical title, and full date. i have done a little bit of editing of the references and can do more down the line, but it would help me if you could complete another pass first. you’ll also note i separated out the works you actually cited (into a “references” section) from the “further reading” section. also there were some references missing (as called out in comments), so please add. throughout the essay, i’ve moved url references to footnotes, as it makes for cleaner reading. other stylistic issues may continue to be adjusted for volume consistency, and you’ll be able to review this in the proofs. fragility and intelligibility of deep learning for libraries michael lesk rutgers university abstract: comment by eric lease morgan: after reading this second draft, i have nothing to add; in my mind the article is sound. more specifically, the article describes why one ought to be a bit apprehensive when it comes to the conclusions of "deep learning". it does so with numerous examples and in a scholarly fashion. yet, the article does not despair. instead, in the end, it offers recommendations and suggestions for overcoming the apprehension. we are reading more and more about practical use of artificial intelligence, ranging from self-driving cars to bank loan approvals. increasingly, we are asked to trust systems that we cannot understand. can we rely on such software? this article talks about the risks that fragile ai software will be unreliable and that we won’t be able to know when or why. traditional machine learning involved feature detection and then some variety of clustering or discrimination rules. it was generally possible to explain what features were being used and how they were related to the task. these methods are now being supplanted by "deep learning." enormous quantities of data are sometimes available, which seems to substitute for analyzing the problem. in a famous paper, peter norvig said that more data beats better algorithms.[footnoteref: ] comment by daniel johnson: (converted bibliographic note to footnote, per our use of chicago) [ : see halevy, norvig, and pereira , but the exact quote is in a talk, not this article.] but if software has been trained on a small and specific subset, will it generalize? will face recognition software trained on adult white male faces work with other races, with women, and with children? even without generalizing, if we don't know what is driving the decision algorithm, we may find that the conclusions are very fragile. if image recognition systems can be fooled by slight changes in an image, what happens if self-driving cars misread road signs? . introduction. on february , , mounir mahjoubi, then the “digital minister” of france (le secrétariat d’État chargé du numérique), told the civil service to use only computer methods that could be understood (mahjoubi ). to be precise, what he actually said to l'assemblée nationale was: aucun algorithme non explicable ne pourra être utilisé. i gave this to google translate and asked for it in english. what i got (on october , ) was: no algorithm that can not be explained can not be used. that’s a long way from fluent english. as i count the “not” words, it’s actually reversed in meaning. but, what if i leave off the final period when i enter it in google translate? then i get: no non-explainable algorithm can be used quite different, and although only barely fluent, now the meaning is right. the difference was only the final punctuation on the sentence. this is an example of the fragility of an ai algorithm. the point is not that both translations are of doubtful quality. the point is that a seemingly insignificant change in the input produced such a difference in the output. in this case, the fragility was detected by accident. machine learning systems have a set of data for training. for example, if you are interested in translation, and you have a large collection of text in both french and english, you might notice that the word truck in english appears where the word camion appears in french. and the system might “learn” this translation. it would then apply this in other examples; this is called generalization. of course if you wish to translate french into british english, a preferred translation of camion is lorry. and if the context of your english truck is a us discussion of the wheels and axles underneath railway vehicles, the better french word is le bogie. deep learning enthusiasts believe that with enough examples, machine learning systems will be able to generalize correctly. there can be various kinds of failures: we can discuss both (a) problems in the scope of the training data and (b) problems in the kind of modeling done. if the system has sufficiently general input data so that it learns well enough to produce reliably correct results on examples it has not seen, we call it robust; robustness is the opposite of fragility. fragility errors here can arise from many sources - for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). or, the data may not have the scope of the real problem: if you train for “boat” based on ocean liners, don’t be surprised if the program fails on canoes. in addition, there are also modeling issues. suppose you use a very simple model, such as a linear model, for data that is actually perhaps quadratic or exponential. this is called “underfitting” and may often arise when there is not enough training data. the reverse is also possible: there may be a lot of training data, including many noisy points, and the program may decide on a very complex model to cover all the noise in the training data. this is called “overfitting” and gives you an answer too dependent on noise and outliers in your data. for example, was an unusually warm year, but the decline in world temperature for the next few years suggests it was noise in the data, not a change in the development of climate. fragility is also a problem in image recognition (“ai recognition” ). currently the most common technique for image recognition research projects is the use of convolutional neural nets. recently, several papers have looked at how trivial modifications to images may impact image classification. here is an image taken from (su, vargas, and sakurai ). the original image class is in black and the classifier choice (and confidence) after adding a single unusual pixel are shown in blue, with the extraneous pixel in white. the images were deliberately processed at low resolution - hence the pixellation - to match the input requirement of a popular image classification program. the authors experimented with algorithms to find the quickest single-pixel change that would deceive an image classifier. they were routinely able to fool the recognition software. in this example, the deception was deliberate; the researchers searched for the best place to change the image. . bias and mistakes we have seen a major change in the way we do machine learning, and there are real dangers involved. the current enthusiasm for neural nets risks the use of processes which cannot be understood, as mahjoubi warned, and which can thus conceal methods we would not approve of, such as discrimination in lending or hiring. cathy o'neil has described this in her book weapons of math destruction ( ). there is much research today that seeks methods to explain what neural nets are doing. see guidiotti et al. ( ) for a survey. there is also a darpa program on “explainable ai.” techniques used can include looking at the results over a range of input data and seeing if the neural net can be modeled by a decision tree, or modifying the input data to see which input elements have the greatest effect on the results, and then showing that to the user. for example, mariusz bojarski et al. describe a self-driving system that highlights what it thinks is important in what it is seeing ( ). however, this is generally research in progress, and it raises the question of whether we can trust the explanation generator. many popular magazines have discussed this problem; forbes, for example, had an explanation of how the choice of datasets can produce a biased result without any deliberate attempt to do so (taulli ). similarly, the new york times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (tugend ). the mit media lab hosts the algorithmic justice league, trying to stop organizations from building socially slanted systems. similar thoughts come from groups like the data and society research institute or the ai now institute. again, the problems may be accidental or deliberate. the phrase “data poisoning” has been used to suggest malicious creation of training data or examples of data designed to deceive machine learning systems. there is now a darpa research program, “guaranteeing ai robustness against deception (gard),” supporting research to learn how to stop trickery such as a demonstration of converting a traffic stop sign to a mph speed limit with a few stickers (eykholt et al. ). more generally, bias in systems deciding whether to grant loans may be discriminatory but nevertheless profitable. even if you want to detect ai mistakes, recognizing such problems is difficult. often things will be wrong and we won’t know why. and even hypothetical (but perhaps erroneous) explanations can be very convincing; people easily believe plausible stories. i routinely give my students a paper that concludes that prior ownership of a cat prevents fatal myocardial infarctions; its result implies that cats are more protective than statin drugs (qureshi et al. ). the students are very quick to come up with possibilities like “petting a cat is relaxing, relaxation reduces your blood pressure, and lower blood pressure decreases the risk of heart attacks.” then i have to explain that the paper evaluates possibilities (prior/current ownership x cats/dogs x medical conditions x fatal/nonfatal) and you shouldn’t be surprised if you evaluate chances and one is significant at the . level, which is only in . in this example, there is also the question of reverse causality: perhaps someone who is in ill health will decide he is too sick to take care of a pet, so that the poor health is not caused by the lack of a cat, but rather the poor health causes the absence of a cat. sometimes explanations can help, as in a machine learning program that was deliberately trained to distinguish images of wolves and dogs but was trained using pictures of wolves that always contained snow and pictures of dogs that never did (ribeiro, singh, and guestrin ). without explaining that, of subjects thought the classifier was trustworthy; after pointing out the snow only of subjects believed the system. usually you don’t get such a clear presentation of a mis-trained system. . recognition of problems can we tell when something is wrong? here’s the result of a google photo merge of three other photos; two landscapes and a picture of somebody’s friend. the software was told to make a panorama and stitched the images together (peng ). it looks like a joke, and even made it into a list of top jokes on reddit. the author’s point was that the panorama system didn’t understand basic composition: people are not the same scale as mountains. often, machine learning results are overstated. google flu trends was acclaimed for several years and then turned out to be undependable (lazer et al. ). a study that attempted to compare the performance of machine learning systems for medical diagnosis with actual doctors found that of over , papers analyzed, only a few dozen had data suitable for an evaluation (liu et al. ). the results claimed comparable accuracy, but virtually none of the papers presented adequate data to support that conclusion. unusually promising results are sometimes the result of overfitting (brownlee ); this is what was wrong with google flu trends. a machine learning program can learn a large number of special cases and then find that the results do not generalize. in other cases problems can result when using “clean” data for training, and then encountering messier data in applications. ideally, training and testing data should be from the same dataset and divided at random, but it can be tempting to start off with examples that are the result of initial and higher quality data collection. sometimes in the past we had a choice between modeling and data for predictions. consider, for example, the problem of guessing what the weather will be tomorrow. we now do this based on a model of the atmosphere that uses the navier-stokes equations; we use supercomputers and derive tomorrow’s atmosphere from today’s (christensen ). what did we do before we had supercomputers? solving those equations by hand is impractical. one of the methods was “prediction by analogy”: find some day in the past whose weather was most similar to today. suppose that day is oct. , . then use october , as tomorrow’s prediction. prediction by analogy doesn’t require you to have a model or use advanced mathematics. in this case, however, it doesn’t work as well - partly because we don’t have enough past days to choose from, and we only get new days at the rate of one per day. in fact, huug van den dool estimated the number of days of data needed to make accurate predictions as years, which is far more than the age of the universe (wilks ). the underlying problem is that the weather is very random. if your state lottery is properly run, it should be completely pointless to look at past winning numbers and try to guess the next one. the weather is not that random but it has too much variation to be solved easily by analogy. if your problem is very simple (tic-tac-toe) you could indeed write down each position and what the best next move is; there are only about , games. to deal with more realistic problems, much of machine learning research is now focused on obtaining larger training sets. instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, “more data beats better algorithms.” in a review of the history of speech recognition, xuedong huang, james baker, and raj reddy write, “the power of these systems arises mainly from their ability to collect, process, and learn from very large datasets. the basic learning and decoding algorithms have not changed substantially in years” ( ). nevertheless, speech recognition has gone from frustration to useful products such as dictation software or home appliances. lacking a model, however, means that we won’t know the limits of the calculations being done. for example, if you have some data that looks quadratic, but you fit a linear model, any attempt at extrapolation is fraught with error. if you are using a “black box” system, you don’t know when this is happening. and, regrettably, many of the ai software systems are sold as black boxes where the purchasers and users do not have access to the process, even if they are imagined to be able to understand it. . what’s changing many ai researchers are sensitive to the risks, especially given the publicity over self-driving cars. as the hype over “deep learning” built up, writers discussed examples such as a pittsburgh medical system that proposed to send patients with both pneumonia and asthma home, because the computer had not understood that patients with both problems were actually being sent to the icu (bornstein ; caruana et al. ). many people work on ways of explaining or presenting neural net software (harley ). most important, perhaps, are new eu regulations that prohibit automated decision making that affects eu citizens, and provides a “right of explanation” (metz ). we recognize that systems which don’t rely on a mathematical model may be cheaper to build than one where the coders understand what is going on. more serious is that they may be more accurate. this image is from the same article on understandability (bornstein ). if there really is a tradeoff between what will solve the problem and what can be explained, we know that many system builders will choose to solve the problem. and yet even having explanations may not be an answer; a key paper on interpretability discusses the complexities of meaning related to explanation, causality, and modeling (lipton ). arend hintze has noted that we do not always impose a demand for explanation on people. i can write that the new york public library main building is well proportioned and attractive without anyone expecting that i will recite its dimensions or the source of the marble used to construct it. and for some problems that’s fine: i don’t care how my camera decides on the focus distance to the subject. where it matters, however, we often want explanations; the hard ethical problem, as noted before, is if better performance can be achieved in an inexplicable way. . recommendations saw the publication of the “asilomar ai principles” ( ). two of these principles are: * safety: ai systems should be safe and secure throughout their operational lifetime, and verifiably so where applicable and feasible. * failure transparency: if an ai system causes harm, it should be possible to ascertain why. the problem is that the technology used to build many systems does not enable verifiability and explanation. similarly the world economic forum calls for protection against discrimination but notes many ways in which technology can have unanticipated and undesirable effects as a result of machine learning (“how” ). historically there has been and continues to be too much hype. an important image recognition task is distinguishing malignant and benign spots on mammograms. there have been promises for decades that computers would do this better than radiologists. here are examples from (“computer-aided diagnosis can improve radiologists’ observational performance”) (schmidt and nishikawa) and (“the bayesian network significantly exceeded the performance of interpreting radiologists”) (burnside et al.). a typical recent ai paper to do this with convolutional neural nets reports % accuracy (singh et al. ). to put this in perspective, the problem is complex, but some examples are more straightforward, and even pigeons can reach % (levenson et al. ). a serious recent review is “diagnostic accuracy of digital screening mammography with and without computer-aided detection” (lehman et al. ). very recently there was another claim that computers have surpassed radiologists (walsh ); we will have to await evaluation. as with many claims of medical progress, replicability and evaluation are needed before doctors will be willing to believe them. what should we do? software testing generally is a decades-old discipline, and many basic principles of regression testing apply here also: · test data should cover the full range of expected input. · test data should also cover unexpected and even illegal input. · test data should include known past failures believed cleared up. · test data should exercise all parts of the program, and all important paths (coverage). · test data should include a set of data which is representative of the distribution of actual data, to be used for timing purposes. it is difficult to apply these ideas in parts of the ai world. if the allowed input is speech, there is no exhaustive list of utterances which can be sampled. if a black-box commercial machine learning package is being used, there is no way to ask about coverage of any number of test cases. if a program is constantly learning from new data, there is no list of previously fixed failures to be collected that reflects the constantly changing program. and obviously the circumstances of use matter. we may well, as a society, decide that forcing banks evaluating loan applications to use decision trees instead of deep learning is appropriate, so that we know whether illegal discrimination is going on, even if this raises the costs to the banks. we might also believe that the safest possible railway operation is important, even if the automated train doesn’t routinely explain how it balanced its choices of acceleration to achieve high punctuality and low risk. what would i suggest? organizationally: · have teams including both the computer scientists and the users. · collaborate with a statistician: they’ve seen a lot of these problems before. · work on easier problems. as examples, i watched a group of zoologists with a group of computer scientists discussing how to improve accuracy at identifying animals in photographs. the discussion indicated that you needed hundreds of training examples at a minimum, if not thousands, since the animals do not typically walk up to the camera and pose for a full-frame shot. it was important to have both the people who understood the learning systems and the people who knew what the pictures were realistically like. the most amusing contribution by a statistician happened when a computer scientist offered a program that tried to recognize individual giraffes, and a zoologist complained that it only worked if you had a view of the right-hand side of the giraffe. somebody who knew statistics perked up and said “it’s a % chance of recognizing the animal? i can do the math for that.” and it is simpler to do “is there any animal in the picture?” before asking “which animal is it?” and create two easier problems. technically: · try to interpolate rather than extrapolate: use the algorithm on points “inside” the training set (thinking in multiple dimensions). · lean towards feature detection and modeling rather than completely unsupervised learning. · emphasize continuous rather than discrete variables. i suggest using methods that involve feature detection, since that tells you what the algorithm is relying on. for example, consider the google flu trends failure; the public was not told what terms were used. as david lazer noted, some of them were just “winter” terms (like ‘basketball’). if you know that, you might be skeptical. more significant are decisions like jail sentences or college admissions; knowing that racial or religious discrimination are not relevant can be verified by knowing that the program did not use them. knowing what features were used can sometimes help the user: if you know that your loan application was downrated because of your credit score, it may be possible for you to pay off some bill to raise the score. sometimes you have to use categorical variables (what county do you live in?) but if you have a choice of how you phrase a variable, asking something like “how many minutes a day do you spend reading?” is likely to produce a better fit than asking people to choose “how much do you read: never, sometimes, a lot?” a machine learning algorithm may tell you how much of the variance each input variable explains; you can use that information to focus on the variables that are most important to your problem, and decide whether you think you are measuring them well enough. why not extrapolate? sadly, as i write this in early april , we are seeing all sorts of extrapolations of the covid- epidemic, with expected us deaths ranging from , to million, as people try to fit various functions (gaussians, logistic regression, or whatever) with inadequately precise data and uncertain models. a simpler example is mark twain’s: “in the space of one hundred and seventy-six years the lower mississippi has shortened itself two hundred and forty-two miles. that is an average of a trifle over one mile and a third per year. therefore, any calm person, who is not blind or idiotic, can see that in the ‘old oolitic silurian period,’ just a million years ago next november, the lower mississippi river was upwards of one million three hundred thousand miles long, and stuck out over the gulf of mexico like a fishing-rod. and by the same token any person can see that seven hundred and forty-two years from now the lower mississippi will be only a mile and three-quarters long, and cairo and new orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen” ( ). finally, note the advice of edgar allan poe: “believe nothing you hear, and only one half that you see.” references “ai recognition fooled by single pixel change.” bbc news, november , . https://www.bbc.com/news/technology- . “asilomar ai principles.” . https://futureoflife.org/ai-principles/. bojarski, mariusz, larry jackel, ben firner, and urs muller. . “explaining how end-to-end deep learning steers a self-driving car.” nvidia developer blog. https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/. bornstein, aaron. . “is artificial intelligence permanently inscrutable?” nautilus ( ). http://nautil.us/issue/ /learning/is-artificial-intelligence-permanently-inscrutable. brownlee, jason. “the model performance mismatch problem (and what to do about it).” machine learning mastery. https://machinelearningmastery.com/the-model-performance-mismatch-problem/. burnside, elizabeth s., jessie davis, jagpreet chhatwal, oguzhan alagoz, mary j. lindstrom, berta m. geller, benjamin littenberg, katherine a. shaffer, charles e. kahn, and c. david page. . “probabilistic computer model developed from clinical data in national mammography database format to classify mammographic findings.” radiology ( ): - . caruana, rich, yin lou, johannes gehrke, paul koch, marc sturm, and noemie elhadad. . “intelligible models for healthcare: predicting pneumonia risk and hospital -day readmission.” in proceedings of the th acm sigkdd international conference on knowledge discovery and data mining (kdd ' ), - . new york: acm press. https://doi.org/ . / . . christensen, hannah. . “banking on better forecasts: the new maths of weather prediction.” the guardian, jan . https://www.theguardian.com/science/alexs-adventures-in-numberland/ /jan/ /banking-forecasts-maths-weather-prediction-stochastic-processes. eykholt, kevin, ivan evtimov, earlence fernandes, bo li, amir rahmati, florian tramèr, atul prakash, tadayoshi kohno, and dawn song. . “physical adversarial examples for object detectors.” th usenix workshop on offensive technologies (woot ). guidiotti, riccardo, anna monreale, salvatore ruggieri, franco turini, giannotti fosca, and dino pedreschi. . “a survey of methods for explaining black box models.” acm computing surveys ( ): - . halevy, alon, peter norvig, and fernando pereira. . “the unreasonable effectiveness of data.” ieee intelligent systems ( ). harley, adam w. . “an interactive node-link visualization of convolutional neural networks.” in advances in visual computing, edited by george bebis et al., - . lecture notes in computer science. cham: springer international publishing. “how to prevent discriminatory outcomes in machine learning.” . white paper from the global future council on human rights - , world economic forum. https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning. huang, xuedong, james baker, and raj reddy. . “a historical perspective of speech recognition.” communications of the acm ( ): - . lazer, david, ryan kennedy, gary king, and alessandro vespignani. . “the parable of google flu: traps in big data analysis.” science ( ): – . lehman, constance, robert wellman, diana buist, karl kerlikowske, anna tosteson, and diana miglioretti. . “diagnostic accuracy of digital screening mammography with and without computer-aided detection.” jama intern med ( ): - . levenson, richard m., elizabeth a. krupinski, victor m. navarro, and edward a. wasserman. . “pigeons (columba livia) as trainable observers of pathology and radiology breast cancer images.” plos one, november , . https://doi.org/ . /journal.pone. . lipton, zachary. . “the mythos of model interpretability.” acm queue ( ): - . liu, xiaoxuan et al. . “a comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.” lancet digital health ( ): e - . https://www.sciencedirect.com/science/article/pii/s mahjoubi, mounir. . “assemblée nationale, xve législature. session ordinaire de - .” compte rendu intégral, deuxième séance du mercredi février . http://www.assemblee-nationale.fr/ /cri/ - / .asp. metz, cade. . “artificial intelligence is setting up the internet for a huge clash with europe.” wired, july , . https://www.wired.com/ / /artificial-intelligence-setting-internet-huge-clash-europe/. o’neil, cathy. . weapons of math destruction. new york: crown. peng, tony. . “ in review: ai failures.” medium, december , . https://medium.com/syncedreview/ -in-review- -ai-failures-c faadf . qureshi, a. i., m. z. memon, g. vazquez, and m. f. suri. . “cat ownership and the risk of fatal cardiovascular diseases. results from the second national health and nutrition examination study mortality follow-up study.” journal of vascular and interventional neurology ( ): - . https://www.ncbi.nlm.nih.gov/pmc/articles/pmc . ribeiro, marco tulio, sameer singh, and carlos guestrin. . “‘why should i trust you?’: explaining the predictions of any classifier.” in proceedings of the nd acm sigkdd international conference on knowledge discovery and data mining (kdd ' ), - . new york: acm press. schmidt, r. a. and r. m. nishikawa. . “clinical use of digital mammography: the present and the prospects.” journal of digital imaging ( suppl ): - . singh, vivek kumar et al. . “breast tumor segmentation and shape classification in mammograms using generative adversarial and convolutional neural network.” expert systems with applications . su, jiawei, danilo vasconcellos vargas, and kouichi sakurai. . “one pixel attack for fooling deep neural networks.” ieee transactions on evolutionary computation ( ): - . taulli, tom. . “how bias distorts ai (artificial intelligence).” forbes, august , . https://www.forbes.com/sites/tomtaulli/ / / /bias-the-silent-killer-of-ai-artificial-intelligence/# cc f d d . twain, mark. . life on the mississippi. boston: j. r. osgood & co. tugend, alina. . “the bias embedded in tech.” the new york times, june , , section f, . walsh, fergus. . “ai ‘outperforms’ doctors diagnosing breast cancer.” bbc news, january , . https://www.bbc.com/news/health- . wilks, daniel s. . review of empirical methods in short-term climate prediction, by huug van den dool. bulletin of the american meteorological society ( ): - . bohyun kim, ai and its moral concerns ai and its moral concerns bohyun kim . automating decisions and actions the goal of artificial intelligence (ai) as a discipline is to create an artificial system – whether it be a piece of software or a machine with a physical body – that is as intelligent as a human in its performance, either broadly in all areas of human activities or narrowly in a specific activity, such as playing chess or driving.[footnoteref: ] the actual capability of most ai systems remained far below this ambitious goal for a long time. but with recent successes with machine learning and deep learning, the performance of some ai programs has started surpassing that of humans. in , an ai program developed with the deep learning method, alphago, astonished even its creators by winning four out of five go matches with the eighteen-time world champion, sedol lee.[footnoteref: ] in , google’s deepmind unveiled atari , a deep reinforcement learning algorithm that reached superhuman levels of play in classic atari games.[footnoteref: ] [ : note that by ‘as intelligent as a human,’ i simply mean human-level ai, or more accurately, ai at human-level performance. by ‘as intelligent as a human,’ i do not mean general(/strong) ai. general ai – also known as ‘artificial general intelligence (agi)’ and ‘strong ai’ – refers to the ability to adapt to achieve any goals; it does not refer to human-level performance in achieving one particular goal. an ai system developed to perform one specific type of activity or activities in one specific domain is called a ‘narrow (/weak) ai’ system. ] [ : alphago can be said to be “as intelligent as humans,” but only in playing go, where it exceeds human capability. so, it does not qualify as general/strong ai in spite of its human-level intelligence in go-playing. it is to be noted that general(/strong) ai and narrow(/weak) ai signify the difference in the scope of ai capability. general(/strong) ai is a broader concept than human-like intelligence, either with its carbon-based substrate or with human-like understanding that relies on what we regard as uniquely human cognitive states such as consciousness, qualia, emotions, and so on. for more helpful descriptions of common terms in ai, see tegmark , . for more on the match between alphago and sedol lee, see koch .] [ : deep reinforcement learning is a type of deep learning that is goal-oriented and reward-based. see heaven .] early symbolic ai systems determined their outputs based upon given rules and logical inference. ai algorithms in these rule-based systems, also known as good old-fashioned ai (gofai), were pre-determined, predictable, and transparent. on the other hand, machine learning, another approach in ai, enables an ai algorithm to evolve. it learns to adjust itself to identify certain patterns through the so-called ‘training’ process, which relies on a large amount of data and statistics. deep learning, one of the widely-used techniques in machine learning, further refines this training process using a ‘neural network.’[footnoteref: ] machine learning and deep learning have brought significant improvements to the performance of ai systems in areas such as translation, speech recognition, and detecting objects and predicting their movements. some people assume that machine learning completely replaced gofai, but this is a misunderstanding. symbolic reasoning and machine learning are two distinct but not mutually exclusive approaches, and they can be used together (knight a). [ : machine learning and deep learning have gained momentum as the cost of high-performance computing has gone down and large data sets have become available. for example, the data in the imagenet contains more than million hand-annotated images. the imagenet data have been used for the well-known annual ai competition for object detection and image classification at large scale from to . see http://www.image-net.org/challenges/lsvrc/.] with their limited intelligence and fully deterministic nature, early rule-based symbolic ai systems raised few ethical concerns.[footnoteref: ] ai systems that near or surpass human capability, on the other hand, are likely to be given the autonomy to make their own decisions without humans, even when their workings are not entirely transparent, and some of those decisions are distinctively moral in character. as humans, we are trained to recognize situations that demand moral decision-making. but how would an ai system be able to do so, and should they be allowed to? with self-driving cars and autonomous weapons systems under active development and testing, these are no longer idle questions. [ : for an excellent history of ai research, see chapter , “what is artificial intelligence,” of boden , - .] . the trolley problem recent advances of ai, such as autonomous cars, have brought new interest to the trolley problem, a thought experiment introduced by philippa foot in . in the standard version of this problem, a runaway trolley barrels down a track where five unsuspecting people are standing. you happen to be standing next to a lever that switches the trolley onto a different track, where there is only one person. those who are on either track will be killed if the trolley heads their way. should you pull the lever, so that the runaway trolley would kill one person instead of five? unlike people, a machine does not panic or freeze and simply follows and executes the given instruction. so, this opens up the possibility of building an ai-powered trolley that can act morally.[footnoteref: ] the question itself remains, however. should one program the ai-powered trolley to swerve or stay on course? [ : programming here does not exclusively mean a deep learning or machine learning approach. ] different moral theories, such as virtue ethics, contractarianism, and moral relativism, take different positions. here, i will consider utilitarianism and deontology since their tenets are relatively straightforward and most ai developers are likely to look towards those two moral theories for guidance and insight. utilitarianism argues that the utility of an action is what makes an action moral. in this view, what generates the greatest amount of good is the most moral thing to do. if one regards five human lives as a greater good than one, then one acts morally by pulling the lever and diverting the trolley to the other track. by contrast, deontology claims that what determines whether an action is morally right or wrong is not its utility but moral rules. if an action is in accordance with those rules, then the action is morally right. otherwise, it is morally wrong. if not to kill another human being is one of those moral rules, then killing someone is morally wrong even if it is to save more lives. note that these are highly simplified accounts of utilitarianism and deontology. the good in utilitarianism can be interpreted in many different ways, and the issue of conflicting moral rules is a perennial problem that deontological ethics grapples with.[footnoteref: ] for our purpose, however, these simplified accounts are sufficient to highlight the aspects in which the utilitarian and the deontological position appeal to and go against our moral intuition. [ : for an overview, see sinnott-armstrong and alexander and moore .] if a trolley cannot be stopped, saving five lives over one seems to be a right thing to do. utilitarianism appears to get things right in this respect. however, it is hard to dispute that killing people is wrong. if killing is morally wrong no matter what, deontology seems to make more sense. with moral theories, things seem to get more confusing. consider the case in which one freezes and fails to pull the level. according to utilitarianism, this would be morally wrong because it fails to maximize the greatest good, i.e. human lives. but how far should one go to maximize the good? suppose there is a very large person on a footbridge over the trolley track and one pushes that person off the footbridge onto the track, thus stopping the trolley and saving the five people. would this count as a right thing to do? utilitarianism may argue that. but in real life, many might consider throwing a person morally wrong but pulling the lever morally permissible.[footnoteref: ] [ : for an empirical study on this, see cushman, young, and hauser . for the results of a similar survey that involves an autonomous car instead of a trolley, see bonnefon, shariff, and rahwan .] the problem with utilitarianism is that it treats the good as something inherently quantifiable, comparable, calculable, and additive. but not all considerations that we have to factor into moral decision-making are measurable in numbers. what if the five people on the track are helpless babies or murderers who just escaped from the prison? would or should that affect our decision? some of us would surely hesitate saving the lives of five murderers by sacrificing one innocent baby. but what if things were different and we were comparing five school children versus one baby or five babies versus one schoolchild? no one can say for sure what is the morally right action in those cases.[footnoteref: ] [ : for an attempt to identify moral principles behind our moral intuition in different versions of the trolley problem and other similar cases, see thomson . ] while the utilitarian position appears less persuasive in light of these considerations, deontology doesn’t fare too well, either. deontology emphasizes one’s duty to observe moral rules. but what if those moral rules conflict with one another? between the two moral rules, “do not kill a person” and “save lives,” which one should trump the other? the conflict among values is common in life, and deontology faces difficulty in guiding how an intelligent agent is to act in a tricky situation like the trolley problem.[footnoteref: ] [ : some moral philosophers doubt the value of our moral intuition in constructing a moral theory. see singer , for example. but a moral theory that clashes with common moral intuition is unlikely to be sought out as a guide to making an ethical decision.] . understanding what ethics has to offer due to the high stakes involved, ai-powered military robots and autonomous weapons systems can present the moral dilemma in the trolley problem more convincingly. suppose that some engineers, following utilitarianism and interpreting victory as the ultimate good/utility, wish to program an unmanned aerial vehicle (uav) to autonomously drop bombs in order to maximize the chances of victory. that may result in sacrificing a greater number of civilians than necessary, and many will consider this to be morally wrong. now imagine different engineers who, adopting deontology and following the moral principle of not killing people, program a uav to autonomously act in a manner that minimizes casualties. this may lead to defeat on the battlefield, because minimizing casualties may not be always advantageous to winning a war. these considerations suggest that philosophical insights from utilitarianism and deontology may provide little practical guidance on how to program autonomous ai systems to act morally. ethicists seek abstract principles that can be generalized. for this reason, they are interested in borderline cases that reveal subtle differences in our moral intuition and varying moral theories. their goal is to define what is moral and investigate how moral reasoning works or should work. by contrast, engineers and programmers pursue practical solutions to real-life problems and look for guidelines that will help with implementing those solutions. their focus is on creating a set of constraints and if-then statements, which will allow a machine to identify and process morally relevant considerations, so that it can determine and execute an action that is not only rational but also ethical in the given situation.[footnoteref: ] [ : note that this moral decision-making process can be modeled as rule-based, as in a symbolic ai system, although it does not have to be. it can be based upon a machine learning approach. it is also possible to combine both. see vincent conitzer et al. .] on the other hand, military commanders and soldiers might aim to end a conflict, bring peace, and facilitate restoring and establishing universally recognized human values such as freedom, equality, justice, and self-determination. in order to achieve this goal, they must make the best strategic decisions and take the most appropriate actions. in deciding on those actions, they are also responsible for abiding by the principles of jus in bello and for not abdicating their moral responsibility, protecting civilians and minimizing harm, violence, and destruction as much as possible.[footnoteref: ] the goal of military commanders and soldiers, therefore, differs from those of moral philosophers or of the engineers of autonomous weapons. they are obligated to make quick decisions in a life-or-death situation while working with ai-powered military systems. [ : for the principles of jus in bello, see international committee of the red cross .] these different goals and interests explain why moral philosophers’ discussion on the trolley problem may be disappointing to ai programmers or military commanders and soldiers. ethics does not provide an easy answer to the question of how one should program moral decision-making into intelligent machines. nor does it prescribe the right moral decision in a battlefield. but taking this as a shortcoming of ethics is missing the point. the role of moral philosophy is not to make decision-making easier but to highlight and articulate the difficulty and complexity involved in it. . ethical challenges from autonomous ai systems the complexity of ethical questions means that dealing with the morality of an action by an autonomous ai system will require more than a clever engineering or programming solution. the fact that ethics does not eliminate the inherent ambiguity in many moral decisions should not lead to the dismissal of ethical challenges from autonomous ai systems. by injecting the capacity for autonomous decision-making into major systems and tools, ai can fundamentally transform any given field. for example, ai-powered military robots are not just another kind of weapon. when widely deployed, they can change the nature of war itself. described below are some of the significant ethical challenges that autonomous ai systems such as military robots present. (a) moral desensitization ai-powered military robots go further than any remotely-operated weapons. they can identify a target and initiate an attack on their own. due to their autonomy, military robots can significantly increase the distance between the party that kills and the party that gets killed (sharkey ). this increase, however, may lead people to surrender their own moral responsibility to a machine, thereby resulting in the loss of humanity, which is a serious moral risk (davis ). the more autonomous military robots become, the less responsibility for life-or-death decisions may come to rest with humans. (b) unintended outcomes as military robots make killing easier, small conflicts may more quickly escalate to a war. the side that deploys ai-powered military robots is likely to suffer fewer casualties itself while inflicting more casualties on the enemy side. this may make the military more inclined to start a war. ironically, when everyone thinks and acts this way, the number of wars and the overall amount of violence and destruction in the world will only increase.[footnoteref: ] [ : kahn ( ) also argues that the resulting increase in the number of wars by the use of military robots will be morally bad. ] (c) surrender of moral agency military robots may fail to distinguish innocents from combatants and kill the former. in such a case, can we be justified in letting robots take the lives of other human beings? some may argue that only humans should decide to kill other humans, not machines (davis ). is it permissible for people to delegate such a decision to robots? (d) opacity in decision-making machine learning is used to build many ai systems today. instead of prescribing a pre-determined algorithm, a machine learning system goes through a so-called ‘training’ process in order to produce the final algorithm from a large amount of data. for example, a machine learning system may generate an algorithm that successfully recognizes cats in a photo after going through millions of photos that show cats in many different postures from various angles.[footnoteref: ] but the resulting algorithm is a complex mathematical formula and not something that humans can easily decipher. this means that the inner workings of a machine learning ai system and its decision-making process is opaque to human understanding, even to those who built the system itself (knight ). in cases where the actions of an ai system can have grave consequences such as a military robot, such opacity becomes a serious problem.[footnoteref: ] in spite of these ethical concerns, autonomous ai systems are likely to continue to be developed and adopted in many areas as a way to increase efficiency and lower cost. [ : google’s research team created an ai algorithm that learned how to recognize a cat in . the neural network behind this algorithm had an array of , processors and more than one billion connections, and unlabeled random thumbnail images from million youtube videos allowed this algorithm to learn to identify cats by itself. see markoff and liat .] [ : this black-box nature of ai systems powered by machine learning has raised great concern among many ai researchers in recent years. this is problematic in all areas where these ai systems are used for decision-making, not just in military operations. the gravity of decisions made in a military operation makes this problem even more troublesome. fortunately, some ai researchers including those in the us department of defense are actively working to make ai systems explainable. but until such research bears fruit and ai systems become fully explainable, their military use means accepting many unknown variables and unforeseeable consequences. see turek n.d..] . ai applications for libraries do these ethical concerns outlined above apply to libraries? to answer that, let us first take a look at how ai, particularly machine learning, may apply to library services and operations. ai-powered digital assistants are likely to mediate a library user’s information search, discovery, and retrieval activities in the near future. in recent years, machine learning and deep learning have brought significant improvement to natural language processing (nlp), which deals with analyzing large amounts of natural language data to make the interaction between people and machines in natural languages possible. for instance, google assistant’s new feature ‘duplex’ was shown to successfully make a phone reservation with restaurant staff in (welch ). google’s real-time translation capability for different languages is planned to become part of the features for all headphones equipped with google assistant (le and schuster , porter ). as digital assistants become capable of handling more sophisticated language tasks, their use as a flexible voice user interface will only increase. such digital assistants will directly interact with library systems and applications, automatically interpret a query, and return results that they deem to be most relevant. those digital assistants can also be equipped to handle the library’s traditional reference or readers’ advisory service. integrated into a humanoid robot body, they may even greet library patrons at the entrance and answer directional questions about the library building. cataloging, abstracting, and indexing are other areas where ai will be actively utilized. currently, those tasks are performed by skilled professionals. but as ai applications become more sophisticated, we may see many of those tasks partially or fully automated and handed over to ai systems. machine learning and deep learning can be used to extract key information from a large number of documents or from information-rich visual materials such as maps and video recordings, and generate metadata or a summary. since machine learning is new to libraries, there are a relatively small number of machine learning applications developed for libraries’ use. they are likely to grow in number. yewno, quartolio, and iris.ai are examples of the vendor products developed with machine learning and deep learning techniques. yewno discover displays the connections between different concepts or works in library materials. quartolio targets researchers looking to discover untapped research opportunities based upon a large amount of data that includes articles, clinical trials, patents, and notes. similarly, iris.ai helps researchers identify and review a large amount of research papers and patents and extracts key information from them.[footnoteref: ] kira identifies, extracts, and analyzes text in contracts and other legal documents.[footnoteref: ] none of these applications performs fully automated decision-making nor incorporates the digital assistant feature. but this is an area where information systems vendors are increasingly focusing. [ : see https://www.yewno.com/education, https://quartolio.com/, and https://iris.ai/.] [ : see https://kirasystems.com/. law firms are adopting similar products to automate and expedite their legal work, and law librarians are discussing how the use of ai may change their work. see marr and talley .] libraries themselves are also experimenting with ai to test its potential for library services and operations. some are focusing on using ai, particularly the voice user interface aspect of the digital assistant, in order to improve existing services. the university of oklahoma libraries have been building an alexa application to provide basic reference service to their students.[footnoteref: ] at the university of pretoria library in south africa, a robot named ‘libby’ already interacts with patrons by providing guidance, answering questions, conducting surveys, and displaying marketing videos (mahlangu ). [ : university of oklahoma libraries are building an alexa application that will provide some basic reference service to their students. also, their pair registry attempts to compile all ai-related projects at libraries. see https://pair.libraries.ou.edu.] other libraries are applying ai to extracting information and automating metadata generation for digital materials to enhance their discovery and use. the library of congress has worked on detecting features, such as railroads in maps, using the convolutional neural network model, and issued a solicitation for a machine learning and deep learning pilot program that will maximize the use of its digital collections in .[footnoteref: ] indiana university libraries, avp, university of texas austin school of information, and the new york public library are jointly developing the audiovisual metadata platform (amp), using many ai tools in order to automatically generate metadata for audiovisual materials, which collection managers can use to supplement their archival description and processing workflows.[footnoteref: ] [ : see blewer, kim, and phetteplace and price . ] [ : the amp wiki is https://wiki.dlib.indiana.edu/pages/viewpage.action?pageid= . the audiovisual metadata platform pilot development (amppd) project was presented at code lib (averkamp and hardesty ).] some libraries are also testing out ai as a tool for evaluating services and operations. the university of rochester libraries applied deep learning to the library’s space assessment to determine the optimal staffing level and building hours. the university of illinois urbana-champaign libraries used machine learning to conduct sentiment analysis on their reference chat log (blewer, kim, and phetteplace ). . ethical challenges from the personalized and automated information environment do these current and future ai applications for libraries pose ethical challenges similar to those that we discussed earlier? since information query, discovery, and retrieval rarely involve life-or-death situations, stakes seem to be certainly lower. but an ai-driven automated information environment does raise its own distinct ethical challenges. (i) intellectual isolation and bigotry hampering civic discourse many ai applications that assist with information seeking activities promise a higher level of personalization. but a highly personalized information environment often traps people in their own so-called ‘filter bubble,’ as we have been increasingly seeing in today’s social media channels, news websites, and commercial search engines, where such personalization is provided by machine learning and deep learning.[footnoteref: ] sophisticated ai algorithms are already curating and pushing information feeds based upon the person’s past search and click behavior. the result is that information seekers are provided with information that conforms and reinforces their existing beliefs and interests. views that are novel or contrast with their existing beliefs are increasingly suppressed and become invisible without them even realizing. [ : see pariser . ] such lack of exposure to opposing views leads information users to intellectual isolation and even bigotry. highly personalized information environments powered by ai can actively restrict ways in which people develop balanced and informed opinions, thereby intensifying and perpetuating social discord and disrupting civic discourse. such social discord and disruption in civil discourse is likely to disproportionately affect those with least privileges. in this sense, intellectual isolation and bigotry has a distinctly moral impact on society. (ii) weakening of cognitive agency and autonomy we have seen earlier that ai-powered digital assistants are likely to mediate people’s information search, discovery, and retrieval activities in the near future. as those digital assistants become more capable, they will go beyond listing available information. they will further choose what they deem to be most relevant to their users and proceed to recommend or autonomously execute the best course of action.[footnoteref: ] other ai-driven features, such as extracting key information or generating a summary of a large amount of information, are also likely to be included in future information systems, and they may deliver key information or summaries even before the request is made, based upon constant monitoring of the user’s activities. [ : please note that this is a simplified scenario. the features can also be built in the information system itself rather than the agent. ] in such a scenario, an information seeker’s cognitive agency is likely to suffer a loss. crucial to cognitive agency is the mental capacity to critically review a variety of information, judge what is and is not relevant, and interpret how they relate to other existing beliefs and opinions. if ai assumes those tasks, the opportunities for information seekers to exercise their own cognitive agency will surely decrease. cognitive deskilling and the subsequent weakening of people’s agency in the ai -powered automated information environment presents an ethical challenge because such agency is necessary for a person to be a fully functioning moral agent in society.[footnoteref: ] [ : outside of the automated information environment, ai has a strong potential to engender moral deskilling. vallor ( ) points out that automated weapons will lead to soldiers’ moral deskilling in the use of military force; new media practices of multitasking may result in deskilling in moral attention; and social robots can cause moral deskilling in practices of human caregiving.] (iii) social impact of scholarship and research from flawed ai algorithms previously, we have seen that deep learning applications are opaque to human understanding. this lack of transparency and explainability raises the question of whether it is moral to rely on ai-powered military robots for life-or-death decisions. does the ai-powered information environment have a similar problem? as explained in the beginning of this chapter, machine learning applications base their recommendations and predictions upon the patterns in past data. their predictions and recommendations are in this sense inherently conservative and become outdated when they fail to reflect the new social views and material conditions that no longer fit the past patterns. furthermore, each data set is a social construct that reflects particular values and choices such as who decided to collect the data and for what purpose; who labeled data; what criteria or beliefs guided such labeling; what taxonomies were used and why (davis ). no data set can capture all the variables and elements of the phenomenon that it describes. furthermore, data sets used for training machine learning and deep learning algorithms may not be representational samples for all relevant subgroups. in such a case, an algorithm will not be able to perform equally well in all subgroups. creating a large data set is also costly. consequently, developers often simply take data sets that are available to them. those data sets are likely to come with inherent limitations such as omissions, inaccuracies, errors, and hidden biases. ai algorithms trained with these flawed data sets can fail unexpectedly, revealing those limitations in the data sets. for example, it has been reported that the success rate of a facial recognition algorithm plunges from % to % when the group of subjects changes from white men to dark-skinned women because it was trained mostly with the photographs of white men (lohr ). adopting such a faulty algorithm for any real-life use at a large scale would be entirely unethical. for the context of libraries, imagine using such a face-recognition algorithm to generate metadata for digitized historical photographs, or a similarly flawed audio transcription algorithm to transcribe archival audio recordings. just like those faulty algorithms, an ai-powered automated information environment can provide information, recommendations, and predictions affected by similar limitations existing in many data sets. the more seamless such an information environment is, the more invisible those limitations become. automated information systems from libraries may not be involved in decisions that have a direct and immediate impact on people’s lives, such as setting a bail amount or determining the medicaid payment to be paid,[footnoteref: ] but automated information systems that are widely adopted and used for research and scholarship will impact real-life policies and regulations in areas such as healthcare and the economy. undiscovered flaws will undermine the validity of the scholarly output that utilized those automated information systems and can further inflict serious harm on certain groups of people through those policies and regulations. [ : see tashea and stanley .] . moral intelligence and rethinking the role of ai in this chapter, i discussed four significant ethical challenges that automating decisions and actions with ai presents: (a) moral desensitization; (b) unintended outcomes; (c) surrender of moral agency; (d) opacity in decision-making.[footnoteref: ] i also examined somewhat different but equally significant ethical challenges in relation to the ai-powered automated information environment, which is likely to surround us in the future: (i) intellectual isolation and bigotry hampering civic discourse; (ii) weakening of cognitive agency and autonomy; (iii) social impact of scholarship and research based upon flawed ai algorithms. [ : this is by no means an exhaustive list. user privacy and potential surveillance are examples of other important ethical challenges, which i do not discuss here.] the challenges related to the ai-powered automated information environment are relevant to libraries, which will soon be acquiring, building, customizing, and implementing many personalized and automated information systems for their users. at present, libraries are at an early stage in developing ai applications and applying machine learning and deep learning techniques to improve library services, systems, and operations. but the general issues of hidden biases and the lack of explainability in machine learning and deep learning are quickly gaining awareness in the library community. as we have seen in the trolley problem, whether a certain action is moral or not is a line that cannot be drawn with absolute clarity. it is entirely possible for fully-functioning moral agents to make different judgements in a particular situation. additionally, there is also the matter of morality that our tools and systems display. this is called machine morality in relation to ai systems, since ai systems are intelligent machines. wallach and allen ( ) argue that there are three distinct levels of machine morality: operational morality, functional morality, and full moral agency ( ). operational morality is found in systems that are low in both autonomy and ethical sensitivity. at this level of machine morality, a machine or a tool is given a mechanism that prevents its immoral use, but the mechanism is within the full control of the machine or the tool’s designers and users. such operational morality exists in a gun with a childproof safety mechanism, for example. a gun with a safety mechanism is hardly autonomous nor sensitive to any ethical concerns related to its use. by contrast, machines with functional morality do possess a certain level of autonomy and ethical sensitivity. they may include an ai system with significant autonomy but little ethical sensitivity or an ai system with little autonomy but high ethical sensitivity. an autonomous drone would fall under the former category, while medethex, an ethical decision-support ai recommendation system for clinicians, would be of the latter. lastly, systems with high autonomy and high ethical sensitivity can be regarded as having full moral agency, as much as humans do. this means that those systems would have a mental representation of values and the capacity for moral reasoning. such machines can be held morally responsible for their actions. we do not know whether ai will be able to produce such a machine with full moral agency. if the current direction to automate more and more human tasks for cost savings and efficiency at scale continues, however, most of the more sophisticated ai applications to come will be of the kind with functional morality, particularly the kind that combines a relatively high level of autonomy and a lower level of ethical sensitivity. in the beginning of this chapter, i mentioned that the goal of ai is to create an artificial system – whether it be a piece of software or a machine with a physical body – that is as intelligent as a human in its performance, either broadly in all areas of human activities or narrowly in a specific activity, such as playing chess or driving. but what does “as intelligent as a human” exactly mean? if morality is an integral component of human-level intelligence, ai research needs to pay more attention to intelligence, not just in accomplishing a goal but also in doing so ethically.[footnoteref: ] in that light, it is meaningful to ask what level of autonomy and ethical sensitivity a given ai system is equipped with, and what level of machine morality is appropriate for its purpose. [ : here, i regard intelligence as the ability to accomplish complex goals following tegmark . for more discussion on intelligence and goals, see chapter and chapter .] in designing an ai system, it would be also helpful to determine what level of autonomy and ethical sensitivity would be best suited for the purpose of the system and whether it is feasible to provide that level of machine morality for the system in question. the narrower the function or the domain of an ai system is, the easier it will be to equip it with an appropriate level of autonomy and ethical sensitivity. in evaluating and designing an ai system, it will be important to test the actual outcome against the anticipated outcome in different types of cases in order to identify potential problems. system-wide audits to detect well-known biases, such as gender discrimination or racism can serve as an effective strategy.[footnoteref: ] other undetected problems may surface only after the ai system is deployed. having a mechanism to continually test an ai algorithm to identify those unnoticed problems and feeding the test result back into the algorithm for retraining will be another way to deal with algorithmic biases. those who build ai systems will also benefit from consulting existing principles and guidelines such as fat/ml’s “principles for accountable algorithms and a social impact statement for algorithms.”[footnoteref: ] [ : these audits are far from foolproof, but the detection of hidden biases will be crucial in making ai algorithms more accountable and their decisions more ethical. a debiasing algorithm can also be used during the training stage of an ai algorithm to reduce hidden biases in training data. see amini et al. , knight b, and courtland . ] [ : see https://www.fatml.org/resources/principles-for-accountable-algorithms. other principles and guidelines include “ethics guidelines for trustworthy ai” (https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai) and “algorithmic impact assessments: a practical framework for public agency accountability” (https://ainowinstitute.org/aiareport .pdf). ] we may also want to rethink how and where we apply ai. we and our society do not have to use ai to equip all our systems and machines with human- or superhuman-level of performance. this is particularly so if the pursuit of such human- or superhuman-level performance is likely to increase the possibility of the ai algorithms making unethical decisions that negatively impact a significant number of people. we do not have to task ai with automating away human work and decisions as much as possible. what if we reframe ai’s role in the context of assisting people rather than replacing them, so that people can become more intelligent and more capable where they struggle or experience disadvantages, such as critical thinking, civic participation, healthy living, financial literacy, dyslexia, or hearing loss? what kind of ai-driven information systems and environments would be created if libraries approach ai with such intention from the beginning? references alexander, larry, and michael moore. . “deontological ethics.” in the stanford encyclopedia of philosophy, edited by edward n. zalta, winter . metaphysics research lab, stanford university. https://plato.stanford.edu/archives/win /entries/ethics-deontological/. amini, alexander, ava p. soleimany, wilko schwarting, sangeeta n. bhatia, and daniela rus. . “uncovering and mitigating algorithmic bias through learned latent structure.” in proceedings of the aaai/acm conference on ai, ethics, and society, – . aies ’ . new york, ny, usa: association for computing machinery. https://doi.org/ . / . . averkamp, shawn, and julie hardesty. . “ai is such a tool: keeping your machine learning outputs in check.” presented at the code lib conference, pittsburgh, pa, march . https:// .code lib.org/talks/ai-is-such-a-tool-keeping-your-machine-learning-outputs-in-check. blewer, ashley, bohyun kim, and eric phetteplace. . “reflections on code lib .” acrl techconnect (blog). march , . https://acrl.ala.org/techconnect/post/reflections-on-code lib- /. boden, margaret a. . ai: its nature and future. oxford: oxford university press. bonnefon, jean-françois, azim shariff, and iyad rahwan. . “the social dilemma of autonomous vehicles.” science ( ): – . https://doi.org/ . /science.aaf . clark, liat. . “google’s artificial brain learns to find cat videos.” wired, june , . https://www.wired.com/ / /google-x-neural-network/. conitzer, vincent, walter sinnott-armstrong, jana schaich borg, yuan deng, and max kramer. . “moral decision making frameworks for artificial intelligence.” in proceedings of the thirty-first aaai conference on artificial intelligence, – . aaai’ . san francisco, california, usa: aaai press. courtland, rachel. . “bias detectives: the researchers striving to make algorithms fair.” nature ( ): – . https://doi.org/ . /d - - - . cushman, fiery, liane young, and marc hauser. . “the role of conscious reasoning and intuition in moral judgment: testing three principles of harm.” psychological science ( ): – . davis, daniel l. . “who decides: man or machine?” armed forces journal, november. http://armedforcesjournal.com/who-decides-man-or-machine/. davis, hannah. . “a dataset is a worldview.” towards data science. march , . https://towardsdatascience.com/a-dataset-is-a-worldview- dd d. foot, philippa. . “the problem of abortion and the doctrine of double effect.” oxford review : – . heaven, will douglas. . “deepmind’s ai can now play all atari games—but it’s still not versatile enough.” mit technology review, april , . https://www.technologyreview.com/ / / / /deepminds-ai- -atari-games-but-its-still-not-versatile-enough/. international committee of the red cross. . “what are jus ad bellum and jus in bello?,” september. https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello- . kahn, leonard. . “military robots and the likelihood of armed combat.” in robot ethics: the ethical and social implications of robotics, edited by patrick lin, keith abney, and george a. bekey, – . intelligent robotics and autonomous agents. cambridge, mass.: mit press. knight, will. . “the dark secret at the heart of ai.” mit technology review, april , . https://www.technologyreview.com/ / / / /the-dark-secret-at-the-heart-of-ai/. ———. a. “two rival ai approaches combine to let machines learn about the world like a child.” mit technology review, april , . https://www.technologyreview.com/ / / / /two-rival-ai-approaches-combine-to-let-machines-learn-about-the-world-like-a-child/. ———. b. “ai is biased. here’s how scientists are trying to fix it.” wired, december , . https://www.wired.com/story/ai-biased-how-scientists-trying-fix/. koch, christof. . “how the computer beat the go master.” scientific american. march , . https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/. le, quoc v., and mike schuster. . “a neural network for machine translation, at production scale.” google ai blog (blog). september , . http://ai.googleblog.com/ / /a-neural-network-for-machine.html. lohr, steve. . “facial recognition is accurate, if you’re a white guy.” new york times, february , . https://www.nytimes.com/ / / /technology/facial-recognition-race-artificial-intelligence.html. mahlangu, isaac. . “meet libby - the new robot library assistant at the university of pretoria’s hatfield campus.” sowetanlive. june , . https://www.sowetanlive.co.za/news/south-africa/ - - -meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/. markoff, john. . “how many computers to identify a cat? , .” new york times, june , . marr, bernard. . “how ai and machine learning are transforming law firms and the legal sector.” forbes, may , . https://www.forbes.com/sites/bernardmarr/ / / /how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/. pariser, eli. . the filter bubble: how the new personalized web is changing what we read and how we think. new york: penguin press. porter, jon. . “the pixel buds’ translation feature is coming to all headphones with google assistant.” the verge. october , . https://www.theverge.com/circuitbreaker/ / / / /pixel-buds-google-translate-google-assistant-headphones. price, gary. . “the library of congress posts solicitation for a machine learning/deep learning pilot program to ‘maximize the use of its digital collection.’” lj infodocket. june , . https://www.infodocket.com/ / / /library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/. sharkey, noel. . “killing made easy: from joysticks to politics.” in robot ethics: the ethical and social implications of robotics, edited by patrick lin, keith abney, and george a. bekey, – . intelligent robotics and autonomous agents. cambridge, mass.: mit press.. singer, peter. . “ethics and intuitions.” the journal of ethics ( / ): – . sinnott-armstrong, walter. . “consequentialism.” in the stanford encyclopedia of philosophy, edited by edward n. zalta, summer . metaphysics research lab, stanford university. https://plato.stanford.edu/archives/sum /entries/consequentialism/. stanley, jay. . “pitfalls of artificial intelligence decisionmaking highlighted in idaho aclu case.” american civil liberties union (blog). june , . https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case. talley, nancy b. . “imagining the use of intelligent agents and artificial intelligence in academic law libraries.” law library journal ( ): – . tashea, jason. . “courts are using ai to sentence criminals. that must stop now.” wired, april , . https://www.wired.com/ / /courts-using-ai-sentence-criminals-must-stop-now/. tegmark, max. . life . : being human in the age of artificial intelligence. new york: alfred knopf. thomson, judith jarvis. . “killing, letting die, and the trolley problem.” the monist ( ): – . turek, matt. n.d. “explainable artificial intelligence.” defense advanced research projects agency. https://www.darpa.mil/program/explainable-artificial-intelligence. vallor, shannon. . “moral deskilling and upskilling in a new machine age: reflections on the ambiguous future of character.” philosophy & technology ( ): – . https://doi.org/ . /s - - - . wallach, wendell. . moral machines teaching robots right from wrong. oxford: university press. welch, chris. . “google just gave a stunning demo of assistant making an actual phone call.” the verge. may , . https://www.theverge.com/ / / / /google-assistant-makes-phone-call-demo-duplex-io- . towards a chicago place name dataset towards a chicago place name dataset: from back-of-the-book index to a labeled dataset ana lucic and john shanahan university of illinois and depaul university may , introduction reading chicago reading[footnoteref: ] is a grant-supported digital humanities project that takes as its object the “one book one chicago” (oboc) program[footnoteref: ] of the chicago public library. since fall , one book one chicago has fostered community through reading and discussion. on its “big read” website, the library of congress includes information about one book programs around the united states,[footnoteref: ] and the american library association (ala) also provides materials with which a library can build its own one book program and, in this way, bring members of their communities together in a conversation.[footnoteref: ] while community reading programs are not a new phenomenon and exist in various formats and sizes, the one book one chicago program is notable because of its size (the chicago public library has local branches) as well as its history (the program has been in existence for nearly years). although relatively common, book clubs and community-based reading programs have not been subjects of sustained study with quantitative methods of data science. [ : reading chicago reading project (https://dh.depaul.press/reading-chicago/) gratefully acknowledges the support of the national endowment for the humanities office of digital humanities, hathitrust, and lyrasis.] [ : see https://www.chipublib.org/one-book-one-chicago/.] [ : see http://read.gov/resources/.] [ : see http://www.ala.org/tools/programming/onebook.] the following research questions have been guiding the reading chicago reading project so far: can we predict the future circulation of a book using a predictive model based on circulation, demographics, and text characteristics? how did different neighborhoods in a diverse but also segregated city respond to particular book choices? have certain books been more popular than others around the city as measured by branch-level circulation, and can these changes in checkout totals be correlated with cpl outreach work? a related question to this is the focus of this paper: the association of place name with sentiment in chicago-themed books, namely, what are the trends that emerge from the spatial analysis that adds sentiment to geographic location? exploration of these questions, and our attempt to find a solution for some of them, has made us reflect on innovative services that libraries can offer. we will discuss this possibility in the last section of this paper. chicago as a place name thus far, the reading chicago reading project has focused its analysis on seven recent oboc book selections and their respective “seasons” of public outreach programming: · fall of : saul bellow’s the adventures of augie march · spring of : yiyun li’s gold boy, emerald girl; · fall of : markus zusak’s the book thief · – : isabel wilkerson’s the warmth of other suns · – : michael chabon’s the amazing adventures of kavalier and clay · – : thomas dyja’s the third coast · – : barbara kingsolver’s animal vegetable miracle: a year of food life all of the listed works above, spanning categories of fiction and non-fiction, are still in copyright. of the seven works, three were categorized as chicago-themed works because they take place in the chicago area in whole or in substantial part: saul bellow’s the adventures of augie march, isabel wilkerson’s the warmth of other suns, and thomas dyja’s the third coast. as part of ongoing work of the “reading chicago reading” project, we used the secure data portal of the hathitrust research consortium to access and pre-process the in-copyright novels in our set. the hathitrust research portal permits the extraction of non-consumptive features of the works included in the digital library, even those that are still under copyright. non-consumptive features do not violate copyright restrictions as they do not allow the regular reading (“consumption”) or digital reconstruction of the full work in question. an example of a non-consumptive feature is the part of speech information extracted either in aggregate or without the connection to its source word. the locations in the text are another example of a non-consumptive feature as long as we do not aim to extract locations with the surrounding context: while the extraction of location from a work under copyright will not violate copyright law, the extraction of location with its surrounding context will do so. similarly, the sentiment of a sentence also falls under the category of a “non-consumptive” feature as long as we do not extract the entire sentence and its sentiment score. thus it was possible to utilize the hathitrust research portal to access and also extract both the locations as well as sentiment of individual sentences from copyrighted works. as later paragraphs will reveal however, we also needed to verify a number of these extractions, which was done in a manual way: by checking the extracted references against the actual text of the work. this paper arises from the finding that the three oboc books that are set largely in or are about chicago circulated differently than the oboc books that are not, such as marcus zusak’s the book thief, yiyun li’s gold boy, and michael chabon’s the amazing adventures of kavalier and clay. since we discovered that some cpl branches had higher circulation for “chicago” books than others, we wanted to determine ( ) which place names were featured in the three books, and ( ) quantify and then examine the sentiment associated with these places.  although recognizing a well-defined place in the text is no longer a difficult task thanks to the development of named entity recognizers such as the stanford named entity recognizer[footnoteref: ] opennlp,[footnoteref: ] spacy[footnoteref: ], and nltk,[footnoteref: ] recognizing whether a place name is a reference to a chicago location is a harder task. if chicago is the setting or one of the main topics of the book then we can assume that a number of locations mentioned will be chicago places. however, if information about the topicality or locality of the book is not known in advance or if the plot in the book moves from location to location, then the task of verifying through automated methods whether a place name is a chicago location is much harder. [ : see https://nlp.stanford.edu/software/crf-ner.html.] [ : see https://opennlp.apache.org/.] [ : see https://spacy.io/.] [ : see https://www.nltk.org/book/ch .html.] with the help of linkedgeodata[footnoteref: ] we were able to obtain all the chicago place names that were identified by volunteers through the openstreetmap project[footnoteref: ] and download a listing that included chicago buildings, theaters, restaurants, streets, and other prominent places. while this is very useful, we realized that we were missing historical place names. at the same time, the way that the place names are represented in a text will likely not always correspond to the way a place name is represented in the dictionary or a knowledge graph; for example, a sentence might simply note “that building” or “her home” instead of the named entity of the previous sentence. moreover, there were many examples of generic place names: how many cities in the united states have a state street, a madison street, or a st avenue, and the like? a further hindrance was determining the type of place names we wanted to identify and collect from the text: it soon became obvious that for the purposes of visualizing a place name on the map, general references to chicago went beyond the scope of the maps we wanted to create. we became more interested in tracking references to specific chicago place names that included buildings (historical and present), named areas of the city, monuments, streets, theatres, restaurants, and the like. given that our dataset comprised just three books, we were able to manually sift through the automatically identified place names and indicate whether they were indeed a chicago place name or not. prior to this we established the sentiment of each sentence in the three books using the stanford sentiment analyzer.[footnoteref: ] our heuristic was that a sentiment score for the entire sentence would be assigned based on specific place(s) mentioned in the sentence. this assumption may not always be true, but our manual inspection of sentences and the sentiment assigned to them established that this method was fairly accurate. it should be mentioned that while we did examine some examples, we did not conduct an analysis of the accuracy of the sentiment scores assigned to the corpus. [ : see http://linkedgeodata.org/about.] [ : see https://www.openstreetmap.org/.] [ : see https://nlp.stanford.edu/sentiment/. ] figure indicates the result of our effort to integrate place names with the sentiment of the sentence.  fig. – visualizing place names associated with positive versus very negative sentiment extracted from three oboc books. particularly telling here is the third coast quadrant which shows a concentration of positively-associated chicago place names in the northern parts of the city along the shore of lake michigan. negative sentiment appears to be more concentrated in the central part of chicago and also in the southern parts of the city. the place names extracted from our three chicago-setting oboc books allowed us to focus on particular areas of the city such as hyde park, which is mentioned in each of them. larger circles correspond to a greater number of sentences that mention hyde park and are associated with a negative sentiment in both the adventures of augie march and the warmth of other suns. judging by fig. , on the other hand, the third coast features sentences in which hyde park is mentioned in both positive and negative contexts. fig. – visualization of the sentences that feature “hyde park” and their sentiment from three books these results prompt us to continue with this line of research and to procure a larger dataset that contains chicago place names extracted from other works of literature. this would allow us to focus on specific places such as “wrigley field” or the once-famous but no longer existing “mecca” apartment building (which stood at the intersection of th and state street on the south side and was immortalized in a poetry collection by gwendolyn brooks). with a robust place name set, we could analyze the context in which these place names were mentioned in other literature, in contemporary or historical newspapers (chicago tribune, chicago sun-times, chicago defender), or in library and archival materials. one such contextual element would be the sentiment associated with the place name. our interest in creating a dataset of chicago place names extracted from literature led us to the chicago of fiction, a vast annotated bibliography by james a. kaser. published in , this work contains entries on more than , works published between and that feature chicago. kaser’s book contains several indexes that can serve as sources of labeled data or instances in which chicago locations are mentioned. although we are still determining how many of the titles included in the annotated bibliography already exist in digital format or are accessible through the hathitrust digital library, it is likely that a subset of the total can be accessed electronically. even if the books do not exist in electronic format presently, it is still possible to use the index as a source of labeled data for chicago place names. we anticipate that such a dataset would be of interest to researchers in urban studies, literature, history, and geography. a sufficiently large number of sentences featuring chicago place names would enable us to proceed in the direction of a chicago place name recognizer that can “learn” chicago context or examine how much context is sufficient to establish whether a “madison street” in a text happens to be located in chicago or elsewhere. how do libraries innovate? from print index to labeled data over the last decade, libraries have taken on additional services related to the development and preservation of digital scholarship projects. librarians frequently assist faculty and students with the development of digital humanities or digital scholarship projects. they point patrons to portals where they can find data and help with licensing. librarians also procure datasets and some perform data cleaning and pre-processing tasks, and yet it is not so common for librarians to participate in the creation of a dataset. a relatively recent initiative, however, collections as data,[footnoteref: ] directly tackles the issue of treating research, library, and cultural heritage collections as data and providing access to them. this ongoing initiative aims to create projects that can serve as a model to other libraries for making collections accessible as data. [ : see https://collectionsasdata.github.io/part whole/. ] the data that undergird the mechanisms of library workings – circulation records for physical and digital objects, metadata records, and the like – are not commonly available as datasets open to machine learning tasks. if they were, not only could libraries refer others to the already created and annotated physical and digital objects, but they could also participate in creating objects that are local to their settings. creation and curation of such datasets could in turn help establish new relationships between area libraries and local communities. one can imagine a new “data challenge,” for instance, in which libraries assemble a community by building a dataset relevant to that community. such an effort would need to be preceded by an assessment of the data needs and interest of that particular community. in the case of a chicago place name dataset challenge, efforts could revolve around local communities adding sentences to the dataset from literary sources. the second step might involve organizing a challenge of building a place name recognizer model based on the sentences gathered. one can also imagine turning metadata records into curated datasets that are shared with local communities and with teachers and university lecturers for use in the classroom. once a dataset is built, scenarios can be invented for using it. this kind of work invites conversations with faculty members about their needs and about potential datasets that would be of particular interest. creation of datasets based on unique materials at their disposal will enrich the palette of services already offered by libraries.  one of the main goals of the reading chicago reading project was the creation of a model that can predict the circulation of a one book one chicago selection given parameters such as prior circulation for the book, text characteristics, and locality of the work. we are not aware of other predictive models that integrate circulation records with text features extracted from the books in this way. given that circulation records are not commonly integrated with other data sources when they are analyzed, linking different data sources with circulation records is another challenging opportunity that this paper envisions. ultimately, libraries can play a dynamic role in both managing and creating data and datasets that can be shared with the members of local communities. using back-of-the-book indexes as a source of labeled place name data is a tool that we have begun to prototype but which requires further exploration. while organizing a data challenge takes a lot of effort, a data challenge can be an effective way of reaching out to one’s local community and identifying their data needs. final note: we aim to make available a curated list of sentences that mention chicago place names in the three oboc selections that feature chicago. we will invite the public and scholars to add sentences extracted from other literature. the ultimate goal is the creation of a labeled training dataset for creation of a chicago place name recognizer. references american library association. n.d. “one book one community.” programming & exhibitions (website). accessed may , . http://www.ala.org/tools/programming/onebook. bird, steven, edward loper and ewan klein. . natural language processing with python. sebastopol, ca: o’reilly media inc. chicago public library. n.d. “one book one chicago.” accessed may , . https://www.chipublib.org/one-book-one-chicago/ “collections as data: part to whole.” n.d. accessed may , . https://collectionsasdata.github.io/part whole/. finkel, jenny rose, trond grenager, and christopher manning. . “incorporating non-local information into information extraction systems by gibbs sampling.” in proceedings of the nd annual meeting of the association for computational linguistics (acl ), - . https://www.aclweb.org/anthology/p - /. hathitrust digital library. n.d. accessed may , . https://www.hathitrust.org/. kaser, a. james. . the chicago of fiction: a resource guide. lanham: scarecrow press. library of congress. “local/community resources.” n.d. read.gov. accessed may , . http://read.gov/resources/. linkedgeodata. “about / news.” n.d. accessed may , . http://linkedgeodata.org/about. manning, christopher d., mihai surdeanu, john bauer, jenny finkel, steven j. bethard, and david mcclosky. . “the stanford corenlp natural language processing toolkit.” in proceedings of the nd annual meeting of the association for computational linguistics: system demonstrations, - . https://www.aclweb.org/anthology/p - /. openstreetmap. n.d. accessed may , . https://www.openstreetmap.org/ reading chicago reading. “about reading chicago reading.” n.d. accessed may , . https://dh.depaul.press/reading-chicago/about/. editor comments for second draft: this is almost ready to go, after a few points in the comments have been addressed. the most important is mark’s comment about the conclusion. it still seems to end abruptly. what if you combined the last two paragraphs with some additional glue, as follows? ultimately, libraries can play a dynamic role in both managing and creating data and datasets that can be shared with the members of local communities. reading chicago reading continues to make steps in this direction. we are prototyping back-of-the-book indexes as a source of labeled place name data, and we aim to make available a curated list of sentences that mention chicago place names in the three oboc selections that feature chicago. we will invite the public and scholars to add sentences extracted from other literature, perhaps issuing a data challenge to build local energy. the end goal, the creation of a labeled training dataset for creation of a chicago place name recognizer, will, we hope, enable new avenues of research that we haven’t even foreseen. thank you for addressing our questions about working with in-copyright novels -- i think readers will find that very informative. cross-disciplinary ml research is like happy marriages: five strengths and two examples meng jiang (mjiang @nd.edu), computer science and engineering, university of notre dame abstract in this essay, i use a metaphor to describe the strengths and challenges of cross-disciplinary machine learning research: successful cross-disciplinary ml research is like happy marriages. among the top strengths of happy marriages, at least five can be reflected in cross-disciplinary ml research, including “discuss problems well,” “handle differences creatively,” and “maintain a good balance of time alone and together.” i use two examples of my personal experiences (as a computer scientist) of collaborating with researchers from multiple disciplines (e.g., historians, psychologists, it technicians) to illustrate. top strengths in ml+x collaboration cross-disciplinary research refers to research and creative practices that involve two or more academic disciplines (jeffrey ; karniouchina, victorino, and verma ). these activities may range from those that simply place disciplinary insights side by side to much more integrative or transformative approaches (aagaard‐hansen ; muratovski ). cross-disciplinary research matters, because ( ) it provides an understanding of complex problems that require a multifaceted approach to solve; ( ) it combines disciplinary breadth with the ability to collaborate and synthesize varying expertise; ( ) it enables researchers to reach a wider audience and communicate diverse viewpoints; ( ) it encourages researchers to confront questions that traditional disciplines do not ask while opening up new areas of research; and ( ) it promotes disciplinary self-awareness about methods and creative practices (urquhart et al. ; o'rourke, crowley, and gonnerman ; miller and leffert ). one of the most popular cross-disciplinary research topics/programs is machine learning + x (or data science + x). machine learning (ml) is a method of data analysis that automates analytical model building. it is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention. ml has been used in a variety of applications (murthy ), such as email filtering and computer vision; however, most applications still fall in the domain of computer science and engineering. recently, the power of ml+x, where x can be any other discipline (such as physics, chemistry, biology, sociology, and psychology), is well recognized. ml tools can reveal profound insights hiding in ballooning datasets (kohavi et al. ; pedregosa et al. ; kotsiantis ; mullainathan and spiess ). however, cross-disciplinary research, which ml+x is part of, is challenging. collaborating with investigators outside one’s own field requires more than just adding a co-author to a paper or proposal. true collaborations will not always be without conflict – lack of information leads to misunderstandings. for example, ml experts would have little domain knowledge in the field of x; and researchers in x might not understand ml either. the knowledge gap limits the progress of collaborative research. so how can we start and manage successful cross-disciplinary research? what can we do to facilitate collaborative behaviors? in this essay, i will compare cross-disciplinary ml research to “happy marriages,” discussing some characteristics they share. specifically, i will present the top strengths of conducting cross-disciplinary ml research and give two examples based on my experience of collaborating with historians and psychologists. marriage is one of the most common “collaborative” behaviors. couples expect to have happy marriages, just like collaborators expect to have successful project outcomes (robinson and blanton ; pettigrew ; xu et al. ). extensive studies have revealed the top strengths of happy marriages (defrain and asay ; gordon and baucom ; prepare/enrich, n.d.), which can be reflected in cross-disciplinary ml research. here i focus on five of them: . collaborators (“partners” in the language of marriage) are satisfied with communication. . collaborators feel very close to each other. . collaborators discuss their problems well. . collaborators handle their differences creatively. . there is a good balance of time alone (i.e., individual research work) and together (meetings, discussions, etc). first of all, communication is the exchange of information to achieve a better understanding; and collaboration is defined as the process of working together with another person to achieve an end goal. effective collaboration is about sharing information, knowledge, and resources to work together through satisfactory communication. ineffectiveness or lack of communication is one of the biggest challenges in ml+x collaboration. second, researchers in different disciplines meet different challenges through the process of collaboration. making the challenges clear to understand and finding solutions together is the core of effective collaboration. third, researchers in different disciplines can collaborate only when they recognize mutual interest and feel that the research topics they have studied in depth are very close to each other. collaborators must be interested in solving the same, big problem. fourth, collaborators must embrace their differences on concepts and methods and take advantage of them. for example, one researcher can introduce a complementary method to the mix of other methods that the collaborator has been using for a long time; or one can have a new, impactful dataset and evaluation method to test the techniques proposed by the other. fifth, in strong collaboration, there is a balance between separateness and togetherness. meetings are an excellent use of time for having integrated perspectives and productive discourse around difficult decisions. however, excessive collaboration happens when researchers are depleted by too many meetings and emails. it can lead to inefficient, unproductive meetings. so it is important to find a balance. next, i, as a computer scientist and ml expert, will discuss two ml+x collaborative projects. ml experts bring mathematical modeling and computational methods for mining knowledge from data. the solutions usually have good generalizability; however, they still need to be tailored for specialized domains or disciplines. example : ml + history the history professor liang cai and i have collaborated on an international research project titled “digital empires: structured biographical and social network analysis of early chinese empires.” dr. cai is well known for her contributions to the fields of early chinese empires, classical chinese thought (in particular, confucianism and daoism), digital humanities, and the material culture and archaeological texts of early china (cai ). our collaboration explores how digital humanities expand the horizon of historical research and help visualize the research landscape of chinese history. historical research is often constrained by sources and the human cognitive capacity for processing them. ml techniques may enhance historians’ abilities to organize and access sources as they like. ml techniques can even create new kinds of sources at scale for historians to interpret. “the historians pose the research questions and visualize the project,” said cai. “the computer scientists can help provide new tools to process primary sources and expand the research horizon.” we conducted a structured biographical analysis to leverage the development of machine learning techniques, such as neural sequence labeling and textual pattern mining, which allowed classical sources of chinese empires to be represented in an encoded way. the project aims to build a digital biographical database that sorts out different attributes of all recorded historical actors in available sources. breaking with traditional formats, ml+history creates new opportunities and augments our way of understanding history. first, it helps scholars, especially historians, change their research paradigm, allowing them to generalize their arguments with sufficient examples. ml techniques can find all examples in the data where manual investigation may miss some. also, abnormal cases can indicate a new discovery. as far as early chinese empires are concerned, ml promises to automate mining and encoding all available biographical data, which allows scholars to change the perspective from one person to a group of persons with shared characteristics, and to shift from analyzing examples to relating a comprehensive history. therefore, scholars can identify general trends efficiently and present an information-rich picture of historical reality using ml techniques. second, the structured data produced by ml techniques revolutionize the questions researchers ask, thereby changing the research landscape. because of the lack of efficient tools, there are numerous interesting questions scholars would like to ask but cannot. for example, the geographical mobility of historical actors is an intriguing question for early china, the answer to which would show how diversified regions were integrated into a unified empire. nevertheless, an individual historian cannot efficiently process the massive amount of information preserved in the sources. with ml techniques, we can generate fact tuples to sort out original geographical places of all available historical actors and provide comprehensive data for historians to analyze. patterns mined by ml tech extracted relations the graph presents a visual of the social network of officials who served in the government about , years ago in china. the network describes their relationships and personal attributes. $per_x …從 $per_y 受 $klg $per_x was taught by $per_y on $klg (knowledge) (張禹,施讎,易), (施讎,田王孫,易), (眭弘, 嬴公, 春秋) $per_x … 事 $per_y $per_x was taught/mentored by $per_y (司馬相如, 孝景帝) (尹齊, 張湯) $per_x … 授 $per_y $per_x taught $per_y (孟喜, 后蒼、疏廣) (王式, 龔舍) $per … $loc 人也 $per place_of_birth $loc (張敞, 河東平陽) (彭越, 昌邑) $per 遷 $tit $per job_title $tit (朱邑, 北海太守) (甘延壽, 遼東太守) $per 至 $tit $per job_title $tit (歐陽生, 御史大夫) (孟卿, 中山中尉) $per 為 $tit $per job_title $tit (伏生, 秦博士) (司馬相如, 武騎常侍) third, the project revolutionizes our reading habits. large datasets mined from primary sources will allow scholars to combine long-distant reading with original texts. the macro picture generated from data will aid in-depth analysis of the event against its immediate context. furthermore, graphics of social networks and common attributes of historical figures will change our reading habits, transforming linear storytelling to accommodate multiple narratives (see the above figure). researchers from the two sides develop collaboration through the project step by step, just like developing a relationship for marriage. ours started at a faculty gathering from some random chat about our research. as the historian is open-minded to ml technologies and the ml expert is willing to create broader impact, we brainstormed ideas that would not have developed without taking care of the five important points: ( ) communication: with our research groups, we started to meet frequently at the beginning. we set up clear goals at the early stage, including expected outcomes, publication venues, and joint proposals for funding agencies, such as the national endowment for the humanities (neh) and notre dame seed grant funding. our research groups met almost twice a week for as long as three weeks. ( ) feel very close to each other: besides holding meetings, we exchanged our instant messenger accounts so we could communicate faster than email. we created google drive space to share readings, documents, and presentation slides. we found many tools to create “tight relationships” between the groups at the beginning. ( ) discuss their problems well: whenever we had misunderstandings, we discussed our problems. historians learned about what a machine does, what a machine can do, and generally how a machine works toward the task. ml people learned what is interesting to historians and what kind of information is valuable. we hold the principle that as the problems exist, they make sense; any problem any other encounters is worth a discussion. we needed to solve problems together from the moment they became our problems. ( ) handle their differences creatively: historians are among the few who can read and write in classical chinese. classical chinese was used as the written language from over , years ago to the early th century. since then, mainland china has used either mandarin (simplified chinese) or cantonese, while taiwan has used traditional chinese. none is similar to classical chinese at all. in other words, historians work on a language that no ml experts here, even those who speak modern chinese, can understand. so we handle our language differences “creatively” by using the translated version as the intermediate medium. historians have translated history books in classical chinese into simplified chinese so we can read the simplified version. here, the idea is to let the machine learning algorithms read both versions. we find that information extraction (i.e., finding relations from text) and machine translation (i.e., from classical chinese to modern chinese) can mutually enhance each other, which turns out to be one of our novel technical contributions to the field of natural language processing. ( ) good balance of time alone and together: after the first month, since the project goal, datasets, background knowledge, and many other aspects were clear in both sides’ minds, we had regular meetings in a less intensive manner. we met twice or three times a month so that computer science students could focus on developing machine learning algorithms, and only when significant progress was made or expert evaluation was needed would we schedule a quick appointment with prof. liang cai. so far, we have published peer-reviewed papers on the topic of information extraction and entity retrieval in classical chinese history books using ml (ma et al. ; zeng et al. ). we have also submitted joint proposals with the above work as preliminary results to neh. example : ml + psychology i am working with drs. ross jacobucci and brooke ammerman in psychology to apply ml to understand mental health problems and suicidal intentions. suicide is a serious public health problem; however, suicides are preventable with timely, evidence-based interventions. social media platforms have been serving users who are experiencing real-time suicidal crises with hopes of receiving peer support. to better understand the helpfulness of peer support occurring online, we characterize the content of both a user’s post and corresponding peer comments occurring on a social media platform and present an empirical example for comparison. we have designed a new topic-model-based approach to finding topics of users and peer posts from the social media forum data. the key advantages include: (i) modeling both the generative process of each type of corpora (i.e., user posts and peer comments) and the associations between them, and (ii) using phrases, which are more informative and less ambiguous than words alone, to represent social media posts and topics. we evaluated the method using data from reddit’s r/suicidewatch community. we examined how the topics of user and peer posts were associated and how this information influenced the perceived helpfulness of peer support. then, we applied structural topic modeling to data collected from individuals with a history of suicidal crisis as a means to validate findings. our observations suggest that effective modeling of the association between the two lines of topics can uncover helpful peer responses to online suicidal crises, notably providing the suggestion of pursuing professional help. our technology can be applied to “paired” corpora in many applications such as tech support forums and question-answering sites. this project started from a talk i gave at the psychology graduate seminar. the fun thing is that dr. jacobucci was not able to attend the talk. another psychology professor who attended my talk asked constructive questions and mentioned my research to dr. jacobucci when they met later. so dr. jacobucci dropped me an email, and we had coffee together. cross-disciplinary research often starts from something that sounds like developing a relationship. because, again, the psychologists are open-minded to ml technologies and the ml expert is willing to create broader impact, we successfully brainstormed ideas when we had coffee, but this would not have developed into long-term collaboration without the following efforts: ( ) communicate intensively between research groups at the early stage. we had multiple meetings a week to make the goals clear. ( ) get students involved in the process. when my graduate student received more and more advice from the psychology professors and students, the connections between the two groups became stronger. ( ) discuss the challenges in our fields very well. we analyzed together whether machine learning would be capable of addressing the challenges in mental health. we also analyzed whether domain experts could be involved in the loop of machine learning algorithms. ( ) handle our differences. we separately presented our research and then founda times to work together to put sets of slides together based on one common vision and goal. ( ) after the first month, only hold meetings when discussion is needed or there is an approaching deadline for either paper or proposal. we have enjoyed our collaboration and the power of cross-disciplinary research. our joint work is under review at nature palgrave communications. we have also submitted joint proposals to nih with this work as preliminary results (jiang et al. ). conclusions in this essay, i used a metaphor comparing cross-disciplinary ml research to “happy marriages.” i discussed five characteristics they share. specifically, i presented the top strengths of producing successful cross-disciplinary ml research: ( ) partners are satisfied with communication. ( ) partners feel very close to each other. ( ) partners discuss their problems well. ( ) partners handle their differences creatively. ( ) there is a good balance of time alone (i.e., individual research work) and together (meetings, discussions, etc). while every project is different and will produce its own challenges, my experience of collaborating with historians and psychologists according to the happy marriage paradigm suggests that is a simple and strong paradigm that could help other interdisciplinary projects develop into successful, long-term collaborations. references aagaard‐hansen, jens. . “the challenges of cross‐disciplinary research.” social epistemology , no. (october-december): - . https://doi.org/ . / . cai, liang. . witchcraft and the rise of the first confucian empire. albany: suny press. defrain, john, and sylvia m. asay. . “strong families around the world: an introduction to the family strengths perspective." marriage & family review , no. - (august): - . https://doi.org/ . /j v n _ . gordon, cameron l., and donald h. baucom. . “examining the individual within marriage: personal strengths and relationship satisfaction." personal relationships , no. (september): - . https://doi.org/ . /j. - . . .x. jeffrey, paul. . “smoothing the waters: observations on the process of cross-disciplinary research collaboration.” social studies of science , no. (august): - . jiang, meng, brooke a. ammerman, qingkai zeng, ross jacobucci, and alex brodersen. . “phrase-level pairwise topic modeling to uncover helpful peer responses to online suicidal crises.” humanities and social sciences communications : - . karniouchina, ekaterina v., liana victorino, and rohit verma. . “product and service innovation: ideas for future cross-disciplinary research.” the journal of product innovation management , no. (may): - . kohavi, ron, george john, richard long, david manley, and karl pfleger. . “mlc++: a machine learning library in c++.” in proceedings of the sixth international conference on tools with artificial intelligence, - . n.p.: ieee. https://doi.org/ . /tai. . . kotsiantis, s.b. . “use of machine learning techniques for educational proposes [sic]: a decision support system for forecasting students’ grades.” artificial intelligence review , no. (may): - . https://doi.org/ . /s - - -x. ma, yihong, qingkai zeng, tianwen jiang, liang cai, and meng jiang. “a study of person entity extraction and profiling from classical chinese historiography.” in proceedings of the nd international workshop on entity retrieval, edited by gong cheng, kalpa gunaratna, and jun wang, - . n.p.: international workshop on entity retrieval. http://ceur-ws.org/vol- /. miller, eliza c. and lisa leffert. . “building cross-disciplinary research collaborations.” stroke , no. (march): e -e . https://doi.org/ . /strokeaha. . . mullainathan, sendhil, and jann spiess. . “machine learning: an applied econometric approach.” journal of economic perspectives , no. (spring): - . https://doi.org/ . /jep. . . . muratovski, gjoko. . “challenges and opportunities of cross-disciplinary design education and research.” in proceedings from the australian council of university art and design schools (acuads) conference: creativity: brain - mind - body, edited by gordon bull. canberra, australia: acauds conference. https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/. murthy, sreerama k. . “automatic construction of decision trees from data: a multi-disciplinary survey.” data mining and knowledge discovery , no. (december): - . https://doi.org/ . /a: . o'rourke, michael, stephen crowley, and chad gonnerman. . “on the nature of cross-disciplinary integration: a philosophical framework.” studies in history and philosophy of science part c: studies in history and philosophy of biological and biomedical sciences (april): - . https://doi.org/ . /j.shpsc. . . . pedregosa, fabian et al. . “scikit-learn: machine learning in python.” the journal of machine learning research : - . http://www.jmlr.org/papers/v /pedregosa a.html. pettigrew, simone f. . “ethnography and grounded theory: a happy marriage?” in association for consumer research conference proceedings, edited by stephen j. hoch and robert j. meyer, - . provo, ut: association for consumer research. https://www.acrwebsite.org/volumes/ /volumes/v /. prepare/enrich. n.d. “national survey of marital strengths.” prepare/enrich (website). accessed january , . https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf. robinson, linda c. and priscilla w. blanton. . “marital strengths in enduring marriages.” family relations: an interdisciplinary journal of applied family studies , no. (january): - . https://doi.org/ . / . urquhart, r., e. grunfeld, l. jackson, j. sargeant, and g. a. porter. . “cross-disciplinary research in cancer: an opportunity to narrow the knowledge–practice gap.” current oncology , no. (december): e -e . https://doi.org/ . /co. . . xu, anqi, xiaolin xie, wenli liu, yan xia, and dalin liu. . “chinese family strengths and resiliency.” marriage & family review , no. - (august): - . https://doi.org/ . /j v n _ . zeng, qingkai, mengxia yu, wenhao yu, jinjun xiong, yiyu shi, and meng jiang. “faceted hierarchy: a new graph type to organize scientific concepts and a construction method.” in proceedings of the thirteenth workshop on graph-based methods for natural language processing (textgraphs- ), edited by dmitry ustalov, swapna somasundaran, peter jansen, goran glavaš, martin riedl, mihai surdeanu, and michalis vazirgiannis, - . hong kong: association for computational linguistics. https://doi.org/ . /v /d - .