Chapter 8 Building a Machine Learning Pipeline Audrey Altman Digital Public Library of America As a new machine learning (ML) practitioner, it is important to develop a mindful approach to the craft. By mindful, I mean possessing the ability to think clearly about each individual piece of the process, and understanding how each piece fits into the larger whole. In my experience, there are many good tutorials available that will help you work with an individual tool, deploy a specific algorithm, or complete a single task. It is more difficult to find guidelines for building a holistic system that supports the entire ML workflow. My aim is to help you build just such a system, so that you are free to focus on inquiry and discovery rather than struggling with in- frastructure and process. I write this as a software developer who has, at one time or another, been on the wrong end of all the recommendations presented here, and hopes to save you from similar headaches. Many of the examples and design choices are drawn from my experiences at the Digital Public Library of America, where I have worked alongside a very talented team of developers. This is by no means an exhaustive text, but rather a bit of pragmatic advice and a jumping-off point for further research, designed to give you a clearer idea of which questions to ask throughout your practice. This article reviews the basic machine learning workflow, discussing design considerations along the way. It offers recommendations for data storage, guidelines on selecting and working with ML algorithms, and questions to guide tool selection. Finally, it describes some challenges with scaling up. My hope is that the insight presented here, combined with your good judgement, will empower you to get started with the actual practice of designing and executing a machine learning project. 89 90 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 Algorithm selection As you begin ingesting and preparing data, you’ll want to explore possible machine learning al- gorithms to perform on your dataset. Choose an algorithm that fits your research question and data. If you’re not sure which algorithm to choose and not constrained by time, experiment with several different options and see which one yields the best results. Start by determining what gen- eral type of learning algorithm you need, and proceed from there to research and select one that specifically addresses your research question. In supervised learning, you train a model to predict an output condition based on given in- put conditions; for example, predicting whether or not a patient has some disease based on their symptoms, or the topic of a news article based on keywords in the text. In order for supervised learning to work, you need labeled training data, meaning data in which the outcome is already known. Examples include records of symptoms in patients who were known to have the disease (or not), or news articles that have already been assigned topics. Classification and regression are both types of supervised learning. In a classification problem, you are predicting a discrete number of possible outcomes. For example, “based on what I know about this book, will it make the New York Times Best Seller list?” is a classification problem because there are two discrete outcomes: yes or no. Classification algorithms include naive Bayes, decision trees, and k-nearest neighbor. Regression problems try to predict an outcome from a continuum of possibilities, i.e., “based on what I know about this book, what will its retail price be?” Regression algorithms include linear regression and regression trees. In unsupervised learning, the ML algorithm discovers a new pattern. The training data is unlabeled, meaning there is no indication of how the data should be organized at the outset. A common example is clustering, in which the algorithm groups items together based on features it finds mathematically significant. Perhaps you have a collection of news articles (with no existing topic labels), and you want to discover common themes or topics that appear throughout the collection. The algorithm will not tell you what the themes or topics are, but will show which articles group together. It is then up to the researcher to work out the common thread. In addition to serving your research question, your algorithm should also be a good fit for your data. Specific considerations will vary for each dataset and algorithm, so make sure you know the strengths and weaknesses of your algorithm and how they relate to the unique qualities of your dataset. For example, algorithms differ in their abilities to handle datasets with a very large number of features, handle datasets with high variance, efficiently process very large datasets, and glean meaningful intelligence from very small datasets. Is it important that your algorithm be easy to explain? Some algorithms, such as neural nets, function as black boxes, and it is difficult to decipher how they arrive at their decisions. Other algorithms, such as decision trees, are easy to understand. Can you prepare your data for the algorithm with a reasonable amount of pre- processing? Can you find examples of success (or failure) from people using similar datasets with the same algorithm? Asking these sorts of questions will help you to choose an algorithm that works well for your data, and will also inform how you prepare your data for optimal use. Finally, consider whether or not you are constrained by time, hardware, or available toolsets. Different algorithms require different amounts of time and memory to train and/or execute. Dif- ferent ML tools offer implementations of different algorithms. Altman 91 The machine learning pipeline The metaphor of a pipeline is often used for a machine learning workflow. This metaphor cap- tures the idea of data channeled through a series of sequential transformations. However, it is important to note that each stage in the process will need to be repeated and honed through- out the course of your project. Therefore, don’t think of yourself as building a single intelligent model, such as a decision tree or clustering algorithm. Instead, build a pipeline with pieces that can be swapped in and out as needed. Data flows through the pipeline and outputs a version of a decision tree, clustering algorithm, or other intelligent model. Throughout your process, you will tweak your pipeline, making many intelligent models. Eventually you will select the best model for your use case. To use another metaphor, don’t build a car, build an assembly line for making cars. While the final output of a machine learning workflow is some sort of intelligent model, there are many factors that make repetition and iteration necessary. ML processes often involve subjective decisions, such as which data points to ignore, or which configurations to select for your algorithm. You will want to test different possibilities to see what works best. As you learn more about your dataset throughout the course of the project, you will go back and tweak parts of your process. You may discover biases in your data or algorithms that need to be addressed. If you are working collaboratively, you will be incorporating asynchronous feedback from members of your team. At some point, you may need to introduce new or revised data, or try a new tool or algorithm. It is also prudent to expect and plan for errors. Human errors are inevitable, and hardware errors, such as network timeouts or memory overloads, are common. For all of these reasons, you will be well-served by a pipeline composed of modular, repeatable steps, each with discrete and stable output. A modular pipeline supports a batch processing workflow, in which whole datasets undergo a series of transformations. During each step of the process, a large amount of data (possibly the entire dataset) is transformed all at once and then incrementally stored. This can be contrasted with a real-time workflow, in which individual records are transformed instantaneously (e.g. a li- brarian updates a single record in library catalog); or a streaming workflow, in which a continuous flow of data is pushed through an entire pipeline, often without incremental storage along the way (e.g. performing analysis on a continuous stream of new tweets). Batch processing is com- mon in the research and development phase of an ML project, and may also be a good choice for a production system. When designing any step in the batch processing pipeline, assume that at some point you will need to repeat it either exactly as is, or with modifications. Documenting your process lets you compare the outputs of different variations and communicate the ways in which your choices impact the final results. If you’re writing code, version control software can help. If you’re doing more manual data manipulations, such as editing data in spreadsheets, you will need an inten- tional system of documenting exactly which transformations you are applying to your data. It is generally preferable to automate processes wherever possible so that you can repeat them with ease and consistency. A concrete example from my own experience demonstrates the importance of a pipeline that supports repetition. In my first ever ML project, I worked with a set of XML library data con- verted to CSV. I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. This whole pro- 92 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 cess took me countless hours, and when an updated dataset became available, there was no way to reproduce my painstaking cleanup process. I was stuck with outdated data, and my final output was doomed to grow more and more irrelevant as time wore on. Since then, I have always written repeatable scripts for all my data cleanup tasks. Each decision you make will have an impact on the final results, so it is important to keep clear documentation and to verify your assumptions and hypotheses wherever possible. Sometimes there will be explicit tests to perform; at other times, you may just need to look at data—make a quick visualization, perform a simple calculation, or glance through a sample of records. Be cognizant of the potential to introduce error or bias. For example, you could remove a field that you don’t think is important, but that would, in fact, have a meaningful impact on the final result. All of these precautions will strengthen confidence in your final outcomes and make them intelligible to your collaborators and other audiences. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. Data acquisition The first step is to acquire the data that you will be using for your machine learning project. You may need to combine data from several different sources. There are many ways to acquire data, including downloading files, querying a database or API, or scraping web pages. Depending on the size of the source data and how it is made available, this can be a quick and simple step or the most challenging bottleneck in your pipeline. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least dur- ing the initial phase of testing different algorithms or configurations. Having a raw, immutable copy of your initial dataset (or datasets) ensures that you can always go back to the beginning of your ML process and start over with exactly the same input. It will also save you from the possi- bility that the source data will change from beneath you, thereby compromising your ability to compare the outputs of different operations (for more on this, see the section on data storage). If possible, it’s often worthwhile to learn about how the original data was created, especially if you are getting data from multiple sources that differ in subtle ways. Data preparation Data preparation involves cleaning data and transforming it into an appropriate format for sub- sequent machine learning tasks. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you’ve started training and testing models. The first step of data preparation is to parse your acquired data and transform it into a com- mon, usable schema. Acquired data often comes in file formats that are good for data sharing, such as XML, JSON, or CSV. You can parse these files into whatever schema makes sense to man- age the various transformations you want to perform, but it can help to have a sense of where you are headed. Your eventual choice of data format will likely be dictated by your ML algo- rithms; likely candidates include multidimensional arrays, tensors, matrices, and DataFrames. Look ahead to specific functions in the specific libraries you plan to use, and see what type of input data is required. You don’t have to use these same formats during your data preparations, though it can simplify the process. Altman 93 Data cleanup and transformation is an art. Data is messy, and the messier the data, the harder it is to analyze and uncover underlying patterns. Yet, we are only human, and perfect data is far beyond our reach. To strike a workable balance, focus on those cleanup tasks that you know (or strongly suspect) will have a significant impact on the final product. Cleanup and transfor- mation operations include removing punctuation or stopwords from textual data, standardizing date and number formats, replacing missing or dummy values with a meaningful default, and excluding data that is known to be erroneous or atypical. You will select relevant data points, and you may need to represent them in a new way: a birth date becomes age range; a place name be- comes geo-coordinates; a text document becomes a word density vector. There are many possible normalizations to perform, depending on your dataset and which algorithm(s) you plan to use. It’s not a bad idea to ensure that there’s a genuinely unique identifier for each record (even if you don’t see an immediate need for one). This is also a good time to reflect on any biases that might be inherent in your data, and whether or not you can adjust for them; even if you cannot, under- standing how they might impact the ML process will help you conduct a more nuanced analysis and frame your final results. At the very least, you can record biases in the documentation so that future researchers will be aware of them and react accordingly. As you become more familiar with the data, you will likely hone your cleanup process and iterate through the steps multiple times. The more you can learn about the data, the better your preparations will be. During the data preparation phase, practitioners often make use of visualizations and query frameworks to pic- ture their data holistically, identify patterns, and find errors or outliers. Some ML tools support these features out-of-the-box, or are intentionally interoperable with external query and visual- ization tools. For a lightweight tool, consider spreadsheet or notebook software. Depending on your use case, it may be worthwhile to put your data into a temporary database or search index so that you can make use of a more sophisticated query interface. Model testing and training During the testing and training phase, you will build multiple models and determine which one gives you the best results. One of the main ways you will tune your model is by trying multiple combinations of hyperparameters. A hyperparameter is a value that you set before you run the learning process, which impacts how the learning process works. Hyperparameters control things like the number of learning cycles an algorithm will iterate through, the number of layers in a neural net, the characteristics of a cluster, or the number of decision trees in a forest. Often, you will also want to circle back to your data preparation steps to try different configurations, apply new enhancements, or address new problems and particularities that you’ve uncovered. The process is deceptively simple: try out different configurations until you get a good result. The challenge comes when you try to define what constitutes a good (or good-enough) result. Measuring the quality of a machine learning model takes finesse. Start by asking: What would you expect to see if the model learned perfectly? Equally important, what would you expect to see if the model didn’t learn anything at all? You can often utilize randomness as a stand-in for no learning, e.g. “if a result was selected at random, the probability of the desired outcome would be X”. These two questions will help you to set benchmarks at both extremes of the realm of possible outcomes. Perfection is illusive, and the return on investment dwindles after a while, so be prepared to stop training once you’ve arrived at an acceptably good model. In a supervised learning problem the dataset is split into training and testing datasets. The algorithm uses the training data to “learn” a set of rules that it can subsequently apply to new, 94 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 unseen data to predict the outcome. The testing dataset (also called a validation dataset) is used to test how well the model performs. Often, a third dataset is held out as well, reserved for fi- nal testing after the model has been trained. This third dataset provides an additional bulwark against bias and overfitting. Results are typically evaluated based on some statistical measure- ment that is directly relevant to your research question. In a classification problem, you might optimize for recall or precision. In a regression problem, you can use formulas such as the root- mean square deviation to measure how well the regression line matches the actual data points. How you choose to optimize your model will depend on your specific context and priorities. Testing an unsupervised model is not as straightforward, since there is no preconceived no- tion of correct and incorrect categorization. You can sometimes rely on a known pattern in the underlying dataset that you would reasonably expect to be reflected in a successful model. There may also be characteristics of the final model that indicate success. For example, if you are work- ing with a clustering algorithm, models with dense, well-defined clusters are probably better than sparse clusters with vague boundaries. In unsupervised learning, you may want to hold back some portion of your data to perform an independent validation of your results, or you may use the entire dataset to build the model—it depends on what type of testing you want to perform. Application of results As the final step of your workflow, you will use your intelligent model to perform some task. Perhaps you will use it for scholarly analysis of a dataset, or perhaps you will integrate it into a software product. If it is the former, consider how to export any final data and preserve the artifacts of your project. If it is the latter, consider how the model, its outputs, and its contin- ued maintenance will fit into existing systems and workflows. Planning for interoperability may influence decisions from tool selection to data formats and storage. Immutable data storage Immutable data storage can benefit the batch-processing ML pipeline, especially during the ini- tial research and development phase. This type of data storage supports iteration and allows you to compare the results of many different experiments. Treating data as immutable means that af- ter each significant change or set of changes to your data, you save a new snapshot of the dataset that is never edited or changed. It also allows you to be flexible and adaptive with your data model. Immutable data storage has become a popular choice for data-intensive or “big data” applications as a way to easily assemble large quantities of data, often from multiple sources, without having to spend time upfront crafting a strict data model. You may have heard the term “data lake” to refer to such large, unstructured collections of data. This can be contrasted with a “data warehouse”, which usually indicates a highly structured, centralized repository such as a relational database. To demonstrate how immutable supports iteration and experimentation, consider the fol- lowing scenario: You start with an input file Kvn/�i�X+bp, and then perform some cleanup operation over the data, such as converting all measurements in miles to kilometers, rounded to the nearest whole number. If you were treating your data as mutable, you might overwrite the original contents of Kvn/�i�X+bp with the transformed values. The problem with this ap- proach comes if you want to test some alteration of your cleanup operation. Say, for example, you wanted to round all your conversions to the nearest tenth instead. Since you no longer have your original data, you would have to start the entire ML process from the top. If you instead Altman 95 treated your data as immutable, you would keep Kvn/�i�X+bp in its original state, and save the output of your cleanup operation in a new file, say Kvn+H2�Mn/�i�X+bp. That way, you could return to Kvn/�i�X+bp as many times as you wished, try different operations on this data, and easily compare the results of these operations knowing the source data was exactly the same for each one. Think of each immutable dataset as a place in your process that you can safely reset to anytime you want to try something new or correct for some bias or failure. To illustrate the benefits of a flexible data model, consider a mutable data store, such as a relational database. Before you put any data into the database, you would first need to design a system of tables with set fields and datatypes, and the relationships between those tables. This can feel like putting the cart before the horse, especially if you are starting with a dataset with which you are not yet intimately familiar, and you want the ability to experiment with different algorithms, all of which might require slightly different transformations on the original dataset. Revisiting the example in the previous paragraph, you might initially have defined your distance datatype as an integer (when you were rounding to the nearest whole number), and would later have to change it to a floating point number (when you were rounding to the nearest tenth). Making this change would mean altering the database schema and migrating all of the existing data to the new type, which is a nontrivial task—especially if you later decide to revert back to the original type. By contrast, if you were working with immutable CSV files, it would be much easier to write out two files, one with each data type, and keep whichever one ultimately proved most effective. Throughout your ML process, you can create several incremental datasets that are essentially read-only. There’s no one correct data storage format, but ideally you would use something sim- ple and space-efficient with the capacity to interoperate with different tools, such as flat files (plain text files without extraneous markup, such as TXT, CSV, or Parquet). Even if your data is ulti- mately destined for a different kind of datastore, such as a relational database or triplestore, con- sider using simple, immutable storage as an intermediary to facilitate iteration and experimenta- tion. If you’re concerned about overwhelming your local drive, cloud storage is a good option, especially if you can read and write directly from your programs or software services. One final benefit of immutable storage relates to scale. Batch processing workflows and im- mutable data storage work well with distributed data processing frameworks, such as MapReduce and Spark. Therefore, if you need to scale your ML project using distributed processing, the in- tegration will be more seamless (for more, see the section on scaling up). Organizing Immutable Data Organizing immutable data stores can be a challenge, especially with multiple users. A little planning can save you from losing track of your experiments and results. A well-ordered direc- tory structure, informative and consistent file names, liberal use of timestamps, and disciplined note-taking are simple but effective strategies. For example, say you were acquiring MARCXML records from an API feed, parsing out subject terms, and building a clustering algorithm around these terms. Let us explore one possible way that you could organize your data outputs through each step of the machine learning pipeline. To enforce a naming convention, create a helper method that generates the output path for each run of a particular data process. This output path includes the date and timestamp of the run—that way you won’t have to think about naming each individual file, and can avoid the phenomenon of a mess of files called Kvn+H2�Mn/�i�X+bp, Kvn+H2�M2`n/�i�X+bp, 96 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 Kvn7BM�Hn+H2�M2bin/�i�X+bp, etc. Your file path for the acquired data might be in the format: KvS`QD2+if�+[mBbBiBQMbfK�`+nuuuuJJ..n>>JJaaXtKH In this case, “YYMMDD” represents the date and “HHMMSS” represents the timestamp. Your file path for prepared and cleaned data might be: KvS`QD2+if+H2�Mn/�i�b2ibfbm#D2+ibnuuuuJJ..n>>JJaaX+bp Finally, each clustering model you build could be saved using the file path pattern: KvS`QD2+ifKQ/2Hbf+Hmbi2`nuuuuJJ..n>>JJaa Following this general pattern, you can organize all of the outputs for your entire project. Using date and timestamps in the file name also enables easy sorting and retrieval of the most recent output. For each data output, you will want to maintain a record of the exact input, any special at- tributes of the process (e.g. “this time I rounded decimals to the nearest hundredth”), and metrics that will help you determine success or failure of the process. If you can generate this information automatically for each process, all the better for ensuring an accurate record. One strategy is to include a second helper method in your program that will generate and write out a companion file to each data output. The companion file contains information that will help evaluate results, detect errors, perform optimizations, and differentiate between any two data outputs. In the example project, you could accompany the acquisition output with a text file detailing the exact API call used to fetch the data, the number of records acquired, and the runtime for the process. Keeping companion files as close as possible to their outputs helps prevent accidental separation, so save it to: KvS`QD2+if�+[mBbBiBQMfK�`+nuuuuJJ..n>>JJaaXiti In this case, the date and timestamp should exactly match that of its companion XML file. When running processes that test and train models, you can include information in your com- panion file about hyperparameters and whatever metrics you are using to evaluate the quality of the model. In our example, the companion file to each cluster model may contain the file path for the cleaned input data, the number of clusters, and a measure of cluster variance. Working with machine learning algorithms New technologies and software advances make machine learning more accessible to “lay” users, by which I mean those of us without advanced degrees in mathematics or data science. Yet, the algorithms are complex, and you need at least an intuitive understanding of how they work if you hope to implement them correctly. I use the following three questions as a guide for under- standing an algorithm. Keep in mind that any one project will likely make use of several complex algorithms along the way. These questions help ensure that I have the information I truly need, and avoid getting bogged down with details best left to mathematicians. • What do the inputs and outputs of the algorithm mean? There are two parts to answering this question. First is the data structure, e.g. “this is a vector with 300 integers.” Second Altman 97 is knowing what this data describes, e.g. “each vector represents a document, and each integer specifies the number of times a particular word appears in that document.” You also need to be aware of specific implementation details—perhaps the input needs to be normalized in some way, perhaps the output has been smoothed (a technique that com- pensates for noisy data or outliers). This may seem straightforward, but it can be a lot to keep track of once you’ve gone through several layers of processing and abstraction. • What effect do different hyperparameters have on the algorithm? Part of the machine learn- ing process is tuning hyperparameters, or trying out multiple configurations until you get satisfying results. Part of the frustration is that you can’t try every possible configuration, so you have to do some intelligent guesswork. Twiddling hyperparameters can feel enig- matic and unitutive, since it can be difficult to predict their impact on the final outcome. The better you understand hyperparameters and their roles in the ML process, the more likely you are to make reasonable guesses and adjustments—though you should always be prepared for a surprise. • Canyouexplainhowthisalgorithmworkstoalaypersonandwhyit’sbeneficialtotheproject? There are two benefits to articulating a response to this question. First, it ensures that you really understand the algorithm yourself. And second, you will likely be called on to give this explanation to co-collaborators and other stakeholders. A good explanation will build excitement around the project, while a befuddling one could sow doubt or disinterest. It can be difficult to strike a balance between general summary and technical equations, since your stakeholders will likely include people with diverse backgrounds, so do your best and look for opportunities for people with different expertises to help refine your team’s understanding of the algorithm. Learning more about the underlying math can help you make better, more nuanced decisions about how to deploy the algorithm, and is fascinating in its own right—but in most cases I have found that the above three questions provide a solid foundation for machine learning research. Tool selection Tool selection is an important part of your process and should be approached thoughtfully. A good approach is to articulate and prioritize the needs of your team, and make selections that meet these needs. I’ve listed some possible questions for consideration below, many of which you will recognize as general concerns for any tool selection process. • What sorts of features and interfaces do they offer? If you require a specific algorithm, the ability to make data visualizations, or query interfaces, you can find tools to meet these specific needs. • How well do tools interoperate with one another, or with other parts of your existing systems? One of the advantages of a well-designed pipeline is that it will enable you to swap out software components if the need arises. For example, if your data is in a format that is interoperable with many systems, it frees you from being tied down to any specific tool. • How do the tools align with the skill sets and comfort levels of your team? For example, con- sider what coding languages your collaborators know, and whether or not they have the 98 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 8 capacity to learn a new one. If you have someone who is already a wiz with a preferred spreadsheet program, see if you can export data into a compatible file format. • Arethetoolsstable,well-documented,andwell-supported? Machine learning is a fast-changing field, with new algorithms, services, and software features being developed all the time. Something new and exciting that hasn’t yet been road-tested may not be worth the risk if there is a more dependable alternative. Furthermore, there tends to be more scholarship, documented use cases, and tutorials for older, more widely-adopted tools. • Are you concerned about speed and scale? Don’t get bogged down with these considerations if you’re just trying to get a working pilot off the ground, but it can help to at least be aware of how problems are likely to manifest as your volume of data increases, or as you integrate into time-sensitive workflows. You and your team can work through these questions and articulate additional requirements relevant to your specific context. Scaling up Scaling up in machine learning generally means that you need to work with a larger volume of data, or that you need processes to execute faster. Recent advances in hardware and software make the execution of complex computations magnitudes faster and more efficient than they were even a decade ago, and you can often achieve quite a bit by working on a personal computer. Yet, time is valuable, and it can be difficult to iterate and experiment effectively when individual processes take too long to execute. There are many ML software packages that can help you make efficient use of whatever hard- ware you have, including your personal computer. Some examples at the time of writing are Apache Spark, TensorFlow, Scikit-learn, and Microsoft Cognitive Toolkit, each with their own strengths and applications. In addition to providing libraries for building and testing models, these software packages optimize algorithmic performance, memory resources, data through- puts, and/or parallel computations. They can make a remarkable difference in both processing speed and the amount of data you can comfortably handle. There are also services that allow you to submit executable code and data to the cloud for processing, such as Google AI Platform. Managing your own hardware upgrades is not without challenge. You may be lucky enough to have access to a high-powered computer capable of accelerated processing. A common example is a computer with GPUs (graphics processing units), which break complex processes into many small tasks and run them in parallel. However, these powerful machines can be prohibitively ex- pensive. Another scaling technique is distributed or cluster computing, in which complex pro- cesses are distributed across multiple computers, often in the cloud. A cloud cluster can bring significant cost savings, but managing one requires specialized knowledge and the learning curve can be rather steep. It is also important to note that different algorithms require different scal- ing techniques. Some clustering algorithms, for example, scale well with GPUs but not with distributed computing. Even with the right hardware and software, scaling up can be a tricky business. ML processes tend to have dramatic spikes in memory or network use, which can tax your systems. Not all ML algorithms scale well, causing memory use or execution time to grow exponentially as more data is added. Sometimes you have to add additional, complexity-reducing steps to your pipeline to Altman 99 handle data at scale. Some of the more common machine learning languages, such as Python and R, execute relatively slowly, putting the onus on developers to optimize operations for efficiency. In anticipation of these and other challenges, it is often a good idea to start with a scaled-down pilot or proof of concept, and not to underestimate the time and resources necessary to scale up from there. Conclusion New technologies make it possible for more researchers and developers to leverage the power of machine learning. Building an effective machine learning system means supporting the entire workflow, from data acquisition to final analysis. Practitioners must be mindful of how each im- plementation decision and subjective choice—from the way you structure and store your data to the algorithms you use to the ways you validate your results—will impact the efficiency of opera- tions and the quality of learned intelligence. This article has offered some practical guidelines for building ML systems with modular, repeatable processes and intelligible, verifiable results. There are many resources available for further research, both online and in your libraries, and I encour- age you to consult with subject specialists, data scientists, mathematicians, programmers, and data engineers. May your data be clean, your computations efficient, and your results profound. Further Reading I include here a few suggestions for further reading on key topics. I have also found that in the fast-changing world of machine learning technologies, blogs, internet communities, and online classes can be a great source of information that is current, introductory, and/or geared toward practitioners. Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining. Boston: Pearson Addison Wesley. See chapter 2 for data preparation strategies. Later chap- ters introduce common classification and clustering algorithms. Marz, Nathan and James Warren. 2015. Big Data: Principles and best practices of scalable real- time data systems. Shelter Island: Manning. “Part 1: Batch Layer” discusses immutable storage in depth. Kleppmann, Martin. 2017. Designing Data-Intensive Applications: The Big Ideas Behind Reli- able, Scalable, and Maintainable Systems. Boston: O’Reilly. “Chapter 10: Batch Process- ing” is especially relevant if you are interested in scaling up.