key: cord-266626-9vn6yt8m authors: Lei, Howard; O’Connell, Ryan; Ehwerhemuepha, Louis; Taraman, Sharief; Feaster, William; Chang, Anthony title: Agile Clinical Research: A Data Science Approach to Scrumban in Clinical Medicine date: 2020-10-22 journal: Intell Based Med DOI: 10.1016/j.ibmed.2020.100009 sha: doc_id: 266626 cord_uid: 9vn6yt8m The COVID-19 pandemic has required greater minute-to-minute urgency of patient treatment in Intensive Care Units (ICUs), rendering the use of Randomized Controlled Trials (RCTs) too slow to be effective for treatment discovery. There is a need for agility in clinical research, and the use of data science to develop predictive models for patient treatment is a potential solution. However, rapidly developing predictive models in healthcare is challenging given the complexity of healthcare problems and the lack of regular interaction between data scientists and physicians. Data scientists can spend significant time working in isolation to build predictive models that may not be useful in clinical environments. We propose the use of an agile data science framework based on the Scrumban framework used in software development. Scrumban is an iterative framework, where in each iteration larger problems are broken down into simple do-able tasks for data scientists and physicians. The two sides collaborate closely in formulating clinical questions and developing and deploying predictive models into clinical settings. Physicians can provide feedback or new hypotheses given the performance of the model, and refinement of the model or clinical questions can take place in the next iteration. The rapid development of predictive models can now be achieved with increasing numbers of publicly available healthcare datasets and easily accessible cloud-based data science tools. What is truly needed are data scientist and physician partnerships ensuring close collaboration between the two sides in using these tools to develop clinically useful predictive models to meet the demands of the COVID-19 healthcare landscape. We propose the use of an agile data science framework based on the Scrumban framework used in software development. Scrumban is an iterative framework, where in each iteration larger problems are broken down into simple do-able tasks for data scientists and physicians. The two sides collaborate closely in formulating clinical questions and developing and deploying predictive models into clinical settings. Physicians can provide feedback or new hypotheses given the performance of the model, and refinement of the model or clinical questions can take place in the next iteration. The rapid development of predictive models can now be achieved with increasing numbers of publicly available healthcare datasets and easily accessible cloud-based data science tools. What is truly needed are data scientist and physician partnerships ensuring close collaboration between the two sides in using these tools to develop clinically useful predictive models to meet the demands of the COVID-19 healthcare landscape. The COVID-19 pandemic has greatly altered the recent healthcare landscape and has brought about greater minute-to-minute urgency of patient treatment especially in Intensive Care Units (ICUs). This greater urgency for treatment implies a greater need for agility in clinical research, rendering traditional approaches such as Randomized Controlled Trials (RCTs) [1] too slow to be effective. One approach for meeting the agility needs is the use of data science for the development of predictive models to assist in patient treatment. Predictive models can be rapidly and non-invasively developed leveraging existing data and computational tools, and various efforts have been undertaken [2] [3] [4] . If successful, predictive models can rapidly process volumes of patient information to assist physicians in making clinical decisions. However, the development and deployment of predictive models that are useful in clinical environments within short timeframes is challenging. Traditionally, the development and deployment of models employs a sequential process that resembles the Waterfall methodology used in software development [5] . Data would the model be deployed into a real-world setting for the domain experts to evaluate and provide feedback. One main disadvantage of this approach is that it prescribes for little collaboration between the day-to-day operations of data scientists and domain experts such as physicians, resulting in data scientists potentially working in isolation for long periods of time. Figure 1 illustrates this process. A breakdown of the tasks data scientists typically perform in isolation include data collection, data pre-processing and augmentation, model selection, model hyper-parameter turning, model training, and model testing. Data pre-processing is used for the data to be in a format that's suitable for use by the predictive model. Data augmentation is used to artificially increase the size of the data. For example, if the input data consist of images, augmentation can include translation, scaling, rotation, and adjusting the brightness of images to present more example images for the predictive model to learn. Note that one popular technique for compensating for data size is the Synthetic Minority Oversampling Technique (SMOTE) [6] , which addresses class imbalance in datasets by artificially increasing the amount of data in the minority class. Class imbalance is commonly encountered when working with Electronic Medical Record (EMR) data in healthcare. The class representing the patients with a target condition is typically smaller in size (i.e. with fewer samples) than the class representing patients without the target condition, and this can adversely affect the accuracy of predictive models developed on such data. Hyper-parameter tuning involves adjusting the parameters used in the model training process [7] , where the model is taught how to make predictions given the training data. One example of a hyper-parameter is the number of times -or iterations -the training data is presented to the model to learn. Each iteration is known as an epoch. After each epoch the model increases its learning from the training data, and after many epochs the learning is completed. A second example of a hyper-parameter is the percentage of training data that's used by the model in each epoch. The more epochs and the more data presented in each epoch, the better the model learns from the training data. A final example of a hyper-parameter is the learning rate of predictive models. The learning rate inversely correlates with the amount of time the model takes to reach its "learned state". Models trained using higher learning rates can reach its final state faster and complete its training sooner; however, they may not have learned as well compared to models trained using lower learning rates. Depending on the amount of training data, the complexity of the data, the number of parameters in the predictive model, and the computing resources, model training can potentially take days to complete. The model would be evaluated against a separate test data to verify if performance meets requirements. If not, some or all previous steps must be repeated until performance becomes acceptable. After the performance is deemed acceptable, the model would be deployed into a real-world environment. In the end, the process from the conception of the problem to model deployment can take months, and the opportunity for domain experts to evaluate comes only after the deployment of the model. One risk is that after deployment, the model would no longer be relevant if the goals have shifted; another risk is that the model may not meet the performance requirements in a realworld setting. In either situation, time or resources allocated to model development would have been wasted. This can be particularly damaging for data science efforts addressing the COVID-19 pandemic, where rapid development of approaches for detection and diagnoses of symptoms is critical. In healthcare, the ability to rapidly define goals (i.e. clinically relevant questions) and deploy predictive models that have real-world impact is faced with even more challenges. One challenge is that healthcare data such as Electronic Medical Records (EMR) of patients is inherently complex [8] , consisting of a mix of different data types and structures, missing data, and mislabeled data. The development of predictive models often requires well-structured and welllabeled data; hence, there is a greater need for data exploration, pre-processing and/or filtering when processing EMR data. Furthermore, it may be discovered upon exploration of available training data that the initial clinical questions and goals may not be achievable by predictive models developed using the data. Those questions and goals would need to be refined before model development can proceed. Furthermore, for predictive models to be usable in a clinical setting, physicians must have confidence that its performance is reliable. Models that perform well under common metrics used by data scientists, such as Area Under the Curve (AUC), does not guarantee that important clinical decisions can be made based on model [9] . That is because the AUC is a metric that measures model performance across a broad range of sensitivities and specificities of the model. When making important clinical decisions related to patients in the ICU, such as proning versus ventilation, which drugs to use, or whether to administer anti-coagulants, knowing that the model has an excellent AUC of 0.95 out of 1.0 is not as helpful as knowing that a decision based on the model has a 95% chance of being correct (i.e. the model's specificity). Some clinical decisions also need to be made within minutes, implying that the model must meet real-time performance J o u r n a l P r e -p r o o f standards in order to become the "partner" that can assist physicians in on-the-spot decision making. The fact that predictive models could lack in performance after being deployed into a clinical setting implies an even greater need for a framework that allows physicians to collaborate with data scientists to continuously monitor model development and performance. Furthermore, the minute-to-minute urgency of treatment needed for the COVID-19 pandemic implies that the lengthy process prescribed by the traditional Waterfall approach -with little communication between data scientists and physicians -is inadequate. The agile framework has been traditionally used in software development and has recently been introduced in data science [10] . The framework is an expedient approach that encourages greater velocity towards accomplishing goals. It includes the Scrum and Kanban frameworks, and a hybrid framework called Scrumban [11] . The Scrum framework prescribes consecutive "sprint cycles", which each cycle spanning a few weeks. Within each cycle, team members set and refine goals, produce implementations, and perform a retrospective with stakeholders. New goals and refinements are established for the next sprint cycle. One of the team members also acts as the Scrum Master, who facilitates daily team meetings (called standups), and ensures that the team is working towards goals and requirements [12] . The Kanban framework involves breaking down larger tasks into simple, do-able tasks. Each task proceeds through a sequence of well-defined steps from start to finish. Tasks are displayed as cards on a Kanban board, and their positions on the board indicate how much progress has been made [13] . Certain tasks may be "blocked", meaning that something needs to resolve before progress on the task can continue. Figure 2 shows an example of a Kanban board. One advantage to using a Kanban board is that the set of all necessary tasks, along with progress for each task, is transparent to members of the development team and anyone else who is interested. Overall, the Kanban framework helps bring clarity in tackling larger problems. Domain experts can visualize how the team is tackling the problems, along with what has been accomplished, what is in progress, what still needs to be done, and what needs to resolve before progress can be made. The proposed agile framework is shown in Figure 3 . Unlike the Waterfall approach, the tasks in the agile approach are to be done collaboratively between data scientists and physicians, and we note that the use of cloud-based storage and computing helps by providing a common platform for accessing the data and model(s). Complex problems can be broken down into tasks that can be visualized by both data scientists and physicians, enabling physicians to better understand the work that data scientists must do within each sprint cycle. The framework encourages continuous deployment of predictive models in clinical settings (i.e. such as the ICU), during which time data scientists can round with physicians and receive feedback on the model's performance. The physician's insight or gestalt can be leveraged to determine whether the results of the model are believable [7] . It may be that the predictive model performs well only in certain settings, such as with only certain patient populations across certain periods of time; if so, the clinical questions can be refined or new hypotheses developed at the beginning of the next sprint cycle. The point at which the sprint cycles should end, either because the model has finally become clinically useful or the team needs to completely pivot to a different direction, is determined by the physicians. While the traditional Waterfall approach could take many months for clinically useful models to be developed, the agile approach could take just a fraction of the time depending on the level of collaboration between data scientists and physicians. For agile data science to work in the healthcare domain, certain infrastructure must be in place to ensure that sprint cycles can be completed within shorter timeframes. These include the ability to: 1. Rapidly acquire large datasets. 2. Parse and query data in real time. 3. Use established platforms and libraries rather than develop tools de-novo. These platforms and libraries should reside in a cloud framework that allows collaborative efforts to take place. The availability of publicly accessible health information databases for research is increasing despite a multitude of regulatory and financial roadblocks. One such database is the Medical Information Mart for Intensive Care III (MIMIC-III) which contains de-identified data generated by over fifty thousand patients who received care in the ICU at Beth Israel Deaconess Medical Center [15] . The hope is that as researchers adopt the use of MIMIC, new insights, knowledge, and tools from around the world can be generated [16] . Another publicly available database is the eICU Collaborative Research Database, a multi-center collaborative database containing intensive care unit (ICU) data from many hospitals across the United States [17] . Both the MIMIC-III and the eICU databases can be immediately obtained upon registration and completion of training modules. The popularity of these two databases illustrates the potential for large amounts of data to be gathered from hospitals and ICUs around the world and made immediately accessible to researchers. [18] . The Cerner Real-World Data is another COVID-19 research database that contains de-identified data and is freely offered to health systems [19] . Finally, databases for medical imaging studies also exist, such as the Chest X-ray dataset released by the NIH, which contains over 100,000 chest X-ray images [20] . Once datasets are obtained, storage and compute power are easily purchased and accessible from an ever-increasing number of vendors. The compute power needed for analyzing large datasets can often be met using cloud computing resources with Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure being the providers of popular cloud services [21] [22] [23] . The need for cloud computing tools rests mainly on the availability of specialized elastic compute instances. The elasticity implies that computing resources can be assessed in real time and scaled up or down as needed to balance computing power and cost. Another advantage of a cloud framework is that it allows multiple data scientists and physicians to conveniently collaborate and access the work. This shift to elastic cloud resources has seen one of the major Electronic Medical Records (EMR) providers, Cerner Corporation [24] , develop tools for agile data science that use cloud computing resources as the underlying computing engine. These tools for agile data science often use Jupyter Notebook as the underlying frontend programming interface. Jupyter is an open source computational environment that supports programming frameworks and languages such as Apache Spark [25] , Python and R required for processing the data and developing predictive models [26] . Open source machine learning libraries like Keras [27] which enables the rapid development of advanced predictive models such as Convolutional Neural Networks (CNNs) [28] , can also be integrated. Finally, the Jupyter Notebook framework supports collaboration amongst multiple individuals, where data scientists and physicians can query data, add and modify code and/or visualize results in real time [26] . The availability of the development tools and accessibility of data allow data scientists to rapidly acquire data, query parts of data relevant for addressing clinical questions, and develop predictive models. The outcomes of the model can lead to refinement of the clinical questions, data, or the model itself. The combination of the data scientist, physicians, and agile data science tools will help revolutionize the entire data science process and accelerate discoveries in healthcare and other application domains. Agile data science is quickly becoming a necessity in healthcare, and especially critical given the COVID-19 pandemic. The agile framework prescribes a rapid, continuous-improvement process enabling physicians to understand the work of data scientists and regularly evaluate predictive model performance in clinical settings. Physicians can provide feedback or form new hypotheses for data scientists to implement in the next cycle of the process. This is a departure from the traditional Waterfall approach, with data scientists tackling a sequence of tasks in isolation, without regularly deploying the models in real-world settings and engaging domain experts such as physicians. Given the rapidly shifting healthcare landscape, the goals and requirements for the predictive models may change by the time the model is deployed; this renders the slower, traditional model development approaches unsuitable. As the agile framework encourages rapid development and deployment of predictive models, it requires data scientists to have easy access to data and the infrastructure needed for model J o u r n a l P r e -p r o o f development, deployment, and communication of outcomes. Fortunately, there are now publicly available datasets such as MIMIC-III, and cloud-based infrastructure such as Amazon Web Services (AWS) to achieve this. AWS contains a suite of popular tools such as Jupyter Notebook, Python, and R, allowing data scientists to rapidly upload data, and develop and deploy models with short turn-around time. Given the increasing amounts of healthcare data, the plethora of clinical questions to address, as well as the minute-to-minute urgency of treating ICU patients given the COVID-19 pandemic, the rapid development of predictive models to address these challenges is more important than ever. We hope that the agile framework can be embraced by increasing numbers of physician and data scientist partnerships, in the process of developing clinically useful models to address these challenges. A method for assessing the quality of a randomized control trial Artificial Intelligence (AI) applications for COVID-19 pandemic Artificial Intelligence-Enabled Rapid Diagnosis of Patients with COVID-19 SMOTE: Synthetic minority over{sampling technique How to Read Articles that use Machine Learning (Users' Guide to the Medical Literature) Data Processing and Text Mining Technologies on Electronic Medical Records: A A Physician's Perspective on Machine Learning in Healthcare. Invited talk presented at Machine Learning for Health Care (MLHC) Agile Data Science 2.0. O'Reilly Media, Inc MVM -Minimal Viable Model MIMIC-III, a Freely Accessible Critical Care Database Making Big Data Useful for Health Care: A Summary of the Inaugural MIT Critical Data Conference The eICU Collaborative Research Database, a Freely Available Multi-Center Database for COVID-19 Clinical Data Sets for Research FAQ: COVID-19 de-identified data cohort access offer ChestX-ray8: Hospitalscale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases Apache Spark -Unified Analytics Engine for Big Data Toward Collaborative Open Data Science in Metabolomics using Jupyter Notebooks and Cloud Computing Reading checks with multilayer graph transformer networks No external funding is provided for this work.J o u r n a l P r e -p r o o f J o u r n a l P r e -p r o o f HIGHLIGHTS • Agile data science in healthcare is becoming a necessity, given the COVID-19 pandemic and the minute-to-minute urgency of patient treatment.• The proposed agile data science framework is based on Scrumban, used in software development.• Publicly available healthcare datasets and cloud-based infrastructure enable the agile framework to be widely adopted.• Collaboration between physicians and data scientists needed in order to implement the agile framework. ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f