key: cord-0439672-93sh90cb
authors: Merelo-Guerv'os, Juan Juli'an; Garc'ia-Valdez, Mario
title: Agile (data) science: a (draft) manifesto
date: 2021-04-09
journal: nan
DOI: nan
sha: 0643f0e1150defbe98d310157b486e5939ab4435
doc_id: 439672
cord_uid: 93sh90cb

Science has a data management as well as a project management problem. While industrial grade data science teams have embraced the *agile* mindset, and adopted or created all kind of tools to manage reproducible workflows, academia-based science is still (mostly) mired in a mindset that's focused on a single final product (a paper), without focusing on incremental improvement and, over all, reproducibility. In this report we argue towards the adoption of the agile mindset and agile data science tools in academia, to make a more responsible, sustainable, and above all, reproducible science.

By agile, we usually imply a mindset applied to the whole software development lifecycle that is customercentered and focused on continuous improvement of increasingly complex minimally viable products. The name comes from the Agile Manifesto (Beck et al. 2001) , literally called "Manifesto for agile software development." This manifesto has certainly changed the way software at large is developed and become mainstream, spawning many different methodologies, tools, and best practice guidelines. It has proved to be an efficient way of carrying out all kind of projects, from small to large-scale ones, mitigating or reducing the presence of bugs and proving to be more efficient (Abrahamsson et al. 2017 ) than the methodology that prevailed previously (and still today in many sectors), generally called waterfall (Andrei et al. 2019 ), which separated (or siloed) different teams doing from the specification to the testing, with every team acting at other parts of the lifecycle.

Despite being prevalent in software development (and, in general, project development) environments, it certainly has not reached science at large, which arguably follows a method that closely follows the waterfall methodology, and in any case differ in methodology and objectives with mainstream software development teams (Naguib and Li 2010) .

Since data science and engineering has become an integral part of the workflow in many companies, agile data science (Engström 2020) is, mostly, the way it's done. Again, this is mainly because data science is mostly done in the industry, not academia, which does not have the same kind of workflows to deal with its own data. Our intention is to try and put science back in data science. We will try and examine critically how science is done, what are the main reasons why this agile mindset is not being used in science, how would agile concepts translate to science, and eventually what agile data science and science at large would look like.

We will first present what attempts have been made to translate agile concepts to the (academic) world of science.

Despite being an age-old pursuit and probably, as such, ripe for disruption, science has been done in pretty much the same way for years. In very rough brushstrokes, it starts with a funding application, that includes a workplan, and generally works hierarchically from principal investigators to senior and then junior researchers, producing a series of artifacts which always include papers (which are snapshots of the state of the art), and in some cases software, protocols, or even, in some limited areas (mainly astronomy and medicine), workflows. This situation has been challenged repeatedly, lately and mainly after the introduction of the aforementioned agile manifesto. In (Amatriain and Hornos 2009) , which is essentially a presentation and not a formal paper, several proposals are made to apply agile "methods" in research; something that has been proposed repeatedly in later years, for instance in this blog post (Carattino, n.d.) and even in this paper (Baijens, Helms, and Iren 2020) which specifies an agile methodology, Scrum, and how it can be applied specifically to data science projects. As a matter of fact, there were several attempts to raise the issue again and bring it to the attention of the research community: a blog post introduced agile research (Amatriain 2008) and even drafted an Agile Research Manifesto (Amatriain 2009 ). This was almost totally forgotten until two years ago when a blog called "Agile Science" (Bergman 2018) brought it up. Independently, some researchers proposed an (almost) ultimatum for Agile Research in (Way, Chandrasekhar, and Murthy 2009) , and eventually, it became fruitful in a restricted environment, mHealth, in (Wilson et al. 2018 ). Our own work (Merelo 2013) proposed including very common software development practices, such as testing, in scientific software development (in this case, in the context of evolutionary computation). This goes to show that it is still part of the fringe, and has not been incorporated either to funding agencies guidelines or the common science and research practice.

These efforts are certainly related to Open Science: Open Science adapts the main ideas of open source software development to the publication of scientific results and artifacts; the push for Open Science (Robson et al. 2021 ) has provided new venues and new ways of understanding and producing science. As a matter of fact, open science that publishes their work as open source software more closely follow these best practices (Groen et al. 2015) that does that don't.

In other cases, the uptake of new methodologies is still very slow. While most companies have created pipelines for data management (Rodríguez 2019) , there are neither clear guidelines nor best practices nor resources where scientific data management can be done at scale and, what's more important, in a way that can have a (positive) outcome for your career.

We certainly need to acknowledge first that science, the way it is now, has many problems. Many of them start and end with funding (or lack thereof), but we should also realize that using XIX century formats to publish XXI century research leaves a lot to be desired.

However, the fundamental problem is not in what is produced (mainly papers), is in practitioner's workflows themselves, created and maintained in order to produce that single product. The anomalies and problems caused by these hierarchical, paper-oriented, workflows end in frustration and, when major crises like COVID-19 strike, major problems carrying out much-needed (reproducible and replicable) research that can be used as a foundation for the next, also necessary, step.

So we will try to present, and defend, a series of hypotheses that would be the foundation of agile science, and that would contribute to solve the data management and productivity problems science currently has.

Science arguably can't be science if it's not reproducible. However, there are many practical hurdles for it to be so. First, the "paper" format, even if it's paper-with-embedded-links, is not reproducible per se (even if tools such as the one proposed by Kardas et al. (Kardas et al. 2020) are going to be able to extract at least the results in papers, not the code and experimental setup). Efforts like Papers with Code ("Papers with Code," n.d.) help associate something that has been already published with its corresponding code, but even if this is a step forward reproducibility, an effort similar to the one invested in deploying applications to production should be made: configuration as code, provision of all needed services, the inclusion of all data inflows, as well as extensive testing. Testing that is an essential part of the agile mindset, is severely lacking in scientific workflows.

Essentially, all papers try to prove something, answer a research question or prove a hypothesis over data characterized in a certain way. Datasets are static, and hypotheses are proved over provided datasets, which are increasingly available, although that is not necessarily a given.

Data science practitioners know that working with raw data is always challenging, since it needs a series of steps to be suitable for use. Extracting the data is hard, checking that everything is correct is also hard. Small changes could lead to invalidation of results. But this is also true, in general, for science in general, because much of it is based on data obtained from an external source that needs to be validated and processed. This is why testing is essential. "Code that's not tested is broken", we could say dataflows and workflows that are not tested are broken. Besides basic testing (testing for duplication, invalid data, things like that), we should formulate tests on data to check that it keeps being in the same format, range and general characteristics that allow our hypothesis to be valid. So agile science would need to test data to start with, before using it as input for workflows, but it should also unit-test all software used, perform integration tests on data + software, and eventually transform the hypothesis into actual software tests that would continuously check if the hypothesis still holds.

Increasingly, and when the Internet itself, as well as myriad sensors and devices, is a continuous source of data, publishing a paper drawing conclusions over a small piece of data is valuable and helpful. Creating a tested, continuously deployed workflow that over and over again tests that hypothesis, and that has been released as free software to be integrated as input or middleware in other workflows is immensely more valuable. But needs another hypothesis

Science should be reproducible, and this implies that software used or produced for it should be open too. However, it makes sense to emphasize the pliability and flexibility of free-software-as-science. If you want to configure increasingly complicated workflows that are going to be deployed continuously, a single non-open component would break them and make them impossible.

I would like to think this is the least controversial hypothesis that would support agile science. After all, it's been repeatedly proved (Vandewalle 2012) (Vandewalle 2019 ) that papers with code have a higher impact than those who hide it or simply don't publish it alongside the paper itself. As a matter of fact, it's quite usual right now in many scientific conferences to accompany the paper with pointers to a GitHub repo; searching over GitHub for similar code might also help scientists/coders as much as reading new papers. Some important conferences like NeurIPS now have reproducibility responsibles in their program committee, and they have shown that 3 out of 4 papers in that conference already post their GitHub repository (Gibney 2020) . This is mainly prevalent, however, in the data science/machine learning community (Wattanakriengkrai et al. 2020 ).

There's a more important aspect to this: science in open repositories and with free licenses leaves research open to all stakeholders, who can have a say in its outcomes, as well as in their direction. Which is why we prefer:

Again, it has been the Covid crisis the occasion where the world has realized that it needed science, it needed a lot and it needed it now. The whole world health and even lives were at stake, and government officials as well as the civil society needed to know from what to do to avoid contagion to the evolution of the pandemic and what it would mean vis-à-vis their return to a less restrictive situation.

In this, many open data repositories emerged with partial, localized, solutions, that allowed people in general and maybe also government officials have a bit of more control over their lives.

As a matter of fact, the scientific chain on command and general career environment causes not a few problems in participants (Levecque et al. 2017 ) . This study mentions explicitly "the supervisor's leadership style" as one of the factors that are linked to health problems, with the general work and organizational context cited as one of the highlights of the study. The scenario shown in a recent Nature survey (Woolston 2019) would be unsustainable in any kind of company, be it software development or any other kind of company. Agile software development offers a series of principles and best practices that enable a sustainable pace of development, with regular deployments; but since it values "individuals and interactions over processes and tools" it also creates an humane, self-organizing work environment that results in better worker satisfaction, as well as sustainable, career-long learning. That should also be an objective in scientific research.

Techniques such as Scrum (Baijens, Helms, and Iren 2020) have helped in data science (and this, only recently); again, this has not extended to computer science or, in general, science at large.

Agile fixed software development by proposing a series of principles attached to the Agile manifesto (Beck et al. 2001 ) that eventually spawned a series of tools on one hand, and best practices in other hand. Tools that can be grouped into generic CI and CD toolchains, including MLOps tools (Mäkinen et al. 2021) , team work tools (usually attached to source repositories) such as Jira or GitHub itself and the use of different methodologies (Kanban, Scrum), practices (reviews, retros) and roles (product owner, stakeholder) to streamline software production, bring value to stakeholders, and provide a sane, stable nurturing and eventually productive working environment. I have been advocating for using these techniques for quite a long time, at least since 2011 (last version of the talk on "the art of evolutionary algorithm programming is here (Merelo 2013) ). In this report I try to put everything together under the same framework which we will be calling agile science.

Science should not be different, and a (roughly) direct translation of all these practices, however they are interpreted, would be beneficial. We'll try, anyway, to delve a bit further into those concepts to see how they translate and how they could be applied, in practice, to science.

We need to shift focus from publishing on paper as a one shot to deploying working workflows, which will have as side products papers that can be continuously updated or else simply remain as a snapshot of the state of the art in a particular point in time, but that can, anyway, be re-produced (see one of the hypothesis above, reproducibility above all else) by anyone, but specially the people that have produced it first.

As we see this shift in action, we will see how the science system changes to accommodate this and take it into account in a scientific career. At this point in time, the peer-review and article-publishing system has been gamed in so many ways (Arney, n.d.) , that it's almost meaningless to rely on it for scientific promotion, not to mention science itself. To a certain point, it's become an obstacle to science. But replicability can fix it (Moonesinghe, Khoury, and Janssens 2007) , and the best way to totally replicate results is to publish them openly (third hypothesis above).

There are clearly lots of hurdles to overcome in this, including the fact that scientific publishing powerhouses will have to become workflow hosting players. That, however, is an externality to this proposal and can no doubt be solved as soon as the economic scenario draws itself.

The 4th hypothesis above advocates for stakeholder participation over hierarchical decision-making. However, in agile teams there must be a person that will be calling the shots. A successful product owner should be able to (Oomen et al. 2017) to "define the product vision". Since in this case the product is a scientific workflow, the person that had the original idea should be the one that owns the product, and should them prioritize different paths of development, define minimally viable products and its consequent milestones, and in general, take not only responsibility for the finished product but also work with the team to achieve success.

That sense of ownership will one of the factors decreasing anxiety and increasing sustainable productivity in science. It dodges hierarchical organization by making the principal investigator product owner only if he or she, effectively, has had the original idea and the vision of taking it forward. At the same time, a product owner is part of an agile team, and owns the product, not the team. This small semantic change also simplifies roles and clarifies what everyone will be doing in the product development lifecycle.

Most scientific development includes, or even is just simply, software development. We should use common software development tools and best practices, and integrate them into the development of the scientific workflow that is ultimately the objective.

In many cases, and specially in data science/machine learning, there will specialized tools such as MLflow (Zaharia et al. 2018 ) with frontends such as Snapper ML (Domenech and Guillén 2020) to simplify the creation of workflows. No doubt these workflow high-level tools will be extended to other fields, and integrated with mainstream deployment tools such as Docker or Kubernetes. Integrating these practices seamlessly merges product (workflow) development with software development, and also leverages existing tools such as GitHub or GitLab with their accompanying workflow design tools (Github Actions, pipelines), as well as other cloud environments with their accompanying tools.

This also decouples the production of a workflow from its actual deployment. As long as deployment is clearly expressed, it can be deployed by the scientific team producing it on premises or on the cloud, or done by anyone else elsewhere. One could even think about global free infrastructure for doing this kind of thing, or even a model similar to pay-to-publish: pay-to-deploy, and let the hosting place take care of long-term maintenance. This also makes science and the scientific effort, much more sustainable, and satisfies stakeholders (fourth hypothesis above) by keeping the product of science funding available and working way beyond the mere existence of the grant, or even the group itself.

In this report we have tried to propose a set of best practices that we think would benefit science at large, but especially those disciplines that rely heavily in data and software to produce results. Essentially, it interprets, translates and codifies the agile (software development) manifesto to the scientific arena, converting what this manifesto values in a series of 4 hypotheses that will guide agile science.

Hypotheses need to be proved, however, and science prides itself in being able to establish fact over anecdotal, or even counter-intuitive, evidence. This is why we also provide a path forward in the shape of several best practices suggestions that will help prove those hypotheses beyond any doubt. There is strong evidence that supports them, and our own experience using it for some time via open-repository product development, specially in papers such as (García-Ortega, Sánchez, and Merelo-Guervós 2021) and, in general, most papers that we have published lately, helps stakeholder participation in the production of workflows, makes easier to evolve software related to science and streamline product development, and makes also easier to respond to new requirements. Proving the positive effects of these preferences is, however, left as future work.

Agile Software Development Methods: Review and Analysis

A Manifesto for Agile Research

Agile Methods in Research

A Study on Using Waterfall and Agile Methods in Software Project Management

Science Is Broken. Here's How to Fix It

Applying Scrum in Data Science Projects

Manifesto for Agile Software Development

Agile Development for Science: Scientific Work Can Also Benefit from Principles Derived from Software Development

Ml-Experiment: A Python Framework for Reproducible Data Science

Towards Agile Data Engineering for Small Scale Teams

Tropes in Films: An Initial Analysis

This Ai Researcher Is Trying to Ward Off a Reproducibility Crisis

Software Development Practices in Academia: A Case Study Comparison

Axcell: Automatic Extraction of Results from Machine Learning Papers

Work Organization and Mental Health Problems in Phd Students

Who Needs Mlops: What Data Scientists Seek to Accomplish and How Can Mlops Help

The Art of Evolutionary Algorithm Programming

Most Published Research Findings Are False-but a Little Replication Goes a Long Way

Position Paper) Applying Software Engineering Methods and Tools to Cse Research Projects

How Can Scrum Be Succesful? Competences of the Scrum Product Owner

Nudging Open Science

How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions

Code Sharing Is Associated with Research Impact in Image Processing

Code Availability for Image Processing Papers: A Status Update

GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

The Agile Research Penultimatum

Agile Research to Complement Agile Development: A Proposal for an mHealth Research Lifecycle

PhDs: The Tortuous Truth

Accelerating the Machine Learning Lifecycle with Mlflow