key: cord-0274673-l8necwns
authors: Grealey, Jason; Lannelongue, Loïc; Saw, Woei-Yuh; Marten, Jonathan; Meric, Guillaume; Ruiz-Carmona, Sergio; Inouye, Michael
title: The carbon footprint of bioinformatics
date: 2021-03-09
journal: bioRxiv
DOI: 10.1101/2021.03.08.434372
sha: 6c8f2305f4b54d981612536326aee43dbde8641c
doc_id: 274673
cord_uid: l8necwns

Bioinformatic research relies on large-scale computational infrastructures which have a non-zero carbon footprint. So far, no study has quantified the environmental costs of bioinformatic tools and commonly run analyses. In this study, we estimate the bioinformatic carbon footprint (in kilograms of CO2 equivalent units, kgCO2e) using the freely available Green Algorithms calculator (www.green-algorithms.org). We assess (i) bioinformatic approaches in genome-wide association studies (GWAS), RNA sequencing, genome assembly, metagenomics, phylogenetics and molecular simulations, as well as (ii) computation strategies, such as parallelisation, CPU (central processing unit) vs GPU (graphics processing unit), cloud vs. local computing infrastructure and geography. In particular, for GWAS, we found that biobank-scale analyses emitted substantial kgCO2e and simple software upgrades could make GWAS greener, e.g. upgrading from BOLT-LMM v1 to v2.3 reduced carbon footprint by 73%. Switching from the average data centre to a more efficient data centres can reduce carbon footprint by ~34%. Memory over-allocation can be a substantial contributor to an algorithm’s carbon footprint. The use of faster processors or greater parallelisation reduces run time but can lead to, sometimes substantially, greater carbon footprint. Finally, we provide guidance on how researchers can reduce power consumption and minimise kgCO2e. Overall, this work elucidates the carbon footprint of common analyses in bioinformatics and provides solutions which empower a move toward greener research.

Bioinformatic research relies on large-scale computational infrastructures which have a non- 27 zero carbon footprint. So far, no study has quantified the environmental costs of 28 bioinformatic tools and commonly run analyses. In this study, we estimate the bioinformatic 29 carbon footprint (in kilograms of CO 2 equivalent units, kgCO 2 e) using the freely available 30

Green Algorithms calculator (www.green-algorithms.org). We assess (i) bioinformatic 31 approaches in genome-wide association studies (GWAS), RNA sequencing, genome 32 assembly, metagenomics, phylogenetics and molecular simulations, as well as (ii) 33

computation strategies, such as parallelisation, CPU (central processing unit) vs GPU 34 (graphics processing unit), cloud vs. local computing infrastructure and geography. In 35 particular, for GWAS, we found that biobank-scale analyses emitted substantial kgCO 2 e and 36 simple software upgrades could make GWAS greener, e.g. upgrading from BOLT-LMM v1 to 37

v2.3 reduced carbon footprint by 73%. Switching from the average data centre to a more 38 efficient data centres can reduce carbon footprint by ~34%. Memory over-allocation can be a 39 substantial contributor to an algorithm's carbon footprint. 128 203 tree-months), corresponding to 0.14 to 1.9 kgCO 2 e (0.2 to 2 tree-months) per sample. 129

Meta-SPAdes had the greatest carbon footprint but also the best performance followed by 130

MetaVelvet and MEGAHIT, respectively ( The use of cloud facilities may also enable further reductions of carbon footprint by allowing 276

for choice of a geographic location with relatively low carbon intensity. While the kgCO 2 e for 277 specific analyses utilising cloud or local data centre platforms are best estimated with the 278

Green Algorithm calculator (www.green-algorithms.org), we found that a typical GWAS of 279 UK Biobank considering 100 traits using the aforementioned GWAS framework (see 280

Genome-wide association analysis) together with BoltLMM v2.3 on a Google Cloud server 281

in the UK would lower the carbon footprint by 81% when compared to the average local data 282 centre in Australia (Figure 1) working with a desktop computer or a laptop, most computational servers and cloud 300 platforms give the option or require the user to choose the memory allocated. Given it is 301 common practice to over-allocate memory out of caution, we investigated the impact of 302 memory allocation on carbon footprint in bioinformatics (Figure 3, Supplementary table 1) .

We showed that, while increasing the allocated memory always increases the carbon 305 footprint, the effect is particularly significant for tasks with large memory requirements 306 (Figure 3, Supplementary table 1) . For example, in de novo human genome assembly, 307

MEGAHIT had higher memory requirements than ABySS (6% vs 1% of total energy 308 consumption); as a result, a five-fold over-allocation of memory increases carbon footprint by 309 30% for MEGAHIT and 6% for ABySS. Similarly, in human RNA read alignment (Figure 3) , 310

Novoalign had the highest memory requirements (37% of its total energy vs less than 7% for 311 STAR, HISAT2, and TopHat2) and a 5x over-allocation in memory would increase its 312 footprint by 186% compared to 32% for STAR, 2% for HISAT2, and 10% for TopHat2. 313

Processors 314 We estimated the carbon footprint of a number of algorithms executed on both GPUs and 315

CPUs. For cis-eQTL mapping (Genome-wide association analysis), we estimated that, 316 compared to CPU-based FastQTL and LIMIX, using a GPU-based software like TensorQTL 317

can reduce the carbon footprint by 96% and 99% and the running time by 99.63% and 318 99.99%, respectively ( There are a number of assumptions made when estimating the energy and carbon footprint 370 of a given computational algorithm. These assumptions, and the associated limitations, have 371 been discussed in detail within Lannelongue et al. [70] . A particularly important limitation of 372 our study is that many of the carbon footprints estimated are from a single run of any given 373 tool; however, many analyses have parameters that must be fine-tuned through trial and 374 error, frequently extensively so. For example, in machine learning, thousands of optimisation 375 runs may be required. We would stress that the total carbon footprint of a given project will 376

likely scale linearly with the number of times each analysis is tuned or repeated, so a caveat 377 to our estimations and the underlying published benchmarks is that the real carbon footprints 378 could be orders of magnitude greater than that reported here. 379 380

Finally, the parameters needed to estimate the carbon footprint are often missing from 381

published articles, such as running time, hardware information, and often software versions. 382

If we are to fully understand the carbon footprint of the field of bioinformatics or 383 computational research as a whole, there is a need for reporting this information as well as, 384

ideally, for authors to estimate their carbon footprint using freely available tools. 385

This study is, to the best of our knowledge, the first to estimate the carbon footprint for 387 common bioinformatics tools. We further investigated how parallelisation, memory over-388 allocation, and hardware choices affect carbon footprints. We also show that carbon 389 footprints could be reduced by utilising efficient computing facilities. Finally, we outline a 390 number of ways bioinformaticians may reduce their carbon footprint. 391 Methods 392 Selection of bioinformatic tools 393 We estimated the carbon footprint of a range of tasks across the field of bioinformatics: 394 genome and metagenome assembly, long and short reads metagenomic classification, 395

RNA-seq and phylogenetic analyses, GWAS, eQTL mapping algorithms, molecular 396 simulations and molecular docking algorithms ( Table 1) . For each task, we curated the 397 published literature to identify peer-reviewed studies which computationally benchmarked 398 popular tools. For our analysis, we used 10 published scientific papers. To be selected, 399

publications had to report at least the running time and preferably the following: memory 400 usage, and hardware used for the experiments, in particular the model and number of 401 processing cores. We selected 10 publications for this study ( Table 1) . Besides, as we could 402 not find suitable benchmarks to estimate the carbon footprint of cohort-scale eQTL mapping 403

and RNA-seq quality control pipelines, we estimated the carbon footprint of these tasks 404 using in-house computations. These computations were run on the Baker Heart and 405

Diabetes Institute computing cluster (Intel Xeon E5-2683 v4 CPUs and a Tesla T4 GPU) and 406

the University of Cambridge's CSD3 computing cluster (Tesla P100 PCIe GPUs and Xeon 407

Gold 6142 CPUs). 408

Estimating the carbon footprint 409 The carbon footprint of a given tool was calculated using the framework described in 410 Lannelongue et al.

[70] and the corresponding online calculator www.green-algorithms.org. 411 We present here an overview of the methodology. 412 413

Electricity production emits a variety of greenhouse gases, each with a different impact on 414 climate change. To summarise this, the carbon footprint is measured in kilograms of CO 2 -415 equivalent (CO 2 e), which is the amount of carbon dioxide with an equivalent global warming 416

impact as a mix of GHGs. This indicator depends on two factors: the energy needed to run 417 the algorithm, and the global warming impact of producing such energy, called carbon 418

intensity. This can be summarised by: 419 420

Where C is the carbon footprint (in kilograms of CO 2 e -kgCO 2 e ), E is the energy needed (in 422 W) and CI is the carbon intensity (in kgCO 2 e/W). 423 424

The energy needs of an algorithm are measured based on running time, processing cores 425 used, memory deployed and efficiency of the data centre: 426 427

Where t is the run time (h), n c is the number of computing cores, used at u C %, the core 429 usage factor (between 0 and 1), and each drawing a power P c (W). n m is the size of memory 430 available (GB), drawing a power P m (W/GB). PUE is the Power Usage Effectiveness of the 431 data centre. 432 433

The power drawn by a processor (CPU or GPU) is estimated by its Thermal Design Power 434 (TDP) per core, which is provided by the manufacturer, and then scaled by the core usage 435 factor u C . The power draw from memory was estimated to be 0. Each plot details the percentage increase in carbon footprint as a function of memory 720 overestimation for a variety of bioinformatic tools and tasks. The numerical data is available 721

in Supplementary Table 1 

Pan-cancer analysis demonstrates that integrating polygenic risk 453 scores with modifiable risk factors improves risk prediction

Pan-cancer 456 analysis of whole genomes

Patterns of somatic structural 459 variation in human cancer genomes

Genomewide Association Study of Severe Covid-462 19 with Respiratory Failure

Data centres are chewing up vast amounts of energy

On Global Electricity Usage of Communication Technology: 469 Trends to 2030

Air pollution

The 2019 report of The Lancet Countdown on health and climate 474 change: ensuring that the health of a child born today is not defined by a changing 475 climate

The UK Biobank resource with deep phenotyping and genomic data

Accelerating Detection of Disease -UK Research and Innovation

Is PUE actually going UP?

A comprehensive evaluation of 490 assembly scaffolding tools

Scaffolding pre-493 assembled contigs using SSPACE

Efficient de novo assembly of large genomes using 496 compressed data structures

SOAPdenovo2: an empirically improved memory-efficient short-read de 499 novo assembler

Velvet: algorithms for de novo short read assembly using 501 de Bruijn graphs

GAGE: A critical evaluation of genome assemblies and assembly 504 algorithms

Choice of assembly 507 software has a critical impact on virome characterisation

ABySS 2.0: resource-efficient assembly of large genomes using 510 a Bloom filter

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by 513 advanced methodologies and community practices

Assembly Tools from a Microbiologist's Perspective -Not Only Size Matters!

metaSPAdes: a new versatile 519 de novo metagenomics assembler

MetaVelvet: an extension of 522 Velvet assembler to de novo metagenome assembly from short sequence reads

Strain-level metagenomic 525 assignment and compositional estimation for long reads with MetaMaps

Improved metagenomic analysis with Kraken 2

Kraken: ultrafast metagenomic sequence classification 530 using exact alignments

Bracken: estimating species 533 abundance in metagenomics data

Centrifuge: rapid and sensitive 536 classification of metagenomic sequences

High-Performance 539 Computing in Bayesian Phylogenetics and Phylodynamics Using BEAGLE

Virus genomes reveal factors that spread and sustained the Ebola 543 epidemic

BEAGLE: An Application Programming Interface and High-546

Performance Computing Library for Statistical Phylogenetics

Computational Performance and 549 Statistical Accuracy of *BEAST and Comparisons with Other Methods

Simulation-based comprehensive benchmarking of RNA-seq aligners

STAR: ultrafast universal RNA-seq aligner

Graph-based genome 557 alignment and genotyping with HISAT2 and HISAT-genotype

TopHat2: 560 accurate alignment of transcriptomes in the presence of insertions, deletions and gene 561 fusions

Role of Respiratory Viruses in Acute Upper and Lower Respiratory Tract

First Year of Life: A Birth Cohort Study

Early-life respiratory viral infections, atopic sensitization, and risk 570 of subsequent development of persistent asthma

BBMap Guide

Sailfish enables alignment-free isoform 578 quantification from RNA-seq reads using lightweight algorithms

RSEM: accurate transcript quantification from RNA-Seq data 581 with or without a reference genome

Transcript assembly and quantification by RNA-Seq reveals 583 unannotated transcripts and isoform switching during cell differentiation

Comparative assessment of methods for the computational inference of transcript 587 isoform abundance from RNA-seq data

Modelling and simulating generic RNA-Seq experiments with the flux 590 simulator

GENCODE: the reference human genome annotation for The 593 ENCODE Project

Mixed-model 596 association for biobank-scale datasets

User Manual

Genetic effects on gene expression across human tissues

Fast and efficient 603 QTL mapper for thousands of molecular phenotypes

Scaling computational genomics to millions of individuals with 607

LIMIX: genetic analysis of multiple 610 traits

The pmemd.cuda GPU Implementation

The Amber biomolecular simulation programs

Scalable Molecular Dynamics with NAMD

rDock: A Fast, Versatile and Open Source Program for 620

Docking Ligands to Proteins and Nucleic Acids

AutoDock Vina: improving the speed and accuracy of docking 623 with a new scoring function, efficient optimization and multithreading

Glide: A New Approach for Rapid, Accurate Docking and Scoring. 626 1. Method and Assessment of Docking Accuracy

Benchmarking Sets for Molecular Docking

Efficiency -Data Centers -Google

Microsoft's Cloud Infrastructure, Datacenters and Network Fact Sheet

Green Algorithms: Quantifying the carbon 641 footprint of computation

An analysis of memory power consumption in database 644 systems

651 [74] "carbonfootprint.com -International Electricity Factors

Greenhouse gas reporting: conversion factors 2019

CO2-equivalent emissions from 658 European passenger vehicles in the years 1995-2015 based on real-world use: 659 Assessing the climate benefit of the European 'diesel boom

Supplementary Note 1: 731 732 Estimating the running time at which a GPU has a lower carbon footprint: 733 734 From rearranging the Green Algorithms carbon footprint formula it can be shown that the 735 running time at which GPU has a lower carbon footprint is

We thank Kim Descriptions of additional files: 751 752Additional file 1: Hardware details for each analysis presented in this manuscript. 753Additional file 2: The ratio of RNA reads per million and ratio of CPU time of 10 random in-754 house PBMC samples, from the RNA sequencing quality control pipeline task. 755