key: cord-0274673-l8necwns authors: Grealey, Jason; Lannelongue, Loïc; Saw, Woei-Yuh; Marten, Jonathan; Meric, Guillaume; Ruiz-Carmona, Sergio; Inouye, Michael title: The carbon footprint of bioinformatics date: 2021-03-09 journal: bioRxiv DOI: 10.1101/2021.03.08.434372 sha: 6c8f2305f4b54d981612536326aee43dbde8641c doc_id: 274673 cord_uid: l8necwns Bioinformatic research relies on large-scale computational infrastructures which have a non-zero carbon footprint. So far, no study has quantified the environmental costs of bioinformatic tools and commonly run analyses. In this study, we estimate the bioinformatic carbon footprint (in kilograms of CO2 equivalent units, kgCO2e) using the freely available Green Algorithms calculator (www.green-algorithms.org). We assess (i) bioinformatic approaches in genome-wide association studies (GWAS), RNA sequencing, genome assembly, metagenomics, phylogenetics and molecular simulations, as well as (ii) computation strategies, such as parallelisation, CPU (central processing unit) vs GPU (graphics processing unit), cloud vs. local computing infrastructure and geography. In particular, for GWAS, we found that biobank-scale analyses emitted substantial kgCO2e and simple software upgrades could make GWAS greener, e.g. upgrading from BOLT-LMM v1 to v2.3 reduced carbon footprint by 73%. Switching from the average data centre to a more efficient data centres can reduce carbon footprint by ~34%. Memory over-allocation can be a substantial contributor to an algorithm’s carbon footprint. The use of faster processors or greater parallelisation reduces run time but can lead to, sometimes substantially, greater carbon footprint. Finally, we provide guidance on how researchers can reduce power consumption and minimise kgCO2e. Overall, this work elucidates the carbon footprint of common analyses in bioinformatics and provides solutions which empower a move toward greener research. Bioinformatic research relies on large-scale computational infrastructures which have a non- 27 zero carbon footprint. So far, no study has quantified the environmental costs of 28 bioinformatic tools and commonly run analyses. In this study, we estimate the bioinformatic 29 carbon footprint (in kilograms of CO 2 equivalent units, kgCO 2 e) using the freely available 30 Green Algorithms calculator (www.green-algorithms.org). We assess (i) bioinformatic 31 approaches in genome-wide association studies (GWAS), RNA sequencing, genome 32 assembly, metagenomics, phylogenetics and molecular simulations, as well as (ii) 33 computation strategies, such as parallelisation, CPU (central processing unit) vs GPU 34 (graphics processing unit), cloud vs. local computing infrastructure and geography. In 35 particular, for GWAS, we found that biobank-scale analyses emitted substantial kgCO 2 e and 36 simple software upgrades could make GWAS greener, e.g. upgrading from BOLT-LMM v1 to 37 v2.3 reduced carbon footprint by 73%. Switching from the average data centre to a more 38 efficient data centres can reduce carbon footprint by ~34%. Memory over-allocation can be a 39 substantial contributor to an algorithm's carbon footprint. 128 203 tree-months), corresponding to 0.14 to 1.9 kgCO 2 e (0.2 to 2 tree-months) per sample. 129 Meta-SPAdes had the greatest carbon footprint but also the best performance followed by 130 MetaVelvet and MEGAHIT, respectively ( The use of cloud facilities may also enable further reductions of carbon footprint by allowing 276 for choice of a geographic location with relatively low carbon intensity. While the kgCO 2 e for 277 specific analyses utilising cloud or local data centre platforms are best estimated with the 278 Green Algorithm calculator (www.green-algorithms.org), we found that a typical GWAS of 279 UK Biobank considering 100 traits using the aforementioned GWAS framework (see 280 Genome-wide association analysis) together with BoltLMM v2.3 on a Google Cloud server 281 in the UK would lower the carbon footprint by 81% when compared to the average local data 282 centre in Australia (Figure 1) working with a desktop computer or a laptop, most computational servers and cloud 300 platforms give the option or require the user to choose the memory allocated. Given it is 301 common practice to over-allocate memory out of caution, we investigated the impact of 302 memory allocation on carbon footprint in bioinformatics (Figure 3, Supplementary table 1) . We showed that, while increasing the allocated memory always increases the carbon 305 footprint, the effect is particularly significant for tasks with large memory requirements 306 (Figure 3, Supplementary table 1) . For example, in de novo human genome assembly, 307 MEGAHIT had higher memory requirements than ABySS (6% vs 1% of total energy 308 consumption); as a result, a five-fold over-allocation of memory increases carbon footprint by 309 30% for MEGAHIT and 6% for ABySS. Similarly, in human RNA read alignment (Figure 3) , 310 Novoalign had the highest memory requirements (37% of its total energy vs less than 7% for 311 STAR, HISAT2, and TopHat2) and a 5x over-allocation in memory would increase its 312 footprint by 186% compared to 32% for STAR, 2% for HISAT2, and 10% for TopHat2. 313 Processors 314 We estimated the carbon footprint of a number of algorithms executed on both GPUs and 315 CPUs. For cis-eQTL mapping (Genome-wide association analysis), we estimated that, 316 compared to CPU-based FastQTL and LIMIX, using a GPU-based software like TensorQTL 317 can reduce the carbon footprint by 96% and 99% and the running time by 99.63% and 318 99.99%, respectively ( There are a number of assumptions made when estimating the energy and carbon footprint 370 of a given computational algorithm. These assumptions, and the associated limitations, have 371 been discussed in detail within Lannelongue et al. [70] . A particularly important limitation of 372 our study is that many of the carbon footprints estimated are from a single run of any given 373 tool; however, many analyses have parameters that must be fine-tuned through trial and 374 error, frequently extensively so. For example, in machine learning, thousands of optimisation 375 runs may be required. We would stress that the total carbon footprint of a given project will 376 likely scale linearly with the number of times each analysis is tuned or repeated, so a caveat 377 to our estimations and the underlying published benchmarks is that the real carbon footprints 378 could be orders of magnitude greater than that reported here. 379 380 Finally, the parameters needed to estimate the carbon footprint are often missing from 381 published articles, such as running time, hardware information, and often software versions. 382 If we are to fully understand the carbon footprint of the field of bioinformatics or 383 computational research as a whole, there is a need for reporting this information as well as, 384 ideally, for authors to estimate their carbon footprint using freely available tools. 385 This study is, to the best of our knowledge, the first to estimate the carbon footprint for 387 common bioinformatics tools. We further investigated how parallelisation, memory over-388 allocation, and hardware choices affect carbon footprints. We also show that carbon 389 footprints could be reduced by utilising efficient computing facilities. Finally, we outline a 390 number of ways bioinformaticians may reduce their carbon footprint. 391 Methods 392 Selection of bioinformatic tools 393 We estimated the carbon footprint of a range of tasks across the field of bioinformatics: 394 genome and metagenome assembly, long and short reads metagenomic classification, 395 RNA-seq and phylogenetic analyses, GWAS, eQTL mapping algorithms, molecular 396 simulations and molecular docking algorithms ( Table 1) . For each task, we curated the 397 published literature to identify peer-reviewed studies which computationally benchmarked 398 popular tools. For our analysis, we used 10 published scientific papers. To be selected, 399 publications had to report at least the running time and preferably the following: memory 400 usage, and hardware used for the experiments, in particular the model and number of 401 processing cores. We selected 10 publications for this study ( Table 1) . Besides, as we could 402 not find suitable benchmarks to estimate the carbon footprint of cohort-scale eQTL mapping 403 and RNA-seq quality control pipelines, we estimated the carbon footprint of these tasks 404 using in-house computations. These computations were run on the Baker Heart and 405 Diabetes Institute computing cluster (Intel Xeon E5-2683 v4 CPUs and a Tesla T4 GPU) and 406 the University of Cambridge's CSD3 computing cluster (Tesla P100 PCIe GPUs and Xeon 407 Gold 6142 CPUs). 408 Estimating the carbon footprint 409 The carbon footprint of a given tool was calculated using the framework described in 410 Lannelongue et al. [70] and the corresponding online calculator www.green-algorithms.org. 411 We present here an overview of the methodology. 412 413 Electricity production emits a variety of greenhouse gases, each with a different impact on 414 climate change. To summarise this, the carbon footprint is measured in kilograms of CO 2 -415 equivalent (CO 2 e), which is the amount of carbon dioxide with an equivalent global warming 416 impact as a mix of GHGs. This indicator depends on two factors: the energy needed to run 417 the algorithm, and the global warming impact of producing such energy, called carbon 418 intensity. This can be summarised by: 419 420 Where C is the carbon footprint (in kilograms of CO 2 e -kgCO 2 e ), E is the energy needed (in 422 W) and CI is the carbon intensity (in kgCO 2 e/W). 423 424 The energy needs of an algorithm are measured based on running time, processing cores 425 used, memory deployed and efficiency of the data centre: 426 427 Where t is the run time (h), n c is the number of computing cores, used at u C %, the core 429 usage factor (between 0 and 1), and each drawing a power P c (W). n m is the size of memory 430 available (GB), drawing a power P m (W/GB). PUE is the Power Usage Effectiveness of the 431 data centre. 432 433 The power drawn by a processor (CPU or GPU) is estimated by its Thermal Design Power 434 (TDP) per core, which is provided by the manufacturer, and then scaled by the core usage 435 factor u C . The power draw from memory was estimated to be 0. Each plot details the percentage increase in carbon footprint as a function of memory 720 overestimation for a variety of bioinformatic tools and tasks. The numerical data is available 721 in Supplementary Table 1 Pan-cancer analysis demonstrates that integrating polygenic risk 453 scores with modifiable risk factors improves risk prediction Pan-cancer 456 analysis of whole genomes Patterns of somatic structural 459 variation in human cancer genomes Genomewide Association Study of Severe Covid-462 19 with Respiratory Failure Data centres are chewing up vast amounts of energy On Global Electricity Usage of Communication Technology: 469 Trends to 2030 Air pollution The 2019 report of The Lancet Countdown on health and climate 474 change: ensuring that the health of a child born today is not defined by a changing 475 climate The UK Biobank resource with deep phenotyping and genomic data Accelerating Detection of Disease -UK Research and Innovation Is PUE actually going UP? A comprehensive evaluation of 490 assembly scaffolding tools Scaffolding pre-493 assembled contigs using SSPACE Efficient de novo assembly of large genomes using 496 compressed data structures SOAPdenovo2: an empirically improved memory-efficient short-read de 499 novo assembler Velvet: algorithms for de novo short read assembly using 501 de Bruijn graphs GAGE: A critical evaluation of genome assemblies and assembly 504 algorithms Choice of assembly 507 software has a critical impact on virome characterisation ABySS 2.0: resource-efficient assembly of large genomes using 510 a Bloom filter MEGAHIT v1.0: A fast and scalable metagenome assembler driven by 513 advanced methodologies and community practices Assembly Tools from a Microbiologist's Perspective -Not Only Size Matters! metaSPAdes: a new versatile 519 de novo metagenomics assembler MetaVelvet: an extension of 522 Velvet assembler to de novo metagenome assembly from short sequence reads Strain-level metagenomic 525 assignment and compositional estimation for long reads with MetaMaps Improved metagenomic analysis with Kraken 2 Kraken: ultrafast metagenomic sequence classification 530 using exact alignments Bracken: estimating species 533 abundance in metagenomics data Centrifuge: rapid and sensitive 536 classification of metagenomic sequences High-Performance 539 Computing in Bayesian Phylogenetics and Phylodynamics Using BEAGLE Virus genomes reveal factors that spread and sustained the Ebola 543 epidemic BEAGLE: An Application Programming Interface and High-546 Performance Computing Library for Statistical Phylogenetics Computational Performance and 549 Statistical Accuracy of *BEAST and Comparisons with Other Methods Simulation-based comprehensive benchmarking of RNA-seq aligners STAR: ultrafast universal RNA-seq aligner Graph-based genome 557 alignment and genotyping with HISAT2 and HISAT-genotype TopHat2: 560 accurate alignment of transcriptomes in the presence of insertions, deletions and gene 561 fusions Role of Respiratory Viruses in Acute Upper and Lower Respiratory Tract First Year of Life: A Birth Cohort Study Early-life respiratory viral infections, atopic sensitization, and risk 570 of subsequent development of persistent asthma BBMap Guide Sailfish enables alignment-free isoform 578 quantification from RNA-seq reads using lightweight algorithms RSEM: accurate transcript quantification from RNA-Seq data 581 with or without a reference genome Transcript assembly and quantification by RNA-Seq reveals 583 unannotated transcripts and isoform switching during cell differentiation Comparative assessment of methods for the computational inference of transcript 587 isoform abundance from RNA-seq data Modelling and simulating generic RNA-Seq experiments with the flux 590 simulator GENCODE: the reference human genome annotation for The 593 ENCODE Project Mixed-model 596 association for biobank-scale datasets User Manual Genetic effects on gene expression across human tissues Fast and efficient 603 QTL mapper for thousands of molecular phenotypes Scaling computational genomics to millions of individuals with 607 LIMIX: genetic analysis of multiple 610 traits The pmemd.cuda GPU Implementation The Amber biomolecular simulation programs Scalable Molecular Dynamics with NAMD rDock: A Fast, Versatile and Open Source Program for 620 Docking Ligands to Proteins and Nucleic Acids AutoDock Vina: improving the speed and accuracy of docking 623 with a new scoring function, efficient optimization and multithreading Glide: A New Approach for Rapid, Accurate Docking and Scoring. 626 1. Method and Assessment of Docking Accuracy Benchmarking Sets for Molecular Docking Efficiency -Data Centers -Google Microsoft's Cloud Infrastructure, Datacenters and Network Fact Sheet Green Algorithms: Quantifying the carbon 641 footprint of computation An analysis of memory power consumption in database 644 systems 651 [74] "carbonfootprint.com -International Electricity Factors Greenhouse gas reporting: conversion factors 2019 CO2-equivalent emissions from 658 European passenger vehicles in the years 1995-2015 based on real-world use: 659 Assessing the climate benefit of the European 'diesel boom Supplementary Note 1: 731 732 Estimating the running time at which a GPU has a lower carbon footprint: 733 734 From rearranging the Green Algorithms carbon footprint formula it can be shown that the 735 running time at which GPU has a lower carbon footprint is We thank Kim Descriptions of additional files: 751 752Additional file 1: Hardware details for each analysis presented in this manuscript. 753Additional file 2: The ratio of RNA reads per million and ratio of CPU time of 10 random in-754 house PBMC samples, from the RNA sequencing quality control pipeline task. 755