Submitted 15 October 2019 Accepted 18 May 2020 Published 15 June 2020 Corresponding authors Ghazaleh Khodabandelou, ghazaleh.khodabandelou@u-pec.fr, ghazaleh.khodabandeh@gmail.com Julien Mozziconacci, julien.mozziconacci@mnhn.fr Academic editor James Procter Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.278 Copyright 2020 Khodabandelou et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Genome annotation across species using deep convolutional neural networks Ghazaleh Khodabandelou1,2, Etienne Routhier1 and Julien Mozziconacci1,3,4 1 Laboratoire de Physique Théorique de la Matière Condensée (LPTMC), Sorbonne Université, Paris, France 2 Laboratoire Images, Signaux et Systèmes Intelligents (LISSI), Université Val-de-Marne (Paris XII), Paris, France 3 CNRS UMR 7196 / INSERM U1154 - Sorbonne Université, Museum national d’Histoire naturelle (MNHN), Paris, France 4 Institut Universitaire de France, Paris, France ABSTRACT Application of deep neural network is a rapidly expanding field now reaching many disciplines including genomics. In particular, convolutional neural networks have been exploited for identifying the functional role of short genomic sequences. These approaches rely on gathering large sets of sequences with known functional role, extracting those sequences from whole-genome-annotations. These sets are then split into learning, test and validation sets in order to train the networks. While the obtained networks perform well on validation sets, they often perform poorly when applied on whole genomes in which the ratio of positive over negative examples can be very different than in the training set. We here address this issue by assessing the genome- wide performance of networks trained with sets exhibiting different ratios of positive to negative examples. As a case study, we use sequences encompassing gene starts from the RefGene database as positive examples and random genomic sequences as negative examples. We then demonstrate that models trained using data from one organism can be used to predict gene-start sites in a related species, when using training sets providing good genome-wide performance. This cross-species application of convolutional neural networks provides a new way to annotate any genome from existing high-quality annotations in a related reference species. It also provides a way to determine whether the sequence motifs recognised by chromatin-associated proteins in different species are conserved or not. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Transcription start sites, Promoters, Genome annotation, Deep learning, DNA motifs, Sequence evolution, Unbalanced datasets INTRODUCTION The improvement of DNA sequencing techniques lead to an explosion in the number and completeness of fully sequenced genomes. One of the major goals in the field is to annotate these DNA sequences, which is to associate a biological function with sequence motifs located at different positions along the genome (Stein, 2001). In the human genome for instance, while some DNA sequences encode proteins, most sequences do not code for any protein. Many of these non-coding sequences are nevertheless conserved in related species and are necessary for the correct regulation of gene expression. Deciphering the function of these non-coding sequences has been increasingly achieved through improvements in How to cite this article Khodabandelou G, Routhier E, Mozziconacci J. 2020. Genome annotation across species using deep convolu- tional neural networks. PeerJ Comput. Sci. 6:e278 http://doi.org/10.7717/peerj-cs.278 https://peerj.com/computer-science mailto:ghazaleh.khodabandelou@u-pec.fr mailto:ghazaleh.khodabandeh@gmail.com mailto:julien.mozziconacci@mnhn.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.278 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.278 the throughput of next generation sequencing (Rivera & Ren, 2013). The 3.2 Billion base pair (bp) long human genome is now annotated with many functional and bio-chemical cues (Kundaje et al., 2015; ENCODE Project Consortium et al., 2012), among which are the initiation sites of gene transcription (Carninci et al., 2006; Georgakilas, Perdikopanis & Hatzigeorgiou, 2020). While these annotations are becoming more numerous and precise, they cannot be determined experimentally for every organism and every cell type, as the experiments needed to produce these annotations are often costly and difficult to carry out. Computational methods are therefore widely used to extract sequence information from known annotations and extrapolate the results to different genomes or conditions, e.g., Kundaje et al. (2015) and Durham et al. (2018). An related question is to understand the link between these annotations and the un- derlying DNA sequence. To this end, supervised machine learning algorithms (Goodfellow, Bengio & Courville, 2016) have been particularly successful (Zou et al., 2019; Angermueller et al., 2016). Among those, deep Convolution Neural Networks (CNNs) are very efficient at detecting sequence features since they rely on the optimisation of convolution filters that can be directly matched to DNA motifs (Ching et al., 2018). Stacking several of these convolution layers together can lead to the detection of nested motifs at larger scales. Pioneering studies illustrated this ability of CNNs to reliably grasp complex combinations of DNA motifs and their relationship with functional regions of the genome (Min et al., 2016; Umarov & Solovyev, 2017; Alipanahi et al., 2015; Zhou & Troyanskaya, 2015; Kelley, Snoek & Rinn, 2016; Pachganov et al., 2019). Min et al. (2016) used a CNN to predict enhancers, which are specific sequences that regulate gene expression at a distance. This method performed very well and ranked above state-of-the-art support vector machine based methods. Similar tools were used in different contexts, aiming at identifying promoters (Umarov & Solovyev, 2017; Pachganov et al., 2019) or detecting splice sites (Leung et al., 2014; Jaganathan et al., 2019). In these approaches, a sample set is first created by taking all positive class sequences (e.g., enhancers) and adding the same amount of randomly picked negative class examples (e.g., non- enhancers). This sample set is then divided into training, validation and test sets. Balancing the data ensures that the model will be trained on the same number of positive and negative examples, thus giving the same importance to both classes. While these approaches are very successful when assessed on test sets derived from the sample set, we show here that they tend to perform poorly when applied on entire chromosome sequences as required for the task of complete genome annotation. This is due to the fact that the networks are optimised on a similar number of positive and negative examples during training, but that they will usually face very different ratios of negative over positive classes when used on a full chromosome sequence (He & Garcia, 2009). Alternative approaches (Alipanahi et al., 2015; Kelley, Snoek & Rinn, 2016) used unbalanced datasets for training (i.e., with more negative than positive examples) to predict DNA-binding sites for proteins and genome accessibility. In these two studies, however, the prediction performance of the model is also assessed on test sets derived from training sets, not on full genomic sequences. The task of genome-wide prediction has been assessed in a more recent study aiming at identifying cell type specific regulatory Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 2/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278 elements (Kelley et al., 2018). In order to infer long range relationships between these elements, Kelley et al. used very long (131 kb) non-overlapping windows covering the whole genome. This approach has proven efficient but requires a lot of computational memory. As our goal is to provide genome-wide predictions, the methodology we used is inspired from this last study. Since we do not aim here at predicting cell type specific features, we could use shorter sequences as input and a simpler network architecture. We also present two novelties for the development and for performance assessment of genome-wide predictions. Firstly, we do not use as a quality measure the classical prediction scores computed on test sets obtained by dividing the sample data into training, validation and test sets as commonly done in machine learning. Rather, we compute prediction scores that assess the ability of our model to annotate a full chromosome sequence by designing a specific metric (described in Material and Methods). Secondly, we change the ratio between positive and negative examples in order to obtain the highest prediction scores and show that this tuning is has an important effect on the outcome. As a proof of principle, we use in this work gene start sites (GSS) as features. DNA motifs around GSS are recognised by the transcription machinery and indicate the location of the initiation of transcription (Kugel & Goodrich, 2017). The DNA sequence surrounding GSS therefore contains the information that could in principle be used by an algorithm to identify in silico the GSS locations. These DNA sequence motifs are different for different classes to genes. For instance, protein coding genes can have either CG di-nucleotide (CpG) rich or poor sequences upstream their GSS (Deaton & Bird, 2011). We show that using training sets with a higher ratio of negative over positive examples, we can faithfully retrieve GSS positions, with performances varying for different classes of genes such as coding or non coding genes. We then propose a new application of CNNs in genomics that leverages the fact that similar organisms tend to have similar regulatory mechanisms, i.e., rely on homologous molecular machinery and on homologous DNA regulatory motifs. Exploiting these homologies, we first train a model on a dataset corresponding to a given organism and use it to predict the annotation on the genome of a related organism, opening new opportunities for the task of de-novo genome annotation. We show that a CNN trained on GSS containing regions in human is able to recover regions containing GSS in the mouse genome and vice versa. We also assess the generalisation of the approach to more distant species, taking as examples Gallus gallus and Danio rerio. METHODS Input generation Genomic sequences were downloaded for the reference genomes for Human (hg38), Mouse (mm10), Chicken (gg4) and Zebrafish (dr10) via the URLs shown in Table 1. Similarly, GSS positions for each genome were extracted from their respective NCBI RefSeq Reference Gene annotations (RefGene). As a positive input class, we use regions of 299 bp flanking GSS (i.e., ±149 bp around the GSS) which are supposed to contain multiple sequence signals indicating the presence Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 3/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278 Table 1 URL of the data used in the present work. Genomes human https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz mouse https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz chicken https://hgdownload.soe.ucsc.edu/goldenPath/galGal4/bigZips/galGal4.fa.gz zebrafish https://egg.wustl.edu/d/danRer10/refGene.gz Reference Gene human https://egg.wustl.edu/d/hg38/refGene.gz mouse https://egg.wustl.edu/d/mm10/refGene.gz chicken https://egg.wustl.edu/d/galGal4/refGene.gz zebrafish https://egg.wustl.edu/d/danRer10/refGene.gz of a GSS to the transcription machinery of the cell. For instance in the human genome, 31,037 GSS positions are extracted on both DNA strands (15,798 for the positive strand and 15,239 for the negative strand). In order to generate the negative class, we select 31,037 ×Q sequences of 299 bp at random positions on a random strand, rejecting regions that do contain a GSS. The odds of getting at random a genomic region containing a GSS are close to 0.28%. For Q=1, there is an equal number of negative and positive class examples. Unbalanced datasets are produced using different values of Q ranging from 1 to 100. For Q=100, the negative class encompasses 100×299×31k≈1Gb, which represents one third of the human genome. For the other genomes a similar procedure was implemented. The total number of GSS used was 25,698 for the mouse, 6876 for the chicken and 14805 for the zebrafish. Convolution Neural Network (CNN) A CNN (see Fig. 1) is trained in order to predict the presence of a GSS in a DNA sequence of size 299 bp. The shape of the input layer is c×b in which c =4 is the number of different nucleotides and b=299 is the length of the input sequence. The nucleotide sequences are one hot encoded so that A=(1,0,0,0), T=(0,1,0,0), C=(0,0,1,0), and G=(0,0,0,1). The training set contains N samples of labelled pairs (X(n),y(n)), for n∈{1,...,N}, where X(n) are matrices of size c×b and y(n) ∈{0,1}. Each X(n) is associated with y(n) =1 when it corresponds to a region containing a GSS and y(n) =0 otherwise. The first convolution layer consists of k kernels of length s which are applied on b−s+1 successive sequences at positions p∈{1,...,(b−s+1)} to recognise relevant DNA motifs of size s. This operation generates an output feature map of size k×(b−s+1) for an input X(n) of size c×b. The feature map M resulting from the convolution operation is computed as follows: Mp,i= c∑ j=1 s∑ r=1 Wi,j,rXp+r−1,j+Bi, i∈{1,...,k} (1) where W denotes the network weights with size (k×c×s) and B denotes the biases with size (k×1) (see e.g., Goodfellow, Bengio & Courville, 2016). After the convolution layer a non-linear function is applied to the output, here a Rectified Linear Unit (ReLU). This Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 4/18 https://peerj.com https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz https://hgdownload.soe.ucsc.edu/goldenPath/galGal4/bigZips/galGal4.fa.gz https://egg.wustl.edu/d/danRer10/refGene.gz https://egg.wustl.edu/d/hg38/refGene.gz https://egg.wustl.edu/d/mm10/refGene.gz https://egg.wustl.edu/d/galGal4/refGene.gz https://egg.wustl.edu/d/danRer10/refGene.gz http://dx.doi.org/10.7717/peerj-cs.278 Figure 1 Overview of the CNN model. A total of 299 bp-long sequences were one hot encoded into a 4×299 input matrix. The first CNN layer performs a convolution on each input matrix to recognise rel- evant motifs. The next convolutional layers models the interplay among these motifs to grasp higher-level features. Max-pooling layers reduce the dimensions of the layers. The model is trained to correctly label input sequences as GSS or non-GSS. The output layer of the trained network then gives a probability for any 299 bp region to contain a GSS. It can be applied along a full chromosome, i.e., on all 299 bp-long se- quences with a 1 bp shift. Full-size DOI: 10.7717/peerjcs.278/fig-1 activation function computes fReLU (M)=max(0,M) to incorporate non-linearity by transforming all negative values to zero. In order to reduce the input dimension we apply a max-pooling process with a pool size m over the output of fReLU (M). Similar convolution layers followed by ReLu and max-pooling are added sequentially on the input of the first layer to grasp higher order motifs. The output of the last max-pooling layer is then fed into a fully connected layer which output x is transformed by a softmax layer, i.e., a sigmoid function (φ= 11+e−x ), in order to give the final output of the CNN. This final score of the input sequence is ideally 0 for non-GSS and 1 for GSS containing sequences. When we need to perform a classification we use a threshold of 0.5 to discriminate between the two classes. In the training phase, the weights and biases of the convolution layers and the fully connected layer are updated via back-propagation (Rumelhart, Hinton & Williams, 1986) in a way which decreases the loss, which measures the discrepancy between the network predictions and the reality averaged over individual examples. We use here the binary cross-entropy computed as: L=−1/N N∑ i=1 [y(n)log(ŷ(n))+(1−y(n))×log(1− ŷ(n))] (2) where ŷ(n) is the estimated score for the input sample X(n). As data are imbalanced for Q>1, the model may reach a local optimum when predicting the non-GSS class for all input sequences. In order to deal with this issue, we attribute different weights to the positive and negative classes. We assign a greater importance to the less represented GSS class by multiplying the associated term in the loss by a weight CW = number of non-GSSnumber of GSS =Q. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 5/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-1 http://dx.doi.org/10.7717/peerj-cs.278 One of the important issues of any learning algorithm is overfitting. Overfitting occurs when one achieves a good fit of the model on the training and validation data, while it does not generalise well on new, unseen data. To deal with this issue, a regularisation procedure called dropout is usually used (Srivastava et al., 2014). In the training step, some outputs of the pooling layers are randomly masked while the remaining information is fed as inputs for the next layer. Implementation We implement CNN using Keras library (Chollet et al., 2015) and Tensorflow (Abadi, Agarwal & Barham, 2015) as back-end. Training on a GPU is typically faster than on a CPU. We use here a GTX 1070 Ti GPU. We use Adaptive Moment Estimation (Adam) to compute adaptive learning rates for each parameter (Kingma & Ba, 2014). Adam optimiser is an algorithm for first-order stochastic gradient-based optimisation of functions, based on adaptive estimates of lower-order moments. The network architecture (see Fig. 1) is detailed in Table 2. The models are trained for 150 epochs and they mostly converge rapidly (around 30–35 epochs, we use early stopping to prevent overfitting). Hyper-parameters tuning is detailed in the supplementary materials. Source codes are available at https://github.com/StudyTSS/DeepTSS/. Genome wide performance measure Different measures have been developed in order to assess the performance of different models on conventional test sets, i.e., test sets derived from a subset of the initial data. Such measures are described in details in the corresponding supplementary materials section. In our case, we want to apply our model on all the 299 bp windows spanning a full chromosome and eventually chromosomes from other species. Specifically, the model was tested on chromosome 21 which was withdrawn from the training set. We therefore developed a measure to evaluate the performance of the trained models in this case. This metric, calledλ, measures the enhancement of the predicted signal specifically in the regions surrounding the known GSS. We use in the present papers regions of length r = 400 bp. To compute λ, we first compute the genome-wide Z-score (Kreyszig, 2009) Zg = yg−µ̄ σ from the predictions yg where g denotes positions on the genome, and µ̄ and σ stand for the prediction mean and standard deviation, respectively. We extract ZGSS, the Zg signal over 10 kb windows centred on each GSS of the test region, e.g., a full chromosome. Zg is a 2D-array whose rows correspond to different genes and columns to different distances to the GSS. We then average element-wise ZGSS over all GSS, i.e., along all rows. This gives us S, the average of the Z-transformed prediction score in a 10 kb window around all GSS. In order to measure the signal increase close to the GSS, that we call λ, we compute the average of the curve S on a region of r bp centred on the GSS. A higher value of λ corresponds to a higher signal-to-noise ratio around the GSS. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 6/18 https://peerj.com https://github.com/StudyTSS/DeepTSS/ http://dx.doi.org/10.7717/peerj-cs.278 Table 2 Network architecture of the CNN model. The first column depicts the different layers used con- secutively in the network. The ’’layer shape’’ column reports the shape of the convolutional kernels, the max-pooling windows and the fully connected layers. The ’’output shape’’ column reports the variation of layer shapes at each step. Layer name Layer shape Output shape Input – 4×299×1 Conv2D 32×4×(4×1) 32×296×1 Max-pooling 2×1 32×148×1 Dropout – 32×148×1 Conv2D 64×32×(4×1) 64×145×1 Max-pooling 2×1 64×72×1 Dropout – 64×72×1 Conv2D 128×64×(4×1) 128×69×1 Max-pooling 2×1 128×34×1 Dropout – 128×34×1 Dense 128 128 Dropout – 128 Dense (sigmoid) 1 1 RESULTS Training models for genome annotation of GSS The problem of detecting human GSS using deep neural networks has been tackled in (Umarov & Solovyev, 2017). We first follow a similar approach and use a balanced dataset (see Methods for details). The model is trained/validated on an equal number of 299 bp long positive and negative examples and is evaluated on a test set composed of 15% of the original data that was left aside prior to training. The specificity (Sp), the sensitivity (Sn) and the Matthews Correlation Coefficient (MCC, Chicco & Jurman, 2020) (see Supplemental Information 1 for definition) were found to be similar to the ones found in (Umarov & Solovyev, 2017) which used a similar approach albeit separating the sample data into TATA-containing GSS and non-TATA GSS (Sp = 0.94, Sn = 0.92 and MCC = 0.86). In order to assess how this model would perform as a practical tool for detecting GSS on a genome-wide scale, we apply it on all the sequences along chromosome 21 (which has been withdrawn from the training phase, i.e., from the training and validation sets) obtained using a 299 bp long window sliding with an offset of 1 bp. Figure 2A illustrates the predictions of the CNN model over a typical region of 300 kbp containing 7 out of the 480 GSS of chromosome 21. Although the predictions yield higher scores over GSS positions, they also yield high scores over many non-GSS positions reflecting a low signal-to-noise ratio. This is due to the fact that the reality is biased in the training phase during which the CNN model learns an equal number of examples from the positive and the negative classes (He & Garcia, 2009). Applied over all the 299-bp sequences of chromosome 21, the model encounters many more examples of the negative class and fails to generalise to the new examples. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 7/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Figure 2 CNN predictions for two regions of chromosome 21. (A) Prediction scores for balanced 1* model (Q = 1) and unbalanced 100* model (Q = 100), respectively in blue and red on a 300 kb region. The position of genes is indicated below. The annotation track was done using the UWash Epigenome browser (https://epigenomegateway.wustl.edu/). Both models detect 7 GSS positions, but the 1* model re- turns a higher background signal at non GSS positions. Adding negative examples using the 100* model mitigates the noise while preserving the high scores over GSS. (B) Application of 30 1* models, trained on different datasets, over a 3.2 kb region of chromosome 21. At each site, the maximum and minimum pre- diction scores are respectively displayed in black and red. Other prediction scores are plotted in grey. Full-size DOI: 10.7717/peerjcs.278/fig-2 To address this issue and train a network for genome annotation, we propose a heuristic where more negative examples are added into the balanced dataset to reduce the importance of the positive class during training and to allocate more weight to the negative class. We call these augmented datasets limited unbalanced datasets. The parameter Q is the ratio between negative and positive training examples and denote as Q∗ models trained with the corresponding ratio. For instance, on Fig. 2A the model trained on the balanced data yielding to blue signal predictions is denoted as 1∗. We train our CNN model on a 100* dataset (Q=100) and assess the efficiency of the trained model. As depicted on Fig. 2A by a red signal, the predictions for this model display a much higher signal-to-noise ratio, with significant peaks over each of the 7 GSS (C21orf54, IFNAR2, IL10RB, IFNAR1, IFNGR2, TMEM50B, DNAJC28) and a much weaker signal between these sites. Predicting GSS using the 100* model is thus expected to generate fewer false positives than the 1* model, regardless the value of the threshold used to identify GSS-containing regions. In order to Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 8/18 https://peerj.com https://epigenomegateway.wustl.edu/ https://doi.org/10.7717/peerjcs.278/fig-2 http://dx.doi.org/10.7717/peerj-cs.278 assess how changing the value of Q affects GSS classification, we apply a threshold on the prediction and compute the precision and the recall obtained for both models (i.e., 1* and 100*) at 600 bp resolution on a full chromosome. The precision recall curves confirmed the compromising effect of a lower signal-to-noise ratio on the accuracy of the classification (Fig. S1). For the sake of completeness, the performance of more models (1*, 10*, 20*, 30*, 50*, 100*) evaluated using conventional metrics on test sets derived from the initial sample sets can be found in Supplemental Information 1. Investigating the effect of random selection of the negative examples on predictions While positive examples are always the same in different sample sets, the negative examples are randomly picked out of the genome. The performance of the model in different regions of chromosome 21 can thus vary for different training sets (Wesolowska-Andersen et al., 2020). To investigate this variation, we set up 30 balanced 1∗ datasets and train 30 CNNs separately. The 30 models are then applied over human chromosome 21 to study the fluctuations of the predictions. The variation of 30 predictions is depicted in Fig. 2B. The first observation is that almost all predictions present a peak over the DIP2A GSS. However, the large gap between the minimum and maximum predictions underlines the variability of predictions obtained with different training datasets. This variability illustrates the uncertainty of the predictions obtained from a single CNN trained on a balanced dataset and highlights the need to use limited unbalanced datasets for the task of genome annotation. Comparing 1* and 100* models over a full chromosome Models trained on 1* and 100* sets are applied to the full chromosome 21 and the Z-normalized prediction scores around GSS are presented as heat-maps. While the 1* model (Fig. 3A) presents a noisy signal around GSS positions, the 100* model (Fig. 3B) presents a higher signal-to-noise ratio. To investigate the performance of different models on a genome-wide scale we devised a custom metric λ which measures the average signal-to-noise ratio around GSS (see Methods for the definition of λ). Figures 3C, 3D illustrate the average of the Z-score over all the GSS of chromosome 21 for the models 1* and 100*, respectively, and λ denotes the average of this average over a r =400 bp region centred on the GSS. A larger λ score corresponds to a higher signal-to-noise ratio. In this particular case, we find a λ score of 2.21 and 5.81 for the 1* and 100* model, respectively. To illustrate the variability of prediction scores achieved around different GSS, we randomly selected four GSS within the chromosome. The first GSS corresponds to the gene CXADR, shown in Fig. 3E. While the prediction of 1* model results in a low averaged Z-scores over all positions, the averaged Z-score of 100* model strongly peaks around the GSS position and shows low variations over non-GSS positions. Figure 3F depicts the second selected GSS corresponding to the KRTAP19-2 gene. This gene is part of a cluster of similar genes belonging to the family of Keratin Associated Proteins (highlighted by a yellow rectangle on Figs. 3A, 3B). For this particular cluster, the predictions are poor for Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 9/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Figure 3 Comparison of the 1* and 100* models predictions over chromosome 21. (A) and (B) Heat maps depict the Z-score of the prediction for the 1* and 100* models respectively on 5,000 bp flanking each GSS of chromosome 21. (C) and (D) Averaged Z-score of the predictions over each GSS of chromo- some 21. (E–H) Zoom on regions around randomly selected GSS. Genes are indicated at the bottom of each plot. (I–K) Averaged Z-score of the predictions over each GSS of mouse chromosome X (I) and for networks trained on mouse/human chromosomes (except X) and applied on human/mouse chromosome X (J,K). Full-size DOI: 10.7717/peerjcs.278/fig-3 both 1* and 100*, probably reflecting a specific GSS signature that has not been grasped by the model. Another example of gene cluster with a poor prediction score for GSS is the t-RNA cluster, highlighted in green in Figs. 3A, 3B. Figures 3G, 3H displays the predictions around the GSS of the SCAF4 and, PCNT and C21ORF58 genes, respectively. On these more typical GSS the 100* model shows a higher signal-to-noise ratio than the 1* and regions containing GSS are detected. These regions often stretch over 1 kb while our training sequence centred on each GSS is only 299 bp long. This could indicate the presence either of alternative GSS close to the annotated GSS or of similar sequence patterns in broader regions surrounding the GSS (Carninci et al., 2006; Sandelin et al., 2007). Learning and predicting in human and mouse To show the potential of our annotation method in a different context, we replicate a similar GSS analysis in mouse. Models with values of Q ranging from 1 to 100 trained on mouse chromosomes (except X) are applied over the mouse chromosome X to assess the model performance (see Fig. 3I, and Figs. S2A, S2D, S2G). The averaged Z-score of λ reaches values of 2.24 and 4.90 respectively for the 1* and 100* models in quantitative agreement with the model performance in human. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 10/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-3 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Mammals show a substantial degree of homology in the DNA sequence found at GSS (Waterston et al., 2002), and earlier computational models were trained to recognise transcription start site in any mammalian species (Down & Hubbard, 2002). This study focused on 313 sequences, of which 50 were kept aside for test purposes and we want here to extend the validity of this initial study at the genome wide level. Following this line, we determine the possibility of predicting GSS in one organism with a network trained on a related organisms. This possibility has previously been shown to be effective for sequence variants calling (Poplin et al., 2018) To this end, the mouse trained model is applied on human chromosome X and the human trained model is applied on mouse chromosome X. The two chromosomes carry homologous genes (Waterston et al., 2002), the number of annotated GSS varies with a total of 4,968 GSS in human and 2,005 GSS in mouse. While the model trained and applied on mouse shows a better signal-to-noise ratio, the same model applied to human chromosome X still captures most of the GSS and gives a λ score of 5.18 for the 100* model (see Fig. 3J and Figs. S2B, S2E, S2H). Similarly, the models trained on human capture most of GSS on the mouse X chromosome as shown in Fig. 3K and Figs. S2C, S2F, S2I and reaches a λ score of 4.32 for the 100* model. In all cases, the signal-to-noise ratio is improved in the 100* models with respect to the 1* models. The human model applied on human provides the highest scores for both 1* and 100* models probably a signature of an overall better GSS annotation. Evaluation of the prediction for different GSS classes The potential of our trained networks to recover GSS containing regions along the human and mouse genomes is assessed in the previous parts without any distinction between different GSS classes. Since we find that some GSS are better predicted than others (Fig. 3), we compute the λ score independently for the two main classes of GSS: mRNA-GSS and ncRNA-GSS. While λ is higher for the mRNA-GSS class, the model is versatile and is also able to predict the ncRNA-GSS (Fig. 4B). In human and mouse, mRNA-GSS are found in different classes, that can be derived from the CpG content of the region flanking the GSS. High CpG regions, also called ‘‘CpG island’’ can be methylated and play an important role in gene regulation (Deaton & Bird, 2011). Figure 4A displays the distribution of the CpG number in 299 bp regions surrounding the all mRNA-GSS for the mouse and human X chromosome. From this distribution, we identify three classes of mRNA-GSS with respectively a high, medium and low CpG content. High CpG GSS correspond to genes regulated by DNA methylation and have been shown to exhibit a different pattern of chromatin modifications (Vavouri & Lehner, 2012). Assessing the performance of the model for the three different classes, we find that better scores are obtained for CpG richer GSS (Fig. 4B). The worst performing GSS are low CpG content GSS which are hardly recovered by our model. In order to test whether CpG content alone could be used to predict GSS we computed the λ score over all GSS using the Z-normalized CpG content as predictor. We get values of 1.30 and 0.92 respectively for the human and mouse GSS indicating that the CpG content is a strong indicator of the present of GSS but that our models use as well other features which allow them to reach much higher scores. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 11/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Figure 4 Evaluation of the model performance for different classes of genes. (A) and (B) CpG number in 299 bp regions centred on mRNA-GSS in X chromosomes for human (A) and mouse (B). These regions were divided in three groups of similar size according to their CpG number into low, medium and high groups (the bounds are 35% and 60% for human and 30% and 60% for mouse). The proportion of genes in each class is similar on the X chromosome (test set) than on other chromosomes (training and valida- tion sets). (C) Lambda values computed for networks trained on each species non-X chromosome GSS (t) and predicted on either species’ X-chromosome GSS (p). Lambda values for each mRNA-CpG sub-group and ncRNA genes are also shown to highlight different levels of performance. Full-size DOI: 10.7717/peerjcs.278/fig-4 Application of the approach to other vertebrates The performance of a CNN trained on human GSS to recover mouse GSS is not surprising given the similarity between their genomes (Waterston et al., 2002). We next set out to apply the same methodology on more diverse species, including chicken and zebrafish (Fig. 5). Four CNNs were trained on all the GSS from the genomes of Homo Sapiens (human), Mouse musculus (mouse), Gallus gallus (chicken) and Danio rerio (zebrafish). G.g. and D.r. are model organisms, and together with H.s. and M.m. provide the most comprehensive GSS annotations for mammals, birds and fishes. These four CNNs were Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 12/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-4 http://dx.doi.org/10.7717/peerj-cs.278 Figure 5 Lambda scores obtained with CNN trained on four different species: human, mouse, chicken and zebrafish. Lambda scores are computed from predictions done on GSS of (A) human, (B) mouse, (C) chicken and (D) zebrafish chromosomes. Full-size DOI: 10.7717/peerjcs.278/fig-5 then applied genome wide on each of the four species and the λ metric is computed for each chromosome independently, using a r value of 400 bp (see Methods). The results for the human and mouse genomes are very similar, with only a slightly better performance when the model trained on a species is applied on the same species. The model trained on the chicken genome performs less well when applied on the mammalian genomes and the model trained on the zebrafish genome is not able recover the mammalian GSS as shown by a λ value of 0. When applied on the chicken genome, the mouse and human models surprisingly outperform the chicken model, probably because the GSS annotation is better in the two mammals so that the training phase is more efficient. This result highlights the potential of the method when used across different species when the genome of one species is more precisely annotated. When applied on the zebrafish genome on the other hand, the human, mouse and chicken models all show poor performances while the zebrafish model performs well. This is in line with the fact that the CpG composition of zebrafish regions around GSS if very different than in chicken and mammals. CpG islands, which are high density CpG regions, are found upstream of many GSS for coding genes in chicken and mammals while they are less abundant in the zebrafish’s genome which has a low GC content (Han et al., 2008). All together, these results suggest that the molecular machinery that interprets the genome sequence in order to find start sites of genes has a similar specificity in human, mouse and chicken, but a different specificity in zebrafish. CONCLUSIONS With the surge of DNA sequencing technologies, over a million genome datasets are now available and petabases of transcripts are sequenced every year to annotate these datasets with functional marks (Wainberg et al., 2018). It has not escaped the notice of many computational biologists that deep neural networks are a key tool to deal with this exponentially increasing amount of data (Wainberg et al., 2018). One possible application is to leverage datasets with good annotations in order to train neural networks and to predict annotations on other datasets. One of the practical issues when applying neural networks Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 13/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-5 http://dx.doi.org/10.7717/peerj-cs.278 on genomic sequences is unbalanced data, a well-known issue in the machine learning literature (He & Garcia, 2009; Chawla, Japkowicz & Kotcz, 2004; Batista, Prati & Monard, 2004). In the present paper, we address this problem using GSS as a case study. Indeed, GSS occupy only a few locations on the genome (31,037 GSS for human) leading to extreme unbalances in datasets (i.e., the ratio of GSS-containing 299 bp windows to non-GSS in the human genome is 1/400). In this case, the lack of examples of the minority class (i.e., true GSS) impacts the learning process as conventional machine learning algorithms usually measure the model performance on the majority class (i.e., non-GSS) leading to biased or inaccurate prediction of the minority class. To deal with this disparity, we adopt a weighting strategy to decrease the importance of the majority class samples (non-GSS) during the learning process thereby improving identification of the rare minority class samples (GSS). Using this approach, which we call ‘‘limited unbalanced datasets’’, we show that learning on imbalanced datasets can be performed effectively, and that for GSS recognition, a ratio of 1 to 100 positive over negative examples is usually sufficient to achieve a good signal to noise ratio in the prediction. This approach can be easily extended to identify other functional regions in any annotated genome. We also show that our method can be efficiently used across genomes of different species, i.e., training the model on one genome and applying it to another genome. We use the X chromosomes of human and mouse GSS as a case study, and apply models trained on each one’s other chromosomes to its own and the other one’s X chromosome. While the sequence of this chromosome has evolved differently in both species, many genes are homologous (Sinha & Meller, 2007). The fact that we are able to recover GSS in mouse/human with a model trained on the other organism suggests that the machinery capable of recognising GSS in each organism is overall conserved. We also show that this methodology can be applied to more distant species, and use as examples chicken and zebrafish. Our results point toward a higher similarity between mammal and chicken while zebrafish GSS cannot cannot be reliably predicted with models trained on mammal and chicken sequences. While the genome sequence conservation can be computed directly from DNA sequences, further developments of our method may provide a new tool to quantify more complex patterns of similarity between different organism’s nuclear machinery that interprets DNA sequences in vivo. ACKNOWLEDGEMENTS We would like to thank Léopold Carron for helping us with datasets, Hugues Roest Croeluis for discussions, Michel Quaggetto for technical support and Annick Lesne for comments on the manuscript. We also wish to thank our editor James Procter and the two anonymous referees for their invaluable work. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 14/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Agence Nationale pour la Recherche [HiResBac ANR- 15-CE11-0023-03]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Agence Nationale pour la Recherche [HiResBac ANR-15-CE11-0023-03]. Competing Interests The authors declare there are no competing interests. Author Contributions • Ghazaleh Khodabandelou conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Etienne Routhier performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Julien Mozziconacci conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available at GitHub: https://github.com/StudyTSS/DeepTSS. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.278#supplemental-information. REFERENCES Abadi M, Agarwal A, Barham P. 2015. TensorFlow: large-scale machine learning on heterogeneous systems. Available at https://www.tensorflow.org/ . Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence speci- ficities of DNA-and RNA-binding proteins by deep learning. Nature Biotechnology 33(8):831–838 DOI 10.1038/nbt.3300. Angermueller C, Pärnamaa T, Parts L, Stegle O. 2016. Deep learning for computational biology. Molecular Systems Biology 12(7):878–884 DOI 10.15252/msb.20156651. Batista GE, Prati RC, Monard MC. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29 DOI 10.1145/1007730.1007735. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 15/18 https://peerj.com https://github.com/StudyTSS/DeepTSS http://dx.doi.org/10.7717/peerj-cs.278#supplemental-information http://dx.doi.org/10.7717/peerj-cs.278#supplemental-information https://www.tensorflow.org/ http://dx.doi.org/10.1038/nbt.3300 http://dx.doi.org/10.15252/msb.20156651 http://dx.doi.org/10.1145/1007730.1007735 http://dx.doi.org/10.7717/peerj-cs.278 Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engström PG, Frith MC. 2006. Genome-wide analysis of mam- malian promoter architecture and evolution. Nature Genetics 38(6):626–635 DOI 10.1038/ng1789. Chawla NV, Japkowicz N, Kotcz A. 2004. Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6(1):1–6. Chicco D, Jurman G. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Ge- nomics 21(1):6. Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM. 2018. Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface 15(141):1–47 DOI 10.1098/rsif.2017.0387. Chollet F. 2015. Keras. Available at https://keras.io. Deaton AM, Bird A. 2011. CpG islands and the regulation of transcription. Genes & Development 25(10):1010–1022 DOI 10.1101/gad.2037511. Down TA, Hubbard TJ. 2002. Computational detection and location of transcrip- tion start sites in mammalian genomic DNA. Genome Research 12(3):458–461 DOI 10.1101/gr.216102. Durham TJ, Libbrecht MW, Howbert JJ, Bilmes J, Noble WS. 2018. PREDICTD parallel epigenomics data imputation with cloud-based tensor decomposition. Nature Communications 9(1):1–15 DOI 10.1038/s41467-017-02088-w. ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57–74 DOI 10.1038/nature11247. Georgakilas GK, Perdikopanis N, Hatzigeorgiou A. 2020. Solving the transcrip- tion start site identification problem with ADAPT-CAGE: a machine learn- ing algorithm for the analysis of CAGE data. Scientific Reports 10(1):1–12 DOI 10.1038/s41598-019-56847-4. Goodfellow I, Bengio Y, Courville A. 2016. Deep learning. Cambridge: MIT Press. Han L, Su B, Li WH, Zhao Z. 2008. CpG island density and its correlations with genomic features in mammalian genomes. Genome Biology 9(5):1–12 DOI 10.1186/gb-2008-9-5-r79. He H, Garcia EA. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9):1263–1284 DOI 10.1109/TKDE.2008.239. Jaganathan K, Panagiotopoulou SK, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB. 2019. Predicting splicing from primary sequence with deep learning. Cell 176(3):535–548. Kelley DR, Reshef Y, Bileschi M, Belanger D, McLean CY, Snoek J. 2018. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Research 28(5):739–750. Kelley DR, Snoek J, Rinn JL. 2016. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research 26(7):990–999 DOI 10.1101/gr.200535.115. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 16/18 https://peerj.com http://dx.doi.org/10.1038/ng1789 http://dx.doi.org/10.1098/rsif.2017.0387 https://keras.io http://dx.doi.org/10.1101/gad.2037511 http://dx.doi.org/10.1101/gr.216102 http://dx.doi.org/10.1038/s41467-017-02088-w http://dx.doi.org/10.1038/nature11247 http://dx.doi.org/10.1038/s41598-019-56847-4 http://dx.doi.org/10.1186/gb-2008-9-5-r79 http://dx.doi.org/10.1109/TKDE.2008.239 http://dx.doi.org/10.1101/gr.200535.115 http://dx.doi.org/10.7717/peerj-cs.278 Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint. arXiv:1412.6980. Kreyszig E. 2009. Advanced engineering mathematics. 10th eddition. Wiley. Kugel JF, Goodrich JA. 2017. Finding the start site: redefining the human initiator element. Genes & Development 31(1):1–2 DOI 10.1101/gad.295980.117. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ. 2015. Integrative analysis of 111 reference human epigenomes. Nature 518(7539):317–330 DOI 10.1038/nature14248. Leung MKK, Xiong HY, Lee LJ, Frey BJ. 2014. Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129 DOI 10.1093/bioinformatics/btu277. Min X, Chen N, Chen T, Jiang R. 2016. DeepEnhancer: predicting enhancers by con- volutional neural networks. In: Bioinformatics and biomedicine (BIBM), 2016 IEEE international conference on. IEEE, 637–644. Pachganov S, Murtazalieva K, Zarubin A, Sokolov D, Chartier DR, Tatarinova TV. 2019. TransPrise: a novel machine learning approach for eukaryotic promoter prediction. PeerJ 7:e7990 DOI 10.7717/peerj.7990. Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT. 2018. A universal SNP and small-indel vari- ant caller using deep neural networks. Nature Biotechnology 36(10):983–987 DOI 10.1038/nbt.4235. Rivera CM, Ren B. 2013. Mapping human epigenomes. Cell 155(1):39–55 DOI 10.1016/j.cell.2013.09.011. Rumelhart DE, Hinton GE, Williams RJ. 1986. Learning representations by back- propagating errors. Nature 323(6088):533–536 DOI 10.1038/323533a0. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA. 2007. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nature Reviews Genetics 8(6):424–436. Sinha AU, Meller J. 2007. Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics 8(1):82 DOI 10.1186/1471-2105-8-82. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958. Stein L. 2001. Genome annotation: from sequence to biology. Nature Reviews Genetics 2(7):493–503. Umarov RK, Solovyev VV. 2017. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLOS ONE 12(2):e0171410 DOI 10.1371/journal.pone.0171410. Vavouri T, Lehner B. 2012. Human genes with CpG island promoters have a distinct transcription-associated chromatin organization. Genome Biology 13(11):1–12 DOI 10.1186/gb-2012-13-11-r110. Wainberg M, Merico D, Delong A, Frey BJ. 2018. Deep learning in biomedicine. Nature Biotechnology 36(9):829 DOI 10.1038/nbt.4233. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 17/18 https://peerj.com http://arXiv.org/abs/1412.6980 http://dx.doi.org/10.1101/gad.295980.117 http://dx.doi.org/10.1038/nature14248 http://dx.doi.org/10.1093/bioinformatics/btu277 http://dx.doi.org/10.7717/peerj.7990 http://dx.doi.org/10.1038/nbt.4235 http://dx.doi.org/10.1016/j.cell.2013.09.011 http://dx.doi.org/10.1038/323533a0 http://dx.doi.org/10.1186/1471-2105-8-82 http://dx.doi.org/10.1371/journal.pone.0171410 http://dx.doi.org/10.1186/gb-2012-13-11-r110 http://dx.doi.org/10.1038/nbt.4233 http://dx.doi.org/10.7717/peerj-cs.278 Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agar- wala R, Ainscough R, Alexandersson M, An P. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915):520–562 DOI 10.1038/nature01262. Wesolowska-Andersen A, Yu GZ, Nylander V, Abaitua F, Thurner M, Torres JM, Mahajan A, Gloyn AL, McCarthy MI. 2020. Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals. eLife 9:e51503 DOI 10.7554/eLife.51503. Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods 12(10):931–934 DOI 10.1038/nmeth.3547. Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. 2019. A primer on deep learning in genomics. Nature Genetics 51(1):12–18. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 18/18 https://peerj.com http://dx.doi.org/10.1038/nature01262 http://dx.doi.org/10.7554/eLife.51503 http://dx.doi.org/10.1038/nmeth.3547 http://dx.doi.org/10.7717/peerj-cs.278