key: cord-0257241-5si31cwb
authors: Katsman, Efrat; Orlanski, Shari; Martignano, Filippo; Fox-Fisher, Ilana; Shemer, Ruth; Dor, Yuval; Zick, Aviad; Eden, Amir; Petrini, Iacopo; Conticello, Silvestro G.; Berman, Benjamin P.
title: Detecting cell-of-origin and cancer-specific methylation features of cell-free DNA from Nanopore sequencing
date: 2022-04-12
journal: bioRxiv
DOI: 10.1101/2021.10.18.464684
sha: 22500bbfba87caf6ade2cf24263af0fc5de1add7
doc_id: 257241
cord_uid: 5si31cwb

The Oxford Nanopore (ONT) platform provides portable and rapid genome sequencing, and its ability to natively profile DNA methylation without complex sample processing is attractive for clinical sequencing. We recently demonstrated ONT shallow whole-genome sequencing to detect copy number alterations (CNA) from the circulating tumor DNA (ctDNA) of cancer patients. Here, we show that cell-type and cancer-specific methylation changes can also be detected, as well as cancer-associated fragmentation signatures. This feasibility study suggests that ONT shallow WGS could be a powerful tool for liquid biopsy, especially real-time medical applications.

Circulating cell-free DNA (cfDNA) can reveal informative features of its tissue of origin including 35 somatic genome alterations, DNA modifications, and cell-type specific fragmentation patterns 36 (1) . DNA methylation is a promising cfDNA biomarker and is in widespread testing as a cancer 37 screening tool (2). DNA methylation can also be used to detect turnover of damaged cells in 38

time-sensitive conditions such as myocardial infarction, sepsis, and COVID-19 (3-6). These 39 studies used bisulfite-based approaches to profile methylation, but alternative approaches 40 include immunoprecipitation (7) and enzymatic conversion (8) techniques. 41

Accurate calling of DNA methylation from native DNA Oxford Nanopore (ONT) sequencing has 42 matured and now produces single base-pair resolution results highly similar to bisulfite 43 sequencing (9, 10). The ONT platform is portable and has a low cost of setup, and a rapid 44 sequencing workflow that can enable real-time medicine (11, 12) . Native ONT sequencing 45 requires no complex sample processing steps and no PCR amplification, making it attractive for 46 clinical tests. Bisulfite approaches in particular involve significant degradation and loss of input 47 material. For these reasons, whole-genome sequencing (WGS) using the ONT platform is 48 appealing relative to other whole-genome approaches. 49 Due to their low cost, targeted bisulfite PCR (including multiplexed NGS versions (13)) are also 50 popular for clinical methylation sequencing, and these would not require the native modification 51 calling capability of Nanopore. However, shallow whole-genome approaches that capture 52 multiple genomic features could potentially be more informative, especially with regard to the 53 course of the disease (8, 14, 15) . 54 ONT has primarily been used for long-read sequencing, but recent work by our group and 55 others has shown that it can be adapted for short fragments without additional processing steps 56 (16-18). As an added benefit, the ability to capture much longer cfDNA fragments than short-57 read sequencing may lead to new discoveries or biomarkers, as was demonstrated recently in 58 the case of longer fragments during pregnancy (19) . 59

Here, we perform a feasibility study of ONT sequencing for circulating tumor DNA (ctDNA) 60 detection by comparing methylation and several fragmentation features to matched Illumina 61 samples and comparable Illumina-based datasets. While our sample size is too small to 62 determine the limits of sensitivity and specificity, we find that both cell-type specific and cancer-63 specific features can be reliably detected in most of our samples. The results provide 64 confidence for pursuing this approach in larger studies. 65 66

Estimating cell type fractions from cfNano 68

We first performed cell type deconvolution of healthy plasma cfDNA using DNA methylation 69 data from either published WGBS datasets or our cfNano samples. For the external WGBS 70 datasets, we used the methylation fractions (beta values) that were provided in the published 71 data files. For our cfNano, we performed direct modification calling using the Megalodon 72 software provided by ONT (https://github.com/nanoporetech/megalodon). To perform 73 deconvolution, we used 1,000-2,000 marker CpGs per cell type based on a previously published 74 atlas of purified cell types ("MethAtlas", (5, 13)), and estimated cell type fractions using Non-75

Negative Least Squares (NNLS) regression as described in (5). In order to better understand 76 the impact of the relatively low sequencing depth of our cfNano samples (~0.2x genome 77 coverage), we first performed deconvolution of all samples using downsampling experiments 78 starting with full sequence depth down to 0.0001x genome coverage ( Figure 1A and 79 Supplementary Figures 1-3) . Healthy plasma WGBS samples were taken from a recent study of 80 50-100x genomic coverage ((13), Figure 1A left "Fox-Fisher" samples), and another WGBS 81 study with 0.5-1x coverage ((20) Figure 1A middle "Nguyen" samples). Finally, healthy cfNano 82 samples were analyzed ( Figure 1A right "this study"). From full depth down to 0.2x (about 2.5M 83 aligned fragments), all samples were dominated by the expected cell types: monocytes, 84 lymphocytes, megakaryocytes, neutrophils/granulocytes, and sometimes hepatocytes (5). Cell 85 type proportions became significantly degraded at 0.05x coverage and below (corresponding to 86 less than 700,000 aligned fragments). The common cell types were consistently found across 87 the 23 healthy individuals in the Fox-Fisher dataset, the 3 healthy individuals in the Nguyen 88 dataset, and the 7 healthy individuals in the cfNano dataset, both at full depth ( Figure 1B ) and 89 when downsampled to 0.2x depth ( Figure 1C ). The same was found when cell type groups such 90 as lymphocytes were broken down into the 25 individual types (Supplementary Figure 4A -C). 91

Notably, a slight epithelial fraction was identified in some of the Fox-Fisher samples at 0.2x 92 which did not appear at full 80x depth, suggesting a small but measurable amount of noise at 93 the 0.2x coverage level. 94

The healthy cfNano individuals were divided into two groups based on source site, with one 95 being collected and sequenced in Italy ("BC" samples) and one in Israel ("HU" samples). 96

Despite the HU samples being lower coverage (two were between 0.10-0.15x depth), they 97 displayed relatively similar cell type proportions ( Figure 1B -C and Supplementary Figure 3) . 98

In addition to healthy individuals, the Nguyen WGBS dataset and our cfNano dataset also 99 contained individuals being treated for lung adenocarcinoma (LuAd), marked as "LuAd" in 100 Figure 1B -C. In the Nguyen WGBS study, samples were collected at the time of acquired 101 resistance to EGFR-inhibitors, and were divided into those that acquired resistance mutations in 102 EGFR itself (labeled "on" for on-target) vs. those that acquired amplifications in alternative 103 oncogenes MET/ERBB2 (labeled "off" for off-target). The epithelial cell fraction was much higher 104 in the on-target patients, while the off-target patients had very low or no epithelial fraction 105 ( Figure 1B) , consistent with the absence of CNAs in the off-target samples in the original study 106

(20). The 6 LuAd samples in our cfNano study had similarly high epithelial fraction ( Figure 1B ), 107

which was significantly higher than in the healthy patients (p=0.004). In all WGBS and cfNano 108 samples, full depth results were highly similar to 0.2x downsampled results ( Figure 1C and 109 Supplementary Figures 1-3) . Interestingly, while the Nguyen et al. study interpreted the normal-110  like methylation levels of the "off-target" tumors as a difference in cancer methylation patterns,  111 our deconvolution results strongly suggest that it is due to the relative amount of cancer DNA 112 circulating in the blood. 113

The fraction of cancer cells in cfDNA ("tumor fraction") can be estimated from somatic copy 114 number alterations (CNAs) using the ichorCNA tool (21), for cancers that contain a sufficient 115 degree of aneuploidy. We estimated tumor fraction for our cfNano samples and four matched 116

Illumina WGS samples from LuAd patients ( Figure 1D , Supplementary Table 1, and  117 Supplementary Data File 1). While the Illumina samples were sequenced at significantly higher 118 depth (median 1.3x), the tumor fraction estimates were highly similar between cfNano and 119

Illumina sequencing ( Figure 1E ). Interestingly, the ichorCNA tumor fractions were more similar 120

to the high-depth Illumina samples than the Illumina samples themselves were when 121 downsampled to the same depth as the cfNano samples (Supplementary Figure 5A) . 122

To compare ichorCNA tumor fraction estimates to methylation-based estimates, we designed a 123 "two-component" deconvolution method based on NNLS regression that used 2,253 CpGs with 124 differential methylation between sorted lung epithelia and healthy plasma. This was based on 125 the same array-based MethAtlas samples from (5) as the full deconvolution ( Figure 1F ). 330-126

1,526 of these CpGs were covered by each cfNano sample (Supplementary Table 1), which  127 were the CpGs used for NNLS deconvolution. These DNA methylation based estimates of lung 128 fraction and the ichorCNA estimates of cancer cell fraction were largely in agreement ( Figure  129 1F, bottom), with all the six LuAd samples having significantly higher lung fraction compared to 130 the seven healthy plasma samples (p=0.003). Two LuAd cases were markedly higher in the 131 methylation-based than the CNA-based estimate (BC09 and BC10). While we have no 132 independent data to determine which was the more accurate estimate, we hypothesize that the 133 discrepancy may be due to either whole-genome doubling (WGD) events that are not detected 134 by ichorCNA (WGD occurs in 297/503 or 59% of LuAd tumors from the TCGA project (22)), or 135 damage to normal lung cells surrounding the tumor which die and shed their DNA into 136 circulation (23). 137

To verify the robustness of methylation-based deconvolution, we used a mutually exclusive set 138 of 13,770 CpGs that could distinguish TCGA LuAd tumors from healthy plasma, but were not 139 found in the normal lung epithelia set ( Figure 1H ). Before applying the NNLS regression, we 140 corrected the methylation levels of the TCGA LuAd samples to correct for their degree of non-141 cancer contamination ("purity correction"), since most TCGA LuAd samples contain a significant 142 fraction of leukocytes. After this correction, the tumor fraction estimates of our cfNano samples 143

were highly similar to those based on normal-lung specific CpGs ( Figure 1H , bottom), despite 144 the fact that the two CpG sets were completely non-overlapping. One HU healthy sample 145 (HU005.10) had a higher lung fraction estimate than one of the cancer samples, possibly 146 because this was the cfNano sample with the lowest sequencing coverage (0.11x). However, 147 the methylation-based tumor fraction was still significantly higher in the LuAd samples than 148 healthy controls (p=0.004). 149

We performed all deconvolution analysis using a second, and older, base modification caller 150 (DeepSignal, (24)). While Megalodon called 10-20% more CpGs, the majority of CpGs called 151 were in common between the two methods and had identical methylation states (Supplementary 152 Figure 6A ). Both the full cell-type deconvolution (Supplementary Figure 6B ) and the two-153 component deconvolution (Supplementary Figure 6C) were highly similar between the two 154 callers. 155 156 Genomic context of DNA methylation changes detected using cfNano 157

The deconvolution analysis above was based on unannotated differentially methylated regions. 158

In order to investigate the genomic context of lung cancer specific DNA methylation, we 159 analyzed one hypomethylation feature associated with cell of origin (lineage-specific 160 transcription factor binding sites) and one associated specifically with transformation (global 161 hypomethylation). For the TFBS analysis, we identified 5,974 predicted TFBS that were specific 162 to lung epithelia based on a single-cell ATAC-seq atlas of open chromatin within lung and other 163 primary human tissues (25). In that study, adult lung tissues from multiple donors contained a 164 strong cluster of lung pneumocyte-specific open chromatin regions ("Pal" cluster). This cluster 165 was most strongly enriched for the binding motif for the transcription factor NKX2-1, which is a 166 master regulatory transcription factor in this cell type (26). NKX2-1 activity is also known to be 167

highly restricted to this cell type (27), and NKX2-1 binding sites were also the most enriched 168 within lung adenocarcinoma ATAC-seq sites in an independent study (28) . Because open 169 chromatin regions are almost universally hypomethylated, we hypothesized that the 5,974 170

predicted NKX2-1 TFBS in lung pneumocytes would be specifically hypomethylated in healthy 171 lung tissues and in lung tumors. We confirmed this using WGBS data from TCGA (29) 172

(Supplementary Figure 7A) . 173

We next plotted plasma cfDNA methylation levels at these same predicted NKX2-1 sites from 174 the published Illumina WGBS studies and our cfNano study ( Figure 2A ). In healthy samples 175 from three WGBS studies and our own cfNano samples, NKX2-1 sites were fully methylated. In 176 contrast, the LuAd samples from both the Nguyen et al. WGBS study (Figure 2A , middle) and 177 our cfNano study (Figure 2A , right) had substantial demethylation. In both studies demethylation 178 could only be observed in the higher tumor fraction samples ("on-target" samples in the WGBS 179 study, and samples with ichorCNA TF>0.15 in the cfNano study). As a negative control, we 180 selected predicted TFBS from a cell type not expected to be found either in healthy plasma or 181

LuAd. We used the adrenal cortical cluster ("Adc" cluster) from (25), which was highly enriched 182

for the KLF5 binding motif. These sites were fully methylated in plasma samples from both 183 healthy and LuAd individuals (Supplementary Figure 7B ). cfNano profiles were nearly identical 184

using DeepSignal methylation calling (Supplementary Figure 7C) . 185

Global DNA hypomethylation is one of the hallmarks of the cancer epigenome. It has long been 186

proposed as a general marker for circulating tumor DNA (30), and this was recently verified for 187 lung cancer using shallow plasma cfDNA WGBS (20). We have shown that this "global" 188 hypomethylation is not completely global, and occurs preferentially within large domains called 189

Partially Methylated Domains (PMDs) and specifically at CpGs with a local sequence context 190 termed "solo-WCGW" (29). We replotted WGBS methylation data from TCGA normal lung and 191 lung tumor tissues, showing a typical chromosome arm where strong hypomethylation occurs 192 within the PMD regions identified in (29) in the cancer samples ( Figure 2B , top). In our cfNano 193 samples, strong hypomethylation was also found exclusively in the cancer samples in the same 194 PMD regions ( Figure 2B , bottom). To quantify this genome-wide, we plotted the methylation 195 change (relative to the sample-specific whole-genome average) of PMD solo-WCGW CpGs 196 within each 10 Mbp genomic bin that overlapped a common PMD region from (29) ( Figure 2C ). 197

As expected, five of the six cancer samples were significantly hypomethylated relative to the 198 healthy controls (p<0.0001). Overall, there was significantly more hypomethylation across all 199 cancer sample bins (mean=-0.10,SD=0.07) than across all healthy sample bins (mean=-200 0.05,SD=0.04), corresponding to a p-value<2.2e-16 by one-sided Wilcoxon test. In the final 201

LuAd case (BC11), no PMD hypomethylation could be detected ( Figure 2C ). This is not 202 surprising given the high degree of variability associated with global hypomethylation in cancer 203

(29), a process that is not entirely understood but is affected both passively through mitotic 204 divisions as well as actively by dysregulation of several chromatin modifiers (31, 32). 205

Reasoning that copy number altered regions would have skewed proportions of tumor-derived 206 DNA and thus different levels of PMD hypomethylation, we divided the PMD bins based on the 207 copy number status of each sample from ichorCNA. In four of the five cases with significant 208 hypomethylation overall, the amplified bins had significantly more hypomethylation than diploid 209 regions ( Figure 2D ). In the one remaining case (BC09), there were not enough PMDs with 210

CNAs for an accurate measurement. Conversely, deleted showed significantly less 211 hypomethylation than diploid regions, although this trend only reached statistical significance in 212 two cases. In the future, the combined analysis of CNAs and global hypomethylation may 213 provide a stronger cancer-specific signal than each feature alone. PMD hypomethylation profiles 214

were nearly identical using DeepSignal methylation calling (Supplementary Figure 7D Cell-free DNA circulates primarily as mononucleosomal fragments, and mapping the positions of 218 these mononucleosomes can be used to identify cell-type specific positioning (reviewed in (1)).

CTCF binding sites provide a good test of whether these signals are detectable, since they eject 220 a central nucleosome and position 10 phased nucleosomes on either side of their binding site 221 (33) ( Figure 3A ). Around a set of 9,780 CTCF binding sites, cfNano mononucleosome locations 222 recapitulated this expected pattern ( Figure 3B , top), which was identical to the pattern based on 223

matched Illumina WGS of greater sequence depth ( Figure 3B , bottom). These were also 224 identical when both cfNano and Illumina libraries were downsampled to an equal number of 2M 225 fragments ( Figure 3C ). The CTCF binding site also demethylates CpGs located approximately 226

200bp of on either side (33), and this DNA methylation pattern was also recapitulated in our 227 cfNano samples (Supplementary Figure 8) . 228

We tried the same mononucleosome mapping approach for the 5,974 lung-specific NKX2-1 229 TFBS from Figure 2A . While the demethylation signal was detectable (Figure 2A ), we could not 230 detect any mononucleosome positioning signal (data not shown). Lung-specific nucleosome 231 positions would only be present on a fraction of the fragments, so the signal from these 232 fragments may be masked by those from non-lung cell types. But given that the inherent 233 nucleosome positioning information is present (as shown by the CTCF example), more 234 advanced normalization and quantification techniques may reveal these cell-type specific 235 fragments more sensitively in the future. 236 237

Specific fragment lengths have been associated with cancer-derived cfDNA fragments 239 (reviewed in (1)) and these have been used as accurate cancer classifiers (15). Specifically, 240

shorter mononucleosome fragments (<150bp) tend to be enriched for cancer-derived fragments 241

(34). Density plots of fragment length showed that our cfNano cancer samples were enriched in 242 these short mononucleosome fragments relative to healthy controls ( Figure 4A ). We used the 243 definition from (34) and (15) to calculate the ratio of short mononucleosomes (100-150bp) to all 244 mononucleosomes (100-220bp). The short mononucleosome ratio was significantly higher in 245 the high tumor fraction cancer cases (mean=0.24, SD=0.03) than in the healthy cases 246 (mean=0.16, SD=0.03), which corresponded to a t-test p-value of p=0.038 ( Figure 4B ). We 247 compared these ratios calculated from our cfNano libraries with those calculated from the 248 matched Illumina WGS libraries which were available for four of the six cancer samples, and the 249 two library types were strongly correlated ( Figure 4C ). We hypothesized that improvements to 250

Nanopore basecalling could improve alignment and adapter trimming, so we also compared 251

base and all dinucleosomes (275-400bp). The short dinucleosome ratio showed even more 260 separation between cancer vs. healthy cfNano samples than the mononucleosome ratio, with 261 the high tumor fraction cancer cases (mean=0.62, SD=0.01) than in the healthy cases 262 (mean=0.30, SD=0.04), corresponding to a t-test p-value of p=2E-7 ( Figure 4D ). We compared 263 dinucleosome ratios calculated from our cfNano libraries with those calculated from the matched 264

Illumina WGS libraries, and they were nearly perfectly correlated ( Figure 4E ). When we looked 265 across all samples, the short mononucleosome ratio was highly correlated with the short 266 dinucleosome ratio (Supplementary Figure 9D ). Interestingly, this correlation held across the 267 healthy samples as well as the cancer samples, indicating that the same underlying mechanism 268 affects circulating cfDNA from both cancer and non-cancer cell types. 

The four bases immediately flanking cfDNA fragmentation sites have biased sequence 279 composition that differs between cancer-derived and non-cancer-derived fragments (35, 36). To 280 study this in our cfNano samples, we first plotted the 25 most abundant 4-mer end motifs that 281

were identified in an Illumina-based study of healthy plasma cfDNA (36), using a heatmap to 282 indicate motif frequencies in each of our cfNano and matched Illumina samples ( Figure 4F ). 283

There was broad agreement across all samples, although some differences between cfNano 284

and Illumina libraries were clearly noticeable. When we plotted average Nanopore vs. Illumina 285 frequencies for all 256 possible 4-mers, it appeared that the less abundant motifs had slightly 286 higher frequencies in Nanopore, while the more abundant motifs had slightly lower frequencies 287

in Nanopore ( Figure 4G ). Nevertheless, the relative frequencies were highly concordant overall 288 (PCC=0.97). These were slightly more concordant when we used the 2022 "high accuracy" 289

( Of particular interest is the CCCA end motif, which is typically the most abundant 4-mer in 294 healthy plasma and its reduction was shown to be a cancer marker in several cancer types 295

including lung cancer (35, 36). CCCA indeed has the highest frequency across all our cfNano 296

and Illumina WGS samples ( Figure 4F -H), and was significantly lower in our three high tumor 297 fraction cancer samples than the healthy samples ( Figure 4I ). However, there was a clear 298 difference between the healthy samples generated in the "HU" and "ISPRO" batches, which we 299 presume to be technical since these two batches behaved similarly with respect to fragment 300 length and methylation features. We therefore only did a direct statistical comparison within the 301 ISPRO batch, and indeed CCCA frequency in high TF tumors (mean=1.6,SD=0.06) was 302 significantly lower than in ISPRO healthy samples (mean=1.9,SD=0.13), leading to a t-test p-303 value=0.007 ( Figure 4I ). 304

We found additional 16 motifs that were as significant as CCCA, although none survived FDR 305 adjustment and so will have to be validated in larger studies (Supplementary Table 3 ). Like 306 fragment lengths, end motif frequencies were not sensitive to downsampling to 2M fragments 307 (Supplementary Figure 10A-D) . Additionally, the relative frequencies of four cfNano cancer 308 samples were not concordant with their matched Illumina WGS libraries ( Figure 4J ). We 309 conclude from this that end motifs are particularly sensitive to changes in library strategy and 310 sequencing platform, and caution must be taken when comparing across multiple batches. This 311

is not surprising, given that fragment representation can be skewed by a number of variables 312 during library construction and amplification, as well as sequencing errors and downstream 313 bioinformatic steps such as adapter trimming (in our cfNano processing, we also exclude 314 fragment ends that are soft-clipped While most features agreed between cfNano and Illumina-based datasets, we identified 332 fragment end motifs as one that was especially sensitive to sequencing platform differences. 333

With the small sample size here, it is not possible to determine which of the two platforms 334

provides the truer results. It is tempting to hypothesize that Nanopore provides a more accurate 335

representation of actual fragment frequencies, since it does not include any PCR amplification. 336

On the other hand, Nanopore error rates are higher and may lead to less accurate adapter 337 trimming or a different spectrum of sequencing errors within the end motifs (although we did try 338 to control for this by using the reference genome sequence rather than the read sequence, and 339

by filtering out reads with soft-clipped ends). Despite these differences, we were able to detect 340 the best known cancer-specific end motif (CCCA) when we constrained the analysis to samples 341 from our main cfNano batch ("ISPRO"). We also detected slight differences in end motif 342 frequencies between our own Illumina WGS and earlier published WGS data, and between our 343 two cfNano batches which were sequenced two years apart. This indicates that end motif 344 frequencies are susceptible to batch effects in general and should be analyzed cautiously 345 between batches. In the future, proper controls should be developed to compare this feature 346 across datasets sequenced at different times. 347

With the small number of metastatic cancer samples used in our cfNano study, it was not 348 possible to define the lower limit of methylation sensitivity. However, the fact that some 349 epithelial/lung content was found when downsampling higher coverage WGBS samples and in 350

our cfNano healthy samples that had the lowest coverage, suggests that specificity needs to be 351 improved at this very low coverage. We believe this could be improved significantly if whole-352 genome DNA methylation atlases (WGBS or similar) can be generated for individual cell types 353

purified from human tissues. The purified cell type atlas we used here was based on the 354

Illumina HM450k platform, which covers only about 10-15% of cell type specific methylation 355 markers. Alternatively, cancer markers could come directly from discovery WGBS studies on 356 cancer plasma cfDNA. Such datasets do exist in the private domain, but remain proprietary (e.g. 357

(2)). 358

Even if deconvolution can be improved bioinformatically, we will almost certainly need higher 359 read coverage for applications that require more sensitive detection. The cfNano samples 360 analyzed here were all from metastatic disease and had relatively high tumor fraction (>=10%), 361

and this does not represent the situation for cancer screening or detection of minimal residual 362 disease. Thus, the Nanopore platform will need to be able to consistently produce tens of 363 millions of reads to enable those applications. minus the average methylation genome-wide). Each cancer sample was compared to the group 475 of healthy samples using a one-tailed t-test, and statistical significance is shown using asterisks. 476 (D) 10Mbp PMD bins were stratified by copy number status for each cancer sample using 477

ichorCNA, and statistically significant differences were calculated by performing one-tailed 478

Wilcoxon tests within each sample.*p<0.05, **p<0.01, ***p<0.001, ****p<0.0001. 479 

ISPRO Plasma cfDNA samples, library construction, and sequencing. ISPRO Samples, library 581 construction and sequencing were described in our initial publication of these sequences (17). 582

The original sample names from that study are listed in Supplementary Table 1 . Notably, one 583 sample (S1/19_326) was produced using a different library kit (SQK-LSK109 vs. NBD-584

EXP104+SQK-LSK109 for all other samples). This is the singleplex library kit, which results in 585

shorter adapter-ligated templates overall (due to the lack of barcodes) and thus responds 586 differently to the equivalent clean up bead concentration. Also adapter trimming is performed 587 differently in 19_326 due to the library kit differences. For these reasons, fragmentomic 588

properties are not directly comparable between 19_326 and other samples. We thus analyzed 589

19_326 separately for all fragmentomic analyses (included in Supplementary Figures 9-10 "--discard_middle --extra_end_trim 0". For both HU and ISPRO samples, alignments were 603 performed to GCF_000001405.39_GRCh38.p13 with minimap2 (Version 2.13-r850), using the 604 parameters "-ax map-ont --MD". 605 Megalodon filters out multi-mapping (supplementary) reads and uses the minimap2 "map-ont" 614 mode to filter low quality mappings. 615

Individual tile Fast5 files were run individually, and mod_mapping.bam files were merged using 616 samtools merge (v1.14). Samtools/HTSlib versions before v. DeepSignal call_mods (modification_call) output tsv file, extracting the (strand-specific) 630 methylation calls for each CpG from column 9 (called_label field), and calculated a methylation 631 beta value by taking the number of methylated reads (value 1) divided by the total number of 632 reads (value 0 or value 1). These were collapsed into a bedgraph file with a value between 0-1 633 for every CpG covered. These are available as file "grouped-beta-value_bedgraph.zip" in GEO 634

accession GSE185307 and at Zenodo DOI: 10.5281/zenodo.6448476. All genomic coordinates 635 are in GRCh38 and are zero-based. 636

Megalodon methylation calling and processing. Basic running of Megalodon is described above. 637

To extract (stranded) methylation information from the mod_mapping.bam files, we used 638 modbam2bed (https://github.com/epi2me-labs/modbam2bed) v.0.4.5, specifying a minimum 639 probability threshold of 0.667, and filtering out positions with 0 confident reads using awk. The 640 full command line was "modbam2bed --cpg -t 4 -a 0.333 -b 0.667 | awk '($5>0){print} > out.bed". 641

All coordinates are in GRCh38 and are 0-based. These files are named 642 "*.5mC.cut0.667.hg38.bed.gz". Column 11 corresponds to the percent of reads methylated. 643

Modbam2bed does not provide a column for the actual number of reads that this percentage is 644 based on, but it can be calculated from the other columns. readCount=(col5*col10)/1000. 645 We also profide a simple bedgrpah with just the methylation fraction (beta) values in files named 646 "*cut0.667.hg38.sorted.bedgraph.gz". These can be loaded into any genome browser. Both file 647 types are available in GEO accession GSE185307 and at Zenodo DOI: 648

10.5281/zenodo.6448476 649

Mapping of cfNano methylation data to HM450k probes. Using the zero-based stranded bed 650 files from modbam2bed ("5mC.cut0.667.hg38.bed.gz" files), we mapped each CpG covering 651 either the forward or reverse strand of each CpG on the Infinium 450k array. For each 652 modbam2bed stranded column, we first got the readCount as (col5*col10)/1000. We then 653 multiplied the methylation percentage by the read count for each strand to get the number of 654 methylated reads. Then we divided the sum of the methylated read counts by the sum of the 655 total read counts to get the unstranded percent methylation (beta value). This script treats each read at each CpG as an independent observation, and then randomly 672 samples from these until it has enough observations to reach the average genomic coverage 673 requested. To obtain the coverage levels shown in Figure 1 , it was run with the command line 674 "downsampleMethylBed.pl --coverageLevels 1E-3,2E-3,5E-3,1E-2,2E-2,5E-2,1E-1,2E-1,5E-675 1,1E0,2E0,5E0,1E1,2E1,5E1,8E1 --fracTotalFieldsFrom0 -3,4 --ncpgsGenome 28217005". 676

Full cell type methylation deconvolution. For the full cell type deconvolution in Figures 1A-C Supplementary Figure 4A -B. 685

For Figure 1A -C, we collapsed cell types into 8 groups using the file 686 https://github.com/methylgrammarlab/cfdna-687 ont/blob/main/deconvolution_code/deconvolution_moss/group_file_for_plot_green_epithilial.csv 688 (shown visually in Supplementary Figure 4C) . We plotted results using code in 689 https://github.com/methylgrammarlab/cfdna-690 ont/blob/main/deconvolution_code/deconvolution_moss/deconvolution_plot.R. For DeepSignal 691 methylation data, the procedure was the same except we used the top 2,000 hypermethylated 692 and top 2,000 hypomethylated CpGs, to account for the significantly smaller number of CpGs 693 called in the DeepSignal data (shown in Supplemental Figure 6A ). 694 ichorCNA analysis. Samtools (Version 1.9) was used to filter BAM alignments, unmapped 695 reads, secondary and supplementary reads, reads with mapping quality less than 20 as in (44) Supplemental Figure 6A ). We removed any probe that did not have valid (non-NA) values for 2 719 or more of the Lung_cell samples and 2 or more of the healthy plasma samples. 720

For each probe, the 450k beta values were averaged to produce a single Lung-specific beta 721 value ܺ ଵ . Gain and Loss respectively. For the healthy samples, 10Mb bins were generated from the whole 757 genome. GRCh38 Methylation files were converted to GRCh37 using liftover R package. We 758 selected only the bins overlapping one or more common Partially Methylated Domains (PMDs) 759 from (29). Within these PMD bins, we took the average of all "solo-WCGW" CpGs overlapping a 760

PMDs, with "solo-WCGW" annotation also from (29). We calculated the bin average from these 761

CpGs as sum(frac_methylation_each_CpG)/CpG_count. We then subtracted this bin average 762 and subtracted it from the average of all CpGs in the genome, to get the Methylation Delta 763

shown in Figure 2C -D. Common PMD and solo-WCGW annotations were taken from file 764 https://zwdzwd.s3.amazonaws.com/pmd/solo_WCGW_inCommonPMDs_hg38.bed.gz. 765

Statistical significance between each cancer sample's bins and all pooled healthy sample bins 766

in Figure 2D was calculated by one-sided Wilcoxon test (because we decided a priori to look 767 only for hypomethylation in the cancer samples). For copy number analysis in Figure 2D , from -1000 to -800 and +800 to +1000 across all NKX2-1 sites. For Supplementary Figure 7A , 786 we used all WGBS cancer types that were represented by normal tissues in the scATAC-seq 787 atlas, as this was the atlas used to define pneumocyte specific (PAL) peaks. For TCGA lung 788

and non-lung samples in Supplementary Figure 7A , we downloaded TCGA WGBS bedgraph 789 files from https://zwdzwd.github.io/pmd (29). We used all WGBS cancer types that were 790

represented by normal tissues in the scATAC-seq atlas, as this was the atlas used to define 791 pneumocyte specific (PAL) peaks. These TGCA types included LUAD and LUSC (Lung tissue  792 from atlas), CRC (Transverse colon tissue from atlas), BRCA (Breast tissue from atlas), STAD 793 (Stomach tissue from atlas), and UCEC (Uterus tissue from atlas). 794 Figure 7B) . As with 795 NKX.2 above, we used HOMER to identify predicted KLF5 binding sites (using the HOMER built 796 in matrix "klf5.motif") across the GRCh38 genome, and removed any site within the ENCODE 797

blacklist. As a control we intersected this list with 9,274 ATAC-seq peaks identified in the cluster 798 43 CREs from (46) (downloaded from supplemental table 6 of that paper 799 "Table_S6_Union_set_of_cCREs.xlsx"). We then selected 1,762 peaks that overlapped a 800 predicted KLF5 TFBS, and centered each on the predicted KLF5 TFBS. If multiple TFBS were 801 present in the peak, we took the motif with the highest HOMER log-odds match score. This 802 TFBS set is available as file "klf5.incluster43Distal.txt.highestScoreMotifs.bed" in GEO 803 accession GSE185307. 804 CTCF nucleosome positioning analysis. We used 9,780 evolutionarily conserved CTCF motifs 805 occurring in distal ChIP-seq peaks, which were taken from (33). Nanopore or Illumina fragments 806 within the size range of 130-155bp were used for fragment coverage analysis, with reads being 807 extracted from BAMs as described above. These shorter mononucleosomal fragments showed 808 similar nucleosomal patterns but gave higher spatial resolution than 156-180 bp fragments. 809 Deeptools (Version 3.5.0) bamCoverage was used with the parameters "--ignoreDuplicates --810 binSize -bl ENCODE_blacklist -of bedgraph --effectiveGenomeSize 2913022398 --811 normalizeUsing RPGC". For Illumina WGS, we used the additional parameter "--extendedReads 812 145". The bedgraph was converted to a bigwig file using bigWigToBedGraph downloaded from 813 UCSC Genome Browser. This bigwig file was passed to Deeptools computeMatrix with the 814 command line parameters "reference-point --referencePoint center -out (this was determined visually from Figure 2D of publication (34)). 830

End motif analysis. Fragments and reads were processed and filtered as in fragment length 831 analysis. We only used end 1 of both cfNano and Illumina reads, to avoid problems with adapter 832 trimming at end 2. To avoid biases that would affect end motif analysis, we also removed reads 833 with any soft clipping at end 1. The first 4 bases of each fragment were extracted and used for 834 4-mer analysis. To avoid errors in Nanopore base calling, these 4 bases were extracted from 835 the reference genome. Motif frequency was calculated as 

Epigenetics, fragmentomics, and 843 topology of cell-free DNA in liquid biopsies

Clinical validation of a targeted methylation-based multi-cancer early detection 848 test using an independent validation set

Principles of DNA methylation and their implications for biology and 850 medicine

Non-invasive detection of human cardiomyocyte death 853 using methylation patterns of circulating DNA

Comprehensive human cell-type methylation atlas reveals origins of 858 circulating cell-free DNA in health and disease

Cell-free 861 DNA maps COVID-19 tissue injury and risk of death and can cause tissue injury

Sensitive tumour detection and classification using plasma cell-869 free DNA methylomes

Cell-free DNA TAPS provides 873 multimodal information for early cancer detection

Systematic 875 benchmarking of tools for CpG methylation detection from nanopore sequencing

DNA methylation-calling tools for Oxford Nanopore sequencing: a 879 survey and human epigenome-wide evaluation

Rapid-CNS2: Rapid comprehensive adaptive nanopore-882 sequencing of CNS tumors, a proof of concept study

Intraoperative DNA methylation classification of brain tumors impacts neurosurgical 886 strategy

Remote immune processes revealed by immune-derived circulating cell-free DNA. 891 Elife

Detection 898 and characterization of lung cancer using cell-free DNA fragmentomes

Genome-wide cell-free DNA 906 fragmentation in patients with cancer

Noninvasive prenatal testing by nanopore sequencing of maternal plasma 909 DNA: feasibility assessment

Nanopore sequencing from liquid biopsy: analysis of copy number 912 variations from cell-free DNA of lung cancer patients

High resolution copy number inference in cancer using short-915 molecule nanopore sequencing

Single-molecule 918 sequencing reveals a large population of long cell-free DNA molecules in maternal plasma

Liquid biopsy uncovers distinct patterns of DNA methylation and copy number 925 changes in NSCLC patients with different EGFR-TKI resistant mutations

Scalable whole-exome sequencing of cell-free DNA reveals high 936 concordance with metastatic tumors

Genomic and Functional 1040 Approaches to Understanding Cancer Aneuploidy

Liquid 1046 biopsy reveals collateral tissue damage in cancer

DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using 1050 deep-learning

A single-cell atlas of chromatin accessibility in the 1053 human genome

Transcriptional control of lung morphogenesis

Single-nucleus cross-tissue molecular 1060 reference maps to decipher disease gene function

The 1065 chromatin accessibility landscape of primary human cancers

DNA methylation loss in late-replicating domains is linked to mitotic cell 1069 division

Noninvasive detection of cancer-1073 associated genome-wide hypomethylation and copy number aberrations by plasma DNA 1074 bisulfite sequencing

Paradoxical association of TET loss of function with genome-wide DNA 1077 hypomethylation

Chromatin dysregulation associated with NSD1 mutation in head and neck 1080 squamous cell carcinoma

Genome-wide mapping 1082 of nucleosome positioning and DNA methylation within individual DNA molecules

Enhanced detection of circulating 1090 tumor DNA by fragment size analysis

Plasma DNA End-Motif Profiling as a Fragmentomic Marker in Cancer, Pregnancy, and 1096 Transplantation

Plasma DNA Profile Associated with DNASE1L3 Gene 1101 Mutations: Clinical Observations, Relationships to Nuclease Substrate Preference, and In 1102 Vivo Correction

Performance 1107 assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing 1108 Study

Genome-1115 wide cell-free DNA mutational integration enables ultra-sensitive cancer monitoring

Machine learning guided signal enrichment for ultrasensitive plasma tumor 1124 burden monitoring. bioRxiv (2022)

cfNOMe -A single 1127 assay for comprehensive epigenetic analyses of cell-free DNA

Cardiovascular disease biomarkers derived from circulating cell-1132 free DNA methylation. medRxiv

Plasma DNA 1136 tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, 1137 and transplantation assessments

SAMBLASTER: fast duplicate marking and structural variant read 1139 extraction

High resolution copy number inference in cancer using short-molecule nanopore 1142 sequencing. bioRxiv (2020)

2: an R/Bioconductor package to reconstruct gene regulatory networks 1145 from DNA methylation and transcriptome profiles

A cell atlas of chromatin accessibility across 25 adult 1148 human tissues