Multicomponent molecular memory ARTICLE Multicomponent molecular memory Christopher E. Arcadia1, Eamonn Kennedy1, Joseph Geiser1, Amanda Dombroski1, Kady Oakley1, Shui-Ling Chen 1, Leonard Sprague 1, Mustafa Ozmen1, Jason Sello1, Peter M. Weber 1, Sherief Reda1, Christopher Rose1, Eunsuk Kim1, Brenda M. Rubenstein 1 & Jacob K. Rosenstein 1* Multicomponent reactions enable the synthesis of large molecular libraries from relatively few inputs. This scalability has led to the broad adoption of these reactions by the phar- maceutical industry. Here, we employ the four-component Ugi reaction to demonstrate that multicomponent reactions can provide a basis for large-scale molecular data storage. Using this combinatorial chemistry we encode more than 1.8 million bits of art historical images, including a Cubist drawing by Picasso. Digital data is written using robotically synthesized libraries of Ugi products, and the files are read back using mass spectrometry. We combine sparse mixture mapping with supervised learning to achieve bit error rates as low as 0.11% for single reads, without library purification. In addition to improved scaling of non-biological molecular data storage, these demonstrations offer an information-centric perspective on the high-throughput synthesis and screening of small-molecule libraries. https://doi.org/10.1038/s41467-020-14455-1 OPEN 1 Brown University, Providence, RI, USA. *email: jacob_rosenstein@brown.edu NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications 1 12 3 4 5 6 7 8 9 0 () :,; http://orcid.org/0000-0003-0935-3323 http://orcid.org/0000-0003-0935-3323 http://orcid.org/0000-0003-0935-3323 http://orcid.org/0000-0003-0935-3323 http://orcid.org/0000-0003-0935-3323 http://orcid.org/0000-0003-1926-8827 http://orcid.org/0000-0003-1926-8827 http://orcid.org/0000-0003-1926-8827 http://orcid.org/0000-0003-1926-8827 http://orcid.org/0000-0003-1926-8827 http://orcid.org/0000-0003-3017-336X http://orcid.org/0000-0003-3017-336X http://orcid.org/0000-0003-3017-336X http://orcid.org/0000-0003-3017-336X http://orcid.org/0000-0003-3017-336X http://orcid.org/0000-0003-1643-0358 http://orcid.org/0000-0003-1643-0358 http://orcid.org/0000-0003-1643-0358 http://orcid.org/0000-0003-1643-0358 http://orcid.org/0000-0003-1643-0358 http://orcid.org/0000-0001-9791-704X http://orcid.org/0000-0001-9791-704X http://orcid.org/0000-0001-9791-704X http://orcid.org/0000-0001-9791-704X http://orcid.org/0000-0001-9791-704X mailto:jacob_rosenstein@brown.edu www.nature.com/naturecommunications www.nature.com/naturecommunications S ignificant advances toward useful molecular-scale data systems have been made by exploiting DNA1–8 and other sequence-defined polymers9–11. However, linearly ordered macromolecules represent only a tiny fraction of the near- limitless variety of chemistries, which could be used to represent information. To find alternative examples of molecular infor- mation systems, one need look no further than the simplest single-celled organisms, which have evolved to make use of many complementary forms of chemical information, including loosely ordered mixtures of small molecules, such as metabolites and dissolved ions. Similarly, recent demonstrations have stored digital data not only in DNA4,5,12, but also using short peptides13 and metabolites14. While macromolecules will continue to be important for information systems, complementary small-molecule approaches can offer a number of potential advantages13,15. They do not require polymerization or enzymatic steps; they can be designed to resist cellular digestion16 and extreme environmental condi- tions; and they can be economical to produce. However, previous demonstrations of non-polymeric molecular data have faced capacity scaling challenges, which limited their scope to small files, such as encryption keys17,18. In this work, we encode millions of bits of data, in the form of digital images, using mixtures of small molecules. Rather than representing information in linear molecular sequences, we store data in locally disordered mixtures of small molecules, which can be identified by their molecular structures. This approach may appear comparatively difficult to scale to large amounts of data since we cannot simply add more subunits, as in the case of a polymer. We overcome this hurdle by creating large libraries of unique compounds through automated multi- component reactions. We introduce a process to perform the nanoliter-scale synth- esis and validation of thousands of unique Ugi products per day, without requiring purification or the use of solid supports. Some of these compounds are likely novel and have not been experi- mentally characterized before. To use these Ugi libraries, we have developed tools that can identify information-bearing molecules in complex chemical mixtures. By combining high-resolution mass spectrometry with supervised learning, we show how to use isotopes, adducts, impurities, and chemical interactions to improve the identification of information-carrying compounds. Additionally, we improve on previous demonstrations of non- genomic data storage by implementing a sparse data encoding scheme which dramatically reduces error rates. Furthermore, the techniques used here can be applied to other scalable chemical libraries. An overview of the data storage process is provided in Fig. 1, which depicts the storage of a 0.88 megapixel digital image derived from a Cubist charcoal drawing of a violin by Pablo Picasso19. Other datasets are shown in Fig. 6. These images represent the largest amount of digital data stored in a non- polymeric molecular form (Supplementary Fig. 9). Even in these early demonstrations, encoding between 16 and 575 bits of data per position compares favorably to some aspects of conventional memory devices, in which information is typi- cally encoded using a single scalar parameter (e.g. charge) per location, and where electronic noise sources make it impractical to store more than a few bits per cell20,21. In order to further improve density, semiconductor memory is increasingly struc- tured in three dimensions22. While the physical dimensions of our chemical memory spots are currently much larger than electronic memory, the concept of storing information in diverse small-molecule mixtures is valid down to the nanoscale. In addition to the potential for dense data storage, working with large numbers of complex chemical mixtures provides opportunities to learn from information-rich annotated experi- mental datasets. Just as DNA memory has inspired improvements in synthesis and sequence alignment23,24, advances in non- polymeric molecular data systems can lead to insights that may prove useful for navigating broad small-molecule spaces for drug discovery, metabolomics, and synthetic biology. Results Combinatorial library synthesis. The automated generation of diverse non-polymeric chemical libraries is challenging because of the wide variety and complexity of synthetic protocols. To create scalable small-molecule libraries appropriate for information storage, we use the multicomponent Ugi reaction25, which combines four reagents: an amine, an aldehyde or a ketone, a carboxylic acid, and an isocyanide, into a single product plus water. The number of unique Ugi products that can be formed scales with the number of available reagents (Fig. 2a). For instance, with 10 variations of each reagent, up to 10,000 unique multicomponent products are possible. The Ugi reaction is par- ticularly attractive as a one-pot, single-step, room-temperature reaction26 that has known catalysts27,28, solid supports29–31, accelerated conditions32,33, and multi-step extensions34. Ugi reactions have been previously used as secret molecular encryp- tion keys17, and to create sequence-defined macromolecules35. Here, however, our goal is to encode millions of bits of infor- mation in mixtures of small molecules, requiring that we find efficient strategies to synthesize as many unique products as possible. To this end, we have developed automated protocols for the high throughput synthesis of 1500 Ugi products at a time (Fig. 2b). We begin with a well plate containing five amines, five aldehydes, 12 carboxylic acids, and five isocyanides, and use an acoustic fluid handler (Echo 550, Labcyte) to enumerate all 1500 possible combinations of the four components into a 1536-well plate. After reacting, the wells are diluted to a final volume of 4 μL. Since the minimum transfer volume of the fluid handler is 2.5 nL, each library well can be dispensed more than a thousand times before it is depleted. We initially assumed that it would be necessary to purify each library component, perhaps using solid supports, but were pleasantly surprised to find that with appropriate analysis strategies our molecular datasets could be accurately read using raw reaction solutions. Forgoing purification allowed us to streamline the experimental protocols and minimize labor and material cost, such that over the course of this work we were able to synthesize more than 10,000 compounds. To validate the library, 20 nL from each well was analyzed with mass spectrometry (SolariX 7T, Bruker)36,37, in matrix-assisted laser desorption ionization (MALDI) mode38. The Ugi product monoisotopic masses (M) are mostly between 500 and 700 Da, but we frequently observe sodiated (M + Na) and potassiated (M + K) adducts (Fig. 3b). We analyzed the library spectra for sodiated product peaks, and found that more than 90% (1346/1500) of the products had significant signal ( SNR > 31:44). Additional details about library synthesis and validation are provided in the Methods and Supplementary Figs. 1–4. Writing data as chemical mixtures. The composition of a che- mical sample can represent abstract information, whether the sample consists of a single compound selected from a defined chemical space17, a pool of sequence-controlled polymers4,13, or a mixture of unique compounds15. With small molecule libraries, the most direct way to encode information is to use the presence or absence of each library element in a sample to represent one bit of data14,39. Thus, our 1500 compound Ugi libraries could encode ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14455-1 2 NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications www.nature.com/naturecommunications up to 1500 bits of information per mixture. However, using the full library in this way would strain our current experimental system, requiring large mixture volumes and low analyte con- centrations. Thus when using this simple encoding scheme we often prefer to use a reduced subset of library components. Figure 4 shows a 48,841-pixel binary image of the Egyptian god Anubis, which we encoded across 1527 independent mixtures using a 32-product library subset. In this case, the binary image data is first rearranged as a 1527 × 32 matrix, where each row of the matrix corresponds to one location on a data plate and each column corresponds to one library component. At each position, if a library element is meant to be included, we instruct our acoustic liquid handler to dispense a 2.5 nL droplet from its library well to the data plate. If it is meant to be excluded, no transfer is performed. The data is assembled on a standard MALDI target plate, forming a unique data mixture at each position. Finally, MALDI matrix is added to the mixtures, which are then dried, leaving behind crystalline spots, which can be stored and later read back using mass spectrometry. We often create up to 1536 unique mixtures per data plate, and the storage capacity scales with the number of library compounds used. To encode the 0.88 megapixel image of a Picasso drawing shown in Fig. 1, we used 575 unique compounds. Additional storage experiments are summarized in Fig. 6 and Supplementary Fig. 7. These chemical datasets take several hours to write and read, are stable for at least 9 months, and can be read more than 100 times (Supplementary Figs. 8 and 12). Reading data from chemical mixtures. To recover a chemical dataset, we analyze each mixture with mass spectrometry and train supervised learning algorithms to identify which library elements they contain. In the simplest version of this analysis, we can consider only the sodiated peaks of each Ugi product. If a peak’s intensity exceeds a threshold, we record the compound as present (“1”) and otherwise declare it as absent (“0”). For example, the spectrum shown in Fig. 4d contains the 1st, 2nd, 5th, 12th, 14th, and 17th compounds from a 32-compound sub- library, and thus this mixture encodes the following four bytes: 11001000 00010100 10000000 00000000. Using only sodiated product peaks, the Anubis dataset was recovered with 97.9% accuracy. At least 30/32 compounds were correctly assigned in over 95% of mixtures and the residual errors displayed similar rates of false positives and false negatives (Fig. 4c). While the majority of compounds achieved <5% error, we observed an order of magnitude variation in compound performance (Fig. 4c). We can improve the readout accuracy by using multiple spectral features to determine the presence of each M ix tu re s Violin (Picasso, 1912) a b c 1 0 1 0 0 1 2 M .. . Mix Input Train Infer Recovered A m in e s A ld e h yd e s Carboxylic acids Isocyanides Ugi reaction space Ugi 2 N O OH O Cl N O NH Ugi M N OHO O H N O H N O O Ugi 1 HN N OH O OHO H N O NH O O Fig. 1 Process overview. a Multicomponent reactions such as the four-component Ugi reaction can generate diverse libraries of small molecules. b A file can be stored by mapping its digital information onto a set of vectors which determine the presence (“1”) or absence (“0”) of multiple Ugi products at each location in an array. Here, we encoded a Cubist charcoal drawing of a violin by Pablo Picasso19 (©Estate of Pablo Picasso/Artists Rights Society (ARS), New York). c The molecular data is stored on a thin metal plate in small crystalline spots. The original file can be recovered by interrogating each spot with mass spectrometry and interpreting the spectra with models trained to identify library elements. The spots on the data plate displayed here have an average diameter of 820 μm and a thickness of several micrometers. OHR3 NH2 R1 H O R2 ++ + R4 Ugi product –H2O Amine Carboxylic acid IsocyanideAldehyde a b R3 O O N O N N H R2 R1 R4 +3.2 µL DMSO >24 h 800 nL 125 mM 4 µL 25 mM 200 nL (×4) 500 mM Fig. 2 Library synthesis. a The multicomponent Ugi reaction incorporates an amine, aldehyde, carboxylic acid, and isocyanide into a single peptide- like bis-amide. b We use automated acoustic liquid handling to synthesize combinatorial Ugi libraries of up to 1500 compounds at a time. NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14455-1 ARTICLE NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications 3 www.nature.com/naturecommunications www.nature.com/naturecommunications library element. Although it is difficult to assign the precise origin of every peak in a spectrum (e.g., isotopes, adducts), knowing their physical origin is not strictly required for data recovery. In fact, any feature which reliably correlates with the presence of a compound can be used to improve detection. In order to exploit these correlated background peaks, we use supervised learning to train classifiers on library elements using logisitic or random forest regression. In both cases, we produce a set of regression models, one per library element, which can interpret the contents of a data mixture from its mass spectrum. By applying these machine learning approaches, we have achieved up to a five-fold reduction in error rates (Fig. 5). In addition to improving accuracy, treating the readout as a learning problem has also revealed some interesting and non- intuitive chemical identifiers (Supplementary Fig. 13). For example, some classifiers used information about other library elements, perhaps as a result of competitive ionization. In other cases, compounds were found to form complexes with residual starting reagents, which could have arisen during ionization or during synthesis. Moreover, by learning these difficult-to-anticipate interactions, multi-peak detection can allow us to identify multiple library elements with the same monoisotopic mass. Improving accuracy with sparse data-mixture maps. For the Picasso drawing (Fig. 1) and Anubis image (Fig. 4), each bit of data was independently mapped onto the presence or absence of a single compound in a single mixture. While conceptually straightforward, this mapping limits the design parameters available to optimize experimental throughput and accuracy. For example, it is vulnerable to errors if single chemical components are improperly identified, and it implies that increasing storage capacity per spot requires both larger mixtures and larger libraries. In Fig. 5, we explore an alternate encoding scheme, in which a 16-bit block of data is mapped to an entire mixture, instead of mapping each bit independently. This approach allows us to tune the complexity of the mixtures separately from the library size. Here we use a library subset of 512 compounds, but constrain exactly 32 compounds to be present in each mixture. In theory, there are 512 32 � � � 2169 such combinations. However, only 216 states are needed to represent all possible values of the 16-bit data. This sparse mapping implies that the vast majority of possible mixtures should never be observed. As such, when errors do occur, data can be rounded to the nearest valid mixture, providing some degree of fault tolerance. In this example, the minimum Hamming distance between any two valid mixtures is 36, meaning that perfect recovery of the encoded data can be guaranteed even when up to 17 of the 512 compounds (3.3%) are incorrectly classified. To test this, we performed a series of simulations where a 1600-bit data vector was encoded into 100 chemical mixtures. The virtual mixtures were symmetrically corrupted, at various error rates, and then decoded (Fig. 5b). Even with raw error rates several times larger than the guaranteed threshold, the vast majority of errors could still be corrected. Using this sparse encoding, we wrote a 24,336-pixel digital image derived from a 16th century German illustration of angels seated at a table40. Looking only for the sodiated product peaks, we correctly classified 389 of the 512 library components, and after rounding to the nearest valid mixture, the original data was recovered with 96.67% accuracy (Fig. 5c). Training a logistic regression model to perform multi-peak detection resulted in a 5- fold reduction in raw compound errors (Fig. 5d) and a 30-fold reduction in decoded data errors, yielding a final accuracy of 99.89% (Fig. 5e). Several mapping and detection schemes were tested in this work, and the results summarized in Fig. 6 highlight key trade- offs between the different approaches. The largest file was stored using direct mapping, since it provides a direct scaling of data capacity with library size. In contrast, the lowest error rates were achieved with sparse mapping and multi-peak detection. Discussion In this study, we have introduced chemical information repre- sentations based on mixtures of multicomponent molecules. We can view this as an effort to store information in a superset of the molecular space available to biological systems, where synthetic chemistry is not held to the same environmental and energetic constraints as living cells. The demonstrations presented here are already six times larger than the information capacity of the smallest known genome41, and although it is difficult to quantify exactly how much information is represented in living systems, it is interesting to think about how engineered chemical 100 101 102 103 104 SNR threshold [�] 0 25 50 75 100 F a ll o u t (F P R ) [% ] 0 25 50 75 100 S e n sitivity (T P R ) [% ] 0 10 20 30 50 m/z – M 0 200 400 600 800 O b se rv a tio n c o u n t 40 M M + Na M + K M + Na + 1 M + Na + 2 M + K + 1M + 1 500 600 700 M 0 100 200 C o u n t 1 48 1 32 SNR of expected masses 101 102 103 104 a b c Fig. 3 Library analysis. a A color-coded map of the signal-to-noise ratio (SNR) of sodiated product peaks across a 1500-Ugi library. b A histogram of the 50 strongest peaks in each product spectra, offset by their expected Ugi product masses (M). Several common salt adducts and isotopes are labeled. Inset: A histogram of the monoisotoptic masses of the 1500- compound library. c The true positive rate (TPR) and false positive rate (FPR) as a function of the SNR threshold for the M+Na peaks. The dashed line (SNR � 31) jointly optimizes the true and false positive rates ( TPR ¼ 89:73%, FPR ¼ 8:71%). ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14455-1 4 NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications www.nature.com/naturecommunications 0 500 1000 1500 Mixture number P e a k in te n si ty [ a rb .] Count Threshold Present Absent 107 106 106 104 1 0 1 0 0 Measured Expected All ×10–3 ×106 ×106 ×10–3 0 2 4 –50 –25 0 25 50 m/z + 594.331 0 2 4 –50 –25 0 25 50 m/z + 560.310 540 560 580 640620600 660 m/z In te n si ty [ a rb .] 1.5 2 1 0 0.5 × 107 a b c 0110 ... 10 d 0.1 1 10 F a ls e r a te [ % ] 1 0 0 1 FPR FNR 0 5 10 15 20 25 30 Compound number (Sorted) 80 90 100 [% ] A cc u ra cy Fig. 4 Encoding data into small-molecule mixtures. a This molecular dataset is a 48,841 pixel (221 × 221) binary image of the Egyptian god Anubis52. b Sodiated peak intensities for one library element across the 1527 mixtures. c False negative and false positive rates for the 32 compounds, ordered by overall accuracy. d The mass spectrum of a data mixture containing six molecules from the 32-compound subset of the Ugi library. Left inset: the 8th molecule is absent (“0”). Right inset: the 17th molecule is present (“1”). 1 10 100 Mixture error rate [%] 0 50 100 150 200 M ix tu re c o u n t Single peak Multi-peak� = 4.8% � = 24.1% Single peak Multi-peak a edc b 0 10 20 30 40 50 60 Raw error rate [%] 0 10 20 30 40 50 D e co d e d e rr o r ra te [ % ] Equality Simulated Single peak Multi-peak Data blocks Sparse mixtures 16 bits 512 compounds 2 1 6 = 6 5 ,5 3 6 e n tr ie s (3 2 p re se n t in e a ch ) Lookup Fig. 5 Using sparse mixture mapping and multi-peak readout to improve data recovery. a Every 16 bits of data is mapped onto a sparse mixture, based on a 512-Ugi library subset, in which only 32 of the 512 of the compounds are present. There are ~2169 such mixtures, but only 216 are mapped to data. b Simulated read error rates, before and after decoding. Experimental results from single peak (c) and multi-peak (e) detection are also shown. c A 24,336 pixel binary image of angels at a dinner table from a 16th century print40, recovered with single (sodiated) peak readout (96.67% accurate). d Histograms of the raw compound errors per mixture for each recovery scheme. e The digital image recovered with multi-peak detection (99.89% accurate). NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14455-1 ARTICLE NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications 5 www.nature.com/naturecommunications www.nature.com/naturecommunications information systems could similarly take advantage of the inter- play between macromolecules and small molecules. By introducing automated synthesis and analysis approaches for multicomponent Ugi products, we produced the largest small- molecule digital information representations described to date. We showed that by using sparse data-to-mixture mapping and applying supervised learning to mass spectrometry data, we can tolerate impurities, improve accuracy, and produce a workflow that readily generalizes to other classes of small molecules. We previously demonstrated data storage using a library of common metabolites14, which could also benefit from the improved encoding and analysis strategies developed here. However, metabolite memory is more challenging to scale and perhaps more suitable for transient memory or chemical com- putation. Ugi products, on the other hand, offer combinatorial scaling, excellent stability, and comparatively uniform chemical properties. In total, we stored 1.8 million bits of data in Ugi molecules, including more than 0.8 million bits on a single plate (Fig. 6). This is still very far from theoretical capacity limits, and there are many parameters that can be further optimized, including library size, plate spot density, and mixture complexity. With incre- mental improvements in these areas, we could reach several megabytes per plate or more, using largely the same experimental setup. Introducing photolithography or advanced printing could further improve the spatial density and data capacity of a plate by several orders of magnitude. We selected steel plates for their compatibility with available instruments, but future imple- mentations could utilize flexible substrates or reels, or embed chemical information onto the surfaces of three-dimensional objects. MALDI imaging can achieve resolutions finer than 10 microns42, and although reading larger chemical datasets may require different coding strategies, we have not yet approached the information capacity of our readout. The sensitivity of MALDI mass spectrometry is limited by the presence of background ions, such as matrix adducts and impu- rities43. With adjustments to ionization, trapping, and excitation, this chemical noise can be mitigated, enabling attomole limits of detection44,45. In our current demonstrations, we estimate the amount of Ugi product ionized per read to be on the order of 10 femtomoles (Supplementary Fig. 12), offering room for future improvement. There are also other interesting avenues to explore beyond simple capacity improvements. Our sparse mixture mapping (Fig. 5) can be considered a coarse version of block coding, and there would be benefits to exploring more efficient coding schemes for digital error correction. Alternatively, one could leverage sparsity for enhanced information density, and represent many bits per small molecule present. The experimental workflow used here has similarities with early-stage pharmaceutical pipe- lines46, and it would be exciting to consider how error correction and the correlated statistics of encoded mixtures could be applied to drug discovery and medical applications. By automating the Ugi reaction, we found that we could syn- thesize thousands of multicomponent compounds per day using low cost reagents and with minimal manual sample preparation. The yield and quality of the Ugi reactions supported decoding hundreds of bits of data per mixture without any purification, which is partly a result of the fact that here we are interested in the information carried by the unique fingerprint of each library element rather than individual chemical structures. One minor reaction adjustment that we found helpful was to limit the amount of isocyanide to 80% of the other reagents, which seemed to reduce side product formation. Our ability to apply supervised learning to chemical informa- tion recovery stems from the availability of labeled training data, and a willingness to tackle complex mixtures. To utilize the potential of even larger molecular libraries, other approaches may be required. Recent studies have explored the use of autonomous systems for the exploration of chemical spaces47, which pairs well with the idea of mapping chemical mixtures to abstract infor- mation. Screening combinatorial libraries in bulk rather than one at a time is already established in some areas of molecular biol- ogy, such as aptamer design48. Extending these information- centric philosophies to more subtle molecular properties and emergent chemical reaction networks may prove particularly fruitful. Methods Materials and reagents. The solvent dimethyl sulfoxide (DMSO, anhydrous, �99.9%, MilliporeSigma) was used to prepare all solutions in the library and data plates. Analytical grade α-cyano-4-hydroxycinnamic acid (HCCA, �99.0%, Milli- poreSigma) was used as the matrix material for all MALDI samples. The library of 1500 Ugi products was constructed with the following five amines: benzylamine, 4-methylbenzylamine, p-methoxybenzylamine, 4-chlorobenzylamine, 4-tertbu- tylbenzylamine; five aldehydes: cyclohexanecarboxaldehyde, 3-cyclohexylpropanal, valeradlehyde, isovaleraldehyde, cyclopentanecarboxaldehyde; 12 carboxylic acids: Boc-glycine, Boc-proline, Boc-N-methyl-L-valine, Boc-L-asparagine, Boc-L- beta-homoleucine, Boc-L-methionine, Boc-L-beta-homoglutamine, Boc-L-beta- homo-methionine, Boc-L-phenylalanine, Boc-N-alpha-N-epsilon-formyl-L-lysine, Boc-N-methyl-L-phenylalanine, Boc-O-methyl-L-tyrosine; and five isocyanides: cyclohexyl isocyanide, ethyl isocyanoacetate, benzyl isocyanide, 2-naphthyl iso- cyanide, methyl isocyanoacetate. These compounds were obtained at synthesis grade or higher and used as received from their vendors (Chem-Impex for the carboxylic acids and MilliporeSigma for the others). Further details about the reagents can be found in Supplementary Fig. 1. Digital-to-chemical mapping Direct Sparse D e te ct io n m e th o d S in g le -p e a k Anubis Dimna 48,841 bits 32 Ugis • 97.90% 12,100 bits 256 Ugis • 99.02% M u lti-p e a k Violin Angels 879,625 bits 575 Ugis • 97.57% 24,336 bits 512 Ugis • 99.89% Fig. 6 A gallery of digital images written into mixtures of Ugi products. The file sizes, number of compounds used, and readout accuracies are shown below each recovered image. Each dataset was represented using ~1500 mixtures. The Violin experiment was performed twice with similar results. The images were adapted from artwork from The Metropolitan Museum of Art (Anubis52, Dimna53, Angels40) and19 the ©Estate of Pablo Picasso/Artists Rights Society (ARS), New York (Violin19). ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14455-1 6 NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications www.nature.com/naturecommunications Library preparation. Each reagent was dissolved in DMSO to a concentration of 500 mM and placed into a 384-well plate. Using an acoustic fluid handler, we dispensed the reagents, 200 nL per inclusion, into a 1536-well plate to enumerate all possible four-component Ugi reactions. The array of reagent mixtures was left to react at room temperature for 1–2 days. After reacting, DMSO was added to each library well to reach a final volume of 4 μL. Mass spectrometry. Mass spectra were acquired with a Fourier transform ion cyclotron resonance (FT-ICR) mass spectrometer in positive ion mode. Samples were crystallized in matrix (Supplementary Fig. 17), using an ~100:1 ratio of matrix to Ugi product. Samples were ionized using matrix-assisted laser desorption ionization (MALDI). Spectra produced by FT-ICR are particularly high resolution, often reaching peak widths of 0.001 Da or smaller. To ensure the accuracy of peak assignment, a mass calibration is performed before each run using sodium tri- fluoroacetate as a reference49 (Supplementary Fig. 10). We typically acquire spectra for 1.5 s, which results in a resolving power of 1:3 ´ 105 at 600 Da (Supplementary Fig. 15). The instrument serially addresses each crystallized spot (Supplementary Fig. 11), and takes about 4 h to record all 1,536 spots on a plate. Each measurement is made by ionizing a portion of a sample with a laser configured to take 500 shots at 1000 Hz, over a scan area of 500–900 μm, with medium focus, and ×4 averaging. We convert the raw data files from the instrument into a custom HDF5 file, for more efficient querying and ease of access. To normalize signals across measure- ments, we often convert the raw intensity values of a spectrum to signal-to-noise ratios (SNR) according to the following shift-and-scale relation: SNR ¼ I � μð Þ=σ, where I is an intensity and μ and σ are the mean and standard deviation of the spectrum’s background (see Supplement). Library validation. To identify successful reactions, a small volume (20 nL) from each library well was spotted to a unique location on a stainless steel plate (78 mm ´ 120 mm) along with matrix (20 nL of 176.2 mM HCCA in DMSO). The plated samples were allowed to dry overnight (�10 h) into round crystals (�800 μm in diameter), before analysis via mass spectrometry. In the resulting mass spectra, we looked for peaks corresponding to expected Ugi product masses, and used peak height as a coarse measure of reaction yield. Since the Ugi products have similar ionization profiles, we performed a global statistical analysis of the library spectra, using the SNR of their sodiated peaks. A common threshold (τ) was found using receiver operator characteristic (ROC) curve analysis50. To construct the ROC curve, we look for the sodiated product peaks across all reaction wells, apply a given SNR threshold to assess the presence or absense of these peaks, tally detected library peaks to estimate the true positive (TPR) and false positive (FPR) rates, and repeat this process for all candidate thresholds. Since there should be exactly one product per well, if the expected product is detected, it is counted as a true positive (TP), and if not, then it is marked as a false negative (FN). Similarly, if other products are detected in the well, they are counted as false positives (FP) and otherwise as true negatives (TN). The products with masses that overlap with that of the expected product are counted as TPs or FNs. Error rates can be calculated as TPR ¼ TP=ðTP þ FN Þ and FPR ¼ FP=ðFP þ TN Þ, and used to find an optimal SNR threshold, by minimizing the distance to the (0,1)-corner: ð0 � FPRðSNRÞÞ2 þ ð1 � TPRðSNRÞÞ2 � �1=2 . The Ugi products whose SNR exceeds this threshold ( SNR � τ) are declared present. Data plate preparation. First, a digital file is converted into a one-dimensional binary vector. This vector is then encoded, either with a direct or sparse mapping, into an M ´ N compound-presence matrix, where M is the number of compounds to be used, and N is the number of independent mixtures to be made. The value of element pmn in this matrix indicates the presence (“1”) or absence (“0”) of the m th compound in the nth mixture. To physically generate the mixtures, 2.5 nL droplets are transferred from the 1536-well library plate to their appropriate locations on a MALDI plate. Finally, 30 nL of matrix solution (176.2 mM HCCA in DMSO) is added to each data mixture spot. The overall time to write a data plate ranged from 0.3 to 7.9 h, varying with the encoding scheme and file size (Supplementary Figs. 6 and 7). Once all transfers are complete, the data plate is left to dry in a fume hood overnight or a vacuum chamber for about 2 h. The resulting dried mixture spots are typically 1 mm in diameter. Currently, the number of compounds that can be included in each mixture is limited by the layout of samples on a MALDI plate. For a 1536-well grid, spots can contain up to 200 nL of solution before they begin to merge with adjacent samples (Supplementary Fig. 14). For more complex samples, mixing would have to be done in an intermediate well plate. Data plate analysis. During plate preparation, the matrix solution is spiked with a reference Ugi molecule (Supplementary Fig. 5) which is used to calibrate for small offsets in the recorded masses. After offset calibration, raw mass spectra are resampled to a common m/z grid in order to construct a single analysis-ready matrix containing the mass spectra of all spots on a plate. For single peak detection, the sodiated adduct intensities for a product are simply one row in the spectral matrix, and this vector can be thresholded to determine, which mixtures contain the compound. The detection threshold for each compound was found using ROC analysis of labeled training data, as previously described for library validation. Recovering the data file from the presence matrix depends on the encoding method. For direct mapping, the matrix is simply reshaped to obtain the stored data. For sparse mappings, each matrix row was matched to the nearest valid key and converted to the corresponding binary data value. For multi-peak detection, a similar procedure was followed, except that the presence matrix was found by applying a regression model trained to identify each compound based on multiple spectral features. To reduce computational overhead, instead of building the models on the entire mass spectra matrix, masses whose average intensities were close to the noise floor were discarded, reducing the feature space to <1% of its original size, from four million initial points to at most 20,000 candidate masses. For logistic regression, these features were further refined based on AUROC scores. This additional filtering was not needed for random forest regression since it automatically performs feature selection. The Python library Scikit-learn51 was used to construct a regression model for each compound. Logistic regressions were configured to use 64 spectral peaks, while random forest regressions were configured to use 300 trees of unlimited depth and at most 20,000 spectral features. The regression models used a 30/70 train/test split. Data availability The datasets from this study are available from the authors on reasonable request. Code availability The software used in this study is based on code available from the Metabolomics Workbench data repository (study ST001173), and is available from the authors on reasonable request. Received: 28 August 2019; Accepted: 8 January 2020; References 1. Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628–1628 (2012). 2. Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nat. Mater. 15, 366–370 (2016). 3. Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77 (2013). 4. Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019). 5. Davis, J. Microvenus. Art. J. 55, 70–74 (1996). 6. Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017). 7. Colquhoun, H. & Lutz, J.-F. Information-containing macromolecules. Nat. Chem. 6, 455–456 (2014). 8. Rutten, M. G. T. A., Vaandrager, F. W., Elemans, J. A. A. W. & Nolte, R. J. M. Encoding information into polymers. Nat. Rev. Chem. 2, 1 (2018). 9. Roy, R. K. et al. Design and synthesis of digitally encoded polymers that can be decoded and erased. Nat. Commun. 6, 7237 (2015). 10. Martens, S. et al. Multifunctional sequence-defined macromolecules for chemical data storage. Nat. Commun. 9, 4451 (2018). 11. König, N. F. et al. Photo-editable macromolecular information. Nat. Commun. 10, 3774 (2019). 12. Tabatabaei, S. K. et al. DNA punch cards: encoding data on native DNA sequences via nicking. Preprint at https://doi.org/10.1101/672394v5, 672394 (2019). 13. Cafferty, B. J. et al. Storage of information using small organic molecules. ACS Cent. Sci. 5, 911–916 (2019). 14. Kennedy, E. & Arcadia, C. E. et al. Encoding information in synthetic metabolomes. PLOS ONE 14, 1–12 (2019). 15. Rosenstein, J. K. et al. Principles of information storage in small-molecule mixtures. Preprint at https://arxiv.org/abs/1905.02187 (2019). 16. Finkel, S. E. & Kolter, R. DNA as a nutrient: novel role for bacterial competence gene homologs. J. Bacteriol. 183, 6288–6293 (2001). 17. Boukis, A. C., Reiter, K., Frölich, M., Hofheinz, D. & Meier, M. A. R. Multicomponent reactions provide key molecules for secret communication. Nat. Commun. 9, 1439 (2018). 18. Sarkar, T., Selvakumar, K., Motiei, L. & Margulies, D. Message in a molecule. Nat. Commun. 7, 11374 (2016). 19. Picasso, P. Violin (©Estate of Pablo Picasso/Artists Rights Society (ARS), New York, 1912). 20. Li, Q., Jiang, A. and Haratsch, E. F. Noise modeling and capacity analysis for NAND flash memories. In Proc. IEEE International Symposium on Information Theory, 2262–2266 (IEEE, 2014). NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14455-1 ARTICLE NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications 7 https://doi.org/10.1101/672394v5 https://arxiv.org/abs/1905.02187 www.nature.com/naturecommunications www.nature.com/naturecommunications 21. Hong, S. Memory technology trend and future challenges. In International Electron Devices Meeting, 12–4 (IEEE, 2010). 22. Kim, W. et al. Multi-layered vertical gate NAND ash overcoming stacking limit for terabit density storage. IEEE Symposium on VLSI Technology, 188–189 (2009). 23. Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free templateindependent enzymatic DNA synthesis for digital information storage. Nat. Commu- nications 10, 2383 (2019). 24. Lopez, R. et al. DNA assembly for nanopore data storage readout. Nat. Communi- cations 10, 2933 (2019). 25. Ugi, I. The α-addition of immonium ions and anions to isonitriles accompanied by secondary reactions. Angew. Chem. Int. Ed. Engl. 1, 8–21 (1962). 26. Marcaccini, S. & Torroba, T. The use of the Ugi four-component condensation. Nat. Protoc. 2, 632 (2007). 27. Pirrung, M. C. & Sarma, K. D. Multicomponent reactions are accelerated in water. J. Am. Chem. Soc. 126, 444–445 (2004). 28. Zhang, J. et al. Asymmetric phosphoric acid-catalyzed four-component Ugi reaction. Science 361, 8707 (2018). 29. Strocker, A. M., Keating, T. A., Tempest, P. A. & Armstrong, R. W. Use of a convertible isocyanide for generation of Ugi reaction derivatives on solid support: Synthesis of α-acylaminoesters and pyrroles. Tetrahedron Lett. 37, 1149–1152 (1996). 30. Short, K. M., Ching, B. W. & Mjalli, A. M. M. Exploitation of the Ugi 4CC reaction: Preparation of small molecule combinatorial libraries via solid phase. Tetrahedron 53, 6653–6679 (1997). 31. Lin, Q., O’Neil, J. C. & Blackwell, H. E. Small molecule macroarray construction via Ugi four-component reactions. Org. Lett. 7, 4455–4458 (2005). 32. Hoel, A. M. L. & Nielsen, J. Microwave-assisted solid-phase Ugi four- component condensations. Tetrahedron Lett. 40, 3941–3944 (1999). 33. Tye, H. & Whittaker, M. Use of a Design of Experiments approach for the optimisation of a microwave assisted Ugi reaction. Org. Biomolecular Chem. 2, 813–815 (2004). 34. Brauch, S., van Berkel, S. S. & Westermann, B. Higher-order multicomponent reactions: beyond four reactants. Chem. Soc. Rev. 42, 4948–4962 (2013). 35. Boukis, A. C. & Meier, M. A. R. Data storage in sequence-defined macromolecules via multicomponent reactions. Eur. Polym. J. 104, 32–38 (2018). 36. Nikolaev, E. N., Kostyukevich, Y. I. & Vladimirov, G. N. Fourier transform ion cyclotron resonance (FT ICR) mass spectrometry: Theory and simulations. Mass spec.- trometry Rev. 35, 219–258 (2016). 37. Amster, I. J. Fourier Transform Mass Spectrometry. J. Mass Spectrom. 31, 1325–1337 (1996). 38. Karas, M., Bahr, U. & Gießmann, U. Matrix-assisted laser desorption ionization mass spectrometry. Mass Spectrom. Rev. 10, 335–357 (1991). 39. Arcadia, C. E. et al. Parallelized Linear Classification with Volumetric Chemical Perceptrons. IEEE International Conference on Rebooting Computing, 1–9 (2018). 40. Baldung, H. Angels Served at a Table. Accession 17.3.3034. Metropolitan Museum of Art, NY, USA. (1507). 41. Nakabachi, A. et al. The 160-Kilobase Genome of the Bacterial Endosymbiont Carsonella. Science 314, 267–267 (2006). 42. Römpp, A. & Spengler, B. Mass spectrometry imaging with high resolution in mass and space. Histochemistry Cell Biol. 139, 759–783 (2013). 43. Krutchinsky, A. N. & Chait, B. T. On the mature of the chemical noise in MALDI mass spectra. J. Am. Soc. Mass Spectrom. 13, 129–134 (2002). 44. Moyer, S. C., Budnik, B. A., Pittman, J. L., Costello, C. E. & O’Connor, P. B. Attomole Peptide Analysis by High-Pressure Matrix-Assisted Laser Desorption/Ionization Fourier Transform Mass Spectrometry. Anal. Chem. 75, 6449–6454 (2003). 45. Solouki, T., Marto, J. A., White, F. M., Guan, S. & Marshall, A. G. Attomole Biomolecule Mass Analysis by Matrix-Assisted Laser Desorption/Ionization Fourier Transform Ion Cyclotron Resonance. Anal. Chem. 67, 4139–4144 (1995). 46. Hughes, J. P., Rees, S., Kalindjian, S. B. & Philpott, K. L. Principles of early drug discovery. Br. J. Pharmacol. 162, 1239–1249 (2011). 47. Kitson, P. J. et al. Digitization of multistep organic synthesis in reactionware for ondemand pharmaceuticals. Science 359, 314–319 (2018). 48. Han, K., Liang, Z. & Zhou, N. Design Strategies for Aptamer-Based Biosensors. Sensors 10, 4541–4557 (2010). 49. Moini, M., Jones, B. L., Rogers, R. M. & Jiang, L. Sodium triuoroacetate as a tune/calibration compound for positive- and negative-ion electrospray ionization mass spectrometry in the mass range of 100–4000 Da. J. Am. Soc. Mass Spectrom. 9, 977–980 (1998). 50. Unal, I. Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach. Computational Math. Methods Med. 2017, 1–14 (2017). 51. Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn- ing. Res. 12, 2825–2830 (2011). 52. Unknown artist. Book of the Dead for the Singer of Amun, Nany. Accession 30.3.31. Metropolitan Museum of Art, NY, USA. (ca. 1050 B.C.) 53. Unknown artist. Leopard Bearing Lion’s Order to Fellow Judges. Accession 1981.373.51. Metropolitan Museum of Art, NY, USA. (18th century). Acknowledgements This research was supported by funding from the Defense Advanced Research Projects Agency (DARPA W911NF-18-2-0031). The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. This work was also made possible by support from the Office of the Vice President for Research at Brown Uni- versity, and by the National Science Foundation under Grant No. 1941344. Author contributions C.E.A., E.K. and J.G. performed experiments. C.E.A., E.K., J.G. and J.K.R. analyzed data. C.E.A., A.D., K.O., S.-L.C. and L.S. synthesized the library. J.S., P.M.W., S.R., C.R., M.O., E.K., B.M.R. and J.K.R. provided direction and oversight. C.E.A., E.K. and J.K.R. drafted the paper. All authors provided notes and edits to the paper. Competing interests A pending patent application (PCT/US2019/038301) has been filed by Brown University with the following authors included on it as inventors [C.E.A., S.L.C., A.D., J.G., E.K., E.K., K.O., S.R., C.R., J.S., P.M.W., B.M.R., and J.K.R.] concerning data storage and computation using small molecules including Ugi reaction products. Additional information Supplementary information is available for this paper at https://doi.org/10.1038/s41467- 020-14455-1. Correspondence and requests for materials should be addressed to J.K.R. Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available Reprints and permission information is available at http://www.nature.com/reprints Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/. © The Author(s) 2020 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-14455-1 8 NATURE COMMUNICATIONS | (2020) 11:691 | https://doi.org/10.1038/s41467-020-14455-1 | www.nature.com/naturecommunications https://doi.org/10.1038/s41467-020-14455-1 https://doi.org/10.1038/s41467-020-14455-1 http://www.nature.com/reprints http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ www.nature.com/naturecommunications Multicomponent molecular memory Results Combinatorial library synthesis Writing data as chemical mixtures Reading data from chemical mixtures Improving accuracy with sparse data-mixture maps Discussion Methods Materials and reagents Library preparation Mass spectrometry Library validation Data plate preparation Data plate analysis Data availability Code availability References Acknowledgements Author contributions Competing interests Additional information