key: cord-0181774-qedsjy94
authors: Yang, Shuwen; Li, Ziyao; Song, Guojie; Cai, Lingsheng
title: Deep Molecular Representation Learning via Fusing Physical and Chemical Information
date: 2021-11-28
journal: nan
DOI: nan
sha: 028969f5b5135797ba0690b46aaa3042a7720f6a
doc_id: 181774
cord_uid: qedsjy94

Molecular representation learning is the first yet vital step in combining deep learning and molecular science. To push the boundaries of molecular representation learning, we present PhysChem, a novel neural architecture that learns molecular representations via fusing physical and chemical information of molecules. PhysChem is composed of a physicist network (PhysNet) and a chemist network (ChemNet). PhysNet is a neural physical engine that learns molecular conformations through simulating molecular dynamics with parameterized forces; ChemNet implements geometry-aware deep message-passing to learn chemical / biomedical properties of molecules. Two networks specialize in their own tasks and cooperate by providing expertise to each other. By fusing physical and chemical information, PhysChem achieved state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark. The effectiveness of PhysChem was further corroborated on cutting-edge datasets of SARS-CoV-2.

The intersection between deep learning and molecular science has recently caught the eye of researchers in both areas. Remarkable progress was made in applications including molecular property prediction [37, 16] , molecular graph generation [38, 12, 29] , and virtual screening for drug discovery [8, 32] , yet learning representations for molecules remains the first yet vital step. Molecular representation learning, or learning molecular fingerprints, aims to encode input notations of molecules into numerical vectors, which later serve as features for downstream tasks. Earlier deep molecular representation methods generally used off-the-shelf network architectures including message-passing neural networks [9] , graph attention networks [37] and Transformers [11, 25] . These methods took either line notations (e.g. SMILES 3 ) or graph notations (i.e. structural formulas) of molecules as inputs, whereas physical and chemical essence of molecules was largely neglected. Notably, a trend of integrating 3D conformations (i.e. the 3D Cartesian coordinates of atoms) into molecular representations recently emerged [4, 15, 30] , while most of these methods assume the availability of labeled conformations of target molecules.

In order to push the boundaries of molecular representation learning, we revisited molecules from both physical and chemical perspectives. Modern physicists generally regard molecules as particle systems that continuously move following the laws of (quantum) mechanics. The dominant conformations of molecules reflect the equilibriums of these micro mechanical systems, and are thus of wide interest.

Chemists, on the other hand, focus more on chemical bonds and functional groups of molecules, which denote the interactions of electrons and determine chemical / biomedical properties such as solubility and toxicity, etc. Nevertheless, physical and chemical information of molecules is not orthogonal. For example, torsions of bond lengths and angles greatly influence the dynamics of particle systems. Therefore, an ideal molecular representation is not only expected to capture both physical and chemical information, but also to appropriately fuse the two types of information.

Based on the observations above, we propose PhysChem, a novel neural architecture that captures and fuses physical and chemical information of molecules. PhysChem is composed of two specialist networks, namely a physicist network (PhysNet) and a chemist network (ChemNet), who understand molecules physically and chemically. 4 PhysNet is a neural physical engine that learns dominant conformations of molecules via simulating molecular dynamics in a generalized space. In PhysNet, implicit positions and momenta of atoms are initialized by encoding input features. Forces between pairs of atoms are learned with neural networks, according to which the system moves following laws of classic mechanics. Final positions of atoms are supervised with labeled conformations under spatial-invariant losses. ChemNet utilizes a message-passing framework [9] to capture chemical characteristics of atoms and bonds. ChemNet generates messages from atom states and local geometries, and then updates the states of both atoms and bonds. Output molecular representations are merged from atomic states and supervised with labeled chemical / biomedical properties. Besides focusing on their own specialties, two networks also cooperate by sharing expertise: PhysNet consults the hidden representations of chemical bonds in ChemNet to generate torsion forces, whereas ChemNet leverages the local geometries of the intermediate conformations in PhysNet.

Compared with existing methods, PhysChem adopts a more elaborated as well as interpretable architecture that incarnates physical and chemical understandings of molecules. Moreover, as PhysNet learns molecular conformations from scratch, PhysChem does not require labeled conformations of test molecules. This extends the applicability of PhysChem to situations where such labels are unavailable, for example, with neural-network-generated drug candidates. We evaluated PhysChem on several datasets in the MoleculeNet [36] benchmark, where PhysChem displayed state-of-the-art performances on both conformation learning and property prediction tasks. Results on cutting-edge datasets of SARS-CoV-2 further proved the effectiveness of PhysChem.

Molecular Representation Learning Early molecular fingerprints commonly encoded line or graph notations of molecules with rule-based algorithms [20, 23, 26] . Along with the spurt of deep learning, deep molecular representations gradually prevailed [7, 9, 11, 37] . More recently, researchers started to focus on incorporating 3D conformations of molecules into their representations [1, 4, 30, 15] . Models that leveraged 3D geometries of molecules generally performed better than those that simply used graph notations, whereas most 3D models required labeled conformations of the target molecules. This limited the applicability of these models. Among previous studies, message-passing neural networks (MPNNs) proposed a universal framework of encoding molecular graphs, which assumed that nodes in graphs (atoms in molecules) passed messages to their neighbors, and then aggregated received messages to update their states. A general message-passing layer calculated

where i, j ∈ V were graph nodes, hs and es were states of nodes and edges, and M (·), U (·) were the message and update functions. For graph-level tasks, MPNNs further defined a readout function which merged the states of all nodes into graph representations. In this work, ChemNet modifies the message-passing scheme of MPNNs: messages are extended to incorporate geometries of local structures, and states of both atoms and bonds are updated. See Section 3.4 for more details.

Neural Physical Engines Recent studies showed that neural networks are capable of learning annotated (or pseudo) potentials and forces in particle systems, which made fast molecular simulations [39, 17] and protein-folding tasks [27] possible. Notably, it was further shown in [16, 28] that neural networks alone can simulate molecular dynamics for conformation prediction. As an instance, HamNet [16] proposed a neural physical engine that operated on a generalized space, where positions and momentums of atoms were defined as high-dimensional vectors. In the engine, atoms moved following Hamiltonian Equations with parameterized kinetic, potential and dissipation functions. PhysNet in our work is a similar engine. Nevertheless, instead of learning parameterized energies and calculating their negative derivatives as forces, we directly parameterize the forces between each pair of atoms. In addition, HamNet considered gravitations and repulsions of molecules based on implicit positions, while it ignored the effects of chemical interactions: for example, the types of chemical bonds were ignored in the energy functions. PhysChem fixes this issue via the cooperation mechanism between two specialist networks. Specifically, PhysNet takes chemical expertise (the bond states) from ChemNet and introduces torsion forces, i.e. forces that origin from torsions in chemical bonds, into the dynamics. See Section 3.3 for more details.

The cooperation mechanism in PhysChem shares similar motivations with multi-task learning and model fusion. Multi-task learning [40, 6] is now almost universally used in deep learning models. Representations are shared among a collection of related tasks in order to learn the common ideas. Model fusion, on the other hand, merges different models on identical tasks to improve performances [24, 13] . Notably, these techniques have been previously applied to molecular tasks [31, 13, 32] . In PhysChem, conformation learning and property prediction tasks are jointly trained, with two specialist networks fused to achieve better performances. Nevertheless, the cooperation mechanism in PhysChem roots from observations in physics and chemistry, and enjoys better interpretability than straight-forward ensemble strategies. The advantages of the cooperation mechanism over multi-task strategies are also empirically shown in Section 4.

Notations In the statements and equations below, we use italic letters for scalars and indices, bold lower-case letters for (column) vectors, bold upper-case letters for matrices, calligraphic letters for sets, and normal letters for annotations. Common neural network layers are directly referred to: FC (x) denotes a fully-connected layer with input x; GCN (A, X) denotes a (vanilla) graph convolutional network (GCN) [14] with adjacency matrix A and feature matrix X; GRU cell (s, x) denotes a cell of the Gated Recurrent Units (GRU) [5] with recurrent state s and input signal x; LSTM ({x t }) denotes a Long Short-Term Memory network [10] with input signals {x t }; MaxPool ({x t }) denotes a max-pooling layer with inputs {x t }. Exact formulas are available in the appendix. ⊕ is the operator of concatenations. · denotes the l 2 norm of the input vector.

Problem Definition PhysChem considers molecular representation learning as a supervised learning task. It takes notations of molecules as inputs, conformations and chemical / biomedical properties as supervisions. PhysChem assumes a pre-conducted featurization process, after which a molecule can be denoted as an attributed graph M = (V, E, n, m, X v , X e ). Here, V is the set of n atoms, E ⊂ V × V is the set of m chemical bonds, X v ∈ R n×dv = (x v 1 , · · · , x v n ) is the matrix of atomic features, and X e ∈ R m×de = (x e 1 , · · · , x e m ) that of bond features. Based on above inputs, PhysNet outputs the dominant conformations of molecules, and ChemNet outputs the representations as well as the predicted chemical / biomedical properties of molecules.

Overview Figure 1 is a sketch of PhysChem. An initializer first encodes the inputs into initial atom and bond states (v (0) and e (0) ) for ChemNet, along with the initial atomic positions and momenta (q (0) and p (0) ) in PhysNet. Subsequently, L PhysNet blocks simulate neural molecular dynamics in the generalized space; L ChemNet blocks conduct geometry-aware message-passing for atoms and bonds. Between each couple of blocks, implicit conformations (qs) and states of chemical bonds (es) are shared as expertise. At the top of PhysNet, a conformation decoder transforms implicit atomic positions into the 3D Euclidean space; of ChemNet, a sequence of readout layers aggregate atom states into molecular representations, based on which molecular properties are predicted with task-specific layers. Notably, two specialist networks are jointly optimized in PhysChem.

Given a molecular graph M = (V, E, n, m, X v , X e ), the initializer generates the initial values for both PhysNet and ChemNet variables. In the initializer, we first encode the input features into initial atom and bond states with fully connected layers, i.e.

We then adopt the initialization method in [16] to generate initial positions (q (0) ) and momenta (p (0) ) for atoms: a bond-strength adjacency matrix A ∈ R n×n is estimated with sigmoid-activated FC layers on bond features, according to which a GCN captures the chemical environments of atoms (as v); an LSTM then determines unique positions for atoms, especially for those with identical chemical environments (carbons in benzene, for example). Denoted in formula, the initialization follows

where

(0) n andṼ = (ṽ 1 , · · · ,ṽ n ) are the atom states. The order of atoms in the LSTM is specified by the canonical SMILES of the molecule.

Overall, PhysNet simulates the dynamics of atoms in a generalized,

In a molecule as a classic dynamical system, atoms move following Newton's Second Law:

where q, p and m are the position, momentum and mass of an atom, and f is the force that the atom experiences. With uniform temporal discretization, the above equations may be approximated with

where s is the index of timestamps and τ is the temporal interval. The calculations in PhysNet simulate such a process. Parameters of PhysNet blocks learn to derive the forces f from intermediate molecular conformations and states of chemical bonds. Correspondingly, two types of forces are modeled in PhysNet, namely the positional forces and the torsion forces.

Positional Forces The positional forces in PhysNet model the gravitations and repulsions between pairs of atoms. In conventional molecular force fields, these forces are generally modeled with (negative) derivatives of Lennard-Jones potentials. We therefore propose a similar form for the positional forces taking atomic distances as determinants:

where d j,i = qi−qj qi−qj is the unitary directional vector from atom j to i. Instead of using l 2 distances in the generalized space, we use parameterized interaction distances r j,i to estimate the forces, which increase the capability of the network. Here, r −2 − r −1 approximates the landscape of derivatives of Lennard-Jones potentials 5 with lower-degree polynomials. The approximation is made to avoid numerical issues of using high-degree polynomials, most typically, the gradient explosion.

Torsion Forces The torsion forces model the mechanical effects of torsions in local structures. The torsion forces are defined between pairs of atoms that are directly connected by chemical bonds. Estimating local torsions only with positions and momenta of atoms is somehow suboptimal, as the network lacks prior knowledge of chemical characteristics of the bonds (default bond lengths, for example). Alternatively, we leverage the bond states (e i,j ) in ChemNet as chemical expertise to derive the torsion forces. Specifically, the torsion forces f tor are calculated as

As the torsion forces model chemical interactions, we do not explicitly incorporate atomic mass into the calculation. Notably, atomic information is implicitly considered in the torsion forces, as the bond states integrate atom states at both ends in ChemNet (see the next subsection).

Dynamics After estimating the positional and torsion forces, PhysNet simulates the dynamics of atoms following Equation (6). In the l-th block of PhysNet, S steps of dynamics are simulated:

Here, q

Conformation Decoder and Loss We use a simple linear transformation to decode implicit atomic positions (and momenta) from the generalized space into the real 3D space:

where Q = (q 1 , · · · , q n ) ∈ R n×d f is the position matrix in the generalized space, andR = (r 1 , · · · ,r n ) ∈ R n×3 that of the predicted 3D conformation. We further propose Conn-k (k-hop connectivity loss), a spatially invariant loss that supervises PhysNet based on local distance errors: if C (k) ∈ {0, 1} n×n is the k-hop connectivity matrix 6 and D is the distance matrix 7 , then the k-hop connectivity loss is defined as

where is the element-wise product, · F is the Frobenius norm, (D,D) are distance matrices of the real and predicted conformations (R,R), andĈ (k) is the normalized k-hop connectivity matrix 8 . The total loss of PhysNet is defined as the weighted sum of Conn-3 losses on all timestamps:

where η < 1 is a decay factor and Z = 0.01 is an empirical normalization factor.

ChemNet conducts message-passing for both atoms and chemical bonds. Besides generating messages with atom states, ChemNet also considers local geometries including bond lengths and angles to adequately characterize local chemical environments. Specifically, in each ChemNet block, triplet descriptors are established in the atomic neighborhoods. Atoms merge relevant triplet descriptors to

i,j = 1 if and only if atoms i and j are kor less-hop connected on the molecular graph. 7 Di,j = ri − rj , where r ∈ R 3 are 3D coordinates of atoms.

generate messages, and then aggregate received messages to update their states. Chemical bonds also update their states by aggregating the states of atoms at both ends.

Triplet Descriptor The triplet descriptors t i,j,k are descriptive vectors defined on all atom triplets

where N (i) = {j | (i, j) ∈ E} denotes the atomic neighborhood of i, vs are atom states, and l i,j = FC (l i,j ) and a j,i,k = FC (cos (∠ j,i,k )) are representations of bond lengths and bond angles. The motivation of constructing triplet descriptors is that these features are geometrically deterministic and invariant: i) all bond lengths and angles in an atomic neighborhood together compose a minimum set of variables that uniquely determine the geometry of the neighborhood (deterministic); ii) these features all enjoy translational and rotational invariances (invariant). When PhysNet and ChemNet are jointly optimized, l Message-passing After establishing the triplet descriptors, ChemNet generates atomic messages by merging relevant descriptors. The message from atom j to i in the l-th block is calculated as

Subsequently, ChemNet utilizes a similar architecture to [37] to conduct message passing. Centric atoms aggregate the received messages with attention scores determined by the bond states:

Atom states are then updated with GRU cells that take previous states as recurrent states. A similar process is then conducted for all chemical bonds, where messages are generated by the states of atoms at both ends. Denoted in formula,

Representation Readout The molecular representation is finally read out with T global attentive layers [37] , where a virtual meta-atom continuously collects messages from all atoms in the molecule and updates its state. We initialize and update the state of the meta-atom (v meta ) following:

The final state of the meta-atom is used as the molecular representation. Labeled chemical and / or biomedical properties are used to supervise the representations, and the loss of ChemNet, L chem , is determined task-specifically. The total loss of PhysChem is the weighted sum of losses in the two networks controlled by a hyperparameter λ, i.e.

L total = λL phys + L chem . [3] is a collection of datasets 10 generated by screening a panel of SARS-CoV-2-related assays against approved drugs. 13 assays of 14,332 drugs were used in our experiments. We split all datasets into 8:1:1 as training, validation and test sets. For datasets with less than 100, 000 molecules, we trained models for 5 replicas with randomly split data and reported the means and standard deviations of performances; for QM9, we used splits with the same random seed across models. We used identical featurization process in [37] to derive feature matrices (X v , X e ) for all models and on all datasets.

Baselines For conformation learning tasks, we compared PhysChem with i) a Distance Geometry [2] method tuned with the Universal Force Fields (UFF) [22] , which was implemented in the RDKit package 11 and thus referred to as RDKit; ii) CVGAE and CVGAE+UFF [18] , which learned to generate low-energy molecular conformations with deep generative graph neural networks (either with or without UFF tuning); iii) HamEng [16] , which learned stable conformations via simulating Hamiltonian mechanics with neural physical engines. For property prediction tasks, we compared PhysChem with i) MoleculeNet, for which we reported the best performances achieved by methods collected in [36] (before 2017); ii) 3DGCN [4] , which augmented conventional GCNs with input bond geometries; iii) DimeNet [15] , which conducted directional message-passing by representing pairs of atoms; iv) Attentive FP [37] , which used local and global attentive layers to derive molecular representations; and v) CMPNN [30] , which used communicative kernels to conduct deep messagepassing. We conducted experiments with official implementations of HamEng, CVGAE,Attentive FP and CMPNN; for other baselines, as identical evaluation schemes were adopted, we referred to the reported performances in corresponding citations and left unreported entries blank.

We also compared PhysChem with several variants, including i) PhysNet (s.a.), a stand-alone PhysNet that ignored all chemical expertise by setting e j,i ≡ 0; ii) ChemNet (s.a.), a stand-alone ChemNet (ChemNet (s.a.)) that ignored all physical expertise by equally setting all bond lengths and angles (l i,j ≡ l, a j,i,k ≡ a); iii) ChemNet (real conf.) and ChemNet (rdkit conf.), two ChemNet variants that leveraged l i,j and a j,i,k in real conformations (real conf.) or RDKit-generated conformations (rdkit conf.); and iv) ChemNet (m.t.), a multi-task ChemNet variant that used a straight-forward multi-task strategy for conformation learning and property prediction tasks: we applied the conformation decoder and loss on atom states v and optimized ChemNet with the weighted sum of losses of both tasks (i.e. L total in Equation (19)).

Unless otherwise specified, we used L = 2 pairs of blocks for PhysChem. In the initializer, we used a 2-layer GCN and a 2-layer LSTM. In each PhysNet block, we set d f = 3, S = 4 and τ = 0.25. In each ChemNet block, we set the dimensions of atom and bond states as 128 and 64, correspondingly. In the representation readout block, we used T = 1 global attentive layers with 256-dimensional meta-atom states. For property prediction tasks with multiple targets, the targets were first standardized and then fitted simultaneously. We use the mean-square-error (MSE) loss for all regression tasks and the cross-entropy loss for all classification tasks to train the models. Other implementation details, such as hyperparameters of baselines, are provided in the appendix. 9 The datasets are publicly available at http://moleculenet.ai/datasets-1. 10 The datasets are available at https://opendata.ncats.nih.gov/covid19/ (CC BY 4.0 license) and are continuously extended. The data used in our experiments were downloaded on February 16th, 2021. 11 We use the 2020.03.1.0 version of the RDKit package at http://www.rdkit.org/. 

Quantum Mechanical Datasets As real conformations are available in the QM datasets, we evaluated PhysChem on both conformation learning and property prediction tasks. On conformation learning tasks, a Distance MSE metric is reported, which sums the squared errors of all pair-wise distances in each molecule, normalizes it by the number of atoms, and then takes averages across molecules. Note that this metric is equivalent to the Conn-∞ loss with k = ∞ in Equation (12). On property prediction tasks, the mean-absolute-errors (MAE) for regression targets are reported. When multiple targets are requested (on QM8 and QM9), we report the Multi-MAE metric in [16] which sums the MAEs of all targets (standardized for QM9). Table 2 and 3 show the results. On conformation learning tasks, PhysNet (s.a.) displays significant advantages on learning conformations of small molecules. Specifically, the comparison between PhysNet (s.a.) and HamEng indicates that directly learning forces in neural physical engines may be superior to learning energies and their derivatives. With massive data samples (QM9), the specialist, PhysNet (s.a.), obtains better results than PhysChem; while on datasets with limited samples (QM7), chemical expertise demonstrates its effectiveness. On property prediction tasks, PhysChem obtains state-of-the-art performances on QM datasets. The comparison between PhysChem, ChemNet (s.a.) and ChemNet (m.t.) shows that the elaborated cooperation mechanism in PhysChem is necessary, as the straight-forward multi-task strategy leads to severe negative transfer. In addition, the results of ChemNet (real conf.) and Chem-Net (rdkit conf.) show that leveraging real (or generated, geometrically correct) conformations of test molecules indeed helps on some datasets, while the elevations are somehow insignificant.

On LIPOP, FREESOLV, ESOL and COVID19 with no labeled conformations, we used RDKit-generated conformations to satisfy models that requested conformations of training and / or test molecules. Although these generated conformations are less accurate, distance geometries in local structures are generally correctly displayed. We report the rooted-mean-squared-errors (RMSE) for regression tasks, and the multi-class ROC-AUC (Multi-AUC) metric on COVID19. Table 4 shows the results. PhysChem again displays state-of-the-art performances. Notably, as the numbers of samples in physical chemical datasets (LIPOP, FREESOLV and ESOL) are sparse, the conformation learning task for PhysNet is comparatively tough. This leads to a larger gap of performances between PhysChem and ChemNet (rdkit conf.), yet the former is still better than most baselines. The largest elevation of performances is observed on COVID19, which indicates that PhysChem is further competent on more complicated, structure-dependent tasks. Figure 2 (a) shows the effect of the loss hyperparameter λ on QM9. When λ ≤ 0.1, increasing λ benefits both tasks. This indicates that aiding the learning of conformations also helps to predict chemical properties. With larger λ, the property prediction task is then compromised. Figure 2 (b) visualizes the predicted conformations of baselines and PhysNet blocks. Local structures such as bond lengths, angles and planarity of aromatic groups are better preserved by PhysChem.

In this paper, we propose a novel neural architecture, PhysChem, that learns molecular representations via fusing physical and chemical information. Conformation learning and property prediction tasks are jointly trained in this architecture. Beyond straight-forward multi-task strategies, PhysChem adopts an elaborated cooperation mechanism between two specialist networks. State-of-the-art performances were achieved on MoleculeNet and SARS-CoV-2 datasets. Nevertheless, there is still much space for advancement of PhysChem. For future work, a straight-forward improvement is to enlarge the capability of the model by further simplifying as well as deepening the architecture of PhysChem. In addition, proposing strategies to train PhysChem with massive unlabeled molecules is yet another promising direction.

Broader Impact For the machine learning community, our work proposes a more interpretable architecture on molecular machine learning tasks and demonstrates its effectiveness. We hope that the specialist networks and domain-related cooperation mechanism in PhysChem will inspire researchers in a wider area of deep learning to develop novel architectures under the same motivation. For the drug discovery community, PhysChem leads to direct applications on ligand-related tasks including conformation and property prediction, protein-ligand binding affinity prediction, etc. With limited possibility, the abuse of PhysChem in drug discovery may violate some ethics of life science.

We hereby formalize the algorithms of the initializer, PhysNet blocks and ChemNet blocks.

Encode atomic chemical environments with GCNs

3: Derive initial atomic positions / momenta 

We preprocessed regression targets into classification ones: we partitioned the potency indices into bins as {inactive, [10, ∞) , [1, 10) , (0, 1)}, i.e. {inactive, low potency, medium potency, high potency} (the smaller index indicates stronger potency), and predicted which bin the indices were in. Correspondingly, predictions on all assays were transformed into 4-class classification tasks. Distribution on the classes are shown in Figure S3 . 

We used the identical featurization process to [37] . Details are included in Table S5 . 

For property prediction tasks, italic results of baselines (3DGCN [4] , CMPNN [30] , DimeNet [15] , and HamNet [16] ) in the tables in Section 4 were directly took from corresponding references. As MoleculeNet [36] suggested standard training and evaluation schemes for all collected datasets (including QM7, QM8, QM9, LIPOP, FREESOLV, ESOL), to which all baselines conformed, the performances were comparable. For these baselines, we left untested entries in the tables blank. For Attentive FP [37] , as the paper did not report the standard errors of its performances, we conduct experiments on its official implementation 12 with the exact hyperparameters included in its appendix.

For COVID19 datasets, we tested the official implementations of Attentive FP and CMPNN 13 . Suggested parameters were used for these baselines.

We also reported conformation learning results of RDKit, HamEng [16] and CVGAE [18] . We referred to the official implementations of RDKit 14 and HamEng 15 . As there were some errors in the official code of CVGAE, we reimplemented it strictly following details in the paper. We used the default parameters suggested in the papers. Table S6 shows the hyperparameters of PhysChem on different datasets. As similar architectures were used, most parameters of ChemNet and training in PhysChem are borrowed from corresponding hyperparameters in Attentive FP [37] (on corresponding datasets): for example, the dimensionalities of atom states and the learning rates. Despite the modification of structures, we found that these setups worked so we did not tune these hyperparameters.

Notably, we used d f = 3 in our experiments 16 . Empirically, setting d f = 3 already leads to promising results. Using d f 3 may slightly improve the performances of conformation learning at the compromise of model efficiency. We show the effect of d f in Section K.

Compared to distance-based metric, traditional RMSD is subjected to a optimal alignment between predicted and reference conformations, so it can't smoothly measure their discrepancy when there are poorly predicted regions in the conformation [19] . However, to further justify the results of our model, we still evaluate RMSD of several baselines and results are list in Table H. Here, Mixed Loss indicates that we used a combination of RMSD loss and Conn-3 loss following [16] .

We evaluate PhysChem, ChemNet (s.a.) and ChemNet (real conf.) on 12 separate tasks on QM9, and results are listed in Table I .

The training and inferences of PhysChem and the baselines were conducted on a total of 8 NVIDIA Tesla P100 GPUs. We recorded the running time of PhysChem and baselines including CVGAE [18] , Attentive FP [37] and HamNet [16] of the conformation learning and property prediction tasks on QM9. The results are shown in Table S9 . For HamNet, we separately reported the running time of its two modules, namely the Hamiltonian Engine (Ham. Eng.) and the Fingerprint Generator (FP Gen.). Our model displayed comparable efficiency to the introduced baselines on both tasks. 12 https://github.com//OpenDrugAI/AttentiveFP 13 https://github.com//SY575/CMPNN 14 Version 2020.03.1.0 at http://www.rdkit.org/ 15 https://github.com//PKUterran/HamNet 16 We are truly sorry to cause confusions by stating in the main paper that d f 3 and that we used d f = 128 by default. We will correct this issue in the future versions of our paper. Figure S4 shows the effect of the hyperparameter d f in PhysChem. PhysChem is robust to the selection of d f . Empirically, using larger d f leads to equivalent performances on property prediction, and slightly improved ones on conformation learning. Notably, using d f = 3 already leads to promising results. This is different to that in HamNet [16] . Figure S4 : The effect of d f on QM7.

A simple representation of three-dimensional molecular structure

Distance Geometry in Molecular Modeling

An OpenData portal to share COVID-19 drug repurposing data in real time. bioRxiv

Enhanced deep-learning prediction of molecular properties via augmentation of bond topology

Learning phrase representations using RNN encoderdecoder for statistical machine translation

Multi-task learning with deep neural networks: A survey

Convolutional networks on graphs for learning molecular fingerprints

Deep docking: A deep learning platform for augmentation of structure based drug discovery

Neural message passing for quantum chemistry

Long Short-Term Memory

SMILES transformer: Pre-trained molecular fingerprint for low data drug discovery

Junction tree variational autoencoder for molecular graph generation

Improved protein-ligand binding affinity prediction with structure-based deep fusion inference

Semi-supervised classification with graph convolutional networks

Directional message passing for molecular graphs

Conformation-guided molecular representation with hamiltonian neural networks

86 PFLOPS deep potential molecular dynamics simulation of 100 million atoms with ab initio accuracy

Molecular geometry prediction using a deep generative graph neural network

lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests

The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service

Fréchet ChemNet Distance: A metric for generative models for molecules in drug discovery

Uff, a full periodic table force field for molecular mechanics and molecular dynamics simulations

Extended-connectivity fingerprints

Analysis of deep fusion strategies for multi-modal gesture recognition

Self-supervised graph transformer on large-scale molecular data

Smifp (smiles fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules

Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning

Learning gradient fields for molecular conformation generation

Graphaf: a flow-based autoregressive model for molecular graph generation

Communicative representation learning on attributed molecular graphs

A survey of multi-task learning methods in chemoinformatics

High-throughput virtual screening of small molecule inhibitors for sars-cov-2 protein targets with deep fusion models

PhysNet: A neural network for predicting energies, forces, dipole moments, and partial charges

3D-PhysNet: Learning the intuitive physics of non-rigid object deformations

Smiles 1. introduction and encoding rules

Moleculenet: a benchmark for molecular machine learning

Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism

Graph convolutional policy network for goal-directed molecular graph generation

Deep potential molecular dynamics: A scalable model with the accuracy of quantum mechanics

A survey on multi-task learning. CoRR, abs/1707.08114, 2017. • SARS-CoV-2 datasets (COVID19) are licensed under a CC BY 4.0 license

13 assays of 14,332 drugs were used as targets. Note that not all assays on all drugs were available, so we applied masks on unavailable entries in both training and evaluation stages

• (b) ACE2 enzymatic activity

• (c) HEK293 cell line toxicity

• (d) Human fibroblast toxicity

• (e) MERS Pseudotyped particle entry

• (f) MERS Pseudotyped particle entry

• (g) SARS-CoV Pseudotyped particle entry

• (h) SARS-CoV Pseudotyped particle entry

• (i) SARS-CoV-2 cytopathic effect (CPE)

• (j) SARS-CoV-2 cytopathic effect

• (k) Spike-ACE2 protein-protein interaction

• (l) Spike-ACE2 protein-protein interaction

• (m) TMPRSS2 enzymatic activity

Acknowledgments This work was supported by the National Natural Science Foundation of China (Grant No. 61876006).

We hereby specify the detailed formulas of the neural network layers used in this paper.• Fully-Connected Layers• Graph Convolutional Networks (with two layers)• Cells of Gated Recurrent Units• Long Short-Term Memory (LSTM) (with one layer) h 0 and c 0 randomly initialized; for j ∈ N (i) do 3:end for 5: for j, k ∈ N (i) ∧ j = k do 6: cos(a j,i,k ) ← r j,i · r k,i l i,j · l i,k 7:a j,i,k ← FC (cos(a j,i,k )) 8:end for 10:for j ∈ N (i) do 11:end for 13:19: