key: cord-0222347-ky0d4du1
authors: Fuchs, Fabian B.; Worrall, Daniel E.; Fischer, Volker; Welling, Max
title: SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks
date: 2020-06-18
journal: nan
DOI: nan
sha: 6075091294a0fe0fe5c6c7b7a0df9029b6a965cb
doc_id: 222347
cord_uid: ky0d4du1

We introduce the SE(3)-Transformer, a variant of the self-attention module for 3D point clouds, which is equivariant under continuous 3D roto-translations. Equivariance is important to ensure stable and predictable performance in the presence of nuisance transformations of the data input. A positive corollary of equivariance is increased weight-tying within the model, leading to fewer trainable parameters and thus decreased sample complexity (i.e. we need less training data). The SE(3)-Transformer leverages the benefits of self-attention to operate on large point clouds with varying number of points, while guaranteeing SE(3)-equivariance for robustness. We evaluate our model on a toy $N$-body particle simulation dataset, showcasing the robustness of the predictions under rotations of the input. We further achieve competitive performance on two real-world datasets, ScanObjectNN and QM9. In all cases, our model outperforms a strong, non-equivariant attention baseline and an equivariant model without attention.

Self-attention mechanisms [28] have enjoyed a sharp rise in popularity in the last few years. Their relative implementational simplicity coupled with high efficacy on a wide range of tasks such as language modeling [28] , image recognition [16] , or graph-based problems [29] , make them an attractive component to use. However, their generality of application means that for specific tasks, knowledge of existing underlying structure is unused. In this paper, we propose the SE(3)-Transformer shown in Fig. 1 , a self-attention mechanism specifically for 3D point cloud data, which adheres to equivariance constraints, improving robustness to nuisance transformations and general performance.

Point cloud data is ubiquitous across many fields, presenting itself in diverse forms such as 3D object scans [26] , 3D molecular structures [19] , or N -body particle simulations [12] . Finding neural structures which can adapt to the varying number of points in an input, while respecting the irregular sampling of point positions, is challenging. Furthermore, an important property is that these structures should be invariant to global changes in overall input pose; that is, 3D translations and rotations of the input point cloud should not affect the output. In this paper, we find that the explicit imposition of equivariance constraints on the self-attention mechanism addresses these challenges. The SE(3)- Transformer uses the self-attention mechanism as a data-dependent filter particularly suited for sparse, non-voxelised point cloud data, while respecting and leveraging the symmetries of the task at hand.

Self-attention itself is a pseudo-linear map between sets of points. It can be seen to consist of two components: input-dependent attention weights and an embedding of the input, called a value embedding. In Fig. 1 , we show an example of a molecular graph, where attached to every atom we see a value embedding vector and where the attention weights are represented as edges, with width corresponding to the attention weight magnitude. In the SE(3)-Transformer, we explicitly design the attention weights to be invariant to global pose. Furthermore, we design the value embedding to be equivariant to global pose. Equivariance generalises the translational weight-tying of convolutions. It ensures that transformations of a layer's input manifest as equivalent transformations of the output. SE(3)-equivariance in particular is the generalisation of translational weight-tying in 2D known from conventional convolutions to roto-translations in 3D. This restricts the space of learnable functions to a subspace which adheres to the symmetries of the task and thus reduces the number of learnable parameters. Meanwhile, it provides us with a richer form of invariance, since relative positional information between features in the input is preserved.

Our contributions are the following:

• We introduce a novel self-attention mechanism, guaranteeably invariant to global rotations and translations of its input. It is also equivariant to permutations of the input point labels.

• We show that the SE(3)-Transformer resolves an issue with concurrent SE(3)-equivariant neural networks, which suffer from angularly constrained filters.

• We introduce a Pytorch implementation of spherical harmonics, which is 10x faster than Scipy on CPU and 100 − 1000× faster on GPU.

In this section we introduce the relevant background materials on self-attention, graph neural networks, and equivariance. We are concerned with point cloud based machine learning tasks, such as object classification or segmentation. In such a task, we are given a point cloud as input, represented as a collection of n coordinate vectors x i ∈ R 3 with optional per-point features f i ∈ R d .

The standard attention mechanism [28] can be thought of as consisting of three terms: a set of query vectors q i ∈ R p for i = 1, ..., m, a set of key vectors k j ∈ R p for j = 1, ..., n, and a set of value vectors v j ∈ R r for j = 1, ..., n, where r and p are the dimensions of the low dimensional embeddings. We commonly interpret the key k j and the value v j as being 'attached' to the same point j. For a given query q i , the attention mechanism can be written as

where we used a softmax as a nonlinearity acting on the weights. In general, the number of query vectors does not have to equal the number of input points [14] . In the case of self-attention the query, key, and value vectors are embeddings of the input features, so

where {h Q , h K , h V } are, in the most general case, neural networks [27] . For us, query q i is associated with a point i in the input, which has a geometric location x i . Thus if we have n points, we have n possible queries. For query q i , we say that node i attends to all other nodes j = i.

Motivated by a successes across a wide range of tasks in deep learning such as language modeling [28] , image recognition [16] , graph-based problems [29] , and relational reasoning [27, 7] , a recent stream of work has applied forms of self-attention algorithms to point cloud data [39, 37, 14] . One such example is the Set Transformer [14] . When applied to object classification on ModelNet40 [36] , the input to the Set Transformer are the cartesian coordinates of the points. Each layer embeds this positional information further while dynamically querying information from other points. The final per-point embeddings are downsampled and used for object classification.

Permutation equivariance A key property of self-attention is permutation equivariance. Permutations of point labels 1, ..., n lead to permutations of the self-attention output. This guarantees the attention output does not depend arbitrarily on input point ordering. Wagstaff et al. [30] recently showed that this mechanism can theoretically approximate all permutation equivariant functions. The SE(3)-transformer is a special case of this attention mechanism, inheriting permutation equivariance. However, it limits the space of learnable functions to rotation and translation equivariant ones.

Attention scales quadratically with point cloud size, so it is useful to introduce neighbourhoods: instead of each point attending to all other points, it only attends to its nearest neighbours. Sets with neighbourhoods are naturally represented as graphs. Attention has previously been introduced on graphs under the names of intra-, self-, vertex-, or graph-attention [15, 28, 29, 10, 23] . These methods were unified by Wang et al. [31] with the non-local neural network. This has the simple form

where w and h are neural networks and C normalises the sum as a function of all features in the neighbourhood N i . This has a similar structure to attention, and indeed we can see it as performing attention per neighbourhood. While non-local modules do not explicitly incorporate edge-features, it is possible to add them, as done in Veličković et al. [29] and Hoshen [10] .

Given a set of transformations T g : V → V for g ∈ G, where G is an abstract group, a function φ : V → Y is called equivariant if for every g there exists a transformation S g : Y → Y such that

The indices g can be considered as parameters describing the transformation. Given a pair (T g , S g ), we can solve for the family of equivariant functions φ satisfying Equation 4. Furthermore, if (T g , S g ) are linear and the map φ is also linear, then a very rich and developed theory already exists for finding φ [5] . In the equivariance literature, deep networks are built from interleaved linear maps φ and equivariant nonlinearities. In the case of 3D roto-translations it has already been shown that a suitable structure for φ is a tensor field network [25] , explained below. Note that Romero et al. [21] recently introduced a 2D roto-translationally equivariant attention module for pixel-based image data.

Group Representations In general, the transformations (T g , S g ) are called group representations. Formally, a group representation ρ : G → GL(N ) is a map from a group G to the set of N × N invertible matrices GL(N ). Critically ρ is a group homomorphism; that is, it satisfies the following property ρ(g 1 g 2 ) = ρ(g 1 )ρ(g 2 ) for all g 1 , g 2 ∈ G. Specifically for 3D rotations G = SO(3), we have a few interesting properties: 1) its representations are orthogonal matrices, 2) all representations can be decomposed as

where Q is an orthogonal, N × N , change-of-basis matrix [4] ; each D for = 0, 1, 2, ... is a (2 + 1) × (2 + 1) matrix known as a Wigner-D matrix 3 ; and the is the direct sum or concatenation of matrices along the diagonal. The Wigner-D matrices are irreducible representations of SO(3)think of them as the 'smallest' representations possible. Vectors transforming according to D (i.e. we set Q = I, i = ), are called typevectors. Type-0 vectors are invariant under rotations and type-1 vectors rotate according to 3D rotation matrices. Note, type-vectors have length 2 + 1. They can be stacked, forming a feature vector f transforming according to Eq. (5).

Tensor Field Networks Tensor field networks (TFN) [25] are neural networks, which map point clouds to point clouds under the constraint of SE(3)-equivariance, the group of 3D rotations and translations. For point clouds, the input is a vector field f : 

We can also include a sum over input channels, but we omit it here. Weiler et al. [33] , Thomas et al. [25] and Kondor [13] showed that the kernel W k lies in the span of an equivariant basis {W k J } k+ J=|k− | . The kernel is a linear combination of these basis kernels, where the J th coefficient is a learnable function ϕ k J : R ≥0 → R of the radius x . Mathematically this is

Eq. (7) and Eq. (9) present the convolution in message-passing form, where messages are aggregated from all nodes and feature types. They are also a form of nonlocal graph operation as in Eq. (3), where the weights are functions on edges and the features {f i } are node features. We will later see how our proposed attention layer unifies aspects of convolutions and graph neural networks.

Here, we present the SE(3)-Transformer. The layer can be broken down into a procedure of steps as shown in Fig. 2 , which we describe in the following section. These are the construction of a graph from a point cloud, the construction of equivariant edge functions on the graph, how to propagate SE(3)-equivariant messages on the graph, and how to aggregate them. We also introduce an alternative for the self-interaction layer, which we call attentive self-interaction. 

The SE(3)-Transformer itself consists of three components. These are 1) edge-wise attention weights α ij , constructed to be SE(3)-invariant on each edge ij, 2) edge-wise SE(3)-equivariant value messages, propagating information between nodes, as found in the TFN convolution of Eq. (7), and 3) a linear/attentive self-interaction layer. Attention is performed on a per-neighbourhood basis as follows:

These components are visualised in Fig. 2 . If we remove the attention weights then we have a tensor field convolution, and if we instead remove the dependence of W V on (x j − x i ), we have a conventional attention mechanism. Provided that the attention weights α ij are invariant, Eq. (10) is equivariant to SE(3)-transformations. This is because it is just a linear combination of equivariant value messages. Invariant attention weights can be achieved with a dot-product attention structure shown in Eq. (11) . This mechanism consists of a normalised inner product between a query vector q i at node i and a set of key vectors {k ij } j∈Ni along each edge ij in the neighbourhood N i where

is the direct sum, i.e. vector concatenation in this instance. The linear embedding matrices W k Q and W k K (x j − x i ) are of TFN type (c.f. Eq. (8)). The attention weights α ij are invariant for the following reason. If the input features {f in,j } are SO(3)-equivariant, then the query q i and key vectors {k ij } are also SE(3)-equivariant, since the linear embedding matrices are of TFN type. The inner product of SO(3)-equivariant vectors, transforming under the same representation S g is invariant, since if q → S g q and k → S g k, then q S g S g k = q k, because of the orthonormality of representations of SO(3), mentioned in the background section. We follow the common practice from the self-attention literature [28, 14] , and chosen a softmax nonlinearity to normalise the attention weights to unity, but in general any nonlinear function could be used.

The attention weights add extra degrees of freedom to the TFN kernel in the angular direction. This is seen when Eq. (10) is viewed as a convolution with a datadependent kernel α ij W k V (x). In the literature, SO(3) equivariant kernels are decomposed as a sum of products of learnable radial functions ϕ k J ( x ) and non-learnable angular kernels (8)). The fixed angular dependence of W k J (x/ x ) is a strange artifact of the equivariance condition in noncommutative algebras and while necessary to guarantee equivariance, it is seen as overconstraining the expressiveness of the kernels. Interestingly, the attention weights α ij introduce a means to modulate the angular profile of W k J (x/ x ), while maintaining equivariance. Channels, Self-interaction Layers, and Non-Linearities Analogous to conventional neural networks, the SE(3)-Transformer can straightforwardly be extended to multiple channels per representation degree , so far omitted for brevity. This sets the stage for self-interaction layers. The attention layer (c.f. Fig. 2 and circles 1 and 2 of Eq. (10)) aggregates information over nodes and input representation degrees k. In contrast, the self-interaction layer (c.f. circle 3 of Eq. (10)) exchanges information solely between features of the same degree and within one node-much akin to 1x1 convolutions in CNNs. Self-interaction is an elegant form of learnable skip connection, transporting information from query point i in layer L to query point i in layer L + 1. This is crucial since, in the SE(3)-Transformer, points do not attend to themselves. In our experiments, we use two different types of self-interaction layer: (1) linear and (2) attentive, both of the form

Linear: Following Schütt et al. [22] , output channels are a learned linear combination of input channels using one set of weights w i,c c = w c c per representation degree, shared across all points. As proposed in Thomas et al. [25] , this is followed by a norm-based non-linearity.

Attentive: We propose an extension of linear self-interaction, attentive self-interaction, combining self-interaction and nonlinearity. We replace the learned scalar weights w c c with attention weights output from an MLP, shown in Eq. (13) ( means concatenation.). These weights are SE(3)-invariant due to the invariance of inner products of features, transforming under the same representation.

Point cloud data often has information attached to points (node-features) and connections between points (edge-features), which we would both like to pass as inputs into the first layer of the network. Node information can directly be incorporated via the tensors f j in Eqs. (6) and (10). For incorporating edge information, note that f j is part of multiple neighbourhoods. One can replace f j with f ij in Eq. (10) . Now, f ij can carry different information depending on which neighbourhood N i we are currently performing attention over. In other words, f ij can carry information both about node j but also about edge ij. Alternatively, if the edge information is scalar, it can be incorporated into the weight matrices W V and W K as an input to the radial network (see step 2 in Fig. 2 ).

We test the efficacy of the SE(3)-Transformer on three datasets, each testing different aspects of the model. The N-body problem is an equivariant task: rotation of the input should result in rotated predictions of locations and velocities of the particles. Next, we evaluate on a real-world object classification task. Here, the network is confronted with large point clouds of noisy data with symmetry only around the gravitational axis. Finally, we test the SE(3)-Transformer on a molecular property regression task, which shines light on its ability to incorporate rich graph structures. We compare to publicly available, state-of-the-art results as well as a set of our own baselines. Specifically, we compare to the Set-Transformer [14] , a non-equivariant attention model, and Tensor Field Networks [25] , which is similar to SE(3)-Transformer but does not leverage attention.

Similar to [24, 34] , we measure the exactness of equivariance by applying uniformly sampled SO(3)transformations to input and output. The distance between the two, averaged over samples, yields the equivariance error. Note that, unlike in Sosnovik et al. [24] , the error is not squared:

In this experiment, we use an adaptation of the dataset from Kipf et al. [12] . Five particles each carry either a positive or a negative charge and exert repulsive or attractive forces on each other. The input to the network is the position of a particle in a specific time step, its velocity, and its charge. The task of the algorithm is then to predict the relative location and velocity 500 time steps into the future. We deliberately formulated this as a regression problem to avoid the need to predict multiple time steps iteratively. Even though it certainly is an interesting direction for future research to combine equivariant attention with, e.g., an LSTM, our goal here was to test our core contribution and compare it to related models. This task sets itself apart from the other two experiments by not being invariant but equivariant: When the input is rotated or translated, the output changes respectively (see Fig. 3 ).

We trained an SE(3)-Transformer with 4 equivariant layers, each followed by an attentive selfinteraction layer (details are provided in the Appendix). Table 1 shows quantitative results. Our model outperforms both an attention-based, but not rotation-equivariant approach (Set Transformer) and a equivariant approach which does not levarage attention (Tensor Field). The equivariance error shows that our approach is indeed fully rotation equivariant up to the precision of the computations.

ScanObjectNN is a recently introduced dataset for real-world object classification. The benchmark provides point clouds of 2902 objects across 15 different categories. We only use the coordinates of the points as input and object categories as training labels. We train an SE (3) equivariant layers with linear self-interaction followed by max-pooling and an MLP. Interestingly, the task is not fully rotation invariant, in a statistical sense, as the objects are aligned with respect to the gravity axis. This results in a performance loss when deploying a fully SO(3) invariant model (see Fig. 4a ). In other words: when looking at a new object, it helps to know where 'up' is. We create an SO(2) invariant version of our algorithm by additionally feeding the z-component as an type-0 field and the x, y position as an additional type-1 field (see Appendix). We dub this model SE(3)-Transformer +z. This way, the model can 'learn' which symmetries to adhere to by suppressing and promoting different inputs (compare Fig. 4a and Fig. 4b ).

In Table 2 , we compare our model to the current state-of-the-art in object classification 4 . Despite the dataset not playing to the strengths of our model (full SE(3)-invariance) and a much lower number of input points, the performance is competitive with models specifically designed for object classification. (3), which may explain its success.

We have presented an attention-based neural architecture designed specifically for point cloud data. This architecture is guaranteed to be robust to rotations and translations of the input, obviating the need for training time data augmentation and ensuring stability to arbitrary choices of coordinate frame. The use of self-attention allows for anisotropic, data-adaptive filters, while the use of neighbourhoods enables scalability to large point clouds. We have also introduced the interpretation of the attention mechanism as a data-dependent nonlinearity, adding to the list of equivariant nonlinearties which we can use in equivariant networks. Furthermore, we provide pseudocode in the Appendix for a speed up of spherical harmonics computation of up to 3 orders of magnitudes. This speed-up allowed us to train significantly larger versions of both the SE(3)-Transformer and the Tensor Field network [25] and to apply these models to real-world datasets.

Our experiments showed that adding attention to a roto-translation-equivariant model consistently led to higher accuracy and increased training stability. Specifically for large neighbourhoods, attention proved essential for model convergence. On the other hand, compared to convential attention, adding the equivariance constraints also increases performance in all of our experiments while at the same time providing a mathematical guarantee for robustness with respect to rotations of the input data.

The main contribution of the paper is a mathematically motivated attention mechanism which can be used for deep learning on point cloud based problems. We do not see a direct potential of negative impact to the society. However, we would like to stress that this type of algorithm is inherently suited for classification and regression problems on molecules. The SE(3)-Transformer therefore lends itself to application in drug research. One concrete application we are currently investigating is to use the algorithm for early-stage suitability classification of molecules for inhibiting the reproductive cycle of the coronavirus. While research of this sort always requires intensive testing in wet labs, computer algorithms can be and are being used to filter out particularly promising compounds from large databases of millions of molecules.

Groups A group is an abstract mathematical concept. Formally a group (G, •) consists of a set G and a binary composition operator • : G × G → G (typically we just use the symbol G to refer to the group). All groups must adhere to the following 4 axioms

• Identity: There exists an element e ∈ G such that e • g = g • e = g for all g ∈ G

• Inverses: For each g ∈ G there exists a g −1 ∈ G such that g −1 • g = g • g −1 = e

In practice, we omit writing the binary composition operator •, so would write gh instead of g • h.

Groups can be finite or infinite, countable or uncountable, compact or non-compact. Note that they are not necessarily commutative; that is, gh = hg in general.

Actions/Transformations Groups are useful concepts, because they allow us to describe the structure of transformations, also sometimes called actions. A transformation (operator) T g : X → X is an injective map from a space into itself. It is parameterised by an element g of a group G. Transformations obey two laws:

• Closure: T g • T h is a valid transformation for all g, h ∈ G

• Identity: There exists at least one element e ∈ G such that T e [x] = x for all x ∈ X , where • denotes composition of transformations. For the expression T g [x], we say that T g acts on x.

It can also be shown that transformations are associative under composition. To codify the structure of a transformation, we note that due to closure we can always write

If for any x, y ∈ X we can always find a group element g, such that T g [x] = y, then we call X a homogeneous space. Homogeneous spaces are important concepts, because to each pair of points x, y we can always associate at least one group element.

As written in the main body of the text, equivariance is a property of functions f : X → Y. Just to recap, given a set of transformations T g : X → X for g ∈ G, where G is an abstract group, a function f : X → Y is called equivariant if for every g there exists a transformation S g : Y → Y such that

If f is linear and equivariant, then it is called an intertwiner. Two important questions arise: 1) How do we choose S g ? 2) once we have (T g , S g ), how do we solve for f ? To answer these questions, we need to understand what kinds of S g are possible. For this, we review representations.

Representations A group representation ρ : G → GL(N ) is a map from a group G to the set of N × N invertible matrices GL(N ). Critically ρ is a group homomorphism; that is, it satisfies the following property ρ(g 1 g 2 ) = ρ(g 1 )ρ(g 2 ) for all g 1 , g 2 ∈ G. Representations can be used as transformation operators, acting on N -dimensional vectors x ∈ R N . For instance, for the group of 3D rotations, known as SO (3), we have that 3D rotation matrices, ρ(g) = R g act on (i.e., rotate) 3D vectors, as

However, there are many more representations of SO(3) than just the 3D rotation matrices. Among representations, two representations ρ and ρ of the same dimensionality are said to be equivalent if they can be connected by a similarity transformation ρ (g) = Q −1 ρ(g)Q, for all g ∈ G.

We also say that a representation is reducible if is can be written as

If the representations ρ 1 and ρ 2 are not reducible, then they are called irreducible representations of G, or irreps. In a sense, they are the atoms among representations, out of which all other representations can be constructed. Note that each irrep acts on a separate subspace, mapping vectors from that space back into it. We say that subspace X ∈ X is invariant under irrep ρ , if {ρ (g)x | x ∈ X , g ∈ G} ⊆ X .

Representation theory of SO(3) As it turns out, all linear representations of compact groups 5 (such as SO (3)) can be decomposed into a direct sum of irreps, as

where Q is an orthogonal, N × N , change-of-basis matrix [4] ; and each D J for J = 0, 1, 2, ... The Spherical Harmonics The spherical harmonics Y J : S 2 → C 2J+1 for J ≥ 0 are squareintegrable complex-valued functions on the sphere S 2 . They have the satisfying property that they are rotated directly by the Wigner-D matrices as

where D J is the J th Wigner-D matrix and D * J is its complex conjugate. They form an orthonormal basis for (the Hilbert space of) square-integrable functions on the sphere L 2 (S 2 ), with inner product given as

So Y Jm , Y J m S 2 = δ JJ δ mm , where Y Jm is the m th element of Y J . We can express any function in L 2 (S 2 ) as a linear combination of spherical harmonics, where

where each f J is a vector of coefficients of length 2J + 1. And in the opposite direction, we can retrieve the coefficients as

following from the orthonormality of the spherical harmonics. This is in fact a Fourier transform on the sphere and the the vectors f J can be considered Fourier coefficients. Critically, we can represent rotated functions as

The Clebsch-Gordan Decomposition In the main text we introduced the Clebsch-Gordan coefficients. These are used in the construction of the equivariant kernels. They arise in the situation where we have a tensor product of Wigner-D matrices, which as we will see is part of the equivariance constraint on the form of the equivariant kernels. In representation theory a tensor product of representations is also a representation, but since it is not an easy object to work with, we seek to decompose it into a direct sum of irreps, which are easier. This decomposition is of the form of Eq. (20), written

In this specific instance, the change of basis matrices Q k are given the special name of the Clebsch-Gordan coefficients. These can be found in many mathematical physics libraries. 5 Over a field of characteristic zero.

In Tensor Field Networks [25] and 3D Steerable CNNs [33] , the authors solve for the intertwiners between SO(3) equivariant point clouds. Here we run through the derivation again in our own notation.

We begin with a point cloud

where f j is an equivariant point feature. Let's say that f j is a type-k feature, which we write as f k j to remind ourselves of the fact. Now say we perform a convolution * with kernel W k : R 3 → R (2 +1)×(2k+1) , which maps from type-k features to type-features. Then

Now let's apply the equivariance condition to this expression, then

Now we notice that this expression should also be equal to Eq. (32) , which is the convolution with an unrotated point cloud. Thus we end up at

which is sometimes refered to as the kernel constraint. To solve the kernel constraint, we notice that it is a linear equation and that we can rearrange it as

where we used the identity vec(AXB) = (B ⊗ A)vec(X) and the fact that the Wigner-D matrices are orthogonal. Using the Clebsch-Gordan decomposition we rewrite this as

Lastly, we can left multiply both sides by Q k and denote η k (x) Q k vec(W k (x)), noting the the Clebsch-Gordan matrices are orthogonal. At the same time we

Thus we have that η k J (R −1 g x) the J th subvector of η k (R −1 g x) is subject to the constraint

which is exactly the transformation law for the spherical harmonics from Eq. (21)! Thus one way how W k (x) can be constructed is

One of the core operations in the SE(3)-Transformer is multiplying a feature vector f, which transforms according to SO (3), with a matrix W while preserving equivariance:

where

Here, as in the previous section we showed how such a matrix W could be constructed when mapping between features of type-k and type-, where ρ in (g) is a block diagonal matrix of type-k Wigner-D matrices and similarly ρ in (g) is made of type-Wigner-D matrices. W is dependent on the relative position x and underlies the linear equivariance constraints, but is also has learnable components, which we did not show in the previous section. In this section, we show how such a matrix is constructed in practice.

Previously we showed that

which is an equivariant mapping between vectors of types k and . In practice, we have multiple input vectors {f k c } c of type-k and multiple output vectors of type-. For simplicity, however, we ignore this and pretend we only have a single input and single output. Note that W k has no learnable components. Note that the kernel constraint only acts in the angular direction, but not in the radial direction, so we can introduce scalar radial functions ϕ k J : R ≥0 → R (one for each J), such that

There radial functions ϕ k J ( x ) act as an independent, learnable scalar factor for each degree J. The vectorised matrix has dimensionality (2 + 1)(2k + 1). We can unvectorise the above yielding

where Q k J is a (2 + 1)(2k + 1) × (2J + 1) slice from Q k , corresponding to spherical harmonic Y J . As we showed in the main text, we can also rewrite the unvectorised Clebsch-Gordan-spherical harmonic matrix-vector product as

In contrast to Weiler et al. [33] , we do not voxelise space and therefore x will be different for each pair of points in each point cloud. However, the same Y J (x) will be used multiple times in the network and even multiple times in the same layer. Hence, precomputing them at the beginning of each forward pass for the entire network can significantly speed up the computation. The Clebsch-Gordan coefficients do not depend on the relative positions and can therefore be precomputed once and stored on disk. Multiple libraries exist which approximate those coefficients numerically. 3 (x), we compute P 1 3 (x), for which we need P 1 2 (x) and P 1 1 (x). We store each intermediate computation, speeding up average computation time by a factor of ∼ 10 on CPU.

We wrote our own spherical harmonics library in Pytorch, which can generate spherical harmonics on the GPU. We found this critical to being able to run the SE(3)-Transformer and Tensor Field network baselines in a reasonable time. This library is accurate to within machine precision against the scipy counterpart scipy.special.sph_harm and is 10x faster on CPU and 100-1000x on GPU. Here we outline our method to generate them.

The tesseral/real spherical harmonics are given as We make use of the following recursion relations in the computation of the ALPs:

where the semifactorial is defined as x!! = x(x − 2)(x − 4) · · · , and I is the indicator function. These relations are helpful because they define a recursion.

To understand how we recurse, we consider an example. Fig. 5 shows the space of J and m. The black vertices represent a particular ALP, for instance, we have highlighted P −1 3 (x). When m < 0, we can use Eq. (49) to compute P −1 3 (x) from P 1 3 (x). We can then use Eq. (50) to compute P 1 3 (x) from P 1 2 (x) and P 1 1 (x). P 1 2 (x) can also be computed from Eq. (50) and the boundary value P 1 1 (x) can be computed directly using Eq. (48). Crucially, all intermediate ALPs are stored for reuse. Say we wanted to compute P −1 4 (x), then we could use Eq. (49) to find it from P −1 4 (x), which can be recursed from the stored values P 1 3 (x) and P 1 2 (x), without needing to recurse down to the boundary. We intend to make the code for the spherical harmonics computation publicly available. Transformer to Tensor Field Networks) significantly increased the stability. As a result, whenever we swapped out the attention mechanism for a convolution to retrieve the Tensor Field network baseline, we had to decrease the model size to obtain stable training. However, we would like to stress that all the Tensor Field networks we trained were significantly bigger than in the original paper [25] , mostly enabled by the faster computation of the spherical harmonics.

For the ablation study in Fig. 4 , we trained networks with 4 hidden equivariant layers with 5 channels each, and up to representation degree 2. This results in a hidden feature size per point of

We used 200 points of the point cloud and neighbourhood size 40. For the Tensor Field network baseline, in order to achieve stable training, we used a smaller model with 3 instead of 5 channels, 100 input points and neighbourhood size 10, but with representation degrees up to 3.

We used 1 head per attention mechanism yielding one attention weight for each pair of points but across all channels and degrees (for an implementation of multi-head attention, see Appendix D.3).

For the query embedding, we used the identity matrix. For the key embedding, we used a quadratic equivariant matrix preserving the number of degrees and channels per degree.

For the quantitative comparison to the start-of-the-art in Table 2 , we used 128 input points and neighbourhood size 10 for both the Tensor Field network baseline and the SE(3)-Transformer. We used farthest point sampling with a random starting point to retrieve the 128 points from the overall point cloud. We used degrees up to 3 and 5 channels per degree, which we again had to reduce to 3 channels for the Tensor Field network to obtain stable training. We used a norm based non-linearity for the Tensor Field network (as in [25] ) and no extra non-linearity (beyond the softmax in the self-attention algorithm) for the SE (3) Transformer.

For all experiments, the final layer of the equivariant encoder maps to 64 channels of degree 0 representations. This yields a 64-dimensional SE(3) invariant representation per point. Next, we pool over the point dimension followed by an MLP with one hidden layer of dimension 64, a ReLU and a 15 dimensional output with a cross entropy loss. We trained for 60000 steps with batch size 10. We used the Adam optimizer [11] with a start learning of 1e-2 and a reduction of the learning rate by 70% every 15000 steps. Training took up to 2 days on a system with 4 CPU cores, 30 GB of RAM, and an NVIDIA GeForce GTX 1080 Ti GPU.

The input to the Tensorfield network and the Se(3) Transformer are relative x-y-z positions of each point w.r.t. their neighbours. To guarantee equivariance, these inputs are provided as fields of degree 1. For the '+z' versions, however, we deliberately break the SE(3) equivariance by providing additional and relative z-position as two additional scalar fields (i.e. degree 0), as well as relative x-y positions as a degree 1 field (where the z-component is set to 0).

We originally replicated the implementation proposed in [40] for their object classification experiment on ModelNet40 [36] . However, most likely due to the relatively small number of objects in the ScanObjectNN dataset, we found that reducing the model size helped the performance significantly. The reported model had 128 units per hidden layer (instead of 256) and no dropout but the same number of layers and type of non-linearity as in [40] .

Set Transformer Baseline We used the same architecture as [14] in their object classification experiment on ModelNet40 [36] with an ISAB (induced set attention block)-based encoder followed by PMA (pooling by multihead attention) and an MLP.

Following Kipf et al. [12] , we simulated trajectories for 5 charged, interacting particles. Instead of a 2d simulation setup, we considered a 3d setup. Positive and negative charges were drawn as Bernoulli trials (p = 0.5). We used the provided code base https://github.com/ethanfetaya/nri with the following modifications: While we randomly sampled initial positions inside a [−5, 5] 3 box, we removed the bounding-boxes during the simulation. We generated 5k simulation samples for training and 1k for testing. Instead of phrasing it as a time-series task, we posed it as a regression task: The input data is positions and velocities at a random time step as well as the signs of the charges. The labels (which the algorithm is learning to predict) are the positions and velocities 500 simulation time steps into the future.

Training Details We trained each model for 100,000 steps with batch size 128 using an Adam optimizer [11] . We used a fixed learning rate throughout training and conducted a separate hyper parameter search for each model to find a suitable learning rate.

We trained an SE(3)-Transformer with 4 equivariant layers, where the hidden layers had representation degrees {0, 1, 2, 3} and 3 channels per degree. The input is handled as two type-1 fields (for positions and velocities) and one type-0 field (for charges). The learning rate was set to 3e-3. Each layer included attentive self-interaction.

We used 1 head per attention mechanism yielding one attention weight for each pair of points but across all channels and degrees (for an implementation of multi-head attention, see Appendix D.3).

For the query embedding, we used the identity matrix. For the key embedding, we used a quadratic equivariant matrix preserving the number of degrees and channels per degree.

Baseline Architectures All our baselines fulfill permutation invariance (ordering of input points), but only the Tensor Field network and the linear baseline are SE(3) equivariant. For the Tensor Field Network [25] baseline, we used the same hyper parameters as for the SE(3) Transformer but with a linear self-interaction and an additional norm-based nonlinearity in each layer as in Thomas et al. [25] . For the DeepSet [40] baseline, we used 3 fully connected layers, a pooling layer, and two more fully connected layers with 64 units each. All fully connected layers act pointwise. The pooling layer uses max pooling to aggregate information from all points, but concatenates this with a skip connection for each point. Each hidden layer was followed by a LeakyReLU. The learning rate was set to 1e-3. For the Set Transformer [14] , we used 4 self-attention blocks with 64 hidden units and 4 heads each. For each point this was then followed by a fully connected layer (64 units), a LeakyReLU and another fully connected layer. The learning rate was set to 3e-4.

For the linear baseline, we simply propagated the particles linearly according to the simulation hyperparamaters. The linear baseline can be seen as removing the interactions between particles from the prediction. Any performance improvement beyond the linear baseline can therefore be interpreted as an indication for relational reasoning being performed.

The QM9 regression dataset [19] is a publicly available chemical property prediction task consisting of 134k small drug-like organic molecules with up to 29 atoms per molecule. There are 5 atomic species (Hydrogen, Carbon, Oxygen, Nitrogen, and Flourine) in a molecular graph connected by chemical bonds of 4 types (single, double, triple, and aromatic bonds). 'Positions' of each atom, measured in ångtröms, are provided. We used the exact same train/validation/test splits as Anderson et al. [1] of sizes 100k/18k/13k. The architecture we used is shown in Table 4 . It consists of 7 multihead attention layers interspersed with norm nonlinearities, followed by a TFN layer, max pooling, and two linear layers separated by a ReLU. For each attention layer, shown in Fig. 6 , we embed the input to half the number of feature channels before applying multiheaded attention [28] . Multiheaded attention is a variation of attention, where we partition the queries, keys, and values into H attention heads. So if our embeddings have dimensionality (4, 16) (denoting 4 feature types with 16 channels each) and we use H = 8 attention heads, then we partition the embeddings to shape (4, 2). We then combine each of the 8 sets of shape (4, 2) queries, keys, and values individually and then concatenate the results into a single vector of the original shape (4, 16) . The keys and queries are edge embeddings, and thus the embedding matrices are of TFN type (c.f. Eq. (8)). For TFN type layers, the radial functions are learnable maps. For these we used neural networks with architecture shown in Table 5 .

For the norm nonlinearities [35] , we use

where LN is layer norm [2] applied across all features within a feature type. For the TFN baseline, we used the exact same architecture but we replaced each of the multiheaded attention blocks with a TFN layer with the same output shape.

The input to the network is a sparse molecular graph, with edges represented by the molecular bonds. The node embeedings are a 6 dimensional vector composed of a 5 dimensional one-hot embedding of the 5 atomic species and a 1 dimension integer node embedding for number of protons per atom. The edges embeddings are a 5 dimensional vector consisting of a 4 dimensional one-hot embedding of bond type and a positive scalar for the Euclidean distance between the two atoms at the ends of the bond. For each regression target, we normalised the values by mean and dividing by the standard deviation of the training set.

We trained for 50 epochs using Adam [11] at initial learning rate 1e-3 and a single-cycle cosine rate-decay to learning rate 1e-4. The batch size was 32, but for the TFN baseline we used batch size 16, to fit the model in memory. We show results on the 6 regression tasks not requiring thermochemical energy subtraction in Table 3 . As is common practice, we optimised architectures and hyperparameters on ε HOMO and retrained each network on the other tasks. Training took about 2.5 days on an NVIDIA GeForce GTX 1080 Ti GPU with 4 CPU cores and 15 GB of RAM. 

Across experiments on different datasets with the SE(3)-Transformer, we made the observation that the number of representation degrees have a significant but saturating impact on performance. A big improvement was observed when switching from degrees {0, 1} to {0, 1, 2}. Adding type-3 latent representations gave small improvements, further representation degrees did not change the performance of the model. However, higher representation degrees have a significant impact on memory usage and computation time. We therefore recommend representation degrees up to 2, when computation time and memory usage is a concern, and 3 otherwise. Linear d out * C in * C out 

Cormorant: Covariant molecular neural networks

Layer normalization

Three-dimensional point cloud classification in realtime using convolutional neural networks

Engineering applications of noncommutative harmonic analysis: with emphasis on rotation and motion groups

Steerable cnns. International Conference on Learning Representations (ICLR

Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data

End-to-end recurrent multi-object tracking and prediction with relational reasoning

Neural message passing for quantum chemistry

Wavelet scattering regression of quantum chemical energies

Vain: Attentional multi-agent predictive modeling

Adam: A method for stochastic optimization

Neural relational inference for interacting systems

N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials

Set transformer: A framework for attention-based permutation-invariant neural networks

A structured self-attentive sentence embedding

Stand-alone self-attention in vision models

Pointnet: Deep learning on point sets for 3d classification and segmentation

Pointnet++: Deep hierarchical feature learning on point sets in a metric space

Quantum chemistry structures and properties of 134 kilo molecules

Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds

Attentive group equivariant convolutional networks

Schnet: A continuous-filter convolutional neural network for modeling quantum interactions

Self-attention with relative position representations

Scale-equivariant steerable networks

Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

Relational neural expectation maximization: Unsupervised discovery of objects and their interactions

Attention is all you need

Graph attention networks. International Conference on Learning Representations (ICLR)

On the limitations of representing functions on sets

Non-local neural networks

Dynamic graph cnn for learning on point clouds

3d steerable cnns: Learning rotationally equivariant features in volumetric data

Deep scale-spaces: Equivariance over scale

Harmonic networks: Deep translation and rotation equivariance

3d shapenets: A deep representation for volumetric shapes

Attentional shapecontextnet for point cloud recognition

Deep learning on point sets with parameterized convolutional filters. European Conference on Computer Vision (ECCV)

Modeling point clouds with self-attention and gumbel subset sampling

Deep Sets

We would like to express our gratitude to the Bosch Center for Artificial Intelligence and Konincklijke Philips N.V. for their support and contribution to open research in publishing our paper.