key: cord-0116485-ipwp2hic
authors: Murray, Dakota; Yoon, Jisung; Kojaku, Sadamori; Costas, Rodrigo; Jung, Woo-Sung; Milojevi'c, Stavsa; Ahn, Yong-Yeol
title: Unsupervised embedding of trajectories captures the latent structure of mobility
date: 2020-12-04
journal: nan
DOI: nan
sha: c2db40315fe4c41b7f5d301c4ff57a3b7bc37059
doc_id: 116485
cord_uid: ipwp2hic

Human mobility drives major societal phenomena including epidemics, economies, and innovation. Historically, mobility was constrained by geographic distance, however, in the globalizing world, language, culture, and history are increasingly important. We propose using the neural embedding model word2vec for studying mobility and capturing its complexity. Word2ec is shown to be mathematically equivalent to the gravity model of mobility, and using three human trajectory datasets, we demonstrate that it encodes nuanced relationships between locations into a vector-space, providing a measure of effective distance that outperforms baselines. Focusing on the case of scientific mobility, we show that embeddings uncover cultural, linguistic, and hierarchical relationships at multiple levels of granularity. Connecting neural embeddings to the gravity model opens up new avenues for the study of mobility.

How far apart are two places? The question is surprisingly hard to answer when it involves human mobility. Although geographic distance has historically constrained human movements, it is becoming less relevant in a world connected by rapid transit and global airline networks.

For instance, a person living in Australia is more likely to migrate to the United Kingdom, a far-away country with similar language and culture, than to a much closer country such as Indonesia. 1 Similarly, a student in South Korea is more likely to attend a university in Canada than one in North Korea. 2 Although geographic distance has been used as the most prominent basis for models of mobility, such as the Gravity 3 and Radiation 4 models, there have been attempts to define alternative notions of distance, or functional distances, 5-7 from real-world data or a priori relationships between geographic entities.

Yet, functional distances are often low-resolution, computed at the level of countries rather than regions, cities, or organizations, and have focused on only a single facet of mobility at a time, whereas real-world mobility is multi-faceted, influenced simultaneously by geography, language, culture, history, and economic opportunity. Low dimensional distance alone cannot represent the multitude of inter-related factors that drive mobility. Networks offer a solution to representing many dimensions of mobility, yet edges only encode simple relationships between connected entities. Capturing the complexity of mobility requires moving beyond simple functional distances and networks, to learning high-dimensional landscapes of mobility that incorporate the many facets of mobility into a single fine-grained and continuous representation.

Here, we apply a neural embedding framework to real-world mobility trajectories and demonstrate that it can encode the complex landscape of human mobility into a dense and continuous vector-space representation, from which we can not only derive a meaningful functional distance between locations but also probe relationships based on culture, language, and even prestige along with the geographic relationship. We embed trajectories from three massive datasets: U.S. passenger flight itinerary records, Korean accommodation reservations, and a dataset of scientists' career mobility between organizations captured in bibliometric records (Detailed descriptions are available in the Methods).

The flight itinerary data, from the Airline Origin and Destination Survey, consists of records of more than 300 million itineraries between 1993 and 2020 documenting domestic flights between 828 airports in the United States. A trajectory is constructed for each passenger flight itinerary, forming an ordered sequence of unique identifiers of the origin and destination airports. The Korean accommodation reservations consist of customer reservation histories across 2018 and 2020 for 1,038 unique accommodation locations in Seoul, South Korea. A trajectory is constructed for each customer, containing the ordered sequences of accommodations they reserved over time. Finally, we use scientific mobility data that captures the affiliation trajectories of nearly 3 million scientists across ten years. We focus in more detail on scientific mobility due to its richness and importance. Scientific mobility-which is a central driver of the globalized scientific enterprise 8, 9 and strongly related to innovation, 10, 11 impact, 12,13 collaboration, 14 and the diffusion of knowledge 10,15 -is not only an important topic in the Science of Science but also ideal for our study thanks to its well-known structural properties such as the centrality of scientifically advanced countries and the strong prestige hierarchy. 16, 17 In spite of its importance, understandings of scientific mobility have been limited by the sheer scope and complexity of the phenomenon, 17, 18 being further confounded by the diminishing role of geography in shaping the landscape of scientific mobility.

Trajectories of scientific mobility are constructed using more than three million namedisambiguated authors who were mobile-having more than one affiliation-between 2008 and 2019, as evidenced by their publications indexed in the Web of Science database (see Methods).

As a scientist's career progresses, they move between organizations or pick up additional (simultaneous) affiliations forming affiliation trajectories (Fig. 1) . Thus, the trajectories encode both migration and co-affiliation-the holding of multiple simultaneous affiliations involving the sharing of time and capital between locations-that is typical of scientific mobility 12, 14 (see

A vector-space embedding of locations (airports, accommodations, and organizations) is learned by using trajectories as input to the standard with skip-gram negative sampling, or word2vec neural-network architecture (see Methods) . This neural embedding model, originally designed for learning language models, 19 has been making breakthroughs by revealing novel insights into texts, 20-25 networks 26,27 and trajectories. [28] [29] [30] [31] [32] [33] It works under the notion that a good representation should facilitate prediction, learning a mapping between words that can predict a target word based on its context (surrounding words). The model is also computationally efficient, robust to noise, and can encode relations between entities as geometric relationships in the vector space. 22, 25, 34, 35 As a result, each location is encoded into a single vector representation, and vectors relate to one another based on the likelihood of locations appearing adjacent to one another in the same trajectory.

To validate our approach, we evaluate the quality of vector representations based on their performance in predicting real-world mobility flows using the gravity model framework. 3 The gravity model is a widely used mobility model [36] [37] [38] [39] that connects the expected flux,ˆ, between locations based on their populations and distance:

where is the population of location , ( ) is a decay function with respect to distance between locations, and is a constant estimated from data (see Methods). For the flight itinerary data, we use population as the total number of unique passengers who passed through each airport, for the Korean accommodation reservation data, we use the total number of unique customers who booked with each accommodation, and for scientific mobility, we use the mean annual number of unique mobile and non-mobile authors who were affiliated with each organization.ˆ, which is often referred to as "expected flux", 4 is the expected frequency of the co-occurrence of location and in the trajectory in the gravity model.

The gravity model dictates that the expected flow,ˆ, (ˆ=ˆ), is proportional to the locations' population,ˆ∝ , and decays as a function of their distance, ( ). We define the distance function in terms of either the geographic distance between locations or their functional distance in the vector space, which is calculated as the cosine distance between their vectors, termed the embedding distance. The decay function ( ) defines the effect of distance, and different decay functions can model fundamentally different mechanisms 40 such as the cost functions for a given distance and the spatial granularity of the observation. For geographic distance, we define ( ) as the standard power-law function, and for the embedding distance, we use the exponential function, selected as the best performing for each case ( Fig. S7 and Fig. S8 ). Though there is no qualitative difference, the embedding distance outperforms geographic distance regardless of the type of gravity model used (see Tables S3, S4 , S5). The power-law decay function performs best with the geographic distance, likely resulting from the function's suitability for large, complex, and scale-free spatial systems; 41 performance of the embedding distance is similar between the functional forms, though best for the exponential form, which stems from an underlying connection between the function and word2vec that we discuss later.

We show that the embedding distance better predicts actual mobility flows than the geographic distance across three disparate datasets. In the case of flight itineraries, the embedding distance explains more than twice the expected flux between airports ( 2 = 0.51, Fig. 2a ) than does geographic distance ( 2 = 0.22). Also, the embedding distance produces better predictions of actual flux between airports than does the geographic distance (Fig. 2b ). In the case of Korean accommodation reservations, embedding distance better explains the expected flux ( 2 = 0.57,

Fig. 2c) than does geographic distance ( 2 = 0. 25) , and predictions made using the embedding distance outperform those made with geographic distance (Fig. 2d) . This performance is consistent in the case of scientific mobility: the embedding distance explains more than twice the expected flux ( 2 = 0.48, Fig. 2e ) than does the geographic distance ( 2 = 0. 22) , and predictions made using the embedding distance outperform those using the geographic distance ( Fig. 2f) . These patterns hold for the subsets of only domestic (within-country organization pairs, Fig. S7 and Fig. S9c ) and only international mobility flows (across-country organization pairs, Fig. S9d ). The embedding distance also out-performs alternative diffusion-based network distance measures including the personalized-page rank scores calculated from the underlying mobility network (Fig. S5, Fig. S11, Fig. S12 ). The embedding distance derived from neural embedding also explains more of the flux and better predicts mobility flows than simpler embedding baselines, such as distances derived from a singular-value decomposition and a Laplacian

Eigenmap embedding 42 of the underlying location co-occurrence matrix, Levy's symmetric word2vec, 34 and even direct optimization of the gravity model (Fig.S5 and Tables S3, S4, S5 ).

In sum, our results demonstrate that, consistently and efficiently, the embedding distance better captures patterns of actual mobility than does the geographic distance.

word2vec and the gravity model Flux (data) c d Figure 2 : Embedding distance encodes functional distance and better predicts mobility in flights, accommodation reservations, and global scientific mobility. a. Embedding distance (cosine distance between organization vectors, top) better explains the expected flux of passengers between U.S. airports than does geographic distance (bottom). The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted for the mean flux in each bin based on a normal distribution. Correlation is calculated on the data in the log-log scale ( < 0.0001 across all fits). The lightness of each hex bin indicates the frequency of organization pairs within it. b. Predictions of flux between airport pairs made using embedding distance (top) outperform those made using geographic distance (bottom). Box-plots show the distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps with = ; a perfect prediction yields all points on the black line. "RMSE" is the root-mean-squared error between the actual and predicted values. Results are consistent in the case of scientific mobility. For Korean accommodation reservations, embedding distance better explains the expected flux than does geographic distance (c), and produces better predictions (d). Similarly, in the case of global scientific mobility, embedding distance explains the expected flux between organizations (e) and allows for better predictions (f) than geographic distance.

learns an embedding by estimating the probability that location has context :

where the denominator = ∈A exp( · ) is a normalization constant, and A is the set of all locations. Although word2vec generates two embedding vectors and -referred to as the in-vector and out-vector, respectively-we follow convention to use the in-vector as an embedding of location .

Our experiments show that word2vec best explains real-world mobility flow when = 1 8 ( Fig. S4) , with the flow predicted bŷ

where ( ) is the fraction of in the data. In general, calculating is computationally expensive and there are two common approximations: hierarchical softmax 43 and negative sampling. 19 Due to its simplicity and performance, negative sampling is the most widely used strategy, which we also adopt in our study.

Although negative sampling is the most common approximation, it is a biased estimator 44, 45 and fits a different probability model. When taking into account this bias, word2vec with skipgram and negative sampling fits a probability model given by

where we redefine the normalization constant as = Parameter = 1 is a special choice that ensures that, when the embedding dimension is sufficiently large, there exists optimal in-vectors and out-vectors such that = . 34 Setting = 1 and substituting = into Eq. 4, the flow predicted by word2vec is given by

where ( ) = ( ) is the frequency of location in the data, and = ∈A ( ) is the sum of frequencies of all locations.

The flow is symmetric (i.e., = ) because the skip-gram model neglects whether the context appears before or after the target in the trajectory. If we swap and in Eq. 5, the numerator remains the same but the denominator can be different. Therefore, to ensure = , the denominator should be a constant.

Taken together, the word2vec model with the negative sampling predicts a flow in the same form as in the gravity model:ˆ=

In other words, word2vec with the skip-gram negative sampling is equivalent to the gravity model, with the mass given by the location's frequency ( ), and the distance measured by their dot similarities. While the gravity model predicts mobility flows from the given mass and locations, word2vec finds locations that best explain the given mobility flow. Therefore, the embedding distance is inherently tied to the mobility flow, and hence, has greater predictive power than the geographic distance and other baselines.

In practice, because we only have limited amounts of noisy data and the optimization may not find the true optimum, the mathematical result may only approximately hold. Indeed, we find that the in-and out-vectors tend to be different and that the cosine similarity tends to better capture real-world mobility than the dot similarity. This result echos other applications of word embedding, such as word analogy testing, 46 in which cosine distance also outperformed dot similarity. Nevertheless, a model with dot similarity has the second-best performance after cosine similarity (Tables S3, S4, S5) , and the embedding distance still outperforms all alternatives we considered.

In the remainder of the paper, we focus on scientific mobility and interrogate the geometric space generated by the neural embedding to shed light on the multi-faceted relationships between organizations. To explore the topological structure of the embedding, we use a topologybased dimensionality reduction method (UMAP 47 ) to obtain a two-dimensional representation of the embedding space (Fig. 3a) . By showing relationships between individual organizations, rather than aggregates such as nations or cities, this projection constitutes the largest and highest resolution "map" of scientific mobility to date.

Globally, the geographic constraints are conspicuous; organizations tend to form clusters based on their national affiliations and national clusters tend to be near their geographic neighbors. At the same time, the embedding space also reflects a mix of geographic, historic, cultural, and linguistic relationships between regions much more clearly than alternative network representations ( Fig. S13 ) that have been common in studies of scientific mobility. 8, 48 The embedding space also allows us to zoom in on subsets and re-project them to reveal local relationships. For example, re-projecting organizations located in Western, Southern, and Southeastern Asia with UMAP ( Fig. 3b ) reveals a gradient of countries between Egypt and the Philippines that largely corresponds to geography, but with some exceptions seemingly stemming from cultural and religious similarity; Malaysia, with its official religion of Islam, is nearer to Middle Eastern countries in the embedding space than to many geographically-closer South Asian countries. We validate this finding quantitatively with the cosine distance between nations (the centroids of organizations vectors belonging to that country). Malaysia is nearer to many Islamic countries such as Iraq ( = 0.27), Pakistan ( = 0.32), and Saudi Arabia ( = 0.41) than neighboring but Buddhist Thailand ( = 0.43) and neighboring Singapore ( = 0.48).

Linguistic and historical ties also affect scientific mobility. We observe that Spanish-speaking Latin American nations are positioned near Spain (Fig. 3c) Mirroring the global pattern, organizations in the United States are largely arranged according to geography (Fig. 3d ). Re-projecting organizations located in Massachusetts (Fig. 3e) reveals structure based on urban centers (Boston vs. Worcester), organization type (e.g., hospi- and Western Europe and the Mediterranean (light green). The cluster structure shows that not only geography but also linguistic ties and cultural between countries are related to scientific mobility.

We quantify the relative importance of geography (by region), and language (by the most widely-spoken language of each country) using the element-centric clustering similarity, 49 a 13 method that can compare hierarchical clustering and disjoint clustering (geography, language...)

at the different level of hierarchy by explicitly adjusting a scaling parameter , acting like a zooming lens. If is high, the similarity is based on the lower levels of the dendrogram, whereas when is low, the similarity is based on higher levels. Fig. 4b demonstrates that regional relationships play a major role at higher levels of the clustering process (low ), and language (family) explains the clustering more at the lower levels (high ). This suggests that the embedding space captures the hierarchical structure of mobility.

Prestige hierarchy is known to underpin the dynamics of scientific mobility, in which researchers tend to move to similar or less prestigious organizations. 16, 17 Could the embedding space, to which no explicit prestige information is given, encode a prestige hierarchy? This question is tested by exploiting the geometric properties of the embedding space with SemAxis. 35 Here, we use SemAxis to operationalize the abstract notion of academic prestige, defining an axis in the embedding space where poles are defined using known high-and low-ranked universities.

As an external proxy of prestige, we use the Times Ranking of World Universities (we also use research impact from the Leiden Ranking, 51 see Supporting Information); the high-rank pole is defined as the average vector of the top five U.S. universities according to the rankings, whereas the low-rank pole is defined using the five bottom-ranked (geographically-matched by U.S. census region) universities. We derive an embedding-based ranking for universities based on the geometrical spectrum from the high-ranked to low-ranked poles (see Data and Methods).

The embedding space encodes the prestige hierarchy of U.S. universities that are coherent with real-world university rankings. The embedding-based ranking is strongly correlated with the Times ranking (Spearman's = 0.73, Fig. 5a ). For reference, the correlation between the Times ranking and the publication impact scores from the Leiden Ranking, 51 a Figure 4 : Geography, then language, conditions international mobility. a. Hierarchically clustered similarity matrix of country vectors aggregated as the mean of all organization vectors within countries with at least 25 organizations. Color of matrix cells corresponds to the cosine similarity between country vectors. Color of country names corresponds to their cluster. Color of three cell columns separated from the matrix corresponds to, from left to right, the region of the country, the language family, 50 and the dominant language. b. Element-centric cluster similarity 49 reveals the factors dictating hierarchical clustering. Region better explains the grouping of country vectors at higher levels of the clustering. Language family, and then the most widely-spoken language, better explain the fine-grained grouping of countries.

bibliometrically-based university ranking, is 0.87 (Spearman's , Fig. 5b ). The correlation between the embedding-based ranking and the Times ranking is robust regardless of the number of organizations used to define the axes (Fig. S18 ), such that even using only the single top-ranked and bottom-ranked universities produces a ranking that is significantly correlated with the Times ranking (Spearman's = 0.46, Fig. S18a ). The correlation is also comparable to more direct measures such as node strength (sum of edge weights, Spearman's = 0.73) and

eigenvector centrality (Spearman's = 0.76, see Supporting Information) from the mobility network. The strongest outliers that were ranked more highly in the Times ranking than in the embedding-based ranking tend to be large state universities such as Arizona State University and the University of Florida. Those ranked higher in the embedding-based ranking tend to be relatively-small universities near major urban areas such as the University of San Francisco and the University of Maryland Baltimore County, possibly reflecting exchanges of scholars with nearby high-ranked institutions at these locations. In sum, our results suggest that the embedding space is capable of capturing information about academic prestige, even when the representation is learned using data without explicit information on the direction of mobility (as in other formal models 16 ), or prestige.

The axes can be visualized to examine the relative position of organizations along the prestige axis, and along a geographic axis between California and Massachusetts. Prestigious universities such as Columbia, Stanford, MIT, Harvard, and Rockefeller are positioned towards the top of the axis (Fig. 5c ). Universities at the bottom of this axis tend to be regional universities with lower national profiles (yet still ranked by Times Higher Education) and with more emphasis on teaching, such as Barry University and California State University at Long Beach.

By projecting other types of organizations onto the prestige axis, SemAxis offers a new way of representing a continuous spectrum of organizational prestige for which rankings are often low-resolution, incomplete, or entirely absent, such as for regional and liberal arts universities ( Fig. 5d ), research institutes (Fig. 5e ), and government organizations (Fig. 5f ). Their estimated prestige is speculative, though we find that it significantly correlates with their citation impact ( Fig. S22) .

We also find that the size (L2 norm) of the organization embedding vectors provides insights into the characteristics of organizations (Fig. 6 ). Up to a point (around 1,000 researchers), the size of U.S. organization's vectors tends to increase proportionally to the number of researchers (both mobile and non-mobile) with published work; these organizations are primarily teachingfocused institutions, agencies, and hospitals that either are not ranked or have a low ranking.

However, at around 1,000 researchers, the size of the vector decreases as the number of researchers increases. These organizations are primarily research-intensive and prestigious universities with higher rank, research outputs, R&D funding, and doctoral students (Fig. S23) .

A similar pattern has been observed in applications of neural embedding to natural language, in which the size of word vectors were found to represent the word's specificity, i.e., the word associated with the vector frequently co-appears with particular context words. 52 If the word in question is universal, appearing frequently in many different contexts, it would not have a large norm due to a lack of strong association with a particular context. Likewise, an organization with a small norm, such as Harvard, appears in many contexts alongside many different organizations in affiliation trajectories-it is well-connected. The concavity of the curve emerges in part from the relationship between the size of the vector and the expected connectedness of the organization, given its size ( 

Low−ranked regional and liberal arts colleges Un-filled points are those top and bottom five universities used to span the axis. Even when considering only a total of ten organization vectors, the estimate of the Spearman's rank correlation between the embedding and Times ranking is = 0.73 ( = 145, < 0.0001), which increases when more topand-bottom ranked universities are included (Fig. S18) . b. The Times ranking is correlated with Leiden Ranking of U.S. universities with Spearman's = 0.87 and < 0001. c-f. Illustration of SemAxis projection along two axes; the latent geographic axis, from California to Massachusetts (left to right) and the prestige axis. Shown for U.S. Universities (c), Regional and liberal arts colleges (d), Research institutes (e), and Government organizations (f). Full organization names are listed in Table S1 .

We report that this curve is almost universal across many countries. For instance, China's curve closely mirrors that of the United States (Fig. 6b ). Smaller but scientifically advanced countries such as Australia and other populous countries such as Brazil also exhibit curves similar to the United States (Fig. 6b , inset). Other nations exhibit different curves which lack the portions with decreasing norm, probably indicating the lack of internationally-prestigious institutions. Similar patterns can be found across many of the 30 countries with the most total researchers ( Fig. S24 ; see Supporting Information for more discussion). Size (L2 norm) of organization embedding vectors compared to the number of researchers for U.S. universities. Color indicates the rank of the university from the Times ranking, with 1 being the highest ranked university. Uncolored points are universities not listed on the Times ranking. A concave-shape emerges, wherein larger universities tend to be more distant from the origin (large L2 norm); however, the more prestigious universities tend to have smaller L2 norms. b. We find a similar concave-curve pattern across many countries such as the United States, China, Australia, Brazil, and others (inset, and Fig. S24 ). Some countries exhibit variants of this pattern, such as Egypt, which is missing the right side of the curve. The loess regression lines are shown for each selected country, and for the aggregate of remaining countries, with ribbons mapping to the 99% confidence intervals based on a normal distribution. Loess lines are also shown for organizations in Australia, Brazil, and Egypt (inset).

Neural embedding approaches offer a novel, data-driven solution for efficiently learning an effective and robust representation of locations based on trajectory data, encoding the complex and multi-faceted nature of mobility. We demonstrated that a functional distance derived from the embedding can be used with the gravity model of mobility to better predict real-world mobility than does geographic distance. Embedding distance outperformed geographic distance across distinct and disparate domains, including U.S. flight itineraries, Korean hotel accommodation reservations, and global scientific mobility. We discover that this performance results from neural embeddings implicitly learning gravity law relationships between locations, making them particularly well suited to representing mobility. Focusing on scientific mobility, we find that the embedding successfully encodes many aspects of scientific mobility into a single representation, including global and regional geography, shared languages, and prestige hierarchies, even without explicit information on these factors.

In revealing the correspondence between neural embeddings and the gravity model, the study of human mobility can move beyond geographic and network-based models of mobility, and instead leverage the high-order structure from individuals' mobility trajectories using these robust and efficient methods. While we focus on three domains of mobility, this approach could be applied to many different domains, such as general human migration, transit-network mobility, and more. Once learned, functional distances between locations, such as countries, cities, or organizations, can be published to support future research. Moreover, this approach can be used to learn a functional distance even between entities for which no geographic analog exists, such as between occupational categories based on individuals' career trajectories. In addition to providing a functional distance that supports modeling and predicting mobility patterns, the structure of the neural embedding space is amenable to a range of unique applications 20 for studying mobility. As we have shown, the embedding space allows the visualization of the complex structure of scientific mobility at high resolution across multiple scales, providing a large and detailed map of the landscape of global scientific mobility. Embedding also allows us to quantitatively explore abstract notions such as academic prestige, and can potentially be generalized to other abstract axes. Investigation of the structure of the embedding space, such as the vector norm, reveals universal patterns based on the organization's size and their vector norm that could be leveraged in future studies of mobility.

This approach, and our study, also have several limitations. First, the skip-gram word2vec model does not leverage directionality, meaning that embedding will be less effective at capturing mobility for which directionality is critical. Future studies may consider bi-directional embeddings, such as BERT, 53 to incorporate directionality, as well as their correspondance to asymetric mobility models, such as the radiation model (4) . Second, the neural embedding approach is most useful in cases of mobility between discrete geographic units such as between countries, cities, and businesses; this approach is less useful in the case of mobility between locations represented using geographic coordinates, such as in the modeling of animal movements. Third, neural embeddings are an inherently stochastic procedure, and so results may change across different iterations. However, in this study we observe all results to be robust to stochasticity, likely emerging from the limited "vocabulary" of scientific mobility, airports, and accommodations (several thousand) and the relatively massive datasets used to learn representations (several million trajectories). Applications of word2vec to problem domains where the ratio of the vocabulary to data is smaller, however, should be implemented with caution to ensure that findings are not the result of random fluctuations. Finally, the case of scientific mobility presents domain-specific limitations. Reliance on bibliometric metadata means that we capture only long-term mobility such as migration, rather than the array of more frequent short-term mobility such as conference travel and temporary visits. The kinds of mobility we 21 do capture-migration and co-affiliation-although conceptually different, are treated identically by our model. Also, our data might further suffer from bias based on publication rates: researchers at prestigious organizations tend to have more publications, leading to these organizations appearing more frequently in affiliation trajectories.

Mobility and migration are at the core of human nature and history, driving societal phenomena as diverse as epidemics 39,54 and innovation. [11] [12] [13] [14] [15] However, the paradigm of scientific migration may be changing. Traditional hubs of migration have experienced many politicallymotivated policy changes that affect scientific mobility, such as travel restrictions in the U.S. and U.K.. 55 At the same time, other nations, such as China, are growing into major scientific powers and attractors of talent. 56 Unprecedented health crises such as the COVID-19 pandemic threaten to bring drastic global changes to migration by tightening borders and halting travel.

By revealing the correspondence between neural embedding and the gravity model and revealing their utility and efficacy, our study opens a new avenue in the study of mobility. Mobility is at the core of many global challenges, and the insights brought forth by our tool can be put to use to inform a better understanding of human mobility, and to inform sensible, effective, sustainable, and humane policies.

We source U.S. airport itinerary data from the Origin and Destination Survey (DB1B), provided by the Bureau of Transportation Statistics at the United States Department of Transportation.

DB1B is a sample of 10 percent of domestic airline tickets between 1993 and 2020, comprising 307,760,841 passenger itineraries between 828 U.S. airports. Each itinerary is associated with a trajectory of airports including the origin, destination, and intermediary stops.

We source Korean accommodation reservation data from collaboration with Goodchoice Company LTD. The data contains customer-level reservation trajectories spanning the period of August 2018 through July 2020 and comprising 1,038 unique accommodation locations in Seoul, South Korea.

We source co-affiliation trajectories of authors from the Web of Science database hosted by the For example, France, Qatar, the USA, Iraq, and Luxembourg had the most mobile authors (Fig. S2c ). However, due to their size, the USA, accounted for nearly 40 % of all mobile authors worldwide (Fig. S2a) , with 10 countries accounting for 80 % of all mobility (Fig. S2b ).

The countries with the highest proportion of mobile scientists are France, Qatar, the United States, and Iraq, whereas those with the lowest are Jamaica, Serbia, Bosnia & Herzegovina, and North Macedonia (Fig. S2c) . In most cases, countries with a high degree of inter-organization mobility also have a high degree of international mobility, indicating that a high proportion of their total mobility is international (Fig. S2d) ; However, some countries such as France and the United States seem to have more domestic mobility than international mobility. While the number of publications has increased year-to-year, the mobility and disciplinary makeup of the dataset has not notably changed across the period of study (Fig. S1 ).

We embed trajectories by treating them analogously to sentences and locations analogously to words. For U.S. airport itinerary, trajectories are formed from the flight itineraries of individual passenger, in which airports correspond to unique identifiers. In the case of Korean accommodation reservations, trajectories comprise a sequence of accommodations reserved over a customer's history. For scientific mobility, an"affiliation trajectories" is constructed for each mobile author, which is built by concatenating together their ordered list of unique organization identifiers, as demonstrated in Fig. 1a . In more complex cases, such as listing multiple affiliations on the same paper or publishing with different affiliations on multiple publications in the same year, the order is randomized within that year, as shown in Fig. 1b .

These trajectories are used as input to the standard skip-gram negative sampling word embedding, commonly known as word2vec. 19 word2vec constructs dense and continuous vector representations of words and phrases, in which distance between words corresponds to a notion of semantic distance. By embedding trajectories, we aim to learn a dense vector for every location, for which the distance between vectors relates to the tendency for two loca- 

where,

where and are the "in-vector" and "out-vector", respectively, = ∈A exp( · ) is a normalization constant, and A is the set of all locations. We follow the standard practice and only use the in-vector, , which is known to be superior to the out-vector in link prediction benchmarks. [20] [21] [22] [23] [24] [25] 27 We used the word2vec implementation in the python package gensim. The skip-gram negative sampling word2vec model has several tunable hyper-parameters, including the embedding dimension , the size of the context window , the minimum frequency threshold min , initial learning rate , shape of negative sampling distribution , the number of the negative samples should be drawn , and the number of iterations. For main results regarding scientific mobility, we used = 300 and = 1, which were the parameters that best explained the flux between locations, though results were robust across different settings (Fig. S4) . Although the original word2vec paper uses = 0.75, 19 here we set = 1.0, though results are only trivially different at different values of (Fig. S5 ). We used = 5, which is suggested default of word2vec. We also use same setting for U.S. airport itinerary and Korean accommodation reservation data.

To mitigate the effect of less common locations, we set min = 50, limiting to locations appearing at least 50 times across the training trajectories; 744 unique airport for U.S. airport itinerary, 1004 unique accommodations for Korean accommodation reservation data, and 6,580

unique organizations for scientific mobility appear in the resulting embedding. We set to its default value of 0.025 and iterate five times over all training trajectories. For scientific mobility, across each training iteration, the order of organizations within a single year is randomized to remove unclear sequential order.

Negative sampling trains word2vec using a binary classification task as follows. For each target word , we sample a context word from the given data and label it as positive, denoted by = 1. Then, we sample words ℓ from a noise distribution 0 (ℓ) and label them as negative, denoted by ℓ = 0. In word2vec, the noise distribution is given by 0 (ℓ) ∝ (ℓ), where ( ) is the fraction of in the data, and is a hyper-parameter. Then, for the sampled words, we fit a logistic regression model

by maximizing the log-likelihood:

where D is the set of all sampled context words.

This procedure does not guarantee that the embedding optimally converges, even when increasing the training samples and iterations. 44, 45 To make this bias explicit, let us consider the unbiased variant of negative sampling, i.e., the noise contrastive estimation (NCE). 44, 45 NCE is an unbiased estimator for a probability model of the form:

where is a non-negative likelihood function of data , and X is the set of all data. NCE fits a logistic regression model:

where = ln + ln ∈X ( ) is a constant and maximizes the log-likelihood

by calculating the gradients for embedding vectors , and iteratively updating them (see Supporting Information for full derivation). Note that NCE is an unbiased estimator that has convergence to the optimal embedding in terms of the original word2vec's objective function, 45 if we increase the number of words to sample and the training iterations.

Let us revisit negative sampling from the perspective of NCE. We rewrite the logistic regression model in negative sampling (Eq. 9) in form of the posterior probability:

where we define the likelihood function by

which is the unbiased estimator for the probability model

Taken together, the conditional probability that SGNS word2vec actually optimizes is

where = ∈A ( ) exp( · ).

We calculate as the total number of co-occurrence between two locations and across the data-set. In scientific mobility, = 10 indicates that the number of co-occurrence between both organization and between 2008 and 2019 is 10, as evidenced from their publications.

Here, we treat = for the sake of simplicity and, in the case of scientific mobility, because directionality cannot easily be derived from bibliometric records, or may not be particularly informative (see Supporting Information).

We calculate two main forms of distance between locations. The geographic distance, , is the pairwise geographic distance between locations. Geographic distance is calculated as the great circle distance, in kilometers, between pairs of locations. In the case of U.S. flight itinerary and scientific mobility, we impute distance to 1 km when their distance is less than one kilometer. In the case of Korean accommodation reservation data, because this data is intra-city mobility trajectory at a much smaller scale, we impute distance to 0.01 km when their distance is less than 0.01 km. The embedding distance with the cosine distance, , is calculated as = 1 − · , where and are the embedding vectors for locations and , respectively. Note that is not a formal metric because it does not satisfy the triangle inequality. Nevertheless, cosine distance is often shown to be useful in practice. 6, 7, 58 We compare the performance of this cosine-based embedding distance against those derived using dot product similarity and euclidean distance.

We compare the performance of the embedding distance to many baselines. These include distances derived from simpler embedding approaches, such as Singular Value Decomposi-tion (SVD) and a Laplacian Eigenmap embedding performed on the underlying location cooccurrence matrix. We also use network-based distances, calculating vectors using a Personalized Page Rank approach and measuring the distance between them using cosine distance and Jensen-Shannon Divergence (see Supporting Information) . Finally, we compare the embedding distance against embeddings calculated through direct matrix factorization, following the approach that word2vec implicitley approximates. 34

We model co-occurences for locations and (referred to as flux), using the gravity law of mobility. 3 The gravity law of mobility, which was inspired by Newton's law of gravity, postulates that attraction between two locations is a function of their population and the distance between them. This formulation and variants have proven useful for modeling and predicting many kinds of mobility. [36] [37] [38] [39] In the gravity law of mobility, the expected flux,ˆbetween two locations and is defined as,ˆ=

where and are the population of locations, defined as the total number of passenger who passed through each airport for U.S. airport itineraries, the total number of customer who We consider separate variants of ( ) for the geographic distance, , and the embedding distance, , report the best-fit model of each distance. For the geographic distance, we use the power-law function of the gravity law, ( ) = − (Eq. 22). For the embedding distance, we use the exponential function, with ( ) = − (Eq. 23).

where is the actual flow from the data. The gravity law of mobility is sensitive to accommodations for Korean accommodation reservation data. This value is comparable to other common applications of the gravity law, such as phone calls, commuting, and migration. 4 We follow standard practice and exclude zero flows from our analysis.

SemAxis and similar studies 22, 25, 35 demonstrated that "semantic axes" can be found from an embedding space by defining the "poles" and the latent semantic relationship along the semantic axis can be extracted with simple arithmetic. In the case of natural language, the poles of the axis could be "good" and "bad", "surprising" and "unsurprising", or "masculine" and "feminine". We can use SemAxis to leverage the semantic properties of the embedding vectors to operationalize abstract relationships between organizations. 

where a higher score for organization indicates that is more closely aligned to + than − .

We define two axes to capture geography and academic prestige, respectively. 

Code used in this analysis can be found at https://github.com/murrayds/sci-mobility-emb Supporting Information S1 Text Mobility and science.

As scholars move, they bring their knowledge, skills, and social connections with themcollectively the movements of researchers shape the structure and direction of the global scientific enterprise. For example, prestige-driven mobility between doctoral-granting and employing institution is highly unequal, 16 There are many ways of modeling scientific mobility from bibliographic data, the first consideration being the unit of analysis. Most studies of mobility have focused on countrylevel mobility-the flows of researchers across nations. 12,69-71 Practically, country-level analyses benefit from higher reliability, such that idiosyncrasies and errors inherent to bibliographic databases are mitigated by this higher level of aggregation. Epistemically, country-level analysis is useful for national science governance who aims to understand the status of their country in the global landscape and make informed policy decisions. Analyses at lower levels of analysis are far less common. Regional-level scientific mobility-the flow of researchers between regions or cities within or across countries has been only minimally studied, 72 possibly due to lack of reliable long-term data and lack of policy relevance to national-level lawmakers. Organization-level mobility has the potential to inform institutional policy and to understand the composition of mobility within a single country or region, especially as it relates to organization performance, prestige, and inequality. [15] [16] [17] 73 However, affiliation disambiguation and noise in bibliometric data makes large-scale organization-level analysis challenging. Here, we learn neural-network embeddings of scientific mobility at the level of organizations using a curated bibliographic database. These embeddings are robust to noise, and so are capable of representing clear structure even amid issues with organizational disambiguation. In doing so, embeddings also capture a more detailed understanding of mobility than has been previously studied.

Another consideration when analyzing scientific mobility is what kinds of mobility to study.

Typical understandings of mobility are directional: movement is always from one place and to another. However, scientific mobility is more complicated. For example, scientists often hold multiple affiliations at a time, 74 listing them as co-affiliations on a single paper, or even choosing a subset of affiliations to use fohabeultiple simultaneous projects. 18 Even clearly-directional migration to another institution is complex-researchers may continue to publish with an old affiliation for projects that began before their move, and they may maintain social and organizational links to their old institution (e.g., collaborators, projects, graduate students) such that there is no clear breakage after migrating. There is also a whole range of short-term scientific mobility, such as visiting scholarships and short-term visits that are only visible through intensive efforts such as manual extraction from CVs. [75] [76] [77] Here, we focus on more long-term mobility that can be derived from bibliographic data. Due to the complexity of scientific mobility, we make the simplifying assumption that all scientific mobility is symmetric or without direction such that any move from an organization to organization is equivalent to a move from to . By assuming non-directional mobility, all mobility events are commensurate, meaning that they can be treated identically in our analysis-this allows us to represent the complexity of mobility without making decisions about the directional of their mobility or which is their main affiliation. Moreover, this assumption has the practical advantage of matching the data format expected by the word2vec model, as well as the theoretical advantage of adhering to the symmetricity assumption of the gravity model of mobility.

For each mobile researcher who has at least two distinct affiliations, we construct an affil- This effectively removes the effect of order within a year, as the order cannot be meaningfully established based on co-affiliations in a single paper, or on different affiliations listed on separate papers, for which its date of publication may not be representative of the actual completion of the project.

Other than restricting to only mobile researchers, we do not perform any filtering or reductions to affiliation trajectories. In the case than an author publishes with organization four times in 0 , and affiliation two times in 1 , then their trajectory will be ( , , , , , ).

Although mobile authors who publish more papers will have longer trajectories, word2vec will skip duplicate consecutive organization IDs, mitigating the impact of long repetitive trajectories.

We examine the gravity model on the Personalized Page Rank (PPR) 78 as a benchmark on the network. We construct the co-occurrence network of organizations, in which each edge between organizations and represents a co-occurence of and in the same affiliation trajectory, with weight given by the sum of the co-occurences over all researchers. and edges are co-occurrence between two organizations. The Personalized Page Rank is a ranking algorithm for nodes based on a random walk process on networks. The walker visiting at a node moves to a neighboring node chosen randomly with a probability proportionally to the weight of the edge in one step. Furthermore, with probability , the walker is teleported back to the starting node. The rank of a node is determined by the probability that the walker visits the node in the stationary state. The stationary distribution of the random walker starting from node , denoted by = ( ), is given by

where is a column vector of length with entries that are all zero except the th entry that equals to one, = ( ) is the weighted adjacency matrix. We used = 0.9 here.

We can think as a representation vector of the organization , and calculate the distance between organizations and , with measuring distance between and to examine the gravity law. We consider two distance measures in this analysis. The first one is cosine distance which is used for our embedding method,

Also, if we think as a discrete probability distribution, then we can consider Jensen-Shannon divergence (JSD), can be written as,

where = 1 2 ( + ). We report the result with cosine distance ( 2 = 0.14, Fig. S11 ) and

Jensen-Shannon divergence ( 2 = 0.19, Fig. S12 ). In both cases, the performance is under the performance of the model with geographical distance. Even though the length of the PPR vectors is extremely larger than the length of our embedding vectors, result with the embedding distance outperforms both of them.

We use the truncated singular value decomposition (SVD) on the underlying mobility cooccurrence matrix as a baseline embedding. In short, truncated SVD performs linear low-rank approximation of the matrix with given dimensions, . First, we construct the co-occurrence matrix of organizations, given by the co-occurrence of organizations and in the same affiliation trajectory. Then, we apply truncated singular value decomposition with = 300 on the flow matrix directly.

We calculate distance between organizations in the SVD embedding space using cosine distance, finding that it explains slighly more of the flux between organizations than does geo-graphic distance ( 2 = 0.247, Table S3 ). When used as an input to the gravity model, this distance produces better predictions than geographic distance using both the exponential (RMSE = 0.859, Table S4 ) and power-law models (RMSE = 0.839, Table S4 ), performing slightly better with the power-law formulation.

We also consider Laplacian Eigenmap embeddings 42 as a baseline, which is one of the most fundamental approaches for graph embedding. First, we construct the co-occurrence matrix of organizations, given by the co-occurrence of organizations and in the same affiliation trajectory and degree matrix which is the diagonal matrix for which = . Then we construct graph Lapalcian matrix = − and apply truncated singular value decomposition in the matrix with = 300.

We only report results based on the cosine distance between Laplacian embedding vectors, finding that it explains less of the total flux than geographic distance ( 2 = 0.212, Table S3 ).

When used as an input to the gravity model, the Laplacian cosine distance produces marginallybetter predictions than geographic distance using both the exponential (RMSE = 0.878, Table S4) and power-law models (RMSE = 0.87, Table S5 ), performing slightly better with the power-law formulation.

We also compare the word2vec embedding distnace against a baseline of direct matrix factorization approach, using the symmetric SVD word2vec method. 34 Based on idea of word2vec is just implicit matrix factorization, Levy proposed symmetric SVD word2vec embedding, which should directly compute the embedding that word2vec only attempts to efficiently approximate. First, we construct the matrix of organization

where ( , ) is the number of times the location pair ( , ) appears given the window size in the total corpus , ( ) = = ( , ) as the number of items occurred given the window size in , and k is the number of negative samples. We used = 1 and = 5 which is same setting in the our main result. Then, we factorized matrix with truncated singular value decomposition in the matrix with = 300 into Σ , and used the embedding vector as √ Σ .

For this baseline, we report results using the cosine distance, Euclidean distance, and dot product between the embedding vectors. We find that the dot product performs by far the worst, worse than any other baseline considered 2 = 0.004, Table S3 ). The cosine distance performs better, but worse than geographic distance ( 2 = 0.212, Table S3 ). The Euclidean distance performs best, explaining more of the flux than geographic distance, and only being below the embedding distance ( 2 = 0.341, Table S3 ). Focusing on the Euclidean distance, we find that using it as input to the gravity model results in better predictions than geographic distance using both the exponential model (RMSE = 0.803, Table S4 ) and power law models, (RMSE = 0.78, Table S5 ), though it performs slightly better with the power law formulation. We then embed the gravity matrix using two approaches: a truncated singular value decom-position (SVD), and with multidimensional scaling (MDS), which embeds each location in a N-dimensional space such that pairwise distances are preserved as well as possible. Typically, MDS uses Euclidean distance to measure distance between vectors.

For this result, we report results using the cosine distance between embedding vectors for the SVD embedding, and the logged Euclidean distance for the MDS embedding. We observe that the gravity-optimized SVD cosine distance has poor performance, more weakly correlated with actual flux than geographic distance ( 2 = 0.122, Table S3 ), and similarly poor prediction error when used as input to the gravity mode. In contrast, the cosine distance between MDS vectors has the second-highest correlation with actual flux, after only the embedding distance ( 2 = 0.355, Table S3 ), and third best for predicting actual flux using both the power-law version of the gravity model (RMSE = 0.795, Table S5 ), though performance is much lower using the exponential form (RMSE = 0.904, Table S4 ). We note that the MDS embedding actually has the lowest prediction error when organizations' populations are defined as the raw frequency during prediction with the power-law model (RMSE = 0.691, Table S5 ); however, we note that the MDS distance is defined with a population of "All" mobile and non-mobile scholars, whereas the predictions made in Table S5 using the raw frequency, which likely confounds this result.

MDS is also computationally-intensive, requiring upwards of 6 times more time to compute (on the machine used in this analysis) than the more computationally efficient word2vec model.

Why do neural-embedding approaches, which rely on stochastic gradient descent, outperform Levy's direct matrix factorization, 34 especially given that word2vec is implicitly approximating factorization? We speculate that is stems, in part, from the sensitivity of matrix factorization to small flows between locations. Levy's matrix factorization embeds affiliations and such that their dot similarity is as close as possible to log( / ( ) ( )). If the flow is considerably small or zero, the dot similarity goes to −∞, pushing and very far from other affiliations in the embedding space. 34, 79 This is particularly problematic when the window size is small because most affiliation pairs would have no flow, which is indeed the case in our experiments. To circumvent this problem, previous studies 34,79 added a constant flow between the affiliation pairs with no flow. However, in addition to altering the underlying data, these small flows can still have a strong impact on the embedding.

It is also well known that singular value decomposition (SVD) is vulnerable to outliers. [80] [81] [82] [83] [84] The stochastic gradient descent algorithm, which is employed in SGNS word2vec, is more robust than SVD and can enhance generalization and effectiveness of word2vec model. [85] [86] [87] S10 Text Organization disambiguation and metadata.

Affiliations mapped to one of 8,661 organizations, disambiguated following that originally designed for the Leiden Rankings of World Universities. 51 Organizational records were associated with a full name, a type indicating the sector (e.g., University, Government, Industry), and an identifier for the country and city of the organization. Sixteen different sector types were included in the analysis, which we aggregated to four high-level codes: University, Hospital, Government, and Other. Each record was also associated with a latitude and longitude.

However, for many organizations, these geographic coordinates were missing or incorrect. We manually updated the coordinates of 2,267 organizations by searching the institution name and city on Google Maps; in cases where a precise location of the organization could not be identified, we used the coordinates returned when searching the name of the city. The data was further enriched with country-level information, including region, most widely-spoken language, and its language family (e.g., the language family of Spanish is Italic). State/province-level information was added using the reverse geocoding service LocationIQ using each organization's latitude and longitude as input. Regional census classifications were added for states in the United States. For each organization, we calculated size as the average number of unique authors (mobile and non-mobile) who published with that organization across each year of our dataset; in the case that authors publish with multiple affiliations in a single year, they are counted towards each.

As a result of our disambiguation procedure, some affiliations are mapped to two organizations, one specific, and one more general. For example, any author affiliated with "Indiana

University Bloomington" will also be listed as being affiliated with the "Indiana University System", a more general designation for all public universities in Indiana. However, a more general organization may not always occur alongside the more specific one. For example, a researcher affiliated with the smaller regional school "Indiana University South Bend" will be listed as affiliated with only the "Indiana University System". We identify all specific organizations that always co-occur along with a more general one. For every career trajectory that includes one of these specific organizations, we remove all occurrences of the more general organization;

trajectories containing only a general designation are not altered.

Author-name disambiguation, the problem of associating names on papers with individuals authors, remains difficult for the use of bibliometric data. 88 Authors in our dataset have been disambiguated using a rule-based algorithm that makes use of author and paper metadata, such as physical addresses, co-authors, and journal, to score papers on the likelihood of belonging to an author cluster-a cluster of publications believed to have been authored by the same individual. 57 We limit our period of analysis to the period of 2008 to 2019, as in 2008 the Web of Science began indexing additional author-level metadata such as full names and email addresses. The disambiguation algorithm is conservative, favoring splitting clusters over merging.

Past studies have validated this data and shown that the disambiguated authors are comparable 46 to ground-truth records such as those from ORCID and useful for a wide range of bibliometric studies. 12, 18, 48, 55 S12 Text Derivation of Eq.35 -Noise Contrastive Estimation

The noise contrastive estimation (NCE). 44, 45 NCE is an unbiased estimator for a probability model of the form:

where is a non-negative likelihood function of data , and X is the set of all data. This general form includes the word2vec model (Eq. (8)), where ( ) = exp( ) and = · . NCE fits the probability model using a binary classification task in the same way as in negative sampling but using a Bayesian formalism for logistic regression. Specifically, before the training, we know that 1 in 1+ words is sampled from the given data, which can be modeled as prior probabilities

Using the Bayes rule, the posterior probability for given word is given by

Bearing in mind that word is sampled from the given data if = 1 and from the noise

Assuming that the given data is generated from the probability model to fit, the class-conditional probability, | , is given by

Putting Eqs. (30) , (31) and (32) together, the posterior probability for is given by

47 which can be rewritten in form of sigmoid function:

where = ln + ln ∈X ( ) is a constant. NCE maximizes the log-likelihood

by calculating the gradients for embedding vectors , and iteratively updating them. An important consequence of this framework is that NCE is an unbiased estimator that has convergence to the optimal embedding in terms of the original word2vec's objective function, J if we increase the number of words to sample and the training iterations. 44, 45 S13 Text Reconstructing Times ranking with network measure.

The performance of the embedding ranking in reconstructing the Times ranking is comparable to that of network-derived measures such as degree strength (Spearman's = 0.73, Fig. S20b ). The embedding ranking over-ranks large research-intensive universities such as North Carolina State University, University of Florida, and Texas A&M University, whereas the network-derived ranking over-ranks smaller, more specialized universities such as Brandeis University, Yeshiva University, and University of San Francisco. This suggests that the embedding encodes information on prestige hierarchy at least as well as a network representation, with some noticeable qualitative differences.

S14 Text Speculation on variations of the convex-curve pattern.

The convex-curve pattern observed in Fig. 6 repeats across many countries, with variations. Table S3 : Correlation between flux and distance over metrics, experimental parameters.

Each cell corresponds to the correlation between the real-world flux between scientific organizations (measured with 2 ) and baseline metrics, shown by subsets of mobility data and by definitions of organization population. The asterisk denotes the top-performing distance metric by column. Distance metrics are ordered from highest 2 to lowest, based on global mobility with organization population defined using all mobile and non-mobile authors. "All" means that population is defined as the average yearly number of unique mobile and non-mobile scholars who published with the organizations' affiliation; population is defined in the same way for "Mobile only", except only using unique mobile researchers; "Raw freq" means that organization populations are defined as their frequency across all the trajectories, similar to word frequency in language embedding. Embedding distance, measured as the cosine distance between embedding vectors, explains more of the flux than baselines in nearly every case, except using raw frequency population and domestic and international mobility, where direct optimization of the gravity model works better, as well as Levy's factorization 34

for domestic and international only mobility. Table S4 : Prediction error between actual and predicted mobility with exponential gravity model, by metrics and experimental parameters.

Each cell corresponds to the prediction error (measured with root mean squared error) when using each distance as input to the exponential form of the gravity model of mobility to predict the flux between organizations, shown by subsets of mobility data, and by definitions of organization population. The asterisk denotes the top-performing distance metric by column (lowest prediction error). Distance metrics are ordered from lowest prediction error to highest, based on global mobility with organization population defined using all mobile and non-mobile authors. "All" means that population is defined as the average yearly number of unique mobile and non-mobile scholars who published with the organizations' affiliation; population is defined in the same way for "Mobile only", except only using unique mobile researchers; "Raw freq" means that organization populations are defined as their frequency across all the trajectories, similar to word frequency in language embedding. Embedding distance, measured as the cosine distance between embedding vectors, results in better predictions of mobility than baselines in nearly every case, however Levy's 's factorization 34

perform better in the case of international and domestic only mobility when using raw frequency populations. Table S5 : Prediction error between actual and predicted mobility with power-law gravity model, by metrics and experimental parameters.

Each cell corresponds to the prediction error (measured with root mean squared error) when using each distance as input to the power-law form of the gravity model of mobility to predict the flux between organizations, shown by subsets of mobility data, and by definitions of organization population. The asterisk denotes the top-performing distance metric by column (lowest prediction error). Distance metrics are ordered from lowest prediction error to highest, based on global mobility with organization population defined using all mobile and non-mobile authors. "All" means that population is defined as the average yearly number of unique mobile and non-mobile scholars who published with the organizations' affiliation; population is defined in the same way for "Mobile only", except only using unique mobile researchers; "Raw freq" means that organization populations are defined as their frequency across all the trajectories, similar to word frequency in language embedding. Embedding distance, measured as the cosine distance between embedding vectors, results in better predictions for global mobility with "All'; and "Mobile only" definitions of population, though direvt gravitylaw opimization with MDS performs better when using raw frequencies to measure organization population, and Levy's factorization 34 perform better here with domestic and international only mobility. Figure S5 : Neural embeddings outperform baselines for scientific mobility. Cosine distance between embedding vectors generated with word2vec explains more of the flux, and better predicts flux when used in the gravity model of mobility than geographic distance and other baselines. Shown is the correlation between the flux and embedding distances, measured with 2 (left), and the prediction error when using the distance as input to the gravity model of mobility. The asterisk denotes the top-performing metric.

For prediction error, we show results based on both the exponential and power-law forms of the gravity model. All embedding-based methods use dimensions of 300. Here, embedding distance is obtained from neural embeddings learned with window size of 1 and = 1. In all cases, organization population is defined as the mean annualized number of unique mobile and non-mobile authors, and flux is calculated for all global mobility. Baselines include the top-performing distance metrics calculated between vectors obtained by personalized-page rank (PPR), singular value decomposition (SVD), laplacian eigenmap, direct-factorization following Levy's approach, 34 and direct optimization of the gravity model using SVD and multidimensional scaling (MDS), as well as the geographic distance between organizations. Embedding distance better explains and predicts flux than any other baseline, though there is some variation by experimental parameters (Table S3, Table S4, and Table S5 ). Figure S6 : Cosine distance is correlated with dot product similarity. We find a relatively high correlation between the embedding distance-one minus the cosine similarity-and the dot product similarity between organization vectors ( 2 = 0.73). Color of each hex bin indicates the frequency of organization pairs. The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted for the mean flux in each bin based on a normal distribution. Correlation is calculated on the data in the log-log scale ( < 0.0001). Figure S8 : For embedding distance, the exponential-decay gravity model is slightly better. Flux between organization pairs predicted by the gravity model with different distance decay functions, i.e., exponential decay function (a) and power-law decay function (b) using embedding distance. Boxplots show the distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps with = ; a perfect prediction yields all points on the black line. "RMSE" is the root-mean-squared error between the actual and predicted values. Shown for all pairs of organization (a-b), domestic (c-d), and international only (e-f) mobility. The gravity model with the exponential decay function slightly outperforms that with a power-decay function except in the case of international-only mobility, for which power-decay performs slightly better. Figure S9 : Embedding distance explains more variance for global, within, and across country flux than geographic distance. a. Embedding distance explains more flux than geographic distance (b). The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted for the mean flux in each bin based on a normal distribution. Correlation is calculated on the data in the log-log scale ( < 0.0001 across all fits). Color of each hex bin indicates frequency of organization pairs. Results here are identical to those shown in Fig. 2 . c-d. embedding distance explains more variance when considering only within-country organization pairs. e-f. embedding distance is more robust than geographic distance when considering only across-country organization pairs. Figure S10 : Examine gravity model with dot product on the embedding space. Performance of dot product similarities in explaining and predicting mobility. Similarity scores are calculated as the pairwise dot product between organizational vectors. Dot product similarity performs better than geographic distance, though worse than cosine similarity in explaining global mobility (a), or domestic (b) or international (c) country mobility. The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted for the mean flux in each bin based on a normal distribution. Correlation is calculated on the data in the log-log scale ( < 0.0001 across all fits). Color indicates frequency of organization pairs within each hex bin. Similarly, PPR distance performs comparably to geographic distance in predicting global (d), domestic (e) and international (f) scientific mobility. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps = ; a perfect prediction yields all points on the black line. "RMSE" is the root-mean-squared error between the actual and predicted values. Figure S11 : Personalized page rank with cosine distance. Performance of personalized page rank scores in explaining and predicting mobility. Personalized page rank is calculated for the underlying mobility network, and distance measured as the cosine distance between PPR probability distribution vectors. PPR cosine distance performs roughly similar to geographic distance in explaining global(a), domestic (b), or international (c) country mobility. The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted for the mean flux in each bin based on a normal distribution. Correlation is calculated on the data in the log-log scale ( < 0.0001 across all fits). Color of hex bind indicates frequency of organization pairs. Similarly, PPR distance performs comparably to geographic distance in predicting global (d), domestic (e) and international (f) scientific mobility. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps = ; a perfect prediction yields all points on the black line. "RMSE" is the root-mean-squared error between the actual and predicted values. Performance of personalized page rank scores in explaining and predicting mobility. Personalized page rank is calculated for the underlying mobility network, and distance measured as the Jensen-Shannon Divergence (JSD) between PPR probability distribution vectors. PPR JSD performs roughly similar to geographic distance in explaining global mobility (a), or domestic (b) or international (c) country mobility. Overall, PPR JSD explains more variance in mobility than using cosine distance (Fig. S11) , except for international mobility, for which cosine similarity out-performs JSD. The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted for the mean flux in each bin based on a normal distribution. Correlation is calculated on the data in the log-log scale ( < 0.0001 across all fits). Color of hex bind indicates frequency of organization pairs. Similarly, PPR JSD performs comparably to geographic distance in predicting global (d), domestic (e) and international (f) scientific mobility. Boxplots show distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps = ; a perfect prediction yields all points on the black line. "RMSE" is the root-mean-squared error between the actual and predicted values. Figure S13 : Visualization of global mobility network. The network demonstrates country-level structure, but not at the detail or the extent of the global UMAP projection (Fig. 3a) . Each node corresponds to an organization, whereas weighted edges (not shown) correspond to the flow of mobile researchers between the two organization. Nodes are colored by the country of the organization. Nodes are positioned using the Force Atlas layout algorithm. Table S1 . Figure S22 : SemAxis reconstructs publication impact in non-university sectors. Comparison between the ranking of organizations in each non-university sector by their citation impact and the embedding rank. Citation impact is calculated as the mean-normalized citation score using papers published in the Web of Science database between 2008 and 2019. The embedding rank is derived by first projecting non-university organizations onto the SemAxis axis formed with poles defined using the top five to geographically-matched bottom five universities ranked by the 2018 Times Higher Education ranking of U.S. Universities. a Shows how the correlation between the citation impact and SemAxis rankings differ while varying the size threshold for including an organization. Size is calculated as the mean annualized number of unique authors publishing with that organization. Annotations show the number of organizations remaining at thresholds of 0, 50, and 100. b. Comparison of organizations using a size threshold of 10 for regional and liberal arts colleges, and 50 for research institutes and government organizations; these thresholds were chosen as points thresholds of stability in a. The impact rank is correlated with the embedding rank for regional and liberal arts colleges with Spearman's 

Proportion Figure S25 : Distribution of organization embedding vector norms by country. Histogram showing the distribution of L2 norm values of organization embedding vectors in each of the 30 countries with the largest number of total unique mobile and non-mobile researchers. Text in each panel shows the number of organizations in the country (n) and the GINI index of inequality of the distribution (g); a small GINI index indicates that the L2 norms of organizations are more balanced, whereas a high GINI value indicates that they are more unequal.

Global Flow of Tertiary-Level Students

The Global Competition for Talent: Mobility of the Highly Skilled

presented at the Proceedings of the 26th International Conference on Neural Information Processing Systems

presented at the Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics

presented at the Proceedings of the Eleventh International AAAI Conference on Web and Social Media

presented at the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

presented at the Proceedings of the AAAI Conference on Artificial Intelligence

presented at the Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

presented at the Proceedings of the 26th conference on user Modeling, adaptation and personalization

presented at the Advances in Neural Information Processing Systems 27

presented at the Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

presented at the Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

Notes on Noise Contrastive Estimation and Negative Sampling

presented at the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

presented at the Proceedings of the 14th Science and Technology Indicators Conference

Entropy in urban and regional modelling (Routledge, 2011)

The Diffusion of Scientific Knowledge Across Time and Space: Evidence from Professional Transitions for the Superstars of Medicine

FASEB journal: official publication of the Federation of American Societies for

Measuring Innovation: A New Perspective

presented at the Proceedings of the 12th international conference on World Wide Web

presented at the Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

IEEE transactions on information theory

presented at the International Conference on Machine Learning

the International Conference on Machine Learning

We thank the Center for Science and Technology Studies at Leiden University for managing and making available the dataset of scientific mobility. We also thank the Goodchoice Company LTD. for making available the dataset of Korean accommodation reservation data. For their comments, we thank Guillaume Cabanac, Cassidy R. Sugimoto