key: cord-0606455-9wbm0xgf
authors: King, Sinjoni Mukhopadhyay; Nawab, Faisal; Obraczka, Katia
title: A Survey of Open Source User Activity Traces with Applications to User Mobility Characterization and Modeling
date: 2021-10-12
journal: nan
DOI: nan
sha: 202f02e9505a09b486272744aac3a04071684018
doc_id: 606455
cord_uid: 9wbm0xgf

The current state-of-the-art in user mobility research has extensively relied on open-source mobility traces captured from pedestrian and vehicular activity through a variety of communication technologies as users engage in a wide-range of applications, including connected healthcare, localization, social media, e-commerce, etc. Most of these traces are feature-rich and diverse, not only in the information they provide, but also in how they can be used and leveraged. This diversity poses two main challenges for researchers and practitioners who wish to make use of available mobility datasets. First, it is quite difficult to get a bird's eye view of the available traces without spending considerable time looking them up. Second, once they have found the traces, they still need to figure out whether the traces are adequate to their needs. The purpose of this survey is three-fold. It proposes a taxonomy to classify open-source mobility traces including their mobility mode, data source and collection technology. It then uses the proposed taxonomy to classify existing open-source mobility traces and finally, highlights three case studies using popular publicly available datasets to showcase how our taxonomy can tease out feature sets in traces to help determine their applicability to specific use-cases.

Understanding how people move has important applications in a wide range of fields including urban planning, public health and healthcare, efficient transit and transportation, critical infrastructure planning and deployment, commerce, entertainment, to name a few. As such, it has attracted considerable attention from the research and practitioner communities across different disciplines. For example, mobility models have been proposed in the context of mobile networking [72] , traffic prediction [61] , urban planning [46] , cellular routing [98] , energy management systems [43] , healthcare [49] and social media advertising [51] . Due to its versatile applicability, there has been growing interest in mobility modeling research and application.

Mobility models have been extensively used in the study of wireless networks and their protocols [68, 104] . Indeed, most network simulators include synthetic mobility generators, which, given a pre-specified mobility regime, determine the position of network nodes over time during simulation runs. More recent mobility models try to represent user mobility more realistically by using real mobility traces, especially since an increasing number and variety of these real, open sourced traces can be accessed through public websites like CRAWDAD [88] , data.world [89] , Kaggle [91] and GitHub [90] . Most of these datasets are rich in information and have diverse applications. This poses two main challenges to researchers and practitioners who wish to make use of these open source mobility datasets. First, it is quite difficult to get a bird's eye view of open source traces without looking for them, especially when you do not know that they exist. Second, once the traces are identified, determining whether they are adequate for the application at hand is far from trivial.

This survey tries to fill these gaps by: (1) Proposing a taxonomy that classifies publicly available mobility traces based on a number of factors including their mobility mode, data source, technology used for data collection, information type and their current and potential future applications; (2) Categorizing approximately 30 well known public datasets using our taxonomy; and (3) Presenting three case studies, each uniquely categorized according to our taxonomy.

Existing surveys on mobility modeling focus mainly on two areas, namely mobility feature analysis techniques, where models with different feature characteristics are compared [58, 112, 113, 38, 44, 67] , and mobility prediction models [115, 34, 35, 69] . They highlight different application areas like mobile and wireless networking, traffic modeling, smart city planning, etc. Additionally, the few surveys that discuss traces only focus on traces belonging to a single category, for example a survey that compares mobility traces collected via Bluetooth sensors on two different university campuses [40] . To the best of our knowledge, our work is the first to survey the current state-of-the-art in publicly available mobility traces, offer a taxonomy to put them in perspective, and provide a comparative analysis of existing traces according to the proposed taxonomy. Additionally, our survey is also meant to be a guide for researchers and practitioners to navigate the space of currently available mobility traces as well as future ones.

Coming up with a representative taxonomy for mobility traces is not trivial due to their feature and application diversity. To create our main taxonomy we used a bottom up approach, starting from the data source or technology used to collect the traces, all the way up to the mobility mode i.e., pedestrian or vehicular, being represented by these traces. Another challenge we faced was finding a representative collection of traces to define our taxonomy. Most of the state of the art mobility modelling techniques generate their own traces, only a fraction of which are open source. We picked the most heavily used traces with the goal of creating a classification scheme that is broad enough so that any existing or new trace can be categorized using our taxonomy.

In addition to proposing a taxonomy to categorize well-known publicly available mobility traces, and presenting three different case studies; our study also identifies significant gaps in availability of real mobility traces, which motivates the need for realistic mobility trace generators. The rest of the survey is organized as follows. Section 2.1 provides an overview of our multi-layer taxonomy. Then Section 2.2 expands on the taxonomy layers and categorizes 30 well-known public datasets using our taxonomy and by areas of application. In order to illustrate that our taxonomy can be used to classify a broad variety of mobility datasets, Section 3 goes over three case studies, each of which following a unique trajectory in our taxonomy. In Sections 4 and 5 respectively we discuss applications of traces belonging to different categories and current state-of-art in mobility research, including existing surveys. Finally, Section 6 concludes the survey.

Our proposed taxonomy is illustrated in Figure 1 , where layers 1, 2 and 3 are connected using the red lines to show the various types of taxonomy trajectories that the datasets could follow. An extension of the taxonomy is illustrated in Figure 2 , where layers 4 and 5 connect feature groups, represented by the white boxes, to the information types, where each different colored arrow represents a unique information type. In Figure 1 , the top layer (layer one), Mobility Mode, refers to the user's mode of transportation, namely pedestrian or vehicular. Pedestrian Mobility typically represents movement within a limited geographic region, while Vehicular Mobility involves movement using various modes of transportation usually spanning larger geographic regions. Vehicular mobility includes personal vehicles and public modes of transportation like buses, trains, shared scooters/bikes/cabs, ships, airplanes, etc. The next layer down, the Data Source layer (layer two) considers how the dataset was collected. It is further divided into two sublayers, namely Collection Infrastructure and Measurement Medium. Collection infrastructure refers to the systems that host the devices used to collect data, whereas measurement medium considers the actual device/technology that generates the different measurements used to populate the datasets. In the case of pedestrian mobility, data is usually collected through smartphones, laptops, tablets, wearable devices or through network infrastructure gear. In vehicular mobility, data is either gathered through end-user devices like smartphones, laptops, tablets and wearable devices, in which case it is usually generated when these devices are inside a moving vehicle; or through smart-vehicle hosted devices. Examples of collection infrastructure and measurement media are illustrated in Figure 1 .

As illustrated in Figure 2 , the taxonomy extends to organize mobility traces by the type of information they contain which is typically dependent on the features that are generated. Based on the existing set of open-source traces, we have identified four main information categories: (1) Connectivity traces are typically used to optimize network performance, e.g., provisioning, redistributing resources to better manage network traffic, etc [36, 30, 23, 25, 10, 22, 7, 14, 12, 11, 4, 6] ; (2) Location traces can be used for location-based optimizations like improving waiting area around a specific business that has a heavy footfall [1, 36, 24, 29, 13, 23, 25, 7, 12, 11, 9, 6, 17] ; (3) Health-related traces are applied to improve health solutions like adding features for call-for-help services; there aren't any current open source health traces because of HIPAA compliance, but we include this class as a future categorization possibility; (4) Lifestyle traces are used to draw patterns in user behaviour like sleep cycles, downtime etc. Note that the taxonomy also lists, under every information category, some examples of features that may be present in the respective category of traces [24, 29, 13, 19, 5, 2, 28] . Now that we have introduced the taxonomy, in the remaining section, we dig deeper into the criteria used to classify mobility traces based on the taxonomy. Table 1 lists existing publicly available mobility traces classifying them by mobility mode and data source, while Table 2 lists existing publicly available mobility traces classifying them by information category.

The Mobility Mode layer organizes traces into two major categories: pedestrian and vehicular.

Pedestrian mobility datasets can provide an understanding of how people move in certain areas according to various aspects like traveling distance, locales where people tend to congregate, and trends related to places visited, when they are visited, and for how long [5, 2, 16, 8] . Lately, there has been a lot of emphasis on understanding crowd management, e.g., identifying attractors and detractors, determining and optimizing wait times in various situations, localizing congestion and bottlenecks in crowded localities, all of which provide invaluable insights on how people move in different places and situations. Other examples of applications that can benefit from better understanding of pedestrian mobility include reduction of crowd-based carbon-dioxide emissions, population density control in crowded areas, and optimization of city infrastructure. The COVID-19 pandemic has made the importance of being able to model pedestrian mobility patterns even more critical so that it can be used to perform contact tracing and manage public spaces in order to better enact local policies and restrictions.

Vehicular mobility traces can be used to characterize movement of various modes of transportation [1, 17, 9] . Before automobiles, cities were limited in terms of area, population and business prospects. The advent of different motorized transportation modes have fundamentally transformed the way cities are planned and expanded. Vehicular mobility can be broken down into personal and public transport. Public transport can further be differentiated based on mobility medium, namely: ground, sea, or air. Whether publicly shared or personally owned, vehicular mobility captures various aspects of a given region like traffic during different times of the day, road congestion, favorable modes of transport, usage based on infrastructure, etc. Understanding vehicular movement can be insightful not just for city planning but also to understand trends among social communities of people. Other applications of vehicular mobility modeling include managing electrical vehicles (e.g., provisioning of charging stations), supporting autonomous driving, shared modes of transportation, and smart transportation scheduling.

Some example sources of vehicular mobility data includes:

• Personal Vehicles, which are getting smarter by the day with new advances in 5G and IoT technology. With these automobiles being connected to the Internet, there is a treasure chest of uploaded data that can help us model useful mobility patterns. Add to that application integration, where applications like Google and Apple Maps can record common routes and locations frequently visited, we can derive a complete picture of user mobility.

• Shared public transport by companies like Uber, Lyft, and Bay Wheels, which also collect mobility information via data shared by user applications. This includes information like pick up/drop off location, route taken, and stops along the way. • Shared public transport by local governments including buses and taxis which also provides us with information similar to privately-owned shared public transport.

In the second layer, traces are then classified according to their data sources, including the infrastructure where data is being stored/located, and how data was obtained/measured.

Mobility data is typically collected by network infrastructure or end user devices. Network infrastructure includes devices that provide network connectivity to end users such as base stations and cell phone towers, that provide fingerprints which help identify times when a user has been around a certain location and for how long; as well as routers, switches, hubs, and wireless access points that provide network traffic exchange information which can be used to study mobility as a device switches between multiple components. End user devices can either directly be the source of mobility traces or traces can be extracted from applications hosted on these devices. Examples include:

• Smartphones/Tablets/laptops that have become increasingly common as information and communication sources. They can collect/store information such as location, social media check-ins, communication logs like calls/messages being exchanged and sensor based information like motion tracking and health conditions. • Wearable devices, like smart-watches, with their sensing capabilities are ideal to monitor health conditions, and crowd density during ordinary and emergency scenarios. • Smart Vehicles that provide network and location information.

Pedestrian traces can be extracted from public network infrastructure and end user devices like smartphones, tablets, laptops, wearable devices, etc. Vehicular traces can be captured using data from specialized end user devices like transportation smart cards in personal and shared vehicles or, from transportation / transit applications or websites. Vehicular mobility can also be collected using sensors embedded in vehicles.

Here we consider how data was obtained or measured and divide traces into three categories: the Sensor category refers to traces collected via a variety of sensing devices; the Network category includes traces collected using devices that provide network connectivity (e.g., cellular, WiFi, Bluetooth, etc); and the Other system hardware group includes information derived from logs of on-device applications.

Sensor-Based data is usually the output from a device that measures the physical environment. The output of sensors is usually used as raw information or to trigger other sensors or processes. Below, we include examples of sensor data commonly found in mobility datasets.

• Global Positioning Systems or GPS sensors are receivers with antennas that use satellite based navigation to provide time and geolocation information usually in the form of latitude and longitude coordinates. In some cases, GPS sensors can also capture position in the form of velocity and orientation. These features are most commonly used as unique location identifiers. The datasets we have explored include latitude and longitude coordinates only. • Light sensors are devices that convert any form of light energy, visible or infrared, into electrical signal outputs. In the case of mobility datasets, information on when it is day versus night can be useful to monitor patterns in user habits. • Accelerometers measure acceleration using three axes, X, Y and Z. Such sensors mainly provide two kinds of information: first, the static force applied on the sensor due to gravity and orientation; second, the force and acceleration exerted on the sensor in motion. • Gyroscope sensors calculate angular velocity, or change in rotational angle per unit of time, usually measured in degrees per second. In the datasets we consider, gyroscope sensors add an additional dimension to the accelerator data to determine the orientation of a device. • Magnetometers measure the relative change in magnetic field at a given location.

• Pedometers are mechanical devices that use software to detect vertical movement at the hip, to count the number of steps taken by a user. This can indirectly be used to derive information like distance travelled and patterns of other physical activities. • Oximeters or pulse oximeters use LEDs to emit two types of red light through human tissue in order to measure oxygen saturation levels in the blood along with the number of times our heart beats per unit of time. • Temperature sensors are electronic devices that measure surrounding ambient temperature and convert that into electronic data, to measure changes in temperature. • Camera produces records in-terms of images or videos that can be used to derive social patterns in different human communities. This information can be derived based on the contents of the picture/video, location where the record was made, people who were a part of the record, what the people in the record were doing, etc.

Datasets containing sensor information can be collected directly by sensors and can be classified either under pedestrian or vehicular mobility. Crivello is a notable example of a pedestrian dataset that includes sensor information from wearable devices that has been used to compare and evaluate indoor localization solutions [36] ; Microsoft's Geolife dataset [8] consists of approximately eighteen thousand GPS trajectories with a total distance of 1,292,951 kilometers and a total duration of 50,176 hours collected from GPS loggers and GPS-enabled phones.

Commonly used vehicular traces include: the T-drive dataset by Microsoft Research which contains GPS trajectories, i.e., longitude and latitude of approximately ten thousand taxis in Beijing [1] ; the Cabspotting dataset consisting of GPS latitude and longitude information collected from over 500 taxis [3] ; and mobility data collected from taxi cabs in Rome derived from GPS coordinates [9] . Other less known vehicular traces that contain sensor information are the GRID bikeshare dataset, which describes the main attributes of GRID temperature, including the feed, operator, hours, calendar, regions, pricing, alerts, stations, and bike status [28] ; and the mobility dataset from the city of Austin, Texas which includes GPS sensor information from bicycles and other mean of transportation [31, 15, 21] . More recently a taxicab dataset [17] from Chicago collected location data from seven thousand licensed cabs operating within city limits.

Network-Based The ever increasing popularity and availability of mobile communications has made "anywhere, anytime connectivity" a reality. As such, end user mobility information that is collected through access network devices helps manage and provision network resources. Examples of network connectivity information contained in mobility traces include:

• Bluetooth technology targets short-range wireless communication. Data collected from Bluetooth networks provide information about nearby devices and their characteristics [4] .

Other network-based mobility information includes Bluetooth low energy packets generated by BLE beacons from end user devices [22] . • WiFi is one of the most widely used wireless technologies for data communication in local-area networks. It is also widely used as Internet access technology. Information from WiFi networks like access point associations / dissociations and signal strength can be used to determine user location as well as mobility patterns and trajectories.

The Crivello trace, which also contains network connectivity information from smartphones, has been used to compare and evaluate indoor localization solutions [36] . Other notable traces that contain network data include the Global System for Mobile Communications (GSMC) which gathers information from approximately 10 mobile smartphone (iPhones) users via the MySignals iPhone App [23] ; and data collected by Flexran from a platform for software-defined radio access networks [25] . Additionally, several organizations collect mobility data based on records of authenticated user associations to their WiFi networks [10] .

Other traces that contain network connectivity information include: HYCCUPS Tracer, that contains availability and mobile interaction information such as usage statistics, user activity, battery statistics, or sensor data, a device's encounters with other nodes or with wireless access points [7] ; the Cambridge Haggle dataset that contains bluetooth encounters between 12 nodes for approximately 6 days [14] ; Asturias (Spain) Fire Department mobility and connectivity traces generated by GPS devices embedded mainly in cars and trucks, but also in a helicopter and a few personal radios [12]; traces containing Bluetooth encounters, Facebook friendships and interests of a set of users collected through the SocialBlueConn application at the University of Calabria [11] ; traces with Bluetooth encounters, opportunistic messaging, and social profiles of 76 users, using the MobiClique application at the SIGCOMM 2009 [4] ; and fine-grained network mobility data from commercial mobile phones in Seoul, Korea, containing continuous GPS information combined with Wi-Fi fingerprints and user-annotated location information [6] .

Other System Hardware Mobility datasets can also include information generated by Application based data, also known as data collected from other system hardware, is actively triggered by users utilizing an application. Most of this data collected is only restricted to when the application is running and does not include information from the application's rest time. This kind of data usually gives information in the form of timestamps. Some specific datasets combine these timestamps with other information like location, human movement and constraints in the digital space.

The Clock and the Calendar applications provide us with date and time, which can be useful if we want to model time series data, or analyze patterns and trends over specific time/date periods.

The Map application uses requests to derive information about frequency of trips, locations in-terms of latitude and longitude, start and end times, types of transportation used in different locations for different times, pedestrian population in different areas of a city. Information generated by this kind of application can be useful for city planning and vehicle traffic management. Data from maps application can also fall under the sensor category since it consists primarily of location information. When combined with lifestyle information, examples of map based datasets are data from Apple and Google Maps application to analyze daily changes in direction and routing requests before and after the onset of COVID, focusing on different transportation types, countries, regions, sub-regions and cities [26, 24] .

Location based social network (LBSN) follows geosocial networking principles, where a social networking application has geographic capabilities like geotagging and geocoding to collect additional information about human social patterns. Location coordinates like latitude and longitude, added to uploaded pictures or social network check ins to cities, bridges the gap between the physical world and the online services, bringing social networks back to reality. Like the map datasets, this category of traces can also belong to the sensor group of datasets. Examples of LBSN based datasets are Gowalla [5] , Nsense [16] and Brightkite [2] which use social network check-ins as the main source of the mobility data.

Other application based datasets include the US Internal Lifetime Mobility [20], which predicts mobility based on when and where a user is born and where they are currently located. The NYC Citywide Mobility survey of the New York City residents' travel choices, behaviors, and perceptions [18] collects mobility information via online surveys and phone surveys. COVID-based mobility traces provide information based on how a typical member of a population is moving in a day [30] . The Knowledge Center on Migration and Demography, Dynamic Data Hub contains global transnational mobility data that provides us with information on country-to-country cross-border human mobility using global statistics on tourism and air passenger traffic [13] . The Knowledge Center on Migration and Demography, Dynamic Data Hub also highlights information on monthly air passenger flows, which can be synthesized into a set of indicators between countries worldwide; demography and mobility data collected by the Joint Research Centre and the Directorate General for Regional and Urban Policy in European metropolitan regions in 2018 [19] ; and region based mobility data collected via interactive maps publicly available on the Flows to Europe Geoportal, which provides statistical updates on migrant and refugee land and sea arrivals and routes towards Europe [27] .

This section outlines three case studies, all of which showcase popular open-source datasets and their classification using our proposed taxonomy. Our aim here is to highlight how our taxonomy can be used to identify traces that can be used for different applications. The first case study, which is presented in Section 3.1, analyzes two well known pedestrian mobility datasets that have been derived using data from smartphone applications, namely Apple and Google Maps, and compares mobility trends during the COVID-19 pandemic in 2020. The second and third case studies, presented in Sections 3.2 and 3.3, mainly focus on how we can use our taxonomy to derive similarities or differences between different traces. In Section 3.2, we analyze two commonly used GPS-based vehicular mobility traces that have been collected from taxis in San Francisco and Rome. Section 3.3 draws similarities between two Location-Based Social Network (LBSN) traces and highlights how feature-sparse datasets can also be information-rich.

Companies like Microsoft, Google and Apple have extracted data from applications like Google and Apple Maps to analyze changes in mobility trends since the COVID-19 pandemic started in late 2019 [29] . We compare two open-source COVID Mobility datasets, one derived from Apple maps [24] and another from Google maps [26] , to study the difference between their feature sets and information provided. Figure 3 shows how each of the traces are classified according to our taxonomy. Both datasets are pedestrian-based and were collected using data from applications, in this case Google and Apple Maps, running on smartphones.

We classify this under pedestrian and not vehicular even though the datasets provide information about types of transportation, since the information is requested by a pedestrian smartphone. Each data-point in the Apple dataset represents a location around the world and is then associated with a number of driving, walking and transit related requests made each day in that location over a span of four months. The raw dataset consists of: satellite based location represented by a tuple of GPS coordinates, i.e., latitude and longitude, mode of transportation being requested: driving, walking or public transportation, date-wise number of requests for each mode of transportation for each GPS tuple and region information.

Similar to the Apple Maps trace, the Google Maps dataset also provides date-wise GPS locations. Additionally, it provides a percentage increase or decrease in mobility for various categories within each location. The categories include retail/recreation, grocery, parks, residential, workplace, and transit stations and are derived from location tags present in the map settings. Since the main difference between both datasets is in their feature sets as highlighted by the red boxes in Figure 3 , a mobility mode based classification is meaningless for this particular case. A more appropriate classification would be using the information type which then goes into details of the different types of features being used in each dataset. Figure 4 illustrates locations in the United States for which a request to either Apple Maps and Google Maps was made¹. ¹Although our analysis was performed for the entire datasets, for readability, we only show access from the United States Region.

Since the Apple Maps dataset classifies location requests based on modes of transportation, we use it to analyze mobility trends based on the mean number of requests for each type of transportation mode. An interesting trend that Figure 5a shows is that over a span of five months from January to May 2020, there has been a 57% drop in public transportation use and a 16% percent drop in the number of walking requests, whereas requests for driving personal vehicles have increased by 43% percent.

The Google dataset, on the other hand, lays more emphasis on public businesses and properties signaled in the requests. As such, we use it to analyze trends based on the increase or decrease in number of requests for the specific classes of locations/businesses. Figure 5b illustrates a monthly change in requests for different locations. From the figure we observe that, as the months progressed from February to May, with an increase in COVID threat, there has been up to a 38% decrease in visits to workplaces, a 35% decrease in retail and recreation maps requests, a 30% decrease in public transportation usage, and a 25% increase in park visitations.

After a cumulative study, we infer that, although falling in the same general taxonomy path, both datasets are different when we look at the taxonomy extension, and provide different kinds of information to predict mobility patterns during COVID. The mode of transportation used to travel to the location and the increase and decrease in number of visitations to different classes of locations. Reiterating on our aim for this taxonomy, having a taxonomy that is capable of classifying features from both the Apple and the Google datasets for a specific location, will help identify gaps in each dataset. For example we picked Illinois from both the Apple and Google datasets with latitude and longitude tuple < 40,-89 >. For this location, Figure 6 shows the cumulative information we can potentially generate. Some important insights we can infer for this location is that people using the maps application mainly looked for driving requests during the four peak months in COVID cases; also park visitations increased significantly during the month of May. Identification of these gaps will be useful when researchers want to improve technology pertaining to trace collection and trace generation.

In this second use case, we analyze GPS traces from taxis in Rome and San Francisco. Figure 7 shows how the datasets are classified according to our taxonomy: both are public transportation vehicular traces. Their data is collected using end user devices like smart vehicles. They provide GPS locations and time and date information from the system clock. Both datasets contain GPS location information in the form of latitude-longitude tuples. The San Francisco taxi trace associates each taxi with a unique taxi ID and a series of locations it visited over a period of time, while the Rome taxi trace represents a day-by-day set of locations visited by a group of taxis. As we illustrate with our analysis below, this use case highlights the fact that traces that have similar features can still be used to conduct / derive different types of analysis / results.

The overall goal of both the datasets is to provide locations that are commonly visited by the taxis in each region. The taxi traces from Rome focus on temporal patterns using timestamps to categorize locations visited based on dates, while the SF traces focus on spatial patterns categorizing locations based on unique cab identifiers. Such dataset graphs can provide us with various types of information. Figure 8a represents GPS locations visited by a group of taxis on four different days. When projected on the map of the Rome region, we can identify activity hotspots. Figure 8b represents GPS locations visited by four unique taxis during the month of February. We can infer from the difference in mobility patterns of the cabs that there are probably different categories of cabs that follow different routes and are restricted from entering certain areas.

Reiterating our motivation for the taxonomy, we notice that both datasets showcase unique temporal or spatial information despite following the same taxonomy paths and having the same feature sets, lacking specific information in either case. The taxi dataset from Rome provides an overall list of locations visited per day but provides no information to identify different taxis, whereas the dataset from SF provides location information per taxi, but for a cumulative period over a month. The Rome dataset can be structured to provide a monthly analysis of taxi travel like the SF dataset, but the SF dataset cannot provide a daily taxi analysis like the Rome dataset. Identifying these differences using the taxonomy can help fill in the gaps in both datasets to then generate a more comprehensive idea of the public taxi system for a particular region. We can derive information about unique categories of taxis and their routes for different times of the day and across different months.

When combined with location-acquisition technologies like GPS and Wifi, social network information can connect locations with user interest patterns. For this use case we have used data from the Brightkite and Gowalla datasets to show how traces that provide sparse feature sets can still be used to derive useful information. Sparse feature sets refer to datasets with a small number of traces. Figure 9 shows how the Gowalla and Brightkite traces are classified according to our taxonomy. Both datasets are pedestrian based, collected using applications running on end user devices like smartphones, iPads, or laptops. GPS location, region-and check-in information are collected and used to identify social relations between various groups of people. The technology type used by both datasets provide information about visited locations, represented by vertices in a graph, along with their corresponding latitude, longitude, and timestamps.

In this case both datasets only provide a timestamp and 3 features the vertex represented by a number between 0 to n, the latitude and longitude. Each vertex number stands for a user and can be associated with several tuples of latitude and longitude values representing locations visited by that user. As illustrated in Figure 10 , one type of information we can extract from both datasets is worldwide check-in patterns. Additionally, there are two major kinds of information that can be visualized from these datasets. Check-in patterns can help group users into social groups as shown in Figure 11 which presents a set of connections between the first 5 users of each dataset. The blue vertices have bidirectional connections between them while the green vertices have unidirectional connections between themselves and with the blue vertices.

There are a number of limitations associated with LBSN-based datasets. First, we are restricted to only information provided by the users, so if the user forgets to check-in, that information will not be available. Additionally, LBSN users may be uniquely identified even if only to LBSN application developers, which may cause privacy issues. Our taxonomy is useful to these types of datasets in two capacities. One, it will help researchers working with geo-social traces to classify feature differences between the datasets. Second, it will help understand the similarity of LBSN-based traces to other pedestrian-based traces.

Public mobility traces have a multitude of applications including communication, urban planning, vehicular and network traffic analysis, social management and in some cases even healthcare. Examples of social management would be structuring community needs like libraries, shopping complexes based on mobility footprints. This section outlines different application areas for some of the different trace information categories in our taxonomy extension.

Traditional GPS traces provide basic latitude-longitude information and when combined with location specific mobility tags, have several applications in various spheres of mobility analysis. Mobility tags can include any type of information tied to location like behavioural patterns, foot traffic, choices, network usage etc. Vehicular GPS information when combined with timestamps can provide insights on routes taken by specific modes of transportation at different times of the day. Frequency of travel using different modes of transportation at different locations can also be derived, which provide insights into mobility patterns of different communities. For example, if we have a trace that contains information of all buses along with their locations and timestamps for a particular city, we can identify the major hotspots or centrally located spots in the city based on locations that are visited most frequently by buses of different routes. GPS traces when combined with connectivity traces can provide insight into how well-connected some regions are. For example a trace that gives locations and their corresponding RSSI values can help identify placement ofWiFi access points, which can in turn help offloading/managing network traffic within a specific location. GPS traces can also provide useful migration information. For example location information combined with unique person IDs, date and timestamps can help us identify patterns in how a person is moving between two locations.

Geosocial traces, referred as lifestyle traces in the paper, derive human patterns using social network check-ins. The unique style of geosocial mobility traces can provide insightful information for urban planning and retail real estate to property owners and operators. Geosocial data has the potential to reveal the personality of neighborhoods in a city. Building a park near neighborhoods that have a strong healthy living, nature or dog loving segments might be a source of support. Whereas building shopping center in that same space instead might encounter resistance from the same group of people. Retail business success and geosocial segments are also closely related. A low priced, high traffic region may seem like a good place to build a store, but the most important factor contributing to the success of the store would be the social dynamics of the people around the store. Retail property owners can also use geosocial data to determine social segments of people around their property, which will help them lease the property to stores that are more likely to do well in the longer term in a particular area. Geosocial traces can also be used to target online applications to specific communities of people and be applied for marketing, merchandising and consumer goods. We can identify what people are doing and talking about in various locations. This kind of information can help with placement of billboards, local radio spots, or location-targeted mobile advertisements, as it is more effective to advertise in an area which has a social segment that has been predicted to be more receptive to those ideas.As a bonus these traces can be applied in planning healthcare facilities. Such data can be used to identify the age group of people in different locations, and based on the age determine if a particular location has more children who require pediatricians, or have an older population who require elder care physicians. We can also use such data to plan other specialized healthcare solutions, like chiropractors for locations where people have more of a sedentary lifestyle e.g. software engineers.

Connectivity and network traces provide information on signal strengths, packet transfer details, network usage details and timestamps. Mobility performance metrics like user pause probability, user arrival, departure probabilities heavily impact the performance of 5G cellular networks. Optimizations can be performed by analyzing these metrics [56, 92] . Understanding user mobility characteristics, predicting network usage, can also help determine performance of routing protocols and feasibility of running an application over a vehicular ad hoc network [32] . Caching files based on popularity to reduce pressure on backhaul networks relies on user mobility pattern studies to provision storage allocation [76] , model cost optimal device to device networks [47, 60] and improve data offloading [101, 80, 68] . Other applications of connectivity traces include analysis of spatial and temporal properties of pedestrian smart device based mobility datasets to enhance operations of wireless sensor networks [104] .

Movement and health tracking traces include information from sensors like heart rate monitors, oximeters, accelerators and magnetometers. One important application of these traces is in the healthcare field. For example traces with information about a user's orientation and displacement can be used to predict whether the user is about to fall; this kind of information can be useful to enable independent living for older adults. Another important application of these traces is positioning and localization for users. For example navigation traces collected from sensors over time can help build a map for a particular area, complete with obstacles. This map can later be used for several applications like, video gaming using augmented reality, and accessibility applications like creating navigation tools for mobility-challenged users.

In today's modern information-rich world, better understanding of human mobility has become increasingly essential in various areas such as network and communication provisioning and deployment, urban planning, health care delivery, to name a few. Existing efforts to study human mobility can be roughly grouped into 3 categories: mobility feature analysis, mobility prediction and mobility surveys.

Positioning and location sensor technologies like GPS, cellular radio tower geo-positioning, WiFi positioning and other motion tracking devices have enabled additional approaches to collecting human mobility data and mining patterns of interest.

Feature vector studies using mobility datasets were introduced in the late 19900s with the goal of analyzing human interactions in various environments that could provide information about cultural group formation [58] . Between 2010 and 2015, mobility feature studies took off again using GPS traces to mine geo-locations [112] and geocommunities [113] . There are also studies that focus on coarse-versus fine granularity of mobility datasets [38] , location-dependent versus location-independent datasets [44] , periodic transitions between locations affecting human mobility [77] , and city-wide GPS logs from taxis [94] . Other references conduct comparative studies of different GPS-based trace analysis techniques [67] .

Other kinds of mobility trace analysis involve data from location-based social network (LBSN) platforms to: extract and infer the purpose of travel, or the activity at the destination of a trip in daily life scenarios [111] ; or study the impact of location history collection on mobility features [81] ; human movement among point-of-interests (PoIs) [114, 54, 109] ; exploit information on transitions between types of locations, mobility flows between locations, and spatio-temporal characteristics of user check-in patterns [70] . Datasets captured from applications like Twitter are information-rich, e.g., they can indicate diversity of movement scopes among individuals as well as movement within and between cities [64] . Some references also talk about using multiple sources of data from both cell phones and transit [71] , and extracting mobility patterns using tensor decomposition techniques [107] . Others discuss inferring human mobility patterns from anonymized mobile communication usage [93] .

In the last five years with advances in data mining and data analysis techniques several references have talked about the importance of PoIs and temporal distance to understand mobility patterns [74] , using mobile and sensing data to analyze human habits and their living environments [105, 63] , and mining human behavior and patterns from geo-socially tagged data [45] , learning mobility patterns with minimal user intervention [33] . With the rampant usage of deep learning (DL) techniques, DL-based feature extraction approaches have been used to analyze trajectory and transportation based mobility traces [50, 103] . Additionally, recent fog and edge computing technology has paved paths for healthcare [57] and transit [78] based mobility feature extraction.

Mobility modeling in the 900s focused heavily on communication systems applications. Examples of mobility models include: using residence time distributions to analyze channel holding time [115] ; using features of asynchronous point-to-point communication like distribution of processes to locations, routing of messages, failure to reach locations and their detection, to extract mobility patterns of processes [34] ; and supporting mobility in IPv6 without loss of connectivity [35] . Mobility models studying mobility of elderly people and their quality of life was also briefly studied [69] .

Mobility prediction between 2010 and 2015 started branching out into several new areas. Data from location based sensing networks, like a person's GPS trajectory, were used to predict current and future locations visited by users, how frequently they were visited [48, 108, 99, 102] and to find additional points of interests [42] . Communication-based mobility modeling considered opportunistic networks and used data shared by short range devices to predict user communication patterns [75] ; it also targeted mobility-aware personalization and resource allocation for mobile cloud applications [110] . Transportation-based mobility models use bus/taxi travel requests to predict bus travel demand for different routes as well as locations for potential future customers [66] . There has also been some psychology-based human mobility studies on regularity and predictability of human movements [44, 83] , predicting human mobility in response to a large-scale disaster [85, 87] , and predicting long-term mobility associating location information with contextual features like days of the week [82] .

More recently, with mobility prediction riding the machine learning wave, there have been several references to DeepMove [52] that uses recurrent neural networks (RNNs) to predict human trajectory data, hidden Markov models to predict user movement [79] , federated learning as a privacy-preserving mobility prediction framework [53] , Deeptransport to predict user's future movements and transportation mode for a period of time [86] , DeepUrbanMomentum for prediction of short-term urban mobility [62] , variational trajectory convolutional networks to predict point of interests [55] , and Neural Turing machine with Stacked RNNs to predict neighborhood human mobility patterns [96] .

Over the years there have been various surveys outlining the state-of-the-art of mobility research. In 2011, Karamshuk et al analyzed challenges associated with human mobility in Opportunistic Networks research and also reviewed mobility analysis and models [65] ; Palmer et al surveyed various gathering and analyses techniques for spatially-rich demographic data using mobile phones [73] ; Lin et al surveyed data mined from GPS trajectory data focusing on locations significant for prediction of future moves, detecting modes of transport, mining trajectory patterns and recognizing location-based activities [67] . In 2013, Asgari et al surveyed datasets representing population flow in transportation networks along with their data types and various applications [37] ; Becker et al studied mobility characterization with respect to cellular network data [41] . In 2015, Hess et al described steps for creation and validation of mobile networking based mobility models [59] ; Yang et al surveyed wireless indoor localization using inertial sensors [106] . In 2018 Thorton et al surveyed user characteristics and their effect on human mobility, especially reactions to environmental change [95] ; Barbosa et al surveyed geolocation data to study individual versus collective mobility patterns [39] . In 2019, Toch et al analysed large scale mobility datasets using machine learning techniques [97] focusing on the data's positioning characteristics, the scale of the analysis, the properties of the modeling approach, and the class of applications; Solmaz et al discussed commonly used metrics and data collection techniques for various models and also proposed a taxonomy to classify mobility models based on their main characteristics [84] ; Wang et al surveyed mobility prediction models derived using multi-source datasets [100] .

From this brief summary of related work, it is clear that the study of human mobility ranges over an extremely broad scope of topics, each creating models with their unique information base. However, to the best of our knowledge, our work is the first to provide comprehensive documentation and classification of the various open source mobility datasets. Our goal is to provide the mobility research community a bird's eye view of existing collected datasets along with their potential application scope.

Motivated by a wide range of applications from urban planning, efficient transit and transportation, public health and healthcare, commerce, critical infrastructure planning and provisioning, to name a few, it has become increasingly important to better understand human mobility. As a result, mobility characterization and modeling has been attracting considerable attention from researchers and practitioners. Mobility records and traces have played a crucial role in enabling the exploration of how humans move in a variety of environments. In this paper, we survey the current state-of-the-art of publicly available mobility records by (1) Proposing a taxonomy that classifies these traces based on a number of factors including their mobility mode, data source, data collection technology, information type and their current and potential future applications; (2) Categorizing around 30 well-known public datasets using the proposed taxonomy; and (3) Presenting three case studies, each uniquely categorized according to our taxonomy. Our study also identifies significant gaps in the availability of real mobility traces. This has been motivating our ongoing work on developing realistic mobility trace generators.

MobiClique application at SIGCOMM

LifeMap monitoring system

Taxi cabs in Rome

Social Blue Conn Application

NSense dataset

Urban Mobility data Platform

BLEBeacon dataset

Community RF Sensing via iPhones for Source Localization and Coverage Maps

Covid-19 mobility trends in Apple Maps

FLEXRAN data

Flow Monitoring in the Mediterranean and Western Balkans

Microsoft visualization of the COVID pandemic trends by Apple and Google

Mobility changes in response to COVID-19

Vehicle mobility and communication channel models for realistic and efficient highway vanet simulation

Minimizing user involvement for learning human mobility patterns from location traces

On modelling mobility

Modelling ip mobility. Formal Methods in System Design

A multisource and multivariate dataset for indoor localization methods based on wlan and geo-magnetic field fingerprinting

A survey on human mobility and its applications

Towards fine-grained urban traffic knowledge extraction using mobile sensing

Human mobility: Models and applications

Mobility models, traces and impact of mobility on opportunistic routing algorithms: A survey

Human mobility characterization from cellular network data

Human mobility prediction based on individual and collective geographical preferences

Mobfogsim: Simulation of mobility and migration for fog computing. Simulation Modelling Practice and Theory

Evaluating mobility models for temporal prediction with high-granularity mobility data

Mining human mobility patterns from social geo-tagged data

A deep learning approach for identifying user communities based on geographical preferences and its applications to urban and environmental planning

Cost-optimal caching for d2d networks with user mobility: Modeling, analysis, and computational approaches

Contextual conditional models for smartphone-based human mobility prediction

Covid-19 outbreak response, a dataset to assess mobility changes in Italy following national lockdown

Deep feature extraction from trajectories for transportation mode estimation

From mobility patterns to behavioural change: leveraging travel behaviour and personality profiles to nudge for sustainable transportation

Deepmove: Predicting human mobility with attentional recurrent networks

Pmf: A privacy-preserving human mobility prediction framework via federated learning

Mining human mobility in location-based social networks

Predicting human mobility via variational attention

User mobility evaluation for 5g small cell networks based on individual mobility models

technology; ubiquitous computing and communications; dependable, autonomic and secure computing; pervasive intelligence and computing

Evolution, mobility, and ethnic group formation

Data-driven human mobility modeling: a survey and engineering guidance for mobile networking

Impact of user mobility on d2d caching networks

Geohash tag based mobility detection and prediction for traffic management

Deep Urban Momentum: An online deep-learning system for short-term urban mobility prediction

Activity-based human mobility patterns inferred from mobile phone data: A case study of singapore

Understanding human mobility from twitter

Human mobility models for opportunistic networks

Prediction of urban human mobility using large-scale taxi traces and its applications

Mining gps data for mobility patterns: A survey

Congestion-optimal wifi offloading with user mobility management in smart communications

Mobility of older people and their quality of life

Mining user mobility features for next place prediction in location-based services

Energy-efficient real-time human mobility state classification using smartphones

User mobility and quality-of-experience aware placement of virtual network functions in 5g

New approaches to human mobility: Using mobile phones for demographic research

On the properties of human mobility

Human mobility in opportunistic networks: Characteristics, models and prediction methods

Exploiting user mobility for wireless content delivery

Leveraging periodicity in human mobility for next place prediction

A vehicle-based edge computing platform for transit and human mobility analytics

A hybrid markov-based model for human mobility prediction

Exploiting user mobility for d2d assisted wireless caching networks

Impact of location history collection schemes on observed human mobility features

Far out: Predicting long-term human mobility

A refined limit on the predictability of human mobility

A survey of human mobility models

Limits of predictability in human mobility

Deeptransport: Prediction and simulation of human mobility and transportation mode at a citywide level

Prediction of human emergency behavior and their mobility following large-scale disaster

User mobility-aware virtual network function placement for virtual 5g network infrastructure

Inferring human mobility patterns from anonymized mobile communication usage

Uncovering urban human mobility from large scale taxi gps data

Human mobility and environmental change: a survey of perceptions and policy direction

Neural turing machine for sequential learning of human mobility patterns

Analyzing large-scale human mobility data: a survey of machine learning methods and applications

Gabor Soos

User group behavioural pattern in a cellular mobile network for 5g use-cases. IEEE/IFIP Network Operations and Management Symposium

Human mobility, social ties, and link prediction

Urban human mobility: Data-driven modeling and prediction

Mobility-aware caching in d2d networks

Regularity and conformity: Location prediction using heterogeneous mobility data

Mobiseg: Interactive region segmentation using heterogeneous mobility data

Analysis of smartphone user mobility traces for opportunistic data collection in wireless sensor networks

Exploring spatial-temporal patterns of urban human mobility hotspots

Mobility increases localizability: A survey on wireless indoor localization using inertial sensors

Human mobility synthesis using matrix and tensor factorizations

Supporting serendipitous social interaction using human mobility prediction

Discovering regions of different functions in a city using human mobility and pois

Mobility prediction in telecom cloud using mobile calls

Inferring travel purpose from crowd-augmented human mobility data

Extracting human mobility patterns from gps-based traces

Extracting human mobility and social behavior from location-aware traces

On the key features in human mobility: relevance, time and distance

Mobility modelling and channel holding time distribution in cellular mobile communication systems