key: cord-0135186-g4tc9qm0 authors: Georgiev, Martin; Eberz, Simon; Turner, Henry; Lovisotto, Giulio; Martinovic, Ivan title: Common Evaluation Pitfalls in Touch-Based Authentication Systems date: 2022-01-25 journal: nan DOI: nan sha: cbf6956df6ce82d5c3cdd66c143565378d9d62d8 doc_id: 135186 cord_uid: g4tc9qm0 In this paper, we investigate common pitfalls affecting the evaluation of authentication systems based on touch dynamics. We consider different factors that lead to misrepresented performance, are incompatible with stated system and threat models or impede reproducibility and comparability with previous work. Specifically, we investigate the effects of (i) small sample sizes (both number of users and recording sessions), (ii) using different phone models in training data, (iii) selecting non-contiguous training data, (iv) inserting attacker samples in training data and (v) swipe aggregation. We perform a systematic review of 30 touch dynamics papers showing that all of them overlook at least one of these pitfalls. To quantify each pitfall's effect, we design a set of experiments and collect a new longitudinal dataset of touch dynamics from 470 users over 31 days comprised of 1,166,092 unique swipes. We make this dataset and our code available online. Our results show significant percentage-point changes in reported mean EER for several pitfalls: including attacker data (2.55%), non-contiguous training data (3.8%), phone model mixing (3.2%-5.8%). We show that, in a common evaluation setting, cumulative effects of these evaluation choices result in a combined difference of 8.9% EER. We also largely observe these effects across the entire ROC curve. Furthermore, we validate the pitfalls on four distinct classifiers - SVM, Random Forest, Neural Network, and kNN. Based on these insights, we propose a set of best practices that, if followed, will lead to more realistic and comparable reporting of results in the field. Touch dynamics systems use distinctive touchscreen gestures for authentication. These interactions include both common gestures like swipes and scrolls and more advanced ones like pinch-andzoom. Touch dynamics have been proposed as a way to improve the security of login-time authentication mechanisms and to enable continuous authentication while a device is being used. The field has been growing rapidly since the first papers were published in 2012, with 30 papers collecting unique swipe-and-scroll datasets published so far. Despite the growth in the field, no standard set of methods has been established to enable comparison between published work and transition to real-world deployment. While authors largely report the Equal Error Rate (EER) as a metric of average system performance, there are vast differences in methodological choices when evaluating systems on a static dataset. The goal of this paper is to identify these methodological choices, investigate how common they are in published work, and quantify their effect on reported system performance. These steps are crucial to enable fair comparisons between papers, ensure reproducibility of results and obtain results that are compatible with a real-world system-and threat model. Through our analysis of the existing work, we identify six pitfalls where design flaws in the experiment, data collection, or analysis impede comparability or lead to unrealistic results. To examine the impact of each of these pitfalls on a touch dynamics system we collect our own longitudinal large-scale dataset of swipes. Specifically, we investigate the effects of sample and model size, mixing different phone models in the analysis, using non-contiguous training data, including attacker data in training, using arbitrary aggregation windows, and the implications of code and data availability. We quantify the effect of each pitfall with their effect on the system equal error rate, showing that pitfalls lead to conspicuous changes in the resulting performance. The dataset and code from our study are openly accessible to advance the field further. In this study we make the following key contributions: • We identified six evaluation pitfalls: small sample size, phone model mixing, selecting non-contiguous training data, including attacker samples, swipe aggregation, and code/ dataset availability. We conducted a systematic analysis of the touchbased authentication literature, showing that all published studies overlook at least one of the pitfalls. • We quantified the effects stemming from these pitfalls in terms of resulting EER; to do so we collected a new 470-user touch dynamics dataset comprised of daily interactions over 31 days. The dataset and our code are available online. 1 • We outlined a set of best practices to avoid the identified pitfalls. These practices include both recommendations for experimental design and methods and also recommendations to allow for reproducibility and comparability of results in the field. In this section, we present our identified evaluation pitfalls in touch authentication systems. Include Attacker User 1 User 2 Attacker (not in Training) Attacker (in Training) Figure 2 : Visualization of the difference between attacker modeling approaches. The "include attacker" model creates a better boundary between legitimate and invalid data but it does not represent a realistic authentication scenario as specific attacker data is rarely available at the time of model creation. P1 : Small sample size. Sample size can refer both to the number of users in a study and the amount of data collection sessions recorded per user. Due to various experimental limitations, often touch authentication methods are evaluated on limited amounts of data, with a median of ∼40 distinct users and two data collection sessions. Nevertheless, the accuracy of the measured performance may benefit from a larger number of users. In fact, sampling negative training data from larger pools of users can lead to differences in the performance of the recognition model, affecting the mean system performance. On the other hand, collecting longitudinal data is also necessary to estimate the effect of changing user behavior over time, as this may change across different sessions. These sample size effects are non-trivial to measure and hinder a robust generalization of results found on smaller samples. P2: Phone model mixing. Many studies in the field perform data collection on multiple distinct device models. This can be a result of convenience (especially in remote studies) or an attempt to demonstrate system performance on different hardware. While phone models might look similar, slightly different specifications cause fundamental differences when devices are used to collect swipes. These differences are caused by various factors, including the shape of the phone, its resolution, how it is held, touchscreen sampling rate, and the value range of its pressure and area sensors. In general, an attacker would use the same phone model as their victim as they use the same physical device in an in-person attack. Mixing phone models in testing violates this requirement as attackers and victims use different device models. It is worth noting that this pitfall does not apply in the case of remote authentication where the attacker can send data from any device model. P3: Non-contiguous training data selection. In practice, a biometric authentication system has an enrollment (training) phase which precedes the use of the system (or its evaluation). However, when using the randomized training data selection method, swipes are randomly sampled from the whole user data as shown in Figure 1 (right) . This does not resemble how a deployed system works, as it essentially evaluates the system by testing on samples from the past. As a consequence, randomized training data selection leads to biased performance estimation. P4: Attacker data in training. While there are several ways to design an authentication method, a common approach is to use a binary classifier that discriminates between legitimate and nonlegitimate user samples. In this case, the negative samples (nonlegitimate) are generally gathered from the available pool of users, the same user pool is then used to test the system recognition rates. However, most stated threat models rule out the possibility that the classifier was trained with negative training data belonging to an attacker: attacker samples should be unknown. Figure 2 illustrates this problem: including the attacker samples in the training data provides a significant benefit against attacks compared to what happens when the attacker is excluded from training. This property has been initially addressed in [11] , where it is shown that it artificially reduces the zero-effort attack success rates. The inclusion of an attacker in training data is incompatible with a realistic threat model. It is important to clarify that attacker data we use to delineate the negative class consists of legitimate swipes of other users. While active attacks are interesting to examine, we limit our analysis to zero-effort attackers. P5: Aggregation window size. Intuitively, the use of multiple swipes when evaluating a particular model leads to an increase in performance [17, 19, 20, 32, 44] . While aggregating multiple swipes for an authentication decision is a legitimate approach in general (e.g., it mitigates occasional erratic behavior and improves recognition), it has two important drawbacks. Firstly, it impedes straightforward comparison between different approaches when the aggregation window size is different. Secondly, in a realistic threat scenario, it allows the attacker a non-negligible time to perform their attack, as the anomalous attacker behavior will only be identified after a certain number of swipes (depending on the aggregation window size). P6: Dataset and code availability. Datasets and codebases of touch-based authentication systems are rarely made publicly available. This is a major impediment to reproducibility and progress in the field. Sharing datasets would enable researchers to reliably separate the effects of different models from those of the collected data. Sharing the code used to obtain the results is especially important in light of the pitfalls investigated in this paper: oftentimes unstated assumptions are made which are not trivial to spot. The focus of our work is on mobile continuous authentication systems based on swiping and scrolling behavior. While our work concentrates on the use of swipes as the most widespread touch method, there are other types of touch gestures used for authentication (e.g., "pinch to zoom" [40] , screen taps [45] ). In this paper, we consider swipes and scrolls -horizontal and vertical displacements on a touch-capacitive display done using a single finger. Origin of touch-based authentication. Feng et al. developed one of the earliest systems in touch-based continuous authentication on smartphones [13] . Soon after, other systems solely based on the data provided by the phone were developed [17, 20, 25] . Many hybrid approaches for touch-based authentication have also been proposed. For instance, some research includes sensor data coming from the accelerometer and gyroscope [22, 39] . Deb et al. include 30 different modalities including GPS and magnetometer [10] and Rahul et al. have even taken into account the power usage of the device [26] . Data collection modalities. There are varying approaches for data collection in touch-based authentication. Frank et al. use textreading to collect vertical scrolls and a "spot the difference" game to gather horizontal swipes [17] . Similarly, Antal et al. use text reading and image gallery tasks [5] . Others include social media interactions [26] , zooming on pictures [15] and questionnaires [32] . Buschek et al. evaluate the influence of GUI elements and hand postures on the performance of touch dynamic systems [9] . In order to analyze the time stability of the biometric, some recent studies collect data over multiple sessions or days. Watanabe et al. specifically look into the long-term performance of touch-based systems by collecting data over 6 months [38] . They demonstrate promising results for the time-stability of the biometric. While the data from some experiments is collected in a restricted environment during lab sessions, Feng et al. [15] recruited 100 users to use their data collection application over the course of 3 weeks to provide a more realistic environment when performing everyday tasks. Feature extraction and classification modalities. Most feature extraction methods in touch authentication systems focus on describing the geometrical attributes of swipes such as coordinates, duration, acceleration, deviation, and direction [17, 20] . Zhao et al., however, use a method to convert the stroke information into an image that can be used for statistical feature model extraction [44] . There is a vast variability in the classification approaches in touchbased authentication. Some studies have focused on systematizing and comparing knowledge within the field. Fierrez et al. [6] analyze and compare recent efforts in the field in terms of datasets, classifiers, and performance. Serwadda et al. compare the most common machine learning algorithms in the context of touch-based authentication [32] . The studies suggest that Support Vector Machine (SVM) and Random Forest perform the best for touch-based tasks. Fierrez et al. provide insights into model and design choice performance by benchmarking open-access datasets [16] . They find that landscape phone orientation and horizontal gestures prove to be more stable and discriminative. Performance and metrics. The difference in data collection and classification approaches leads to significant variability in the results reported in the field, with authors claiming EERs between 0% [8, 17] and 22.1% [22] Studies also vary in their evaluation metrics as results are reported in False Acceptance Rate (FAR), False Rejection Rate (FRR), Equal Error Rate (EER), Receiver Operating Characteristics (ROC) curve, and Accuracy. While it has been argued that EER does not adequately describe systematic errors [12] , it is generally accepted as a good measure of average system performance. Furthermore, [34] argues the importance of considering the ROC curve for performance as the EER metric could be misleading depending on TPR (True Positive Rate) and FPR (False Positive Rate) system requirements. In this paper, we abstract from the variety of experimental choices outlined in this section and investigate fundamental effects of evaluation pitfalls on the EER and ROC curve. To check how prevalent the pitfalls are, we analyzed the touchauthentication literature. We report an overview of our findings on 30 studies from the last decade, each of the studies introduces a new touch-based dataset in Table 1 . We only selected studies with experiments containing natural swiping behavior such as navigating through specific tasks. We did not consider studies that only rely on mobile keystroke dynamics, sensors, tapping, and onetime gestures for authentication. Patterns that emerge are discussed throughout the paper. Table 1 shows that all of the studies included in the table are subject to at least one of the pitfalls described in Section 2. Our set of studies have a close to equal split in their study environment, with 15 studies done in a lab and 13 remotely -the collection environment was unclear for the 2 remaining studies. We find that the median number of participants is 40, who complete a median of 2 sessions. This relatively low number of median sessions is concerning and we analyze the impact of this (P1) in section 7. Seven of the studies hand out devices to participants for a period of time without specific instructions on how often to use them, meaning that the precise number of sessions is not known. Of our analyzed studies, 28% mix device models in their data collection and do not discuss splitting them in the evaluation, falling into P2. Likewise, 30% of the studies do not clearly explain the way they select their training and testing data, with a further 18% using a randomized approach to select data, and are thus snared by P3. For those that do not explain their selection process, the code is also not shared, making it impossible to know how the selection was performed. In terms of attacker modeling, an overwhelming majority (80%) of the studies investigated use an unrealistic attacker modeling approach and include attacker data into the training set, falling victim to P4. A much smaller number of studies succumb to P5, with 17% reporting their results only on the analysis of an aggregation group of more than one swipe, hindering comparability across studies. Table 1 : Data collection and analysis choices in touch dynamics studies. denotes that the study fulfills the column recommendation (i.e., does not fall into the evaluation pitfall) and denotes that it does not, ? means that the information was not shared or it is unclear from the paper, -means not applicable and in the last column means that the code or dataset is no longer available through the provided url (accessed on 14 April 2021)). The "Cont. (Period)" Sessions label indicates that the phone was given to the users for a period of time without specific instructions on how often to use it. The "Single Device Model" column marks whether the analysis separates data belonging to distinct phone models (even if the data collection included various phone models). P6 also captures many works, with only 8 studies (27%) sharing their datasets upon publication, two of which no longer have functional web pages. Furthermore, none of the studies we examined share a complete codebase of their work. One study, [17] , does share the feature extraction code files but not the rest of the analysis. Recent studies have gathered large amounts of data by making collection apps available on public app stores [2, 29] . This is a step in the right direction in terms of dataset sizes but presents other challenges. For instance, in the case of [29] there is data from 2218 users collected on 2418 different devices and in [2] there is data from 600 users on 278 distinct devices. There is likely a large variation in the unique device models used as well, especially considering the large fragmentation of the Android ecosystem. Furthermore, multiple people may perform the tasks on the same account (e.g. a parent giving a child to play the game). We designed our data collection experiment to enable us to thoroughly measure the effects of each of the pitfalls described in Section 2. As a consequence, we have a few notable differences from previous datasets. We collected all data remotely on a carefully constrained set of devices. Furthermore, we obtained data from 470 participants (well above the median of 40) and collected data of up to 31 sessions (compared to the median of 2). In the remainder of this section, we discuss the designs of the key parts of our data collection experiment. Remote data collection provides two major benefits. Firstly, it allows for the collection of large amounts of data which is impractical for a lab study due to the difficulty of recruiting participants with particular qualities at scale. Furthermore, external factors such as the COVID-19 pandemic may prevent lab studies altogether, leaving remote collection as the only viable option. For our study, we utilized Amazon Mechanical Turk (MTurk) -a popular crowdsourcing platform, where workers perform Human Intelligence Tasks (HITs) in exchange for payment. The platform gives access to a large population of potential subjects and allows for targeting by age, gender, and other demographic criteria. We created an MTurk HIT, which described the requirements and details of the study and guided the subjects to install the data collection app which was distributed through TestFlight -an online service for over-the-air installation on the iOS platform, which does not allow the general public to install the application. The HIT also contained the participant information sheet, as required by our institutional review board. This study received ethical approval. Application Onboarding. Upon opening the application, users were required to complete a consent form and provide demographic information as they would in a lab study. Users were then required to complete their first pair of tasks once. This established a connection between MTurk and the application, providing users with the first payment, and allowing payments to be automatically generated for subsequent completions of the task. Study Duration. Within the study, participants were either invited to participate for 7 or 31 days. Each day participants were prompted with a notification (if they allowed notifications from the application) to complete the task at 9 am, and again at 7 pm if the task had not yet been completed that day. Not all users, however, completed their tasks consistently as further discussed in Appendix C. We selected the iOS platform to carry out our data collection efforts in order to ensure the consistency of hardware and software throughout experimentation. The other major mobile operating system, Android, includes a much higher number of device models with varying screen sizes and sensors making it impractical for our analysis. Moreover, the majority of Android devices approximate their reported touch pressure values by considering the size of the touchpoint while the iPhone models we have chosen support "3D touch" -a true pressure sensor built into the screen of the devices. Due to these restrictions, we have narrowed down our efforts to the nine devices shown in Appendix D. These design choices left us with a large number of users using a limited amount of models but still let us make a comparison in terms of phone size, resolution, and even hardware differences. To our knowledge, there is only one other paper [43] in the field which focuses on iOS devices for touch-based authentication and the dataset is not publicly available. While we have placed specific restrictions on our data collection and experimentation, the dataset can be used for developing systems beyond the specifics of this study. To facilitate our study we developed an iOS application that collects touch and sensor data as users perform common smartphone interactions. We collected coordinates and pressure data for each user interaction with the screen at the maximum rate of 60Hz. Furthermore, we also recorded the accelerometer and gyroscope data at their maximum frequency of 100Hz. The application required users to complete two tasks: a social media style task and an image gallery task. The design and intention of these tasks are described in Section 4.4. We optimized the number of rounds of each task to equalize the completion time and the number of swipes and scrolls collected per task. Both tasks were intended to be completed with the phone in a vertical position, and thus we did not allow a change in the layout when the device was rotated. The application home page included elements such as completion streak and earning potential in order to increase user retention throughout the study. The order in which the two tasks were presented was randomly determined before each session, and the instructions for completing each task were provided before each task begins. The user was required to perform five rounds of each task, with the correctness of answers being validated to ensure the legitimacy of the data and avoid abuse. If the user made a mistake, they were prompted to repeat that round of the task. On completion of both tasks, the touch and sensor data was transmitted to a remote server. Social media task. The goal of the social media game is to gather touch data by simulating how users tend to use their phones on common vertical scrolling tasks such as browsing a social media feed or looking through a list of news articles. In this task, users were required to scroll through a feed in order to find articles or posts which fit a given description. The articles and corresponding images were gathered from the copyright-free content of NewsUSA [1] and we manually created a non-ambiguous corresponding description for each one of them. Each description was associated with one unique article or post and there were 600 such pairs available in the system. The feed was 20 items long and the correct descriptionanswer pair was randomly chosen and mixed with arbitrary decoys pooled from the rest of the pairs. Image gallery task. The goal of the image gallery game is to gather touch data by simulating how users tend to use their phones on common horizontal swiping tasks such as browsing a list of photos or application screens. In this task, users were presented with a horizontal list of pictures in which only a single image was visible at any given time. Users were required to count the number of occurrences of a specific object while swiping through the gallery. For instance, the objects could be animals such as dogs and cats or food items such as pizza. All the images were gathered from the open computer vision "Common Objects in Context" (COCO) dataset [21] . There were a total of 200 unique images in the system and each challenge presented 20 images in the gallery while ensuring that between 2 and 6 of them contain the target object. At the end of the round users were required to enter the number of objects they have counted. The application's source code is available with the rest of the data and code from the project. As with any remote data collection experiment, the lack of direct experimenter involvement poses challenges. The two actions that could compromise the quality of the dataset are participants completing the study twice or participants asking others to complete some of their sessions. The first case is highly problematic since the user would appear twice in the data under different labels. However, to do so the participant would require two MTurk accounts, two Apple accounts, two physical devices, and the capability to accept and complete the HIT twice before it expires. The second case of participants handing their phones to someone else for some of their sessions is harder to rule out entirely. However, we have reminded participants not to do so at the start of each session and, the impact of participants disregarding this would be limited to individual sessions. Lastly, data may have been collected under varying uncontrolled conditions that differ both between users and sessions of the same user. For instance, a user could be sitting or walking, holding the phone, or having it on a table. While this may hinder the overall performance (as it adds variability), it should be considered a more realistic representation of the way a touch-based system will be used in practice. In total, we collected data from 470 users amounting to 6,017 unique sessions and 1,166,092 unique swipes. On average, users completed 13 sessions with cumulative distribution function plots for each study duration group shown in Appendix C. The majority of the users that completed the first few sessions continued throughout the whole duration of the experiment. On average, an image gallery task took 1:54 minutes to complete and resulted in the collection of 124 swipes. The social media task took 1:48 minutes to complete and resulted in 79 scrolls using the same method. The average duration of a swipe was 58ms and the average flight time between swipes was 630ms. The demographics of our participants can be found in Appendix C. Here we present our data and machine learning pipeline and we describe how we investigate the effect of the pitfalls P2, P3, and P4, which require specific steps. P1 and P5 are analyzed directly by varying the sample size and the aggregation window size, respectively. Our implementation is available online. Division by phone model. As outlined in Section 4.2, our participants used 9 distinct phone models for data collection. While their hardware and sensors are likely to be very similar, there are differences in their screen size, resolution, and shape. In order to control for the effect of P2, we create distinct data subsets by isolating data collected by each phone model (which we refer to with the phone model name, e.g., xs max). We compare the performance on this phone model-specific subsets with the performance computed on the entire dataset containing data from all models, which we refer to as combined. Preprocessing and feature extraction. As the first step, we aggregate individual touch samples (consisting of X/Y coordinates and touch pressure) within a game into horizontal swipes (image gallery task) and vertical scrolls (social media task). In all future steps, scrolls and swipes are classified separately and independently. In order to avoid including taps, we remove swipes shorter than 3 samples and the ones that do not deviate by more than 5 pixels from the starting point. For each remaining swipe and scroll, we calculate a set of features directly taken from [17] . All positional features are normalized to the screen resolution. We also distinguish between the direction (left/right or up/down) of both swipes and scrolls. Training data selection. In order to control for the effect of P3, we consider four methods of dividing the target user's data into training and testing sets. In the following, identifies the set comprising of all users, identifies the number of samples (swipes) belonging to user , and and refer to the fraction of samples used for training and testing, respectively. • random: we choose training samples for a user out of all the available samples at random, i.e., all sessions are merged, testing uses the remaining samples. This process is repeated independently for each user. • contiguous: we combine all samples of a user and we select the first portion (in chronological order) of samples for training and the remainder for testing. • dedicatedSessions: for a user, we select a subset of their sessions for training and test on the remaining sessions. This ensures that each session is used for either training or testing and that training and testing samples are never drawn from the same session. We investigate selecting sessions both contiguously (in chronological order, with first sessions used for training, later sessions used for testing) and randomly. • intraSession: for a user, we select a specific session and use the first half of samples for training and the remainder for testing. Only samples from the chosen session are used. Attacker modeling. To evaluate the effect of P4, we examine two different scenarios, one where attacker samples are included in training data and one where they are not. In both cases, we train a binary model where the user's samples are labeled as positive and multiple other users are combined into a single negative class. • excludeAtk: For each user we randomly divide the remaining users into two equally-sized sets 1 and 2 . For training, we select positive class data from the available data from the user and negative class data from 1 . We ensure the two classes are balanced. For testing, we treat all users from 2 as attackers and classify their samples along with the user's testing samples. This ensures that there is no overlap in the attackers used for training and testing. We use this approach over the leaveone-out method proposed in [12] to avoid overfitting when a separate threshold is chosen for each user-attacker pair. • includeAtk: We select a user and split the remaining users into 1 and 2 . We first train and test the system on 1 . This involves training a model for each user where * of the user's samples and * | 1 | of each attacker's samples are used for training and the rest for testing. This ensures that the negative and positive classes are balanced in the training data. This process is then separately repeated with 2 . Scaling. Following the division of data into training and testing batches along with the inclusion or exclusion of attacker data, we standardize each feature by computing the mean and standard deviation of the training data. The training and testing samples of both the user and the attackers are scaled by subtracting the mean and dividing by the standard deviation of this training data. Classification. Following scaling, we fit a classifier to our training data for each user. We then classify the samples in the testing set, which gives us a probability for each sample. This probability is in turn used for both sample aggregation and threshold selection. Sample aggregation. For this optional step, instead of treating samples independently, we group a set of consecutive samples together and take their mean probability estimation, which we use instead of individual probability estimation for threshold selection and final decision. Threshold selection. Taking the distance scores for the testing samples (both user and attacker samples), we compute the EER for each user. This is done by finding the distance score threshold where the FAR and FRR are equal. The mean EER for a given system is the average EER across all users. To quantify each pitfall's effect on the evaluation performance, we analyze their effect one at a time. Our system implementation is based on one of the seminal papers in the field [17] . We report our results from the SVM classifier as it is the best performing method in the study but also experiment with other classifiers (Random Forest, Neural Network, and k-Nearest Neighbors (kNN)). We discuss classifier differences at the end of this section. When investigating one pitfall, we control the remaining experimental choices estimating a baseline performance as follows: (i) combined, (ii) contiguous, (iii) excludeAtk and no sample aggregation. We chose this specific configuration as a default in our experiments for the following reasons. For phone model mixing and training data selection, we chose the most common configurations in Table 1 combined and contiguous respectively. However, we chose ex-cludeAtk as previous work on the topic has already shown the negative effects of using the unrealistic includAtk approach [11] . We do not use an aggregation of samples in our default configuration as it adds another dimension to the data and results, thus making comparison within experiments and previous work more complicated. Unless differently specified, we focus on the effect of pitfalls on the mean EER, i.e., for an experiment configuration, we train the system, then use the test set to estimate each user's EER (per-user EER) and report the average of those. We also report the mean ROC curve with 95% confidence intervals where appropriate. The baseline system resulted in a mean EER of 8.4% and a standard deviation of ±5.57. As our goal is to investigate the fundamental effects of evaluation pitfalls, we focus on the most populous left swipe type to limit sources of variability. Details about the per-user EER distribution and effects of swipe direction on performance can be found in Appendix E. Here we investigate non-trivial effects of user sample size and the effect of the amount of available data per user on the resulting mean EER. 7.0.1 User sample size. Oftentimes it is assumed that the EER of a given authentication method can be reliably estimated by sampling roughly 40 users (the median number of users in Table 1 ). To investigate this, we randomly sample < 470 users from our dataset and compute the mean EER of the system fit on those and the standard deviation of each sample's per-user EER distribution. We focus on the standard deviation of the per-user EER distribution as it is a proxy to the evaluation of systematic errors and EER outliers: certain users with high per-user EER are responsible for a larger proportion of the resulting mean EER [12] . The sampling procedure is repeated 1,000 times for each . We then use =40 (median user sample size in Table 1 ) as a reference: we test whether the metrics obtained at =40 reliably predict the behavior for different . Effect on mean EER. The left-hand of Figure 4 reports the difference in behavior between the EER measured empirically for various and the EER extrapolated from the performance of the =40 subset. The figure shows that increasing the number of users in the model has a non-negligible effect on the EER: while we obtain EER=9.14% for =40, increasing the number of users has a large benefit, reaching EER=8.41% for =400. Effect on per-user EER standard deviation. The right-hand of Figure 4 reports the difference in behavior between the empirical per-user EER standard deviation for various and the standard deviation extrapolated from the performance of the =40 subset. Given the effect described in the previous paragraph, to allow for meaningful comparison we adjust the extrapolated standard deviation to account for the reduction in mean EER (which reduces the per-user EER standard deviation). We do so by adjusting the standard deviation extrapolated at each with the scaling ration between the empirical mean EER measured at and the one measured at 40; 2 this moves the two distributions to the same mean EER. Figure 4 (right) shows how for increasing there is a notable decrement in the per-user EER standard deviation, which is not solely explained by EER mean reduction presented above. Overall, we find that increasing the user sample size greatly benefits the machine learning model (at least in our general method and SVM), thanks to the added variety of negative samples coming from larger pools of users. Larger sample sizes not only lead to lower and more accurate measurement of underlying EER but also have a regularizing effect on the resulting per-user EER distribution, leading to fewer outliers. This also challenges previous findings regarding the usage of error distribution metrics [12] as user sample sizes also will have an effect on such EER distribution across users. Increasing the amount of data collected per user may lead to differences in performance: (i) across several data collection sessions users may get acclimatized to the task (leading to better stability of the collected swipes) and (ii) larger amount of data per user may generally benefit the performance of the machine learning model. In the following paragraphs, we test both factors separately. Effect of user acclimatization. We use data from the 68 users who completed the full 31 sessions, given a number of sessions , we split the data into the earliest collected sessions (Early) and the latest collected sessions (Late). If users gradually get used to the experimental settings (i.e., their behavior exhibits reduced variation), then Early sessions will perform worse than Late sessions when the user has acclimatized after many repetitions. We apply our 2 given empirical per-user EER standard deviation and EER mean measured at , and , we estimateˆusing =40 asˆ= 40 40 . authentication pipeline on both early and late sets, making several splits with ranging from 3 to 15. We report the results in Figure 5 , showing no significant difference between the performance of early and late sessions. Therefore the data shows no evidence of task acclimatization leading to changes in performance. Effect of amount of data per-user. We again use data from the 68 users who completed the full 31 sessions, we consider the effect of the increasing amount of data per user by evaluating the system performance as the number of sessions grows. Figure 6 shows the resulting EER for growing number of sessions. We found that no specific trends emerge as the session count varies. We extend the analysis to the remaining users as well by considering the number of swipes per-user rather than the number of sessions. Figure 7 shows the relationship between number of swipes and resulting per-user EER, points are labeled by Short or Long batch depending on whether the user belonged to either study batch (see Section 4). We found that there is not a clear distinction or trend based on the number of swipes, reinforcing the previous results of Figure 6 . Both figures indeed suggest that the number of swipes or sessions does not necessarily affect the performance of our model which contradicts hypothesis (ii). While long-term studies are necessary to investigate the stability of the biometric, the availability of longterm data does not affect EER in a significant way. In this section, we compare the system performance on data belonging to individual phone models and when merging together data from various phone models (combined). We then explore this concept further by measuring how accurately we can predict the phone model a swipe originated from. Effect of combining phone models. As evidenced in the previous Section 7, increasing leads to an EER reduction (see Figure 4 ). To account for this, we compare each single-phone subset to a combined subsample from all phone models, with an equal number of users as for each respective phone model. Table 2 presents the results for combined dataset and single-phone model subsets. The table shows that the combined approach leads to an overestimation of performance. We observed a decrease in EER for each of the phone models. Furthermore, we performed a -test and found that the EER difference between a single phone model and a subsample is statistically significant (P < .05) except for 6s Plus, 7 Plus and XS MAX. Figure 3a shows the complete ROC curves for the iPhone 7 model (which includes the most number of users in our dataset) and its respective combined model. The overestimation of performance is present throughout the whole of the ROC curves apart from extreme TPR and FPR values. The ROC curves for the other phone models can be found in Appendix B. Phone model identifiability. We create a phone model classifier whose aim is to identify the iPhone model of a given swipe. We merge all the available data and label each swipe with its originating phone model; data is then divided into 80/20 train-test splits. The data is balanced such that each phone model had an equal number of swipes in the training split. We make sure that users which were used in training were not considered in testing and vice versa (to avoid biasing the prediction with the users' identities). We fit an SVM classifier with the data. We perform this experiment once using all 9 phone models and again only with the 6s, 7 and 8 models as these three have equal screen sizes, resolutions, and pixel densities. The classifier achieves 44% accuracy where a random baseline model would yield 11.1%. When considering only 6s, 7 and 8, we achieved an accuracy of 49% compared to a baseline of 33.3%. A complete confusion matrix for the classification of the experiments including all nine phone models can be found in Appendix F. This shows that differences in the properties of the devices are reflected in the identification outcome, i.e., swipes belonging to similar phone models tend to be more similar. These results indicate that it is undesirable to mix different phone models in data collection and analysis for touch-based authentication. Furthermore, it is irrelevant whether the mixed models have similar screen sizes, dimensions, or display pixel densities. The practice of mixing phone models can lead to an artificial increase of performance between 2.5% and 4.5% EER. We compared the classification performance of our model under the conditions described in Section 6: (i) random, (ii) contiguous, (iii) dedicatedSessions and (iv) intraSession. For a fair comparison, we only used data from the 409 users which have completed 2 or more sessions as this is a prerequisite for the dedicatedSessions modality. We present our findings in Table 3 . As expected the in-traSession method yielded the best performance as users have a more stable interaction pattern during a single session than through time [17] . The fact that the model performed well in this category is hopeful, but in practice, users carry out many sessions throughout time and the intraSession result should not be considered an accurate metric for touch-based authentication systems. Mixing and randomizing samples from all sessions (random approach) provided a similar effect as the model learns on information about users' interactions throughout all sessions. contiguous training also allows the model to learn from an overlapping session, which yields better performance. The dedicatedSessions scenario is the most realistic one for a touch authentication system as it relies on self-contained training sessions -as they will be performed in a deployed system. We found that results between all of the methods vary considerably and performance seems to be overestimated compared to the realistic dedicatedSessions approach. An unrealistic training data selection can lead to an increase in performance of 3.8% EER when using a random approach compared to the dedicatedSessions approach. The complete ROC curves resulting from this experiment are available in Figure 3c . The ROC curve results are mostly consistent with the EER reported in Table 3 apart from random and intraSession curves where random selection has a higher TPR above 0.08% FPR. We compared different attack modeling choices as described in Section 6: (i) excludeAtk and (ii) includeAtk. To do so, we randomly subsampled users from our dataset at various , for each we apply our pipeline and compute the resulting EER for the two approaches. This procedure is repeated 10 times, Figure 8 and Figure 9 illustrate the results. We find that includeAtk results in consistently lower mean EER when compared to excludeAtk, see Figure 8 . However, Figure 9 shows how the EER difference between the two approaches decreases exponentially as the number of users ( ) increases. This is expected as the fewer users are considered the Median users in touch studies Figure 9 : Absolute EER difference between includeAtk and excludeAtk attacker modeling approaches. For each number of users, the shaded areas report 95% confidence intervals on the mean difference from 10 random subsampling repetitions. more the presence of attacker data impacts the classifier (e.g., 10% of negative training data for =11 users, <1% of negative training data for > 101 users). This diminishing return also explains why in in-cludeAtk the EER increases when more users are included, despite the expectation that more data might result in better performance. Figure 9 shows that at =40, the EER difference between the two approaches is 2.55%. As pointed out in Table 1 , 80% of our reported studies falls into P4, meaning that these might not present performance metrics appropriate for the specified threat model. Overall, depending on the user sample size considered, includeAtk can lead to an artificial performance gain of between 0.3% and 6.9%. Figure 3c shows the ROC curves of includeAtk and excludeAtk models for 40 users (the average number of users from Table 1 ). The ROC curves for 20, 100, 200, 300 and 400 users are also available in Appendix A. When reporting their results, many studies [17, 19, 20, 32, 44] consider the performance of a group of consecutive swipes instead of a single one as we have done so far. Figure 10 shows the performance of our pipeline when we use an aggregation of consecutive swipes as described in Section 6. The procedure was repeated 10 times and shaded areas show the 95% confidence interval across the ten repetitions. As expected, increasing the aggregation window size leads to lower EERs: an EER of 8.2% obtained on single swipes drops more than a quarter (5.9%) when aggregating two swipes, and drops to less than 3% at 12 swipes. Touch-based authentication studies should be clear when and how they use such aggregations as they evidently have an impact on performance. It should also be noted that each swipe action takes time to perform which can leave a system at risk. For instance, our dataset suggests that on the tasks considered, performing 20 swipes would take 14 seconds during which the system would be vulnerable. Therefore a balance between usability and security should be sought. In this subsection, we quantify the difference between realistic (pitfall-free) and unrealistic (with all pitfalls) evaluation choices for touch authentication systems. We repeated the following two procedures 100 times and report the mean of all runs and the confidence interval at 95%. In the unrealistic methods experiment, we combined phone models (combined), included the attacker into the training data (includeAtk), used the random data selection method and each round randomly subsampled our dataset to the median of =40 participants taken from Table 1 (to even out the effect of P1). This resulted in a 4.9% EER with a confidence interval of ±0.09. In the realistic method experiment, again we selected =40 users from the most commonly used iPhone 7 phone model, used excludeAtk and the dedicatedSessions training data selection. Each round we randomly select which users are selected as attackers. This approach resulted in a much worse EER of 13.8% with a confidence interval of ±0.14. Figure 3d illustrates the overestimation of performance throughout the ROC curves of these experiments. The results clearly illustrate that flawed methods have strong effects on the resulting performance and can lead to an artificial boost to performance by 8.9% EER. In this subsection, we quantify the impact of pitfalls on performance on four of the most widely used machine learning algorithms in the field. Implementation details for each individual classifier can be found in Appendix G. The results of our experiments are presented in Table 4 . All of the examined pitfalls introduce an overestimation of performance regardless of the classifier chosen. However, there are differences in individual performance across chosen classifiers. For instance, the kNN classifier relies heavily on individual swipes similar to the target one, hence the impact of including the attacker data into training is much more pronounced. These results suggest that the pitfalls apply to a wide range of touch dynamics system implementations. In order to facilitate better comparison between future studies and achieve unbiased performance evaluation, we propose a standard set of practices to follow when evaluating touch-based authentication systems, derived from our set of common evaluation pitfalls. P1: Small sample size. While it is hard to advocate for a specific minimum number of users to be required by a study, we recommend researchers to be aware of the effects of user sample sizes in pipelines similar to the one analyzed in this paper. Based on the findings in Section 7, we found that increasing sample size has two important effects: it reduces the resulting mean EER and smooths the variance of the per-user EER distribution. It is advisable that an analysis of the effect of sample size is included in new studies, and that results for a sample size of =40 are also reported (when applicable). This best practice must be accounted for at the study design phase, to ensure enough data is initially collected. P2: Phone model mixing. A single phone model should be used to train and test a proposed system. While this might not always be the final use case (e.g., in other scenarios, one might want to test the generalization performance of a device-specific classifier on a different device), this avoids the bias introduced by data collected on a specific phone model. Isolating data belonging to different phone models when training will produce more accurate performance measurements. Care must be taken in data collection to ensure there are enough samples for each phone model that will be studied. P3: Non-contiguous training data selection. Randomized swipes selection should not be used to separate training and testing data. Test data must always have been collected at a time after the training data was collected, to mimic real-world usage, and to account for behavior drift. For comparison between works, only an initial training phase (enrollment) should be included, as training updates increase the difficulty of comparing figures. Ideally, at least two sessions should be used to collect training and test data, as the bulk of real-world usage occurs with a time interval between enrollment and authentication. P4: Attacker data in training. Studies should always exclude the attacker from the training set, as one shall never assume they have information about the attacker in a deployed system. In particular, care should be taken so that any attacker of a model was not included as a negative example when training the model. Excluding the attacker is particularly important with studies with a limited number of users, where the effect of such an attacker modeling approach greatly affects the resulting performance. P5: Aggregation window size. Using aggregation of consecutive swipes is beneficial to performance, particularly when using the mean of their distances to the decision boundary as shown in Fig 10. However, researchers should report the performance of a single swipe model in order to ensure comparability with other studies, as well as other reasonable numbers of swipes that other similar papers have proposed. Furthermore, information about the flight time between swipes and their duration should also be shared, as these directly relate to the time the system is vulnerable to an attacker. P6: Dataset and code availability. Historically, in this field, it has been rare for authors to share their data -see Table 1 -and none of the studies examined in the related work share their analysis code. This leads to uncertainty when reproducing results, in fact, for some studies, it was unclear from the paper alone whether the study made certain choices regarding the experiments (e.g., we could not clearly define whether 30% of studies fell into P3). The code and datasets of touch authentication studies should be made freely available. This ensures that results can be reproduced by others, and reduces barriers to entry of those wishing to build upon existing work. Generality of results. Although this paper focuses on touch-based authentication, we believe these best practices apply in similar ways to other types of biometric systems such as facial recognition and keystroke authentication. In particular, non-contiguous training data selection (P3), and inclusion of attacker data in training (P4) are fundamentally flawed and should be avoided in all biometric system evaluations. However, the effect of mixing similar devices (P2) may vary across different modalities. Similarly, the sample size implications (P1) might differ in other systems from what we found in our experimentation. Nevertheless, these points should be examined with caution by the relevant literature. Further work is required to examine to what extent these pitfalls are prevalent in the study of other biometric authentication systems. In this work, we explored the impacts of evaluation choices on touch-based authentication methods. We investigated performance differences in approaches related both to data gathering and choices in the way classifiers are trained with a certain data split. For the purpose of this study, we collected a large open-source dataset for touch-based mobile authentication consisting of 470 users, which we made publicly available. We confirmed large variations in performance based on phone model mixing (up to 5.8% EER), training data selection (up to 3.8% EER), user sample size (up to 4% EER), and attacker modeling (up to 6.9% EER). Finally, combining all evaluation pitfalls results in overestimation of performance by 8.9% EER. The results are largely similar regardless of the chosen classifier. We also note that, aside from some extreme threshold settings, these effects are observable throughout the ROC curve. Based on these findings, we proposed a set of good practices to be considered in order to enable accurate reporting of results and to allow comparability across studies. Figure 12 shows the ROC curves for individual phone models compared to mixing them. We found that our results are largely consistent throughout the length of the ROC curve. Combined (9.3%) Figure 12 : ROC Curves for individual phone models compared to COMBINED models which use the same number of users but merge multiple phone models. The use of remote collection through the MTurk platform organically resulted in a relatively balanced dataset in terms of age, gender, handedness, and iPhone model. The gender distribution of all users was 47% females (229), 51% males (252), and 1% other (5) . Only 14% (67) of the participants reported being left-handed which is roughly comparable to 10% in the general population. The age distribution of participants is shown in Figure 13 . Table 5 presents the 9 iPhone models we used for our experiments in order of their release dates. The per-user EER distribution of our baseline model is shown in Figure 15 . We repeat our baseline model for each swipe direction and report the result in Table 6 together with the amount of data available for each swipe direction. Down and right swipes are underrepresented as these interactions are performed rarely in our application, leading to much higher mean EERs of up to 19% and 16.2%, respectively. NewsUSA: Copyright Free Content BeCAPTCHA: Behavioral bot detection using touchscreen and mobile sensors benchmarked on HuMIdb. Engineering Applications of Analysis of interaction trace maps for active authentication on smart devices Dynamic Authentication of Smartphone Users Based on Touchscreen Gestures Information revealed from scrolling interactions on mobile devices Biometric Authentication Based on Touchscreen Swipe Patterns Biometric Authentication Based on Touchscreen Swipe Patterns SilentSense: Silent User Identification via Touch and Movement Behavioral Biometrics Evaluating the Influence of Targets and Hand Postures on Touch-Based Behavioural Biometrics Actions Speak Louder Than (Pass)words: Passive Authentication of Smartphone* Users via Deep Temporal Features Evaluating Behavioral Biometrics for Continuous Authentication: Challenges and Metrics Evaluating behavioral biometrics for continuous authentication: Challenges and metrics Continuous mobile authentication using touchscreen gestures Continuous mobile authentication using touchscreen gestures TIPS: Context-Aware Implicit User Identification Using Touch Screen in Uncontrolled Environments Benchmarking Touchscreen Biometrics for Mobile Authentication Touchalytics: On the Applicability of Touchscreen Input as a Behavioral Biometric for Continuous Authentication Towards Application-Centric Implicit Authentication on Smartphones Continuous authentication of smartphone users by fusing typing, swiping, and phone movement patterns Unobservable Reauthentication for Smartphones Microsoft COCO: Common Objects in Context Active user authentication for smartphones: A challenge data set and benchmark results Active user authentication for smartphones: A challenge data set and benchmark results TouchWB: Touch behavioral user authentication based on web browsing on smartphones Touch Gestures Based Biometric Authentication Scheme for Touchscreen Mobile Phones Continuous Authentication on Mobile Devices Using Power Consumption, Touch Gestures and Physical Movement of Users Continuous Authentication on Mobile Devices Using Power Consumption, Touch Gestures and Physical Movement of Users Designing Touch-Based Hybrid Authentication Method for Smartphones BrainRun: A Behavioral Biometrics Dataset towards Continuous Implicit Authentication. Data Continuous authentication with a focus on explainability LatentGesture: Active User Authentication through Background Touch Analysis Which verifiers work?: A benchmark evaluation of touch-based authentication algorithms Performance Analysis of Touch-Interaction Behavior for Active Smartphone Authentication Robust Performance Metrics for Authentication Systems Touch gesture-based authentication on mobile devices: The effects of user posture, device size, configuration, and inter-session variability Touch to Authenticate -Continuous Biometric Authentication on Mobile Devices Towards Continuous and Passive Authentication across Mobile Devices: An Empirical Study Long-Term Influence of User Identification Based on Touch Operation on Smart Phone ICAuth: Implicit and Continuous Authentication When the Screen Is Awake Towards Continuous and Passive Authentication via Touch Biometrics: An Experimental Study on Smartphones BehaveSense: Continuous authentication for security-sensitive mobile apps using behavioral biometrics Passive user identification using sequential analysis of proximity information in touchscreen usage patterns Touch Gesture-Based Active User Authentication Using Dictionaries Mobile User Authentication Using Statistical Touch Dynamics Images You Are How You Touch: User Verification on Smartphones via Tapping Behaviors is Adam and the activation function is ReLU. Similarly, we chose the set parameters based on preliminary experimentation.