key: cord-0470618-a6xs3tez authors: Krawiecka, Klaudia; Birnbach, Simon; Eberz, Simon; Martinovic, Ivan title: BeeHIVE: Behavioral Biometric System based on Object Interactions in Smart Environments date: 2022-02-08 journal: nan DOI: nan sha: b299f9802fd543316a5bd328fc74291f2fdb1e48 doc_id: 470618 cord_uid: a6xs3tez The lack of standard input interfaces in the Internet of Things (IoT) ecosystems presents a challenge in securing such infrastructures. To tackle this challenge, we introduce a novel behavioral biometric system based on naturally occurring interactions with objects in smart environments. This biometric leverages existing sensors to authenticate users without requiring any hardware modifications of existing smart home devices. The system is designed to reduce the need for phone-based authentication mechanisms, on which smart home systems currently rely. It requires the user to approve transactions on their phone only when the user cannot be authenticated with high confidence through their interactions with the smart environment. We conduct a real-world experiment that involves 13 participants in a company environment, using this experiment to also study mimicry attacks on our proposed system. We show that this system can provide seamless and unobtrusive authentication while still staying highly resistant to zero-effort, video, and in-person observation-based mimicry attacks. Even when at most 1% of the strongest type of mimicry attacks are successful, our system does not require the user to take out their phone to approve legitimate transactions in more than 80% of cases for a single interaction. This increases to 92% of transactions when interactions with more objects are considered. Projections indicate that by the end of 2021 smart environments will account for over 35% of all households in North America and over 20% in Europe [26] . The growing number of smart devices that are incorporated into such environments leads to a wider presence of a variety of sensors. These sensors can be leveraged to improve the security of smart environments by providing essential input about user activities. In many environments, the control over specific devices or financial transactions should only be available for an authorized group of users. For example, smart windows in a child's bedroom should not open when the parent is not present, and the child should not be able to order hundreds of their favorite candy bars using a smart refrigerator. Similarly, not all office workers should have access to a smart printer's history, nor should the visitors in a guesthouse be able to change credentials on smart devices that do not belong to them. But while there is a need for Figure 1 : An overview of the BeeHIVE system. As the user interacts with the printer, sensors embedded in smart objects surrounding the user and the printer record these interactions. Physical signals generated from the user's movements are picked up by sensors such as accelerometers, pressure sensors and microphones, and are used to profile them. The system authenticates the user before allowing them to perform certain actions, such as payments. authentication, smart devices offer limited interfaces for implementing security measures. This can be mitigated by requiring that the user initiates or approves every transaction through a privileged companion app running on the user's smartphone. However, this can be very cumbersome as the user needs to have their phone at hand and thus negates many advantages that smart environments offer in the first place. On-device sensors such as microphones, passive infrared (PIR) sensors, and inertial measurement units (IMUs) have been extensively used to recognize different activities performed by users in the area of Human Activity Recognition (HAR) [15] . Prior work has focused on using one type of input data to authenticate users, such as voice, breath, or gait [10, 20, 25, 27] . In order to make attacks more difficult, several systems have been proposed that rely on diverse types of inputs [1, 9, 13, 17, 19, 23] . While these approaches are promising, they often do not utilize the full potential of co-located heterogeneous devices in smart environments. In this paper, we propose the BeeHIVE system that uses sensor data collected during day-to-day interactions with physical arXiv:2202.03845v2 [cs.CR] 18 Feb 2022 objects to implicitly authenticate users without requiring users to change smart home hardware or adapt their behavior. This system can be used to complement phone-based authentication mechanisms that require users to explicitly approve transactions through privileged apps. By using BeeHIVE in conjunction with a phonebased authentication mechanism as a fallback, smart environments can become more seamless and unobtrusive for users without sacrificing their security. We conducted a 13-person experiment in a company environment to explore the effectiveness of imitation attacks against our model. The proposed technique is assessed in three modes of operation to use (1) features from sensors placed on the object with which the user interacts, (2) features only from sensors on co-located objects, and (3) features from both on-device and co-located sensors. Overall, the outcome of our analysis proves that the system achieves desirable security properties, regardless of the amount of smart office users or the environment configuration. We make the following contributions in the paper: • We propose a novel biometric based on interactions with physical objects in smart environments. • We collect a separate 13-person dataset in a company setting to study video-based and in-person imitation attacks. • We make all data and code needed to reproduce our results available online. Existing biometric authentication systems that utilize data collected from mobile and smart devices are generally categorized into singlebiometric or multi-biometric approaches [1, 9] . The systems from the first category collect inputs of a specific type (e.g., sounds, images, acceleration readings) and search for unique patterns. On the other hand, multi-biometric systems combine the data extracted from multiple sources to create unique signatures based on different sensor types. They provide more flexibility and lift a number of limitations posed by single-biometric systems, including dependency on certain types of equipment and environmental conditions. Moreover, they are less prone to mimicry attacks due to the complexity of spoofing multiple modals simultaneously [32] . The vast majority of existing commercial and non-commercial systems used in smart environment contexts [4, 7, 20, 25] primarily rely on voice recognition to authenticate users. Since these systems are often vulnerable to voice spoofing and hijacking attacks [8, 11, 33, 34] , research efforts shifted towards hardening voice recognition systems by leveraging anti-spoofing mechanisms like proximity detection or second factors [7, 20] . With the recent development of new types of Internet of Things and wearable devices, the possibility of using unconventional biometric traits has emerged. For instance, Chauhan et al. observed that microphones can be used to extract breathing acoustics when the user is present in a smart environment [10] . Similarly, the builtin accelerometers in mobile and IoT devices have been used to characterize gait or human body movements to facilitate authentication [5, 21, 22, 27] . These approaches were the first steps taken to explore the full potential of smart environments to turn contextual and behavioral data into biometric traits for seamless authentication. To improve adaptability and reduce the inaccuracy of single-biometric systems, various multi-biometric systems have been proposed [1, 9, 13, 17, 19, 23] . One approach is to combine two biometric traits to fingerprint users [13, 19, 23] . For example, Olazabal et al. [23] proposed a biometric authentication system for smart environments that uses the feature-level fusion of voice and facial features. These solutions, however, still require users to actively participate (e.g. by shaking devices or repeating specific hand wave patterns) in the authentication process and rely on the presence of specific sensors in the smart environment. To address such limitations, the MUBAI system [1] employs multiple smart devices to extract various behavioral and contextual features based on well-known biometric traits such as facial features and voice recognition. Interaction-based biometric systems have emerged from the observation that physical interactions with devices can uniquely identify users. Such systems have been widely discussed for mobile platforms [29] . Typically, on-device sensors are employed to measure touch dynamics or user gestures [18, 28, 29] . For example, users can be profiled based on how they pick up their phones or how they hold them [3] . Similar techniques have been used in smart environments; however, most of the existing solutions not only require the user to actively participate in the authentication process but also rely on a specific setup. Our goal is to introduce a biometric system that continuously and seamlessly authenticates the users while they are interacting with the devices around them without restrictions on sensor placement. 2.3.1 SenseTribute. Closest to our work is SenseTribute [14] , which performs occupant identification by extracting signals from physical interactions using two on-device sensors-accelerometers and gyroscopes. Its main objective is to attribute physical activities to specific users. To cluster such activities, SenseTribute uses supervised and unsupervised learning techniques, and segments and ensembles multiple activities. There is a palpable risk in real-world smart environments that users will attempt to execute actions that they are not authorized for. This requires means for not just identification, but also authentication. Therefore-in contrast to SenseTribute, which focuses on user identification-the main objective of our system is user authentication, for which we conduct a more extensive experiment evaluating various types of active attacks. In office and home environments, it is easy for anyone to observe interactions made by authorized users, and it is natural that, for example, kids may seek to imitate their parents. Going beyond previous work, we therefore evaluate the robustness of our system against mimicry attacks based on real-time observation or video recordings. Furthermore, SenseTribute expects all objects to be equipped with sensors. However, this is not always a realistic assumption, as sensors are often deployed only near (but not on) interaction points. Thus, we propose a system that uses nearby sensors present in co-located IoT devices to authenticate user interactions. The heterogeneous nature of smart devices makes it possible to sample different types of user interactions. The main purpose of this work is to show that such interactions with various objects in smart environments are distinctive and can be used to profile users. The expansion of smart devices, and hence smart environments, will soon make such methods necessary to quickly authorize certain activities, including payments or management of smart devices. Figure 1 shows an overview of the system design. The proposed BeeHIVE system is meant to complement existing app-based authentication mechanisms used to secure current smart home platforms. Our system authenticates the user through their interactions with the smart environment and only requires the user to approve transactions through the app as a fallback if it cannot authenticate the user with confidence itself. In this way, BeeHIVE can be used to reduce the reliance on these app-based authentication mechanisms without compromising on the security of the smart home platform. In order to inform the system design and evaluation methodology, we define the following design goals: Unobtrusiveness. The system should not require users to perform explicit physical actions for the purpose of authentication nor require them to modify their usual behavior. Low false accept rate. As the system is designed to be used alongside app-based authentication, it should prioritize low false accept rates to avoid significantly weakening the security of the overall smart environment system. Low friction. The system should provide a seamless experience to the user wherever possible. This means that false reject rates should be kept low to reduce the need of falling back on the usual app-based authentication of the underlying smart environment platform. However, this should not come at the cost of higher false accept rates. No restrictions on sensor placement. The system should use data from existing sensors without making restrictions on their placement or orientation. This ensures that the system can be applied to existing deployments purely through software. In addition, the system should not require sensors on each object but instead use sensors on other nearby devices. Robustness to imitation attacks. Due to the ease of observation in home environments, the system's error rates should not increase significantly even when subjected to imitation attacks. In this work, we consider smart environments where objects such as fridges or cupboards are augmented by smart devices that monitor their state and provide access to enhanced functionality. People naturally interact with many of these smart objects during their daily activities. Each activity consists of a set of intermediate tasks. For instance, to prepare a meal, a user has to walk to the fridge and open it to collect ingredients. The user then has to walk to the cupboard to pick up the plates. Behavioral data of these tasks are measured with different types of sensors with which smart devices are frequently equipped. As some objects might not have any suitable sensors attached to them, we also consider nearby sensors to profile object interactions. This is particularly true for physical objects without smart capabilities (e.g., cupboards or drawers). In order to illustrate these different possible deployment settings, we consider three system configurations: • On-object, where sensors are mounted directly on the object • Off-object, where only co-located sensor data are considered • Combined, which uses sensor data of both the device on the object as well as from co-located devices We use sequences of interactions to increase confidence in system decisions. This way, the user can be better authenticated if they perform several tasks in succession. As a simplification, we focus on authenticating one user at a time and do not consider multiple users interacting with objects simultaneously. It is important to note that in our system a failed authentication does not mean that the user is barred from making transactions. Instead, they can simply not benefit from the seamless authentication provided by our system and are required to use their phone to approve the requested transaction. The need to authenticate users arises because physical access to smart objects does not imply authorization to use them. We consider scenarios in which children or visitors may abuse the trust of their parents or hosts to initiate sensitive operations through smart devices that are unwanted by the owners of said devices. These operations can include making payments to other people, ordering goods online, changing the configuration of smart devices, or accessing sensitive information stored on these devices. For example, a child might want to exploit the restocking mechanism of the fridge to order their favorite sweets while a visitor might unintentionally or intentionally access the viewing history of the smart TV and learn intimate details about their hosts. Other possible settings include offices where smart devices are accessible to staff and visitors alike. Often these smart devices allow users to complete administrative tasks through them, such as reordering supplies or accessing the print job history of smart printers. But access to this functionality should be restricted to authorized personnel. In these cases, an implicit authentication of the person executing these tasks as done by our system can avoid cumbersome external authentication methods. An adversary's ( ) main objective is to convince the smart environment that they are a legitimate user ( ). Such a misclassification can result in permitting to execute on-device financial transactions or any other types of sensitive operations on behalf of . We assume that has physical access to the environment, but is otherwise an unprivileged user such as a child or a visitor. Moreover, cannot tamper with the smart devices by, for example, connecting to the debug port to flash the device firmware. We also assume that smart devices and the user's smartphone are not compromised; thus, they can be considered a reliable data source. Based on these assumptions, we also exclude the possibility of the attacker interrupting the training phase, which could result in the generation of incorrect biometric signatures of authorized users. In order to achieve their goal, may attempt to mimic the behavior of to generate a matching biometric fingerprint. Successful mimicry attacks on various biometric systems have been previously demonstrated [16] . In our scenarios, we consider three types of such attacks: (1) zero-effort attackers who interact with the environment naturally without attempting to change their behavior, (2) in-person attacks in which can observe legitimate users interacting with IoT devices in person, and (3) video-based attacks in which possesses a video recording of the user interacting with the IoT devices in a smart environment. While in-person attacks give a possibility to inspect 's interactions more closely and potentially capture more details, recordings can provide additional time to learn 's behavior. In order to evaluate the feasibility of authenticating users seamlessly based on their interactions with smart devices, we conducted an experiment in a smart office environment with thirteen participants. This experiment is further used to study attackers that attempt to copy the behavior of the legitimate user to execute mimicry attacks. For our experiment, we collected data from a wide range of typical smart home interactions using sensors similar to those already present in most smart environments. Since raw sensor data in smart devices are typically inaccessible for developers, we deploy Raspberry Pis equipped with the same types of sensors to simulate such an environment and study object interactions. We use a total of ten Raspberry Pis equipped with magnetic contact switches, USB microphones (recording sound pressure levels), and ICM20948 IMUs (providing an accelerometer, a gyroscope, and a magnetometer) to collect the data for the experiments. The Raspberry Pis are fitted to typical home appliances (e.g., fridge or coffee machine) and kitchen furniture (e.g., drawers or cupboards). The magnetic contact switches are used in place of a typical type of smart office device (i.e., a door/window contact sensor) and they provide the ground truth for the occurrence of interactions with smart objects (e.g., the opening of a kitchen cupboard augmented with a contact sensor). The IMUs measure the motion sensor data from the interaction (i.e., acceleration, gyroscopic motion, and orientation) and are being polled through the 2 interface of the Raspberry Pis. The inputs from the USB microphones are only used to calculate sound pressure levels, but no actual audio data is being stored. See Figure 2 for an example deployment of one of our measurement devices. The Raspberry Pis are connected to a smartphone running a wireless hotspot. The data is securely streamed to a remote server and is additionally stored locally on the devices. A mobile app running on the smartphone is used for labeling and timestamping each run of the experiment and provides time synchronization. We rented an office space and invited 13 employees of the same company. We conducted this experiment in adherence to local Covid-19 restrictions and social distancing was observed at all times. All the participants in our study were compensated for their time and effort. This project has been reviewed by and received clearance by the responsible research ethics committee at our university, reference number CS_C1A_20_014-1. Following Covid-19 regulations, this experiment is conducted remotely in the office kitchen of a hotel company. In this experiment, we use 8 devices and the device setup is completed by the participants. They are given a set of our Raspberry Pi sensor boards and they have to set up the devices themselves, according to the provided step-by-step user manual. An overview of the deployment and the room layout are shown in Figure 3 . As object interactions, we consider in this experiment: 4 cupboards, 1 mini oven, 1 pull-out drawer, 1 microwave, and 1 coffee machine. Apart from the coffee machine, all of these interactions involve the opening and closing of the doors of the interaction point. To get the ground truth for the coffee machine interaction, the user first opens a lid on top of the coffee machine which is outfitted with a magnetic contact switch. The user then proceeds with pressing buttons on the coffee machine, before they end the interaction by closing the lid on top of the machine again. Each of the participants performs 20 runs of interactions. Then, one of the participants is randomly chosen as the legitimate user and victim of the attack. The rest of the participants are split into two groups of six attackers who can observe the user's interactions with the smart environment. The first group can only observe the victim in-person, whereas the second group has access to video recordings of previous object interactions which they can study in their own time. The participants from both groups of attackers then execute the same interactions as the victim, carefully trying to mimic the victim's behavior. The attackers from the first group (i.e., the inperson group) have to perform this attack on the same day when the observation took place. The participants in the second attack group can watch video recordings of the victim from different angles overnight and only have to execute the attack on the following day. In this paper, we formally define a task as a physical interaction initiated by the user with an object . Each task can be modeled as a time series = { 1 , 2 , . . . , }, which is constructed from the data collected by on-device sensors, including microphones, accelerometers, gyroscopes, and magnetometers. The variable represents a physical signal generated by the user while they interact with the smart object at time in form of a vector of sensor values. Depending on the combination of sensors on the devices, they can collect diverse inputs. For example, a smart refrigerator equipped with Inertial Measurement Units (IMUs) can collect acceleration values as vectors of ⟨ , , ⟩ when the user opens or closes its door. However, a smart coffee machine may be equipped with both an IMU and a microphone, which results in collecting more input data. Figure 4 presents the system overview and explains its processing pipeline. Base-learners are weak classifiers that are combined to form an ensemble to facilitate the decision-making process. When the user performs a sequence of tasks 1 to on several smart objects, the system extracts the features for these tasks from on-object sensors as well as sensors in proximity. Next the features become an input to the base-learners corresponding to those tasks-resulting in predictions 1 to . These predictions can either indicate a probability that an observed sample belongs to a certain class or a concrete label from the set of labels = 1 , 2 , . . . , , depending on the framework configuration. Finally, the meta-learner gathers all predictions made by all the base-learners and decides on the final prediction in the second-level prediction layer. This way, a smart environment can benefit from the heterogeneous character of smart devices and their built-in sensors by performing a decision-level fusion to improve the classification accuracy. The multitude of 's interactions in a smart environment translates into physical signals that are received by the sensors of smart devices. Figure 5 presents the sensor readings when 1 interacts with the narrow cabinet during the experiment. While (a) shows the signal that the gyroscope sensor of the cabinet has captured, (b) reveals what has been registered by a co-located sensor. Co-located sensors are all sensors in proximity to an object that can capture physical signals originating from interaction with this object. The microphone on the wide cupboard recorded two events -opening and closing the door of the cabinet. These movements are part of the task performed on smart object . The start and end of are time-stamped by the contact sensors and denoted as 0 and 1 respectively (marked with red dotted lines in Figure 5 ). The signals from are segmented by the values of 0 − 1 and 1 + 1 before proceeding to the feature extraction phase. The time-series signals are converted to values characteristic of ℎ sensor types. As a result, for each , a set of corresponding matrices 1 , 2 , ..., ℎ exists that contains vectors of length of different sensor values between 0 and 1 . Thus, with ∈ 1, 2, . . . , ℎ and the sensor component where ∈ 1, 2, . . . , , we refer to a single matrix as with columns 1 , 2 . . . , . The number of columns for is determined by the sensor components (e.g., three axes of a gyroscope sensor 1 gives 1 = 3). For a smart object with two corresponding sensors, two such matrices will be generated. The variable represents a component-specific input value for a column generated by a physical signal received by sensor . These matrices are then passed as input to the feature extraction function. For each physical interaction with an object , the system extracts ℎ matrices with = ℎ =1 columns of time series data segments of particular sensor components of this object and co-located objects. Based on these columns, the features are extracted. More formally, for each smart object the system extracts a set of features represents a set of feature values extracted from a component (e.g. an -axis of an accelerometer) in of length . denotes the total number of different functions that extract features. The features retrieved from physical signals are listed in Table 1. The statistical functions are computed from each column of , which contains sensor values extracted between 0 − 1 and 1 + 1. We add windows of a second to account for signals that originate from the starting and ending movements. These features are categorized into two groups: time-domain features and frequencydomain features. The majority of the extracted features originate from the time domain because such features are typically wellsuited for systems that process large volumes of data due to their low computational complexity. These features help to analyze the biomechanical effect of a given interaction on physical signals and identify characteristics of movements [24] . For microphone data, we extract sound pressure levels (SPLs) instead of actual audio recordings. Thus, statistical functions are applied to SPL values. For each smart object, the system selects a subset of extracted features using a filtering method. This method focuses on verifying whether features are relevant by analyzing their association with the target variable. The univariate feature selection method used in this work relies on statistical tests to investigate the relationship between variables. The features selected from various sensors are aggregated and become an input to an object-specific base-classifier. This is a necessary step in our framework to improve the model performance as well as reduce the computational complexity. Mutual information (MI) is used to examine the distinctiveness of a set of features and to test the null hypothesis ( 0 ) that negates the existence of a relationship between a feature and an associated target variable. This method can capture statistical dependencies between variables, explaining whether one variable can provide relevant information about the other one [6] . By accepting 0 , we assume that the extracted feature is not relevant, indicating that it is independent of the target variable. On the other hand, rejecting 0 suggests that the variables can be dependent, so the feature should be considered relevant. In practice, the null hypothesis is rejected or accepted after examining the resulting non-negative MI scores. The higher the score, the more significant the feature may be. The zero score indicates independence between the variables. As there can be many relevant features, the system has been configured to choose only the top 20 of such features ranked by the highest scores for each object. As stated in Section 3.2, we explore and evaluate the potential of heterogeneous smart environments to authenticate users while they carry out their daily activities. Every node in a smart environment extracts different sets of characteristics from user interactions due to their placement, purpose, and composition of built-in sensors. Various fusion approaches exist that can boost the detection accuracy and system effectiveness in multi-sensor environments [2] . Among these fusion techniques, we focus on decision-level methods which allow the introduction of multiple classifiers, base-learners, that independently undertake a classification task. This gives a certain degree of autonomy to individual base-learners trained on specific smart object interactions. Moreover, such an approach allows us to select the most effective feature sets, classifiers and their hyper-parameters for each of the base-learners. As shown in Figure 4, after each first-level base-classifier makes a prediction, the second-level meta-classifier determines the final outcome. The efficiency and effectiveness of various fusion techniques at the decision level have been extensively studied in the area of Human Activity Recognition (HAR) [2] . While our focus is on user authentication, we hypothesize that similar approaches can be just as effective in our case. As such, we compare two ensemble learning techniques that use fundamentally different classification methods but show promise for good performance in our multi-user smart environment scenarios. A meta-learner is trained using labels obtained from the first-layer base-learners, as its features [31] . We chose stacking as a method of linking heterogeneous classifiers since it typically achieves high accuracy and introduces less variance than other approaches. Combining a multitude of smart objects and their classifiers can be helpful because some of these interactions can classify certain users better than others. The optimal parameters of the meta-classifier are determined during the training phase by using cross-validation on the training dataset to avoid overfitting. Based on the combinations of predictions that the meta-learner receives from the base-classifiers, it computes the most accurate label. Stacking allows combining various classifiers (e.g., k-Nearest Neighbours, Random Forests, Decision Trees, etc.) using different sets of features for each. In our scenario, the biggest advantage of this approach is that the meta-classifier learns which object interactions predict labels more accurately. For instance, after a training phase, a meta-classifier learned that tasks performed on 1 or 3 are more effective in recognizing than 2 . Therefore, it will account for it in the future while making predictions. The varying classification effectiveness of individual object interactions and their sensors was a major factor in deciding whether to include the ensemble learning methods in our system. Voting is another ensemble learning method discussed in this paper. In comparison to stacking, this technique does not require a separate machine learning model to make final predictions. Instead, it uses the deterministic hard voting algorithm to compute the result. In our scenario, voting simply indicates that the most reported class label from the set of predictions will be selected as the outcome . For example, let's assume that there is a set of predictions = { 1 , 2 , 3 } computed for three smart object interactions. For our first object, a smart fridge, the system computed 1 , which resulted in label 1 belonging to . Then, a smart microwave did not recognize . But for the coffee machine the system calculated 3 that again pointed to 1 . Thus, the final prediction would result in 1 . We decided to assign uniform weights to all base-classifiers and test how different combinations of such classifiers affect the performance of the ensemble system. In this paper, we discuss a supervised learning task that focuses on binary classification to evaluate authentication performance of our system. We extract features for each base classifier in three different system configurations, namely On-object, Off-object, and Combined. These base classifiers create the first-level predictions 1 to and on their basis, the meta-classifier generates the second-level prediction . Since both the Support Vector Machine (SVM) and Random Forest (RF) models tend to outperform other classifiers in HAR tasks [12] , we chose them as the base classifiers in our task. For each object interaction classifier, a grid search is performed to find the optimal set of hyper-parameters. More details can be found in Tables 2 and 3 . For all SVM-based base-learners, the selected features are first standardized. Models are trained and tested using 10-fold cross-validation to avoid information leakage. The resulting classification accuracy is averaged over different folds and used to select the best models. For stacking, as the second layer In order to judge the distinctiveness of features by different types of sensors, we use relative mutual information (RMI). RMI is defined as Here, denotes the ground truth of the user performing the object interaction, whereas is the vector of extracted features. Tables 4 and 5 show the RMI for individual sensors that have been placed on household objects as part of our experiment. These scores represent aggregated maximum values of RMI for a particular sensor on a specific household object , given different configurations of the system. Each of these objects introduces a different way for a user to interact with the smart environment. Analysis of the distinctiveness of the features extracted from these sensors allows us to understand which ones contribute to better classification performance for a specific type of interaction. Each device has been equipped with an accelerometer (ACC), a magnetometer (MAG), a gyroscope (GYRO), and a microphone (MIC). Generally, we observe that the features extracted from GYRO and ACC exhibit high distinctiveness for most of the interaction types. For On-object, the most distinctive features originate from GYRO whereas for Off-object, ACC appears to supply the most distinctive features. We observe that, in many cases, the inputs from co-located objects generate higher RMI scores. On the other hand, the features extracted from MIC appear to have relatively low distinctiveness in comparison to other attributes for the majority of interactions. Despite its generally low distinctiveness for most interactions, MIC achieves higher RMI values for interactions with the pull-out drawer and is the second most distinctive sensor for the coffee machine when we consider features extracted only from its ondevice sensors. This can be explained as the drawer's contents make sounds continuously, changing based on how far extended the drawer is, whereas for most other events the main sounds were caused by the closing of doors-with little difference between users. Pressing the buttons of the coffee machine on the other hand makes faint sounds which differ between users with regards to the timing of the button presses. GYRO shows particularly high distinctiveness for most interactions for On-object, with the exceptions of the narrow cabinet and the pull-out drawer. The cabinet used in the experiment has a very stiff door that leads to abrupt openings with little variation between users. While this reduces the effectiveness of the recognition of users by sensors directly placed on the cabinet, such abrupt openings allow co-located sensors to capture stronger vibrations, hence, provide more accurate distinction. The lower RMI values for GYRO for the pull-out drawer can be explained by a lack of rotational movement. Instead, the most distinctive movement characteristics are the sounds and the acceleration which is why MIC and ACC are the most distinctive sensor types for this interaction. ACC appears to provide the most distinctive features captured by co-located sensors. Interestingly, the vibration signals picked up by the co-located sensors exhibit the highest feature distinctiveness during interactions with the coffee maker. Overall, we notice that Off-object features provide better distinctiveness than the features gathered only by On-object sensors. This suggests that the system can accurately authenticate users by their interactions with objects that do not have sensors directly placed on them. In our experiment, we focus on analyzing the system performance against three types of attacks. The first part of the dataset contains the samples from the victim as well as zero-effort attack samples from each of the remaining 12 participants. This dataset is split using 10-fold cross-validation. Each test fold is used to evaluate a group of zero-effort attacks since it contains the samples of attackers' regular interactions with objects. The remaining attack samples are supplied to the zero-effort attack-trained classifier. To compare and evaluate the effectiveness of different types of attacks on the environment, we report False Reject Rates (FRRs) at different thresholds of False Acceptance Rates (FARs). The FAR metric allows us to determine how many attempts the attacker was successful in. On the other hand, FRR specifies how many legitimate samples from a victim have been misclassified as an attack. Note, that rather than completely preventing the user from executing a transaction this merely means that the user will have to approve the transaction explicitly through their phone. First, we examine FRRs for individual smart objects that the user interacts with. Next, we inspect the performance of ensembles of base-classifiers that are responsible for interpreting different interactions with objects. Finally, we compare the performance of voting and stacking meta-classifiers by examining the receiver operating characteristic (ROC) curve for an ensemble of all available object interactions. In Table 6 , we present FRRs at 1% and 10% FAR thresholds averaged across all objects for three types of attacks targeting a dedicated user. Figure 6 shows their averaged ROC curves. Table 7 presents FRRs for individual smart objects in respect to zero-effort attacks without a dedicated victim, i.e., the results are averaged across all users being considered a victim. For each attack, we calculate FRRs and FARs using different system configurations, including On-object, Off-object, and Combined. For Off-object, only the top performing features are selected. For each smart object, Table 8 shows what other objects these features were extracted from. In the training phase, we only use samples collected during participants' regular interactions with the smart environment. This is because we consider an attacker who has access to the facilities. For example, a malicious co-worker whose typical interaction samples would be known by the system. A zero-effort attack, in which the attacker does not attempt to mimic the behavior of a legitimate user, is an indication of the baseline performance of the system. Other types of attacks involve attackers who either watched the video of the victim interacting with objects or observed the victim personally. We observe that for authentication using Off-object sensors, we achieve an average false reject rate of less than 3% with a 1% false acceptance rate for zero-effort attacks. FRRs increase to 19% for video-based attacks and to 15% for in-person observation-based attacks, considering the same false acceptance rate. This means that even when defending against strong video-based attacks, the system does not require the user to explicitly approve transactions in more than 80% of cases, as the system can instead authenticate the user through their interactions with the smart environment. For the FAR of 10%, the FRR for zero-effort attacks drops to less than 1%. Similarly, FRRs for video-based and in-person attacks decrease to 18% and 13% respectively. The On-object configuration exhibits the worst performance among all of the configuration types, resulting in false reject rates of 25% for the zero-effort attacks, 58% and 50% for the other types of attacks. The Combined configuration guarantees better performance than On-object, however, it exhibits worse performance than Off-object due to the inclusion of features extracted from on-device sensors. It is noteworthy that the microwave door and the narrow cabinet classifiers perform significantly worse than others, which impacts the average scores. Since this effect is universal across users, this suggests that poorlyperforming objects should be excluded by the meta-classifier. Table 7 compares the performance of On-object and Off-object configurations across all smart objects. The narrow cabinet and the microwave door exhibit the worst FRRs in the On-object configuration, resulting in false reject rates of 26% and 30% given a 1% false acceptance rate for zero-effort attacks. The FRRs drop to 0.4% and 1% when the model includes features extracted from colocated sensors. Since the Off-object configuration exhibits the best performance, we focus on it for the remainder of this section. The attackers from the video group could watch the video of the victim performing interactions with objects as often as desired for 24 hours. On the other hand, the attackers who observed the victim in person could follow them closely and look at the exact body and hand movements. To understand this phenomenon, we asked the participants to describe their strategies. The participants from the video-based attack group watched the video three times on average before attempting to mimic the victim. When viewing the video, participants report that they paid attention to the strength with which the victim interacted with the objects, the use of the hands (left or right), the speed of the interaction, and the body position. The participants in the second group, on the other hand, focused mainly on the pace, strength, and rhythm of the interaction. All attackers focused their strategy on mimicking the power and speed with which the victim interacted with objects. Additionally, most of them attempted to spend a similar amount of time per interaction as the victim did. One of the attackers even counted the seconds spent on interactions with each object. Considering multiple interactions with various objects can further improve the system's performance. Figure 7 shows averaged FRRs at two FAR thresholds of 10% and 1% for different ensembles of objects for the voting and stacking meta-classifiers given the Off-object configuration. We focus on the Off-object configuration here, as it exhibits the best performance out of the three considered configurations, and thus best demonstrates the potential performance gains that can be achieved. This could be further improved by adjusting the weights, i.e., assigning smaller ones to interactions that exhibit worse performance. Generally, allowing the system to consider more interactions before authenticating the user results in better performance. Overall, the voting method outperforms the stacking meta-classifier in our scenario. This method is also computationally less complex since it does not involve training another classifier with the predictions of the base-classifiers. The voting meta-classifier achieves a false reject rate of less than 1% with an FAR of 1% whereas the stacking classifier obtains an FRR of 2% for the zero-effort attacks. The video-based attacks for the stacking classifier achieve an FRR of 32% when considering the ensemble of two unique objects given an FAR of 1%. On the other hand, the voting classifier obtains 8% FRR given the same FAR threshold. This means that for the voting classifier, the system can spare the user an explicit phone-based authentication in 92% of cases. We included only four smart objects in this analysis but considering more unique smart objects results in further improvements of the system performance. No concurrent device use. In our experiment, we limit interactions with any device to a single user at a time. In the experiments, this was necessary to obtain accurate identity labels to establish the distinctiveness of device interactions. This limitation may lead to two potential problems in practice. If two users are interacting with different devices in the same room simultaneously or in short sequence, this may lead to decisions made using multiple device interactions to be wrong. This can be avoided by only using interactions with the target device (the device requiring authentication) to make the decision. In addition, simultaneous interactions may affect the sensor signatures and make it harder to match fingerprints for either of them. However, simultaneous interactions are easily detected and either accounted for or ignored entirely. Limited number of users and interactions. Due to time considerations and the unique requirements of the ongoing Covid-19 pandemic, we could only capture device interactions in a single session. This limits our analysis for different levels of FAR and FRR, as the total number of samples and attacker/victim pairs are too low to make a statistically robust analysis of extremely low FAR levels. Given the promising results shown by our current analysis, we plan to collect an additional large-scale dataset in the future. Consecutive user sessions. In our experiment, sessions for different users were conducted one after the other. In theory, it would be possible for environmental effects to be present during one user's session but not for others, thereby leading to classifiers learning these effects as a proxy for user identity. For example, a sound pressure sensor may pick up increased ambient noise during a user's session. However, the fairly strong increase in FAR caused by imitation attacks (video and in-person) suggests that the classifiers capture (somewhat imitable) true user behavior as it is unlikely users would attempt to match the original environmental conditions during their attack. In this paper, we have introduced a system to authenticate users in smart environments based on naturally occurring interactions with objects around them. Notably, our system does not require any sensors on the object itself but makes use of sensors placed arbitrarily in the room. We have conducted an experiment in realworld settings with a total of 13 participants, which shows that using these kinds of smart object interactions for authentication is feasible. This is a crucial finding because there is a need for stronger authorization controls in such environments, but many smart devices offer only limited interfaces to implement security features. Therefore, current systems often rely on cumbersome app-based authentication methods that require the user to always have their phone at hand. Our system can complement such phonebased authentication methods and reduce how often a user has to explicitly approve a transaction in the smart home companion app. We show that our system demonstrates good authentication performance against zero-effort attacks, with less than 1% of transactions requiring external approval at an FAR of 1% when considering a single object interaction. When attackers attempting to imitate the victim's behavior after observing them in-person or through video footage are considered, the user has to approve more transactions explicitly to maintain a 1% FAR. However, the system can still authenticate more than 80% of transactions unobtrusively when considering video-based attackers, rising to 85% of transactions for in-person attacks. We also show that the system's confidence in the authentication decision can be significantly improved if more than one object interaction is considered. Including more interactions with objects can further increase the authentication success rates to 92% even when considering the strongest attacker. These promising results and the potential for easy deployment make this behavioral biometric system a good candidate to improve the security of smart environments in a seamless and unobtrusive manner. We make our entire dataset and the code needed to reproduce our results available online to allow researchers to build on our work. Mubai: multiagent biometrics for ambient intelligence Multi-sensor fusion for activity recognition-a survey Hold & sign: A novel behavioral biometrics for smartphone user authentication Voice authentication and command Internet of things data analytics for user authentication and activity recognition Feature selection via mutual information: New theoretical insights Verifying voice commands via two microphone authentication Hidden voice commands Context aware ubiquitous biometrics in edge of military things Breathing-based authentication on resourceconstrained iot devices using recurrent neural networks Your voice assistant is mine: How to abuse speakers to steal information and control your phone Multi-view stacking for activity recognition with sound and accelerometer data Multimodal biometrics via discriminant correlation analysis on mobile devices Smart home occupant identification via sensor fusion across on-object devices Neural network ensembles for sensor-based human activity recognition within smart environments Augmented reality-based mimicry attacks on behaviour-based smartphone authentication Multimodal biometric authentication using teeth image and voice in mobile environment Secure pick up: Implicit authentication when you start using the smartphone Multimodal biometric authentication in iot: Single camera case study Securing consumer iot in the smart home: Architecture, challenges, and countermeasures Lightweight gait based authentication technique for iot using subconscious level activities You walk, we authenticate: Lightweight seamless authentication based on gait in wearable iot systems Multimodal biometrics for enhanced iot security Comparison of different sets of features for human activity recognition by wearable sensors Voice biometrics: The promising future of authentication in the internet of things Smart home technologies in europe: A critical review of concepts, benefits, risks and policies Accelerometer-based speedadaptive gait authentication method for wearable iot devices Tiltpass: using device tilts as an authentication method A survey on touch dynamics authentication in mobile devices Stacked penalized logistic regression for selecting views in multi-view learning. Information Fusion Stacked generalization Mimicry attack on strategy-based behavioral biometric Dolphinattack: Inaudible voice commands Understanding and mitigating the security risks of voice-controlled thirdparty skills on amazon alexa and google home