1 Introduction

Usability and User eXperience (UX) have become used in most of the user-centered design methodologies [4] to understand and later improve the way that users interact with a system [18]. One of the well-known techniques is called “user testing”, which is a technique that involves evaluating a system with real users to see how easy it is to use and how well it meets their needs [3].

The versatility of the technique enables the use of different kinds of configurations based on the specific goals of the research [37]. They can be divided between moderated or unmoderated [19], and can be applied in a physical environment like a usability laboratory [37] to perform in-person tests, or in a remote environment [34] using multiple testing tools [46, 51]. They can also be applied in one-on-one testing procedures, where one moderator works with one user at a time, or in a group where a moderator addresses multiple users simultaneously. To capture information through these different configurations, multiple tools can be used, such as eye gaze tracker systems [11], which are the object of the study of this paper.

Eye gaze trackers and their remote extensions, also known as “remote eye gaze trackers” (REGT), are devices capable of identifying a person’s gaze fixation. These devices are widely used in usability testing. Generally, they are accompanied by native software that generates various metrics, such as fixation time, areas of interest, saccade speed, and gaze duration. However, these tools are expensive and require highly controlled spaces, typically found in usability facilities. Consequently, the possibility for regular users to perform remote tests using these tools is almost null.

The general objective of this study is to investigate, validate, and develop a software that can be used in usability studies with the purpose of capturing the eye gaze of a person and saving their fixation points to generate usability reports. The proposed system will be composed only of a computer and a conventional webcam, creating an open-source and low-cost project that can be used by individuals or groups, both in person or remotely.

Finding open-source contributions to this topic is a challenge, as highlighted by Shehy et al. [47]. Although the authors introduced several organizations and software options, we could not find a software that was still in active development, and incorporate both the Open API specifications and make use of web technologies that would allow easy integration. For this reason, we developed our own system to be published under open source license. Moreover, we also prepared the system to work standalone and also as a plugin for an open source organization called Remote User Experience Lab (RUXLAB)Footnote 1 with the objective to be integrated as a tool in their user testing system.

The development workflow to be implemented has to be capable of identifying the direction of the eye gaze based on the position of the iris and pupil in relation to the screen. Several machine learning methods have already been proposed in the literature to accomplish this task, such as convolutional neural networks [25], recurrent neural networks [22], adversarial generative networks [28], and siamese neural networks [2]. However, although these models are used with success in gaze estimation and other related learning problems, they require a large amount of training data, and are computationally expensive. For an open-source and low-cost system like the one being proposed, it is important to use a predictive algorithm that is simple and fast. Therefore, this study used a regression algorithm to predict the direction of the eye. The performance of this method or any other employed in eye tracking may be impacted by the calibration system. Therefore, we also investigated three different calibration techniques to analyze their effect on the system’s performance.

The calibration system is used to obtain personalized information from each user, according to the position of their head and also the level of brightness when they perform the calibration process. This information feed the model to improve the performance, giving better insights about the user environment and features. We also developed a graphical interface as a tool for users to easily use the model for analyzing and conducting their studies on their desired web pages.

The remainder of this study is organized as follows. Section 2 presents the context about user testing and usability alongside with the eye tracker systems. In Sect. 3, we explore existing research in the field and examine various eye tracker systems that were tagged as open source. Our proposed prototype is described in Sect. 4, where we segmented it into the gaze tracker system, the web implementation, and the API architecture. Finally, the conclusions and future work are presented in Sect. 5.

2 Context

The following section describes the topics related to conducting a user testing evaluation, focusing on eye tracker tools. We describe the methods and necessary steps to conduct such evaluations, as well as the main features that these systems must contain.

2.1 User Testing and Usability

The term usability was created to define the ease with which people interact with a tool or interface while performing a task [24]. It was defined by the ISO/CD 9241-11 as “the ability of software to be understood, learned, used under specific conditions”Footnote 2. To evaluate usability, different techniques can be applied [36], which can be grouped into three different categories: inspection methods, inquiry methods, and test evaluations. User testing belongs to the test evaluations technique and it can be performed during product development to assess the effectiveness, speed, user error tolerance, ease of learning, and user engagement of the product. The goal is to ensure that users experience satisfaction instead of frustration when using the solution [3].

User Testing can be divided into three different steps or sections in which users have to complete different tasks based on provided information from the team of observers or evaluators. Those sections are the pre-test, test phase, and post-test sections. Inside the pre-test section, participants are typically presented with a consent form and a pre-test form to get insights, such as demographic information, about the user that will participate in the study. The test phase involves different tasks that the user must accomplish, which can be highly customized by the evaluators to gather as much qualitative or quantitative data as possible. Each task can have additional forms attached, known as pre-task or post-task forms, that are typically completed before or after the task is finished. Examples of these forms are the Single Ease Question (SEQ) [45] or the NASA Task Load Index (NASA TLX) [17]. Finally, in the post-test section, a post-test form such as The System Usability Scale (SUS) [7] can be shared.

Recruiting participants is another key challenge for user testing. It involves recruiting representations of the target user group for the product or service being tested. When a test is performed in physical facilities, the representative sample of the target group may not be heterogeneous enough, as highlighted by several authors [10, 34, 51]. After the participants are selected, a set of tasks are presented to them to test the product or service, while the evaluator observes and takes notes.

When the different tasks of a user testing are created, the evaluation team must decide what type of data needs to be collected. These decisions can influence the nature of the results, for example, whether they are qualitative or quantitative. They can also determine whether it is possible to measure or calculate the effectiveness, efficiency, or user satisfaction of the system [15]. To achieve this, a variety of configurations can be used, such as recording tools for capturing time, voice, screen activities, and video [10]. In the last case, we can distinguish between face recognition, which enables the application of Artificial Intelligence (AI) for emotion recognition [33], and gaze tracking, which allows for the implementation of eye-tracking procedures combined with heat map technologies [21].

2.2 Eye Tracker

The detection and recognition of human faces are extensively studied tasks in the area of AI, and their applications can be expanded in several use cases, such as security, entertainment, marketing, identification of people in social networks, forensic investigations, participant control in events, and many more [5, 9, 11, 23, 33].

A face has very rich features that are essential for social interaction. While humans can effortlessly detect and interpret these characteristics, this task is not as simple for a computer [57]. Initially, it is necessary for the computer to understand which features are related to a human face and differentiate them from other objects that may have similar visual characteristics [23]. In most face detection algorithms, the first point to be identified are the eyes [57], as they are usually the simplest features to detect. Next, it follows other parts, such as the mouth, nose, eyebrows, and contours to increase confidence in the detected area being a face [31]. For better chances of success and to address a broad spectrum of human physiology and morphology [23], these models are usually trained with very large datasets of positive and negative images [47], since different factors can affect face detection, such as the pose, the presence or absence of structures (glasses and beard for example), facial expression, occlusion, orientation, and image quality [31].

Eye tracking systems aim to discover where the eyes are focused or study how they move when interacting with a graphical interface or environment [5]. This technology enables people with physical or cognitive disabilities [9] to perform activities using only their eyes and opens up a wide range of studies related to the usability of interfaces [44], where eyes can be analyzed to discover how users interact with websites, desktop, and mobile applications [35].

Eye tracking systems can be divided into multiple categories depending on the technology that they are based on, their final application, or their method of operation [47]. Some of them are more intrusive than others, and regarding this configuration, they can be divided into two categories: (i) head-mounted and (ii) remote-based eye trackers. The first category consists of wearable devices that track the movements through a system mounted on a headset like glasses that can be worn. While these systems are more intrusive compared to the remote option, they offer more freedom of movement to the user. On the other hand, remote-based are a non-intrusive method, as they capture eye movements from a distance. In this case, cameras or other sensors that track the movements are normally coupled into a layout or screen. In such cases, the main problems that can arise are related to a lower accuracy because they can be affected by light conditions, as well as a more time-consuming calibration process [9, 29].

Some of the most used analysis tools in these types of studies are heatmaps, gaze replays, and gazeplots. However many other options exist, and a combination of them might help to find more accurate information [43].

Fig. 1.
figure 1

Examples of different output solutions from eye tracker systems.

  • Heatmaps (Fig. 1a) are able to show areas where users fixed their gaze the most. They can be generated using information from a single person, or from all participants if the objective is to find the most fixed consensus point among all of them. In addition, different metrics can be chosen to generate hot and cold zones, such as the fixation time in an area or the number of fixations.

  • Gaze Replays (Fig. 1b) enable researchers to watch the test session and see exactly where the user was looking. Usually, in this method, a cursor is placed on the screen to simulate the position of the user’s eyes, while the recorded screen video plays in the background. This helps to evaluate task flow and determine exactly the order of each fixation.

  • Gazeplots (Fig. 1c) show where the user is fixing its attention, which order was followed, and for how long the fixation occurred at that point. In most representations, the size of two circles represents the fixation time (larger circles indicate longer durations), and their position indicates the location of the fixation on the canvas. The track represent the direction of the next fixing point, and they are typically numbered in sequential order.

There are different eye tracking devices on the market, such as TobiiProFootnote 3 and PupilLabs InvisibleFootnote 4. These tools have multiple light sensors and cameras to record the user‘s gaze, and they come with native software that generates various metrics such as fixation time, areas of interest, saccade speed, and gaze duration. They also incorporate functionality for locating the eyes in the image and performing calibration processes.

3 Related Work

Usability tests with eye trackers have been studied since early 2000 ss. At the same time, AI started to become more affordable, and in the field of REGT, it contributed to the development of systems that do not need special hardware and are released as open source [38]. In this study, an open source organization called RUXLAB [8]Footnote 5 has been investigated, which is dedicated to carrying out different usability tests, including user tests. This organization allows to implement different plugins to increase the functionalities that can be used when a user test is performed. For this reason, we studied different open-source REGT systems that can be reused instead of developing a new one.

We used the study performed by Shehu et al. [47] as a starting point as they present a general overview of different configurations in which REGT can be grouped. One of those presented configurations is based on mapping open-source REGT. We have conducted in-depth research to analyze which platforms are still available and also find new options that might have appeared. Table 1 summarises our finds. The first block represents the different organizations that were found by Shehu et al. [47], while the second block presents organizations that were found in another studies. We tried to identify critical aspects that correspond with our objectives. Specifically, we assessed whether the organization provided an open repository, if the code was available, also which kind of open source license they were using, if it contains Open API specifications, and also if the final result was an executable or not. Moreover, we have identified the technologies that were used during the development and the last updates that we were able to find in the code.

Table 1. List of Gaze Tracking Software.

4 Prototype Results

We developed our prototype taking into account the main features encountered in the previous section. One of our objectives was to create a system that does not rely on specific hardware, thereby avoiding dependencies. This means that the system should be compatible with conventional webcams that are available in the market at a low cost. Additionally, since we intended for the system to run on web browsers, we aimed to develop it using web technologies. Moreover, it has to comply with open source compliance as the system will be licensed by an open source license and able to be integrated with RUXLAB system. Finally, communication layer must be created using Open APIFootnote 6 specification [6]. For this reason we created three different parts to be the most modular possible: (i) the gaze tracker with the calibration system with the use of an AI model (linear regression), (ii) the web application that allows to run tests and serves as the user interface, and (iii) the API that does all the communication between the parties and manages the created data using the Open API specification.

4.1 Gaze Tracker System

In order for the proposed model to accurately predict the direction in which the eyes focus, it is necessary to carry out a point calibration process [5, 9]. Before starting a usability test on the platform, users need to undergo small exercises using their eyes. This step is essential before each session, as each user possesses unique physical characteristics, so it is possible to have a special analysis for each face instead of a pre-trained generic model with other people’s faces. Environmental conditions, including factors such as luminosity, can also interfere with the access during eye tracking sessions [11]. Therefore, each session will be trained considering the specific environmental conditions in which the participant is situated.

The calibration process starts with the user positioning correctly within the camera coverage area, so that it is possible to capture the face and eyes without obstructions or poor lighting conditions. Once positioned, the user will go through the exercises, which consist of following instantiated points on the screen with the eyes. The user has to do his best to follow the moving circle that goes around the whole screen. While the user performs them, the X and Y position of the user’s pupil and the X and Y position of the moving object are saved. Pupil points are captured using the Face Landmarks Detection modelFootnote 7 from the Tensorflow.js packageFootnote 8 which has a map of the entire face, including the center of the pupils. This pre-trained model is based on Media Pipe’s FaceMesh packageFootnote 9. All points captured during this process are saved and stored along with the session. The calibration points are stored in a CSV file containing the entire relationship between the position of the points and the corresponding pupil position. This data is later used to train the machine-learning model.

The method chosen to perform this task was the linear regression [49], as it is ideal for cases where you need to predict a value based on the relationship between two or more variables. Moreover, it is a light and fast model, which suits for the scenario considered in this work, that involves low-cost cameras and multiple browsers. To implement the linear regression method we used the LinearRegression function from the scikit-learn library [42] with default parameters. We performed the experiments using a holdout validation with 80% of the data in the training set and 20% in the test set.

For the calibration process, we compared the three different techniques presented in Fig. 2. The (a) 9-point coordinate system (Fig. 2a) calibration system was implemented with a double forward and backward movement that increases the number of points to 36; the N-point calibration(Figure 2b), where the number of points is determined by the width and height relation of the screen, and finally a system based on the design of four concentric circles of varying sizes, presented in descending order from largest to smallest which uses 160 points (Fig. 2c) [9, 11].

Fig. 2.
figure 2

Calibration strategies.

We used the coefficient of determination and also the mean squared error to analyze the performance of the calibration strategies adopted. The formula for the coefficient of determination (R-squared) is given by:

$$\begin{aligned} R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \end{aligned}$$
(1)

where \(SS_{res}\) is the residual sum of squares and \(SS_{tot}\) is the total sum of squares. On the other hand, the formula for the mean squared error (MSE) is given by:

$$\begin{aligned} MSE = \frac{1}{n} \sum _{i=1}^{n}(y_i - \hat{y_i})^2 \end{aligned}$$
(2)

where \(y_i\) is the actual value of the dependent variable, \(\hat{y_i}\) is the predicted value of the dependent variable, and n is the total number of observations.

Table 2 shows the different results for the three different techniques that were used. It can be observed that the (c) 160 points calibration technique obtained higher values for both X \(R^2\) and Y \(R^2\). However, regarding MSE, we obtained different results. For X MSE, the lowest value was achieved with the (c) 160 points calibration technique. However, for Y MSE, the lowest value was obtained with the (a) 36 points calibration.

The number of calibration points generated with the 160 points calibration technique is higher than the other two evaluated techniques. Therefore, with this technique, more data was provided for training the linear regression, which may have contributed to this learning model capturing more of the variability in the relationship between eye movements and screen coordinates, which resulted in a better performance in terms of \(R^2\).

Table 2. Comparative table of Coefficient of Determination (R-squared) and Mean Squared Error (MSE) coefficients.

4.2 Web System

The proposed project has to run first standalone so it must be easily deployable into any environment. We have used Docker to package our build and to facilitate dependencies between users. Regarding dependencies, any browser can run the project as only will ask for webcam and screen recording permissions.

Functionalities implemented for the prototype include a landing page with a social authentication system, and a home page to perform create, read, update and delete operations, also known as CRUD. Additionally, it incorporates a calibration system, screen and webcam recording system, as well as the iris capture model integration.

Fig. 3.
figure 3

Screenshots from the Gaze module implemented.

For the website development, the Vuejs frameworkFootnote 10 was used along with the CSS Framework called VuetifyFootnote 11 which provides UI components following Material Design guidelines as long as their iconsFootnote 12. For the recording of the screens were used the Native APIs for Javascript Media DevicesFootnote 13 and Media RecorderFootnote 14. For the Login feature, Google Sign-in using FirebaseFootnote 15 is used as the authentication provider.

Therefore, another important part of the proposed Gaze Tracker system is the generation of heatmaps that we have usedFootnote 16 to represent the focus of interest of a user on the studied page. This system needs to be integrated with an API, so that the calibration process, the usability session and the delivery of the results can be shown in a web application.

4.3 API

The API has a set of CRUD endpoints to manage sessions for each logged-in user. Table 3 contains the list of all routes defined in the API together with their descriptions and HTTP methods correspondents.

Table 3. API endpoints and their descriptions.

Those endpoints are used for the integration into the RUXLAB system but also might be useful for others that might use the present work, as it presents a decoupled layer of interconnectivity between a user testing evaluation and an eye tracker system. The final version of our code can be found in the same organization umbrella where is attached the RUXLAB in Github repositoryFootnote 17.

5 Conclusions and Future Work

In this paper, we have presented a complete architecture and system that enables to perform eye tracking sessions that work standalone. We have used a linear regression along with three different calibration procedures and developed a graphical user interface to display data using heatmaps with an API specification to facilitate communication between both systems.

One of the problems we encountered while analyzing previous solutions was that approximately 70% of the software applications were standalone executable programs with their own interfaces, whereas only 20% had a library or API that could allow integration. This topic has limited our research in terms of reusability, especially when we dive deep into the details of trying to use some of the identified items like PupilLabs, which was identified as having an open source code [27], but requiring specific hardware to run. A similar situation occurred with EyeTribe, as the project was acquired by Facebook in 2016 and later discontinued.

Moreover, Table 1 shows that around the 70% of the softwares identified as open source, have not received updates in the past three years. Moreover we were not able to find repositories, partly because the cited links in the bibliography encountered were broken [1, 14, 30, 55, 56, 58] and also we do not find references or keywords to conduct a deep research. Regarding technologies, we found that most of them are based on Java, C, or Python, and around only 10% use web pure technologies like Javascript [41]. It has a direct correlation if the software is executable or not, being most of the applications or softwares developed using Java or C# as an executable. As we could not identify a solution that met our requirements, we implemented our own solution taking into consideration different topics found in the other systems like AI model, calibration process, and data visualization technique.

We have used different calibration systems based on Girshick et al. [16] for their trade-off between simplicity and performance, similar to the choice of the linear regression as an artificial algorithm core due to that kind of unsupervised appearance-based method have demonstrated their high effectiveness in other works [47]. The results obtained from the calibration method suggest that there is room for improvement in achieving a high-accuracy and high-precision system. Therefore, in future work, we intend to explore other techniques for the calibration process and alternative machine-learning models.

Still regarding the calibration method, different authors cited the importance of this step as it can help to greatly improve the final results [9]. We just implemented only three different models but we found gamification processes with better results than ours that also use linear regression [16]. In future work, we plan to study how different strategies can perform on different calibration methods to better analyze which configuration suits better for conventional webcam users. Moreover, as the system was build as a standalone module, we pretend to integrate it with RUXLAB to be used as a complementary tool to perform user testing evaluations.

Finally, regarding the visualization of the results, it was used a heatmap library to display that information. However, it exists other kinds of maps that can be created like the gaze-replay (Fig. 1b) and gazeplot (Fig. 1c), which might be implemented to improve the understanding of the results. Also, we started with the integration of the plugin into RUXLAB and we are working with the release of a new version that will allow the automatic generation of reports.