key: cord-0907776-i3uectd1 authors: Švábenský, Valdemar; Vykopal, Jan; Seda, Pavel; Čeleda, Pavel title: Dataset of shell commands used by participants of hands-on cybersecurity training date: 2021-09-22 journal: Data Brief DOI: 10.1016/j.dib.2021.107398 sha: dcf567192b5a99fd6dfb805fdc8a75eeebcf81ac doc_id: 907776 cord_uid: i3uectd1 We present a dataset of 13446 shell commands from 175 participants who attended cybersecurity training and solved assignments in the Linux terminal. Each acquired data record contains a command with its arguments and metadata, such as a timestamp, working directory, and host identification in the emulated training infrastructure. The commands were captured in Bash, ZSH, and Metasploit shells. The data are stored as JSON records, enabling vast possibilities for their further use in research and development. These include educational data mining, learning analytics, student modeling, and evaluating machine learning models for intrusion detection. The data were collected from 27 cybersecurity training sessions using an open-source logging toolset and two open-source interactive learning environments. Researchers and developers may use the dataset or deploy the learning environments with the logging toolset to generate their own data in the same format. Moreover, we provide a set of common analytical queries to facilitate the exploratory analysis of the dataset. We present a dataset of 13446 shell commands from 175 participants who attended cybersecurity training and solved assignments in the Linux terminal. Each acquired data record contains a command with its arguments and metadata, such as a timestamp, working directory, and host identification in the emulated training infrastructure. The commands were captured in Bash, ZSH, and Metasploit shells. The data are stored as JSON records, enabling vast possibilities for their further use in research and development. These include educational data mining, learning analytics, student modeling, and evaluating machine learning models for intrusion detection. The data were collected from 27 cybersecurity training sessions using an open-source logging toolset and two open-source interactive learning environments. Researchers and developers may use the dataset or deploy the learning environments with the logging toolset to generate their own data in the same format. Moreover, we provide a set of common analytical queries to facilitate the exploratory analysis of the dataset. Table Subject Computer science applications Specific subject area Cybersecurity training with assignments solved via a Linux command-line Type of data Command-line histories from Bash, ZSH, and Metasploit shell with associated metadata in JSON format Analytical software for processing the data in Elasticsearch, Logstash, and Kibana How data were acquired We used an open-source logging toolset based on the Syslog protocol [1] . The toolset was deployed in a virtual environment (sandbox) consisting of several emulated computer systems and networks [2] . As the trainees solved the training assignments in the sandbox, the logging toolset transparently captured their command-line histories. Data format Raw Parameters for data collection All commands with their arguments that the trainees submitted in the command-line were automatically captured and formatted as JSON records. The logs were captured exclusively in the training sandbox; therefore, they do not contain any sensitive information about the trainees. The data are completely anonymous. Description of data collection Trainees at various proficiency levels (high school, university, and professional learners) attended cybersecurity training sessions hosted at our university or by our collaborators. They solved cybersecurity assignments to practice their skills with command-line tools in Kali, a Linux distribution for penetration testing and digital forensics. The training occurred in virtual sandboxes that emulated realistic computer systems. During the training, the commands submitted by the trainees were collected along with associated metadata. Data source location Masaryk University, Brno, Czech Republic Data accessibility The material associated with this article can be found at https://zenodo.org/record/5517479 (DOI: https://doi.org/10.5281/zenodo.5517479 ). It includes the dataset itself, as well as software to facilitate its analysis. Finally, we share a public GitLab repository at https://gitlab.ics.muni.cz/muni-kypo-trainings/datasets/commands that we aim to gradually update with new data in the future. • Educational data mining and learning analytics are emerging research fields that analyze data from educational contexts. Such research enables to improve the methods for educating cybersecurity experts. However, it relies on high-quality primary data, and few cybersecurity education datasets exist. We believe this is the first human-generated dataset of shell commands and corresponding metadata from authentic cybersecurity training, which features realistic tools for penetration testing and digital forensics. • Researchers in computing or education may benefit from these data. The possible use cases include, but are not limited to: training and testing machine learning models (for example, classifiers for skill assessment [3] ), evaluating data mining methods, correlating actions from multiple sandboxes, prototyping student models, or detecting security threats. • The data are normalized and formatted as JSON (JavaScript Object Notation) records. This standard, semi-structured, and easily reusable format enables researchers to directly process the data in a way that suits their needs. The possibilities range from employing analytical tools, such as ELK (Elasticsearch, Logstash, Kibana), to writing dedicated analytical scripts. • Preparing and developing cybersecurity training sessions requires substantial resources, time, and effort. As a result, instructors and educational researchers often have little time to set up an infrastructure for rigorous data collection and analysis. We contribute to the research community by sharing these original, raw data from cybersecurity training. • The dataset is freely available and may be used without restrictions, since it does not contain any sensitive or personally identifiable information. Ethical aspects of data collection were adhered to, and the privacy of the trainees who submitted the commands was preserved. The dataset features 13446 commands originating from Bash [4] , ZSH [5] , and Metasploit [6] shell. The commands were submitted by 175 trainees, distributed among 27 training sessions with approximately 6-7 trainees per session on average. The commands submitted in the training sandbox (see Section 3.2 ) are stored in the files titled sandbox-id-useractions.json , where id is an arbitrary numerical identifier. The command history files contain JSON entries in the format shown in Listing 1 . Each such entry corresponds to a single command submitted by the trainee. In total, the dataset comprises 13446 such records. The meaning of the individual data fields follows. • timestamp_str represents the time of the command's submission in the ISO 8601 format (up to millisecond or microsecond precision). • cmd is the full command (the tool name and its arguments) submitted by the trainee. • cmd_type is the application used to execute the command: either bash-command for the tools executed from Bash/ZSH, or msf-command for Metasploit shell. • username is the account name on the sandboxed machine under which the command was executed. The account names are set by the training author, and they never store personal information of the trainee. username is stored only for the bash-command type. • hostname is the name of the machine in the sandbox on which the command was executed. • ip is the IPv4 address of the corresponding virtual host in the sandbox. The IP addresses do not represent any real machine on the Internet. • wd is the working directory in which the command was executed. Combined with the data fields above, the trainees' command-line prompt looks like this: username@hostnamewd$ , e.g., root@attacker/home$ for Listing 1 . wd is stored only for the bash-command type. • sandbox_id is an arbitrary numerical identifier of the trainee's sandbox. It is a duplicate of the id in the filename sandbox-id-useractions.json for sanity checks. • pool_id is an arbitrary numerical identifier that associates the sandboxes from one training session into a so-called pool . All sandboxes from the same session belong to the same pool. The data were collected from multiple different cybersecurity training sessions. In each session, the trainees practiced using Linux command-line tools in a training sandbox of emulated computer systems. To understand the origin of the data, we briefly explain the training background. Each training is comprised of two components shown in Fig. 1 : • A sandbox definition is a text file that describes the training network topology and the host configuration. It defines, for example, which software is installed on the machines in the training sandbox. After the definition is instantiated, a training sandbox is created. • A training definition is a text file that specifies the wording of the assignments that the trainee completes in the training sandbox. Based on the selected training definition, the trainees use different tools to solve the tasks. When these two components are deployed in an interactive learning environment (see Section 4.2 ), a training instance is created. Each training instance is associated with a specific date and time and is attended by a certain number of trainees. Thus, it corresponds to a single training session. The typical training session lasted up to two hours (the median difference between the first and last command was one hour and 13 minutes). Table 1 presents the descriptive statistics of the collected dataset. For each of the 175 command history files, we counted the number of Bash/ZSH and Metasploit commands. Then, we computed their properties both separately and jointly. The trainees submitted approximately 77 commands on average. Given the training duration, this seems appropriate because the trainees had to read the documentation and contemplate their approach during the training. Overall, the dataset features 586 unique Bash/ZSH tools and 41 unique Metasploit tools. However, only 107 and 18 were used at least five times, suggesting that most of them were typos. Table 2 shows the top ten most frequently used tools across the whole dataset. The Table 2 The most used commands. The command sudo was ignored for this • Attribute values are not guaranteed to be unique. The data fields described in Listing 1 might not be unique across the whole dataset. For example, there may be two different JSON files with the same sandbox_id , but this does not mean that the trainee was the same. Nevertheless, the dataset archived on Zenodo [7] features unique sandbox_id s. • When processing the data line-by-line, timing is not guaranteed to be preserved. The lines within a single JSON file might not be ordered chronologically when commands gathered from different machines (hosts) are interleaved. Even though the machines have synchronized time, they may send the commands to the central storage at different times. Consider the example in Listing 2 , where the two commands have been stored in the given order, but the second one was submitted 30 seconds before the first one. So, the data are not sorted or reordered upon their arrival to the storage. Therefore, analysts should always consider the actual value of the timestamp_str attribute (including the time zone) instead of relying only on the order of the lines. • Some log entries are sequenced in rapid succession (for example, 20 records within one second). These are valid entries often indicating that the trainee copied and pasted multi-line strings into the terminal. They can also indicate a brute-force approach. The Zenodo data repository [7] is structured into seven folders. Each folder corresponds to one training described in Table 3 . Each folder contains JSON files with the raw command-line data captured from that training. In order to provide detailed context to the data, each training includes its sandbox definition, and in some cases, its training definition as well. This way, the training can be further used to generate new data. This section explains the format and content of the cybersecurity training, the participants' background, and data collection. We also discuss related datasets. Privacy and ethical considerations are featured in a separate section. In each training session, the trainee controls a virtual machine that runs Kali Linux: a penetration testing distribution that provides the necessary command-line tools. The trainee completes a sequence of assignments that mostly involve attacking one or more vulnerable networked hosts, though some assignments feature defensive or analytical actions as well. The hosts in the training sandbox are emulated and isolated from the outside network. Almost all the assignments are solved using command-line tools in Bash, ZSH, or Metasploit shell. The virtual machines for the training sessions from which we collected the data were hosted in one of two interactive learning environments: KYPO Cyber Range Platform (CRP) [2, 8] or Cyber Sandbox Creator (CSC) [2, 9] . KYPO CRP is a cloud-based infrastructure for emulating complex networks, while CSC is a tool for creating lightweight virtual labs hosted locally on the trainees' computers. Abstract operations that can be implemented and executed on the dataset. Definition Example Compute descriptive statistics of commands cmds . The average number of submitted commands per trainee is 76.8 (see also Table 1 ). Show the n most used tools among cmds , sorted by their usage count. The most used tool is ls with 2291 occurrences (see also Table 2 ). For a given tool , show all combinations of its used arguments. For the nmap tool, the following arguments were used: -sn (10 ×), -p 20 (8 ×), --help (7 ×). time_gap(cmds, x, y) From August 2019 to July 2021, we hosted 27 cybersecurity training sessions for a total of 175 trainees. Each training session usually took two hours to complete. Some of the sessions were held on-site at our university premises; others were remote due to COVID-19 restrictions. The participants included (sorted from the most to the least represented): • undergraduate and graduate students of computer science from various universities, • selected high school students, finalists of the national cybersecurity competition, and • cybersecurity professionals. All of them attended the training sessions voluntarily because of their interest in cybersecurity and were not incentivized. Although self-selection bias may be present, the sample represents a broad range of cybersecurity students, experts, and enthusiasts. To acquire the data, we developed an open-source toolset for collecting shell commands [1] . As the trainees solve the training assignments, the toolset automatically acquires their submitted commands and the associated metadata. Then, the data are formatted as JSON records and stored in dedicated storage. To facilitate analyzing the data, Table 4 provides a set of analytical queries, standard operations that can be executed on them. The queries result from our educational data mining and learning analytics research. They can be implemented in a way that suits the analysts' needs; as an example, we provide a basic implementation in ELK along with the data in Zenodo [7] . It enables to import the dataset, process it, and analyze it. Linux commands have been collected for research purposes for decades [10] , but not in the cybersecurity context. At the same time, various datasets (not only shell commands) from cybersecurity exercises are a crucial source of evidence for educational research [11, 12] . However, mostly packet captures and system logs have been previously collected from cybersecurity training sessions, and few datasets remain available today. To review related work, we searched for publications indexed on Google Scholar using the query in Listing 3 . Table 5 lists the few examples we discovered. In comparison with the related work, our dataset intersects and bridges the two domains by collecting shell commands from hands-on cybersecurity training. We incrementally collected the commands over several training sessions with human participants. Therefore, the dataset fills the discovered gap and represents an original contribution for the community of computing or educational researchers. We believe that others will find value in it and use it to foster further research and development. Before conducting the training sessions, we discussed the data collection issues with the institutional review board of our university. We obtained a waiver from the ethical committee since we intentionally do not collect any personally identifiable information that could reveal the trainees' identity. The data are anonymous and cannot be linked to specific individuals. As a result, even if one person attended more training sessions, it is impossible to track him/her throughout several training sessions and compare his/her past performance. The trainees agreed to the anonymized data collection for research purposes via informed consent before starting the training. We ensured they would not be harmed or negatively affected by the research, and they had the right to stop participating at any time without any restrictions. After collecting the data, we manually checked them for personal information to avoid any privacy issues. If the trainees typed any personal information in the command-line, we anonymized it, though such occurrences were sporadic. Currently, we have no fully automated solution for this, since trainees can type anything in the command line. Other than that, no changes were made to the raw data. The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article. Toolset for Collecting Shell Commands and Its Application in Handson Cybersecurity Training Scalable Learning Environments for Teaching Cybersecurity Hands-on Predicting student success in cybersecurity exercises with a support vector classifier Dataset: Shell Commands Used by Participants of Hands-on Cybersecurity Training Cyber Sandbox Creator Command Use and Interface Design Learning Analytics Perspective: Evidencing Learning from Digital Datasets in Cybersecurity Exercises Visualizing the New Zealand Cyber Security Challenge for Attack Behaviors NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System Computer intrusion: detecting masquerades Using Unix: Collected traces of 168 users Traffic and log data captured during a cyber defense exercise A Cybersecurity Dataset Derived from the National Collegiate Penetration Testing Competition