key: cord-0467526-bt9e5ddw authors: Feng, Xiaotao; Sun, Ruoxi; Zhu, Xiaogang; Xue, Minhui; Wen, Sheng; Liu, Dongxi; Nepal, Surya; Technology, Yang Xiang Swinburne University of; Adelaide, The University of; Data61, CSIRO title: Snipuzz: Black-box Fuzzing of IoT Firmware via Message Snippet Inference date: 2021-05-12 journal: nan DOI: nan sha: e23f0eeb8282286dfab005d49b9f53239a5f9fab doc_id: 467526 cord_uid: bt9e5ddw The proliferation of Internet of Things (IoT) devices has made people's lives more convenient, but it has also raised many security concerns. Due to the difficulty of obtaining and emulating IoT firmware, the black-box fuzzing of IoT devices has become a viable option. However, existing black-box fuzzers cannot form effective mutation optimization mechanisms to guide their testing processes, mainly due to the lack of feedback. It is difficult or even impossible to apply existing grammar-based fuzzing strategies. Therefore, an efficient fuzzing approach with syntax inference is required in the IoT fuzzing domain. To address these critical problems, we propose a novel automatic black-box fuzzing for IoT firmware, termed Snipuzz. Snipuzz runs as a client communicating with the devices and infers message snippets for mutation based on the responses. Each snippet refers to a block of consecutive bytes that reflect the approximate code coverage in fuzzing. This mutation strategy based on message snippets considerably narrows down the search space to change the probing messages. We compared Snipuzz with four state-of-the-art IoT fuzzing approaches, i.e., IoTFuzzer, BooFuzz, Doona, and Nemesys. Snipuzz not only inherits the advantages of app-based fuzzing (e.g., IoTFuzzer, but also utilizes communication responses to perform efficient mutation. Furthermore, Snipuzz is lightweight as its execution does not rely on any prerequisite operations, such as reverse engineering of apps. We also evaluated Snipuzz on 20 popular real-world IoT devices. Our results show that Snipuzz could identify 5 zero-day vulnerabilities, and 3 of them could be exposed only by Snipuzz. All the newly discovered vulnerabilities have been confirmed by their vendors. The Internet of Things (IoT) refers to the billions of physical devices around the world which are now connected to the Internet, all collecting and sharing data. As early as 2017, IoT devices have outnumbered the world's population [39] , and by 2020, every person on this planet has four IoT devices on average [23] . While these devices enrich our lives and industries, unfortunately, they also introduce blind spots and security risks in the form of vulnerabilities. We take Mirai [25] as an example. Mirai is one of the most prominent types of IoT botnet malware. In 2016, Mirai took down widely-used websites in a distributed denial of service (DDoS) campaign consisting of thousands of compromised household IoT devices. In the case of Mirai, attackers exploited vulnerabilities to target IoT devices themselves and then weaponized the devices for larger campaigns or spreading malware to the network. In fact, attackers can also use vulnerable devices for lateral movement, allowing them to reach critical targets. For example, in the work-from-home scenarios during COVID-19, Trend Micro has reported that, introducing vulnerable IoT devices to the household will expose employees to malware and attacks that could slip into a company's network [26] . Considering the ubiquity of IoT devices, we believe that these known security incidents and risky scenarios are nothing but a tip of the iceberg. IoT vulnerabilities are normally about the implementation flaws within a device's firmware. To launch new products as soon as possible, developers always tend to use open-source components in firmware development without good update plans [1] . This sacrifices the security of IoT devices and exposes them to vulnerabilities that security teams cannot remedy quickly. Even if vendors plan to fix the vulnerabilities in their products, the over-the-air patching is usually infeasible because IoT devices do not have reliable network connectivity [16] . As a result, half of the IoT devices in the market were reported to have vulnerabilities [28] . It is hence crucial to discover such vulnerabilities and fix them before an attacker does. However, most IoT software security tests heavily rely on the assumption of device firmware availability. In many cases, manufacturers tend not to release their product firmware and that makes various dynamic analysis methods based on code analysis [7, 13, 15, 18, 32, 46] (or emulation [8, 10, 20, 50, 51] ) difficult. Among the existing defense techniques, fuzz testing has shown promises to overcome these issues and has been widely used as an efficient approach in finding vulnerabilities. Moreover, the ability of IoT devices to communicate with the outside world offers us a new option, and that is to test device firmware through exchanging network messages. Therefore, an IoT fuzzer could be designed to send random communication messages to the target device in order to detect if it shows any symptoms of malfunctioning. Potential vulnerabilities could be exposed if crashes are triggered during execution or the device is pushed to send back abnormal messages. However, using network communication to fuzz the firmware of IoT devices is very challenging. Since obtaining internal execution information from the device is not possible, most existing network IoT fuzzers [9, 31, 44] work in a black-box manner. This makes optimizing the mutation strategies very difficult. Because the selection of mutated seeds is entirely random, existing black-box IoT fuzzing approaches could become very hard to handle, and sometimes, even become more like brute force crack testing. In addition, IoT devices have strict grammatical specifications for inputs in communication. Most of the messages that are generated by random mutation will break the syntax rules of the input, and will be quickly rejected during syntax validation in the firmware before being executed. A grammar-based mutation strategy [2, 40] can effectively generate messages that meet the input requirements though. This can be done by learning the syntax via documented grammatical specifications or from a labeled training set. However, as shown in Table 1 , many non-standard IoT device communication formats are being used in practice. Therefore, preparing enough learning materials for grammar-based mutation strategies is a huge workload, which makes the deployment of grammar-based IoT fuzzing difficult. Challenges. In this paper, we focus on detecting vulnerabilities in IoT firmware by sending messages to IoT devices. To design an effective and efficient fuzzing method, several challenges have to be overcome. • Challenge 1: Lack of a feedback mechanism. Without access to firmware, it is nearly impossible to obtain the internal execution information from IoT device to guide the fuzzing process (as is done in most typical fuzzers). Therefore, we need a lightweight solution to obtain feedback from device, and optimize the generation process. • Challenge 2: Diverse message formats. Table 1 shows some message formats that are used in IoT communication, including JSON, SOAP, Key-value pairs, string, or even customized formats. In order to be applied to various devices, a solution should be able to infer the format from a raw message. • Challenge 3: Randomness in responses. The response messages of an IoT device may contain random elements, such as timestamps or tokens. Such randomness results in different responses for the same message, and diminishes the effectiveness of fuzzing because the input generation of Snipuzz relies on responses. Our approach. In this paper, we propose a novel and automatic black-box IoT fuzzing, named Snipuzz, to detect vulnerabilities in IoT firmware. Different from other existing IoT fuzzing approaches, Snipuzz implements a snippet-based mutation strategy which utilizes feedback from IoT devices to guide the fuzzing. Specifically, Snipuzz uses a novel heuristic algorithm to detect the role of each byte in the message. It will first mutate bytes in a message one by one to generate probe messages, and categorize the corresponding responses collected from device. Adjacent bytes that have the same role in the message form the initial message snippets, which is the basic unit of mutation. Moreover, Snipuzz utilizes a hierarchical clustering strategy to optimize mutation strategies and reduce the misclassification of categories caused by randomness in the response messages and the firmware's internal mechanism. Therefore, Snipuzz, as a black-box fuzzer, can still effectively test the firmware of IoT devices without the support of grammatical rules and internal execution information of the device. Snipuzz resolves Challenge 1 by using responses as the guidance to optimize the fuzzing process. Based on the responses, Snipuzz designs a novel heuristic algorithm to initially infer the role of each byte in the message, which resolves Challenge 2. Snipuzz utilizes edit distance [42] and agglomerative hierarchical clustering [43] to resolve Challenge 3. We summarize our main contributions as follows: • Message snippet inference mechanism. The responses from IoT devices are related to code execution path in firmware. Based on responses, we infer the relationship between message snippets and code execution path in firmware. This novel mutation mechanism enables that Snipuzz does not need any syntax rules to infer the hidden grammatical structure of the input through the device responses. Compared with the actual syntax rules that determine the input string format, the result of snippet determination proposed by Snipuzz has a similarity of 87.1%. • More effective IoT fuzzing. When testing IoT devices, the number of response categories is positively correlated with the number of code execution paths in the firmware. In the experiment, the number of response categories explored by Snipuzz far exceeded other methods on most devices, no matter how long the analysis duration was (in 10 minutes or 24 hours). • Implementation and vulnerability findings. We implemented the prototype of Snipuzz. 1 We used it to test 20 real-world consumer-grade IoT devices while comparing with the stateof-the-art fuzzing tools, i.e., IoTFuzzer, Doona, Boofuzz, and Nemesys. In 5 out of 20 devices, Snipuzz successfully found 5 zero-day vulnerabilities, including null pointer exceptions, denial of service, and unknown crashes, and 3 of them could be exposed only by Snipuzz. Fuzzing is a powerful automatic testing tool to detect software vulnerabilities. After decades of development, fuzzing has been widely used as a base in several security testing domains, such as the OS kernel [12, 36] , servers [33] , and the blockchain [3] . In general, fuzzing feeds the target programs with numerous mutated inputs and monitors exceptions (e.g., crashes). If an execution reveals undesired behavior, a vulnerability could be detected. To discover vulnerabilities more effectively, fuzzing algorithms optimize the mutation process based on feedback of executions (e.g., coverage knowledge), instead of using a purely random mutation strategy. Moreover, fuzzers can judge from the feedback mechanism whether each test case generated by seed mutation is "interesting" (i.e., whether the test case has explored unseen execution states). If a test case is interesting, it will be reserved as a new seed to participate in future mutation. With the feedback, many fuzzers [4, 5, 29, 41, 49] steer the computing resources towards the interesting test cases and achieve higher possibility to discover vulnerabilities. To react with external inputs, most IoT devices implement a similar high-level communication architecture. As per the pseudo code example presented in Figure 1 , a typical implementation of the communication architecture may consist of four parts: 1) Sanitizer, 2) Function Switch, 3) Function Definitions, and 4) Replier. When an IoT device receives an external input, Sanitizer starts parsing the input and performs regular matching. If the input format breaches the syntactic requirements, or an exception occurs during the parsing process, Sanitizer will directly notify Replier by sending a response message describing the input error and terminate the processing of input. If the input is syntactically correct, Function Switch transfers control to the corresponding Functions according to the attribute, Key, and corresponding value, val, extracted from the input. If Key cannot be matched, the processing of this input will be terminated, similarly as done by Replier. When Functions completes the processing, such as setFlow(), with the parameter val, it notifies Replier to generate the response message. Note that, the implementation of Functions is specific to IoT devices. As described above, Replier is responsible for sending responses to the client (such as the user's APP). Based on the calling situation (indicated by the parameter code in the example), Replier determines the content of response message to be sent. The interactive capabilities of IoT devices make it possible to test security of device firmware through the network. However, there are also some challenges when testing IoT devices using network-based fuzzers. Since most network fuzzing methods cannot directly obtain execution status of the device, it is hard to establish an effective feedback mechanism to guide the fuzzing process. Without feedback mechanism, the fuzzing tests could be blind in the selection of mutation targets, and may lean to a brute force random test. As discussed previously, due to the lack of open-sourced firmware, it is difficult or even impossible to instrument the IoT devices. Therefore, the response messages returned by the firmware can be regarded as a valuable source of device status information at run-time. The Replier in Figure 1 will use the value of the variable code to determine the content of the response messages. The value of code comes from many different function blocks in the firmware. Parameters are passed when Sanitizer fails to parse the input or some exceptions are triggered; or when the Function Switch cannot match the key command characters in the input; or after each input is executed in the Functions. Therefore, through the content of the response message, the code block that has been executed in the firmware can be inferred. When the firmware source code is not available, the correspondence between the firmware execution and the response messages cannot be directly extracted. Moreover, the firmware may return the same response messages even executing different functions. Although the response message cannot be equated to the execution path of the device, it can still play an important role in the black-box fuzz testing for IoT devices. Although it is hard to link the code execution path corresponding to each response message, if the two inputs get different response messages, we can deduce that the two inputs go to different firmware code execution paths. Our approach. Snipuzz uses the response message to establish a new feedback mechanism. Snipuzz will collect every response, and when a new response is found, the input corresponding to the response will be queued as a seed for subsequent mutation testing. The firmware of the IoT device can be regarded as a software program with strict syntax requirements for input. If the byte-based mutation strategies (such as mutating each byte in the input one by one or randomly selecting bytes for mutation testing) are used in the fuzz testing, the generated test cases could be rare to meet the input syntax requirements. The grammar-based fuzzers utilize detailed documents or a large training data set to learn the grammatical rules and use it to guide the generation of mutation [34, 40] . In many cases, the input syntax in IoT devices is diverse or nonstandard. Table 1 shows the communication format requirements used in 20 IoT devices from different vendors. Some of them are using well-known formats such as JSON and SOAP, but some use Key-value pairs or even custom strings as communication format. Therefore, it is difficult to provide grammar specifications or establish training data sets that cover communication formats on a large scale for the grammar-based mutation strategy. The best grammar guidance originates from the firmware itself. Responses from IoT devices suggest the execution results of messages. If we mutate a valid message byte by byte (i.e., breaching the format), we will get many different responses. If mutation of two different positions in the valid message receives the same response, these two positions have a high possibility that they are related to the same functionality in firmware. Therefore, those consecutive bytes with the same response can be merged into one snippet. This method of inferring message snippets can clearly reflect the utility of each byte after entering the firmware. In addition, mutation based on message snippets can largely reduce the search space and improve the efficiency of fuzzing. Our approach. Snipuzz merges consecutive bytes with the same response into one snippet. We also propose different mutation operators performing on snippets. In order to clearly present our approach, we first introduce some notations while explaining the fuzzing process of Snipuzz. At a high level, Snipuzz performs as a client which sends a message sequence to request certain actions from IoT devices. Any message ∈ requests the IoT device to perform a certain functionality, and all the messages = work together to request an action (or actions). Similarly to the typical fuzzers, we initialize a seed with an initial message sequence, and a seed corpus with all the seeds (Section 4.1). Meanwhile, restoring message sequences are collected for resetting the IoT device to a predefined status. To establish an effective fuzzing, as depicted in Figure 2 , Snipuzz first conducts a snippet determination process. Concretely, Snipuzz selects a message in a seed ⊂ , from which a probe message and a corresponding sequence will be generated. Each message in will trigger a response message (response for short) containing the information about the execution output. Snipuzz assigns each message a response pool , which is utilized to determine if a new response is unique. The uniqueness of a response indicates that it does not belong to any category of responses existed in the response pool. If is unique, Snipuzz will add into the pool , and reserve the corresponding message sequence as a new seed. Snipuzz then divides the message into different snippets based on the responses (Section 4.2). Upon the snippets are obtained, Snipuzz performs mutation according to various strategies, e.g., empty, bytes flip, data boundary, or havoc (detailed in Section 4.3). Throughout the fuzzing process, Snipuzz sets up a network monitor to detect crashes which may indicate vulnerabilities (Section 4.4). The quality of initial seeds could influence the fuzzing campaigns significantly. Therefore, we consider to obtain high-quality initial seeds conforming to highly-structured formats required by IoT devices, as such inputs may exercise complex execution paths and enlarge the opportunity of exposing vulnerabilities at deep code. Generating seeds based on companion app reverse-engineering [9] or accessible specifications (as mentioned in Section 3.2) could be intuitive solutions. However, they either require heavy engineering efforts or could be error-prone (e.g., seeds may violate the required formats or have the wrong order of messages). Initial seed acquisition. Snipuzz proposes a lightweight solution to obtain initial valid seeds. Considering that many IoT devices have first-or third-party API documents as well as the test suites, the testing programs provided by both parties can effectively act as a client, sending control commands to IoT devices or remote servers. Most structural information (e.g., header, message content) and protocols (e.g., HTTP, HNAP, MQTT) of communication packets are defined in the API programs as message payloads. Therefore, Snipuzz leverages these test suites to communicate with the target devices, while at the same time, extracting the message sequences as initial seeds. For example, when using an API program to turn on a light bulb, the program first sends login information to the server or to the IoT device, then sends a message to locate a specific light bulb device, and finally sends a message to control the device to turn on the light. Snipuzz captures such a message sequence that triggers a functionality of IoT device as an initial seed. Restoring message sequence acquisition. In order to replay a test case for the crash triage, Snipuzz ensures that the device under test has the same initial state in each round of testing. After sending any message sequence to the device, Snipuzz will send a restoring message sequence to reset the device to a predefined status. Manual efforts. Although we try our best efforts to provide a lightweight fuzzer, Snipuzz still requires some manual efforts to obtain valid and usable initial seeds. First, we manually configure the programs from the test suites, such as setting up the IP address and the login information. Note that, we only need to configure these programs once per device. Second, to capture the message sequences dynamically, we need to manually define the specific format and protocol in the network traffic monitor. Finally, we filter out some message sequences that will mislead the fuzzing process. For instance, some API programs provide operations that can automatically update or restart the device. These operations will halt the device and thus no response will be sent back. This leads to false-positive crashes because we consider a no-response execution as a crash. The manual work costs roughly 5 man-hours per device and is only required during the message sequence acquisition phase of Snipuzz. The key idea of Snipuzz is to optimize fuzzing process based on snippets determined by responses. Put differently, Snipuzz leverages snippet mutation to reduce the search space of inputs, while the snippets are automatically clustered via categorizing responses from IoT devices. The major challenge is to correctly understand the semantics of responses. For instance, due to the presence of timestamp, two semantically identical responses will be classified into different categories if utilizing a simple string comparison. Therefore, Snipuzz utilizes a heuristic algorithm and a hierarchical clustering approach to determine the snippets in each message. The essence of a message snippet is the consecutive bytes in a message that enables the firmware to execute a specific code segment. For experienced experts, it is not difficult to segment message snippets according to the semantic definition in official documents. However, for algorithms that lack such knowledge, it is essential to apply some automatic approaches to identify the meaning of each byte in the message. Snipuzz first uses a heuristic algorithm to roughly divide each message into initial snippets. The core idea of the heuristic algorithm is to generate probe messages by deleting a certain byte in the message ( ∈ ). By categorizing the responses of each probe message, Snipuzz preliminarily determines the snippets in the message . For example, as shown in Table 2 , to determine snippets in the message = {"on":true}, Snipuzz generates probe messages by removing the bytes in one by one. When the first byte '{' in is deleted, the corresponding probe message 1 is "on":true}. Similarly, when the second byte is deleted, the corresponding probe message 2 is {on":true}. Therefore, the message with 11 bytes can generate 11 different probe messages ( 1 to 11 ). Snipuzz will send the 11 corresponding message sequences ( 1 to 11 ) containing the probe messages to the device and collect responses. Snipuzz then distinguishes the snippets in the message by categorizing the responses. Specifically, the consecutive bytes with the same corresponding response type are merged into the same snippet. According to the examples illustrated in Table 2 , the Response 1 , 2 , and 5 are merged into one category that indicates an error in JSON syntax, while Response 3 and 4 are merged into another category which indicates an error of an invalid input parameter. Therefore, the consecutive bytes whose corresponding responses belong to the same category can form a message snippet. Through this heuristic approach, Snipuzz can determine all initial snippets in the message . A naive method to categorize responses is to utilize a string comparison, i.e., comparing the content of responses byte by byte. However, due to the existence of randomness in responses (e.g., timestamp and token), a simple string comparison may incorrectly distinguish the responses with same semantic meaning into different categories. Therefore, a more advanced solution, Edit Distance [42] , is introduced to determine the category of responses. As shown in Equation (1), a similarity score, , between two responses and is calculated. 1 "on":true} Response 1 {"error":{"type":2, "address":"/lights/1/state", "description":"body contains invalid json"}} 1 Probe message 2 {on":true} Response 2 {"error":{"type":2, "address":"/lights/1/state", "description":"body contains invalid json"}} 1 Probe message 3 {"n":true} Response 3 {"error":{"type":6, "address":"/lights/1/state/n", "description":"parameter, n, not available"}} 2 Probe message 4 {"o":true} Response 4 {"error":{"type":6, "address":"/lights/1/state/o", "description":"parameter, o, not available"}} 3 Probe message 5 {"on:true} Response 5 {"error":{"type":2, "address":"/lights/1/state", "description":"body contains invalid json"}} 1 Probe message 11 {"on":true Response 11 {"error":{"type":2, "address":"/lights/1/state", "description":"body contains invalid json"}} 1 Figure 3 : An example of snippet determination. where the _ () in the equation selects the longer string between the two responses and the _ () counts the minimum number of operations, including insertion, deletion, and substitution, required to transform one string into the other. Therefore, the more similar two responses are, the larger the value of is. Snipuzz first calculates a self-similarity score for each probe message . Note that is generated by mutating the -th byte in the message . Concretely, Snipuzz sends the same probe message twice within an interval of one second. Two responses , ′ will be collected from the IoT device, correspondingly. The selfsimilarity score is then calculated based on the two responses , ′ according to Equation (1) . Note that, due to the randomness in the responses, there could be differences between the two responses , ′ , even though they are from the same probe message. Therefore, the self-similarity score could be smaller than 1. To determine whether two responses belong to the same category, Snipuzz computes the similarity score of two responses and compares it with the self-similarity score. For example, for two responses and , Snipuzz uses the Equation (1) to compute the similarity score . After that, will be compared with the selfsimilarity. If >= or >= satisfies, responses and will be considered belonging to the same category; otherwise, responses and are then assigned to the different categories. For a newly received response , Snipuzz will compare it with all the responses in the corresponding response pool based on the similarity score. If the new response does not belong to any existing category, the response as well as the corresponding probe message will be added into the Response Pool. With the response pool , Snipuzz categories each byte in the message . Specifically, the category of the -th byte in message is assigned according to the category of response . Then the consecutive bytes with the same category will be merged into one snippet. Figure 3 shows an example of the initial snippet determination on the message = {"on":true} according to the response categories in Table 2 . Clustering. Although Snipuzz utilizes similarity comparison to mitigate the mis-categorization caused by randomness in responses, two semantically identical responses may still be mis-categorized into different categories. This could occur when the responses contain contents extracted or copied from probe messages. For example, due to the quotation of specific error contents from probe messages, the heuristic algorithm will not assign them to one category. Specifically, the similarity score 34 of = {"on":true} in Table 2 is 0.979, which is smaller than the self-similarity scores 33 = 1.000 and 44 = 1.000 (as there is no randomness in the responses). However, these two responses are semantically identical and should be identified into one category, i.e., they are both error messages, indicating parameter syntax errors are located in the probe messages and the device is executing the same code block. In order to solve the aforementioned problem, Snipuzz uses agglomerative hierarchical clusters to refine message snippets. The core idea of hierarchical clustering is to continuously merge the two most similar clusters until only one cluster remains. As shown in Algorithm 1, Snipuzz will initialize the snippets according to Initial Snippets determined in Section 4.2.1 (line 1). After that, each response category in the response pool will be initialized as a cluster (line 2). Snipuzz will convert the responses into feature vectors (line 3, detailed in the later paragraph) which will be used to compute the distance between each pair of clusters (lines 5-7). Then the two closest clusters will be merged and the cluster center will be updated accordingly (lines 8-10). After performing the cluster process, Snipuzz will generate new snippets according to the current cluster result and add the new snippets into the snippet segmentation result (line 11), which will be further used for mutation. Concretely, Snipuzz first extracts features from responses, which vectorize responses into tuples of the self-similarity score, the length of the response, the number of alphabetic segments, the number of numeric segments, and the number of symbol segments. Each segment consists of consecutive bytes that have the same type. For instance, "123" is 1 numeric segment, and there are 2 alphabetic segments and 1 numeric segment in " 1 ". More specifically, the 1 in Table 2 will be vectorized to 1 = (1, 91, 10, 2, 10) . Similarly, responses 3 and 4 will be converted to 2 = (1, 94, 11, 2, 13) and 3 = (1, 94, 11, 2, 13). Figure 4 shows an example of clustering according to the message = {"on":true} in Table 2 . According to the Algorithm 1, in the preparation round (0th round) of clustering, each category in the response pool will be initialized a single cluster. In the 1st round, as clusters 2 and 3 are the two clusters with minimum distance (∥ 2 − 3 ∥ = 0), the two clusters are merged into a new cluster. Correspondingly, the message snippets 'o' and 'n' are merged into a new snippet, marked with index #4. Similarly, in the next round, the two closest clusters, the cluster 1 and the new cluster, are merged, and a new snippet will also be generated. Finally, all snippets in the message are merged into one new snippet, i.e., the message itself. All the new generated snippets together with the initial snippets will be used in message mutation in the next stage. Snippet Mutation. In order to conduct an efficient fuzzing, Snipuzz mutates the snippets obtained in the stage of Snippet Determination. Note that the mutation schemes are performed on the entire snippet instead of a single byte in a message. • Empty. The empty of a data domain may crash the firmware if the data domain is not properly checked. Therefore, Snipuzz deletes an entire snippet to empty the data domain. • Dictionary. For the scheme of Dictionary, Snipuzz replaces a snippet with a pre-defined string such as "true" and "false", which may directly explore more code coverage. • Repeat. In order to detect bugs in syntax parsers, Snipuzz repeats a snippet for multiple times. Meanwhile, the repetition of data domain can detect defects caused by out-of-boundary problems. Havoc. The conditions for triggering bugs may be complicated. For example, it may require modifying different data domains in the same message to trigger a bug. The aforementioned snippet mutation schemes only mutate one snippet at a time. However, the havoc mutation randomly selects some random snippets in a message, and performs the aforementioned mutation schemes on each of the selected snippets. Havoc mutation will not stop until finding a new response category or the target IoT device crashes. The network communication of the device is monitored and a timeout is set to determine whether the device has been crashed. In fact, the monitoring of device network communication is not a single step, and it occurs during the entire fuzzing process. In case of timeout, Snipuzz will continue to send the same message sequence for three times, as the cause of timeout could be network fluctuations instead of device crashes. If the timeout occurs for three times, Snipuzz will use the control command to physically restart the device and send the same sequence of messages to the device again. If the device still does not return the message on time, Snipuzz will record the crash and the corresponding message sequence. The design of Snipuzz consists of four steps: Message Sequence Acquisition, Snippet Determination, Mutation, and Network Communication Monitoring. In the Message Sequence Acquisition step, we use WireShark [45] in the program to detect and record the communication packets between the API and the IoT device, and manually cleaned these message sequences. The remaining core functional steps are packaged in a prototype implemented with 4,000 lines of C# code. The network monitor will record every message sent to the device, and send the information to the device again when the device does not reply. A smart plug was used to implement the physical restart function of the target device. When Snipuzz needs to physically restart the device under test, it will send control messages to the smart plug, and the plug will be closed and then opened. In this way, the device under test will be powered off briefly and restarted. IoT Devices under test. We have selected 20 popular consumer IoT devices from both online and offline markets worldwide, covering various well-known brands, such as Philips, Xiaomi, TP-Link, Netgear. The types of selected IoT devices include smart plugs, smart bulbs, routers, home bridge, IP camera, fingerprint terminal, etc. These devices are either recommended items in Amazon or the best-selling products that can be bought in supermarkets. Table 1 details the information of the IoT devices under test. Benchmark tools. In order to verify Snipuzz's performance in finding crashes and message segmentation, we used seven different fuzzing schemes as benchmarks. • IoTFuzzer [9] . The core idea of IotFuzzer is to find the functions that send control commands to the IoT device by static analysis of companion apps, and to mutate the value of specific variables to perform fuzzing test without breaking the message format. Note that our implementation of IoTFuzzer is the best effort to replicate since their code is not publicly available, and we acknowledge that this could provide slightly different results with respect to the original version. We implement the IoTFuzzer by replacing the mutation algorithm in Snipuzz framework with the mutation strategies in IoTFuzzer. Considering that the purpose of companion apps analysis in IoTFuzzer is to ensure that only the data domain in the communication message is mutated, to make the benchmark as fair as possible, we use seeds same as the ones used in Snipuzz and manually segment the data domain of each seed message before feeding it to IoTFuzzer. We believe that such manual segmentation is sufficient to provide an upper bound performance of IoTFuzzer. Note that we remove the methods that are related to the feedback mechanism and snippet segmentation because these methods are not used in IoTFuzzer. • Nemesys [22] . Nemesys is a protocol reverse engineering tool for network message analysis. It utilizes the distribution of value changes in a single message to infer the boundaries of each data domain. Considering that Nemesys is a protocol inference method instead of an off-the-shelf fuzzing tool, we implement the method of Nemesys based on the Snipuzz framework to infer the snippet boundary, replacing corresponding snippet determination method (Section 4.2). • BooFuzz [31] . As a successor of Sulley [19] , BooFuzz is an excellent network protocol fuzzer that has been involved in several recent fuzzing research [9, 37, 48] . Different from other automatic fuzzers, BooFuzz requires human-guided message segmentation strategies as inputs. In our research, we leverage this property and manually define more fuzzing strategies to enrich the benchmark evaluation. -BooFuzz-Default. In this strategy, we set each message in the input as a complete string, that is, BooFuzz will use the message as a string for mutation testing. -BooFuzz-Byte. Each byte of the message in the input will be used for a mutation test individually. -BooFuzz-Reversal. Contrary to the idea of IoTFuzzer, in this strategy, we focus on the mutation of non-data domain in the message, while keeping data domain unchanged. • Doona [44] . Doona is a fork of the Bruterforce Exploit Detector (BED) [6] , which is designed to detect potential vulnerabilities related to buffer and formats in network protocol. Different from other tools, Doona does not take network communication packets as seeds. The test cases of Doona are required to be pre-defined for each device or protocol under test. • Snipuzz-NoSnippet. Snipuzz uses the segmentation of message snippets to enhance the efficiency of fuzzing and the ability to find crashes. In order to verify whether the snippet determination indeed benefits fuzzing, we implement Snipuzz-NoSnippet based on Snipuzz. Snipuzz-NoSnippet does not have the component of snippet determination, and blindly mutates bytes in messages without the knowledge of responses. Except for Doona, whose test cases are preset, all benchmark tools and Snipuzz are tested on same input sets. These input sets may be in different formats (e.g., BooFuzz requires to manually set the input, and Numesys requires the input to be the pcap file format), but the content is the same. There are many other popular fuzzing tools which are able to test IoT devices via network communication, such as Peach [30] and AFLNET [33] . However, since they are grey-box fuzzing that requires to instrument firmware, it is infeasible and unfair to regard those tools as baselines for black-box schemes. After performing fuzz testing using Snipuzz on each of the 20 IoT devices for 24 hours, we detected 13 crashes in 5 devices. As shown in Table 3 , the detected crashes include 7 null pointer dereferences, 1 denial of service, and 5 unknown crashes that we further manually verified. The 13 crashes found by Snipuzz are triggered by providing malformed inputs. These malformed inputs break the message format in different ways. For example, deleting placeholders, emptying the data domain or fortunately changing the type of data value. Note that all the crashes identified by Snipuzz are in JSONbased devices, although we successfully conducted experiments on the 20 IoT devices with various communication formats, such as JSON, SOAP, and K-V pair. The experiments also show that Snipuzz observes a higher number of response categories compared to the other fuzzers (as detailed in Section 5.3). Table 3 , the 7 crashes triggered by Snipuzz in TP-Link HS110 and HS100 are all caused by null pointer dereferences. After sending the test cases to HS110 and HS100, the devices crashed, unable to reply to any interaction. However, after a few minutes, the devices automatically restarted and recovered to the initial state. Based on the analysis of test cases, we found that the vulnerabilities are all triggered by messages that mutated in JSON syntax. Put differently, when some important placeholders, such as curly braces and colons, or a part of the test message are mutated, the syntax structure and the semantic meaning of the message are broken. If the device cannot handle the mutated input message properly, it will crash the device. We reported the vulnerabilities to the device vendor, TP-Link, via email on June 13, 2020. They have confirmed the vulnerability and promised to fix it through a firmware update. Contents of mutated messages Generated by {"{"id": 0, "method": "start_cf", "params": ["4, 4, "1000, 2, 2700,100,500 ,1,255,10,5000,7,0,0,500,2,5000,1"]}" Original Message {"{"id": 0, "method": "start_cf", "params": ["4, , "1000, 2, 2700,100,500 ,1,255,10,5000,7,0,0,500,2,5000,1"]}" Snipuzz {"{"id": 0, "method": "start_cf", "params": [", 4, "1000, 2, 270000,100,500 ,1,255,10,5000,7,0,0,500,2,5000,1"]}" IoTFuzzer Denial of service. Another interesting finding is the denial of service vulnerability detected in Philips A60 smart bulb. After being tested by Snipuzz for 24 hours, Philips' official companion app could not manage the device normally. Specifically, the device cannot be found in the app and if any further messages are sent through the app, the response in the app will keep asking to bound the device to a device group and no further interaction is available. However, we observe that if the message packet is sent directly to the device, the device can work normally. This indicates that the device does not completely crash but its service via the companion app is denied. Unknown crashes. Snipuzz found 5 crashes on Yeelight bulbs, YLDP05YL, and YLDP13YL. The devices crashed and restarted by themselves within roughly one minute. By analyzing the test cases, we found that the crashes are due to the deletion of certain data domains, such as the nullify of parameters, marked as red in Table 4 . As the firmware of the 2 devices is not publicly available, the root cause of the vulnerability cannot be determined; However, we can still deduce that the vulnerability is due to the device reading in null values during the parsing process, causing a crash during the assignment. We also find that communication using a local network does not require any authentication, which means that the device can be crashed by any attackers in the local network. Therefore, we consider the vulnerabilities as 'remotely exploitable'. Benchmark with state-of-the-art tools. As shown in Table 3 , for 24 hours fuzz testing on each devices, none of the benchmark tools found a crash except for IoTFuzzer. They did not find the crash due to various reasons. Donna focuses more on the mutation of communication protocols. Further, Donna cannot be applied on all devices, which also limits its capacity. Since Boofuzz directly replaces the specified positions in the message with a preset string, it can only trigger a limited types of vulnerabilities. Nemesys offers a new idea of determining message snippets. However, since it determines message snippets by the distribution of values in messages, it is difficult for Nemesys to accurately decide the boundary between data and non-data domains. Therefore, Nemesys can hardly detect vulnerabilities that can only be triggered by mutating the data or non-data domains. Snipuzz-NoSnippet, which does not apply the snippet-based mutation method used in Snipuzz, is similar to the classic fuzzer AFL [24] . Since Snipuzz-NoSnippet does not infer the structure of the message but directly uses single or multiple consecutive bytes as the unit of mutation, most of the test cases generated by Snipuzz-NoSnippet destroy the structure of the messages. Such a method is difficult to work on devices that require highly-structured inputs. IoTFuzzer detected 2 crashes in 2 smart bulb devices, i.e., the YLDP05Y and YLDP013Y. Due to the mutation strategy of IoT-Fuzzer, the malformed input provided by IoTFuzzer is obtained by emptying the data domain. According to the mutated messages listed in Table 4 , we can see that the messages mutated by IoT-Fuzzer resemble the ones generated by Snipuzz. The mutated domains of messages from Snipuzz and IoTFuzzer in Table 4 are all in the data domain. In terms of the effect of the mutation test, Snipuzz and IoTFuzzer achieve the same goal on these two messages. However, Snipuzz can cover the mutation space of IoTFuzzer because IoTFuzzer only focuses on the data domain mutation while Snipuzz can mutate both the data and non-data domains. To further determine the root cause of the crash, we obtained the firmware source code of HS100 and HS110, two typical market consumer-grade smart plugs manufactured by TP-Link, and conducted a case study which reflected the differences between Snipuzz and IoTFuzzer. We found that one of the crashes triggered by Snipuzz on the two devices is caused by breaking the syntax structure and mutating both on data and non-data domains. More specifically, the mutated messages successfully bypassed the sanitizer and triggered the crash during function execution. We deduce that this could be caused by an error-prone third-party sanitizer (more details could be found in Appendix B). On the other hand, due to the design of IoTFuzzer, the fuzzing is based on the grammatical rules as the IoTFuzzer tends to satisfy the grammar requirements with first-priority, in order not to be rejected by the sanitizer and ensure that each test case can reach the functional execution part in the firmware. Such strategy constraints the test range of fuzzing and its capacity to cover the sanitization part in comparison to Snipuzz. Therefore, we argue that considering the complexity of IoT firmware testing, a lightweight and effective black-box vulnerability detection tool, such as Snipuzz, is a pressing need. Figure 5 shows how Snipuzz and the other seven fuzzers explored the device firmware during the first 10 minutes. Limited by spacing, we only present the results of 5 devices here but plot results of all 20 devices in Appendix A. We repeated the fuzz testing for 10 times and recorded the medium values of the numbers of response categories discovered by each method, indicating that the coverage has been explored. We manually review the presented response categories to remove the mis-categorization caused by randomness in responses or the response mechanism of devices. As shown in Figure 5 , Doona can only detect a small number of response categories. Doona is protocol-based fuzzing methods, and its tests are more biased towards protocol content. The mutation test on the communication protocol has a high probability of being directly rejected or ignored by the device unilaterally, resulting in few categories of responses that can be received. We implemented three fuzzing strategies based on Boofuzz, i.e., mutating the whole message as a string, mutating each byte of the message, and mutating non-data domain. However, the testing results indicate that all of them explored very limited categories of responses on each device. The limitation of category discovery is due to the mutation strategy of Boofuzz, which replaces the target contents with a specific pre-defined string. For example, using strings, such as "/./././././././.", to replace the content of messages in different strategies (e.g., replacement of the entire strings, a single byte, or a non-data domain), causes the violation of message format and could be easily rejected by the sanitizer. Therefore, most of the responses obtained by Boofuzz fall into the category of "error responses". The number of response categories explored by IotFuzzer grows rapidly within a short period of time and then stagnates. In the mutation stage, IotFuzzer randomly selects a set of inputs from the original candidate inputs and randomly mutates the data domain for one or more message(s). It will continue to repeat this method until the device crashes or reaches the time limit. Such a method based on randomness helps IotFuzzer to mutate and test a large number of message data domains in the original input and collect response message categories quickly in the beginning. However, the number of response categories found by IotFuzzer will soon reaches the limitation due to the data domain mutation. In most devices, Snipuzz has maintained a steady upward trend in most cases, and after a period of time the number of response categories found by Snipuzz exceeds IotFuzzer. Unlike IotFuzzer, Snipuzz mainly searches for the response categories through the Snippet Determination stage. As per the message snippet exploration strategy, Snipuzz first explores all the response categories of a certain message as many as possible. After the snippets of a message are obtained and tested by Snippet Mutation, the next message will be processed in the same way until all messages in the initial message sequence have been tested. Followed by this method, Snipuzz may not get a large number of response categories in a short time. When Snipuzz detects a message snippet, every byte in the message content will be included in the test. Therefore, as shown by the bold numbers in Table 3 , for 15 out of 20 devices, Snipuzz covers the most number of response categories after 24hour fuzz testing, compared to other state-of-the-art IoT fuzzing tools. On 5 devices, Snipuzz-NoSnippet collected more response categories than Snipuzz within 24 hours. The mutation method used by Snipuzz-NoSnippet is similar to the classic fuzzer AFL [24] . It directly performs mutation on a single byte or several consecutive bytes. However, Snipuzz-NoSnippet is difficult to cover response categories that are not obtained by breaking the grammatical format (e.g., data out of bounds in the data domain). Theoretically, although the Snipuzz-NoSnippet mutation method is not so efficient, it still has the capability to explore the most categories of responses. Nemesys explores more categories of responses than BooFuzz and Doona, but does not exceed Snipuzz. The Nemesys strategy performs deterministic mutations on each data domain of the messages in turn, which makes its trend of run-time performance similar to Snipuzz. However, the data domain determination strategy of Nemesys is not based on the responses from IoT device. Thus, the distribution of byte values in messages does not benefit in covering more response categories. Therefore, the number of response categories collected by Nemesys is limited. It is interesting to observe that, in the case of R6400, Snipuzz also enters a stagnation after only finding a few response categories. We Table 5 : Inference results of Snipuzz and Nemesys. Ave. Similarity Example Snipuzz 87.1% {" on ":true, " sta ":140, " bri ":254} Nemesys 64.5% {"on": true , "sta": 140 , "bri" 254} Grammar 100.0% {" on ": true , " sta ": 140 , " bri ": 254 } carefully checked the initial input message sequences and found that the average length of the message exceeds 400 bytes, forcing Snipuzz to generate and send a large number of probe messages to determine message snippets. Therefore, in the first 10 minutes, Snipuzz was still exploring the response category of the first few messages, so it did not exceed IotFuzzer. Among all strategies, Snipuzz and Nemesys utilize semantic segmentation, to assess their performance of message snippet inference. We compare the snippets they produce during the fuzzing process with the grammar rules defined in API documents. Specifically, for some mature and popular languages, such as JSON, we establish the grammar rules as per their standard syntax; for custom formats, such as strings or custom bytes, we refer to the official API documents and define the grammar rules based on the instructions. Equation (2) quantifies the quality of snippet inference, and Similarity indicates the percentage of correctly categorized bytes in a snippet-determined message, , compared with the ground truth, , manually extracted from the grammar rules. where () returns the category of each message byte in a series of "0" and "1" bits, () counts the number of mis-categorized bytes, and () represents the length of a message. Note that in a ground truth message, "0" indicates the non-data domain (marked blue in Table 5 ), while "1" indicates the data domain (marked red in Table 5 ). Therefore, the ⊕ is the bitwise operation. In addition, followed by Equation (2), we compute the average similarity of the snippets (or data domain) determined by Snipuzz and Nemesys for all the 235 messages obtained from experiments. Note that during the calculation of the average similarity, for each message, if there are multiple snippet sets determined, we will select the snippet inference with the highest similarity value; therefore a snippet could reflect the grammatical rules as many as possible and maximize the performance of message semantic segmentation. The average similarity result of Snipuzz, 87.1%, indicates that, by applying snippet inference based on the hierarchical clustering approach, Snipuzz can effectively find the grammatical rules hidden in the message. Ideally, in Snipuzz, the merging of clusters removes the influence caused by the randomness in responses and by the replying message mechanism itself. Therefore, the message snippets will conform to the grammatical rules gradually, which leads Snipuzz to a higher similarity result. However, we also found some differences between the snippet inference method and the grammatical rules in some results. For example, given the example shown in Table 5 , the snippet inference method combines the strings belonging to the data domain in the grammatical rules (i.e., 'true', '140' and '254') with some placeholders (such as double quotes and curly brackets). After analyzing the response messages, we found that the responses obtained after destroying these data domains and destroying placeholders are all about invalid format. This may be due to the fact that in the firmware, when an error occurs in the parsing format, the response does not report a detailed description of the error but instead returns a general format error. On the other hand, Nemesys uses the distribution of value changes in the protocol to determine the boundary of different data domains, and to achieve the semantic segmentation of a message. The advantage of this method is that it does not require any other additional information, such as grammar rules or a large number of training data sets in addition to the message itself. The average similarity result of Nemesys, 64.5%, is lower than the Snipuzz result. Given the example shown in Table 5 , when segmenting messages in a format requires restricted syntax, such as Json and XML, Nemesys can achieve a good semantic segmentation performance, because the placeholders usually use symbols unusually used in data domains. This distribution of byte value enables Nemesys to effectively find the boundaries between data domains. However, in IoT devices, customized formats are prevalent. For example, the smart bulb BR30 uses custom bytes as a means of communication, where each byte corresponds to a special meaning (i.e., "0x61" represents "CHANGE_MODE" and "0x0f" represents "TRUE"). In such cases, the value distribution of characters can no longer be used as a guidance for the data domain determination, and thus the message segmentation determined by Nemesys is error-prone. Snipuzz has successfully examined 20 different devices and exposed security vulnerabilities on five of them. However, there are still some limitations relevant to efficiency and scalability of Snipuzz. We discuss the limitations in this section and propose solutions as future work. Scalability and manual effort. IoT devices can be tested by Snipuzz if the valid network packets are known. In our prototype, we capture communication packets by running API programs and monitoring network communication (Note that packets can also be obtained by statically analyzing API programs without running them). In the absence of API programs or documents, we can recover the message formats from the official Apps of IoT devices through decompilation and taint analysis. Or as a second way, we can solve this problem by intercepting the communication between APPs and IoT devices, and then recovering message formats from the captured packets. The second way is feasible and we have experimented it in TP-Link's IoT control APP KASA, which can be further developed for more IoT devices. However, both methods could introduce overhead and involve manual effort. Recall in Section 4.1 that Snipuzz requires manual effort, which takes 5 man-hours per device to collect the initial seeds during the message sequence acquisition phase. The manual effort is mainly referred to cleaning the packets from the API programs that are obtained from publicly available first-and third-party resources. To mitigate this limitation when applying Snipuzz to IoT devices, techniques such as crawlers could be used to automatically gather API programs associated with the IoT devices in the future work. Moreover, the process of cleaning the packets could also be improved by pre-processing keywords through scripts to achieve automatic collection of communication packages. Threats to validity. As Snipuzz collects initial message sequences via API programs and network sniffers, the first threat comes from the absence of API programs. In this case, we can recover message formats based on the companion apps of IoT devices (similar to IoT-Fuzzer) but may need more manual efforts. Second, the encryption in messages decreases the effectiveness of snippet determination because the semantic information could be corrupted. A potential solution to the encryption issue is to integrate decryption modules into Snipuzz. Finally, the code coverage of firmware could be subject to the accessibility of API programs, since Snipuzz can only examine the functionalities that are covered in API programs. Recombining the message snippets from different seeds to generate new valid inputs could mitigate this limitation. Encryption. During Message Acquisition, we noticed that encryption is used to protect communication in some API programs. Encryption has no effect on the message sequence mutation process, but the snippet determination process basically fails. Because the encryption algorithm disrupts the original format of the message, the segmentation of snippets is sensitive to the position of the character. Moreover, because the response messages from the device are also encrypted, Snipuzz cannot get useful feedback from them. Similarly, the encryption and decryption algorithms in the API program can be integrated into the Snipuzz module to address this limitation, or the difficulties caused by encryption can be addressed from the perspective of mutation strategy design. Coverage. The code coverage of firmware explored by Snipuzz depends on the API programs. For example, if the API programs of a bulb only support the functionality of turning on power, it is almost impossible to explore the functionality of adjusting the brightness via mutating the messages captured during the power turned on. In the future work, without the support of grammar, we will consider recombining the message snippets to try to generate new valid inputs. This method can help explore more firmware execution coverage in addition to the original inputs provided. Requirements on detailed responses. The detection effectiveness of Snipuzz depends on the quality of message snippets which is contingent on how much information could be obtained from the responses of IoT devices. To put differently, if the IoT device does not provide responses that are detailed enough, for example reporting all the errors with a uniform message, it could be hard for Snipuzz to determine the message snippets. Fortunately, in many IoT devices, advanced error descriptions could be obtained in debug mode which will significantly improve the determination process of message snippets in Snipuzz. Our Snipuzz performs in a black-box manner for detecting vulnerabilities in IoT devices. Unlike existing black-box fuzzing for IoT devices, which blindly mutates messages, Snipuzz optimizes the mutation process of black-box fuzzing via utilizing responses. This feedback mechanism improves the effectiveness of bug discovery. For instance, IoTFuzzer [9] obtains the data domain, on which IoTFuzzer performs blind mutation. Thus, IoTFuzzer lacks the knowledge of the quality of the generated inputs, resulting in a waste of resource on the low-quality inputs. There are also several dynamic analysis approaches focusing on the networking modules of IoT devices. For example, SPFuzz defines a new language for describing protocol specifications, protocol state transitions, and their correlations [37] . SPFuzz can ensure the correctness of the message format in the conversation state and the dependence of the protocol. IoTHunter is a grey-box approach to fuzz the state protocol of IoT firmware [47] . IoTHunter can constantly switch the protocol state to perform a feedback-based exploration of IoT devices. In a recent example, AFLnet acts as a client and continuously replays the variation of the original message sequence sent to target (i.e., server or device) [33] . AFLnet uses response codes, which are the numbers indicating the execution states, to identify the execution status of targets and explore more regions of their networking modules. Another research line for dynamic analysis of IoT devices is the usage of emulators. The disadvantages of emulation are the heavy engineering efforts and the requisite of firmware, although the emulation of IoT firmware can analyze more thoroughly than black-box fuzzing. Two major challenges for emulation of IoT firmware are the scalability and throughput. Therefore, the efforts in improving the performance of emulation include full-system emulation [8, 27] , improvement of emulation success rates [21] , hardware-independent emulation [17, 38] , and combination of user-and system-mode emulation [51] . Based on the emulation, fuzzing can be integrated into those frameworks and can hunter defects in firmware [38, 51] . Static analysis of firmware is the complementary approach of dynamic analysis. Semantic similarity is one of the major techniques that make static analysis successful. Researchers analyze semantic similarity via comparison of files and modules [13] , Control Flow Graphs (CFGs) [14] , parser and complex processing logic [11] , and multi-binary interactions [35] . There are also many similaritybased approaches that can detect vulnerabilities across different firmware architectures. They usually extract various architectureindependent features from firmware for each node in a CFG to represent a function, and then check whether two functions' CFG representations are similar [15, 32] . In this paper we have presented a black-box fuzzing framework Snipuzz designed for detecting vulnerabilities hiding in IoT devices. Different from other black-box network fuzz testing, Snipuzz uses the response messages returned by the device to establish a feedback mechanism for guiding the fuzzing mutation process. In addition, Snipuzz infers the grammatical role of each byte in the messages based on the responses from the device, so that Snipuzz can generate test cases that meet the device's grammar without the guidance of grammatical rules. We have used 20 consumer-grade IoT devices from the market to test Snipuzz, and it has successfully found 5 zero-day vulnerabilities on 5 different devices. time on snippet determination, it discovers less categories than IoTFuzzer in the beginning. However, IoTFuzzer quickly reaches its peak and cannot discover new categories. On the contrary, after the stage of snippet determination, Snipuzz gradually discovers more categories than IoTFuzzer and other baselines. More detailed analysis can be found in Section 5.3. The HS100 and HS110 manufactured by TP-Link are 2 classic market consumer-grade smart plugs. In the work by Chen et al. [9] , they use HS110 with firmware version 1.3.1 to test IoTFuzzer. The results of their experiment show that IoTFuzzer triggered a vulnerability in the device by mutating the data domain in a message (changing "light" to 0). However, in the updated version of the firmware (1.5.2), IoT-Fuzzer did not find any vulnerabilities but Snipuzz did. Figure 7 shows an example of the original input message and the mutated snippets (inside the red frame) in the mutated message that can trigger the vulnerability. In this case, Snipuzz triggered a vulnerability related to firmware input by breaking the JSON syntax structure in the message. The intention of the original message is to change some attributes (e.g., 'stime_opt' & 'wday') in a rule (inferred by 'edit_rule'). In the mutated message, Snipuzz randomly deleted some contents (inside the blue frame), which break the JSON syntax. This may cause errors about parsing messages or passing parameters incorrectly handled by the firmware and, consequently, crashes the device. To further determine the root cause of the crash, we obtained the firmware source code. Figure 8 shows a code snippet from the firmware, using cJSON, 2 a popular open-source lightweight JSON parser (5.4k stars in GitHub), to interpret input message fragments. The jalr instruction will save the result of cJson_GetObjectItem in $t9 and jump to this address unconditionally (see line 3 in Figure 8) , which means the firmware will pick the value corresponding to 'schedule'. In the original message, the value corresponding to 'schedule' is a JSON object headed by 'edit_rule' (from line 4 to line 16) . Note that the aforementioned snippet-based mutation strategy implemented in Snipuzz is able to break the syntax structure and mutate both on data and non-data domains. Interestingly, although the removing of two left curly braces breaks the JSON syntax, it is not recognized by cJSON parser, so the mutated message successfully bypasses the syntax validation and enters the functional code in firmware. When the firmware tries to access the successor JSON object in 'schedule', i.e., the object starts with 'edit_rule', since the corresponding value is no more a JSON object, but an array, a null pointer exception is triggered. Due to the design of IoTFuzzer, the fuzzing based on grammatical rules will offer priority to satisfying the grammar requirements in the mutation process in order not to be rejected by the firmware grammar detector. The advantage of this is to ensure that each test case can reach the functional execution part of the firmware. However, in this case, the test range of fuzzing based on grammatical rules cannot cover the firmware sanitising part. To conclude, the root cause of the crash has two factors: 1) the validation of message syntax heavily relies on a third-party library; 2) the firmware does not correctly handle the null pointer exception caused by data type mismatch. Although it is not reasonable to require a vendor to develop products purely from scratch, we argue that thorough testing and validation on the open-source library are essential. Considering the complexity of IoT firmware testing, a lightweight and effective black-box vulnerability detection tool, such as Snipuzz, is a pressing need. The Three Software Stacks Required for IoT Architectures NAUTILUS: Fishing for deep bugs with grammars GasFuzzer: Fuzzing ethereum smart contract binaries to expose gas-oriented exception security vulnerabilities Manh-Dung Nguyen, and Abhik Roychoudhury Coveragebased greybox fuzzing as markov chain Soteria: Automated IoT safety and security analysis Towards automated dynamic analysis for Linux-based embedded firmware IOTFUZZER: Discovering memory corruptions in IoT through app-based fuzzing HALucinator: Firmware re-hosting through abstraction layer emulation PIE: Parser identification in embedded systems Difuze: Interface aware fuzzing for kernel drivers A Large-Scale Analysis of the Security of Embedded Firmwares Graph-based comparison of executable objects (english version) discovRE: Efficient cross-architecture identification of bugs in binary code Pwnie Express. 2020. What makes IoT so vulnerable to attack? P2IM: Scalable and hardwareindependent firmware testing via automatic peripheral interface modeling Scalable graph-based bug search for firmware images Toward the analysis of embedded firmware through automated re-hosting FirmAE: Towards large-scale emulation of IoT firmware for dynamic analysis NEMESYS: Network message syntax reverse engineering by analysis of the intrinsic structure of individual messages By 2020, there will be 4 devices for every human on earth Mirai botnet exploit weaponized to attack IoT devices via CVE-2020-5902 Smart yet flawed: IoT device vulnerabilities explained What you corrupt is not what you crash: Challenges in fuzzing embedded devices More than half of IoT devices vulnerable to severe attacks ParmeSan: Sanitizer-guided greybox fuzzing PEACH: The PEACH fuzzer platform boofuzz: Network protocol fuzzing for humans Cross-architecture bug search in binary executables AFLNET: A greybox fuzzer for network protocols Alexandru Razvan Caciulescu, and Abhik Roychoudhury Karonte: Detecting insecure multi-binary interactions in embedded firmware kAFL: Hardware-assisted feedback fuzzing for OS Kernels SPFuzz: a hierarchical scheduling framework for stateful network protocol fuzzing FirmFuzz: automated IoT firmware introspection and analysis IoT devices will outnumber the world's population this year for the first time Skyfire: Data-driven seed generation for fuzzing Not all coverage measurements are equal: Fuzzing by coverage accounting for input prioritization Wikipedia. 2021. Hierarchical clustering Neural network-based graph embedding for cross-platform binary code similarity detection Poster: Fuzzing iot firmware via multi-stage message generation SGPFuzzer: A state-driven smart graybox protocol fuzzer for network protocol implementations EcoFuzz: Adaptive energy-saving greybox fuzzing as a variant of the adversarial multi-armed bandit AVATAR: A framework to support dynamic security analysis of embedded systems' firmwares FIRM-AFL: High-throughput greybox fuzzing of IoT firmware via augmented process emulation A RUNTIME PERFORMANCE