key: cord-200354-t20v00tk authors: Miya, Taichi; Ohshima, Kohta; Kitaguchi, Yoshiaki; Yamaoka, Katsunori title: Experimental Analysis of Communication Relaying Delay in Low-Energy Ad-hoc Networks date: 2020-10-29 journal: nan DOI: nan sha: doc_id: 200354 cord_uid: t20v00tk In recent years, more and more applications use ad-hoc networks for local M2M communications, but in some cases such as when using WSNs, the software processing delay induced by packets relaying may not be negligible. In this paper, we planned and carried out a delay measurement experiment using Raspberry Pi Zero W. The results demonstrated that, in low-energy ad-hoc networks, processing delay of the application is always too large to ignore; it is at least ten times greater than the kernel routing and corresponds to 30% of the transmission delay. Furthermore, if the task is CPU-intensive, such as packet encryption, the processing delay can be greater than the transmission delay and its behavior is represented by a simple linear model. Our findings indicate that the key factor for achieving QoS in ad-hoc networks is an appropriate node-to-node load balancing that takes into account the CPU performance and the amount of traffic passing through each node. An ad-hoc network is a self-organizing network that operates independently of pre-existing infrastructures such as wired backbone networks or wireless base stations by having each node inside the network behave as a repeater. It is a kind of temporary network that is not intended for longterm operation. Every node of an ad-hoc network needs to be tolerant of dynamic topology changes and have the ability to organize the network autonomously and cooperatively. Because of these specific characteristics, since the 1990s, ad-hoc networks have played an important role as a mean for instant communication in environments where the network infrastructure is weak or does not exist, such as developing countries, disaster areas, and battle fields. However, in recent years, the ad-hoc network is also a hot topic in urban areas where the broadband mobile communication systems are well developed and always available. More and more applications use ad-hoc networks for local M2M communications, especially in key technologies that are expected to play a vital role in future society, such as intelligent transportation systems (ITS) supporting autonomous car driving, cyber-physical systems (CPS) like smart grids, wireless sensor networks (WSN), and applications like the IoT platform. These days, communication entities are shifting from humans to things; the network infrastructures tend to require a more strict delay guarantee, and the ad-hoc network is no exception. There have been many prior studies about delayaware communication in the field of ad-hoc networks [1] - [4] . Most of these focus on the link delay and only a few consider both node and link delays [1] , [2] . However, in some situations where the power consumption is severely limited (e.g., with WSN), the communication relaying cost of small devices with low-power processors may not be negligible for the end-to-end delay of each communication. It is necessary to discuss, on the basis of actual data measured on wireless ad-hoc networks, how much the link and node delays account for the end-to-end delay. In the field of wired networks, there have been many studies reporting measurement experiments of packet processing delay as well as various proposals for performance improvement [5] - [10] . In addition, the best practice of QoS measurement has been discussed in the IETF [11] . In the past, measurement experiments on ASIC routers have been carried out for the purpose of benchmarking routers working on ISP backbones [5]- [7] ; in contrast, since the software router has emerged as a hot topic in the last few years, recent studies mainly concentrate on the bottleneck analysis of the Linux kernel's network stack [8] - [10] . There has also been a study focusing on the processing delay caused by the low-power processor assuming interconnection among small robots [12] . However, as far as we know, no similar measurement exists in the field of wireless ad-hoc networks. Therefore, many processing delay models have been considered so far, e.g., simple linear approximation [13] or queueing model-based nonlinear approximation [14] , but it is hard to determine which one is the most reasonable for wireless ad-hoc networks. In this work, we analyze the communication delay in an adhoc network through a practical experiment using Raspberry Pi Zero W. We assume an energy-limited ad-hoc network composed of small devices with low-power processors. Our goal is to support the design of QoS algorithms on adhoc networks by clarifying the impact of software packet processing on the end-to-end delay and presenting a general delay model to which the measured delay can be adapted. This is an essential task for future ad-hoc networks and their related technologies. First, we briefly describe the structure of the Linux kernel network stack in Sect. II. We explain the details of our measurement experiment in Sects. III and IV, and report the results in Sect. V. We conclude in Sect. VI with a brief summary and mention of future work. In this section, we present a brief description of the Linux kernel's standard network stack from the viewpoints of the packet receiving and sending sequences. Figure 1 shows the flow of packets in the network stack from the perspective of packet queueing. First, as the preparation for receiving packets, the NIC driver allocates memory resources in RAM that can store a few packets, and has packet descriptors (Rx descriptors) hold these addresses. The Rx ring buffer is a descriptor ring located in RAM, and the driver notifies the NIC of the head and tail addresses of the ring. The NIC then fetches some unused descriptors by direct memory access (DMA) and waits for the packets to arrive. The workflow after the packet arrival is as follows. As a side note, the below sequence is a receiving mechanism called new API (NAPI) supported in Linux kernel 2.6 or later. i) Once a packet arrives, NIC writes the packet out as an sk buff structure to RAM with DMA, referring to the Rx descriptors cached beforehand, and issues a HardIRQ after the completion. ii) The IRQ handler receiving HardIRQ pushes it by napi_schedule() to the poll list of a specific CPU core and then issues SoftIRQ so as to get the CPU out of the interrupt context. iii) The soft IRQ scheduler receiving SoftIRQ calls the interrupt handler net_rx_action() at the best timing. iv) net_rx_action() calls poll(), which is implemented in not the kernel but the driver, for each poll list. v) poll() fetches sk buff referring to the ring indirectly and pushes it to the application on the upper layer. At this time, packet data is transferred from RAM to RAM; that is, the data is copied from the memory in the kernel space to the receiving socket buffer in the user space by memcpy(). Repeat this memory copy until the poll list becomes empty. vi) The application takes the payload from the socket buffer by calling recv(). This operation is asynchronous with the above workflows in the kernel space. The packet receiving sequence is completed when all the payloads have been retrieved. In the packet sending sequence, all the packets basically follow the reverse path of the receiving sequence, but they are stored in a buffer called QDisc before being written to the Tx ring buffer (Fig. 1) . The ring buffer is a simple FIFO queue that treats all arriving packets equally. This design simplifies the implementation of the NIC driver and allows it to process packets fast. QDisc corresponds to the abstraction of the traffic queue in the Linux kernel and makes it possible to achieve a more complicated queueing strategy than FIFO without modifying the existing codes of the kernel network stack or drivers. QDisc supports many queueing strategies; by default, it runs in pfifo_fast mode. If the packet addition fails due to a lack of free space in QDisc, the packet is pushed back to the upper layer socket buffer. As discussed in Sect. I, the goal of this study is to evaluate the impact of software packet processing, induced by packet relaying, to the end-to-end delay, on the basis of an actual measurement assuming an ad-hoc network consisting of small devices with low-power processors. Figure 2 shows our experimental environment, whose details are described in Sect. IV. We define the classification of communication delays as below. Both processing delay and queueing delay correspond to the application delay in a broad sense. • End-to-end delay: Total of node delays and link delays • Node delay: Sum of processing delay, queueing delays, and any processing delays occurring in the network stack The proxy node (Fig. 2) relays packets with the three methods below, and we evaluate the effect of each in terms of the end-to-end delay. By comparing the results of OLSR and AT, we can clarify the delay caused by packets passing through the network stack. • Kernel routing (OLSR): Proxy relays packets by kernel routing based on the OLSR routing table. In this case, the relaying process is completed in kernel space because all packets are wrapped in L3 of the network stack. Accordingly, both processing delay and queueing delay defined above become zero, and node delay is purely equal to the processing delay on the network stack in the kernel space. • Address translation (AT): Proxy works as a TCP/UDP proxy, and all packets are raised to the application running in the user space. The application simply relays packets by switching sockets, which is equivalent to a fixed-length header translation. • Encryption (Enc): Proxy works as a TCP/UDP proxy. Besides AT, the application also encrypts payloads using AES 128-bit in CTR mode so that the relaying load depends on the payload size. For each relaying method, we conduct measurements with variations of the following conditions. We express all the results as multiple percentile values in order to remove delay spikes. Because the experiment takes several days, we record the RSSI of the ad-hoc network including five surrounding channels. • Payload size • Packets per second (pps) • Additional CPU load (stress) In this section, we explain the technical details of the experimental environment and measurement programs. We use three Raspberry Pi Zero Ws (see Table I for the hardware specs). The Linux distributions installed on the Raspberry Pis are Raspbian and the kernel version is 4.19.97+. We use OLSR (RFC3626), which is a proactive routing protocol, and adopt olsrd as its actual implementation. Since all three of the nodes are location fixed, even if we used a reactive routing protocol like AODV instead of OLSR, only the periodic Hello in OLSR will change the periodic RREQ induced by the route cache expiring; that is, in this experiment, whether the protocol is proactive or reactive does not have a significant impact on the final results. The ad-hoc network uses channel 9 (2.452 GHz) of IEEE 802.11n, transmission power is fixed to -31 dBm, and bandwidth is 20 MHz. As WPA (TKIP) and WPA2 (CCMP) do not support ad-hoc mode, the network is not encrypted. Although the three nodes can configure an OLSR mesh, as they are located physically close to each other, we have the sender/receiver drop OLSR Hello from the receiver/sender as well as the ARP response by Netfilter so that the network topology becomes a logically inline single-hop network, as show in Fig. 2 . We use iperf as a traffic generator and measure the UDP performance as it transmits packets from sender to receiver via proxy. The iperf embeds two timestamps and a packet ID in the first 12 bytes of the UDP data section (Fig. 3) , and the following measurement programs we implement use this ID to identify each packet. Random data are generated when iperf starts getting entropy from /dev/urandom, and the same series is embedded in all packets. We create a loadable kernel module using Netfilter and measure the queueing delay in receiving and sending UDP socket buffers. The workflow is summarized as follows: the module hooks up the received packets with NF_INET_PRE_ROUTING and the sent packets with NF_INET_POST_ROUTING ( Fig. 1) , retrieves the packet IDs iperf marked by indirectly referencing the sk buff structure, and then writes them out to the kernel ring buffer via printk() with a timestamp obtained by ktime_get(). The proxy program is the application running in the user space. It creates AF_INET sockets between sender and proxy as well as between proxy and receiver and then translates IP addresses and port numbers by switching sockets. Furthermore, it records the timestamps obtained by clock_gettime() immediately after calling recv() and sendto(), and encrypts every payload data protecting the first 12 bytes of metadata marked by iperf so as not to be rewritten. The above refers to the UDP proxy; the TCP proxy we prepare simply using socat. We execute a dummy process whose CPU utilization rate is limited by cpulimit as a controlled noise of the user space in order to investigate and clarify its impact on the node delay. We performed the delay measurement experiments under the conditions shown in Table II using the methods described in the previous section. Due to the space constraints, we omit the results of the preliminary experiment. Note that all experiments were carried out at the author's home; due to the Japanese government's declaration of the COVID-19 State of Emergency, we have had to stick to the "Stay home" initiative unless absolutely necessary. The experiment was divided into nine measurements. Figure 4a shows the time variation of RSSI during a measurement. We were unable to obtain SNRs owing to the specifications of the Wi-Fi driver, and thus the noise floors were unknown, but the ESSIDs observed in the five surrounding channels were all less than -80 dBm. The RSSI variabilities were also within the range that did not affect the modulation and coding scheme (MCS) [15] ; therefore, it appears that the link quality was sufficiently high throughout all measurements. Figures 4b, 4c , and 4d shows the average time variations of node delay, which were the results under the condition of 1000 bytes, 200 pps, and 0% stress. The blue highlighted bars indicate upper outliers (delay spikes) detected with a Hampel filter (σ = 3). There were 53 outliers in OLSR, 115 in AT, and 9 in Enc. In general, when the CPU receives periodic interrupts (e.g., routing updates, SNMP requests, GCs of RAM), packet forwarding is paused temporarily so that the periodic delay spikes can be observed in the end-to-end delay. This phenomenon is called the "coffee-break effect" [7] and has been mentioned in several references [5] , [8] , [9] . For this experiment, as seen in the results of AT (Fig. 4c ), in the low-energy ad-hoc networks, it is evident that the CPUrobbing by other processes like coffee-break had a significant impact on the communication delay. Incidentally, there were fewer spikes under both 1) OLSR and 2) Enc than under AT. 1) Since the packet forwarding was completed within the kernel space, node delay was less susceptible to applications running in the user space. 2) Since the payload encryption was overwhelmingly CPU-intensive, the influence of other applications was hidden and difficult to observe from the node delay. Figures 5a and 5b shows the jitter of one-way communication delay. Lines represent the average values, and we filled in the areas between the minimum and the maximum. There were no significant differences between OLSR and AT, which suggests that lifting packets to the application layer does not affect jitter. Jitter increased in proportion to the payload size only in the case of Enc. Similarly, only in the case of Enc with 200 pps or more, the packet loss rate tended to increase with payload size, drawing a logarithmic curve as seen in Fig. 5c ; in all other cases, no packet loss occurred regardless of the conditions. Figure 6 shows the tendency of the node delay variation against several conditions, and Fig. 7 shows the likelihood of occurrence as empirical CDF. According to these figures, in the cases of OLSR and AT, the delay was nearly constant irrespective of pps and stress. There was a correlation between the variation and pps in OLSR, while in AT there was not; this suggests that the application-level packet forwarding is less stable than kernel routing from the perspective of node delay. In the case of Enc, the processing delay increased to the millisecond order and increased approximately linearly with respect to the payload size, and the delay variance became large overall. In addition, the graph tended to be smoothed as the pps increased; this arises from the fact that packet encryption takes up more CPU time, which makes the influence of other processes less conspicuous. It appears that the higher the pps, the lower the average delay ( Fig. 6a and 6b) , and the delay variance decreases around 1200 bytes (Fig. 6c) , but the causes of these remain unknown, and further investigation is required. One thing is certain: on the Raspberry Pi, pulling the packets up to the application through the network stack results in a delay of more than 100 microseconds. Figure 8 shows the breakdown of the end-to-end delay and also describes the node delay link delay ratio (NLR). As we saw in Fig. 2 , for this experimental environment, the end-toend delay included two link delays, and the link delay shown in Fig. 8 is the sum of them. The link delay was calculated from the effective throughput reported in iperf. As iperf does not support pps as its option, we achieved it by adjusting the amount of transmitted traffic, as The results showed that, in the cases of OLSR and AT, the NLR was almost constant with respect to the payload size, while in Enc, it showed an approximately linear increase. The NLR was less than 5% in OLSR, while in AT, it was around 30%, which cannot be considered negligible. Furthermore, node delay was greater than link delay when the payload size was over 1200 bytes in Enc. In this work, we have designed and conducted an experiment to measure the software processing delay caused by packets relaying. The experimental environment is based on an OLSR ad-hoc network composed of Raspberry Pi Zero Ws. The results were qualitatively explainable, and suggested that, in low-energy ad-hoc networks, there are some situations where the processing delay cannot be ignored. • The relaying delay of kernel routing is usually negligible, but when it is handled by application, the delay can be more than ten times greater, however simple the task is. • If an application performs CPU-intensive tasks such as encryption or full translation of protocol stacks, the delay increases according to the linear model and can be greater than the link's transmission delay. For this reason, node-to-node load balancing considering the CPU performance or amount of passing traffic could be extremely useful for achieving delay-guaranteed routing in adhoc networks. Particularly in heterogeneous ad-hoc networks (HANETs), where each node's hardware specs are different from each other, the accuracy of passing node selection would have a significant impact on the end-to-end delay. As we did not take any noise countermeasures in this experiment, our future work will involve similar measurements in an anechoic chamber to reduce the noise from external waves and an investigation of the differences in results. The upper limit of flow accommodation under allowable delay constraint in hanets A unified solution for gateway and in-network traffic load balancing in multihop data collection scenarios Qos-aware routing based on bandwidth estimation for mobile ad hoc networks Qos based multipath routing in manet: A cross layer approach Measurement and analysis of single-hop delay on an ip backbone network Dx: Latencybased congestion control for datacenters Experimental assessment of end-to-end behavior on internet Measurement of processing and queuing delays introduced by an open-source router in a single-hop network A study of networking software induced latency Scheme to measure packet processing time of a remote host through estimation of end-link capacity A One-Way Delay Metric for IP Performance Metrics (IPPM) Real-time linux communications: an evaluation of the linux communication stack for real-time robotic applications Characterizing network processing delay Processor-sharing queues: Some progress in analysis Ieee 802.11 n/ac data rates under power constraints