key: cord-0600065-a7qdleam authors: Kist, Andreas M title: Deep Learning on Edge TPUs date: 2021-08-31 journal: nan DOI: nan sha: 0c26148bf87572728d4ac717f506a2dff078d694 doc_id: 600065 cord_uid: a7qdleam Computing at the edge is important in remote settings, however, conventional hardware is not optimized for utilizing deep neural networks. The Google Edge TPU is an emerging hardware accelerator that is cost, power and speed efficient, and is available for prototyping and production purposes. Here, I review the Edge TPU platform, the tasks that have been accomplished using the Edge TPU, and which steps are necessary to deploy a model to the Edge TPU hardware. The Edge TPU is not only capable of tackling common computer vision tasks, but also surpasses other hardware accelerators, especially when the entire model can be deployed to the Edge TPU. Co-embedding the Edge TPU in cameras allows a seamless analysis of primary data. In summary, the Edge TPU is a maturing system that has proven its usability across multiple tasks. Deep neural networks (DNNs) have revolutionized image processing and computer vision, including object recognition, image classification and semantic segmentation [1] [2] [3] . These advances have resulted in major paradigm changes, such as in healthcare [4] and self-driving cars [5] . Deep learning libraries, such as Caffe, TensorFlow and Torch, are important milestones in making deep neural networks accessible to the machine learning community. In particular, high-level packages, such as Keras and PyTorch, lowered the threshold to utilize DNNs. Even in a few lines of code, beginners can create, train and evaluate for example a multilayer perceptron without exact knowledge of the mathematical foundations. However, even with small DNNs the advantage of dedicated hardware, commonly graphical processing units (GPUs) is clear: faster training and inference, sometimes multiple orders of magnitude [6, 7] . This becomes especially important if DNNs are used in constrained environments, such as embedded solutions. Here, dedicated integrated circuits (ICs) would provide a more efficient inference, that is, more computations using less power. One idea is to utilize field-programmable gate arrays (FPGAs) that can be flexibly programmed [8] [9] [10] , and the other is to rely on dedicated ICs solely developed for DNNs. Multiple platforms are already available, such as the NVIDIA Jetson family [11] , Intel Movidius VPUs [12] and Google's Edge TPUs [13] using a variety of frameworks [14] . In this overview, I highlight the usage of Edge TPUs across deep neural network architectures and emphasize the authors' efforts to utilize Edge TPUs on their specific tasks. The Edge TPU itself is an integrated circuit with a small footprint of 5 × 5 mm. The available operations on the Edge TPU are constantly growing [15] , and are updated regularly. Figure 1 shows the number of operations currently available (July 2021). Interestingly, over time the relative number of operations with known limitations remains constant by approximately 50 %. Recently, the introduction of long short-term memory (LSTM) cells has allowed the use of recurrent neural networks (RNNs). With the Coral platform, Google provides a comprehensive model zoo that can be directly utilized. Figure 2 shows the tasks that were successfully accomplished using the Edge TPU. In the following, we take a closer look to each of the tasks. In general, recent works focused on comparing the Edge TPU with other hardware accelerators in terms of accuracy, inference time and power consumption. Specifically, they focus on the compatibility of the Edge TPU with existing architectures. A comprehensive approach was provided in [17] . The authors suggest a new benchmark (EDLAB) to compare hardware accelerators and show that the Edge TPU is leading in power consumption and is largely on par in accuracy compared to other hardware accelerators. Interestingly, in their study the Edge TPU is slower in inference compared to its competitors, which is contrary to most observations below. Another study performed a bottleneck analysis of the Edge TPU, tested 24 neural networks, and proposed a new framework, Mensa, that improves the energy efficiency by 3-fold and the throughput up to 4-fold [18] . In [19] , the authors tested Edge TPUs with multiview convolutional neural networks [20] to classify objects from various views and specialized convolutional neural networks. They found that the Edge TPU outperformed the CPU, ARM CPU and the Intel Movidius NCS by a large margin. The Edge TPU had a latency of less than 5 ms, and was highly power efficient providing 451.8 forward passes per Watt per second [19] . However, the authors did not compare the accuracy of the Edge TPU model (uint8) to the classification accuracy of the competing platforms (float32). In [21] , the authors compared different neural architectures for object classification deployed to the Edge TPU. Their results suggest that the combination of Raspberry Pi 3 or 4 with the Edge TPU is feasible across different architectures, allowing an increase in the number of processed frames per second without significant loss in accuracy. The use of Siamese networks on Edge TPUs has been shown in [22] , where the authors showed that Edge TPUs are capable of providing inferences at 60 frames per second, and quantization does not hinder the performance. Notably, the authors observed a small increase in performance. The analysis of audio spectrograms on Edge TPUs was assessed by [23] , where the authors showed that the Edge TPU is on par in accuracy with CPU-based deep neural networks. However, they also found that the power efficiency of Edge TPUs is superior to CPUs only when the complete set of model parameters are deployed to the Edge TPU [23] . A direct comparison of NVIDIA Jetson Nano and the Edge TPU revealed that architectures that cache their complete parameter set to the Edge TPU SRAM are approximately five times faster (peak inference 417 frames per second) than their Jetson Nano counterpart [24] . In agreement, architectures with uncached parameters were of the same order as the Jetson Nano. In the COVID-19 pandemic, the automatic detection of face masks is important for the prevention of infections. The authors of [25] showed that a single-shot detector (SSD) [26] with a MobileNetV2 backbone [27] is portable to the Edge TPU to classify detected faces for mask and no-mask in 6.4 ms per frame. A similar configuration was reported in [28] . Here, the authors tested the MobileNetV2-SSD architecture on the MS COCO dataset and found that the mean average precision (mAP) was for the Edge TPU quantized variant 0.2248, only slightly lower than the float32 competitors (Raspberry Pi 3 mAP 0.2530, with Intel Movidius NCS 0.2459). However, the Edge TPU surpassed the competitors in inference time by providing 55 frames per second. An SSD has been additionally used with MobileNetV1 and MobileNetV2 to analyze wine trunks, where the Edge TPU served as the acceleration module [29] . The average inference time across MobileNets was approximately 20-24 ms, which the authors compared to Tiny YOLO-V3 on an NVIDIA Jetson platform (54 ms). The Edge TPU SSDs also outperformed the Tiny YOLO-V3 in terms of average precision. Real-time pose estimation is a feasible approach in the training of pose-relevant sports to provide immediate feedback to the trainee. In a recent study, the Edge TPU was used to extract pose information from golfing footage [30] . Using a combination of the Edge TPU prediction and a Savitzky-Golay filter, the authors were able to obtain a pose estimation accuracy of up to 81.2 %. Edge TPUs have recently been used for 3D pose estimation [31] . The authors utilized the Edge TPU to allow real-time 3D pose estimation of three persons in a single camera frame at 30 Hz. In detail, each image crop took approximately 4.5 ms, and in each second the Edge TPU-based object detection took another 20 ms. The authors of [32] ported TomoGAN [33] , a generative adversarial network, to the Edge TPU to denoise X-ray images at the edge. The authors highlighted that quantizing the model decreases the structural similarity index (SSIM) [34] . However, the SSIM can be rescued by fine-tuning the model [32] . The first description of using Edge TPUs for semantic segmentation was in [35] . The authors mined different segmentation architectures: the SegNet [36] , the DeepLabV3+ [37] and the U-Net [38] architecture. Specifically, by dynamically scaling the U-Net architecture, smaller U-Net derivatives can be completely deployed by maintaining high segmentation accuracy [35] . When using large images, the resizing operations on the Edge TPU are mapped to the CPU to avoid precision loss [15] . However, this yields the fact that not all operations are fully mapped to Edge TPU resulting in slow inference speeds. By using a custom upsampling algorithm consisting of tiling, upsampling and merging, the authors showed that they were able to fully deploy the network to the Edge TPU, dramatically increasing the throughput without a significant drop in segmentation accuracy [35] . As described in the previous sections, many works have ported existing architectures to the Edge TPUs. However, due to the lack of adequate operations, the one-to-one porting is sometimes neither possible [35] nor desired, as shown in [39] . The authors showed that an Edge TPU-specific version of the EfficientNet-family that uses ordinary convolutions and the ReLU activation function is faster and more accurate compared to the separable convolutions and the swish activation function typically used in EfficientNets [40] . [41] developed a model that estimates the latency of an architecture deployed to the Edge TPU with a high accuracy of 97 %. For this model, the authors utilized more than 423,000 unique convolutional neural networks (CNNs) leading to a reliable estimate to allow rapid evaluation of architectural design choices. Accelerator-aware optimization is therefore key for future DNN architectures that are efficient on the Edge TPU. The deployment of DNNs to the Edge TPU is a multistep process ( Figure 3 ). First, a TensorFlow or Keras model is converted to the TFLITE format. As the model will be quantized (int8 or uint8), the model should be either trained in quantize-aware mode or post-training quantized, where the former is preferred for the best performance. In this step, it may be necessary to provide a representative dataset for effective quantization. Next, the Edge TPU compiler uses the TFLITE file to compile it into a special Edge TPU TFLITE format, deciding which operations are mapped to the Edge TPUs or to the CPU. Even though all operations can be mapped to the Edge TPU, it may happen that weights and parameters have to be transferred to the Edge TPU during inference because of the limited SRAM of the Edge TPU (see also Table 2 ). Although deployment and inference are platform independent, the Edge TPU compiler needs a UNIX environment. However, one can utilize the Google Colab platform to compile TFLITE files. Typically, the purpose of the Edge TPU is inference. Nevertheless, there are two options to retrain the DNN on the Edge TPU to allow transfer learning [42] . One method is the retraining of the last layer using backpropagation and cross entropy. The other method uses weight imprinting. Here, the embedding of the pre-last layer is used to compute the new weights in the final model layer. Edge TPUs can be easily incorporated on PCBs and interfaced with PCIe and USB 2.0. Multiple companies already included Edge TPUs in network-attached storage (NAS) 1 or telematic solutions such as network switches 2 . A very promising application is the integration of Edge TPUs in cameras. These so-called smart cameras allow the direct processing of acquired images and can provide not only the raw image but also direct information about what is in the image. Table 3 provides an overview of commercially available cameras that contain an Edge TPU. The only system that has an industry standard objective mount (C-Mount) is the Vision AI platform. The MP Cam and JeVois cameras are prototyping platforms, whereas the Darcy camera comes in a production-friendly package. The Edge TPU is a powerful platform, especially for accelerating inference on the edge and in remote settings. I envision that an ideal application for the Edge TPU is the pure inference of large datasets, for example for data categorization by providing embeddings. Furthermore, embedding the Edge TPU together with adjacent hardware, such as cameras, is highly promising and the first approaches are already commercially available. This close proximity to the camera hardware allows a manifold of image preprocessing, such as image denoising and restoration, super-resolution applications, and camera setting adjustments. Ideally, the Edge TPU allows the direct analysis of primary data and only transmits the inference results to downstream receivers. Deep learning The deep learning revolution A review on deep learning techniques applied to semantic segmentation A guide to deep learning in healthcare Deep learning for self-driving cars: Chances and challenges Theano: Deep learning on gpus with python Benchmarking TPU, GPU, and CPU Platforms for Deep Learning Dlau: A scalable deep learning accelerator unit on fpga Optimizing fpga-based accelerator design for deep convolutional neural networks Fpga-based accelerators of deep learning networks for learning and classification: A review A survey on optimized implementation of deep learning models on the nvidia jetson platform An accelerated prototype with movidius neural compute stick for real-time object detection Taking ai to the edge: Google's tpu now comes in a maker-friendly package Characterizing the deployment of deep neural networks on commercial edge devices TensorFlow models on the Edge TPU Architectural Analysis of Deep Learning on Edge Accelerators EDLAB: A Benchmark for Edge Deep Learning Accelerators Mitigating Edge Machine Learning Inference Bottlenecks: An Empirical Study on Accelerating Google Edge Models Artificial Vision on Edge IoT Devices: A Practical Case for 3D Data Classification Multi-view Convolutional Neural Networks for 3D Shape Recognition Performance Analysis of Deep Neural Networks for Object Classification with Edge TPU Siamese Networks for Few-Shot Learning on Edge Embedded Devices Scaling Spectrogram Data Representation for Deep Learning on Edge TPU Benchmarking Modern Edge Devices for AI Applications Real-time Mask Detection on Google Edge TPU Ssd: Single shot multibox detector Proceedings of the IEEE conference on computer vision and pattern recognition Evaluation of Deep Learning Accelerators for Object Detection at the Edge Visual Trunk Detection Using Transfer Learning and a Deep Learning-Based Coprocessor Applying Pose Estimation to Predict Amateur Golf Swing Performance Using Edge Processing Real-Time Multi-View 3D Human Pose Estimation using Semantic Feedback to Smart Edge Sensors Scientific Image Restoration Anywhere TomoGAN: low-dose synchrotron x-ray tomography with generative adversarial networks: discussion Image quality assessment: from error visibility to structural similarity Efficient biomedical image segmentation on edgetpus at point of care Segnet: A deep convolutional encoder-decoder architecture for image segmentation Rethinking atrous convolution for semantic image segmentation U-net: Convolutional networks for biomedical image segmentation Accelerator-aware neural network design using automl EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks An evaluation of edge tpu accelerators for convolutional neural networks TensorFlow models on the Edge TPU -transfer learning on device AMK thanks Michael Döllinger, Tobias Schraut and René Groh for their critical comments on the manuscript.