key: cord-0555094-n8mriu2l
authors: Xiao, Anxing; Luan, Hao; Zhao, Ziqi; Hong, Yue; Zhao, Jieting; Chen, Weinan; Wang, Jiankun; Meng, Max Q.-H.
title: Robotic Autonomous Trolley Collection with Progressive Perception and Nonlinear Model Predictive Control
date: 2021-10-13
journal: nan
DOI: nan
sha: 7a406208f7046d736fb2e2cdfa4442375f6f66b1
doc_id: 555094
cord_uid: n8mriu2l

Autonomous mobile manipulation robots that can collect trolleys are widely used to liberate human resources and fight epidemics. Most prior robotic trolley collection solutions only detect trolleys with 2D poses or are merely based on specific marks and lack the formal design of planning algorithms. In this paper, we present a novel mobile manipulation system with applications in luggage trolley collection. The proposed system integrates a compact hardware design and a progressive perception and planning framework, enabling the system to efficiently and robustly collect trolleys in dynamic and complex environments. For the perception, we first develop a 3D trolley detection method that combines object detection and keypoint estimation. Then, a docking process in a short distance is achieved with an accurate point cloud plane detection method and a novel manipulator design. On the planning side, we formulate the robot's motion planning under a nonlinear model predictive control framework with control barrier functions to improve obstacle avoidance capabilities while maintaining the target in the sensors' field of view at close distances. We demonstrate our design and framework by deploying the system on actual trolley collection tasks, and their effectiveness and robustness are experimentally validated.

Robots have become popular in people's lives because they can complete tedious and complex tasks autonomously or collaboratively. In this paper, we discuss a robotic autonomous trolley collection system designed for trolley collection at airports.

At airports, passengers usually use luggage trolleys to help carry luggage between gates and arrival/departure. For example, Hong Kong International Airport (HKG) has an annual passenger flow of more than 72.9 million passengers and has approximately 13,000 luggage trolleys [1] . Naturally, for the convenience of passengers, collecting and redistributing these trolleys has become a vital but laborious task. Most airports, including HKG, still require considerable human resources to collect trolleys and return them to designated locations for continued use. However, the labor costs incurred by such human-driven operations are huge and continue to grow, especially in developed regions. Therefore, the robotic autonomous trolley collection system provides a promising and cost-effective solution for tedious and expensive tasks.

In addition, with the outbreak of COVID-19, large international airports such as HKG have become high-risk areas for the spread of the virus. Airport staff working with trolleys are at risk of infection because the coronavirus can last for several days, even on inanimate objects including trolleys. Therefore, without human contact and intervention, robotic autonomous trolley collection will be another effort to break the chain of virus transmission as the pandemic escalates. In this paper, we focus on developing a robotic autonomous trolley collection system that integrates mechanical design, perception and planning, with the ability to navigate and reliably collect trolleys in complex environments.

An autonomous mobile manipulation robot that can find and manipulate objects, as shown in Fig. 1 , is an engineering challenge featuring sophisticated incorporation of multiple modules, including mechanical design, perception, planning, and control. The design of mobile manipulation platforms has been an active research area. A few researchers take the approach of equipping mobile robots with robot arms, represented by the DLR-HIT-Hand [2] , PR2 [3] and TIAGo [4] . However, these platforms are designed for a universal purpose using their sophisticated manipulators, so they lack reliability when performing repetitive tasks, e.g., collecting trolleys at airports. The first robotic trolley collection solution is introduced in [5] . The developed prototype has several sensors and a fork manipulator to lift a trolley. Nonetheless, the lifting process is open-loop since there is no feedback sensor in the manipulator. In our new design, we add sensors to introduce feedback detecting sudden impacts encountered by the manipulator.

For visual perception, learning-based 2D object detection such as Fast R-CNN [6] and YoloV5 [7] has been well investigated in recent years. These real-time object detection models endow mobile robots with the ability to localize specific targets in complex environments. In [5] , the authors use a trained R-CNN model to detect a trolley, and the robot moves towards the trolley while maintaining a bounding box of the target in the middle of the image. For accurate manipulation tasks, however, merely perceiving 2D information of the target is not enough. The method in [8] relies on markers pasted on the trolley despite its exploration in monocular 3D trolley detection. In autonomous driving, many researchers attempt to fully explore the potential of RGB images for 3D detection by recovering 3D objects from key points [9] , [10] .

A key drawback of such methods is that when the target is too close to the camera, the limitation of the camera's field of view will cause failure in detection. In our method, to address the above shortcomings, we incorporate 2D detection and key point detection to estimate the 3D pose of a trolley at a long distance and leverage LiDARs to detect the backplane of the trolley at a short distance.

For planning and control, there are some previous efforts at the trolley collection task. In [11] , [5] , the adopted method is visual servoing. The main disadvantage of such method is that it does not consider obstacle avoidance and safety, leaving another critical task on the to-do list upon field deployment. The work of [12] , [1] focuses on task assignment and smooth-path generation for multiple robots with nonholonomic constraints, but safety is not a general consideration in the planning framework and the final docking error is not taken into account. Recently, optimization-based planning strategies such as model predictive control (MPC) have gained their prevalence in mobile robot planning and control [13] . There are also breakthroughs on tackling realtime safety guarantees for MPC with control barrier functions (CBFs) [14] , [15] , [16] . CBFs are useful tools for integrating safety considerations as constraints into the general MPC framework. In our work, we formulate our motion planning problem under an MPC framework with obstacle avoidance constraints and field-of-view constraints, and it is validated more efficient and robust than the state of the arts concerning robotic autonomous trolley collection.

This work offers the following contributions: 1) A novel robotic autonomous trolley collection system integrating a mechanical system and an efficient autonomy framework. 2) A progressive perception strategy involving longdistance keypoints-based monocular 3D detection and short-distance accurate pose estimation using LiDARs.

3) A safety-critical motion planner formulation under a nonlinear model predictive control framework with CBFs considering obstacle avoidance and field-of-view constraints. 4) Experimental demonstration in complex and dynamic environments of our system detecting target trolleys and safely collecting the trolleys.

The robots for collecting luggage trolleys in the airport need to replace trolley collectors to complete many tasks. Most of these tasks are very complex and challenging for a robotic system, so we specially designed a highly integrated hardware and software system suitable for trolley collection tasks in crowded environments. Fig. 2(c) ), the sensors module ( Fig. 2(a) ), and the manipulator module ( Fig. 2(b) ). In addition, the robot is equipped with a high-performance computer and a large-capacity battery to support stable operation (see Fig. 2 (a)).

The developed robot for luggage trolley collection, shown in Fig. 2 , is 1.2m high with 0.07m ground clearance, 0.45m long, and 0.416m wide. As an integrated robotic system with mobile operation functions, the performance of movement, loading, and operation must be considered in the design.

1) Chassis: The two front wheels are driving wheels, shown in Fig. 2 (c). Each driving wheel comprises a DC servo motor, a reducer, and a wheel, producing 31.75N.m of torque. The rear wheels are two universal wheels connected to the car body by a suspension system.

2) Sensors: For perception and localization, the robot is equipped with a 3D LiDAR, two 2D LiDARs, a solidstate LiDAR, and an RGB-D camera, as shown in Fig. 2 (a). Due to adequate battery performance and sufficient onboard computing power, further extension of sensors and other equipment can be installed as required.

3) Manipulator: The manipulator is particularly designed for catching a luggage trolley in airports, shown in Fig. 2 

It is installed at the front of our robot and consists of a support base, a fork, a draw-wire encoder , and a DC reducer motor. The motor lifts the fork in a rotating manner around a pivot, and the draw-wire encoder serves as a feedback source for calculating the position of the fork based on the length of the wire.

The length variation ∆l of the wire reflects the speed and state of the fork movement. The length of the wire l can be used to judge whether the fork is close to the designated position, and ∆l can be used to judge whether the fork is blocked or has grasped the trolley. Hence, by periodically detecting l from the draw-wire encoder and through differential calculation, we can construct a feedback control system for the manipulator. In the approaching stage shown in Fig. 4(a) , the robot moves in a crowded environment and finally gets to the back of a target trolley and shares the same orientation as the trolley. An RGB-D camera is used at this stage to perceive the trolley's position and orientation at a fairly long distance. The planner generates a motion trajectory to approach the trolley while avoiding obstacles in the dynamic environment. In the docking stage (see Fig. 4 (b)), the robot continues to move towards an ideal docking position precisely and then catches the trolley with its manipulator. At this stage, the robot should accurately get to the docking position so that the final aligning error can be small enough for a successful catch. When the robot arrives at the exact docking location, the planner will give the low-level manipulator controller an action command to perform the final catch.

After successfully capturing the target trolley, the robot carries the trolley to a designated returning spot, as Fig. 4(c) shows. During all stages, an occupancy grid map is built with the Gmapping package [17] , and the AMCL [18] localization is utilized to estimate the robot's states in the world frame.

In this section, we characterize our trolley collection strategy as a two-stage process based on the distance between the trolley and the robot. The collection task is simplified to a planar model since we assume that the trolley and the robot are in the same plane. At long distances, we utilize a monocular camera to detect the trolley and roughly estimate its three dimensional pose q tar = [x tar , y tar , θ tar ] T , wherein p tar = [x tar , y tar ] T denotes the trolley's position and θ tar represents its orientation. At short distances, the trolley's pose is precisely estimated by using a LiDAR. The monocular 3D detection framework consists of three parts as shown in Fig. 5 . First, we use an object detection network to find a 2D bounding box of the target trolley from a downsampled RGB image I s ∈ R Ws×Hs×3 . Then, from the original high-resolution image I ∈ R W ×H×3 , we crop the trolley image with the bounding box and get a new image I c ∈ R Wc×Hc×3 . Second, instead of predicting the 3D pose of the trolley directly, we use a deep neural network to predict six 2D keypoints (red points in Fig. 5 ) in the cropped image I c . Finally, with the a priori known 3D coordinates of the 2D keypoints, we can calculate the corresponding 6D pose by minimizing the reprojection error of the keypoints in the original image I.

Specifically, in the first part, we choose YOLOV5 [7] as our network model for real-time trolley detection due to its efficiency in object detection. In the second part, inspired by the human pose estimation [19] , we adopt the stacked hourglass network structure to estimate the heatmaps of six 2D keypointsp c i = [x c i ,ŷ c i ] T for i = 0, 1, . . . , 5, in the cropped image I c . Then we can get the corresponding homogeneous

coordinates of the original image I. In the third part, we solve a perspective-n-point (PnP) problem [19] to obtain the pose of the trolley. According to the perspective projection model for cameras, we have the following relationship

where

T represent the homogeneous 3D coordinates of the keypoints in the camera's frame and the trolley's frame, respectively; K is the intrinsic camera matrix that projects X t i to the image point p i = [u i , v i , 1] T in homogeneous image coordinates; s i is a scale factor. Then, we solve the PnP problem to get a 3D rotation R and a 3D translation T from the trolley's frame to the camera's frame by adopting the EPnP algorithm [20] . Eventually, with localization information of the robot base and the relative pose of the trolley in the camera frame, we can calculate the global state of the trolley q tar = [x tar , y tar , θ tar ] T . Moreover, a filter is performed to avoid sudden changes in detection results since the trolley should be static most of the time. If the trolley is not in the camera's field of view (FoV), the state of the trolley is set to be the same as the last time when it was within the FoV. During docking, accurate perception of the trolley's pose is vitally crucial. Fig. 6 illustrates our perception strategy at this stage. We use a solid-state LiDAR to yield a point cloud, and then perform plane detection and fitting with those points of the cloud to estimate the precise pose of the trolley. The obtained point cloud is noted by P = {p 1 , p 2 , . . . , p n } where p i ∈ R 3 for i = 1, . . . , n, with n being the total number of points. To get an ideal point cloud characterizing the backplane of a trolley, the robot should be at a suitable pose. Empirically, we set the robot facing the backplane close behind the trolley (0.3m∼2m). Since the trolley's pose obtained at the approaching stage is with a decimeter-level precision, equipped with a good enough motion planner, which we will show in Section III-C, our LiDAR's FoV is large enough to ensure the backplane will be presented entirely in the point cloud.

To estimate the trolley's pose q tar at a centimeter-level precision, we filter out interference points by setting a threshold. After that, we conduct plane fitting through the RANSAC algorithm [21] and get a set of plane points P plane = {p 1 , p 2 , ..., p M } ⊆ P and 4 parameters a, b, c, and d in the plane equation aX + bY + cZ + d = 0. With the plane parameters and the plane points, we may estimate the center point and the yaw angle of the back plane by calculating the center point of the filtered point cloud and the normal vector of the plane. After obtaining the trolley's 3D pose q tar , we can then figure out a manipulation pose for the robot to collect the trolley.

This planning part considers two main problems, videlicet, generating a feasible state and control trajectory, and avoiding unsafe actions in crowded environments. In both stages of the collection process, the robot needs to move to a given goal state x goal = [x goal , y goal , θ goal ] T . Concretely, the goal state at the approaching stage is at a position behind the trolley and an orientation same as the trolley, while at the docking stage, the goal state is an ideal pose for the robot to operate manipulator.

In this work, we characterize the safe set C of states as the zero-superlevel set of a continuously differentiable function h :

Safety, in our case, has a physical meaning of avoiding all static and dynamic obstacles. To do so, we can keep the distance between the robot and any obstacles beyond a specific range. Therefore, it is natural to define the following function to construct our safe set C

where x ob = [x ob , y ob ] T denotes the position of any obstacle and d safe a predefined safety distance. At the docking stage, it is preferable to let the trolley remain in the FoV of the solid-state LiDAR. As is shown in Fig. 7 , e t is a unit vector starting from the robot and pointing at the trolley; e r represents a unit vector in the direction of the robot's orientation; θ max is the maximal angle between these two vectors at which the trolley stays within the LiDAR's FoV. To meet this requirement of maintaining observation, we define the following function that we will use later in our planning formulation:

Then, we introduce CBF constraints [14] , [15] ∆h(

where ∆h(x k , u k ) := h(x k+1 ) − h(x k ) with λ k ∈ (0, 1]. This kind of constraints ensures h becomes a discrete-time CBF, which means the safe set C defined in (2) is invariant along the trajectories of a discrete-time dynamic system. Also, one can find that (5) guarantees the lower bound of h decreases exponentially at time k with the rate 1 − λ k .

We formulate the planning task as a nonlinear model predictive control (NMPC) problem. At the approaching stage, the formulation has the following form:

where x A := 1 2 x T Ax, and the two positive definite matrices P f and Q u are respectively coefficients measuring terminal costs and running control costs. (6a) minimizes the quadratic cost function over a horizon of N steps. In (6b), we use the differential-drive model as the robot's system model. (6d) constrains the states and control inputs in a reachable state set and an admissible control set, respectively. Constraint (6e) is for obstacle avoidance.

At the docking stage, the formulation is similar:

At this stage, we describe the states at which the trolley remains in the robot's FoV as a safe set defined by joining (2) and (4) . Similar to the obstacle avoidance constraint (7e), the maintaining observation requirement is formulated as the constraint (7f). To avoid infeasibility, we introduce a slack variable δ and minimize it by the cost term wδ 2 . Upon implementation, these optimization problems are formulated in CasADi [22] and solved with IPOPT [23] .

A. Perception System Evaluation 1) Data Set: To ensure the robustness of the detection task at long distances, we built our own data set for network training. For object detection, we collected 1,200 pictures of the trolleys with different illumination, backgrounds, angles, etc. These pictures were all labeled with 2D bounding boxes around the trolleys. The data set was divided into three parts, namely, 800 for training, 200 for validation, and 200 for testing. For key points detection, we prepared 800 pictures of the cropped trolleys images with accurate key point labels. Similarly, we arranged 600 images for training, 100 for validation, and another 100 for testing.

2) Implementation Details: We implemented and trained our monocular 3D trolley detection networks offline using PyTorch on an Intel machine with an i7-9750H CPU and an NVIDIA GTX 1660Ti GPU. Our 2D detection network is adapted from the official code releases of YOLOV5 [7] . Training this network on our own data set, we adopted the SGD optimizer [24] for 300 epochs in total with a batch size of 16. The base learning rate was 0.01, and we reduced it to 0.001 from the 150th epoch and to 0.0001 from the 200th epoch. The training stage of the object detection net lasted for roughly 14 hours. For key points detection, we used a stacked hourglass network with PyTorch upon implementation. To improve the generalization ability of our model, we leveraged data augmentation techniques such as random scaling, cropping, flipping, and color transformation. During training, we run the Adam [25] optimizer with a base learning rate of 0.0001 for the first 200 epochs, and reduced it at a decreasing rate of 0.95 every 10 epochs later on. Finally, it took about 4 hours to train our key points detection network with a batch size of 8. At short distances, our method based on plane detection does not involve learning, so we implemented it with the Point Cloud Library [26] . 

To verify the effectiveness of our perception strategy, we conducted experiments moving a trolley in irregular motions and comparing the 3D poses detected by the robot with ground truths measured by a motion capture system. First, we tested our 3D monocular method used in long-distance perception, and we show comparisons between the perceived poses of a moving trolley and the ground truth in Fig. 8 . The average estimate error in position is 0.17m with a variance of 0.0097m 2 , and the average estimated angle error is 0.11rad with a variance of 0.0085rad 2 . Then the proposed short-distance LiDAR method based on plane detection is also validated, and the results are presented in Fig. 10 . The average estimate error in position is 0.03m with a variance of 0.0002m 2 and the average estimate error in orientation is 0.02rad with a variance of 0.00036rad 2 . In all, the perception module can provide accurate information for further planning and manipulation. Using the approaches described throughout this paper, we demonstrate our system in an actual autonomous trolley collection task. The robot is supposed to detect and localize a target trolley, safely navigate itself to the trolley's back, catch the trolley with its manipulator, and finally carry it to a designated returning spot. Our hardware setup is shown in Fig. 2 , and all the algorithms above were integrated with the Robot Operating System (ROS) environment and run in real time on the robot's onboard computer with an i7-1165G7 CPU and an NVIDIA GTX 2060 GPU. In the demonstration shown in Fig. 9 , we put the target trolley at different locations with different orientations, far from several initial locations of the robot, and let the robot perform the collection autonomously. In the space between the robot and the target trolley, we set up multiple static obstacles to block the robot's direct route to the goal. Fig. 11 shows the position trajectory of the robot in the demonstration and velocity commands produced by our planner over time. In the velocity commands plot in Fig. 11 , the first big crest happens between t = 10s and t = 20s is caused by avoiding the moving human in Fig. 9 (a); the second crest at t = 27s means the robot has passed the approaching stage and begins docking (see Fig. 9 (c)); and the sudden change at t = 39s indicates the start of the return stage shown in Fig. 9 (d).

In this paper, we propose a mobile manipulation system for robotic autonomous trolley collection in complex and dynamic environments. To detect target trolleys and estimate their poses, the robot uses a learning-based 3D detection method involving object and key points detection at long distances, and adopts an accurate point cloud plane detection method at short distances. For safe motion planning and control, we model this real-time task as an NMPC problem. With CBFs, the obstacle avoidance and field-of-view maintaining requirements are composed into the planning framework as constraints. The incorporation of the novel design of mechanical system and autonomy framework together with the progressive perception and planning strategy forms an efficient and robust robotic solution to autonomous trolley collection. We demonstrate our system in hardware on an actual trolley collection task with static obstacles and moving humans. Experimental results reveal that our solution clearly outperforms most state of the arts regarding the collection task. Our future work will focus on developing global decision-making strategies and multi-robot collaboration.

Real-time decision making and path planning for robotic autonomous luggage trolley collection at airports

The modular multisensory DLR-HIT-Hand: Hardware and software architecture

Towards autonomous robotic butlers: Lessons learned with the PR2

TIAGo: The modular robot that adapts to different research needs

Coarse-to-fine visual object catching strategy applied in autonomous airport baggage trolley collection

Faster R-CNN: Towards real-time object detection with region proposal networks

ultralytics/yolov5: v5.0 -YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations

A monocular target pose estimation system based on an infrared camera

RTM3D: Real-time monocular 3D detection from object keypoints for autonomous driving

Mono3D++: Monocular 3D vehicle detection with two-scale 3D hypotheses and task priors

A searching space constrained partial to full registration approach with applications in airport trolley deployment robot

Path planning for nonholonomic multiple mobile robot system with applications to robotic autonomous luggage trolley collection at airports

MPC for path following problems of wheeled mobile robots

Safety-critical model predictive control with discrete-time control barrier function

Enhancing feasibility and safety of nonlinear model predictive control with discrete-time control barrier functions

Rule-based safetycritical control design using control barrier functions with application to autonomous lane change

Improved techniques for grid mapping with Rao-Blackwellized particle filters

Monte Carlo localization: Efficient position estimation for mobile robots

Stacked hourglass networks for human pose estimation

EPnP: An accurate O(n) solution to the PnP problem

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography

CasADi: A software framework for nonlinear optimization and optimal control

Large-scale nonlinear programming using IPOPT: An integrating framework for enterprise-wide dynamic optimization

Large-scale machine learning with stochastic gradient descent

ADAM: A method for stochastic optimization

3D is here: Point cloud library (PCL)