col.dvi 1 Fault Detection and Fault Tolerance in Robotics Monica Visinsky, Ian D. Walker, and Joseph R. Cavallaro Department of Electrical & Computer Engineering Rice University Houston, TX 77251-1892 2 1 INTRODUCTION Abstract Robots are used in inaccessible or hazardous environments in or- der to alleviate some of the time, cost and risk involved in prepar- ing men to endure these conditions. In order to perform their expected tasks, the robots are often quite complex, thus increas- ing their potential for failures. If men must be sent into these environments to repair each component failure in the robot, the advantages of using the robot are quickly lost. Fault tolerant robots are needed which can effectively cope with failures and continue their tasks until repairs can be realistically scheduled. Before fault tolerant capabilities can be created, methods of de- tecting and pinpointing failures must be perfected. This paper develops a basic fault tree analysis of a robot in order to obtain a better understanding of where failures can occur and how they contribute to other failures in the robot. The resulting failure flow chart can also be used to analyze the resiliency of the robot in the presence of specific faults. By simulating robot failures and fault detection schemes, the problems involved in detect- ing failures for robots are explored in more depth. Future work will extend the analyses done in this paper to enhance Trick, a robotic simulation testbed, with fault tolerant capabilities in an expert system package. 1 Introduction In hazardous environments or environments which are not read- ily accessible to man, robots must be able to efficiently adapt to failures in both software and hardware in order to continue working until the problem can be realistically repaired. Before a robot can try to cope with a failure, however, it must first be able to detect and pinpoint the problem. This paper devel- ops a basic fault tree analysis of robots in order to obtain a better understanding of where failures can occur and how they contribute to other failures or limitations in each robot. The resulting flow chart style picture of failures in a robot can also be used to analyze the resiliency of the robot in the presence of specific faults. Once a failure has been detected, the robot can reorganize its view of its internal structure in such a way as to hide or isolate the fault so the robot can continue working. The focus of this research is on finding real-time fault detection and fault tolerance methods which maintain as much of the robot’s functionality as possible while not requiring that redundant or extra parts be added to the robot. 1.1 Previous Work 1.1.1 Redundancy Based Fault Tolerance Previous work on fault tolerance in robotics has concentrated on dealing with faults in one specific part of the robot (mechanical 1.1 Previous Work 3 failure in the motor, kinematic joint failure, etc.) with only to- ken thought going to the more critical, systemwide effect of the failures. Relatively little focus has been given to the question of how to detect failures in robots. Previous research tends to con- centrate on fault tolerance algorithms, especially those schemes which rely on duplicating parts such as joint motors [13, 16] for their fault tolerant abilities. These redundancy based schemes are similar to several computer fault tolerant algorithms which are also based on redundancy of parts. One common computer fault tolerant algorithm is Triple Modular Redundancy (TMR) in which three processors all work on the same problem and com- pare their results. If one of the processors is faulty and its result does not agree with the results of the other two processors, the faulty processor is voted out of the final decision and the correct result is passed on to the rest of the system. This fault toler- ance scheme fails, however, if more than one of the processors is faulty. Duplication of physical parts provides a backup in case the element performing the work fails. Redundancy of parts can also provide a useful means of checking to see if a component is in error. For the equivalent redundancy based robot fault tolerant al- gorithms, two motors have been placed in each robot joint to provide a backup in case one motor stalls, runs away, or be- gins free-spinning. The fault tolerant advantages of redundancy has also led to adding extra parallel structures, such as seven legs when only six are needed, in order to allow many differ- ent configurations in the presence of a failure. Previous work by Tesar, et al at UT, Austin [13] and independently by Wu [16] with Lockheed at Johnson Space Center have explored the aforementioned method of duplicating motors. Two motors in a joint work together so as to provide one output velocity for the joint. When one of the motors breaks, the other one takes over the faulty motor’s functions. The faulty motor must be isolated from the system or the second motor must be able to adjust its output to account for any transients introduced to the system by the failed motor. If the robot is performing a time- critical or delicate task, fault tolerance must allow the robot to get a run-away motor (which could cause drastic changes in the robot’s motion) under control quickly before any damage to the environment occurs. 1.1.2 Structure-Independent Fault Tolerance Many useful robots have already been created. In order to pro- vide fault tolerance for these robots without redesigning them, algorithms need to be developed that will utilize the advantages of the existing structure and not require the addition of extra motors, sensors, or other components to the robot. These algo- rithms should be easily adaptable to most robots regardless of the robot structure. 4 1 INTRODUCTION To avoid adding redundant parts for fault tolerance in comput- ers, algorithms have been developed which reconfigure the data or code in a computer system among the working parts after one component has failed. Some computer fault tolerant systems handle a fault by allowing a graceful degradation in function- ality or speed. The literature speaks of time redundancy [4] in which a computational cycle is lengthened so a fault-free part (or parts) will have enough time to handle the tasks of a faulty com- ponent. Other systems use set-switching or processor-switching schemes [4] for reconfiguration. In processor-switching, fault- free components can be collected to form a basic subpart of the configuration, such as a row of an array, until the full config- uration is achieved. This method may, however, require many extra interconnections between components. In software, check bits and error correction codes help insure that data is success- fully transmitted in the system and allow a reconstruction of the original data if a transmission line is faulty. For robotics, however, little work has been done in developing algorithms for accommodating a failure using only the available physical parts. Many robots exist which do not have redundant motors or extensive sensors in the joints. Duplicating motors in- creases the size of the robot, the cost involved in building it, and the weight and inertia which affect the robot controller. It would thus be cost effective to find fault tolerance schemes that do not rely on a specific robotic architecture but reorganize the robotic algorithms in the controller or utilize the self-motion capability of robots with redundant joints in order continue working. In order to develop these schemes, the advantages and capabilities of a general robot architecture must be researched. Maciejewski at Purdue University has quantified the effect of joint failure on the remaining dexterity of a kinematically redun- dant manipulator [9]. He calculates an optimal configuration of redundant arms to maximize the fault tolerance while minimiz- ing the degradation of the system in the event of a failure. His method currently only provides fault tolerance if the robot is near this initial configuration and can try and arrange its joints to mimic the fault safe configuration as close as possible. Robot controllers may further attempt to keep the robot arm in a con- figuration where the joints are arranged to stay away from any possible singularities or uncomfortable positions in case a joint fails during the operation. These studies do not rely on adding extra motors or other components to the robot, but they also do not explore what the robot controller must do in order to utilize the remaining dexterity to continue its tasks. This paper considers and analyzes systemwide failures (elec- tromechanical, computer software/hardware, etc.) and their inter-relationship via fault trees. We focus on developing fault detection and fault tolerance schemes using only the components normally available to the robot. Previous work in fault tolerance forms a subset of our analysis, and our structure has the addi- tional advantage of allowing the best results from more specific fault schemes to be embedded into our tree analysis. 1.2 Riceobot and Robo-MEDIC This paper begins by specifically analyzing the fault trees of the Rice University robot, the Riceobot, but the results apply to most robots. The fault tolerant algorithms developed from 1.3 Fault Detection Simulator and Trick 5 � � � �� � � � � � � � � � ��................ ......... .�� �� � � � � �� �� �� �� ...................................... “Trick” CPU Controller Real-Time Simulation and Control Software Error Detection and Robot Manipulator Operator Interface Intelligent Correction “Robo-MEDIC”Robot Manipulator Operator Figure 1: Robo-MEDIC Robotic Fault Tolerance Hardware/Software Environment. this analysis will be embedded into the CLIPS expert system environment [6]. This NASA-developed public domain software package is commonly used by government agencies and is run- ning on our computer systems. The resulting expert system package, Robo-MEDIC (Robot Ma- nipulator Error Detection and Intelligent Correction) will pro- vide diagnostic assistance to the operator and will interface with the control computer of the robot as shown in Figure 1. Robo- MEDIC will be able to use the fault trees as a flow chart of failures. Nodes in the trees will have some fault tolerant action associated with them that will allow the robot to take advantage of inherent backup or alternate paths charted by the fault tree. By maneuvering around the trees, Robo-MEDIC will perform fault tolerant recovery actions as a sequence of these smaller, simpler actions. 1.3 Fault Detection Simulator and Trick In addition to the fault tree analysis, we are examining fail- ures and testing fault detection schemes using a simulation of a generic four link, planar robot. We will be integrating the concepts derived from the simulator into Trick [1], a robotics software testbed developed at NASA Johnson Space Center by Leslie J. Quiocho and Robert Bailey. The Trick software package already contains the information to model the seven-joint Robotic Research Arms, the Space Shut- tle RMS, and the full Riceobot with base and two arms. Data modules provided by Trick allow the user to build customized robots with many different types of sensors, joints, and links. The software can currently model a joint failure by locking the joint. Our research is expanding the capabilities of the software to model runaway joints, fault detection schemes, and fault toler- ance algorithms. The flexibility of the software allows the failure analysis developed in this paper to be extended to a variety of different robots. 2 Robotic Fault Tree Analysis 2.1 Analysis Technique Fault Tree Analysis (FTA) is a deductive method in which failure paths are identified by using a fault tree drawing or graphical 6 2 ROBOTIC FAULT TREE ANALYSIS Table 1: Fault Tree Analysis Symbols Symbol Function AND gate All inputs required to produce output event. OR gate Any one input event causes the output event. Rectangle A malfunction which results from a combination of fault events through logic gates. Diamond A fault event for which the causes are left undeveloped. Circle A basic fault event. This includes component failures whose frequency and failure mode are known. Triangle A suppressed tree. The tree is detailed in another figure. representation of the flow of fault events [2]. FTA is a well- known analysis technique often used in industry for computer control systems and large industrial plants. Each event in the tree is a component failure, an external disturbance, or a system operation. The top event is the undesired event being analyzed and, in this research, is the failure of the entire robot. The events are connected by logic symbols to create a logical tree of failures. Some of the basic symbols are explained in Table 1. The explanation of the FTA technique in [2] promotes a top down development of the fault tree. The top event is broken down into primary events that can, through some logical com- bination, cause the failure at the top. This process is repeated to deeper levels until a basic event or an undeveloped event is reached. Some conditions or causes may be left undeveloped if the probability that they will occur is small enough to be ig- nored. 2.2 Failure Propagation/Probability Analysis Once the fault trees are built, the information available may be enhanced by a quantitative analysis of the failures. Failure rates are assigned to each input event and propagated up the tree based on the rules of the connecting logic gates. The output of an OR gate is the sum of the inputs. The resulting probability of the combined input events is greater than the probability of an individual input event. The output of an AND gate is the product of the inputs. The resulting probability of all the events occurring is thus less than the probability of any one occurring. The AND gates represent some form of redundant measurement or capability and are more desirable in the fault tree as the probability of a failure decreases through the combination of lower level events. A Markov or semi-Markov model approach to probability anal- ysis can also be developed based on the PAWS/STEM [3] and CARE III [12] reliability analysis packages. These packages can analyze simpler fault trees as well as Markov chains, but they are not necessarily optimized to handle the simpler structures 2.3 Fault Tree Pruning 7 [10]. These analysis tools were also not designed for robotics and thus would not take advantage of some of the commonality within robot structures. We will include some of the advan- tageous aspects of using Markov models in our CLIPS-based expert system, Robo-MEDIC. A quantitative analysis provides a measure of the overall chance of a complete failure for each robot. The structure provided by the fault trees organizes the probabilities appropriately for the robot system and provides a simple map of how the probabilities relate to each other. Using the trees, robots of significantly different origin and structure can be compared for fault tolerant abilities and survivability. The integrated Robo-MEDIC expert system will provide diagnostic capabilities by using the fault trees and will alert the operator of an impending failure. It can be used for off-line comparisons of robots or for suggesting possible corrective actions to an operator and the low-level robot controller during real operations. 2.3 Fault Tree Pruning A suggested drawback of FTA is that there is no way to ensure that all the causes of a failure have been evaluated [2]. The designer tends to pick out the important or most obvious events that would cause a given failure. However, the events that are not modeled normally have a low probability of occuring and can be ignored or treated as a basic event without overly biasing the analysis. Several failures may also be interconnected creating lateral branches or cycles in the fault tree. In some robots, one mo- tion at a joint may be coupled with another motion such that failure of either motion causes the failure of the other. It is also difficult to determine the relationships between some failures. For example, the failure of all the internal feedback sensors at the elbow joint of a robot may make the robot blind to the el- bow position. The elbow has not actually failed, but the robot is unable to detect the results of any commands sent to the elbow. Thus, the sensor malfunction does not contribute to an elbow failure specifically but may cause a failure of the entire robot. Relationships like these make the tree complex and difficult to understand. These problems can be overcome by working to simplify the tree. In the case of the coupled motions, the two failures can be considered as one with twice as likely a probabil- ity of occurring. 2.4 Riceobot Fault Trees To provide a foundation for the analysis of general robots, we have chosen to analyze the arm of the Rice University Riceobot. The arm has eight degrees of freedom: three motions in the shoulder (z translation, pitch, and yaw), two motions in the elbow (roll and pitch), and three motions in the wrist (roll, pitch, and yaw). The results obtained from the Riceobot apply to most general robots especially since the Riceobot has a wide variety of commonly used link, joint, and motor arrangements. Overall Robot Failure Several fault trees have been developed and a few are reproduced in the following pages. The top event is obviously the failure of the robot (Figure 2). The primary causes of a robot failure are 8 2 ROBOTIC FAULT TREE ANALYSIS Failure of Robot Failure of Computer System Power Failure External Failure of Shoulder Failure of Elbow Failure of Wrist Failure of Gripper Figure 2: Top Level Fault Tree for Entire Riceobot. power failure, computer system failure, or a combination of fail- ures of the joints. If the robot is fault tolerant, it can withstand the failure of several joints. By stablizing the faulty motion or joint in some manner (such as locking the joint), it is possible that the other motions can still provide some functional capabil- ity to the robot. This ability results in the AND gate combining the joint failures in Figure 2 and decreases the probability of a failure of the robot. Joint Failures The Riceobot has two directly driven motions: the shoulder z- direction motion and the wrist roll motion (Figure 3). The fault trees for these motions are quite simple since only the failure of External Gear-Train Failure Failure of Wrist Failure of Wrist Pitch/Yaw External Failure of Wrist Roll Link Breaks Chain/ Cable Loose External Chain/ Cable Snap Cable Guide Wheel Loose or Off Connector Failure Jacket Unscrews Cable Comes Free of Clamp Strip Threads Failure of Universal Joint Failure of Motor Failure of Motor Joint Link Becomes Stuck Joint Breaks Joint Clamp to Motor Loosens Figure 3: Sub-Level Fault Tree for Wrist System. 2.4 Riceobot Fault Trees 9 the motor plays an important role in the failure of the motion. The other motions of the Riceobot depend on some form of gear- train assembly to allow the spatially separated motor to drive the joint. Failure of the gear-train can be caused by basic events as simple as a loosening of the chain or cable. Motor and Sensor Failures The probability of a motor failure is dependent on the type of motor used. The Riceobot contains both brushless DC and step- per motors. Each motor also has a gear box which may fail due to gear slippage or wear. A power failure affects all motors as well as any other electrically driven parts in the robot, but each motor could lose power separately if its specific power cables break. A motor failure could conceivably also be the cause of a sensor failure when sensors are mounted on the shafts of the motors. Sensors are also affected by incorrect calibration and external noise or vibrations. (See Figure 4.) Computer System Failures The computer system of the Riceobot consists of three main parts: (1) amplifiers which read from the optical encoders and drive the motors, (2) servo control chips which store informa- tion about the different motors and convert the desired angles into currents for each motor, and (3) an on-board host computer which is programmed in C and computes the desired angles for the desired motions (Figure 5). These three parts each contain at least one board filled with TTL chips, capacitors, power tran- sistors, resistors, and other analog and digital circuit parts. A failure of any one of these parts may not cause a failure of the entire board; but if a board did fail, the robot would be unable to function. The robot cannot withstand the failure of all the servo controllers or all the amplifiers because it would no longer be able to communicate with the joints. External Gear-Train Failure Failure of Wrist Failure of Wrist Pitch/Yaw External Failure of Wrist Roll Link Breaks Chain/ Cable Loose External Chain/ Cable Snap Cable Guide Wheel Loose or Off Connector Failure Jacket Unscrews Cable Comes Free of Clamp Strip Threads Failure of Universal Joint Failure of Motor Failure of Motor Joint Link Becomes Stuck Joint Breaks Joint Clamp to Motor Loosens Figure 3: Sub-Level Fault Tree for Wrist System. 10 2 ROBOTIC FAULT TREE ANALYSIS Sensor Failure Local Power Lines Fail Incorrect Callibration Sensor Logic Board Failure External Damage/ Noise Failure of Motor Local Power Lines Fail Internal Motor Failure Gear Box Failure Gears Slip Wear on Gears Connection to Amp Failure Lose Cable Strip Connector for Cable Strip Faulty Figure 4: Sub-Level Fault Tree for Sensor System. Detecting a non-terminal failure in the computer system requires some form of testing circuitry or the ability to poll components to see if they are still alive. The IEEE standard 1149.1 Test Access Port may be incorporated into any VLSI chip on the boards and could be used for active testing. Radiation hard- ened circuits [15] should also be used in the computer system. Correction code bits can be used to check data transfers and could identify a bus failure if the bits were consistently wrong. 2.5 Derived Riceobot Fault Detection The qualitative analysis of these fault trees has proven useful in pointing out some of the limitations of the Riceobot in regards to fault detection and fault tolerance. With only one sensor at each joint, the Riceobot represents the worst case scenario for detecting sensor and joint failures. The only option available to the fault detection software is to compare the sensed angles with the calculated desired angles. After accounting for a predeter- mined threshold to mask any precision errors in the calculations or sensing equipment and possibly adjust for load effects, any difference between the sensed and desired values must be con- sidered the result of a failure. The computer is, however, unable to differentiate between a sensor malfunction and an actual joint failure due, for example, to a frozen motor. The computer must therefore shut down the joint and proceed with fault tolerance schemes based on a new model of the robot with fewer possible motions. The Riceobot’s fault detection capabilities could be improved by drawing on the vision system to determine the joint angles. With the computer’s calculations and the sensor reading, this third estimate of the angle would help distinguish between a sensor failure and a real joint failure. The Riceobot would still be able to function in the presence of one sensor failure. Using the vision system for this task, however, increases the load on the image processing software and may hinder the system’s ability 11 Sensor Failure Local Power Lines Fail Incorrect Callibration Sensor Logic Board Failure External Damage/ Noise Failure of Motor Local Power Lines Fail Internal Motor Failure Gear Box Failure Gears Slip Wear on Gears Connection to Amp Failure Lose Cable Strip Connector for Cable Strip Faulty Figure 4: Sub-Level Fault Tree for Sensor System. to perform its normal vision tasks. 3 Robot Fault Detection The fault trees give an idea of the interaction between failures in a system. The trees also provide a map of alternate paths for detecting faults or bypassing failures. In order to expand on this information and to show how modeling errors or other un- certainties affect fault detection, we need to simulate the robot and the fault detection algorithm. Because of the Riceobot’s lack of sensors and the complexity needed for its fault detec- tion algorithm, we are initially simulating fault detection using a computer modeled planar, four link robot [7]. The current program will need to be expanded extensively for the Riceobot and will be accomplished by implementing the fault detection routine in the CLIPS expert system as part of Robo-MEDIC. The four link robot is essentially just four cylinders placed end- to-end. All joints are rotational and move in the same plane. A simulated optical encoder and tachometer were added for each joint. The fault tree for this robot is relatively simple (Figure 6). We have not included the possibility of link breakage or global power failure in the simulation. The motors are in essence direct drive with no gear trains and fail only in a locked mode. This conditions pruned each joint subtree down from the complexity of the Riceobot trees to an easily simulated failure situation. It is interesting to note that it is the fault detection software which allows the joint to survive in the presence of a single sensor failure thus creating the AND gate under each motor failure in the tree. If both sensors at a joint fail, the host computer is blind to that joint and the fault detection routine forces a motor failure to prevent the joint from moving too far without computer supervision. Thus, the dual sensor failure subtree is a cause of the motor failure event for each joint. 12 3 ROBOT FAULT DETECTION Failure of Computer System Card Cage Failure External Host Computer Failure Amplifier Board Failure Servo-Control Board Failure Sensor Failures Failure of Components bonding,status indicators, etc., fail Sensor Failure (n) Sensor Failure (2) Sensor Failure (1) Failure of Components transistors, capacitors, or resistors, fail TTL chips, processors, ROM, RAM fail soldering, chip sockets, etc, fail bonding, status indicators, sockets fail Sensor Failures Failure of Components Figure 5: Sub-Level Fault Tree for Computer System. 3.1 Fault Detection Simulator The structure of the simulator is shown in Figure 7. The flow of information is from the simulated host computer through the robot and then the sensors to the fault detection program and finally back to the host computer. The host computer uses the desired angles computed by a planner routine and the estimated present position of the robot derived from the sensors to calcu- late the torque necessary to move each link to its desired desti- nation. The controller is a standard PD computed torque type controller. The robot routine then takes the calculated torques and determines the new position, velocity, and acceleration for each joint. The optical encoders estimate the positions by trun- cating the value of each angle based on each encoder’s precision. The tachometers pass the velocities through a first order filter based on a predetermined motor lag time. The sensors are mod- ules from the Trick simulation package and represent our initial efforts at integration with Trick. These estimates of the an- gles and velocities are passed into the fault detection procedure which checks for failures and either passes to the host what it considers good estimates of the position, velocity, and accelera- tion or signals the motor of a joint with two bad sensors to shut down. 3.2 Host Computer Model The simulated host computer uses the following dynamics equa- tion as a model for the robot: τ = [M (θ)]θ̈ + N (θ, θ̇), (1) where τ is the joint torque vector, [M ] is the inertia matrix, and N is the Coriolis and centrifugal torque vector. The [M ] matrix and N vector are computed based on the estimated angles from the sensors. Since the robot is planar, gravity is orthogonal to 3.2 Host Computer Model 13 Failure of Computer System Card Cage Failure External Host Computer Failure Amplifier Board Failure Servo-Control Board Failure Sensor Failures Failure of Components bonding,status indicators, etc., fail Sensor Failure (n) Sensor Failure (2) Sensor Failure (1) Failure of Components transistors, capacitors, or resistors, fail TTL chips, processors, ROM, RAM fail soldering, chip sockets, etc, fail bonding, status indicators, sockets fail Sensor Failures Failure of Components Figure 5: Sub-Level Fault Tree for Computer System. the plane of motion and there are no resultant gravity torques to consider. Friction is also ignored in this model. The PD controller for this model becomes: τ = [M (θ)]{θ̈d + [KP ](θd − θ) + [KD](θ̇d − θ̇)} + N (θ, θ̇). (2) The matrices [KP ] and [KD] are the position and derivative gains, respectively, and are used to control tracking and steady state errors by feedback control. For critical damping, the gains become: Failure of Joint 0 Failure of Robot Failure of Joint 1 Failure of Joint 2 Failure of Joint 3 Failure of Motor Failure of Motor Parts Failure of Tachometer Failure of Encoder Figure 6: Four Link, Planar Robot Fault Tree 14 3 ROBOT FAULT DETECTION θ e θ t Instigate Failures Fault Detection Encoders Tachs Robot (calc real accelerations, velocities, and angles) Host Computer (calc torques)Planner Report of Failures Failure Information fail motor signal τ θ d θ d θ d θ θ θ θ d θ d θ d θ r θ r Figure 7: Fault Detection Simulator Flow Chart [KD] = 2ω, [KP ] = ω 2, (3) where ω is the natural frequency input by the user. The natural frequency is typically set to 1 for most runs of the simulator. A higher natural frequency would increase the effects of the gains in the feedback control. 3.3 Robot Model The simulated robot is a procedure which takes the computed torque from the simulated host computer and determines the resulting four robot angle accelerations based on the equation: θ̈ = [M̂ ]−1τ − [M ]−1(N̂ ). (4) Here, [M̂ ] and N̂ are the inertia matrix and Coriolis and cen- trifugal torque vector as before but are now based on the real robot angles instead of the sensed angles. The matrices have also been injected with a small constant error to simulate the effects of a load on the robot. The joint angle, θ, and its first derivative are estimated by the equations: θ̇i = θ̇i−1 + (Δt)θ̈i, (5) θi = θi−1 + (Δt)θ̇i. (6) If a motor failure has occurred, θ̈ and θ̇ are set to zero to simulate the effects of a locked motor. The position thus remains con- stant. Only the locked motor is currently simulated, but other failure modes could result in runaway motors or free-spinning motors. The robot’s position, velocity, and acceleration calculated in this procedure are sent to the sensor routines. The robot position is also sent to the graphics simulator which displays the motion on the screen. This is the same graphics program used by the Trick simulation package. 3.4 Fault Detection Capabilities 15 3.4 Fault Detection Capabilities 3.4.1 Failure Modes If a sensor breaks and the failure goes undetected, the host com- puter will be performing its calculations using erroneous infor- mation. In this simulator, the encoders break in a frozen mode continuously reporting the last value read before the failure. The tachometers fail by continuously reporting zero velocities and thus constant positions. With these failure modes, the host sees the error between the sensed angle and the desired angle grow for the joint with the faulty sensor, and the control equa- tions increase the appropriate output torque to the robot to try and compensate for the error. The joint with the faulty sensor swings wildly off course because the host keeps trying harder and harder to get the broken sensor value to match the desired value. Since the calculations for all the joints are based on knowledge of where the other joints are located, all of the other output torques are also computed incorrectly and the joints all stray from their desired paths. When a motor fails, it locks in position and the joint is then unable to move. If a motor failure goes undetected, the sensors are still reading the correct information. In reality, the motor failure would probably result in a sensor failure as well, but the result would still be that both sensors are reporting a constant joint angle. The control equations try to push the broken joint closer to the desired value but the frozen motor does not respond to the torques. Since the sensors are still reporting the actual position of the joint, all the other calculations are based on cor- rect data and the other joints can continue with their normal motions. The plan must be modified, however, to get the end effector to its desired location. 3.4.2 Thresholds These two undetected failures reveal the importance of getting accurate sensor readings and of detecting a sensor failure quickly. A frozen motor is not as critical a failure in most cases and can be dealt with at a more leisurely pace. Since the sensors are not perfectly accurate, an acceptable threshold for the error between sensor reading and desired value must be chosen. Unfortunately, even during normal operation, the error between the actual angle and the desired angle can be relatively large especially at the beginning of a run before the controller has time to bring the error under control. Choosing the maximum error found during a failure-free run results in a threshold that is so large, it may take several time steps to notice the error from a broken sensor. By the time the failed sensor is detected, the robot controller has already been infected with the erroneous information and the robot is either off course or has damaged itself. Fortunately, the error between the angles recorded by the two sensors during normal operation is very small even after integrat- ing the tachometer reading to get the angular position. Fortu- nately, modeling errors and errors induced by unpredicted loads would affect both sensors in a similar manner. A tight threshold can be chosen for a comparison of the two sensed positions. If this threshold is exceeded, the fault detection software assumes that one of the sensors has failed and appropriately chooses one as the working sensor from which to take the recorded data. The larger thresholds from the typical error between the sensed and desired angles are still monitored, however. The large thresholds 16 3 ROBOT FAULT DETECTION provide a means of checking for a motor failure. The psuedocode for these checks is reproduced below. The angle θd and its derivatives are the desired values. The variables θt, θ̇t, and θ̈t are the values derived from the tachometer reading. The results based on the encoder are θe, θ̇e, and θ̈e. Finally, θ and its derivatives are the values sent to the robot controller. If ((encoder working) and (tachometer working)){ θ = θe, θ̇ = θ̇t, θ̈ = θ̈t If ((|θt − θe|) >= threshold){ if (encoder working){ tachometer = failed θ = θe, θ̇ = θ̇e, θ̈ = θ̈e }else{ encoder = failed θ = θt, θ̇ = θ̇t, θ̈ = θ̈t } }else{ if ((|θd − θt|) >= tachometer-threshold) tachometer = failed if ((|θd − θe|) >= encoder-threshold) encoder = failed } } If ((tachometer == failed) and (encoder != failed)){ if ((|θd − θe|) < encoder-threshold){ θ = θe, θ̇ = θ̇e, θ̈ = θ̈e }else{ encoder = failed motor = failed send stop motor signal to robot } } If ((encoder == failed) and (tachometer != failed)){ if ((|θd − θt|) < tachometer-threshold){ θ = θt, θ̇ = θ̇t, θ̈ = θ̈t }else{ tachometer = failed motor = failed send stop motor signal to robot } } Choosing which sensor has failed and which is still working when the tight tolerance is exceeded is the most difficult task. Intu- itively, one would expect the sensor with a reading closer to the desired value to be the working sensor and would switch to obtaining all the information from that sensor. However, ex- perience has shown that the fault detection software choses the correct sensor only when the desired values are increasing. If the desired angles are decreasing in value, it consistently picks the failed sensor as the working one. This problem is a result of the time it takes the controller to bring the errors under control and the failure modes for the sen- sors. Both the encoders and the tachometers fail by reporting a constant angular position either directly or by integration of a zero velocity. First, let us assume the sensors always read less than the desired value. If a sensor fails and gets stuck at a spe- cific value while the desired values are increasing, the error will 3.4 Fault Detection Capabilities 17 Table 2: Sensor Failure Detection Tests Ordering Encoder Fails Tachometer Fails θd < θt, θe largest error largest error (2) (2) θt, θe < θd smallest error smallest error (1,2) (1,2) θt < θd < θe largest error smallest error (1,2) θe < θd < θt smallest error largest error (1,2) grow and the fault detection routine should take the angle in- formation from the sensor that reads closer to the desired value. However, if the sensor fails while the desired values are decreas- ing, the desired values are approaching the failed value. The error starts decreasing and the surviving sensor is often the one whose absolute error is larger. (See Table 2 for listing of results.) The opposite relationships hold if both sensors are reading val- ues greater than the desired angle. A failed sensor would then have the smaller error during an increase in desired angles and the larger error during a decrease in desired angles. The various sensor failure situations that arise in the presence of increasing desired values and the relative size of the error for the surviving sensor are listed in Table 2. The cases in which the initial, naive algorithm is successful in choosing the survivor are marked with an “1”, and the cases in which the second algorithm is successful are marked with a “2”. The table represents half of the possible cases. The table for decreasing desired values would look similar with the equivalent number of successes for each algorithm as in Table 2. By checking whether the desired values are increasing or de- creasing and performing the appropriate comparisons to choose the surviving sensor, the algorithm’s success rate for the cases listed in the table increases from 50% to 75%. The algorithm still has a problem with the cases where the sensors are reading on opposite sides of the desired value. The fault detection pro- gram picks a failed tachometer as the survivor if the tachometer value is trailing a trend of decreasing desired values while the encoder is preceding the desired values. Similarly, the fault de- tection program should pick a failed encoder as the surviving sensor during a trend of increasing desired values if the encoder value is preceding the desired value by a large enough margin while the tachometer is trailing the desired value. We were un- able to demonstrate this second error as the encoder value was close enough to the desired value that when the encoder failed, the desired value passed the failed value in the next time step. The fault detection scheme was able to guess correctly in this situation. In general, our simple fault detection simulator is capable of de- tecting for each joint a single sensor failure, a single sensor failure followed by a motor failure, or a motor failure. The simulator will eventually detect a second sensor failure and will catch the single failures it has missed, but it has allowed enough erroneous sensor readings through to the controller that other joints have been knocked off course and fail as well. In order to improve the fault detection algorithm, we must switch from the hard-coded voting scheme presented above to other forms of analytical re- 18 4 CONCLUSIONS AND FUTURE WORK dundancy which use filters [14, 11], adaptive thresholds [8], or parity relations [5]. Willsky [14, 5] gives a good overview of the various methods of analytical redundancy. Most of the work in this area has been focused on failure detection in aircraft, power generation system and other mechanical systems. Unfor- tunately, the amount of uncertainty and modeling errors present in most robotic control systems makes several of these methods inaccurate. The generation of residuals using parity relations is one example of a method which would be unsuitable for robotic applications. [8]. 4 Conclusions and Future Work In this paper we have presented new results in fault tree analysis and fault detection for robot manipulators. This research sets the stage for significantly enhanced activity in fault tolerance for robotics. Once a failure can be detected and isolated, a fault tolerant expert system like Robo-MEDIC can proceed with the appropriate actions to make use of the existing robot structure, redundancy, and alternate paths. There already exist a variety of computer fault tolerance schemes which can provide a starting point for creating these structurally independent robotic fault tolerance algorithms. The fault tree analysis for the Riceobot has proven useful in pointing out some trouble spots for fault detection and fault tolerance. Even without a quantitative analysis, the importance of certain components and the severity of different failures are revealed in the fault trees. For robots, the good health of the internal sensors is shown to be extremely desirable. Erroneous data from even one sensor at a joint can cause the whole robot to deviate drastically from its course if the failure is not detected quickly. Without the sensors, the robot also loses much of its ca- pability to detect faults. Developing methods for early detection of sensor malfunctions thus has a high priority in this research. By simulating relatively simple fault detection situations, we are gaining a better understanding of how to satisfy this need for early detection. Our simulator has shown that to avoid false alarms due to modeling errors and noise we must implement large thresholds which let some failures go undetected for too long. Other relationships must be developed concerning the in- formation available in order to improve the fault detection algo- rithms. Once these schemes have been perfected, we will be able to embed the algorithms in an expert system and integrate the simulation into the Trick simulation package to create a more flexible fault detection and fault tolerance simulator. The analysis of the fault trees in this paper will be useful in creating fault tolerant algorithms. Through an analysis of the structures, fault tolerance schemes must be developed which will attempt to maintain the health of the internal nodes in the pres- ence of failures in their children. Detection schemes and fault tolerant algorithms will be tested on the Trick robotic software testbed. Acknowledgments This work was supported in part by the National Science Foun- dation under grants MIP-8909498 and MSS-9024391 and in part REFERENCES 19 by a Mitre Corporation Graduate Fellowship and an NSF Grad- uate Fellowship. References 1. Bailey, R. W. and Quiocho, L. J., Trick Simulation En- vironment Developer’s Guide, NASA JSC Automa- tion and Robotics Division, Houston, TX, Beta-release edi- tion, February 1991. 2. Bloch, H. P. and Geitner, F. K., An Introduction to Machinery Reliability Assessment, Van Nostrand Reinhold, New York, NY, 1990. 3. Butler, R. W. and Stevenson, P. H., “The PAWS and STEM Reliability Analysis Programs,” NASA Technical Mem- orandum 100572, NASA Langley Research Center, Hamp- ton, VA, March 1988. 4. Chean, M. and Fortes, J. A., “A Taxonomy of Reconfig- uration Techniques for Fault-Tolerant Processor Arrays,” IEEE Computer, 23(1):55–67, January 1990. 5. Chow, E. Y. and Willsky, A. S., “Analytical Redundancy and the Design of Robust Failure Detection Systems,” IEEE Transactions on Automatic Control, AC- 29(7):603–614, July 1984. 6. Giarratano, J. and Riley, G., Expert Systems: Prin- ciples and Programming, PWS-Kent Publishing Co., Boston, MA, 1989. 7. Hamilton, D. L., “A Simulation of Robot Motion Control,” Advisors: I. D. Walker and J. K. Bennett, Rice Univer- sity, Department of Electrical and Computer Engineering, Houston, Texas, April 1991. 8. Horak, D. T., “Failure Detection in Dynamic Systems with Modeling Errors,” AIAA Journal of Guidance, Control, and Dynamics, 11(6):508–516, November- December 1988. 9. Maciejewski, A. A., “Fault Tolerant Properties of Kinemat- ically Redundant Manipulators,” In 1990 IEEE Confer- ence on Robotics and Automation, pages 638–642, Cincinnati, OH, May 1990. 10. Martensen, A. L. and Butler, R. W., “The Fault-Tree Com- piler,” NASA Technical Memorandum 89098, NASA Langley Research Center, Hampton, VA, January 1987. 11. Merrill, W. C., DeLaat, J. C., and Bruton, W. M., “Ad- vanced Detection, Isolation, and Accommodation of Sensor Failures - Real-Time Evaluation,” Journal of Guidance, Control, and Dynamics, 11(6):517–526, November- December 1988. 12. Stiffler, J. J. and Bryant, L. A., “CARE III Phase II Re- port - Mathematical Description,” NASA Contractor Report 3566, NASA Langley Research Center, Hampton, VA, 1982. 13. Tesar, D., Sreevijayan, D., and Price, C., “Four-Level Fault Tolerance in Manipulator Design for Space Operations,” In NASA First International Symposium on Measure- ment and Control in Robotics, volume 3, page J3.2.1, Houston, TX, June 1990. 14. Willsky, A. S., “A Survey of Design Methods for Failure De- tection in Dynamic Systems,” Automatica, 12:601–611, 1976. 15. Winokur, P. S., “Radiation-Hardened Circuits for Robotics,” Proc. Fourth Topical Meeting on 20 REFERENCES Robotics and Remote Systems, pages 311–315, Febru- ary 1991. 16. Wu, E., Diftler, M., Hwang, J., and Chladek, J., “A Fault Tolerant Joint Drive Systems for the Space Shuttle Remote Manipulator System,” In 1991 IEEE Interna- tional Conference on Robotics and Automation, pages 2504–2509, Sacremento, CA, April 1991.