key: cord-0591534-j561jg5a authors: Chen, Tianjian; He, Zhanpeng; Ciocarlie, Matei title: Hardware as Policy: Mechanical and Computational Co-Optimization using Deep Reinforcement Learning date: 2020-08-11 journal: nan DOI: nan sha: b879dd6a05355d609fccc431e7b609275b0a80cc doc_id: 591534 cord_uid: j561jg5a Deep Reinforcement Learning (RL) has shown great success in learning complex control policies for a variety of applications in robotics. However, in most such cases, the hardware of the robot has been considered immutable, modeled as part of the environment. In this study, we explore the problem of learning hardware and control parameters together in a unified RL framework. To achieve this, we propose to model aspects of the robot's hardware as a"mechanical policy", analogous to and optimized jointly with its computational counterpart. We show that, by modeling such mechanical policies as auto-differentiable computational graphs, the ensuing optimization problem can be solved efficiently by gradient-based algorithms from the Policy Optimization family. We present two such design examples: a toy mass-spring problem, and a real-world problem of designing an underactuated hand. We compare our method against traditional co-optimization approaches, and also demonstrate its effectiveness by building a physical prototype based on the learned hardware parameters. Human "intelligence" resides in both the brain and the body: we can develop complex motor skills, and the mechanical properties of our bones and muscles are also adapted to our daily tasks. Numerous motor skills exhibit this phenomenon, from running (where the stiffness of the Achilles tendon has been shown to maximize locomotion efficiency [1] ) to grasping (where coordination patterns between finger joints emerge from both synergistic muscle control and mechanical coupling of joints [2] ). Mechanical adaptation and motor skill improvement can happen simultaneously, both over an individual's lifetime (e.g. [3] ) and at evolutionary time scales. For example, it has been suggested that, as early hominids practiced throwing and clubbing, hand morphology also changed accordingly, as the thumb got longer to provide better opposition [4] . In robotics, the idea of jointly designing/optimizing the mechanical and computational aspects has a long track record with remarkable advances, exploiting the fact that the morphology, transmissions, and control policies are tightly connected by the laws of physics and co-determine the robot behavior. If the policy and dynamics can be modeled analytically, traditional optimization can derive the desired values for hardware and policy parameters. When such an approach is not feasible (for example due to complex policies or dynamics), evolutionary computation has been used instead. However, these methods still have difficulty learning sophisticated motor skills in complex environments (e.g. partially observable states, dynamics with transient contacts), or are sample-inefficient in such cases. In contrast, recent advances in Deep Reinforcement Learning (Deep RL) have shown great potentials for learning difficult motor skills despite having only partial information of complex, unstructured environments (e.g. [5, 6, 7] ). Traditionally, the output of a Deep RL policy in robotics consists of motor commands, and the robot hardware converts these motor commands to effects on the external world (usually through forces and/or torques). In this conventional RL perspective, robot hardware is considered given and immutable, essentially treated as part of the environment (Fig 1a) . Consider the concrete example of an underactuated robot hand. Motor forces are converted into joint torques by a transmission mechanism, consisting of gears, tendons or linkages. Through careful design and optimization of the hardware parameters, such a transmission can provide compliant underactuation, greatly increasing the ability of the hand to grasp a wide range of objects (e.g. [8, 9] ). Such a transmission is conceptually akin to a policy, mapping an input (motor forces) to an output (joint torques) with carefully tuned parameters leading to beneficial effects for overall performance. Can we leverage the power of Deep RL for co-optimization of the computational and mechanical components of a robot? Effective sim-to-real transfer, where a policy is trained on a physics simulator and only then deployed on real hardware [10] provides such an opportunity, since it allows modifications of design parameters during training without incurring the prohibitive cost of re-building hardware. In such a case, a straightforward option is to treat these hardware parameters as hyperparameters of the RL algorithm, and optimize them via hyperparameter tuning. However, this approach carries a prohibitive computational cost. In this study, we propose an approach to consider hardware as policy, optimized jointly with the traditional computational policy. As is well known, a model-free Policy Optimization (e.g. [11, 12] ) or Actor-critic (e.g. [13] ) algorithm can train using an auto-differentiable agent/policy and a nondifferentiable black-box environment. The core idea we propose is to move some part of the robot hardware from the non-differentiable environment and into the auto-differentiable agent/policy (Fig 1b) . In this way, the hardware parameters 2 become parameters in the policy graph, analogous to and optimized in the same fashion as the neural network weights and biases. Therefore, the optimization of hardware parameters can be directly incorporated into the existing RL framework, and can use existing learning algorithms with minor changes only in the computational graphs. We summarize our major contribution as follows: • To the best of our knowledge, we are the first to express hardware aspects as a policy, in a way that allows an optimization algorithm to include gradients of actions w.r.t. hardware parameters and computational parameters. • Via case studies comprising both a toy problem and a real-world design challenge, we show that such gradient-based methods are superior to hyperparameter tuning as well as gradient-free evolutionary strategies for hardware-software co-optimization. • To the best of our knowledge, we are the first to build a physical prototype to validate a Deep RL-based co-optimization approach, in the form of a compliant underactuated robot hand. The first category of related work comprises studies using analytical dynamics and classical control. An early example is from Park and Asada [14] . Paul and Bongard [15] , Geijtenbeek et al. [16] and Ha et al. [17] performed optimizations of mechanical and control or planning parameters for legged locomotors. All studies above require an analytical model of the complete mechanical-control system, which is non-trivial in complex problems. More recent work that uses classical control but evaluates and iterates on real hardware is [18] , which optimizes micro robots with Baysian Optimization. However, the goal of this work is different from ours: it aims to drastically decrease the number of real-world design evaluations, which is avoided in our work by simulation and sim-to-real transfer. Evolutionary computation provides another way to approach this problem. This research path originated from studies on the evolution of artificial creatures [19] , where the morphology and the neural systems are both encoded as graphs and generated using genetic algorithms. Lipson and Pollack [20] introduced the automatic lifeform design technique using bars, joints, and actuators as building blocks of the morphology, with neurons attached to them as controllers. A series of works from Cheney et al. [21, 22] studied the morphology-computation co-evolution of cellular automata, in the context of locomotion. Nygaard et al. [23] presented a method that optimizes the morphology and control of quadruped robot using real-world evaluation of the robot. Evolutionary strategies, which are gradient-free, have significant promise, but also exhibit high computational complexity and data-inefficiency compared to recent gradient-based optimization methods. The recent influx of reinforcement learning provides a new perspective on this co-optimization problem. Ha [24] augmented the REINFORCE algorithm with rewards calculated using the mechanical parameters. Schaff et al. [25] proposed a joint learning method to construct and control the agent, which models both design and control in a stochastic fashion and optimizes them via a variation of Proximal Policy Optimization (PPO). Vermeer et al. [26] showed a study on two-dimensional linkage mechanism synthesis using a Decision-Tree-based mechanism representation fused with Reinforcement Learning. Luck et al. [27] presented a method for data-efficient co-adaptation of morphology and behaviors based on Soft Actor-Critic (SAC), leveraging previously tested morphology and behaviors to estimate the performance of new candidates. In all the studies above, hardware parameters are still optimized separately and iteratively with the computational policies, whereas we aim to optimize both together in a unified framework. In addition, none of these works show physical prototypes based on the co-optimized agent. Recent work on general-purpose auto-differentiable physics [28, 29, 30, 31] is also very relevant to our approach, which relies on modeling (part of) the robot hardware as an auto-differentiable computational graph. We hope to make use of such recent advances in general differentiable physics simulation in further iterations of our method. We start from a standard RL formulation, where the problem of optimizing an agent to perform a certain task can be modeled as a Markov Decision Process (MDP), represented by a tuple (S, A, F, R), where S is state space, A is the action space, R(s, a) is the reward function, and F(s |s, a) is the state transition model (s, s ∈ S, s is for the next time step, a ∈ A). Behavior is determines by a computational control policy π comp θ (a|s), where θ represents the parameters of the policy. Usually, π comp θ is represented as a deep neural network, with θ consisting of the network's weights and biases. The goal of learning is to find the values of the policy parameters that maximize the expected return E[ T t=0 R(s t , a t )] where T is the length of episode. We start from the observation that, in robotics, in addition to the parameters θ of the computational policy, the design parameters of the hardware itself, denoted here by φ, play an equally important role for task outcomes. In particular, hardware parameters φ help determine the output (the effect on the outside world) that is produced by a given input to the hardware (motor commands). This is perfectly analogous to computational parameters θ help determine the output of the computational policy (action a) that is produced by a given input (state or observations s). Even though this analogy exists, traditionally, these two classes of parameters have been treated very differently in RL: computational parameters can be optimized via gradient-based methods -taking Policy Optimization (e.g. Trust Region Policy Optimization (TRPO) [11] and Proximal Policy Optimization (PPO) [12] ) as an example learning algorithm, the parameters of the computational policy are optimized by computing and following the policy gradient: , where A t is the advantage function. In contrast, hardware is generally considered immutable, and modeled as part of the environment. Formally, this means that hardware parameters φ are considered as parameters of the transition function F = F φ (s |s, a) instead of the policy. This is the concept illustrated in Fig. 1a . Such a formulation is grounded in the most general RL framework, where F is not modeled analytically, but only observed by execution on real hardware. In such a case, changing φ can only be done by building a new prototype, which is generally impractical. However, in recent years, the robotics community has made great advances in training via a computational model of the transition function F, often referred to as a physics simulator (e.g. [32] ). The main drivers have been the need to train using many more samples than possible with real hardware, and ensure safety during training. Recent results have indeed shown that it is often possible to train exclusively using an imperfect analytical model of F, and then transfer to the real world [10] . In our context, training with such physics simulator opens new possibilities for hardware design: we can change the hardware parameters φ and test different hardware configurations on-the-fly inside the simulator, without incurring the cost of re-building a prototype. The Hardware as Policy method (HWasP) proposed here largely aims to perform a similar optimization for hardware parameters as we do for computational policy parameters, i.e. by computing and follow the gradient of action probabilities w.r.t such parameters. The core of the HWasP method is to model the effects of the robot hardware we aim to optimize separately from the rest of the environment. We refer to this component as a hardware policy, and denote it via π hw φ (a new |s, a). The input to the hardware policy consists of the action produced by the computational policy (i.e. a motor command) and other components of the state; the output is in a redefined action space A new further discussed below. In the traditional formulation outlined so far, the "hardware policy" and its parameters φ are included in the transition distribution function F φ . With HWasP, π hw φ becomes part of the agent. The new overall policy π θ,φ = π hw φ (a new |s, a)π comp θ (a|s) comprises the composition of both computational and mechanical policies, while the new transition probability F new = F new (s |s, a new ) encapsulates the rest of the environment. In other words, we have split the simulation of the environment into two: one part consists of the mechanical policy, now considered as part of the agent, while the other simulates all other components of the robot, as well as the external environment. The reward function, R(s, a), will be redefined to be associated with the new action space: R new (s, a new ). Once this modification is performed, we aim to run the original Policy Optimization algorithm on the new tuple (S, A new , F new , R new ) as redefined above. However, in order for this to be feasible, two key conditions have to be met: The redefined action vector a new must encapsulate the interactions between the mechanical policy, and the rest of the environment. In other words, this new action interface must comprise all the ways in which the hardware we are optimizing effects change on the rest of the environment. Furthermore, the redefined action vector must be low-dimensional enough to allow for efficient optimization. Such an interface is problem-specific. Forces / torques between the robot and the environment make good candidates, as we will exemplify in the following sections. To use Policy Optimization algorithms, we need to efficiently compute the gradient of the redefined action probability w.r.t. hardware parameters. We further discuss this condition next. Computational Graph Implementation (HWasP). In order to meet Condition 2 above, we propose to simulate the part of hardware we care to optimize as a computational graph. In this way, the gradients can be computed by auto-differentiation and can flow or back-propagate through the hardware policy. Similar to the computational policy, the gradient of log-likelihood of actions w.r.t mechanical parameters φ can be computed as ∇ φ log π hw φ (a new |a, s). Critically, since the computational policy is also generally expressed as a computational graph, the gradient can back-propagate through both the hardware policy and the computational policy, i.e., the hardware and computational parameters are optimized jointly, and in the same fashion. This general idea is illustrated in Fig. 1b . However, this approach is predicated on being able to simulate the effects of the hardware being optimized as a computational graph. Once again, the exact form of this simulation is problem-specific, and can be considered as a key part of the algorithm. In the next sections, we illustrate how this can be done both for a toy problem, and for a real-world design problem, and regard these implementations as an intrinsic part of the contribution of this work. Minimal Implementation (HWasP-Minimal). In the general case, where should the split between the (differentiable) hardware policy and the (non-differentiable) rest of the environment simulation be performed? In particular, what if the hardware we care to optimize does not lend itself to a differentiable simulation using existing methods? Even in such a case, we argue that a "minimal" hardware policy is always possible: we can simply put the hardware parameters into the output layer of the original computational policy. In this case, a new = [a, φ] T . Here, the policy gradient with respect to the hardware parameters is trivial but can be still useful to guide the update of parameters. When this case in implemented in practice, the transition function F(s |s, a new ) typically operates in two steps: first, it sets the new values of the hardware parameters to the underlying simulator, then advances the simulation to the next step. We illustrate this case in Fig. 2 , which can be directly compared to the general HWasP in Fig. 1b . HWasP-Minimal is simple to implement since it does not require a physics-based auto-differentiable hardware policy. As outlined in the following sections, this version still performs at least as well as or better than our baselines, but still below HWasP. Comparison Baselines. We compare HWasP and HWasP-Minimal with the following baselines: • CMA-ES with RL inner loop: here, we treat hardware parameters as hyperparameters, using the Covariance Matrix Adaptation -Evolution Strategy (CMA-ES) algorithm [33] in an outer loop that optimizes hardware parameters, while learning the policy using RL algorithms (e.g. PPO or TRPO) in an inner loop for each set of sampled hardware parameters. • CMA-ES: here, we use CMA-ES as a gradient-free evolutionary strategy to directly learn both computational policy and hardware parameters, without a separate inner loop. We present here a simple one-dimensional implementation of our method on the mass-spring system in Fig.3(a) . Two point masses, connected by a massless bar, are hanging in the standard gravity field under n parallel springs whose stiffnesses are k 1 , . . . , k n . A motor can apply a controllable force to the lower mass. The behavior is governed by a computational policy that regulates the motor force, but also by the hardware parameters (spring stiffnesses). We note that, since springs are all parallel, only the sum of their stiffnesses matters, but we still consider the stiffness of each individual spring as a parameter as a way to test how our methods scale up for higher-dimensional problems. The input to the computational policy consists of y 2 andẏ 2 , and the output of the computational policy is motor current i. The goal is to optimize both the computational policy that regulates motor force, and the hardware parameters such that the lower mass goes to the red target line (y 2 = h) and stay there, with minimum motor effort. (The exact formulation for the reward function we use is presented in the Supplementary Materials.) Hardware as Policy. In this case, we include the effect of the parallel springs in the mechanical policy. Using Hooke's Law, we model spring effects as a simple computational graph, with k 1 , . . . , k n as parameters. The output of this computational graph is the total spring force f spr applied to the masses. The redefined action a new consists of the total resultant force f total = f str + f spr . The transition function F (rest of the environment) implements Newton's Law for the two masses, assuming f total as an external force. Additional details for the implementation, including the structure of the computational graph, can be found in Supplementary Materials. Hardware as Policy -Minimal. Here, we simply re-define the action vector to also include spring stiffnesses: a new = [i, k 1 , · · · , k n ] T . The transition function F is responsible for modeling the dynamics of the springs and the two masses. Results. Fig.3(b) shows the comparison of the training curves for both implementations of our method, as well as other baselines, for two cases: one with 10 parallel springs, and one with 50 parallel springs. In both cases, HWasP learns an effective joint policy that moves the lower mass to the target position. HWasP-Minimal works equally well for the smaller problem, but suffers a drop in performance as the number of hardware parameters increases. CMA-ES with RL inner loop also learns a joint policy, but learns slower than our method, especially for the larger problem. CMA-ES by itself does not exhibit any learning behavior over the number of samples tested. For the numerical results of the optimized stiffnesses, please refer to the Supplementary Materials. In this section we show how HWasP can be applied to a real-world design problem: optimizing both the mechanism and the control policy for an underactuated robot hand. The high-level goal of this problem, similar to the one introduced by Chen et al. [34] and illustrated in Fig 4, is to design a robot hand that is simultaneously versatile (able to grasp different shaped objects) and compact. In order to achieve the stated compactness goal, all joints are driven by a single motor, via an underactuated transmission mechanism: A single tendon connects to the motor, then splits to actuate all joints in the flexion direction (see Fig 4) . Finger extension is passive, provided by preloaded torsional springs. The mechanical parameters that govern the behavior of this mechanism consist of tendon pulley radii in each joint, as well as stiffness values and preload angles for restoring springs. Here, we look to simultaneously optimize the hardware parameters along with a computational policy that determines how to position the hand, and when to use the motor. The input to the computational policy consists of the position vectors of the palm and object, the size vector of the object boundingbox, the current hand motor travel and motor torque. Its output contains relative motor travel and palm motion commands. The hardware parameters we aim to optimize consist of all parameters of the underactuated transmission as listed above. It is important to note that, in this study, we do not try to optimize the kinematic structure or topology for the hand. Unlike the underactuated transmission, these aspects do not lend themselves to parameterization and implementation as computational graphs, preventing the use of the HWasP method in its current form. While HWasP-Minimal could still be applied, we leave that for future investigations. We tested our method with two grasping tasks: 1. top-down grasping with only z-axis motion for the palm movement (Z-Grasp); 2. top-down grasping with 3-dimensional palm motion (3D-Grasp). The former is a simplified problem version of the latter, and easier to train. Additional details on problem formulation and training can be found in Supplementary Materials. Noted that since the hardware parameters can be large in scale comparing to weights and biases in the neural network, a small change of them can lead to a huge shift of the joint policy output distribution during training. This kind of large distribution shift can result in a local optimum in the reward landscape. Hence, we hope to improve our policy performance while having small changes of the joint policy output distribution. In this problem, we use TRPO [11] because it allows for hard constraints on the change of action distribution. We also apply Domain Randomization [10] in the training to increase the chance of successful simto-real transfer. We randomized object shape, size, weight, friction coefficient and inertia, injected sensor and actuation noise, and applied random disturbance wrenches on the hand-object system. Hardware as Policy. In this case, we model the complete underactuated transmission as a computational graph and include it in our mechanical policy. The input to the mechanical policy consists of the commanded motor travel (output by the computational policy), as well current joint angles. Its output consists of hand joint torques. To perform this computation, we use a tendon model that computes the elongation of the tendon in response to motor travel and joint positions, then uses that value to compute tendon forces and joint torques. Details of this model as well as its implementation as an auto-differentiable computational graph can be found in Supplemental Materials. The redefined action a new contains the palm position command output by the computational policy, and the joint torques produced by the mechanical policy. The rest of the environment comprises the hand-object system without the tendon underactuation mechanisms, i.e. with independent joints. Hardware as Policy -Minimal. In this case, all hardware parameters are simply appended to the output of the computational policy. The underactuated transmission model is part of the environment, along with the rest of the hand as well as the object. Our results are shown in Fig. 5 . In the case of the Z-Grasp problem (left), HWasP learns an effective computational/hardware policy, albeit with some measure of instability in the learning curve. HWasP-Minimal also learns, but lags in performance. Neither evolutionary strategy shows any learning behavior over a similar number of training steps. We also tried a version of the same problem with the search range for the hardware parameters reduced by a factor of 8 (middle plot). Here, all methods except CMA-ES obtain similarly effective policies, but HWasP is still most efficient. Finally, we investigated performance for the more complex 3D-Grasp task. With a large search range, neither method was able to learn. However, with a reduced search range, HWasP was able to learn an effective policy, while neither CMA-ES-based method displayed any learning behavior over a similar timescale. The values for the hardware parameters resulting from the optimizations are shown in the Supplementary Materials. Validation with Physical Prototype. To validate our results in the real world, we physically built the hand with the parameters resulted from the co-optimization. The hand is 3D printed, and actuated by a single position-controlled servo motor. Fig. 4 shows some grasps obtained by this physical prototype, compared to their simulated counterparts. We note that, by virtue of a large number of simulation samples of different grasp types with different object shapes, sizes and other physics properties, the hand is versatile and can perform both stable fingertip grasps as well as enveloping grasps for different objects in reality. We plan to also test the computational policy on a real robot arm after the campus reopen from COVID-19. Our results show that the HWasP approach is able to learn combined computational and mechanical policies. We attribute this performance to the fact that HWasP connects different hardware parameters via a computational graph based on the laws of physics, and can provide the physics-based gradient of the action probability w.r.t the hardware parameters. The HWasP-Minimal implementation does not provide such information, and the policy gradient can only be estimated via sampling, which is usually less efficient, particularly for high-dimensional problems. In consequence, HWasP-Minimal also shows the ability to learn effective policies, but with reduced performance. Compared to gradient-free evolutionary baselines for joint hardware-software co-optimization, HWasP always learns faster, while HWasP-Minimal is at least as effective as the best baseline algorithm. We note that combining an RL inner loop for the computational policy with a CMA-ES outer loop for hardware parameters proved more effective than directly using CMA-ES for the complete problem. Still, HWasP outperforms both methods. The biggest advantage of HWasP-Minimal is that, like gradient-free methods, it does not depend on auto-differentiable physics, and is widely applicable with straightforward implementations to various problems using existing non-differentiable physics engines. We believe that our methods represent a step towards a framework where an algorithm designer can "tune the slider" to decide how much physics to include in the computational policy, based on the trade-offs between computation efficiency, ease of development, and the availability of auto-differentiable physics simulations. In its current stage, our work still presents a number of limitations. In particular, HWasP suffers from stability issues when the parameter search range is large. We suspect that this is due to the relative scale of the hardware parameters (imposed by the laws of physics), which can be large enough to scale the gradient through the hardware computational graph and create instability. Partly due to this problem, the computational aspects of the policies we have explored so far are relatively simple (e.g. limiting hand motion to 1-or 3-DOF). We hope to explore more challenging robotic tasks in future work, for example 6-DOF grasping problems. Finally, we also aim to include additional hardware aspects in the optimization, such as mechanism kinematics, morphology, or link dimensions. We believe the proposed idea of considering hardware as part of the policy will enable us to codesign of hardware and software using existing RL toolkits, with changes in the computational graph structure but no changes in the learning algorithms. We hope this work can open up new opportunities for task-based hardware-software co-design of robots or other intelligent systems, for researchers both in RL and in the hardware domain. A.1 Details for the Mass-spring Toy Problem Problem Formulation. We have presented the problem formulation in the paper (except for the exact form of the reward function), and we list these bullet points again for better readability here. • The observations we can measure are the mass position y 1 and velocityẏ 1 . • The input of this system is motor current i, which is the result of the computational policy. • The variables to optimize are the weights and biases in the neural network, as well as all the spring stiffnesses k 1 , k 2 , · · · , k n . • The goal is to make the mass m 2 go to the red target line in Fig 6 and stay there, with the minimum input effort. We designed a two-stage reward function that rewards smaller position and velocity error when the mass m 2 is far from the goal or moving fast, and in addition rewards less input current when the mass is close to the goal and almost still. where the α, β, and γ are the weighting coefficients, i max is the upper bound of the motor current, and is a hand-tuned threshold. Shared Implementation Details. We implemented HWasP, HWasP-Minimal, as well as our two baselines: CMA-ES with RL inner loop, and CMA-ES. In order to have a fair comparison between them, we intentionally made different cases share common aspects wherever possible. The physics parameters not being optimized are the same for all cases: m 1 = m 2 = 0.1kg, l = 0.1m, h = 0.2m, g = 9.8m/s 2 , k T = 0.001N m/A, r shaf t = 0.001m. The initial conditions are random within the feasible range. The initial values of the total spring stiffness are sampled from 0 to 100N/m. We used PPO and CMA-ES in the Garage package [35] for all cases. We implemented the computational graphs for HWasP in TensorFlow and the dynamics of the rest of the environment (non-differentiable) by ourselves using mid-point Euler integration. In the computational policies the neural network sizes are set to be 2 layers and 32 nodes each layer. The episode length is 1000 environment steps and the total number of steps is 4 × 10 6 . Hardware as Policy. We use Hooke's Law for the parallel springs and current-torque relationship for the motor (ignoring rotor inertia and friction) to model the mechanical part of our agent. We implement the computational graph of this hardware policy and combine it with a neural network computational policy, shown in Fig 7a. Hardware as Policy -Minimal. In this case (Fig.7b) , we only add the hardware parameters (spring stiffnesses vector k) into the action. The environment is governed by the physics of the spring-mass system, but can take k in the simulation of the next time step. Numerical Results. If we ignore the transition phase of the task in this toy problem and make a quasi-static assumption, if the total spring stiffness equals such a value: the gravity will drag the mass m 2 exactly to the target, and the steady-state input current i can be zero, which minimizes the return on i in a long enough horizon. In the real world, the system is dynamic, but optimized total stiffness k should be close to this value given a long horizon. After training, we find the optimized total stiffness close to the k * value, shown in Table 1 . A.2 Details for the Co-Design of an Underactuated Hand Tendon Underactuation Model. In a tendon-driven underactuated hand, the tendon mechanism acts as the transmission that converts motor states to joint states. In the model we constructed, it assumes an elastic tendon (with stiffness k tend ) goes through multiple revolute joints by wrapping around circular pulleys (radii r pul ), and each joint closes by tendon and opens by a restoring spring (stiffness k spr and preload angle θ pre spr ). We note we solve such originally inter-determinate physics in a time-shifted fashion: this model assumes a nominal (finite) tendon stiffness, takes in the commanded relative motor travel ∆x mot , the motor position reading x mot and the joint angles θ joint in the previous time step, and computes the joint torques τ joint for the current time step. Such torques are commanded to the joints in the physics simulation. The tendon elongation can be calculated as: where θ ref joint is the joint angle when the motor is in zero-position, and usually we define it zero. Then the tendon force can be calculated as: Hence, the torques applied to joints are: where * means element-wise multiplication. This model is built as an auto-differentiable computational graph in HWasP, and a non-differentiable model on top of the physics simulation in HWasP-Minimal and the baselines. Problem Formulation. • The observations are the position vector of the palm p palm , the position vector of the object p obj , the size vector of the object bounding-box l obj , and the current hand motor travel x mot and torque τ mot . We can also send joint angles θ joint to the hardware policy (only used in training time), but not the computational policy, because there are no joint encoders in such an underactuated hand, and the computational policy (which will serve as a controller of the real robot in run time) does not have access to joint angles. • The input of this system is the relative motor travel ∆x mot and the relative palm motion ∆p palm , which are produced by the computational policy. • The variables to optimize are the parameters in the neural network, and the hand underactuation parameters: pulley radii r pul , the joint restoring spring stiffnesses k spr and the joint restoring spring preload angles θ pre spr , where each vector has a dimension of four corresponding to the proximal and distal joints in the thumb and the opposing fingers (the two fingers share the same parameters). • The goal is to grasp the object and lift it up. Formally, the reward function is: where the α, and β are the weighting coefficients, p palm and p obj are the positions of the palm and the object, C is the number of contacts between the distal links and the object, z obj is the height of the object, and f (z obj ) is a hand-tuned non-decreasing piecewise-constant function of z obj . Shared Implementation Details. Similar to the toy problem, we used the Garage package [35] and TensorFlow for HWasP, HWasP-Minimal and the two baselines in this design case. We use MuJoCo [32] for the physics simulation of the hand. The simulation time step is 0.001s, the environment step is 0.01s, and there are 500 environment steps per episode and 2 × 10 7 steps for the entire training. The initial height of the hand, as well as the type, weight, size, and friction coefficient of the object are randomly sampled within a reasonable range. We also added random perturbations forces and torques on the hand-object system to encourage more stable grasps. We use TRPO to explicitly limit the shift of action distributions. The computational policy is a fully-connected neural network with 2 layers and 128 hidden units on each layer. Hardware as Policy. Shown in Fig. 8a , we implement the underactuation model as the hardware policy in the computational graph, which allows the hardware parameters to be optimized via autodifferentiation and back-propagation. We redefine the action in the RL formulation to be the relative palm position command as well as the joint torques. The environment then becomes the hand-object system without the tendon underactuation mechanisms, i.e. with independent joints. Hardware as Policy -Minimal. We also implemented the "HWasP-Minimal" method by incorporating all hardware parameters (pulley radii r pul , joint restoring spring stiffnesses k spr and preload angle θ pre spr ) into the original control (relative hand position ∆p palm and motor command ∆x mot ), shown in Fig. 8b . The environment (containing underactuated hand and the object) takes in all these actions, sets the hardware parameters and performs simulation. Numerical Results. Our results show that we can learn effective hardware parameters. The resulted pulley radii, spring stiffnesses and preload angles are shown in Table 2 , using HWasP and HWasP-Minimal respectively. We note that the resulted parameters do not necessarily need to be identical: the optimal set of underactuation parameters is not unique by nature (for example, scaling them does not change the grasping behavior; for another example, a higher spring stiffness and a higher spring preload have similar effects), the evaluation is also noisy since we intentionally injected noise, and the gradient-based training process may also settle in local optima in the optimization landscape. Validation with Physical Prototype. Here we present some details of the physical prototype we built. This hand is 3D printed using polylactide (PLA). All eight joints in three fingers are actuated by a single servo motor (DYNAMIXEL XM430-W210-T) equipped with 12-bit encoder and current sensing. The tendons are made of Ultra-high-molecular-weight polyethylene (commercially named Spectra R ). In each finger joint, there are a pulley with the designed radii and two parallel springs whose stiffnesses add up to the designed stiffness. The distal joint pulleys are fixed to the finger structures, and proximal joint pulleys are free-rotating. The CAD model, tendon routing scheme, joint design, and finger trajectory are shown in Fig. 9 . We built a physical prototype of the optimized underactuated hand in a "work from home" situation, and validated the resulted hardware parameters. Unfortunately, we are not able to mount the hand to a robot to validate the computational policy due to the campus lockdown. We aim to test the computational policy immediately after the lab re-open. We strongly believe our jointly optimized policy can be effectively transferred to reality due to the following reasons: • We applied the Domain Randomization techniques to a lot of physical parameters and processes. We randomized object shape (among sphere, box, cylinder, ellipsoid), size (bounding box size uniformly sampled from 40 to 100mm), weight (uniformly sampled from 100 to 500g), friction coefficient (uniformly sampled from 0.5 to 1.0) and inertia (each principal component uniformly sampled from 0.0001 to 0.005kg · m 2 ). We also injected sensor and actuation noise (Gaussian noise with 1mm and 0.01rad standard deviation for translational and rotational joints respectively), and applied random disturbance wrenches (Gaussian disturbance with 0.02N and 0.002N m standard deviation for force and torque respectively) on the hand-object system. • In simulation, we limited the hand motion close to quasi-static, and use position control to drive the palm and hand joints. This control scheme is not sensitive to inaccurate parameters, unmodeled dynamics and can effectively reject disturbances. Even though we are not able to test the computational policy, the validation for the hand itself, operated by a person, still has value. We show that the resulting mechanical parameters can effectively create useful joint coordination and finger trajectories, and create a variety of stable grasps as long as the palm is positioned close to the desired picking location. Is achilles tendon compliance optimised for maximum muscle efficiency during locomotion Neural bases of hand synergies Associations between balance and muscle strength, power performance in male youth athletes of different maturity status Evolution of the human hand: the role of throwing and clubbing Learning complex dexterous manipulation with deep reinforcement learning and demonstrations Learning dexterous in-hand manipulation Learning to walk via deep reinforcement learning Kinetostatic analysis of underactuated fingers A compliant, underactuated hand for robust manipulation Domain randomization for transferring deep neural networks from simulation to the real world Trust region policy optimization Proximal policy optimization algorithms Continuous control with deep reinforcement learning Concurrent design optimization of mechanical structure and control for high speed robots The road less travelled: Morphology in the optimization of biped robot locomotion Flexible muscle-based locomotion for bipedal creatures Computational co-optimization of design parameters and motion trajectories for robotic systems Data-efficient learning of morphology and controller for a microrobot Evolving virtual creatures Automatic design and manufacture of robotic lifeforms Unshackling evolution: evolving soft robots with multiple materials and a powerful generative encoding Topological evolution for embodied cellular automata Real-world evolution adapts robot morphology and control to hardware limitations Reinforcement learning for improving agent design Jointly learning to construct and control agents using deep reinforcement learning Kinematic synthesis using reinforcement learning Data-efficient co-adaptation of morphology and behaviour with deep reinforcement learning End-to-end differentiable physics for learning and control A differentiable physics engine for deep learning in robotics Chainqueen: A real-time differentiable physical simulator for soft robotics Differentiable programming for physical simulation Mujoco: A physics engine for model-based control Completely derandomized self-adaptation in evolution strategies Underactuation design for tendon-driven hands via optimization of mechanically realizable manifolds in posture and torque spaces Garage: A toolkit for reproducible reinforcement learning research