key: cord-0205528-np0x03ht authors: Mandlekar, Ajay; Xu, Danfei; Mart'in-Mart'in, Roberto; Zhu, Yuke; Fei-Fei, Li; Savarese, Silvio title: Human-in-the-Loop Imitation Learning using Remote Teleoperation date: 2020-12-12 journal: nan DOI: nan sha: d7f5aeb06f37bbaf6d47d560db2162665c7bdac6 doc_id: 205528 cord_uid: np0x03ht Imitation Learning is a promising paradigm for learning complex robot manipulation skills by reproducing behavior from human demonstrations. However, manipulation tasks often contain bottleneck regions that require a sequence of precise actions to make meaningful progress, such as a robot inserting a pod into a coffee machine to make coffee. Trained policies can fail in these regions because small deviations in actions can lead the policy into states not covered by the demonstrations. Intervention-based policy learning is an alternative that can address this issue -- it allows human operators to monitor trained policies and take over control when they encounter failures. In this paper, we build a data collection system tailored to 6-DoF manipulation settings, that enables remote human operators to monitor and intervene on trained policies. We develop a simple and effective algorithm to train the policy iteratively on new data collected by the system that encourages the policy to learn how to traverse bottlenecks through the interventions. We demonstrate that agents trained on data collected by our intervention-based system and algorithm outperform agents trained on an equivalent number of samples collected by non-interventional demonstrators, and further show that our method outperforms multiple state-of-the-art baselines for learning from the human interventions on a challenging robot threading task and a coffee making task. Additional results and videos at https://sites.google.com/stanford.edu/iwr . Imitation Learning (IL) is a promising paradigm for learning complex manipulation skills by reproducing behaviors from human demonstrations [27, 33, 41] . Unlike interactive learning techniques, such as reinforcement learning, which generate large amounts of training data via autonomous exploration, the efficacy of IL is bounded by the cost of human demonstrations. This cost limits the amount of data available to train IL models. Consequently, models trained by IL can suffer from covariate shift: small errors in actions can bring the learner to unseen states that the learner has not been trained for. To address this covariate shift problem, DAGGER-style methods [18, 23, 35] have an expert relabel dataset samples collected by the trained agent with actions that the expert would have taken. This allows training data to include samples that a trained agent is likely to encounter. For real-world robotic tasks, however, DAGGER-style data relabeling is often infeasible. For example, a 30-second manipulation task with 20 hz robot control would require a human to relabel 600 state samples for every trajectory collected by the robot. Moreover, the human needs to estimate the correct robot action that should have been taken in each state. This kind of offline relabeling requires significant human effort and is prone to incorrect action labels [21] . Instead, it * These authors contributed equally. 1 Manipulation tasks often contain bottleneck regions that require a series of precise actions to traverse successfully. Models trained on an offline set of human demonstrations easily fail in these regions due to action errors that compound. To address this issue, we built a system where a human operator observes a policy attempting to solve a manipulation task and intervenes when necessary to help solve the task. During an intervention, the human operator takes control of the arm from the policy, moves the robot arm into a state where the policy is likely to succeed, and then returns control to the policy. All data is aggregated into a dataset and the policy is re-trained on the new collected data. This process repeats. is more natural for humans to annotate actions in the loop, i.e. monitor policy execution and take over control when intervention is needed [4, 18, 37] . However, intervention-based learning has mostly been limited to 2D driving domains [18, 37] where an agent must learn a policy to stay on the road. Both the data collection and the policy learning are straightforward in this domainhumans can easily provide intervention actions in 2D, and the domain is tolerant to action error, since there is a large set of actions that keep the agent on the road. By contrast, in 6-DoF robot manipulation settings, certain regions of the state space can require precise sequences of actions to make meaningful task progress. These regions are much less tolerant to error, and a small deviation means the difference between success and failure. We call such regions bottlenecks. Consider the coffee making task shown in Fig. 1 , where the robot must carefully insert the pod into the machine slot. States where the pod is close to the container form a bottleneck, since only a particular sequence of actions lead to successful insertion and any deviation will cause the pod to collide with the rim. Tasks with such bottlenecks are ideal testbeds for intervention-based learning, because small inaccuracies in the output actions can make IL agents susceptible to making mistakes in these regions. Making human interventions feasible for robot manipula-Operator intervenes by moving phone Operator monitors policy using web browser video stream Fig. 2 : System Overview. We extend the RoboTurk system [26, 28] to enable remote users to monitor policy execution and intervene when they would like to take over control using a web browser and a smartphone. The user monitors policy execution by watching a video stream of the robot in their web browser. They can hold a button on their smartphone to intervene and control the robot by moving their smartphone. tion raises technical challenges on the system side as well as on the algorithmic side. First, we need a robust system that enables human demonstrators to monitor the robot behaviors and gain immediate full control of the robot when observing imminent risks. Second, we need an effective algorithm to learn from human intervention data. To this end, we develop a system suitable for collecting intervention data from remote users with 6-DoF control of robot end-effectors and a simple yet effective method to leverage the human interventions. The key intuition behind our algorithm is that, in manipulation tasks, humans tend to intervene when the robot has difficulty "entering" a bottleneck and return control to the robot after traversing the bottleneck. Therefore, the human interventions are informative about both where task bottlenecks occur and how to traverse them. The algorithm we propose leverages these two signals for policy learning. Specifically, we find that treating the intervention signal as an implicit reward function and performing weighted regression is highly effective to correctly leverage the new data. Summary of Contributions: 1) We develop a system that enables remote teleoperation for 6-DoF robot control and a natural human intervention mechanism well suited to robot manipulation. 2) We introduce Intervention Weighted Regression (IWR), a simple yet effective method to learn from human interventions that encourages the policy to learn how to traverse bottlenecks through the interventions. 3) We evaluate our system and method on two challenging contact-rich manipulation tasks: a threading task and coffee machine task. We demonstrate that policies trained on data collected by our system outperform policies trained on an equivalent amount of full human demonstration trajectories, that IWR outperforms alternatives for learning from the intervention data, and that our results hold across data collected from multiple human operators. Imitation Learning from Offline Demonstrations: Imitation learning can be used to train a policy from a set of offline expert demonstrations. A policy can be learned without any interaction, as in Behavioral Cloning (BC) [27, 33, 41] or using additional interaction with the environment, as in Inverse Reinforcement Learning (IRL) [1, 2] . However, policies trained with BC can suffer from covariate shift because they are trained completely offline [35] -it can be easy for the policy to encounter unseen states during evaluation. Because of this problem, BC often takes a prohibitive number of expert samples to work well. By contrast, IRL generally requires fewer expert samples, but it relies on a prohibitive amount of agent interaction with the environment [16, 39, 42] due to the need to do reinforcement learning with a learned reward function. We instead focus on human-in-the-loop policy learning, which can strike a better balance between the number of human samples and agent samples required for learning. Human-in-the-Loop Policy Learning: Human-in-the-Loop Policy Learning allows a human to provide additional supervision during the policy learning process. One paradigm is Reinforcement Learning (RL) with human feedback [40] , where a human provides rewards during agent training [8, 11, 24, 25, 34, 36] , but this suffers from the same limitations as IRL due to the need for extensive agent interaction. DAGGER [35] introduced a useful paradigm that can require less agent samples than IRL and less human samples than BC by asking an expert to relabel data collected by a trained agent with actions that should have been taken. However, DAGGER is not feasible for humans in practical scenarios (e.g. continuous 6D control) due to the relabeling procedure, which can be burdensome and prone to human error, especially in manipulation settings [21] . Prior work has attempted to reduce the number of human annotations needed [10, 23, 30] but relabeling is still required. Noise injection during expert demonstrations has also been proposed in order to correct for covariate shift [22] . Other paradigms for human-in-the-loop policy learning include collaboration [14, 15] , teaching the robot through informative sample selection [7, 9, 17] , and leveraging physical kinesthetic corrections [5, 6] . Learning from Interventions: In this work, we build a system that allows remote users to monitor policy execution and provide interventions, and a method that learns from the interventions intelligently. While similar systems have been built recently [4, 13] , our system is the only one allowing for remote web-based operation and the only one demonstrated on contact-rich manipulation tasks. Prior intervention-based approaches [4, 18] only leverage the human intervention samples for training the policy, and discard the agent samples that lead to those interventions. However, training without these on-policy agent samples can cause the behavior of the agent to change significantly after training on the new dataset, inducing a new distribution of mistakes due to covariate shift. By contrast, including the on-policy agent samples during training can help alleviate this issue by encouraging the agent to visit states where human interventions are available, increasing the likelihood of successful bottleneck traversal. Spencer et al. [37] also found it useful to leverage all samples for learning, but their method only applies to discrete action settings and was only demonstrated in driving domains. Our method encourages successful bottleneck traversal in continuous control settings by re-weighting the dataset distribution to prioritize intervention samples over the on-policy samples, Fig. 3 : Intervention Weighted Regression. Every intervention trajectory (blue box) consists of portions where the policy was controlling the robot (green) and portions where the human was intervening (red). These are aggregated to separate datasets. The intervention portions usually start in the neighborhood of bottlenecks that require a structured sequence of actions to traverse successfully and they demonstrate how to traverse them. During training (orange box), we sample equal size batches from the non-intervention and intervention datasets and train with a behavioral cloning loss. Sampling equal size batches from each dataset re-weights the data distribution to reinforce intervention actions that demonstrate bottleneck traversal, while the non-intervention samples provide regularization that keeps the policy close to previous policy iterates. which are included for regularization. We formalize the problem of solving a robot manipulation task as an infinite-horizon discrete-time Markov Decision is the discount factor, and ρ 0 (·) is the initial state distribution. At every step, an agent observes s t , uses a policy π to choose an action, a t = π(s t ), and observes the next state, s t+1 ∼ T (·|s t , a t ), and reward, r t = R(s t , a t , s t+1 ). The goal is to learn an policy π that maximizes the expected return: . We now review some methods for learning from demonstrations. Behavioral Cloning (BC) [33] is a common and simple method for learning from a set of demonstrations D. It trains a policy π θ (s) to clone the actions in the demonstrations with the objective: arg min θ E (s,a)∼D ||π θ (s) − a|| 2 . Policies trained with BC often suffer from covariate shift, because small action errors can result in a state visitation distribution ρ π (s) that differs from the one provided in the demonstration data ρ D (s). To address this issue, Ross et al. [35] introduced DAGGER, which optimizes the objective arg min θ E s∼ρ π θ (s) ||π θ (s) − π D (s)|| 2 , where π D is the demonstrator policy. This objective ensures that the policy imitates the demonstrator policy on its own induced distribution of states, instead of the demonstrator's distribution of states. To optimize this objective, DAGGER alternates between collecting state samples using the current policy iterate (s, π θ (s)), relabeling the states visited using the demonstrator policy (s, π D (s)), and updating the policy with BC. However, relabeling all states in an offline manner is not feasible for humans in realistic continuous control tasks. Kelly et al. [18] introduced HG-DAGGER, a simple variant of DAGGER that does not require an explicit offline relabeling process. Instead, the demonstrator is allowed to intervene and control the agent during agent execution when they are unsatisfied with the agent performance. Thus, state samples are collected under a mixture policy π(s) = G H (s)π H (s) + (1 − G H (s))π θ (s), where G H (s) represents a gating function that corresponds to when the human decides to intervene. The intervention samples, {(s, a)|G H (s) = 1}, are treated as the relabeled samples, and are added to the dataset in order to train the agent, while the on-policy agent samples are discarded. However, training the agent for the next round without onpolicy agent samples can cause agent behavior to change significantly due to covariate shift, especially in contact-rich manipulation settings as explored in this work. To alleviate this issue, we develop a method that includes on-policy agent samples as regularization, but prioritizes intervention samples that demonstrate successful bottleneck traversal. We first describe our data collection system that allows remote operators to collect intervention data for manipulation tasks. Next, we discuss how learning from intervention data can be viewed as performing reinforcement learning where the human curates the dataset for the policy to learn from. Finally, we present Intervention Weighted Regression, an effective method for learning from intervention data, that leverages this insight to re-weight the dataset distribution according to the human intervention timings. We develop our data collection system on top of Robo-Turk [26, 28] to allow remote operators to monitor policy execution and intervene when necessary. RoboTurk is a platform that allows users to collect task demonstrations through low-latency teleoperation. Users log on to a website that streams video from the robot workspace, and use a smartphone as a motion controller to control the robot. The platform works for both simulated and real robots, although in this work we focus on simulated domains 1 . The robot simulation runs in remote servers hosted in the cloud to make it simple for users to participate in data collection. They only require a smartphone and a web browser to stream video. We extend RoboTurk to enable remote users to monitor trained policies and intervene when necessary to help them improve (see Fig. 2 ). The user is able to pause and resume policy execution by tapping a button on their smartphone. The user can intervene by holding another button down and moving their phone in free space to apply relative pose commands to the robot arm. This provides a natural and simple way for users to intervene and apply corrections, since users can control the robot by moving or rotating their phone in a particular direction relative to the current robot pose. For example, the user can simply push their phone forward to make the arm move back, or twist their phone along a particular axis to apply the same relative rotation to the robot arm. Each recorded task demonstration now consists of a mix of human and policy samples. We record each completed task demonstration with binary intervention labels, and leverage the datasets collected by our system for imitation learning. We first show how we can formulate intervention-based policy learning as an alternating optimization problem between the human and the policy. We then show that our algorithm emerges via assumptions on how the human solves this optimization problem. We start with the RL objective, which is to find a policy π θ that maximizes the expected return E p π θ (τ) [R(τ)], where τ is a trajectory of states and actions, p π θ (τ) is the distribution of trajectories induced by policy π θ , and R(τ) is the discounted return of the trajectory. We instead choose to maximize the log of the return, and then introduce a variational trajectory distribution q(τ) and a corresponding variational lower bound [3] J(θ , q) = E q(τ) [log R(τ) + log p π θ (τ) − log q(τ)] (1) In our setting of intervention-based learning, we view q(τ) as a dataset distribution that is curated by the human. Eq. (1) can be maximized via Expectation-Maximization [12] , which alternates between optimizing the trajectory distribution q while holding the policy fixed and optimizing the policy parameters θ while holding q fixed. Each round of intervention data collection and policy training corresponds to an iteration of EM. During each round, the human tries to maximize Eq. (1) by improving the dataset sample distribution q(τ) via interventions. This means that the human optimizes q(τ) = arg max q J(θ , q), which can be written as where KL denotes the Kullback-Leibler divergence between the variational trajectory distribution and the one induced by the current policy. A human optimizes this objective by choosing to intervene in different regions of the state space to improve the task success rate of the trajectories in the dataset. Notice that the KL penalty in Eq. (2) is implicitly encouraged in intervention data collection because all samples are on-policy, except for the human intervention samples. Then, the base policy for the next iteration is trained by solving θ = arg max θ J(θ , q) over the dataset curated by the human, which corresponds to the following objective If using a deterministic policy, Eq. (3) reduces to the BC loss arg min θ E (s,a)∼q(τ) ||π θ (s) − a|| 2 , where q(τ) is curated by the human during intervention data collection. Notice that the density q(τ) effectively assigns importance to each stateaction sample, and is equivalent to weighting the BC loss on a per-sample basis. A number of works have leveraged this insight to develop RL algorithms [3, 29, 31, 32] . Thus, we can view intervention-based policy learned as optimizing a lower bound on the reinforcement learning objective via EM, where each iteration consists of a human curating a data distribution, and a policy update given by BC over the curated data distribution. In the next section, we discuss how this perspective can be leveraged to make better use of human interventions in manipulation tasks. Different assumptions for how the human carries out the optimization in Eq. (2) to produce the data distribution q(τ) result in different algorithms. A simple assumption is that the human specifies q(τ) directly via the intervention signalevery sample where the human is intervening and controlling the robot is included in q and the rest of the on-policy samples are discarded and not used for learning -this is what HG-DAGGER [18] does. However, excluding on-policy samples in the dataset distribution can cause the policy trained in the next round to change substantially from the current policy, and induce a significantly different distribution of policy failures due to covariate shift. The optimization in Eq. (2) captures this intuition through the KL penalty, which rewards the q distribution for including samples close to the policy distribution of states and actions. Thus, we instead assume that the human-curated distribution q(τ) includes both intervention samples D I and on-policy samples D R . However, q also needs to assign density to each sample, as mentioned earlier, samples with higher density correspond to higher importance. Fortunately, the human intervention timings can be indicative of the importance of samples in manipulation domains. This is because robot manipulation tasks often contain bottlenecks, which are regions of the state space that require a structured sequence of actions to traverse successfully (such as the task in Fig. 1 ). Our key insight is that policies trained on offline demonstrations are most likely to incur errors near bottlenecks, and consequently, the timing of the human interventions identify the location of these bottlenecks and how to traverse them. Actions that successfully traverse bottlenecks should be reinforced over others in a trajectory. The dataset collected via human interventions consists of a set of intervention data samples, D I , and on-policy data samples collected by the robot, D R . We assume that the human is specifying q(τ) by re-weighting the distribution of data with a parameter α such that where ρ I (s, a) and ρ R (s, a) are the state-action distributions for the intervention and non-intervention samples respectively. The weight α corresponds to the amount of prioritization given to the intervention samples. We call our method Intervention Weighted Regression (IWR) to reflect how reinforcement occurs by re-weighting the dataset distribution according to the user interventions. In practice, we find that instead of tuning the value of α per dataset, choosing α = |D R |/|D I | so that q(s, a) samples from ρ I (s, a) and ρ R (s, a) in equal proportion performs well across datasets (see Fig. 3 ). This is equivalent to behavioral cloning with the objective θ = arg min θ ||π θ (s I ) − a I || 2 + ||π θ (s R ) − a R || 2 with (s I , a I ) ∼ D I , (s R , a R ) ∼ D R . In this section, we evaluate our system and method on challenging contact-rich manipulation tasks that require precise control. We seek to demonstrate that policies trained on data collected by our system can outperform policies trained on an equivalent amount of full human demonstration trajectories and that IWR outperforms alternatives for learning from the intervention data. All tasks were designed using MuJoCo [38] and the robosuite framework [43] (see Fig. 4 ). The workspace consists of a Sawyer robotic arm in front of a table. The arm is controlled using an Operational Space Controller [19] . Threading: The robot arm must pick up a wooden rod and insert it into a wooden ring (Fig. 4) . The location of the wooden rod and ring are randomized at the start of each episode. This task contains two bottlenecks -the grasping of the rod and the insertion into the ring. The insertion must be performed carefully -the wooden ring can move easily if the rod hits the ring. The observation space consists of the end effector pose, gripper finger positions, and the poses of the wooden rod and ring. Coffee Machine: The robot arm must pick up a pod, insert it into the holder, and then close the lid of the coffee machine (Fig. 4) . The location of the pod is randomized at the start of each episode. This task has three bottlenecks -grasping, insertion, and closing. Both the pod grasping and insertion must be precise, as small errors will cause the pod to slip out of the hand, or fail to be inserted into the holder. The observation space consists of the end effector pose, gripper finger positions, the poses of the pod and pod holder, the hinge angle of the lid, and binary contact indicators between the pod, pod holder, and gripper fingers. IWR-NB: Same as IWR, but all dataset samples are added to one dataset. No dataset balancing takes place, and uniform sampling is used. HG-Dagger: As in [4, 18] at each round of intervention data collection, only the samples where the user was intervening are added to the dataset, while the policy samples are discarded, and policies are trained with BC. Full Demos: A human operator collects full task demonstrations instead of interventions. The trajectories are added to the dataset and policies are trained with BC. We conducted two studies to demonstrate the utility of our system and our method for learning from intervention data. In the first, a single operator collected datasets on the Threading task. In the second, three human operators collected datasets on the Coffee Machine task. Each operator started with an initial dataset that consisted of 30 task demonstrations, and a base policy that was trained on that dataset. For each intervention-based method, the operator performed 3 additional rounds of data collection. During each round, the operator collected trajectories until the number of intervention data samples reached roughly 33% of the initial dataset samples to ensure that all methods would be able to receive the same number of human-annotated samples at each round, regardless of base policy quality, and to be consistent with prior work [18] . After each round, the base policy for the next round was obtained by training for a fixed number of epochs. For the Full Demos baseline, each operator collected a single dataset of human demonstration trajectories which had the same number of samples as the initial dataset. All policies are 2-layer LSTMs with hidden size 100, trained on a sequence length of 10 using Adam optimizers [20] . To evaluate each method, policies were saved at a fixed rate and evaluated with 50 rollouts for each checkpoint. All success rates presented in each table reflect the maximum average success rate obtained by each run over all model checkpoints, for 3 training runs with different seeds. Does data collected using our intervention-based system improve task performance more than an equivalent amount of full demonstration samples? We present results on the Threading datasets collected by a single operator in Table I . The results show that our method outperforms the Full Demos baseline by a significant margin (87.3% vs. 76.7%). This trend holds true even for the intermediate rounds, showing that intervention-based data collection can produce higher quality policies with fewer human-annotated samples. We also see that other intervention-based baselines do not necessarily improve upon the Full Demos performance -both HG-DAGGER and IWR-NB reach roughly the same level of performance as the Full Demos baseline. Only our method is able to consistently leverage intervention data to outperform the Full Demos baseline. Does using our method outperform baselines that learn from intervention data? As shown in Table I , the results demonstrate that our method outperforms both the HG-DAGGER baseline and the variant of our method without balancing consistently in each round on the Threading task. Are results consistent across multiple human operators? We present results averaged across 3 different operators on the Coffee Machine task in Table II . The results show that our method consistently outperforms the HG-DAGGER baseline in each round, leading to an average final task performance of 87.5% (over 35% improvement over the original base policy), while the baseline reaches 69.6% success (about 18% improvement). Furthermore, the average Full Demos performance is 64.9% (about 13% improvement). Together, these results demonstrate the value of intervention-based data collection over collecting full human demonstrations, and intelligently leveraging both the human intervention and nonintervention samples for learning. Does our method outperform other methods on datasets that were collected using their intermediate policies? We take the final aggregated datasets for each intervention-based method, and train policies on this dataset for all models. The results in Table III and Table IV demonstrate that our method consistently outperforms other baselines on their collected datasets and can reach a level of performance close to its own collected dataset. This suggests that other baselines do not fail purely due to lower quality data or worse base policies at each iteration, but due to the way they leverage the data. Is there value in doing more rounds of intervention data collection compared to collecting the same amount of data in a single round? We had a single operator collect a large Round 1 intervention dataset with an equivalent number of intervention samples as was collected during all 3 rounds of intervention data collection, for both tasks. We found that our method did not exhibit a significant difference in final performance on the Threading task, and about 7% lower in average success rate on the Coffee Machine task on the large intervention dataset, compared to performing iterative data collection. This suggests that longer tasks that contain more bottlenecks might benefit from iterative data collection more than shorter tasks, which makes sense -the number of potential mistakes that the policy can make increases with the number of task bottlenecks, and a single base policy may not sufficiently cover the space well. We built a data collection system that allows remote operators to monitor trained policies and intervene when necessary to help the policy complete the task. We developed a simple and effective method to leverage such intervention data that reshapes the data distribution to prioritize bottleneck traversal via the timing of the human interventions, which is important in manipulation settings. We demonstrated that training an agent on intervention data with our method substantially outperforms other intervention-based baselines, and is more effective than training the agent with an equivalent number of full human demonstration trajectories. We showed that our results hold over multiple human operators and that our method can more effectively learn from intervention data even if other methods' base policies were used to collect and aggregate data. This makes our method ideal for crowdsourced settings [26, 28] , since we anticipate that data will be obtained from a variety of trained base policies and human operators. We plan to explore this in future work, as well as conduct data collection with physical robot arms. Apprenticeship learning via inverse reinforcement learning Inverse reinforcement learning Maximum a posteriori policy optimisation Fighting Failures with FIRE: Failure Identification to Reduce Expert Burden in Intervention-Based Learning Learning from physical human corrections, one feature at a time Learning robot objectives from physical human interaction Machine teaching for inverse reinforcement learning: Algorithms and applications Scaling data-driven robotics with reward sketching and batch reinforcement learning Algorithmic and human teaching of sequential decision tasks Interactive policy learning through confidence-based autonomy Deep reinforcement learning from human preferences Using expectation-maximization for reinforcement learning Helping Robots Learn: A Human-Robot Master-Apprentice Model Using Demonstrations via Virtual Reality Teleoperation Pragmatic-pedagogic value alignment Cooperative inverse reinforcement learning Generative adversarial imitation learning Showing versus doing: Teaching by demonstration HG-dagger: Interactive imitation learning with human experts A unified approach for motion and force control of robot manipulators: The operational space formulation Adam: A method for stochastic optimization Comparing human-centric and robotcentric sampling for robot deep learning from demonstrations Dart: Noise injection for robust imitation learning Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning Interactive learning from policydependent human feedback Scaling Robot Supervision to Hundreds of Hours with RoboTurk: Robotic Manipulation Dataset through Human Reasoning and Dexterity Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation Fitted Q-iteration by advantage weighted regression Policies for active learning from demonstration Advantageweighted regression: Simple and scalable off-policy reinforcement learning Reinforcement learning by reward-weighted regression for operational space control Alvinn: An autonomous land vehicle in a neural network Learning human objectives by evaluating hypothetical behavior A reduction of imitation learning and structured prediction to no-regret online learning Endto-end robotic reinforcement learning without reward engineering Learning from Interventions: Humanrobot interaction as both explicit and implicit feedback Mujoco: A physics engine for model-based control Positive-unlabeled reward learning Leveraging human guidance for deep reinforcement learning tasks Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation Reinforcement and imitation learning for diverse visuomotor skills robosuite: A Modular Simulation Framework and Benchmark for Robot Learning We would like to thank Albert Tung and Josiah Wong for helping with data collection. Ajay Mandlekar acknowledges the support of the Department of Defense (DoD) through the NDSEG program. We acknowledge the support of Toyota Research Institute ("TRI"); this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.