Grasping small objects surrounded by unstable or
non-rigid material plays a crucial role in applications such as
surgery, harvesting, construction, disaster recovery, and assisted
feeding. This task is especially difficult when fine manipulation is
required in the presence of sensor noise and perception errors;
this inevitably triggers dynamic motion, which is challenging to
Similar challenges arise in everyday interactions: to remove shells
from flowing egg whites, to grasp noodles from soup, and for surgeons to remove clots from deformable organs.
Given the ubiquitous nature of these problems, developing robotic
solutions to automate these has immense practical and economic value.
Picking up egg yolks
Picking up noodles
Removing clots from deformable organs
This work presents CherryBot, an RL system for fine manipulation
that surpasses human reactivity for some dynamic
grasping tasks. By carefully designing the training paradigm and
algorithm, we study how to make a real-world robot learning
system sample efficient and general while reducing the human
effort required for supervision. Our system shows continual
improvement through only 30 minutes of real-world interaction:
through reactive retries, it achieves an almost 100% success rate
on the demanding task of using chopsticks to grasp small objects
swinging in the air. We demonstrate the reactiveness, robustness
and generalizability of CherryBot to varying object shapes
and dynamics in zero-shot settings (e.g., external disturbances
like wind and human perturbations).
Our system, CherryBot, can handle challenging
dynamic fine manipulation tasks in the real world. CherryBot
operates in three phases: (1) pretraining in simulation on the
proxy task, (2) fine-tuning in the real world on the same proxy
task, and (3) deploying in the real world on test tasks. We then
evaluate the resulting learned policy in a variety of dynamic
scenarios. The image on the right details a deeper look at our
hardware setup. Our robot is an assembled 6-DOF robotic arm
equipped with chopsticks to perform fine manipulation, paired with
either a motion capture cage or an RGB-D camera for perception.
After pretraining in simulation, our system can efficiently finetune in the real world.
Aside from time taken up by resets, our system finetunes in as little as 30 minutes of interaction!
This finetuning procedure is robust and reproducible, as illustrated below in a timelapse of finetuning over 3 different seeds.
Typically, the training procedure takes around 2.5 hours which contains 0.5 hours interaction and 2 hours reset.
How Reactive is the Agent?
Below are examples of a human, a handwritten 100Hz controller, and our RL agent
attempting to grab a ball that is being bounced in the air by a motor.
Use the dropdown to view different trials!
How Generalizable is the Agent?
We chose evaluation tasks to effectively test our agent's ability to
generalize from the proxy task to more practical settings.
Particularly, these tasks feature real-world complications that
are not present at training time, allowing us to test the agent's
ability to compensate for them at test time.
How Robust is the Agent?
Through the use of a challenging proxy task, our agent can stay robust to unmodeled dynamics.
The following video showcases the robot successfully grabbing a ball that is being shaken around by a human.
Asynchronous update, high Update-To-Data (UTD) ratio and LayerNorm
regularization yields a moderate improvement in sample efficiency while pretraining in simulation,
but greatly improves the speed of training during the real-world
We found that standard online RL algorithms (such as soft
actor critic (SAC)) are far more effective for fine-tuning
than targeted offline RL methods.
Insight: Pre-training using standard off-policy RL methods
with imperfect prior data from simulation and heuristic
controllers can significantly help with sample efficiency for
The choice of control frequency can critically affect
the horizon of the task. A higher control frequency effectively
increases the time steps required to conduct a task and
negatively impacts sample efficiency.
Latency's impact becomes negligible when the control frequency
is lower, which decreases the relative ratio of the length of the
latency over the length of the control step.
Insight: Learning medium-frequency hybrid controllers can
effectively balance policy reactivity against the tractability
Can we use different perception pipelines?
Our system leverages an external state estimation module, so any perception pipeline can be used at test time.
Exeriments in our paper use a simple HSV filtration and contour detection algorithm, but this system easily generalizes
to SOTA perception modules.
In these demos, we use an off-the-shelf YOLOv5 model to predict the bounding box of objects on the table,
the center of which is fed into the trained policy.
Segment Anything Model
SAM is a generalizable segmentation system released by Meta AI.
We can use this system to predict the segmentation mask of the object and feed the mask's center into CherryBot.
Pictured below are the predicted masks and calculated object centers taken from SAM.