Cherry picking with Reinforcement learning

RSS 2023

*Equal Contribution
1Carnegie Mellon University, 2University of Washington

The Problem

Grasping small objects surrounded by unstable or non-rigid material plays a crucial role in applications such as surgery, harvesting, construction, disaster recovery, and assisted feeding. This task is especially difficult when fine manipulation is required in the presence of sensor noise and perception errors; this inevitably triggers dynamic motion, which is challenging to model precisely.

Similar challenges arise in everyday interactions: to remove shells from flowing egg whites, to grasp noodles from soup, and for surgeons to remove clots from deformable organs. Given the ubiquitous nature of these problems, developing robotic solutions to automate these has immense practical and economic value.

Image of egg shells in egg yolk

Picking up egg yolks

Image of noodles being picked up with chopsticks

Picking up noodles

Image of micro-surgery

Removing clots from deformable organs

This work presents CherryBot, an RL system for fine manipulation that surpasses human reactivity for some dynamic grasping tasks. By carefully designing the training paradigm and algorithm, we study how to make a real-world robot learning system sample efficient and general while reducing the human effort required for supervision. Our system shows continual improvement through only 30 minutes of real-world interaction: through reactive retries, it achieves an almost 100% success rate on the demanding task of using chopsticks to grasp small objects swinging in the air. We demonstrate the reactiveness, robustness and generalizability of CherryBot to varying object shapes and dynamics in zero-shot settings (e.g., external disturbances like wind and human perturbations).

Our System

Our system, CherryBot, can handle challenging dynamic fine manipulation tasks in the real world. CherryBot operates in three phases: (1) pretraining in simulation on the proxy task, (2) fine-tuning in the real world on the same proxy task, and (3) deploying in the real world on test tasks. We then evaluate the resulting learned policy in a variety of dynamic scenarios. The image on the right details a deeper look at our hardware setup. Our robot is an assembled 6-DOF robotic arm equipped with chopsticks to perform fine manipulation, paired with either a motion capture cage or an RGB-D camera for perception.

System figure

Finetuning Reproducibility

After pretraining in simulation, our system can efficiently finetune in the real world. Aside from time taken up by resets, our system finetunes in as little as 30 minutes of interaction! This finetuning procedure is robust and reproducible, as illustrated below in a timelapse of finetuning over 3 different seeds.

Typically, the training procedure takes around 2.5 hours which contains 0.5 hours interaction and 2 hours reset.


Food Demonstrations

How Reactive is the Agent?

Below are examples of a human, a handwritten 100Hz controller, and our RL agent attempting to grab a ball that is being bounced in the air by a motor. Use the dropdown to view different trials!

How Generalizable is the Agent?

We chose evaluation tasks to effectively test our agent's ability to generalize from the proxy task to more practical settings. Particularly, these tasks feature real-world complications that are not present at training time, allowing us to test the agent's ability to compensate for them at test time.

How Robust is the Agent?

Through the use of a challenging proxy task, our agent can stay robust to unmodeled dynamics. The following video showcases the robot successfully grabbing a ball that is being shaken around by a human.

Design Ablations


Plot of UTD ablation

Asynchronous update, high Update-To-Data (UTD) ratio and LayerNorm regularization yields a moderate improvement in sample efficiency while pretraining in simulation, but greatly improves the speed of training during the real-world finetuning stage.

Plot of IQL vs SAC

We found that standard online RL algorithms (such as soft actor critic (SAC)) are far more effective for fine-tuning than targeted offline RL methods.

Insight: Pre-training using standard off-policy RL methods with imperfect prior data from simulation and heuristic controllers can significantly help with sample efficiency for real-world fine-tuning

Heirarchical Controller

Plot of 20hz vs 100hz

The choice of control frequency can critically affect the horizon of the task. A higher control frequency effectively increases the time steps required to conduct a task and negatively impacts sample efficiency.

Plot of latency ablation

Latency's impact becomes negligible when the control frequency is lower, which decreases the relative ratio of the length of the latency over the length of the control step.

Insight: Learning medium-frequency hybrid controllers can effectively balance policy reactivity against the tractability of learning

Can we use different perception pipelines?

Our system leverages an external state estimation module, so any perception pipeline can be used at test time. Exeriments in our paper use a simple HSV filtration and contour detection algorithm, but this system easily generalizes to SOTA perception modules.


In these demos, we use an off-the-shelf YOLOv5 model to predict the bounding box of objects on the table, the center of which is fed into the trained policy.

Bounding box of yolo detection
Bounding box of yolo detection
Bounding box of yolo detection
Bounding box of yolo detection

Segment Anything Model

SAM is a generalizable segmentation system released by Meta AI. We can use this system to predict the segmentation mask of the object and feed the mask's center into CherryBot. Pictured below are the predicted masks and calculated object centers taken from SAM.

Bounding box of yolo detection
Bounding box of yolo detection
Bounding box of yolo detection
Bounding box of yolo detection
Bounding box of yolo detection
Bounding box of yolo detection