We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real- world tasks. Our study spans five different PVRs, two different policy-learning paradigms (imitation and reinforcement learning), and three different robots for 5 distinct manipulation and indoor navigation tasks. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held- out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website1 for additional details and visuals.
Pre-trained visual representations (PVRs) show promise in advancing general-purpose visual perception for sensorimotor control. However, studies on the efficacy of PVRs have mostly been limited to simulations or small-scale hardware experiments. Hence, we pose the research question: Do today's PVRs form a comprehensive visual foundation for a wide variety of robotics tasks? In response, we conducted the largest empirical study to date involving 5 PVRs, 3 different robot platforms, 2 policy-learning paradigms, and 5 distinct tasks, totaling 348 experiments and over 110 hours of robot experimentation.
Our key findings are:
Figure 1(a)
Figure 1(b)
Figure 2
Trifinger | Franka | Franka | Franka | Stretch | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Push-Cube | Reach-Pos | Pick-Up | Open Drawer | ImageNav | ||||||
Model | Real | Sim | Real | Sim | Real | Sim | Real | Sim | Real | Sim |
Scratch | 8 | 21 | 0 | 27 | 0 | 11 | 30 | 37 | 10 | 35 |
R3M | 11 | 31 | 0 | 97 | 0 | 87 | 10 | 37 | 20 | 25 |
CLIP | 8 | 31 | 0 | 80 | 0 | 77 | 27 | 40 | 20 | 39 |
MVP | 5 | 44 | 0 | 90 | 0 | 70 | 13 | 50 | 50 | 60 |
VC1-Base | 3 | 40 | 0 | 97 | 0 | 80 | 23 | 57 | 60 | 61 |
VC1-Large | 2 | 41 | 0 | 97 | 0 | 77 | 23 | 50 | 90 | 60 |
Table 1: Zero-shot sim2real evaluations of randomly initialized ViT-Base model with finetuning & augmentations (row 0) and pre-trained visual encoders (rows 1-5) for all tasks
Trifinger | Franka | Franka | Franka | Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Push-Cube | Reach-Pos | Pick-Up | Open Drawer | All tasks | ||||||||
Model | Frozen | Aug | Sim | Real | Sim | Real | Sim | Real | Sim | Real | Sim | Real |
VC1-Base | Yes | No | 40 | 37 | 97 | 97 | 80 | 83 | 57 | 67 | 55 | 57 |
VC1-Base | Yes | Yes | 38 | 35 | 63 | 40 | 90 | 83 | 40 | 70 | 46 | 46 |
VC1-Base | No | No | 36 | 20 | 89 | 0 | 93 | 0 | 40 | 33 | 61 | 60 |
VC1-Base | No | Yes | 36 | 11 | 100 | 0 | 100 | 0 | 47 | 30 | 47 | 90 |
VC1-Large | Yes | No | 41 | 38 | 97 | 87 | 77 | 43 | 50 | 57 | 53 | 45 |
VC1-Large | Yes | Yes | 35 | 38 | 85 | 37 | 100 | 60 | 43 | 63 | 53 | 40 |
VC1-Large | No | No | 28 | 34 | 93 | 57 | 97 | 90 | 33 | 43 | 50 | 45 |
VC1-Large | No | Yes | 34 | 31 | 96 | 87 | 100 | 50 | 47 | 67 | 55 | 47 |
Table 2: Success rate of policies using two model sizes, with and without fine-tuning and augmentations on 4 tasks.
Trifinger | Franka | Franka | Franka | Stretch | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Push-Cube | Reach-Pos | Pick-Up | Open Drawer | ImageNav | ||||||||
Model | Frozen | Aug | Sim | Real | Sim | Real | Sim | Real | Sim | Real | Sim | Real |
VC1-Base | Yes | No | 40 | 3 | 97 | 0 | 80 | 0 | 57 | 23 | 75 | 60 |
VC1-Base | Yes | Yes | 38 | 6 | 63 | 0 | 90 | 0 | 40 | 27 | 75 | 10 |
VC1-Base | No | No | 36 | 20 | 89 | 0 | 93 | 0 | 40 | 33 | 61 | 60 |
VC1-Base | No | Yes | 36 | 11 | 100 | 0 | 100 | 0 | 47 | 30 | 47 | 90 |
VC1-Large | Yes | No | 41 | 2 | 97 | 0 | 77 | 0 | 50 | 23 | 71 | 90 |
VC1-Large | Yes | Yes | 35 | 3 | 85 | 0 | 100 | 0 | 43 | 27 | 76 | 80 |
VC1-Large | No | No | 28 | 23 | 93 | 0 | 97 | 0 | 33 | 37 | 60 | 60 |
VC1-Large | No | Yes | 34 | 15 | 96 | 0 | 100 | 0 | 47 | 40 | 69 | 90 |
Table 3: Sim2Real transfer results. All results are with policies trained in simulation and evaluated on real robots.
Our large-scale empirical study has significantly advanced the understanding of pre-trained visual representations (PVRs) in robot learning. We found a high degree of sim2real predictivity of PVR-based policies, suggesting that simulation experiments can inform real-world performance. Furthermore, we have achieved a landmark result on ImageNav, demonstrating the critical role of PVRs in enabling effective sim2real transfer. Finally, our study highlights the impact of key design decisions, such as model size, data augmentation, and fine-tuning when deploying PVRs in real-world robotics tasks. These insights help illuminate the immense potential of PVRs for robot learning, setting a strong foundation for future research.
Trifinger without augmentation
Trifinger with augmentation
Franka Pick task without augmentation
Franka Pick task with augmentation
Franka Reach task without augmentation
Franka Reach task with augmentation
@misc{silwal2023learn,
title={What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?},
author={Sneha Silwal and Karmesh Yadav and Tingfan Wu and Jay Vakil and Arjun Majumdar and Sergio Arnaud and Claire Chen and Vincent-Pierre Berges and Dhruv Batra and Aravind Rajeswaran and Mrinal Kalakrishnan and Franziska Meier and Oleksandr Maksymets},
year={2023},
eprint={2310.02219},
archivePrefix={arXiv},
primaryClass={cs.RO}
}