What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

Accepted at 2024 IEEE International Conference on Robotics and Automation
*Equal Contribution Equal Contribution, 1Meta AI 2Georgia Institute of Technology 3 Stanford University
Arxiv Paper
We conducted 304 experiments with PVRs on five tasks (push cube, pick up bottle, open drawer, reach goal position, and image-goal navigation), three robots (Trifinger, Franka, and Stretch), two learning paradigms (imitation and reinforcement learning), in simulation and reality.

Abstract

We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real- world tasks. Our study spans five different PVRs, two different policy-learning paradigms (imitation and reinforcement learning), and three different robots for 5 distinct manipulation and indoor navigation tasks. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held- out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website1 for additional details and visuals.

Overview

Pre-trained visual representations (PVRs) show promise in advancing general-purpose visual perception for sensorimotor control. However, studies on the efficacy of PVRs have mostly been limited to simulations or small-scale hardware experiments. Hence, we pose the research question: Do today's PVRs form a comprehensive visual foundation for a wide variety of robotics tasks? In response, we conducted the largest empirical study to date involving 5 PVRs, 3 different robot platforms, 2 policy-learning paradigms, and 5 distinct tasks, totaling 348 experiments and over 110 hours of robot experimentation.

Our key findings are:

  • Sim2Real Predictivity of PVR-based policies: PVR-based policies demonstrate a sim2real predictivity score of 0.929, indicating a high correlation between performance in simulation and real-world settings, after basic alignment between setups.
  • First of its kind result on ImageNav: An image-goal navigation agent using the strong PVR, VC-1 Large, trained entirely in simulation and achieved a 90% success rate upon zero-shot transfer to real-world application. This success highlights overcoming the simulation-vs-real domain gap with solely RGB-based perception.
  • Impact of Design Choices: Design choices such as model size, whether to finetune the visual backbone, and the use of data augmentations play a significant role in the performance of PVRs. Finetuning and the use of augmentations to be generally beneficial. Regarding model size, smaller models performed better in few-shot imitation learning domains, whereas larger models excelled in extensive RL domains like ImageNav.

Figure 1(a)

Figure 1(b)

Figure 2

Comparison of sim2real correlation between success on CortexBench and reality (Fig 1(a)) vs. success on our simulation environments and reality (Fig 1(b)). We find that by matching the simulation setting with real-world conditions (Fig 1(b)), SRCC (i.e., the Pearson correlation) substantially improves. Fig 2 compares the correlations of sim performance to policies trained in the real world (blue), vs correlations of sim performance to policies trained in sim and evaluated on hardware (Sim2Real transfer). Sim2Real transfer is poor across the board for tasks that use few-shot imitation learning, while transfer performance is substantially better on ImageNav, which is trained using large-scale reinforcement learning on simulated scenes

Comparison of performance in simulation (blue) vs. reality (red) for five PVRs on five tasks. R3M and VC-1 Base performance in sim closely matches reality on all tasks. However, for CLIP, MVP, and VC-1 Large there is mismatched performance on multiple tasks.

Results

Trifinger Franka Franka Franka Stretch
Push-Cube Reach-Pos Pick-Up Open Drawer ImageNav
Model Real Sim Real Sim Real Sim Real Sim Real Sim
Scratch 8 21 0 27 0 11 30 37 10 35
R3M 11 31 0 97 0 87 10 37 20 25
CLIP 8 31 0 80 0 77 27 40 20 39
MVP 5 44 0 90 0 70 13 50 50 60
VC1-Base 3 40 0 97 0 80 23 57 60 61
VC1-Large 2 41 0 97 0 77 23 50 90 60

Table 1: Zero-shot sim2real evaluations of randomly initialized ViT-Base model with finetuning & augmentations (row 0) and pre-trained visual encoders (rows 1-5) for all tasks


Trifinger Franka Franka Franka Average
Push-Cube Reach-Pos Pick-Up Open Drawer All tasks
Model Frozen Aug Sim Real Sim Real Sim Real Sim Real Sim Real
VC1-Base Yes No 40 37 97 97 80 83 57 67 55 57
VC1-Base Yes Yes 38 35 63 40 90 83 40 70 46 46
VC1-Base No No 36 20 89 0 93 0 40 33 61 60
VC1-Base No Yes 36 11 100 0 100 0 47 30 47 90
VC1-Large Yes No 41 38 97 87 77 43 50 57 53 45
VC1-Large Yes Yes 35 38 85 37 100 60 43 63 53 40
VC1-Large No No 28 34 93 57 97 90 33 43 50 45
VC1-Large No Yes 34 31 96 87 100 50 47 67 55 47

Table 2: Success rate of policies using two model sizes, with and without fine-tuning and augmentations on 4 tasks.


Trifinger Franka Franka Franka Stretch
Push-Cube Reach-Pos Pick-Up Open Drawer ImageNav
Model Frozen Aug Sim Real Sim Real Sim Real Sim Real Sim Real
VC1-Base Yes No 40 3 97 0 80 0 57 23 75 60
VC1-Base Yes Yes 38 6 63 0 90 0 40 27 75 10
VC1-Base No No 36 20 89 0 93 0 40 33 61 60
VC1-Base No Yes 36 11 100 0 100 0 47 30 47 90
VC1-Large Yes No 41 2 97 0 77 0 50 23 71 90
VC1-Large Yes Yes 35 3 85 0 100 0 43 27 76 80
VC1-Large No No 28 23 93 0 97 0 33 37 60 60
VC1-Large No Yes 34 15 96 0 100 0 47 40 69 90

Table 3: Sim2Real transfer results. All results are with policies trained in simulation and evaluated on real robots.

Conclusion

Our large-scale empirical study has significantly advanced the understanding of pre-trained visual representations (PVRs) in robot learning. We found a high degree of sim2real predictivity of PVR-based policies, suggesting that simulation experiments can inform real-world performance. Furthermore, we have achieved a landmark result on ImageNav, demonstrating the critical role of PVRs in enabling effective sim2real transfer. Finally, our study highlights the impact of key design decisions, such as model size, data augmentation, and fine-tuning when deploying PVRs in real-world robotics tasks. These insights help illuminate the immense potential of PVRs for robot learning, setting a strong foundation for future research.

Task Videos

Franka reach task
Trifinger task
ImageNav

Attention Visualizations

Trifinger without augmentation

Trifinger with augmentation

Franka Pick task without augmentation

Franka Pick task with augmentation

Franka Reach task without augmentation

Franka Reach task with augmentation

Authors

BibTeX


                @misc{silwal2023learn,
                    title={What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?}, 
                    author={Sneha Silwal and Karmesh Yadav and Tingfan Wu and Jay Vakil and Arjun Majumdar and Sergio Arnaud and Claire Chen and Vincent-Pierre Berges and Dhruv Batra and Aravind Rajeswaran and Mrinal Kalakrishnan and Franziska Meier and Oleksandr Maksymets},
                    year={2023},
                    eprint={2310.02219},
                    archivePrefix={arXiv},
                    primaryClass={cs.RO}
              }