You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are open-sourcing our generalist robotic foundation model, GO-1. Beyond the dataset and model innovations we shared in our <ahref="https://opendrivelab.com/AgiBot-World/" class="text-white underline hover:text-o-blue">previous blog</a>, this time we’d like to talk about the bitter lessons we learned along the way.
160
+
We are open-sourcing our generalist robotic foundation model, GO-1. Beyond the dataset and model innovations we shared in our <ahref="https://opendrivelab.com/AgiBot-World/" class="text-white underline hover:text-o-blue">previous blog</a>, this time we'd like to talk about the bitter lessons we learned along the way.
Even if a team manages to squeeze out a 5% or 10% gain in model accuracy, the entire system can still fail if any part of the pipeline is broken. This is the bucket effect. A wrong coordinate frame, inconsistencies between data collection and execution, or even a minor hardware failure can cause the robot’s actions to collapse completely. When scaling up data collection to more than a hundred robots, even more factors need to be considered.
172
+
Even if a team manages to squeeze out a 5% or 10% gain in model accuracy, the entire system can still fail if any part of the pipeline is broken. This is the bucket effect. A wrong coordinate frame, inconsistencies between data collection and execution, or even a minor hardware failure can cause the robot's actions to collapse completely. When scaling up data collection to more than a hundred robots, even more factors need to be considered.
Of course, robots aren’t perfect. Because of differences in their built-in controllers and software, the arm might show grasping errors anywhere from 1 mm to 10 mm. You’ll notice this if the robot drops the screw or if the screw ends up off-center in the gripper. These errors highlight the gap between what you demonstrated and what the robot actually executes. And if you train a VLA model with data that has these kinds of errors, it will likely struggle to perform the task, even if the model itself is strong and well trained.
217
+
Of course, robots aren't perfect. Because of differences in their built-in controllers and software, the arm might show grasping errors anywhere from 1 mm to 10 mm. You'll notice this if the robot drops the screw or if the screw ends up off-center in the gripper. These errors highlight the gap between what you demonstrated and what the robot actually executes. And if you train a VLA model with data that has these kinds of errors, it will likely struggle to perform the task, even if the model itself is strong and well trained.
Now, things get even trickier when you scale up. Imagine collecting data with dozens or even hundreds of robots at the same time, like we do in our data collection factory. In this case, it’s not enough for data to work on the robot it was collected from, you also need it to work across different robots. That way, all the data can be treated as one big, unified dataset, rather than being tied to a specific machine. This cross-robot consistency not only boosts scalability but also makes it possible to evaluate models on any robot in the fleet.
221
+
Now, things get even trickier when you scale up. Imagine collecting data with dozens or even hundreds of robots at the same time, like we do in our data collection factory. In this case, it's not enough for data to work on the robot it was collected from, you also need it to work across different robots. That way, all the data can be treated as one big, unified dataset, rather than being tied to a specific machine. This cross-robot consistency not only boosts scalability but also makes it possible to evaluate models on any robot in the fleet.
The action space defines the coordinates in which robots operate. Traditional robot controllers often use the end-effector (EEF) pose, measured relative to the robot’s chest or base and compared to the previous frame. In the VLA era, however, a more direct approach is to control the arm motors through their joint angles. There are also multiple ways to design the learning objectives for a model. For example, predicting actions relative to the last frame, or relative to the first frame in an action chunk (as in pi0). From our experiments, we’ve seen that strong models can easily adapt to different action spaces, even for dexterous manipulation tasks like cloth folding. In fact, they can learn effectively even when the pre-training and fine-tuning stages use different action spaces. To keep things simple for users, we choose to adopt absolute joint space in our open-source model. The key takeaway here is that the robot must execute the actions predicted by the model correctly. Since some robot controllers operate in different coordinate systems, a coordinate transformation may be needed to ensure everything lines up properly.
287
+
The action space defines the coordinates in which robots operate. Traditional robot controllers often use the end-effector (EEF) pose, measured relative to the robot's chest or base and compared to the previous frame. In the VLA era, however, a more direct approach is to control the arm motors through their joint angles. There are also multiple ways to design the learning objectives for a model. For example, predicting actions relative to the last frame, or relative to the first frame in an action chunk (as in pi0). From our experiments, we've seen that strong models can easily adapt to different action spaces, even for dexterous manipulation tasks like cloth folding. In fact, they can learn effectively even when the pre-training and fine-tuning stages use different action spaces. To keep things simple for users, we choose to adopt absolute joint space in our open-source model. The key takeaway here is that the robot must execute the actions predicted by the model correctly. Since some robot controllers operate in different coordinate systems, a coordinate transformation may be needed to ensure everything lines up properly.
Now that we have a trained VLA model, it’s time for evaluation. But don’t rush straight into deploying it on a real robot. A good first step is to run an open-loop test, which validates whether the model has properly fit the fine-tuning data. If the model fails here, it usually points to issues in your pipeline, such as problems with the dataset, dataloader, or training process. Fix them before moving forward.
325
+
Now that we have a trained VLA model, it's time for evaluation. But don't rush straight into deploying it on a real robot. A good first step is to run an open-loop test, which validates whether the model has properly fit the fine-tuning data. If the model fails here, it usually points to issues in your pipeline, such as problems with the dataset, dataloader, or training process. Fix them before moving forward.
Once the model passes the open-loop test, you can deploy it to the robot and begin real-world testing. As an extra precaution, it’s also a good idea to first replay previously collected data on the robot to confirm that all hardware is functioning as expected.
329
+
Once the model passes the open-loop test, you can deploy it to the robot and begin real-world testing. As an extra precaution, it's also a good idea to first replay previously collected data on the robot to confirm that all hardware is functioning as expected.
0 commit comments