Fetch Objects with Boston Dynamics' Spot

As part of a practical course at the Forschungszentrum Informatik (FZI), our group of four built an autonomous search-and-retrieval system on top of Boston Dynamics' Spot robot. The robot had to explore an unknown environment, find predefined objects (multiple balls, cubes, and bottles), and bring them to a drop-off location all without human intervention.

The system was built in ROS2 and used a Behavior Tree as the central coordinator, with separate modules for exploration, map coverage, object detection, localization, and grasping. Here I'll focus on the three parts I was responsible for: object detection, object localization, and grasping.

Object Detection

For detection I fine-tuned a YOLO11-n model. The network is small enough to run in real time on the robot's onboard computer, which mattered more than raw accuracy here.

The trickiest part was the dataset. We as a group manually labeled 2,300 images using Roboflow, all captured directly from Spot's own cameras: the front-left and front-right cameras (grayscale) and the hand camera (RGB). Getting images from the robot itself was important because the lighting conditions, perspectives, and image quality are quite different from what you'd find in a dataset consisting of images taken with a smartphone.

We collected images from two different arm configurations:

Search pose: arm in a forward-looking position during exploration
Top-down pose: arm pointing straight down, used during the grasping visual servoing

Training two arm configurations into one dataset meant the same model handled both phases of the pipeline without switching. On the validation set (397 images) the model hit mAP50 of 96.5% and mAP50-95 of 82.6%. In practice, accuracy wasn't a bottleneck: objects were visible from many angles and got detected reliably throughout exploration, at least a couple of times.

Due to compute constraints, we restricted real-time inference to the hand camera stream only. We also discarded any detections within 10% of the image edges to cut down on false positives from partial views at the frame borders (shown by the blue box in the video). The video on the right shows the detection running live on the hand camera feed.

Object Localization

Knowing an object is in the image isn't enough: the robot needs a 3D world coordinate to navigate toward and grasp. I built a persistent localization module that subscribes to the bounding boxes published by the detection module and accumulates position estimates over time, storing them so the robot can go pick something up even after it has already walked past it.

I explored three approaches:

Ray casting via the Boston Dynamics API. Given the center of a detected bounding box, a ray is cast from the camera into the world and intersected with the nearest surface using the robot's depth cameras. This is what we used in the final deployment: simple and accurate once the network issues at the lab were sorted out. Since it relies on the proprietary Boston Dynamics API, testing it at without access to the robot is not straightforward.

Multi-view ray intersection. My own approach, developed in parallel as a fallback. The core idea: since the robot moves around and sees the same object from many angles, you can treat the accumulated rays as a kind of monocular stereo system and find where they intersect. Critically, this approach does not rely on depth cameras or the Boston Dynamics API at all, which meant I could test the full pipeline against ROS bags at remotely without needing the physical robot. I stored rays with a minimum spacing of r = 0.01m, checked intersections against the last w = 100 rays, and considered two rays as intersecting if their shortest distance was below d = 0.05m. I then clustered all intersection estimates with HDBSCAN (min cluster size 20) to get a robust final position. One issue I ran into was high variance along the camera's viewing direction, which I fixed by normalizing the data along that axis before clustering. This approach was especially useful early in the project when network problems made the Spot API unreliable.

The video below shows the pipeline running on the same scene as the detection video above. As the robot moves, red arrows are cast out from the camera toward the detected objects. The dots that appear over time are candidate intersection points between ray pairs, colored yellow to red depending on how far apart the rays were when they got close enough to count. At the end you can see the HDBSCAN clustering, collapsing the cloud of candidates into single estimated object positions.

Depth camera. I considered using Spot's depth cameras directly, but the data was too noisy, especially for small objects like the ball. By the time the other two approaches were working well, there was no reason to pursue it further.

All detected positions are visualized in RViz2 as markers. The operator can click on a marker to manually discard it if the localization is wrong, which was useful during testing.

Grasping

This is where things got interesting. I implemented and tested three separate grasping approaches.

PickObjectInImage (Boston Dynamics API, wrapped by FZI). You pass a pixel coordinate and the robot plans a grasp. This worked sometimes, but was unreliable in a specific bad way: if grasp planning got stuck, canceling the request left the robot in a broken state that required a full reboot. That's an unlucky situation for a real-time demo.

PickObjectRayInWorld (my own wrapper). Functionally similar but takes a ray (like the ones from object localization) instead of a pixel. I expected this to be easier to debug, but hit exactly the same reliability problems. Testing with Boston Dynamics' own controller, which also calls PickObjectInImage internally, produced the same failures, which strongly suggested there was an issue deeper in the robot's driver.

Custom visual servoing (what we actually used). I built this approach from scratch using the ArmCartesianCommand and MobilityCommand APIs. When a grasp is requested:

The gripper is positioned 0.7m above the estimated object location, pointing straight down with the gripper fully open.
A visual servoing loop runs for five iterations. In each iteration, the object detection checks for a bounding box in the current hand camera frame. If nothing is detected, the grasp is aborted and the object is removed from the localization module. If detected, I use the ray cast API to get a fresh 3D position estimate and reposition the arm.
The gripper descends gradually: 0.6 - (i * 0.1) meters above the object per iteration, landing at 0.2m in the final step. I also apply a fixed -0.07m offset along the robot's x-axis, which I found experimentally to improve grasp success across all three objects.
After the servoing loop, the gripper moves straight down to the last estimated position and closes.

I estimated about an 80% success rate for the bottle when it was upright or aligned horizontally with the gripper. The ball and cube weren't orientation-sensitive so they worked more consistently. One edge case I didn't fully solve: if multiple objects were close together, the visual servoing would sometimes oscillate between them. I mitigated this by increasing the z-step size early in the loop to commit to one object quickly, but you can still see it happening in one of the examples in the video below.

There was also a recurring ROS2 bug: after a successful grasp, the grasping node would occasionally crash and block all further movement commands. Restarting the object localization node cleared it. We never tracked down the root cause.

Takeaways

The parts I found most rewarding were the multi-view localization approach and building the custom grasping routine. The localization problem is genuinely interesting: turning a stream of monocular bounding boxes into reliable 3D coordinates without depth sensors requires some careful geometric thinking, and the HDBSCAN clustering step made it robust to the inevitable outliers.

The grasping work taught me a lot about working with proprietary robot APIs where you don't control the internals. When the built-in methods are unreliable and the failure modes are hard to recover from, building your own loop around lower-level primitives is often the cleaner path, even if it takes more work upfront.

The full system worked end-to-end in the demo. Not without hiccups, but well enough to autonomously find and retrieve objects in an environment the robot had never seen before.