Technical Deep Dive
FP3's architecture is a masterclass in matching data representation to the physical problem. The model is built on a Diffusion Transformer (DiT) backbone, a class of models that has proven highly effective for generative tasks like image and video synthesis. Here, the diffusion process is applied to action sequences rather than pixels. The input is a stream of point clouds—typically 16,384 points per frame from a depth camera or LiDAR—which are voxelized and passed through a 3D sparse convolution encoder before entering the transformer. This is fundamentally different from 2D-based policies that encode RGB images with a ResNet or ViT and then try to project features into 3D space via cross-attention or depth estimation heads. FP3 eliminates that projection step entirely.
The 1.3B parameter count is notable. For comparison, RT-2 (Google DeepMind) is estimated at 55B parameters but operates on 2D images and text tokens. FP3 achieves competitive or superior performance on standard manipulation benchmarks with roughly 1/40th the parameters, because the input representation carries far more geometric information per byte. The 60,000 trajectory pre-training dataset, collected from simulated environments (primarily RLBench and a custom Franka Emika Panda dataset) and some real-world teleoperation, includes diverse tasks: pick-and-place, peg-in-hole, drawer opening, and cloth folding. Each trajectory is an average of 120 steps, yielding roughly 7.2 million point cloud-action pairs.
A key engineering contribution is the use of a novel point cloud tokenization scheme that preserves local geometric structure while being computationally tractable. The team open-sourced the core components on GitHub under the repository `fp3-robot`, which has already garnered over 2,300 stars. The repo includes pre-trained checkpoints, a PyTorch Lightning training script, and a ROS 2 integration package for real-time inference at 30 Hz on a single NVIDIA RTX 4090 GPU.
Benchmark Results (from FP3 paper):
| Model | Input Modality | Parameters | RLBench Success Rate (18 tasks avg.) | Real-World Pick-and-Place (unseen objects) | Inference Latency (ms) |
|---|---|---|---|---|---|
| FP3 (ours) | Point cloud | 1.3B | 87.2% | 84.5% | 33 |
| RT-2-X | 2D image + text | 55B (est.) | 72.4% | 63.1% | 120 |
| Octo (1.5B) | 2D image | 1.5B | 68.9% | 55.3% | 45 |
| PerAct | Voxelized 3D | 0.6M | 65.3% | 48.7% | 280 |
Data Takeaway: FP3 achieves a 15-20 percentage point improvement over the best 2D-based models on both simulated and real-world benchmarks, with lower latency than RT-2-X. This confirms that direct 3D input is not just philosophically cleaner but practically superior for manipulation tasks requiring precise spatial reasoning.
Key Players & Case Studies
The FP3 story cannot be separated from its lead: Gao Yang. As an assistant professor at Tsinghua University's Institute for AI (THUAI) and co-founder/chief scientist of Qianxun Intelligence, he bridges the gap between academic rigor and product velocity. Qianxun Intelligence, valued at over $1.2 billion after its Series B in Q4 2025, is one of China's most prominent embodied AI startups. Its flagship product, the Qianxun G1 humanoid robot, currently relies on a VLM-based control stack. FP3 is widely expected to replace that stack in the G2 generation, expected in early 2027.
Other players in the 3D policy space include:
- Google DeepMind's RT-2-X: The current de facto standard for vision-language-action models. It uses a 2D image + text prompt approach. While generalist, it struggles with tasks requiring precise depth estimation, such as inserting a peg into a hole with a 0.1mm tolerance.
- Physical Intelligence (π0): A startup founded by former Google Brain researchers. Their π0 model uses a similar diffusion transformer approach but on 2D images. They have not yet adopted point cloud inputs.
- Covariant's RFM-1: A robotics foundation model trained on 2D images and text. It excels at pick-and-place in structured environments but fails on cluttered scenes with heavy occlusion.
- NVIDIA's Isaac GR00T: A platform for robot foundation models, but its core models (e.g., GR00T N1) remain 2D-centric, relying on depth estimation networks as a post-hoc step.
Competitive Landscape Comparison:
| Company/Model | Input | Parameter Scale | Pre-training Data | Real-World Deployment |
|---|---|---|---|---|
| FP3 (Tsinghua/Qianxun) | Point cloud | 1.3B | 60K trajectories | Planned for G2 (2027) |
| RT-2-X (Google) | 2D image + text | 55B | ~100K trajectories | In research labs |
| π0 (Physical Intelligence) | 2D image | 1.2B | ~50K trajectories | Beta with select partners |
| RFM-1 (Covariant) | 2D image + text | 1.8B | ~80K trajectories | Commercial warehouses |
Data Takeaway: FP3 is the only model that uses raw 3D input at scale. All competitors rely on 2D images, making FP3 a first-mover in a space that is likely to become the standard within 2-3 years.
Industry Impact & Market Dynamics
The adoption of 3D foundation models like FP3 will reshape the robotics industry in three major ways:
1. Hardware Requirements Shift: Current robot training pipelines rely heavily on high-resolution RGB cameras. FP3's point cloud input demands depth sensors (LiDAR, stereo cameras, or structured light). This will accelerate the adoption of affordable depth sensors (e.g., Intel RealSense, Ouster, or Apple's LiDAR technology) in robot fleets. The global 3D sensor market for robotics is projected to grow from $2.1 billion in 2025 to $6.8 billion by 2030, per industry estimates.
2. Data Collection Paradigm Change: 2D image datasets are abundant (e.g., YouTube videos, web images). 3D point cloud trajectories are scarce and expensive to collect. FP3's 60K trajectories required approximately 10,000 hours of teleoperation and simulation time. This creates a moat for early movers like Qianxun Intelligence who can afford to generate proprietary 3D datasets. We predict a surge in investment in synthetic data generation tools (e.g., NVIDIA Isaac Sim, MuJoCo with depth rendering) to bridge the data gap.
3. Product Roadmap Acceleration: Embodied AI companies that adopt 3D policies will see faster time-to-market for general-purpose manipulation. For example, a humanoid robot that can reliably pick up a screwdriver from a cluttered drawer (a task that stumps 2D-based models due to occlusion) becomes feasible. This directly impacts the $24 billion warehouse automation market, where bin-picking remains a top unsolved problem.
Market Growth Projection for 3D Policy Models:
| Year | Number of companies with 3D policy R&D | Estimated cumulative investment ($M) | % of robot foundation models using 3D input |
|---|---|---|---|
| 2025 | 5 | 120 | 5% |
| 2026 | 15 | 450 | 15% |
| 2027 | 35 | 1,200 | 35% |
| 2028 | 60 | 2,800 | 60% |
Data Takeaway: The inflection point is 2027-2028, when the majority of new robot foundation models will adopt 3D input. FP3's ICRA recognition will catalyze this shift by providing a validated blueprint.
Risks, Limitations & Open Questions
Despite its promise, FP3 has significant limitations:
- Point Cloud Sparsity: The current model uses 16,384 points per frame. For small or thin objects (e.g., a needle, a paperclip), this resolution is insufficient. The model may fail on tasks requiring sub-millimeter precision.
- Generalization to Dynamic Environments: FP3 was trained primarily on static or quasi-static scenes. In environments with moving obstacles or humans, the 30 Hz inference rate may be too slow for reactive control.
- Sim-to-Real Gap: While the paper reports 84.5% real-world success, the pre-training was 90% simulation. The remaining 10% real-world data came from a single robot platform (Franka Emika Panda). Generalization to other robot morphologies (e.g., humanoid arms with different kinematics) is unproven.
- Computational Cost: Despite being smaller than RT-2-X, 1.3B parameters still requires a GPU for inference. Edge deployment on low-power robot controllers (e.g., NVIDIA Jetson Orin) is possible but at reduced frame rates (15-20 Hz).
- Ethical Concerns: As with all foundation models, FP3 could be used for autonomous weapons or surveillance. The open-source release of the codebase raises dual-use concerns.
AINews Verdict & Predictions
FP3 is not just a paper; it is a declaration that the 2D era of robot learning is ending. Our editorial judgment is that within three years, no serious robotics foundation model will be trained exclusively on 2D images. The geometric information in point clouds is too valuable to ignore, and FP3 provides the first scalable proof.
Specific Predictions:
1. By ICRA 2027, at least 5 other papers will present 3D foundation models, and the term "point cloud policy" will become a standard subfield.
2. Qianxun Intelligence will announce the G2 humanoid robot with FP3-based control in Q1 2027, achieving a 30% improvement in task success rate over the G1 on standardized benchmarks.
3. NVIDIA will release a 3D foundation model reference architecture within 12 months, likely as part of the Isaac platform, incorporating point cloud tokenization similar to FP3.
4. The open-source community will produce a 300M-parameter version of FP3 that runs on a Raspberry Pi 5 with a depth camera, democratizing 3D policy research.
What to watch next: The FP3 team's follow-up work on multi-modal fusion (point cloud + tactile + audio) and their integration of FP3 into Qianxun's commercial product. If the G2 robot ships with FP3 and demonstrates reliable general manipulation, it will mark the beginning of the end for 2D-only robot learning.
Final Takeaway: FP3 is the most important robotics foundation model paper of 2026 because it solves the right problem—perception geometry—rather than scaling up the wrong input modality. The industry will follow.