- AiNews.com
- Posts
- Figure AI’s Helix Enables Robots to See, Understand, and Act Like Humans
Figure AI’s Helix Enables Robots to See, Understand, and Act Like Humans

Image Source: ChatGPT-4o
Figure AI’s Helix Enables Robots to See, Understand, and Act Like Humans
Figure AI has introduced Helix, a first-of-its-kind Vision-Language-Action (VLA) model that integrates perception, language understanding, and real-time control to advance humanoid robotics. Helix represents several industry firsts, including:
Full upper-body control – Directs wrists, torso, head, and individual fingers in high-resolution continuous motion.
Multi-robot collaboration – Enables two robots to coordinate tasks using natural language, even with objects they have never seen before.
Generalized object manipulation – Allows robots to pick up and interact with thousands of household items without prior training, following natural language prompts.
Single neural network architecture – Uses a single set of neural network weights to learn and execute a wide range of behaviors—all without the need for task-specific retraining or manual adjustments, enabling seamless generalization across new tasks and environments.
Commercial readiness – Runs entirely on embedded, low-power consumption GPUs, making it deployable for real-world applications.
Household environments present significant challenges for robots—unlike controlled industrial settings, homes contain varied and unpredictable objects, such as delicate glassware, scattered toys, and crumpled clothing. For robots to be truly useful in everyday life, they must adapt dynamically to new tasks and objects without requiring extensive retraining.
Traditional robotics approaches struggle with scalability. Teaching robots a single new behavior often requires either:
Extensive manual programming by robotics experts, or
Thousands of real-world demonstrations, making the process costly and inefficient.
Helix addresses this problem by leveraging AI models trained on internet-scale vision and language data, allowing robots to learn new skills instantly using natural language prompts.
How Helix Works: A "System 1, System 2" Approach
Helix introduces a two-system AI architecture that balances speed and generalization:
System 2 (S2): A vision-language model (VLM) pretrained on internet-scale data. It processes both speech and visual data at 7-9 Hz, allowing the robot to understand commands and recognize objects in real-time.
System 1 (S1): A fast-reacting visuomotor control system running at 200 Hz, translating S2’s high-level intent into precise real-time actions, controlling 35 degrees of freedom (DoF) across the robot’s upper body.
This design allows Helix to operate efficiently:
S2 "thinks slow" – It processes scene context, understands language commands, and determines goals.
S1 "thinks fast" – It rapidly adjusts the robot’s motions in real-time, ensuring smooth execution.
For example, in a collaborative task, S1 can quickly adapt to a partner robot’s movements while maintaining S2’s high-level objectives, such as correctly placing an item in a storage container.
Key Advancements in Humanoid Control
Helix provides several breakthroughs over previous robotic control systems:
Speed & Generalization – Matches the speed of specialized single-task models while generalizing zero-shot to thousands of unseen objects.
Scalability – Outputs continuous, high-dimensional control signals directly, avoiding the limitations of previous VLA models that struggled with humanoid robotics.
Architectural Simplicity – Uses standard AI architectures (open-source VLM for S2, transformer-based visuomotor policy for S1).
Separation of Concerns – Decoupling S1 and S2 allows each system to optimize separately, making it easier to scale and improve performance.
Training and Model Details
Helix was trained on 500 hours of high-quality multi-robot, multi-operator teleoperated behaviors, a significantly lower dataset size compared to other AI models. To generate natural language-conditioned training pairs, the researchers used an auto-labeling VLM that analyzed robot actions and generated corresponding text instructions in hindsight.
Architecture
Helix’s architecture consists of two specialized AI systems working in tandem:
S2 (Vision-Language Model, 7B parameters): A large-scale vision-language model (VLM) pretrained on internet-scale data. It processes monocular camera images, wrist pose, and finger positions, then integrates this sensory information with natural language commands. Using a shared vision-language embedding space, S2 extracts task-relevant context and converts it into a continuous latent vector, which serves as a high-level intent signal for S1.
S1 (Visuomotor Transformer, 80M parameters): A fast, low-latency control model designed for real-time execution. It receives both raw visual input and the task-conditioned latent vector from S2, enabling it to generate continuous control outputs at 200 Hz. S1 uses a cross-attention transformer architecture optimized for high-dimensional humanoid motion, handling everything from fine finger movements to full-body coordination.
To improve task sequencing and autonomy, Helix includes a synthetic "percentage task completion" action in its output space. This allows the system to predict when a task is finished, enabling smooth transitions between multiple learned behaviors without requiring external commands.
By decoupling perception and control, this architecture ensures that S2 can focus on higher-level reasoning and generalization, while S1 optimizes for speed and precision, enabling fluid and adaptive humanoid movement.
Training Approach
Joint optimization – S1 and S2 were trained end-to-end, allowing gradients to flow between both systems, ensuring real-time control performance.
Temporal alignment – A calibrated time offset between S1 and S2 inputs was introduced during training to match the inference latency gap, minimizing delays and improving real-world deployment accuracy.
Optimized Deployment for Real-Time Robotics
Helix is designed for efficient, real-time operation, running entirely on embedded low-power GPUs, making it commercially viable without requiring external cloud processing.
S2 (Vision-Language Model): Runs as an asynchronous background process, continuously processing camera feeds, robot state data, and natural language commands to update high-level task objectives. It refreshes a shared memory latent vector, encoding high-level behavioral intent to guide real-time decision-making and task execution.
S1 (Visuomotor Transformer): Operates a 200 Hz control loop, enabling precise, low-latency adjustments to movement—from fine finger dexterity to full-body coordination. It processes both the latest sensory observations and the most recent S2 latent vector. Due to the inherent speed difference between S2 and S1, S1 operates at a higher temporal resolution, allowing it to react to real-time changes with greater precision and creating a tight feedback loop for reactive control.
To optimize latency and responsiveness, Helix’s deployment strategy mirrors its training setup, aligning S2’s high-level reasoning with S1’s high-speed motor execution. This reduces the train-inference distribution gap, ensuring robots can:
Instantly adapt to environmental changes.
Perform smooth, real-time motion without perceptible delays.
Execute complex multi-step tasks with seamless transitions.
By leveraging onboard processing, Helix eliminates the need for high-bandwidth external computation, making it a scalable and cost-effective solution for real-world robotics applications.
Breakthrough Results with Helix
Whole Upper-Body Coordination at 200Hz
Helix controls 35 degrees of freedom (DoF), managing:
Fine motor control – Precise coordination of individual fingers, wrists, and hands, allowing for delicate grasping, object manipulation, and dexterous in-hand adjustments. Helix enables robots to securely grip fragile objects like glassware, manipulate varied textures and materials, and adapt grip strength in real-time.
Full-body motion – Coordinated movement of the hands, head, and torso, including precise hand trajectories for object manipulation, dynamic head adjustments for visual tracking, and torso positioning for balance and reach optimization.
Adaptive adjustments – Continuous real-time recalibration of movements based on object weight, shifting loads, and environmental feedback. Helix enables smooth transitions between tasks, allowing robots to adjust their approach mid-action, such as repositioning their grip if an object begins to slip or shifting body posture for better stability.
Zero-Shot Multi-Robot Collaboration
In a grocery storage task, two Helix-powered robots: Worked together to organize food inside a refrigerator, handling items never seen before in training.
Followed verbal prompts, such as:
“Hand the bag of cookies to the robot on your right.”
“Place the cookies in the open drawer.”
Used identical model weights, eliminating the need for role-specific training.
"Pick Up Anything" Capability
Helix enables robots to grasp thousands of objects in cluttered environments without prior demonstrations, handling:
Glassware, toys, tools, and clothing
New shapes, sizes, and textures
Abstract concepts (e.g., picking up a “desert item” and selecting a toy cactus as the closest match).
This versatile language-to-action grasping ability unlocks exciting new possibilities for humanoid robots, enabling them to operate effectively in unstructured and dynamic environments.
Looking Ahead
Figure AI recently ended its collaboration with OpenAI, signaling a shift toward developing its own in-house AI models for high-speed robot control. The company believes that outsourcing AI does not work effectively for real-world embodied robotics, necessitating the creation of a vertically integrated solution where hardware and AI are developed together for optimal performance. While the specifics of the separation have not been publicly detailed, Figure AI remains focused on advancing humanoid robotics independently. Meanwhile, OpenAI appears to be renewing its interest in robotics, actively hiring engineers for a new robotics team—an indication that both companies see humanoid AI as a critical area of development.
What This Means
Helix represents a major breakthrough in humanoid robotics, demonstrating that AI-powered robots can perform complex, real-time manipulation without task-specific training. By integrating vision, language understanding, and dexterous control, Helix allows robots to interact naturally with their environments, handling objects and responding to commands as humans do.
Traditionally, robots have been constrained by rigid, pre-programmed behaviors and required massive datasets of demonstrations to learn new skills. Helix breaks this paradigm by enabling zero-shot generalization—robots can now learn new tasks instantly through natural language, reducing the time and cost associated with training.
Beyond home and industrial settings, this technology could have far-reaching applications in:
Healthcare – Assisting with patient care, medical equipment handling, and physical therapy.
Logistics & Warehousing – Automating dynamic, unpredictable workflows.
Disaster Response – Performing tasks in hazardous environments where human intervention is unsafe.
As AI-driven robotics continue to evolve, Helix lays the foundation for fully autonomous humanoids, capable of adapting to any environment without extensive reprogramming. While challenges remain in scaling, safety, and real-world deployment, this represents a pivotal step toward a future where robots are seamlessly integrated into daily life.
For those interested in pushing the boundaries of Embodied AI, Figure AI is actively recruiting talent to further develop Helix and expand its capabilities. Visit their
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.