UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

Chaitanya Patel¹, Hiroki Nakamura², Yuta Kyuragi^1,3, Kazuki Kozuka², Juan Carlos Niebles¹, Ehsan Adeli ¹
¹Stanford University, ²Panasonic Holdings Corporation, ³Panasonic R&D Company of America

ICCV 2025
Paper | ArXiv | Code & Data

Abstract

Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion’s simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

Motion Tasks Formulation

Model Architecture

Egocentric Motion Reconstruction

View in full screen for best visualization.

Qualitative comparison of Egocentric Reconstruction. Baseline methods exhibit floating motion, floor penetration, and inaccurate joint localization, whereas UniEgoMotion generates reconstructions that closely align with the ground truth.

Egocentric Motion Forecasting

View in full screen for best visualization.

Qualitative comparison of Egocentric Forecasting for predicting future motion using the first 2 seconds of egocentric video and trajectory input. The LSTM baseline predicts an average future motion and suffers from foot sliding, while the Two-stage baseline produces damped motion. In contrast, our model successfully predicts complex motions, such as squatting down to repair a bike tire, performing a salsa dance, and executing a dribbling drill around a dome cone.

Egocentric Motion Generation

View in full screen for best visualization.

Qualitative comparison of Egocentric Motion Generation from a single egocentric image input. Compared to the LSTM and Two-stage baseline, our model leverages the fine-grained image features for more accurate motion generation, demonstrating soccer juggling, a basketball shooting drill, and interaction with the lower cabinet on the left side of the person.

BibTex

@inproceedings{patel2025uniegomotion,
        title={UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation},
        author={Patel, Chaitanya and Nakamura, Hiroki and Kyuragi, Yuta and Kozuka, Kazuki and Niebles, Juan Carlos and Adeli, Ehsan},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
        pages={10318--10329},
        year={2025}
      }