UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation
Abstract
Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion’s simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.
Motion Tasks Formulation
UniEgoMotion is a unified, scene-aware motion model designed for egocentric settings:
(1) It generates plausible future motion from a single egocentric image — for example, predicting how you might take your shot on goal.
(2) It forecasts upcoming motion using past egocentric video and ego-device trajectory, showing how you could complete your run-up to score.
(3) It reconstructs accurate 3D motion from past egocentric observations, showing how you squatted down to reach the lower cabinet.
Model Architecture
Overview of a denoising step in UniEgoMotion. The input noisy motion Xt1:N is denoised using a transformer decoder network conditioned on the ego-device trajectory T1:N and egocentric images I1:N. A robust image encoder is used to extract fine-grained scene context from the images. During training, conditioning inputs are randomly replaced with learnable mask tokens to simulate three tasks: egocentric reconstruction, forecasting, and generation. During inference, the learned mask tokens are used in place of any missing conditioning input, allowing a single model to perform all three tasks consistently.
Egocentric Motion Reconstruction
Egocentric Motion Forecasting
Egocentric Motion Generation
BibTex
@inproceedings{patel25uniegomotion,
title = {UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction,
Forecasting, and Generation},
author = {Patel, Chaitanya and Nakamura, Hiroki and Kyuragi, Yuta and
Kozuka, Kazuki and Niebles, Juan Carlos and Adeli, Ehsan},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2025},
}