GDFusion

Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

Dubing Chen1     Huan Zheng1     Jin Fang1,     Xingping Dong2     Xianfei Li3     Wenlong Liao3     Tao He3     Pai Peng3     Jianbing Shen1
1SKL-IOTSC, CIS, University of Macau     2Wuhan University     3COWAROBOT Co. Ltd.

CVPR 2025

Temporal Fusion Framework

Figure 1: GDFusion in the VisionOcc pipeline

FP32 Inference Memory Comparison

Figure 2: FP32 Inference Memory (MB)

TL;DR: GDFusion explores multi-level temporal fusion for vision-based 3D semantic occupancy prediction, in an efficient streaming manner.

Abstract

We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4%-4.8% mIoU improvements and reduces memory consumption by 27%-72%.


benchmark Results

Temporal Fusion Framework

Table 1: 3D semantic occupancy prediction results on Occ3D.

Temporal Fusion Framework

Table 2: 3D semantic occupancy prediction results on SurroundOcc. * denotes versions of the models with temporal fusion removed.

Temporal Fusion Framework

Table 3: 3D semantic occupancy prediction results on OpenOccupancy. * denotes versions of the models with temporal fusion removed.

BibTeX

@article{chen2025_gdfusion,
  title     = {Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction},
  author    = {Chen, Dubing and Zheng, Huan and Fang, Jin and Dong, Xingping and Li, Xianfei and Liao, Wenlong and He, Tao and Peng, Pai and Shen, Jianbing},
  journal   = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2025}
}