Pano2IR: Spatial Room Impulse Responses Generation
from Egocentric Panoramic Image

Anonymous Authors
Result Image

Abstract

Immersive spatial audio for augmented and extended reality (AR/XR) applications requires alignment with visual cues from the surrounding environment. However, inferring realistic room impulse responses (RIRs) from visual information in diverse and unconstrained environments remains a major challenge. We propose Pano2IR, a framework that generates first-order Ambisonics (FOA) spatial room impulse responses (SRIRs) from a single 360° egocentric panoramic image captured at the user's location. This enables user-centric spatial audio rendering in arbitrary environments without the need for extensive RIR measurements or full 3D meshes. To preserve the spatial characteristics of SRIRs, we introduce a diffuseness-weighted intensity vector (Diff-IV) loss, which captures inter-channel relationships by leveraging both the intensity vector and diffuseness inherent in FOA-SRIRs. Pano2IR extracts three acoustic-correlated visual transformations from a single RGB panorama—depth maps, room geometry, and object absorption maps—to effectively bridge the visual-acoustic domain gap. In addition, we encode key acoustic properties of SRIRs into latent acoustic parameter representations. A contrastive learning scheme then aligns these acoustic features with their corresponding visual counterparts, enabling the model to distinguish matching acoustic-visual pairs from non-matching ones.

Experimental Results

The effectiveness of Pano2IR is showcased by inference results on panoramic images captured in previously unseen scenes. The generated FOA-SRIRs were transformed into binaural audio using the off-the-shelf ambisonics-to-binaural plug-in.

Panorama 1
Original audio
Generated audio
Panorama 2
Original audio
Generated audio
Panorama 3
Original audio
Generated audio
Panorama 5
Original audio
Generated audio