Pano2IR

Immersive spatial audio for augmented and extended reality (AR/XR) applications requires alignment with visual cues from the surrounding environment. However, inferring realistic room impulse responses (RIRs) from visual information in diverse and unconstrained environments remains a major challenge. We propose Pano2IR, a framework that generates first-order Ambisonics (FOA) spatial room impulse responses (SRIRs) from a single 360° egocentric panoramic image captured at the user's location. This enables user-centric spatial audio rendering in arbitrary environments without the need for extensive RIR measurements or full 3D meshes. To preserve the spatial characteristics of SRIRs, we introduce a diffuseness-weighted intensity vector (Diff-IV) loss, which captures inter-channel relationships by leveraging both the intensity vector and diffuseness inherent in FOA-SRIRs. Pano2IR extracts three acoustic-correlated visual transformations from a single RGB panorama—depth maps, room geometry, and object absorption maps—to effectively bridge the visual-acoustic domain gap. In addition, we encode key acoustic properties of SRIRs into latent acoustic parameter representations. A contrastive learning scheme then aligns these acoustic features with their corresponding visual counterparts, enabling the model to distinguish matching acoustic-visual pairs from non-matching ones.

Pano2IR: Spatial Room Impulse Responses Generation
from Egocentric Panoramic Image

Abstract

Experimental Results

Pano2IR: Spatial Room Impulse Responses Generation from Egocentric Panoramic Image

Abstract

Experimental Results

Pano2IR: Spatial Room Impulse Responses Generation
from Egocentric Panoramic Image