Paper status: completed

Learning Spatially-Aware Language and Audio Embeddings

Published:09/18/2024

Multimodal Contrastive Learning (2)Spatially-Aware Audio and Text Embedding Model (1)Audio Event Localization and Detection (1)Open Vocabulary Text Descriptions (1)Non-Spatial Audio and Text Mapping (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces ELSA, a multimodal contrastive learning model that captures both semantic and spatial features of audio. Using synthetic spatial audio, ELSA demonstrates superior performance in semantic retrieval and 3D localization, improving accuracy over existing models.

Abstract

Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6° mean-absolute-error in 3D source localization over the baseline.

Mind Map

In-depth Reading

English Analysis~21 min read · 27,566 chars

1. Bibliographic Information

1.1. Title

Learning Spatially-Aware Language and Audio Embeddings (ELSA)

1.2. Authors

Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry‑John Theobald, Jonathan Sheaffer, Miguel Sarabia.

Affiliations: The work is primarily from Apple; the first author lists Georgia Tech (email bdevnani3@gatech.edu). The team’s background spans audio signal processing, multimodal learning, spatial audio, and speech technology.

1.3. Journal/Conference

arXiv preprint. arXiv is a widely used pre-publication repository; it is not peer‑reviewed, but often hosts influential early work.

1.4. Publication Year

2024 (v2 posted 2024‑09‑17).

1.5. Abstract (author’s core message, summarized)

The paper proposes ELSA, a multimodal contrastive model that learns joint embeddings for spatial audio and language. Unlike prior audio–language foundation models (e.g., CLAP/LAION‑CLAP) trained with non‑spatial audio, ELSA encodes both semantics (what sound) and spatial attributes (where, e.g., “behind,” “left,” “far”). It uses first‑order ambisonics (FOA) and synthetic spatial augmentation for both audio and captions (rewritten with a large language model). ELSA achieves competitive semantic retrieval and strong 3D source localization, improving retrieval R@1 by +2.8% over a CLAP baseline and lowering 3D localization error by −11.6° compared to SeldNET on TUT Sound Events 2018 (REAL component). The learned space is structured enough to “swap” spatial direction via text embedding arithmetic.

1.6. Original Source Link

Abstract page: https://arxiv.org/abs/2409.11369v2
PDF: https://arxiv.org/pdf/2409.11369v2.pdf
Status: arXiv preprint (v2).

2. Executive Summary

2.1. Background & Motivation

Problem: Current audio–language foundation models (AFMs) align audio and text semantically but lack spatial awareness because they are trained on non‑spatial audio. Conversely, sound event localization and detection (SELD) systems localize sources but use fixed class vocabularies and absolute coordinates rather than open‑vocabulary natural language like “next to me” or “behind.”
Why it matters: Spatial context is critical for human‑like scene understanding (e.g., “siren from behind you”) and for AR/VR, robotics, assistive tech, and immersive media. Bridging semantics and spatial attributes enables richer retrieval, reasoning, and editing (“move the dog bark from left to right”).
Key idea: Train a single contrastive multimodal model, ELSA, that ingests spatial audio (FOA) and spatially‑augmented captions, learning a joint embedding that captures both semantics and spatial attributes.

2.2. Main Contributions / Findings

Data: A synthetic spatial audio–text corpus of 4,738 hours (890,038 samples) across 8,972 simulated rooms; spatial captions are produced by prompting an LLM to rewrite original captions with spatial descriptors (direction, elevation, distance, room size/reverb). A small real‑world FOA dataset (70 clips) validates transfer.
Model: ELSA pairs a semantic audio branch (HTSAT) and a spatial attributes branch (CNN on active/reactive intensity vectors), concatenates them, projects to a shared embedding, and aligns with a text encoder (RoBERTa) via a CLIP‑style contrastive loss, plus auxiliary spatial regressions (DoA, distance, room area).
Results:
- Semantic retrieval (non‑spatial): Comparable to LAION‑CLAP on AudioCaps/Clotho.
- Spatial tasks: Strong 3D localization on TUT REAL (14.97° MAE). Zero‑shot spatial attribute classification via templated captions achieves >90% on most synthetic tests and transfers to the real‑world set (with expected gaps).
- Representation: Direction can be “swapped” using text prototype vector arithmetic with 99.7% success while largely preserving semantic recall.
Insight: Mixing spatial and non‑spatial training pairs preserves non‑spatial semantic performance while adding spatial understanding.

3.1. Foundational Concepts

Spatial audio: Audio that includes directional cues (azimuth/elevation) and distance. Unlike mono, spatial formats encode the 3D sound field.
First‑order ambisonics (FOA): A spatial encoding using spherical harmonics truncated at order N=1, yielding 4 channels: omnidirectional W and dipoles X, Y, Z. FOA is device‑agnostic and can be converted to binaural or other formats.
Binaural vs ambisonics: Binaural simulates ear signals using head‑related transfer functions (HRTFs) but mixes content with listener/head effects; ambisonics encodes sound field components separably, enabling general‑purpose processing and later rendering.
Direction of Arrival (DoA): The direction from which sound arrives, often parameterized by azimuth (horizontal angle) and elevation (vertical angle). 3D DoA error on the sphere can be measured via angular distance.
Mel spectrogram: A time–frequency representation mapping frequency bins to the mel scale; log‑magnitude mel spectrograms are common audio features.
Intensity vectors (IVs): In FOA, active (real part) and reactive (imaginary part) intensity vectors, derived from products of W and dipoles in the STFT domain, capture sound energy flow directionality.
Contrastive multimodal learning (CLIP): Two encoders map different modalities (e.g., audio and text) into a shared space; a contrastive loss pulls paired embeddings together and pushes non‑pairs apart, enabling zero‑shot retrieval and alignment.
UMAP: A dimensionality reduction algorithm used to visualize high‑dimensional embeddings while preserving local/global structure.

3.2. Previous Works (with essential background)

Audio–language alignment:
- CLAP and LAION-CLAP: Extend CLIP to audio–text using paired datasets; excel at semantic retrieval/classification but trained on non‑spatial audio.
- MULAN, AudioCLIP: Joint embeddings of audio and text (and sometimes images).
- LLM‑centric audio models: Pengi, LTU, SALMONN, BAT route audio into LLMs for captioning/QA; BAT specifically considers binaural spatial reasoning with QA pairs. These generally do not deliver a task‑agnostic, device‑agnostic spatial audio representation aligned with open‑vocabulary text.
Spatial audio learning:
- SELDNet, PILOT, Spatial LibriSpeech focus on localization/detection with fixed label sets; not open‑vocabulary and not aligned with natural language descriptions.
Datasets:
- Semantic: AudioCaps, Clotho, LAION‑Audio‑630K.
- Spatial/QA: SPATIALSOUNDQA (binaural, QA‑focused).
- Real FOA: STARSS23 (spatial annotations but no natural language captions).

3.3. Technological Evolution

From unimodal acoustic scene classification to audio–text contrastive foundations (CLAP/LAION‑CLAP) enabling zero‑shot semantics.
In parallel, SELD methods refined localization from mic arrays/FOA but remained class‑limited and coordinate‑centric.
ELSA merges these lines: contrastive audio–text learning with explicit spatial features and supervision to support open‑vocabulary spatial semantics.

3.4. Differentiation Analysis

Device‑agnostic spatial encoding: ELSA uses FOA (not tied to binaural HRTFs).
Open‑vocabulary: Text encoder enables natural language spatial phrases (“behind,” “left,” “far,” “reverberant room”), unlike fixed‑class SELD.
Joint training objective: CLIP loss plus spatial regressions encourages embeddings to encode “what + where.”
Synthetic scale: Augments large semantic datasets with parametric rooms and LLM‑rewritten captions, bridging data scarcity in spatial audio–text pairs.

4. Methodology

4.1. Principles

Learn a shared embedding space where semantically and spatially matching audio/text are close, and mismatched pairs are far. Use FOA to capture spatial cues; use mel spectrograms for semantics and intensity vectors for spatial attributes; align audio and text with a batched contrastive (CLIP‑style) objective augmented by supervised spatial regressions (DoA, distance, room area).

4.2. Core Methodology In-Depth

4.2.1. Audio Feature Construction (FOA, mel, and intensity vectors)

Given FOA STFT features $\mathbf{A} \in \mathbb{C}^{T \times F \times (N+1)^{2}}$ (with $N=1$ for FOA), the real‑valued log‑mel spectrogram is: $ \mathbf{M E L}( t , \nu ) = \log \left( \left| \mathbf { A } ( t , f ) \right| ^ { 2 } \cdot \mathbf { W } _ { \mathrm { m e l } } ( f , \nu ) \right) , $

$t$ : time frame index; $f$ : frequency bin; $\nu$ : mel filter index.
$\mathbf{W}_{\mathrm{mel}}(f,\nu)$ : mel filter bank.
$\mathbf{A}(t,f)$ : FOA complex STFT coefficients (W, X, Y, Z channels in spherical harmonic basis).

Spatial features use active and reactive intensity vectors: $ \begin{array} { r } { I _ { \mathrm { a c t i v e } } ( t , f ) = \Re \left[ A _ { 0 , 0 } ^ { * } ( t , f ) \left( \begin{array} { l } { A _ { 1 , - 1 } ( t , f ) } \ { A _ { 1 , 0 } ( t , f ) } \ { A _ { 1 , 1 } ( t , f ) } \end{array} \right) \right] , \quad I _ { \mathrm { r e a c t i v e } } ( t , f ) = \Im \left[ A _ { 0 , 0 } ^ { * } ( t , f ) \left( \begin{array} { l } { A _ { 1 , - 1 } ( t , f ) } \ { A _ { 1 , 0 } ( t , f ) } \ { A _ { 1 , 1 } ( t , f ) } \end{array} \right) \right] , } \end{array} $
$A_{n,m}(t,f)$ : FOA spherical harmonic STFT components; $A_{0,0}$ is omni (W); $\{A_{1,-1},A_{1,0},A_{1,1}\}$ are dipoles (order 1 modes).
$\Re[\cdot]$ and $\Im[\cdot]$ are real/imag parts; $(\cdot)^{*}$ is complex conjugate.
IVs are scaled to unit‑norm following SALSA; in practice, they encode energy flow direction (active) and reactive component phase behavior.

Non‑spatial (mono) handling: Mono is replicated across FOA channels; because IVs normalize by $A_{0,0}$ , the model can learn to treat non‑spatial audio as having no directional preference (identical IVs).

4.2.2. Audio Encoders: Semantic branch + Spatial branch

Semantic audio branch: HTSAT transformer (initialized from LAION‑CLAP), fed with the mel spectrogram of the FOA omni channel (W). This captures content semantics similar to non‑spatial encoders (HTSAT excels at tagging/classification).
Spatial attributes branch: A compact CNN (pre‑trained on Spatial LibriSpeech) that ingests active/reactive IVs to represent DoA, distance, and reverberation properties. Pretraining uses a multi‑task regression loss (azimuth/elevation cosine loss; MSE for distance, room volume, and third‑octave DRR/T30 bands).

The two branches produce 768‑D (semantic) and 192‑D (spatial) vectors, concatenated to 960‑D, then projected via a 2‑layer MLP to a 512‑D audio embedding.

The full model architecture (ELSA) combines these with the text encoder and projection heads. The following figure (Figure A.F.1 from the original paper) shows the full ELSA architecture:

Figure A.F.1: Full architecture diagram for ELSA. Filled blocks include trainable parameters. 该图像是 ELSA 的完整架构示意图。图中展示了非空间音频和空间属性编码器的连接，以及用于文本和音频特征处理的多层感知机结构，强调了模型在语义和空间特征捕捉方面的能力。

The spatial attributes branch architecture is a dedicated CNN operating on IVs (with AddCoords2D), pre‑trained then fine‑tuned within ELSA. The following figure (Figure A.F.2) shows this branch:

$Figure A.F.2: Architecture diagram for Spatial Attributes Branch. Filled blocks include trainable parameters. The AddCoords2D block is described in \[20\].$ 该图像是一个示意图，展示了空间属性分支的架构。图中包含6个卷积块，强调了训练过程中各模块的输入和输出，包括Active Intensity和Reactive Intensity，以及最终通过3层MLP进行处理的过程。

4.2.3. Text Encoder and Projection

Text encoder: RoBERTa‑base (pretrained), tokenized with byte‑pair encoding. Output (712‑D) is projected via a 2‑layer MLP to 512‑D to match audio embeddings.

4.2.4. Contrastive Objective (CLIP‑style) and Similarity

InfoNCE loss for a single matched pair $(x_i, y)$ within a batch $X$ : $ \mathcal { L } _ { \mathrm { I n f o N C E } } ( X , x _ { i } , y ) = - \log \frac { f _ { \mathrm { s i m } } ( x _ { i } , y ) } { \sum _ { x _ { j } \in X } f _ { \mathrm { s i m } } ( x _ { j } , y ) } , $ with $f_{\mathrm{sim}}(a,b) = \exp(a \cdot b / \tau)$ ,

$a \cdot b$ : dot product in embedding space; $\tau$ : learnable temperature.
Symmetric CLIP batch loss (averaged over $N$ pairs of audio/text embeddings $\{(z_i^a, z_i^t)\}$ ): $ \begin{array} { r l r } & { } & { \mathcal { L } _ { \mathrm { C L I P } } = \displaystyle \frac { 1 } { 2 } \left( \frac { 1 } { N } \sum _ { i = 0 } ^ { N } \mathcal { L } _ { \mathrm { I n f o N C E } } ( Z ^ { a } , z _ { i } ^ { a } , z _ { i } ^ { t } ) + \frac { 1 } { N } \sum _ { i = 0 } ^ { N } \mathcal { L } _ { \mathrm { I n f o N C E } } ( Z ^ { t } , z _ { i } ^ { t } , z _ { i } ^ { a } ) \right) , } \ & { } & { = \displaystyle - \frac { 1 } { 2 N } \sum _ { i = 0 } ^ { N } \left( \log \frac { f _ { \mathrm { s i m } } ( z _ { i } ^ { a } , z _ { i } ^ { t } ) } { \sum _ { j = 0 } ^ { N } f _ { \mathrm { s i m } } ( z _ { j } ^ { a } , z _ { i } ^ { t } ) } + \log \frac { f _ { \mathrm { s i m } } ( z _ { i } ^ { t } , z _ { i } ^ { a } ) } { \sum _ { j = 0 } ^ { N } f _ { \mathrm { s i m } } ( z _ { j } ^ { t } , z _ { i } ^ { a } ) } \right) , } \end{array} $
$Z^a=\{z_j^a\}$ : batch of audio embeddings; $Z^t=\{z_j^t\}$ : batch of text embeddings.
The objective tightens alignment between matched audio/text samples while repelling mismatched ones.

4.2.5. Spatial Regression Heads and Final Loss

Three 2‑layer MLPs (each ~33k params) regress:

DoA (azimuth, elevation) in 3D space,
Source distance,
Room floor area.

Final training loss: $ \mathcal { L } _ { \mathrm { E L S A } } = \mathcal { L } _ { \mathrm { C L I P } } + \mathcal { L } _ { \mathrm { d i r } } + \mathcal { L } _ { \mathrm { d i s t } } + \mathcal { L } _ { \mathrm { a r e a } } , $
$\mathcal{L}_{\mathrm{dir}}$ : cosine similarity loss between predicted and target angles,
$\mathcal{L}_{\mathrm{dist}}$ , $\mathcal{L}_{\mathrm{area}}$ : mean‑squared error for distance and area.

Ablation (Appendix A.6) indicates negligible semantic retrieval cost (~0.4% mAP@10) for substantial gains in DoA (15.3%) and distance estimation (12.3%) when adding spatial heads vs pure CLIP loss.

4.2.6. Spatial Audio & Caption Augmentation Pipeline

Audio simulation (FOA): Parametric rooms varying in size, material (reverberation/T30), and source/receiver placement; audio clips are trimmed and looped to ≥4s; train/val/test use disjoint room configurations.
Caption rewriting: Original captions are augmented with spatial descriptors (direction, elevation, near/far, room size, reverb) using a templated prompt to LLaMA‑13B, e.g., “The sound: is coming from the of a room. Rephrase…”. This mitigates missing spatial text and language inconsistencies; the paper notes occasional hallucinations.

The core training flow: (1) encode FOA audio via semantic+spatial branches, (2) encode spatially‑augmented text, (3) align with CLIP loss, (4) jointly regress spatial labels from audio embeddings.

4.2.7. Optional Background: Ambisonics Formulation (from Appendix A.5)

Plane‑wave expansion on a sphere, with spherical harmonics $Y_n^m$ and radial functions $b_n$ : $ \begin{array} { l } { { \displaystyle p _ { n m } ( k , r ) = b _ { n } ( k r ) \int _ { \Omega \in S ^ { 2 } } a ( k , \Omega ) [ Y _ { n } ^ { m } ( \Omega ) ] ^ { * } \mathrm { d } \Omega } } \ { { \displaystyle \quad = b _ { n } ( k r ) A _ { n m } ( k ) , } } \end{array} $ $ b _ { n } ( k r ) = 4 \pi i ^ { n } \left[ j _ { n } ( k r ) - \frac { j _ { n } ^ { \prime } ( k r _ { 0 } ) } { h _ { n } ^ { \prime } ( k r _ { 0 } ) } h _ { n } ( k r ) \right] $ Inverse spherical Fourier transform: $ a ( k , \Omega ) = \sum _ { n = 0 } ^ { \infty } \sum _ { m = - n } ^ { n } A _ { n m } ( k ) Y _ { n } ^ { m } ( \Omega ) , $ Truncated order $N$ (FOA: $N=1$ ): $ p ( k r , \Omega ) \approx \sum _ { n = 0 } ^ { N } b _ { n } ( k r ) \sum _ { m = - n } ^ { n } A _ { n m } ( k ) Y _ { n } ^ { m } ( \Omega ) , $ Linear encoding from mic array STFT $\mathbf{p}$ to ambisonics: $ { \bf a } _ { n m } = { \bf Y } ^ { H } \mathrm { d i a g } ( { \bf b } _ { n } ) ^ { - 1 } { \bf p } $ with spherical harmonic matrix $\mathbf{Y}$ as defined in the paper. FOA channels (order $N=1$ ) for time $t$ , frequency $f$ : $ \begin{array} { r } { \boldsymbol A ( t , f ) = \left[ { A } _ { 1 , 1 } ( t , f ) \right] , } \ { \boldsymbol A ( t , f ) = \left[ \begin{array} { l } { \boldsymbol A _ { 0 , 0 } ( t , f ) } \ { \boldsymbol A _ { 1 , - 1 } ( t , f ) } \ { \boldsymbol A _ { 1 , 0 } ( t , f ) } \ { \boldsymbol A _ { 1 , 1 } ( t , f ) } \end{array} \right] } \end{array} , $

5. Experimental Setup

5.1. Datasets

Semantic (non‑spatial) sources:
- AudioCaps (YouTube audio with captions; 49,274 samples; 136.87 hours).
- Clotho (crowd‑captioned; 3,839 samples; 23.99 hours).
- Freesound subset (414,127 samples; 2,528.15 hours).
Spatially‑augmented variants (synthetic FOA + LLM‑rewritten captions):
- Spatial‑AudioCaps (98,459 samples; 258.12 hours),
- Spatial‑Clotho (8,546 samples; 55.0 hours),
- Spatial‑Freesound (783,033 samples; 4,425.53 hours).

Spatial‑RWD (real‑world FOA): 70 samples in 5 rooms (48 kHz, 24‑bit) with annotated semantic and coarse spatial labels (direction: left/right/front/back; distance: near/far; elevation: up/down/level).

The following are the results from Table A.T.1 of the original paper:

Dataset	SPATIAL AUDIO	SPLITS	Num. SAMPLES	DURaTIon (HRS)	CAPTION DESCRIPTION
Clotho	X	train, val, test	3,839	23.99	5 captions per audio
AudioCaps	X	train, val, test	49,274	136.87	12 captions per audio
FreeSound	X	train, val, test	414127	2,528.15	12 captions per audio, keyword tags
Spatial-Clotho	Synthetic	train, val, test	8,546	55.0	5 spatially augmented captions per audio
Spatial-AudioCaps	Synthetic	train, val, test	98,459	258.12	12 spatially augmented captions per audio
Spatial-FreeSound	Synthetic	train, val, test	783,033	4,425.53	12 spatially augmented captions per audio
Spatial-RWD	Recorded	test	70	0.25	12 human annotated spatial captions per audio

Room simulation ranges (for augmentation) include azimuth, elevation, distance, area, T30. The following are the results from Table A.T.2 of the original paper:

TRAIn & VALIDATION		TEST
Number of simulated rooms	8,952	4,970
Source azimuth	[-180.0°, +180.0°]	[-180.0°, +180.0°]
Source elevation	[-47.5°, +48.7°]	[-29.8°, +42.4°]
Source distance	[0.5m, 4.0m]	[0.9m, 4.0m]
Room floor area	[13.3m2, 277.4m2]	[14.3m2, 277.4m2]
Full-band T30	[144.5ms, 2671.9ms]	[167.8ms, 1254.8ms]

Example of spatial caption rewriting (from Appendix A.3):

Input: Original caption “A bird is loudly making a lot of noises.” Distance: far. Room size: medium.
Rewritten: “In a medium-sized room, a bird is emitting loud sounds from a distant location.”

5.2. Evaluation Metrics (definitions and formulas)

Recall@K (R@K): Measures how often the correct match is within the top‑K retrieved items.
- Concept: For each query (audio or text), rank candidates by cosine similarity; success=1 if a correct paired item is among top K, else 0; average over queries.
- Formula (per query $q$ ): $r@K(q)=\mathbb{1}\{\exists\,g\in G_q: \mathrm{rank}_q(g)\le K\}$ , then R@K=\frac{1}{|Q|}\sum_{q\in Q} r@K(q).
- Symbols: $Q$ set of queries; $G_q$ ground‑truth matches; $\mathrm{rank}_q(g)$ rank position of $g$ under query $q$ ; $\mathbb{1}\{\cdot\}$ indicator.
Mean Average Precision at 10 (mAP@10): Averages per‑query precision across relevant items up to K=10, then averages across queries.
- Concept: Rewards correct items appearing early in the ranked list; sensitive to multiple references per query (e.g., multiple captions).
- Per‑query AP@K: $ \mathrm{AP@}K(q) = \frac{1}{\min(K, |G_q|)} \sum_{k=1}^{K} P_q(k)\cdot \mathrm{rel}_q(k), $ where $P_q(k)$ is precision at cut‑off $k$ , and $\mathrm{rel}_q(k)\in\{0,1\}$ indicates whether the item at rank $k$ is relevant for $q$ . Then $\mathrm{mAP@}10 = \frac{1}{|Q|}\sum_{q\in Q}\mathrm{AP@}10(q)$ .
- Symbols: $G_q$ relevant set for $q$ ; P_q(k)=\frac{1}{k}\sum_{i=1}^{k}\mathrm{rel}_q(i).
Mean Absolute Error (MAE) for 3D DoA:
- Concept: Average angular error on the unit sphere between predicted and true directions (in degrees).
- Using 3D unit vectors $\hat{u}$ (pred) and $u$ (true): $ \theta(u,\hat{u}) = \arccos\left( u \cdot \hat{u} \right), \quad \mathrm{MAE} = \frac{180}{\pi}\cdot \frac{1}{N}\sum_{i=1}^{N}\theta(u_i,\hat{u}_i). $
- Symbols: $u,\hat{u}\in\mathbb{R}^{3}$ unit vectors; $N$ samples.
SPIDEr (for captioning): Combines SPICE (semantic scene graph F1) and CIDEr (n‑gram consensus).
- Concept: Balances semantic correctness and n‑gram matching; widely used in captioning.
- Formula (standard): $ \mathrm{SPIDEr} = \frac{1}{2}\left(\mathrm{SPICE} + \mathrm{CIDEr}\right). $
- Symbols: SPICE and CIDEr scores computed between generated and reference captions.
FENSE (for captioning): Fluency‑ENhanced Sentence‑level metric (Zhou et al., 2022).
- Concept: Combines semantic similarity with a fluency/error penalty to better reflect caption quality; range is [-1, +1].
- A common form: $ \mathrm{FENSE}(s, S_{\mathrm{ref}}) = \mathrm{BERTScore}{F1}(s, S{\mathrm{ref}}) - \lambda \cdot \max\left(0, p_{\mathrm{err}}(s) - \tau\right), $
- Symbols: $s$ generated sentence; $S_{\mathrm{ref}}$ set of references; $\mathrm{BERTScore}_{F1}$ semantic similarity; $p_{\mathrm{err}}(s)$ error probability from a learned fluency/acceptability classifier; $\lambda>0$ scaling; $\tau$ threshold.
Accuracy (for zero‑shot spatial attribute classification): Fraction of test items correctly classified when matching audio embeddings to templated spatial probe captions via cosine similarity.

5.3. Baselines

LAION-CLAP: Open‑vocabulary audio–text embedding model trained on non‑spatial audio; strong semantic retrieval baseline; expected to fail on spatial understanding.
SELDNet: Convolutional recurrent baseline for 3D localization/detection on TUT Sound Events 2018; limited vocabulary; coordinate‑centric outputs.
PILOT: Transformer‑based probabilistic SELD; strong 3D DoA performance on datasets akin to TUT; not language‑aligned.
Spatial LibriSpeech (model comparison from its paper): System trained to regress spatial attributes on synthetic FOA speech; included for 3D localization context.

6. Results & Analysis

6.1. Core Results Analysis

Joint capability: ELSA is the only model in comparisons that performs open‑vocabulary semantic retrieval and spatial localization in one model.
3D localization: On TUT REAL, ELSA achieves 14.97° MAE. LAION‑CLAP (non‑spatial) fails (95.29°). ELSA surpasses SeldNET by −11.6° MAE and trails PILOT (reported at ~4.2° on related data) which is specialized for localization.
Non‑spatial semantic retrieval: On AudioCaps/Clotho (original, non‑spatial), ELSA remains comparable to CLAP/LAION‑CLAP, despite being trained with mixed spatial+non‑spatial data.
Spatial zero‑shot via language: On synthetic spatial datasets, ELSA exceeds 90% accuracy for direction/elevation/distance; real‑world transfer shows meaningful, though reduced, performance (e.g., 67.1% distance; 35.8% direction), reflecting both domain shift and coarse human annotations.
Interpretable embedding space: UMAP shows clustering by direction and distance; a regressor trained on audio embeddings can predict direction labels from spatial text embeddings (64.3% 4‑way), indicating cross‑modal spatial alignment.
Vector arithmetic: Subtracting a text “direction prototype” from an audio embedding and adding a new prototype flips perceived direction in 99.7% of cases, with ~−0.2% change in semantic recall@10—evidence of disentangled spatial attributes.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

MODEL	SEMANTIC CAPABILITIES	SPATIAL CAPABILITIES	AUDIOCAPS MAP@10↑	REAL 3D LOcal.()↓
SeldNET [1]	X Limited vocab.	:	×x	26.6
PILOT [33]	x Limited vocab.			4.2
Spatial Librispeech [31]	X		X	12.4
LAION-CLAP [44]	Open vocab.	X	43.8	95.29
ELSA (ours)	Open vocab.		44.2	14.97

The following are the results from Table 2 of the original paper:

TASK	S-CLOTHO	S-AC	S-RWD
Distance (2-class)	96.0%	92.9%	67.1%
Direction (4-class)	92.0%	92.8%	35.8%
Elevation (2-class)	100.0%	100.0%	72.1%
Room area (2-class)	76.6%	74.7%	N/A
Reverberation (2-class)	100.0%	83.3%	N/A

The following are the results from Table 3 of the original paper:

		AUDIOCAPS						CLOTHO
		TEXT-TO-AUDIO			AUdio-tO-TEXT			TEXT-TO-AUDIO			AUdio-tO-TEXT
MODEL	Train Data	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CLAP (paper)	C, AC, LA	34.7	70.5	83.2	45.3	79.5	89.2	16.4	39.0	51.0	21.8	44.6	60.1
CLAP (local)	C, AC, FS	32.7	68.8	81.5	40.7	74.0	84.7	14.4	37.6	50.7	18.3	40.5	55.1
ELSA	C, AC, FS, CS, ACS, FS	33.2	68.2	81.0	40.9	74.4	86.1	15.0	36.7	50.8	20.1	43.2	55.4

The following are the results from Table A.T.4 of the original paper:

MODEL	3D LOCAL. ()	DIST. (cm)	MAP@10
Static Intensity Vectors	27.86	60.2	23.43
Pretrained Spatial Branch & only CLIP Loss	27.4	54.31	24.93
Pretrained Spatial Branch & Spatial Losses	23.2	47.71	24.81

The following are the results from Table A.T.5 of the original paper:

	Train Data		AUDIOCAPS TEXT-TO-AUDIO			AUDIOCAPS AUDIO-TO-TEXT			CLOtHO TEXT-TO-AUDIO			CLOtHO AUdIO-TO-TEXT
MODEL	AUdiO	CAPTIONS	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CLAP	Non-spatial	Non-spatial	32.7	68.8	81.5	40.7	74.0	84.7	14.4	37.6	50.7	18.3	40.5	55.1
ELSA	Spatial	Non-spatial	27.1	62.7	76.1	36.6	68.7	78.4	11.3	32.6	44.4	12.4	28.3	50.0
ELSA	Spatial	Spatial	25.3	59.3	72.5	34.8	64.5	75.2	9.9	31.0	39.8	12.1	35.3	47.3
ELSA	Mixed	Mixed	33.2	68.2	81.0	40.9	74.4	86.1	15.0	36.7	50.8	20.1	43.2	55.4

The following are the results from Table A.T.7 of the original paper (ELSA vs LAION‑CLAP zero‑shot with spatial probe captions):

TASK	ELSA S-CLOTHO	ELSA S-AC	ELSA S-RWD	LAION-CLAP S-CLOTHO	LAION-CLAP S-AC	LAION-CLAP S-RWD
Distance (2-class)	96.0%	92.9%	67.1%	48.0%	54.3%	53.0%
Direction (4-class)	92.0%	92.8%	35.8%	28.2%	29.3%	27.3%
Elevation (2-class)	100.0%	100.0%	72.1%	56.7%	51.3%	59.4%
Room area (2-class)	76.6%	74.7%	N/A	46.3%	66.5%	N/A
Reverberation (2-class)	100.0%	83.3%	N/A	57.3%	52.5%	N/A

The following are the results from Table A.T.8 of the original paper (spatial retrieval):

		SPatIaL-AuDiOCaPS						SPatIaL-CLotHO
		TEXT-TO-AUDIO			AUDIO-TO-TEXT			TEXT-TO-AUDIO			AUDIO-TO-TEXT
MODEL	Train Data	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
ELSA	Sp(Clotho + AC)	25.3	58.9	73.5	32.6	61.5	73.0	9.4	25.9	38.0	10.5	27.8	40.2
ELSA	Sp(Clotho + AC + FS)	24.2	56.7	71.82	30.5	59.42	72.8	11.26	30.72	43.07	12.6	32.0	44.08

The following are the results from Table A.T.9 of the original paper (real‑world spatial retrieval):

	Train Data	SPATIAL-RWD TEXT-TO-AUDIO			SPATIAL-RWD AUDIO-TO-TEXT
ELSA	Sp(Clotho + AC)	18.6	—	—	R@10	R@1	R@5
ELSA	Sp(Clotho + AC + FS)	41.4	54.3	68.6	75.7	25.7	88.6 35.7

The following are the results from Table A.T.11 of the original paper (direction swapping):

	/	N	R@10	New Direction: LEFT		New Direction: FRONT		New Direction: RIGHT		New Direction: BACK
LEFT	14	156	94.9%	θ	∆R@10	θ	ΔR@10	θ	∆R@10	θ	ΔR@10
FRONT	13	486	81.1%	100.0%	+1.9%			100.0%	+0.6%	100.0%	-0.4%
RIGHT	12	196	92.3%	100.0%	-0.5%	100.0%	+0.5%			100.0%	-0.5%
BACK	9	564	81.6%	97.3%	-1.4%	100.0%	-0.5%	100.0%	-0.9%

The following are the results from Table A.T.12 of the original paper (direction removal ablation):

	$	N	R@10	θ	∆R@10
LEFT	14	156	94.9%	0.0%	−0.6%
FRONT	13	486	81.1%	0.0%	−0.6%
RighT	12	196	92.3%	0.0%	−5.1%
BAcK	9	564	81.6%	0.0%	−2.3%

Captioning metrics. The following are the results from Table 4 of the original paper:

METRIC	RAngE	S-CLOTHO	S-AC
SPIDEr	[0, 5.5]	0.19	0.34
FENSE	[−1.0, +1.0]	0.59	0.68
# Unique words	[0, ∞)	1103	1258

6.3. Figures and Qualitative Analyses

Representation by direction (UMAP): After explaining Section 6.1’s interpretability, we show the projection used to visualize clustering by direction, with audio vs caption embeddings overlapping in structure. The following figure (Figure 2 from the original paper) shows the system’s directional clustering:

该图像是图表，展示了ELSA在Spatial-Clotho和Spatial-AudioCaps测试集上的UMAP投影。填充标记代表空间音频，而空心标记代表空间字幕。此投影强调了嵌入中的方向差异，而非语义差异。
Distance clustering (UMAP): Distinct clusters for near vs far indicate that ELSA embeddings encode distance attributes. The following figure (Figure A.F.4) illustrates the separation:

该图像是UMAP投影图，展示了ELSA对Spatial-Clotho和SpatialAudioCaps测试集的嵌入结果。图中填充标记表示来自空间音频，空心标记则表示来自空间字幕。该投影使用训练集进行拟合，并使用监督降维方法突出显示嵌入中的距离差异。
Fine‑grained DoA error analysis: Errors remain low across azimuth/elevation/distance/room/T30 bins, with slightly larger errors at extremes (e.g., very large rooms/high T30). The following figure (Figure A.F.3) shows boxplots across conditions and semantic classes:

该图像是图表A.F.3，展示了通过2层多层感知机预测的绝对到达方向误差的箱线图。图(a)(e)显示了Spatial Audiocaps和Spatial Clotho测试集在不同类别下的误差，图(f)展示了TUT Sounds 2018测试集在不同语义类下的预测。箱体代表四分位范围，橙色实线为中位数，绿色虚线为均值。
Caption generation architecture: For Section 6.4 below. The following figure (Figure A.F.5) shows the GPT‑2 prefix‑tuning architecture driven by ELSA embeddings:

该图像是ELSA音频分支的架构示意图，展示了如何通过2层MLP与GPT-2进行自回归解码。输入为FOA格式的音频数据，经过ELSA音频分支处理后，生成中间表示 $Z_a$ ，再通过MLP和GPT-2进行解码。该架构支持空间音频和自然语言描述的结合。

6.4. Ablations / Parameter Analysis

Spatial features and losses (A.T.4): Using only static IVs is weakest; a learned spatial branch improves both localization and distance errors; adding spatial losses on top of CLIP further improves spatial regression with minimal semantic trade‑off.
Mixed data necessity (A.T.5): Training on mixed mono+spatial audio and captions is crucial for best non‑spatial retrieval, preserving semantic generalization while enabling spatial understanding.
LAION‑CLAP spatial probe (A.T.7): As expected, LAION‑CLAP (non‑spatial) performs near random on spatial attribute zero‑shot classification, underscoring the need for spatial training signals.
Spatial retrieval on synthetic vs real (A.T.8–A.T.9): ELSA’s spatial retrieval drops on real FOA (domain shift and annotation coarseness) but still leads, indicating partial transfer without fine‑tuning.
Direction swapping/removal (A.T.11–A.T.12): Spatial direction is linearly manipulable in embedding space; removing direction prototype collapses classification to 0% (no direction), while swapping achieves ~100% correct new direction with marginal semantic recall change.

6.5. Spatial Caption Generation

Method: Freeze ELSA encoder; learn a single dense layer to map 512‑D audio embeddings to a GPT‑2 prefix; fine‑tune GPT‑2 (12 layers, 12 heads; 163M params) on 150k spatial audio–caption pairs (Spatial‑Clotho/Spatial‑AudioCaps).
Results: SPIDEr and FENSE indicate viable spatial captioning with moderate vocabulary size; more work needed for richer, less templated outputs.
Architecture shown earlier (Figure A.F.5).

7. Conclusion & Reflections

7.1. Conclusion Summary

ELSA is a single audio–text model that encodes both semantics and spatial attributes using FOA inputs, intensity vectors, and a CLIP‑style contrastive objective augmented with spatial regressions.
It retains strong non‑spatial semantic retrieval (comparable to LAION‑CLAP) while greatly improving spatial understanding (e.g., 14.97° DoA MAE vs 95.29° for LAION‑CLAP on TUT REAL).
The embedding space is structured: spatial direction can be edited via simple text vector arithmetic without degrading semantics.
Synthetic spatial augmentation (audio rooms + LLM‑rewritten captions) is an effective strategy to overcome data scarcity for spatial audio–language learning.

7.2. Limitations & Future Work

Data realism: Synthetic rooms and LLM‑rewritten captions introduce domain gaps and potential hallucinations; real‑world transfer (direction classification) is notably lower.
FOA order: Only first‑order ambisonics; higher orders could capture finer spatial detail but require different sensors and more compute.
Single static sources: Overlapping and moving sources are left for future work.
Binaural/HRTF: Device/person‑specific binaural cues are not modeled; future extensions could add rendering‑aware conditioning while keeping device agnosticism.
Caption consistency: Ensuring spatial rewrites never alter semantics remains an open challenge; hallucination detection/mitigation is promising future work.

7.3. Personal Insights & Critique

Strengths:
1. A clear, principled integration of spatial audio features with contrastive multimodal learning;
2. Comprehensive synthetic data pipeline;
3. Strong evidence that spatial attributes are disentangled enough for controllable editing.
Risks:
- LLM rewrite drift: Even low hallucination rates can bias training if unmonitored; adding verification (e.g., CLIP‑style semantic consistency checks) could help.
- Evaluation breadth: Beyond TUT REAL, more real‑world FOA datasets with natural language captions would strengthen generalization claims.
Transfer to other domains: The approach should transfer to other spatial modalities (e.g., audio‑vision‑text 3D embeddings) and to robotics (open‑vocabulary spatial audio grounding). The vector arithmetic result suggests powerful latent factorization that downstream generation/editing models can exploit (e.g., text‑conditioned spatial mixing).
Practical next steps:
1. Collect or align real FOA with natural descriptions (crowd‑captioned STARSS23‑like data);
2. Add moving/overlapping sources;
3. Explore higher‑order ambisonics and cross‑device robustness;
4. Tighten LLM rewrite quality control and counterfactual augmentations to probe spatial reasoning (e.g., consistent edits across azimuth bins).

Similar papers

Recommended via semantic vector search.

No similar papers found yet.