Sebastian Braun, Hannes Gamper, Dimitra Emmanouilidou,
Microsoft Research Redmond, WA, USA
Modern generative and multimodal models increasingly rely on compact latent representations that trade and balance semantic richness with high-fidelity reconstruction. We introduce SALAD-VAE, a continuous and highly compact semantic Audio Variational Autoencoder, which operates in the frequency domain and achieves state-of-the-art compression with very low latent frame rate (7.8 Hz) while surfacing semantic structure and producing high audio quality. We enhance the standard VAE semantic losses and augmentation, specifically contrastive learning and CLAP-based embedding distillation, enabling it to generalize across diverse audio domains. With a significantly less computational complex architecture than comparable state-of-the-art VAEs, SALAD-VAE matches their reconstruction quality while it consistently outperforms them on a wide range of classification benchmarks. Furthermore, the proposed additional loss function provides a trained CLAP projection layer, which can be used zero-shot audio captioning and classification matching pretrained CLAP audio-text embeddings.
Figure 1: Main results: SALAD-VAE provides state-of-the-art audio compression and reconstruction quality at a very low latent frame rate of 7.8 Hz, while surfacing semantic structure in its latent space. It outperforms competing models on a wide range of classification benchmarks, despite its significantly lower computational complexity. Additionally, the proposed CLAP-based loss provides a trained CLAP projection layer, which can be used for zero-shot audio captioning and classification.
| Description | Original | StableAudio VAE | Music2Latent | SALAD D=64 | SALAD large D=128 | SALAD large D=128 no semantic losses |
|---|---|---|---|---|---|---|
| funk music | ||||||
| acoustic music | ||||||
| speech and door shutting |
||||||
| bandpassed speech | ||||||
| distorted speech | ||||||
| crowd | ||||||
| glass breaking | ||||||
| violin | ||||||
| rustle |
@inproceedings{braun2026,
title={SALAD-VAE: Semantic Audio Compression with Language-Audio Distillation},
author={Braun, Sebastian and Gamper, Hannes and Emmanouilidou, Dimitra},
journal={arxiv},
notes={Submitted to IEEE ICASSP 2026}
}