SALAD-VAE: Semantic Audio Compression with Language-Audio Distillation

Sebastian Braun, Hannes Gamper, Dimitra Emmanouilidou,

Microsoft Research Redmond, WA, USA

Abstract


Modern generative and multimodal models increasingly rely on compact latent representations that trade and balance semantic richness with high-fidelity reconstruction. We introduce SALAD-VAE, a continuous and highly compact semantic Audio Variational Autoencoder, which operates in the frequency domain and achieves state-of-the-art compression with very low latent frame rate (7.8 Hz) while surfacing semantic structure and producing high audio quality. We enhance the standard VAE semantic losses and augmentation, specifically contrastive learning and CLAP-based embedding distillation, enabling it to generalize across diverse audio domains. With a significantly less computational complex architecture than comparable state-of-the-art VAEs, SALAD-VAE matches their reconstruction quality while it consistently outperforms them on a wide range of classification benchmarks. Furthermore, the proposed additional loss function provides a trained CLAP projection layer, which can be used zero-shot audio captioning and classification matching pretrained CLAP audio-text embeddings.





Figure 1: Main results: SALAD-VAE provides state-of-the-art audio compression and reconstruction quality at a very low latent frame rate of 7.8 Hz, while surfacing semantic structure in its latent space. It outperforms competing models on a wide range of classification benchmarks, despite its significantly lower computational complexity. Additionally, the proposed CLAP-based loss provides a trained CLAP projection layer, which can be used for zero-shot audio captioning and classification.

Audio Samples

Description Original StableAudio VAE Music2Latent SALAD D=64 SALAD large D=128 SALAD large D=128
no semantic losses
funk music
acoustic music
speech and
door shutting
bandpassed speech
distorted speech
crowd
glass breaking
violin
rustle

Citation



      @inproceedings{braun2026,
  title={SALAD-VAE: Semantic Audio Compression with Language-Audio Distillation},
  author={Braun, Sebastian and Gamper, Hannes and Emmanouilidou, Dimitra},
  journal={arxiv},
  notes={Submitted to IEEE ICASSP 2026}
}