Interspeech 2026 Challenge

Unsupervised Speech in the Wild

Learning Robust Multilingual Representations from Unsupervised People's Speech

Release & Baselines Dec 12, 2025
Paper Deadline Feb 25, 2026
Conference Sep 27 – Oct 1, 2026

Overview

This challenge evaluates self-supervised speech representation learning on real-world heterogeneous audio at scale. Participants train models without transcripts using the Unsupervised People's Speech (UPS) dataset.

Performance will be assessed via phonetic discrimination, low-resource ASR probes, multilingual language ID, and speaker clustering.

Significance

This challenge advances speech research by moving beyond curated corpora, emphasizing multilingual and accent diversity with realistic environments that include noise and music. By evaluating self-supervised models on spontaneous and heterogeneous audio, the challenge promotes scalable, inclusive representation learning aligned with the Interspeech 2026 theme "Speaking Together", encouraging models that generalize across global voices and real-world scenarios.

Evaluation Tasks

Submitted models will be evaluated on three primary metrics that reflect real-world speech understanding performance:

Few-Shot ASR

Evaluate automatic speech recognition accuracy with minimal labeled examples, testing the model's ability to rapidly adapt to new transcription tasks.

Zero-Shot Language ID

Assess multilingual language identification without any language-specific training, measuring cross-lingual generalization.

Speaker Clustering

Evaluate speaker representation quality through clustering metrics, testing the model's ability to distinguish individual speakers.

Additional Considerations

Cross-accent robustness and computational efficiency will be included to ensure fairness and reproducibility. A leaderboard will rank submissions based on overall performance across these criteria.

Dataset & Baselines

Unsupervised People's Speech (UPS)

The UPS dataset is publicly available via Hugging Face, enabling immediate and open access for participants. Challenge participants will be provided with:

  • Official dataloaders designed for efficient streaming and large-scale training
  • Precomputed language identification indices generated using Whisper 3.0, offering lightweight language guidance without explicit supervised labels
  • Designated unlabeled training subsets with held-out multilingual labeled evaluation sets
Access on Hugging Face

Baseline Systems

The following baseline systems will be provided to establish performance references:

wav2vec 2.0 Coming Soon
HuBERT Coming Soon
XLSR Coming Soon
Whisper Encoder Coming Soon

Baselines will be released on December 12, 2025.

Rules for Participation

Training Data

Participants must train exclusively on the Unsupervised People's Speech dataset, without the use of transcripts or supervised external data.

Compute Requirements

All models must be capable of running inference on a single A100 GPU at most to ensure reproducibility.

Submission Platform

All submissions will be made through the Dynabench platform and evaluated under controlled conditions.

Transparency

Full disclosure of training configuration, compute usage, and data preprocessing or filtering steps is required.

Competition Tracks

Main Track

Train on the full UPS dataset without additional filtering or curation.

Open Filtering Sub-Track

Participants may apply custom data filtering or selection strategies on the UPS dataset.

Important Dates

Dec 12 2025

Release & Baselines

Official dataset release, dataloaders, and baseline models available

Feb 25 2026

Paper Submission Deadline

Submit your challenge paper describing methods and results

Mar 4 2026

Paper Update Deadline

Final updates to submitted papers

Apr 24 May 1

Rebuttal Period

Respond to reviewer feedback

Jun 5 2026

Paper Acceptance Notification

Authors notified of acceptance decisions

Jun 19 2026

Camera-Ready Submission

Final paper submission deadline

Sep 27 Oct 1

Interspeech 2026

Conference and challenge presentations (Tutorial Day: Sep 27)

How to Participate

1

Download the Data

Access the UPS dataset from Hugging Face and set up your training environment using the provided dataloaders.

2

Train Your Model

Develop your self-supervised speech representation model using only the UPS training data without transcripts.

3

Submit via Dynabench

Upload your trained model to Dynabench for automated evaluation on the held-out test sets.

4

Write Your Paper

Document your approach, experiments, and results following the Interspeech paper format.

Submissions open December 12, 2025

Dynabench Submission (Coming Soon)

Ethics & Conduct

All participants must adhere to the ISCA Code of Conduct, ensuring professional behavior, respect for community standards, and the responsible advancement of speech research.

Data Ethics

The challenge data is drawn from legally shareable subsets of the UPS dataset, with appropriate ethics documentation and approvals in place.

Privacy Protection

Participants are strictly prohibited from attempting to identify speakers or infer sensitive personal attributes.

Responsible AI

All submissions must reflect responsible AI development practices, promoting fairness, transparency, and equitable multilingual speech technology.

Organizers

Rafael Mosquera-Gómez

Rafael Mosquera-Gómez

MLCommons / Factored AI

Machine Learning Engineer leading projects at the intersection of AI systems and real-world impact. Core contributor to the MLCommons Datasets Working Group. Co-author of NeurIPS 2024 top paper award winner: The PRISM Alignment Dataset.

rafael.mosquera@mlcommons.org
Sarah Luger

Dr. Sarah Luger

MLCommons / iMerit

Expert in AI and NLP with over two decades of experience. Leads the Generative AI Research group at iMerit and co-chairs the MLCommons Datasets Working Group. PhD from University of Edinburgh, former IBM Watson contributor.

sarahluger@gmail.com
Juan Felipe Rodríguez

Juan Felipe Rodríguez

Factored AI

Machine Learning Engineer specializing in AI systems for education and talent mobility. Background in computational fluid dynamics and robotics, building large-scale applications powered by modern language models.

juan.rodriguez@factored.ai
Daniel Galvez

Daniel Galvez

NVIDIA

AI developer technology engineer working on accelerated speech recognition inference pipelines and toolkits including NeMo, ESPnet, and Kaldi. Previously at LinkedIn and Cornell University.

dt.galvez@nvidia.org
Sheriff Issaka

Sheriff Issaka

UCLA / All Lab

PhD student researching bias, fairness, and low-resource language technologies at UCLA MARS Lab. Founder of the African Languages Lab (All Lab), advancing linguistic equity through open-source research.

Chiara Bonfanti

Chiara Bonfanti

Politecnico di Torino

PhD researcher in NLP and Legal Informatics. Research focuses on semantic analysis and knowledge extraction from complex legal texts, bridging law and cybersecurity domains.

Chris Emezue

Chris Emezue

Mila / Lanfrica

Researcher at Mila focused on NLP, causality, and reinforcement learning. Founder of Lanfrica Labs, a non-profit mapping AI in Africa to accelerate innovation and enable understanding of the continent's AI landscape.