Interspeech 2026 · Audio Samples

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

Audio samples accompanying the submission. A lightweight approach that requires only a single short reference utterance per speaker to generate diverse synthetic speech for fine-tuning dysarthric ASR systems.

Project Overview

Collecting dysarthric speech is labour-intensive and expensive. We investigate zero-shot voice cloning as a scalable, low-burden augmentation strategy that removes the speaker recording bottleneck entirely.

Dysarthric ASR systems are severely limited by data scarcity and high inter-speaker variability. Traditional synthetic augmentation approaches require substantial speaker-specific recordings to build a voice model, reintroducing the very collection burden they aim to avoid. We address this by using zero-shot voice cloning (Higgs Audio V2) to synthesise diverse training speech from a single ~7-second reference utterance per speaker drawn from the TORGO dataset, with no additional recording sessions required.

We fine-tune Whisper-medium on three data configurations, clone-only, real-only, and a hybrid (clone+real), and evaluate on held-out real dysarthric speech across all 8 TORGO speakers. Results show that clone speech alone provides a significant adaptation signal, with the hybrid condition delivering the strongest gains for the most challenging speaker groups.

Key finding: Clone-only fine-tuning (15 h synthetic) reduces overall WER from 31.62 % to 26.00 % (17.8 % relative reduction, p = 0.010). The hybrid achieves the best moderate-severe group WER of 37.49 %, a 31.4 % relative reduction over zero-shot, from just a single reference utterance per speaker.
Reference input

1 utterance

~7 s per speaker
Synthetic data generated

15h

Higgs Audio V2 · 8 speakers
Clone-only WER

26.00 %

↓ from 31.62 % zero-shot
Best mod-severe WER

37.49 %

Hybrid · 31.4 % rel. reduction

Mild Dysarthria

Speakers with mild dysarthria retain relatively clear articulation with minor prosodic irregularities. Cloning fidelity is generally high, producing naturalistic voice characteristics with subtle preserved dysarthric traits.

F04

Mild
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"The princess was the first to speak."

M03

Mild
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Play a Beatles song on Amazon music."

F03

Mild
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"This morning he was feeling very good-natured."

Moderate Dysarthria

Consistent patterns of dysarthria with better intelligibility than severe cases. The cloning model captures distinct prosodic irregularities while maintaining overall speaker identity. Note that all fine-tuning conditions degrade WER for this speaker relative to zero-shot, indicating a distribution mismatch with the training data.

M05

Moderate
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Brown and Day had asked him to call again."

Moderate-Severe Dysarthria

Prosody becomes significantly laboured with noticeable breathiness, pauses, and articulation errors. Clone fine-tuning is most potent here, the hybrid achieves a 31.4 % relative WER reduction over zero-shot for this group, outperforming real-data-only fine-tuning.

F01

Mod-Severe
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Your son told me you were ill and I came right over."

M01

Mod-Severe
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"What is on my calendar tomorrow?"

M02

Mod-Severe
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"I parked on level one."

Severe Dysarthria

High variability, significant pauses, slurring, and unstable phonation make this the most challenging group to clone. Despite reduced fidelity, real-data fine-tuning achieves a 26.9 % relative WER reduction for this speaker (82.32 % → 60.22 %).

M04

Severe
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"I will explain to his lordship."