"The quick brown fox jumps over the lazy dog."
"The princess was the first to speak."
Audio samples accompanying the submission. A lightweight approach that requires only a single short reference utterance per speaker to generate diverse synthetic speech for fine-tuning dysarthric ASR systems.
Collecting dysarthric speech is labour-intensive and expensive. We investigate zero-shot voice cloning as a scalable, low-burden augmentation strategy that removes the speaker recording bottleneck entirely.
Dysarthric ASR systems are severely limited by data scarcity and high inter-speaker variability. Traditional synthetic augmentation approaches require substantial speaker-specific recordings to build a voice model, reintroducing the very collection burden they aim to avoid. We address this by using zero-shot voice cloning (Higgs Audio V2) to synthesise diverse training speech from a single ~7-second reference utterance per speaker drawn from the TORGO dataset, with no additional recording sessions required.
We fine-tune Whisper-medium on three data configurations, clone-only, real-only, and a hybrid (clone+real), and evaluate on held-out real dysarthric speech across all 8 TORGO speakers. Results show that clone speech alone provides a significant adaptation signal, with the hybrid condition delivering the strongest gains for the most challenging speaker groups.
1 utterance
15h
26.00 %
37.49 %
Speakers with mild dysarthria retain relatively clear articulation with minor prosodic irregularities. Cloning fidelity is generally high, producing naturalistic voice characteristics with subtle preserved dysarthric traits.
"The quick brown fox jumps over the lazy dog."
"The princess was the first to speak."
"The quick brown fox jumps over the lazy dog."
"Play a Beatles song on Amazon music."
"The quick brown fox jumps over the lazy dog."
"This morning he was feeling very good-natured."
Consistent patterns of dysarthria with better intelligibility than severe cases. The cloning model captures distinct prosodic irregularities while maintaining overall speaker identity. Note that all fine-tuning conditions degrade WER for this speaker relative to zero-shot, indicating a distribution mismatch with the training data.
"The quick brown fox jumps over the lazy dog."
"Brown and Day had asked him to call again."
Prosody becomes significantly laboured with noticeable breathiness, pauses, and articulation errors. Clone fine-tuning is most potent here, the hybrid achieves a 31.4 % relative WER reduction over zero-shot for this group, outperforming real-data-only fine-tuning.
"The quick brown fox jumps over the lazy dog."
"Your son told me you were ill and I came right over."
"The quick brown fox jumps over the lazy dog."
"What is on my calendar tomorrow?"
"The quick brown fox jumps over the lazy dog."
"I parked on level one."
High variability, significant pauses, slurring, and unstable phonation make this the most challenging group to clone. Despite reduced fidelity, real-data fine-tuning achieves a 26.9 % relative WER reduction for this speaker (82.32 % → 60.22 %).
"The quick brown fox jumps over the lazy dog."
"I will explain to his lordship."