Low-Burden Data Augmentation for Dysarthric ASR

Project Overview

Collecting dysarthric speech is labour-intensive and expensive. We investigate zero-shot voice cloning as a scalable, low-burden augmentation strategy that removes the speaker recording bottleneck entirely.

Dysarthric ASR systems are severely limited by data scarcity and high inter-speaker variability. Traditional synthetic augmentation approaches require substantial speaker-specific recordings to build a voice model, reintroducing the very collection burden they aim to avoid. We address this by using zero-shot voice cloning (Higgs Audio V2) to synthesise diverse training speech from a single ~7-second reference utterance per speaker drawn from the TORGO dataset, with no additional recording sessions required.

We fine-tune Whisper-medium on three data configurations, clone-only, real-only, and a hybrid (clone+real), and evaluate on held-out real dysarthric speech across all 8 TORGO speakers. Results show that clone speech alone provides a significant adaptation signal, with the hybrid condition delivering the strongest gains for the most challenging speaker groups.

            Key finding: Clone-only fine-tuning (15 h synthetic) reduces
            overall WER from 31.62 % to 26.00 % (17.8 % relative reduction,
            p = 0.010). The hybrid achieves the best
            moderate-severe group WER of 37.49 %, a 31.4 % relative reduction
            over zero-shot, from just a single reference utterance per speaker.
          

Reference input

1 utterance

~7 s per speaker

Synthetic data generated

15h

Higgs Audio V2 · 8 speakers

Clone-only WER

26.00 %

↓ from 31.62 % zero-shot

Best mod-severe WER

37.49 %

Hybrid · 31.4 % rel. reduction

Mild Dysarthria

Speakers with mild dysarthria retain relatively clear articulation with minor prosodic irregularities. Cloning fidelity is generally high, producing naturalistic voice characteristics with subtle preserved dysarthric traits.

F04

Mild

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"The princess was the first to speak."

M03

Mild

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Play a Beatles song on Amazon music."

F03

Mild

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"This morning he was feeling very good-natured."

Moderate Dysarthria

Consistent patterns of dysarthria with better intelligibility than severe cases. The cloning model captures distinct prosodic irregularities while maintaining overall speaker identity. Note that all fine-tuning conditions degrade WER for this speaker relative to zero-shot, indicating a distribution mismatch with the training data.

M05

Moderate

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Brown and Day had asked him to call again."

Moderate-Severe Dysarthria

Prosody becomes significantly laboured with noticeable breathiness, pauses, and articulation errors. Clone fine-tuning is most potent here, the hybrid achieves a 31.4 % relative WER reduction over zero-shot for this group, outperforming real-data-only fine-tuning.

F01

Mod-Severe

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Your son told me you were ill and I came right over."

M01

Mod-Severe

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"What is on my calendar tomorrow?"

M02

Mod-Severe

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"I parked on level one."

Severe Dysarthria

High variability, significant pauses, slurring, and unstable phonation make this the most challenging group to clone. Despite reduced fidelity, real-data fine-tuning achieves a 26.9 % relative WER reduction for this speaker (82.32 % → 60.22 %).

M04

Severe

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"I will explain to his lordship."

Project Overview

1Mild Dysarthria

2Moderate Dysarthria

3Moderate-Severe Dysarthria

4Severe Dysarthria

Mild Dysarthria

Moderate Dysarthria

Moderate-Severe Dysarthria

Severe Dysarthria