A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement
Part I. Enhancement Performance
To demonstrate the enhancement performance of different methods with respect to the
speech intelligibility and perceptual quality, two samples enhanced by IRM, CRM,
TimeMapping, SEGAN, DenoisingWavenet and our methods (Mel+Flowavenet and Mel+Flowavenet+JT)
are provided in Table I. These two noisy speeches are obtained by mixing the clean speeches with
the Babble and Factory noises at 0 dB, respectively. The corresponding clean speeches are provided
in the second row.
Table I. Samples enhanced by different methods.
Methods
Babble, 0dB
Factory, 0dB
Noisy
Clean
IRM
CRM
TimeMapping
SEGAN
DenoisingWavenet
Mel+Flowavenet
Mel+Flowavenet-our+JT
Part II. Flowavenet Trained on the NSDTSEA corpus.
We try to train the Flowavenet on the NSDTSEA corpus, which comprises 28 and 2 speakers in the training and
test set, respectively. For each speaker in NSDTSEA corpus, the duration is about 0.3 hour. Compared with the LJ speech
dataset (24 hours recorded by a single speaker), NSDTSEA is a really small dataset. Using such a small dataset to train
Flowavenet will lead to the problem of speaker generalization. Several synthesized samples for trained and
untrained speakers are given in Table II and III, respectively.
Table II. Samples synthesized by Flowavenet for trained speakers on
NSDTSEA dataset (Avg. STOI (%): 89.95, Avg PESQ: 2.92).
Utterance ID
Clean
Generated
p239_068
p250_107
p256_030
p274_135
Table III. Samples synthesized by Flowavenet for untrained speakers on
NSDTSEA dataset (Avg. STOI (%): 86.18, Avg PESQ: 1.91).