A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

Part I. Enhancement Performance

To demonstrate the enhancement performance of different methods with respect to the speech intelligibility and perceptual quality, two samples enhanced by IRM, CRM, TimeMapping, SEGAN, DenoisingWavenet and our methods (Mel+Flowavenet and Mel+Flowavenet+JT) are provided in Table I. These two noisy speeches are obtained by mixing the clean speeches with the Babble and Factory noises at 0 dB, respectively. The corresponding clean speeches are provided in the second row.

Table I. Samples enhanced by different methods.

Methods	Babble, 0dB	Factory, 0dB
Noisy
Clean
IRM
CRM
TimeMapping
SEGAN
DenoisingWavenet
Mel+Flowavenet
Mel+Flowavenet-our+JT

Part II. Flowavenet Trained on the NSDTSEA corpus.

We try to train the Flowavenet on the NSDTSEA corpus, which comprises 28 and 2 speakers in the training and test set, respectively. For each speaker in NSDTSEA corpus, the duration is about 0.3 hour. Compared with the LJ speech dataset (24 hours recorded by a single speaker), NSDTSEA is a really small dataset. Using such a small dataset to train Flowavenet will lead to the problem of speaker generalization. Several synthesized samples for trained and untrained speakers are given in Table II and III, respectively.

Table II. Samples synthesized by Flowavenet for trained speakers on NSDTSEA dataset
(Avg. STOI (%): 89.95, Avg PESQ: 2.92).

Utterance ID	Clean	Generated
p239_068
p250_107
p256_030
p274_135

Table III. Samples synthesized by Flowavenet for untrained speakers on NSDTSEA dataset
(Avg. STOI (%): 86.18, Avg PESQ: 1.91).

Utterance ID	Clean	Generated
p232_051
p232_123
p257_026
p257_093