Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Abstract Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Zolzaya Byambadorj, Ryota Nishimura, Altangerel Ayush, Kengo Ohta, Norihide Kitaoka
Formato: article
Lenguaje:EN
Publicado: SpringerOpen 2021
Materias:
Acceso en línea:https://doaj.org/article/84f582599e30418fa5395ab20605d377
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:84f582599e30418fa5395ab20605d377
record_format dspace
spelling oai:doaj.org-article:84f582599e30418fa5395ab20605d3772021-12-05T12:19:45ZText-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation10.1186/s13636-021-00225-41687-4722https://doaj.org/article/84f582599e30418fa5395ab20605d3772021-12-01T00:00:00Zhttps://doi.org/10.1186/s13636-021-00225-4https://doaj.org/toc/1687-4722Abstract Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.Zolzaya ByambadorjRyota NishimuraAltangerel AyushKengo OhtaNorihide KitaokaSpringerOpenarticleSpeech synthesisText to speechTransfer learningData augmentationLow-resource languageAcoustics. SoundQC221-246Electronic computers. Computer scienceQA75.5-76.95ENEURASIP Journal on Audio, Speech, and Music Processing, Vol 2021, Iss 1, Pp 1-20 (2021)
institution DOAJ
collection DOAJ
language EN
topic Speech synthesis
Text to speech
Transfer learning
Data augmentation
Low-resource language
Acoustics. Sound
QC221-246
Electronic computers. Computer science
QA75.5-76.95
spellingShingle Speech synthesis
Text to speech
Transfer learning
Data augmentation
Low-resource language
Acoustics. Sound
QC221-246
Electronic computers. Computer science
QA75.5-76.95
Zolzaya Byambadorj
Ryota Nishimura
Altangerel Ayush
Kengo Ohta
Norihide Kitaoka
Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
description Abstract Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.
format article
author Zolzaya Byambadorj
Ryota Nishimura
Altangerel Ayush
Kengo Ohta
Norihide Kitaoka
author_facet Zolzaya Byambadorj
Ryota Nishimura
Altangerel Ayush
Kengo Ohta
Norihide Kitaoka
author_sort Zolzaya Byambadorj
title Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
title_short Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
title_full Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
title_fullStr Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
title_full_unstemmed Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
title_sort text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
publisher SpringerOpen
publishDate 2021
url https://doaj.org/article/84f582599e30418fa5395ab20605d377
work_keys_str_mv AT zolzayabyambadorj texttospeechsystemforlowresourcelanguageusingcrosslingualtransferlearninganddataaugmentation
AT ryotanishimura texttospeechsystemforlowresourcelanguageusingcrosslingualtransferlearninganddataaugmentation
AT altangerelayush texttospeechsystemforlowresourcelanguageusingcrosslingualtransferlearninganddataaugmentation
AT kengoohta texttospeechsystemforlowresourcelanguageusingcrosslingualtransferlearninganddataaugmentation
AT norihidekitaoka texttospeechsystemforlowresourcelanguageusingcrosslingualtransferlearninganddataaugmentation
_version_ 1718372027250769920