Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have...
Guardado en:
Autores principales: | , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
MDPI AG
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/99f8bad95ab6464297ca71e3f556b075 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:99f8bad95ab6464297ca71e3f556b075 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:99f8bad95ab6464297ca71e3f556b0752021-11-11T15:25:02ZSequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis10.3390/app1121104752076-3417https://doaj.org/article/99f8bad95ab6464297ca71e3f556b0752021-11-01T00:00:00Zhttps://www.mdpi.com/2076-3417/11/21/10475https://doaj.org/toc/2076-3417An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.Xiao ZhouZhenhua LingYajun HuLirong DaiMDPI AGarticlespeech synthesissequence-to-sequenceattentionphrase boundaryTechnologyTEngineering (General). Civil engineering (General)TA1-2040Biology (General)QH301-705.5PhysicsQC1-999ChemistryQD1-999ENApplied Sciences, Vol 11, Iss 10475, p 10475 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
speech synthesis sequence-to-sequence attention phrase boundary Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999 |
spellingShingle |
speech synthesis sequence-to-sequence attention phrase boundary Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999 Xiao Zhou Zhenhua Ling Yajun Hu Lirong Dai Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis |
description |
An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well. |
format |
article |
author |
Xiao Zhou Zhenhua Ling Yajun Hu Lirong Dai |
author_facet |
Xiao Zhou Zhenhua Ling Yajun Hu Lirong Dai |
author_sort |
Xiao Zhou |
title |
Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis |
title_short |
Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis |
title_full |
Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis |
title_fullStr |
Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis |
title_full_unstemmed |
Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis |
title_sort |
sequence-to-sequence acoustic modeling with semi-stepwise monotonic attention for speech synthesis |
publisher |
MDPI AG |
publishDate |
2021 |
url |
https://doaj.org/article/99f8bad95ab6464297ca71e3f556b075 |
work_keys_str_mv |
AT xiaozhou sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis AT zhenhualing sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis AT yajunhu sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis AT lirongdai sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis |
_version_ |
1718435379944620032 |