Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Xiao Zhou, Zhenhua Ling, Yajun Hu, Lirong Dai
Formato: article
Lenguaje:EN
Publicado: MDPI AG 2021
Materias:
T
Acceso en línea:https://doaj.org/article/99f8bad95ab6464297ca71e3f556b075
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:99f8bad95ab6464297ca71e3f556b075
record_format dspace
spelling oai:doaj.org-article:99f8bad95ab6464297ca71e3f556b0752021-11-11T15:25:02ZSequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis10.3390/app1121104752076-3417https://doaj.org/article/99f8bad95ab6464297ca71e3f556b0752021-11-01T00:00:00Zhttps://www.mdpi.com/2076-3417/11/21/10475https://doaj.org/toc/2076-3417An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.Xiao ZhouZhenhua LingYajun HuLirong DaiMDPI AGarticlespeech synthesissequence-to-sequenceattentionphrase boundaryTechnologyTEngineering (General). Civil engineering (General)TA1-2040Biology (General)QH301-705.5PhysicsQC1-999ChemistryQD1-999ENApplied Sciences, Vol 11, Iss 10475, p 10475 (2021)
institution DOAJ
collection DOAJ
language EN
topic speech synthesis
sequence-to-sequence
attention
phrase boundary
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
spellingShingle speech synthesis
sequence-to-sequence
attention
phrase boundary
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
Xiao Zhou
Zhenhua Ling
Yajun Hu
Lirong Dai
Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
description An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.
format article
author Xiao Zhou
Zhenhua Ling
Yajun Hu
Lirong Dai
author_facet Xiao Zhou
Zhenhua Ling
Yajun Hu
Lirong Dai
author_sort Xiao Zhou
title Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_short Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_full Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_fullStr Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_full_unstemmed Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_sort sequence-to-sequence acoustic modeling with semi-stepwise monotonic attention for speech synthesis
publisher MDPI AG
publishDate 2021
url https://doaj.org/article/99f8bad95ab6464297ca71e3f556b075
work_keys_str_mv AT xiaozhou sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis
AT zhenhualing sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis
AT yajunhu sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis
AT lirongdai sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis
_version_ 1718435379944620032