Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Xiao Zhou, Zhenhua Ling, Yajun Hu, Lirong Dai
Formato:	article
Lenguaje:	EN
Publicado:	MDPI AG 2021
Materias:	speech synthesis sequence-to-sequence attention phrase boundary Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999
Acceso en línea:	https://doaj.org/article/99f8bad95ab6464297ca71e3f556b075
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:99f8bad95ab6464297ca71e3f556b075
record_format	dspace
spelling	oai:doaj.org-article:99f8bad95ab6464297ca71e3f556b0752021-11-11T15:25:02ZSequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis10.3390/app1121104752076-3417https://doaj.org/article/99f8bad95ab6464297ca71e3f556b0752021-11-01T00:00:00Zhttps://www.mdpi.com/2076-3417/11/21/10475https://doaj.org/toc/2076-3417An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.Xiao ZhouZhenhua LingYajun HuLirong DaiMDPI AGarticlespeech synthesissequence-to-sequenceattentionphrase boundaryTechnologyTEngineering (General). Civil engineering (General)TA1-2040Biology (General)QH301-705.5PhysicsQC1-999ChemistryQD1-999ENApplied Sciences, Vol 11, Iss 10475, p 10475 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	speech synthesis sequence-to-sequence attention phrase boundary Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999
spellingShingle	speech synthesis sequence-to-sequence attention phrase boundary Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999 Xiao Zhou Zhenhua Ling Yajun Hu Lirong Dai Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
description	An encoder–decoder with attention has become a popular method to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of the attention mechanism, methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). However, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes to insert hidden states into phone sequences to deal with the situation that pauses are not provided explicitly, and designs a semi-stepwise monotonic attention (SSMA) to model these inserted hidden states. In this method, hidden states are introduced that absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying at the same phone, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Moreover, the pause positions derived from the alignment paths of SSMA matched the manually labeled phrase boundaries quite well.
format	article
author	Xiao Zhou Zhenhua Ling Yajun Hu Lirong Dai
author_facet	Xiao Zhou Zhenhua Ling Yajun Hu Lirong Dai
author_sort	Xiao Zhou
title	Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_short	Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_full	Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_fullStr	Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_full_unstemmed	Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
title_sort	sequence-to-sequence acoustic modeling with semi-stepwise monotonic attention for speech synthesis
publisher	MDPI AG
publishDate	2021
url	https://doaj.org/article/99f8bad95ab6464297ca71e3f556b075
work_keys_str_mv	AT xiaozhou sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis AT zhenhualing sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis AT yajunhu sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis AT lirongdai sequencetosequenceacousticmodelingwithsemistepwisemonotonicattentionforspeechsynthesis
_version_	1718435379944620032

Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

Ejemplares similares