U2-VC: one-shot voice conversion using two-level nested U-structure

Abstract Voice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any so...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Fangkun Liu, Hui Wang, Renhua Peng, Chengshi Zheng, Xiaodong Li
Formato: article
Lenguaje:EN
Publicado: SpringerOpen 2021
Materias:
Acceso en línea:https://doaj.org/article/9774f3e85b684d4c8afe408040baa9c9
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:9774f3e85b684d4c8afe408040baa9c9
record_format dspace
spelling oai:doaj.org-article:9774f3e85b684d4c8afe408040baa9c92021-11-28T12:27:53ZU2-VC: one-shot voice conversion using two-level nested U-structure10.1186/s13636-021-00226-31687-4722https://doaj.org/article/9774f3e85b684d4c8afe408040baa9c92021-11-01T00:00:00Zhttps://doi.org/10.1186/s13636-021-00226-3https://doaj.org/toc/1687-4722Abstract Voice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.Fangkun LiuHui WangRenhua PengChengshi ZhengXiaodong LiSpringerOpenarticleVoice conversionU2-NetSandwich adaptive instance normalizationAcoustics. SoundQC221-246Electronic computers. Computer scienceQA75.5-76.95ENEURASIP Journal on Audio, Speech, and Music Processing, Vol 2021, Iss 1, Pp 1-15 (2021)
institution DOAJ
collection DOAJ
language EN
topic Voice conversion
U2-Net
Sandwich adaptive instance normalization
Acoustics. Sound
QC221-246
Electronic computers. Computer science
QA75.5-76.95
spellingShingle Voice conversion
U2-Net
Sandwich adaptive instance normalization
Acoustics. Sound
QC221-246
Electronic computers. Computer science
QA75.5-76.95
Fangkun Liu
Hui Wang
Renhua Peng
Chengshi Zheng
Xiaodong Li
U2-VC: one-shot voice conversion using two-level nested U-structure
description Abstract Voice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.
format article
author Fangkun Liu
Hui Wang
Renhua Peng
Chengshi Zheng
Xiaodong Li
author_facet Fangkun Liu
Hui Wang
Renhua Peng
Chengshi Zheng
Xiaodong Li
author_sort Fangkun Liu
title U2-VC: one-shot voice conversion using two-level nested U-structure
title_short U2-VC: one-shot voice conversion using two-level nested U-structure
title_full U2-VC: one-shot voice conversion using two-level nested U-structure
title_fullStr U2-VC: one-shot voice conversion using two-level nested U-structure
title_full_unstemmed U2-VC: one-shot voice conversion using two-level nested U-structure
title_sort u2-vc: one-shot voice conversion using two-level nested u-structure
publisher SpringerOpen
publishDate 2021
url https://doaj.org/article/9774f3e85b684d4c8afe408040baa9c9
work_keys_str_mv AT fangkunliu u2vconeshotvoiceconversionusingtwolevelnestedustructure
AT huiwang u2vconeshotvoiceconversionusingtwolevelnestedustructure
AT renhuapeng u2vconeshotvoiceconversionusingtwolevelnestedustructure
AT chengshizheng u2vconeshotvoiceconversionusingtwolevelnestedustructure
AT xiaodongli u2vconeshotvoiceconversionusingtwolevelnestedustructure
_version_ 1718407971261644800