StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

0. Contents

  1. Abstract

  2. Demos on Unseen Speakers (Zero-shot)

  3. Demos on Seen Speakers (In-set)



1. Abstract

Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more practical zero-shot scenarios. To solve this problem, we first build a parallel corpus using a multi-lingual multi-speaker text-to-speech synthesis (TTS) system and propose the StyleS2ST model based on a style adaptor on a direct S2ST system framework. Enabling continuous style space modeling of acoustic models through parallel corpus training and non-parallel TTS data augmentation, StyleS2ST captures cross-lingual acoustic feature conversion from source to target language. Experiments show that StyleS2ST achieves good style similarity and naturalness in both in-set and out-of-set zero-shot scenarios.



2. Demos on Unseen Speakers (Zero-shot)

The source speech is English and the target speech is Chinese.

Note that since there is no target Chinese speech in zero-shot, and it is a direct S2ST system which does not rely on text, there is no corresponding Chinese text.

Demos:

source speech text of source speech StyleS2ST (proposed) directS2UT[1]

Maybe that's why they always stay close to me.

Most, but not all, authorities normalise follow this convention.

In most cases, the background method is preferred and italics are used sparingly.

He was replaced by Los Angeles-based vocalist Bob James.

Hutching now plays bass in Wellington rock band The Accelerants.

Then she again took his hands and studied them carefully.

I would like to focus on the things I would do, he said.

The "Hougoumont"'s passage was the last convict ship transport to Western Australia.

The entire Swiss population is generally well-educated.

Gene duplications account for most of the sequence differences between humans and chimps.

2. Demos on Seen Speakers (in-set)

The source speech is English and the target speech is Chinese.

Demos:

source speech text of source speech target speech text of target speech StyleS2ST (proposed) directS2UT

Fly by night and you waste little time.

夜间飞行的话,你浪费的时间很少。

I tend to avoid speaking to customers much, mostly because of my limited German.

我经常会避免与客户交流太多,主要是因为我的德语能力有限。

Put your trust in God.

相信上帝。

Also, there are lots of small streams passing through the district.

此外,还有许多小溪流经该地区。

Taxis service the airport to downtown Algiers.

从机场到阿尔及尔市区的出租车服务。

He tries to tell her the truth, but she thinks he's only joking.

他试图告诉她真相,但她认为他只是在开玩笑。

You said you wanted fireworks.

你说你想要烟火。

Watch over her tonight.

今晚守护她。

The oxygen's only to help him till the doctor gets here.

氧气只能在医生来之前帮助他。

By the light of their own destruction, I saw them staggering and falling, and their supporters turning to run.

借着他们毁灭的亮光,我看见他们摇摇欲坠,他们的支持者转身逃跑。

Reference

[1] S. Popuri, P. Chen, C. Wang, J. Pino, Y. Adi, J. Gu, W. Hsu, and A. Lee, “Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation,” in Interspeech 2022, 23rd Annual Conference of the International Speech Com-munication Association. ISCA, 2022, pp. 5195–5199