Advancing Deep-Generated Speech and Defending against Its Misuse

Thumbnail Image
Limited Access
This item is unavailable until:



Journal Title

Journal ISSN

Volume Title


Deep learning has revolutionized speech generation, spanning synthesis areas such as text-to-speech and voice conversion, leading to diverse advancements. On the one hand, when trained on high-quality datasets, artificial voices now exhibit a level of synthesized quality that rivals human speech in naturalness. On the other, cutting-edge deep synthesis research is making strides in producing controllable systems, allowing for generating audio signals in arbitrary voice and speaking style.

Yet, despite their impressive synthesis capabilities, current speech generation systems still face challenges in controlling and manipulating speech attributes. Control over crucial attributes, such as speaker identity and language, essential for enhancing the functionality of a synthesis system, still needs to be improved. Specifically, systems capable of cloning a target speaker's voice in cross-lingual contexts or replicating unseen voices are still in their nascent stages. On the other hand, the heightened naturalness of synthesized speech has raised concerns, posing security threats to both humans and automated speech processing systems. The rise of accessible audio deepfakes, capable of spreading misinformation or bypassing biometric security, accentuates the complex interplay between advancing and defencing against deep-synthesized speech.

Consequently, this dissertation delves into the dynamics of deep-generated speech, viewing it from two perspectives. Offensively, we aim to enhance synthesis systems to elevate their capabilities. On the defensive side, we introduce methodologies to counter emerging audio deepfake threats, offering solutions grounded in detection-based approaches and reliable synthesis system design.

Our research yields several noteworthy findings and conclusions. First, we present an improved voice cloning method incorporated with our novel feedback speaker consistency mechanism. Second, we demonstrate the feasibility of achieving cross-lingual multi-speaker speech synthesis with a limited amount of bilingual data, offering a synthesis method capable of producing diverse audio across various speakers and languages. Third, our proposed frame-level detection model for partially fake audio attacks proves effective in detecting tampered utterances and locating the modified regions within. Lastly, by employing an invertible synthesis system, we can trace back to the original speaker of a converted utterance. Despite these strides, each domain of our study still confronts challenges, further fueling our motivation for persistent research and refinement of the associated performance.





Cai, Zexin (2023). Advancing Deep-Generated Speech and Defending against Its Misuse. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.