Advancing Deep-Generated Speech and Defending against Its Misuse

Cai, Zexin

Advancing Deep-Generated Speech and Defending against Its Misuse

dc.contributor.advisor	Li, Ming
dc.contributor.advisor	Li, Xin
dc.contributor.author	Cai, Zexin
dc.date.accessioned	2024-03-07T18:39:41Z
dc.date.issued	2023
dc.department	Electrical and Computer Engineering
dc.description.abstract	Deep learning has revolutionized speech generation, spanning synthesis areas such as text-to-speech and voice conversion, leading to diverse advancements. On the one hand, when trained on high-quality datasets, artificial voices now exhibit a level of synthesized quality that rivals human speech in naturalness. On the other, cutting-edge deep synthesis research is making strides in producing controllable systems, allowing for generating audio signals in arbitrary voice and speaking style. Yet, despite their impressive synthesis capabilities, current speech generation systems still face challenges in controlling and manipulating speech attributes. Control over crucial attributes, such as speaker identity and language, essential for enhancing the functionality of a synthesis system, still needs to be improved. Specifically, systems capable of cloning a target speaker's voice in cross-lingual contexts or replicating unseen voices are still in their nascent stages. On the other hand, the heightened naturalness of synthesized speech has raised concerns, posing security threats to both humans and automated speech processing systems. The rise of accessible audio deepfakes, capable of spreading misinformation or bypassing biometric security, accentuates the complex interplay between advancing and defencing against deep-synthesized speech. Consequently, this dissertation delves into the dynamics of deep-generated speech, viewing it from two perspectives. Offensively, we aim to enhance synthesis systems to elevate their capabilities. On the defensive side, we introduce methodologies to counter emerging audio deepfake threats, offering solutions grounded in detection-based approaches and reliable synthesis system design. Our research yields several noteworthy findings and conclusions. First, we present an improved voice cloning method incorporated with our novel feedback speaker consistency mechanism. Second, we demonstrate the feasibility of achieving cross-lingual multi-speaker speech synthesis with a limited amount of bilingual data, offering a synthesis method capable of producing diverse audio across various speakers and languages. Third, our proposed frame-level detection model for partially fake audio attacks proves effective in detecting tampered utterances and locating the modified regions within. Lastly, by employing an invertible synthesis system, we can trace back to the original speaker of a converted utterance. Despite these strides, each domain of our study still confronts challenges, further fueling our motivation for persistent research and refinement of the associated performance.
dc.identifier.uri	https://hdl.handle.net/10161/30336
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject	Computer science
dc.subject	audio deepfake
dc.subject	Deep learning
dc.subject	Neural network
dc.subject	speech generation
dc.subject	Speech signal processing
dc.subject	speech synthesis
dc.title	Advancing Deep-Generated Speech and Defending against Its Misuse
dc.type	Dissertation
duke.embargo.months	4
duke.embargo.release	2024-07-07T18:39:41Z

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Cai_duke_0066D_17703.pdf
Size:: 5.62 MB
Format:: Adobe Portable Document Format

Download

Collections

Dissertations