![]() |
The Application of Generative Artificial Intelligence Technology in Voice Conversion |
---|---|
รหัสดีโอไอ | |
Creator | Anon Bangsan |
Title | The Application of Generative Artificial Intelligence Technology in Voice Conversion |
Contributor | Payap Sirinam |
Publisher | Navaminda Kasatriyadhiraj Royal Air Force Academy |
Publication Year | 2568 |
Journal Title | NKRAFA Journal of Science and Technology |
Journal Vol. | 22 |
Journal No. | 2 |
Page no. | 135-157 |
Keyword | Generative Artificial Intelligence, Generative Adversarial Network, Voice Conversion, Cyber Warfare |
URL Website | https://ph02.tci-thaijo.org/index.php/nkrafa-sct |
Website title | NKRAFA Journal of Science and Technology |
ISSN | 3057-0913 |
Abstract | This research aims to 1) explore the appropriate application of artificial intelligence (AI) technology for voice spoofing, 2) develop a generative AI-based voice spoofing model and investigate optimization strategies to enhance its suitability for cyber domain applications, 3) evaluate the performance and deception potential of synthetic voices generated by the model, and 4) propose practical applications of generative AI technology in offensive cyber operations.The findings indicated that MaskCycleGAN-VC was a highly effective generative artificial intelligence model suitable for voice spoofing in the Thai language. This model could generate synthetic voices that closely resembled the original in terms of naturalness, including rhythm, intonation, and emotional expression. A key feature of the model was its ability to be developed and trained within just one day, using only moderate computational resources. The synthetic voices generated by the model could deceive listeners into believing they were genuine voices with an accuracy of up to 56%, while genuine voices were misclassified as synthetic in up to 59% of cases. This highlighted the challenges of distinguishing between genuine and synthetic voices in noisy environments. Performance metrics included a Mean Opinion Score (MOS) score for naturalness of up to 3.9 and similarity of up to 4.2, with a minimum Mel Cepstral Distortion (MCD) of 5 dB and Kernel Deep Speech Distance (KDSD) of 15.9 mKDSD. This model demonstrated significant potential for applications in security and offensive cyber operations, including support for intelligence activities, confusion in emergency scenarios, and simulated training exercises. However, its usage should be approached with caution to prevent misuse in unethical contexts. |