Cognitive load of text-to-speech (TTS) synthesis systems measured in the past consistently showed that processing synthetic speech is more difficult to process than human speech in the presence of noise. However, the systems previously evaluated are no longer state-of-the-art. The quality produced by modern TTS systems are considered indistinguishable from human speech. Does this mean that the cognitive load demanded by such systems are now equivalent to that of human speech? The work presented in this paper, sets out to answer this question by measuring the cognitive load of modern TTS systems under noisy conditions. Results show that the gap of cognitive load demanded by TTS and human speech is reducing when listening to systems such as Tacotron 2 and Fastspeech 2. However, differences in cognitive load between these systems are still present. Therefore, despite modern TTS systems producing high quality speech, not all of them demand the same amount of cognitive load and thus not all TTS systems will provide the same user experience when embedded into real-world applications. Interestingly, results suggest that vocoded speech demands the same cognitive load as human speech which shows that it is possible to generate synthetic speech that can impose cognitive load that is equivalent to that of human speech.
Reference:
Govender, A. & King, S. 2023. Cognitive load of modern TTS systems under noisy conditions. http://hdl.handle.net/10204/13517 .
Govender, A., & King, S. (2023). Cognitive load of modern TTS systems under noisy conditions. http://hdl.handle.net/10204/13517
Govender, Avashna, and S King. "Cognitive load of modern TTS systems under noisy conditions." Cognitive AI 2023, Bari, Italy, 13-15 November 2023 (2023): http://hdl.handle.net/10204/13517
Govender A, King S, Cognitive load of modern TTS systems under noisy conditions; 2023. http://hdl.handle.net/10204/13517 .