One of the goals of text-to-speech (TTS) systems is to produce natural-sounding synthesised speech. Towards this end various natural language processing (NLP) tasks are performed to model the prosodic aspects of the TTS voice. One of the fundamental NLP tasks being used is the part-of-speech (POS) tagging of the words in the text. This paper investigates the effects of POS information on the naturalness of a hidden markov model (HMM) based TTS voice when additional resources are not available to aid in the modelling of prosody. It is found that, when a minimal feature set is used for the HMM context labels, the additiion of POS tags does improve the naturalness of the voice. However, the same effect can be accomplished by including segmental counting and positional information instead of the POS tags.
Reference:
Schlunz, GI, Barnard, E and Van Huyssteen, GB. 2010. Part-of-speech effects on text-to-speech synthesis. 21st Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa, 22-23 November 2010, pp 257-262
Schlunz, G. I., Barnard, E., & Van Huyssteen, G. (2010). Part-of-speech effects on text-to-speech synthesis. PRASA 2010. http://hdl.handle.net/10204/4674
Schlunz, Georg I, E Barnard, and GB Van Huyssteen. "Part-of-speech effects on text-to-speech synthesis." (2010): http://hdl.handle.net/10204/4674
Schlunz GI, Barnard E, Van Huyssteen G, Part-of-speech effects on text-to-speech synthesis; PRASA 2010; 2010. http://hdl.handle.net/10204/4674 .