With the increasing prominence and maturity of corpus-based techniques for speech synthesis, the process of system development has in some ways been simplified considerably. However, the dependence on sufficient amounts of relevant speech data of high quality remains a central challenge in under-resourced environments. In this paper the authors investigate the quality implications when building baseline synthesis systems with reduced amounts of speech data. This is done through a perceptual evaluation of synthesis systems based on unit-selection and statistical parametric synthesis techniques. The authors show that - although it is possible to build an acceptable unit-selection synthesizer with as little as 27 minutes of carefully recorded speech data - synthesis quality obtainable from Hidden Markov Model-based synthesis is more consistent and requires significantly less speech data.
Reference:
Van Niekerk, DR, Barnard, E and Schlunz, G. 2009. Perceptual evaluation of corpus-based speech synthesis techniques in under-resourced environments. 20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Stellenbosch, South Africa, 30 November - 01 December 2009, pp 71-75
Van Niekerk, D., Barnard, E., & Schlunz, G. I. (2009). Perceptual evaluation of corpus-based speech synthesis techniques in under-resourced environments. PRASA 2009. http://hdl.handle.net/10204/3852
Van Niekerk, DR, E Barnard, and Georg I Schlunz. "Perceptual evaluation of corpus-based speech synthesis techniques in under-resourced environments." (2009): http://hdl.handle.net/10204/3852
Van Niekerk D, Barnard E, Schlunz GI, Perceptual evaluation of corpus-based speech synthesis techniques in under-resourced environments; PRASA 2009; 2009. http://hdl.handle.net/10204/3852 .