ResearchSpace

Gauging the accuracy of automatic speech data harvesting in five under-resourced languages

Show simple item record

dc.contributor.author Badenhorst, Jacob AC
dc.contributor.author De Wet, F
dc.date.accessioned 2024-02-15T06:52:07Z
dc.date.available 2024-02-15T06:52:07Z
dc.date.issued 2023-03
dc.identifier.citation Badenhorst, J.A. & De Wet, F. 2023. Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. <i>Journal of the Digital Humanities Association of Southern Africa, 4(2).</i> http://hdl.handle.net/10204/13604 en_ZA
dc.identifier.uri https://doi.org/10.55492/dhasa.v4i02.4031
dc.identifier.uri http://hdl.handle.net/10204/13604
dc.description.abstract Recent research on deep-learning architectures has resulted in substantial improvements in automatic speech recognition accuracy. The leaps of progress made in well-resourced languages can be attributed to the fact that these architectures are able to effectively represent spoken language in all its diversity and complexity. However, developing advanced models of a language without appropriate corpora of speech and text data remains a challenge. For many under-resourced languages, including those spoken in South Africa, such resources simply do not exist. The aim of the work reported on in this paper is to address this situation by investigating the possibility to create diverse speech resources from unannotated broadcast data. The paper describes how existing speech and text resources were used to develop a semi-automatic data harvesting procedure for two genres of broadcast data, namely news bulletins and radio dramas. It was found that adapting acoustic models with less than 10 hours of manually annotated data from the same domain significantly reduced transcription error rates for speaking styles and acoustic conditions that are not represented in any of the existing speech corpora. Results also indicated that much more automatically transcribed adaptation data is required to achieve similar results. en_US
dc.format Fulltext en_US
dc.language.iso en en_US
dc.relation.uri https://upjournals.up.ac.za/index.php/dhasa/article/view/4031/3878 en_US
dc.source Journal of the Digital Humanities Association of Southern Africa, 4(2) en_US
dc.subject Low-resource languages en_US
dc.subject Automatic speech recognition en_US
dc.subject Data harvesting en_US
dc.subject Domain adaptation en_US
dc.subject Data collection en_US
dc.title Gauging the accuracy of automatic speech data harvesting in five under-resourced languages en_US
dc.type Article en_US
dc.description.pages 17 en_US
dc.description.cluster Next Generation Enterprises & Institutions en_US
dc.description.impactarea Voice Computing en_US
dc.identifier.apacitation Badenhorst, J. A., & De Wet, F. (2023). Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. <i>Journal of the Digital Humanities Association of Southern Africa, 4(2)</i>, http://hdl.handle.net/10204/13604 en_ZA
dc.identifier.chicagocitation Badenhorst, Jacob AC, and F De Wet "Gauging the accuracy of automatic speech data harvesting in five under-resourced languages." <i>Journal of the Digital Humanities Association of Southern Africa, 4(2)</i> (2023) http://hdl.handle.net/10204/13604 en_ZA
dc.identifier.vancouvercitation Badenhorst JA, De Wet F. Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. Journal of the Digital Humanities Association of Southern Africa, 4(2). 2023; http://hdl.handle.net/10204/13604. en_ZA
dc.identifier.ris TY - Article AU - Badenhorst, Jacob AC AU - De Wet, F AB - Recent research on deep-learning architectures has resulted in substantial improvements in automatic speech recognition accuracy. The leaps of progress made in well-resourced languages can be attributed to the fact that these architectures are able to effectively represent spoken language in all its diversity and complexity. However, developing advanced models of a language without appropriate corpora of speech and text data remains a challenge. For many under-resourced languages, including those spoken in South Africa, such resources simply do not exist. The aim of the work reported on in this paper is to address this situation by investigating the possibility to create diverse speech resources from unannotated broadcast data. The paper describes how existing speech and text resources were used to develop a semi-automatic data harvesting procedure for two genres of broadcast data, namely news bulletins and radio dramas. It was found that adapting acoustic models with less than 10 hours of manually annotated data from the same domain significantly reduced transcription error rates for speaking styles and acoustic conditions that are not represented in any of the existing speech corpora. Results also indicated that much more automatically transcribed adaptation data is required to achieve similar results. DA - 2023-03 DB - ResearchSpace DP - CSIR J1 - Journal of the Digital Humanities Association of Southern Africa, 4(2) KW - Low-resource languages KW - Automatic speech recognition KW - Data harvesting KW - Domain adaptation KW - Data collection LK - https://researchspace.csir.co.za PY - 2023 T1 - Gauging the accuracy of automatic speech data harvesting in five under-resourced languages TI - Gauging the accuracy of automatic speech data harvesting in five under-resourced languages UR - http://hdl.handle.net/10204/13604 ER - en_ZA
dc.identifier.worklist 27193 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record