dc.contributor.author |
Badenhorst, Jacob AC
|
|
dc.contributor.author |
De Wet, F
|
|
dc.date.accessioned |
2024-02-15T06:52:07Z |
|
dc.date.available |
2024-02-15T06:52:07Z |
|
dc.date.issued |
2023-03 |
|
dc.identifier.citation |
Badenhorst, J.A. & De Wet, F. 2023. Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. <i>Journal of the Digital Humanities Association of Southern Africa, 4(2).</i> http://hdl.handle.net/10204/13604 |
en_ZA |
dc.identifier.uri |
https://doi.org/10.55492/dhasa.v4i02.4031
|
|
dc.identifier.uri |
http://hdl.handle.net/10204/13604
|
|
dc.description.abstract |
Recent research on deep-learning architectures has resulted in substantial improvements in automatic speech recognition accuracy. The leaps of progress made in well-resourced languages can be attributed to the fact that these architectures are able to effectively represent spoken language in all its diversity and complexity. However, developing advanced models of a language without appropriate corpora of speech and text data remains a challenge. For many under-resourced languages, including those spoken in South Africa, such resources simply do not exist. The aim of the work reported on in this paper is to address this situation by investigating the possibility to create diverse speech resources from unannotated broadcast data. The paper describes how existing speech and text resources were used to develop a semi-automatic data harvesting procedure for two genres of broadcast data, namely news bulletins and radio dramas. It was found that adapting acoustic models with less than 10 hours of manually annotated data from the same domain significantly reduced transcription error rates for speaking styles and acoustic conditions that are not represented in any of the existing speech corpora. Results also indicated that much more automatically transcribed adaptation data is required to achieve similar results. |
en_US |
dc.format |
Fulltext |
en_US |
dc.language.iso |
en |
en_US |
dc.relation.uri |
https://upjournals.up.ac.za/index.php/dhasa/article/view/4031/3878 |
en_US |
dc.source |
Journal of the Digital Humanities Association of Southern Africa, 4(2) |
en_US |
dc.subject |
Low-resource languages |
en_US |
dc.subject |
Automatic speech recognition |
en_US |
dc.subject |
Data harvesting |
en_US |
dc.subject |
Domain adaptation |
en_US |
dc.subject |
Data collection |
en_US |
dc.title |
Gauging the accuracy of automatic speech data harvesting in five under-resourced languages |
en_US |
dc.type |
Article |
en_US |
dc.description.pages |
17 |
en_US |
dc.description.cluster |
Next Generation Enterprises & Institutions |
en_US |
dc.description.impactarea |
Voice Computing |
en_US |
dc.identifier.apacitation |
Badenhorst, J. A., & De Wet, F. (2023). Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. <i>Journal of the Digital Humanities Association of Southern Africa, 4(2)</i>, http://hdl.handle.net/10204/13604 |
en_ZA |
dc.identifier.chicagocitation |
Badenhorst, Jacob AC, and F De Wet "Gauging the accuracy of automatic speech data harvesting in five under-resourced languages." <i>Journal of the Digital Humanities Association of Southern Africa, 4(2)</i> (2023) http://hdl.handle.net/10204/13604 |
en_ZA |
dc.identifier.vancouvercitation |
Badenhorst JA, De Wet F. Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. Journal of the Digital Humanities Association of Southern Africa, 4(2). 2023; http://hdl.handle.net/10204/13604. |
en_ZA |
dc.identifier.ris |
TY - Article
AU - Badenhorst, Jacob AC
AU - De Wet, F
AB - Recent research on deep-learning architectures has resulted in substantial improvements in automatic speech recognition accuracy. The leaps of progress made in well-resourced languages can be attributed to the fact that these architectures are able to effectively represent spoken language in all its diversity and complexity. However, developing advanced models of a language without appropriate corpora of speech and text data remains a challenge. For many under-resourced languages, including those spoken in South Africa, such resources simply do not exist. The aim of the work reported on in this paper is to address this situation by investigating the possibility to create diverse speech resources from unannotated broadcast data. The paper describes how existing speech and text resources were used to develop a semi-automatic data harvesting procedure for two genres of broadcast data, namely news bulletins and radio dramas. It was found that adapting acoustic models with less than 10 hours of manually annotated data from the same domain significantly reduced transcription error rates for speaking styles and acoustic conditions that are not represented in any of the existing speech corpora. Results also indicated that much more automatically transcribed adaptation data is required to achieve similar results.
DA - 2023-03
DB - ResearchSpace
DP - CSIR
J1 - Journal of the Digital Humanities Association of Southern Africa, 4(2)
KW - Low-resource languages
KW - Automatic speech recognition
KW - Data harvesting
KW - Domain adaptation
KW - Data collection
LK - https://researchspace.csir.co.za
PY - 2023
T1 - Gauging the accuracy of automatic speech data harvesting in five under-resourced languages
TI - Gauging the accuracy of automatic speech data harvesting in five under-resourced languages
UR - http://hdl.handle.net/10204/13604
ER -
|
en_ZA |
dc.identifier.worklist |
27193 |
en_US |