Gauging the accuracy of automatic speech data harvesting in five under-resourced languages

Badenhorst, Jacob AC; De Wet, F

Gauging the accuracy of automatic speech data harvesting in five under-resourced languages

https://doi.org/10.55492/dhasa.v4i02.4031
http://hdl.handle.net/10204/13604

Abstract:

Recent research on deep-learning architectures has resulted in substantial improvements in automatic speech recognition accuracy. The leaps of progress made in well-resourced languages can be attributed to the fact that these architectures are able to effectively represent spoken language in all its diversity and complexity. However, developing advanced models of a language without appropriate corpora of speech and text data remains a challenge. For many under-resourced languages, including those spoken in South Africa, such resources simply do not exist. The aim of the work reported on in this paper is to address this situation by investigating the possibility to create diverse speech resources from unannotated broadcast data. The paper describes how existing speech and text resources were used to develop a semi-automatic data harvesting procedure for two genres of broadcast data, namely news bulletins and radio dramas. It was found that adapting acoustic models with less than 10 hours of manually annotated data from the same domain significantly reduced transcription error rates for speaking styles and acoustic conditions that are not represented in any of the existing speech corpora. Results also indicated that much more automatically transcribed adaptation data is required to achieve similar results.

Reference:

Badenhorst, J.A. & De Wet, F. 2023. Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. Journal of the Digital Humanities Association of Southern Africa, 4(2). http://hdl.handle.net/10204/13604

Badenhorst, J. A., & De Wet, F. (2023). Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. Journal of the Digital Humanities Association of Southern Africa, 4(2), http://hdl.handle.net/10204/13604

Badenhorst, Jacob AC, and F De Wet "Gauging the accuracy of automatic speech data harvesting in five under-resourced languages." Journal of the Digital Humanities Association of Southern Africa, 4(2) (2023) http://hdl.handle.net/10204/13604

Badenhorst JA, De Wet F. Gauging the accuracy of automatic speech data harvesting in five under-resourced languages. Journal of the Digital Humanities Association of Southern Africa, 4(2). 2023; http://hdl.handle.net/10204/13604.

Download RIS

Badenhorst, Jacob AC
De Wet, F

Mar 2023

Low-resource languages
Automatic speech recognition
Data harvesting
Domain adaptation
Data collection

Show full item record

Files in this item

Badenhorst_2023.pdf

Source

Journal of the Digital Humanities Association of Southern Africa, 4(2)

This item appears in the following Collection(s)

Journal Articles

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

Gauging the accuracy of automatic speech data harvesting in five under-resourced languages

Gauging the accuracy of automatic speech data harvesting in five under-resourced languages

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect