ResearchSpace

Exploring ASR fine-tuning on limited domain specific data for low resource languages

Show simple item record

dc.contributor.author Mak, Franco
dc.contributor.author Govender, Avashna
dc.contributor.author Badenhorst, Jaco
dc.date.accessioned 2024-06-11T06:49:36Z
dc.date.available 2024-06-11T06:49:36Z
dc.date.issued 2024-02
dc.identifier.citation Mak, F., Govender, A. & Badenhorst, J. 2024. Exploring ASR fine-tuning on limited domain specific data for low resource languages. <i>Journal of the Digital Humanities Association of Southern Africa, 5(1).</i> http://hdl.handle.net/10204/13683 en_ZA
dc.identifier.uri http://hdl.handle.net/10204/13683
dc.description.abstract The majority of South Africa’s eleven languages are low-resourced, posing a major challenge to Automatic Speech Recognition (ASR) development. Modern ASR systems require an extensive amount of data that is extremely difficult to find for low-resourced languages. In addition, available speech and text corpora for these languages predominantly revolve around government, political and biblical content. Consequently, this hinders the ability of ASR systems developed for these languages to perform well especially when evaluating data outside of these domains. To alleviate this problem, the Icefall Kaldi II toolkit introduced new transformer model scripts, facilitating the adaptation of pre-trained models using limited adaptation data. In this paper, we explored the technique of using pre-trained ASR models in a domain where more data is available (government data) and adapted it to an entirely different domain with limited data (broadcast news data). The objective was to assess whether such techniques can surpass the accuracy of prior ASR models developed for these languages. Our results showed that the Conformer connectionist temporal classification (CTC) model obtained lower word error rates by a large margin in comparison to previous TDNN-F models evaluated on the same datasets. This research signifies a step forward in mitigating data scarcity challenges and enhancing ASR performance for low-resourced languages in South Africa. en_US
dc.format Fulltext en_US
dc.language.iso en en_US
dc.relation.uri https://upjournals.up.ac.za/index.php/dhasa/article/view/5024/4137 en_US
dc.relation.uri https://upjournals.up.ac.za/index.php/dhasa en_US
dc.source Journal of the Digital Humanities Association of Southern Africa, 5(1) en_US
dc.subject Automatic speech recognition en_US
dc.subject Fine-tuning en_US
dc.subject Low-resource languages en_US
dc.subject Data harvesting en_US
dc.subject Broadcast news data en_US
dc.title Exploring ASR fine-tuning on limited domain specific data for low resource languages en_US
dc.type Article en_US
dc.description.pages 8 en_US
dc.description.note Copyright (c) 2024 Franco Mak, Avashna Govender, Jaco Badenhorst. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. en_US
dc.description.cluster Next Generation Enterprises & Institutions en_US
dc.identifier.apacitation Mak, F., Govender, A., & Badenhorst, J. (2024). Exploring ASR fine-tuning on limited domain specific data for low resource languages. <i>Journal of the Digital Humanities Association of Southern Africa, 5(1)</i>, http://hdl.handle.net/10204/13683 en_ZA
dc.identifier.chicagocitation Mak, Franco, Avashna Govender, and Jaco Badenhorst "Exploring ASR fine-tuning on limited domain specific data for low resource languages." <i>Journal of the Digital Humanities Association of Southern Africa, 5(1)</i> (2024) http://hdl.handle.net/10204/13683 en_ZA
dc.identifier.vancouvercitation Mak F, Govender A, Badenhorst J. Exploring ASR fine-tuning on limited domain specific data for low resource languages. Journal of the Digital Humanities Association of Southern Africa, 5(1). 2024; http://hdl.handle.net/10204/13683. en_ZA
dc.identifier.ris TY - Article AU - Mak, Franco AU - Govender, Avashna AU - Badenhorst, Jaco AB - The majority of South Africa’s eleven languages are low-resourced, posing a major challenge to Automatic Speech Recognition (ASR) development. Modern ASR systems require an extensive amount of data that is extremely difficult to find for low-resourced languages. In addition, available speech and text corpora for these languages predominantly revolve around government, political and biblical content. Consequently, this hinders the ability of ASR systems developed for these languages to perform well especially when evaluating data outside of these domains. To alleviate this problem, the Icefall Kaldi II toolkit introduced new transformer model scripts, facilitating the adaptation of pre-trained models using limited adaptation data. In this paper, we explored the technique of using pre-trained ASR models in a domain where more data is available (government data) and adapted it to an entirely different domain with limited data (broadcast news data). The objective was to assess whether such techniques can surpass the accuracy of prior ASR models developed for these languages. Our results showed that the Conformer connectionist temporal classification (CTC) model obtained lower word error rates by a large margin in comparison to previous TDNN-F models evaluated on the same datasets. This research signifies a step forward in mitigating data scarcity challenges and enhancing ASR performance for low-resourced languages in South Africa. DA - 2024-02 DB - ResearchSpace DP - CSIR J1 - Journal of the Digital Humanities Association of Southern Africa, 5(1) KW - Automatic speech recognition KW - Fine-tuning KW - Low-resource languages KW - Data harvesting KW - Broadcast news data LK - https://researchspace.csir.co.za PY - 2024 T1 - Exploring ASR fine-tuning on limited domain specific data for low resource languages TI - Exploring ASR fine-tuning on limited domain specific data for low resource languages UR - http://hdl.handle.net/10204/13683 ER - en_ZA
dc.identifier.worklist 27440 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record