ResearchSpace

Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages

Show simple item record

dc.contributor.author Badenhorst, Jacob AC
dc.contributor.author De Wet, Febe
dc.date.accessioned 2022-05-04T13:22:38Z
dc.date.available 2022-05-04T13:22:38Z
dc.date.issued 2021-11
dc.identifier.citation Badenhorst, J.A. & De Wet, F. 2021. Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages. http://hdl.handle.net/10204/12380 . en_ZA
dc.identifier.uri https://doi.org/10.55492/dhasa.v3i03.3820
dc.identifier.uri http://hdl.handle.net/10204/12380
dc.description.abstract Sufficient target language data remains an important factor in the development of automatic speech recognition (ASR) systems. For instance, the substantial improvement in acoustic modelling that deep architectures have recently achieved for well-resourced languages requires vast amounts of speech data. Moreover, the acoustic models in state-of-the-art ASR systems that generalise well across different domains are usually trained on various corpora, not just one or two. Diverse corpora containing hundreds of hours of speech data are not available for resource limited languages. In this paper, we investigate the feasibility of creating additional speech resources for the official languages of South Africa by employing a semi-automatic data harvesting procedure. Factorised time-delay neural network models were used to generate phone-level transcriptions of speech data harvested from different domains. en_US
dc.format Fulltext en_US
dc.language.iso en en_US
dc.relation.uri https://upjournals.up.ac.za/index.php/dhasa/article/view/3820 en_US
dc.source Proceedings of the International Conference of the Digital Humanities Association of Southern Africa. 2nd workshop on Resources for African Indigenous Language (RAIL), Virtual, 29 November - 3 December 2021 en_US
dc.subject Automatic speech recognition en_US
dc.subject Data harvesting en_US
dc.subject Domain adaptation en_US
dc.subject Low-resource languages en_US
dc.subject TDNN-F en_US
dc.title Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages en_US
dc.type Conference Presentation en_US
dc.description.pages 9pp en_US
dc.description.note Presentation included in the Proceedings of the International Conference of the Digital Humanities Association of Southern Africa, 2nd workshop on Resources for African Indigenous Language (RAIL), Virtual, 29 November - 3 December 2021 en_US
dc.description.cluster Next Generation Enterprises & Institutions en_US
dc.description.impactarea Voice Computing en_US
dc.identifier.apacitation Badenhorst, J. A., & De Wet, F. (2021). Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages. http://hdl.handle.net/10204/12380 en_ZA
dc.identifier.chicagocitation Badenhorst, Jacob AC, and Febe De Wet. "Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages." <i>Proceedings of the International Conference of the Digital Humanities Association of Southern Africa. 2nd workshop on Resources for African Indigenous Language (RAIL), Virtual, 29 November - 3 December 2021</i> (2021): http://hdl.handle.net/10204/12380 en_ZA
dc.identifier.vancouvercitation Badenhorst JA, De Wet F, Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages; 2021. http://hdl.handle.net/10204/12380 . en_ZA
dc.identifier.ris TY - Conference Presentation AU - Badenhorst, Jacob AC AU - De Wet, Febe AB - Sufficient target language data remains an important factor in the development of automatic speech recognition (ASR) systems. For instance, the substantial improvement in acoustic modelling that deep architectures have recently achieved for well-resourced languages requires vast amounts of speech data. Moreover, the acoustic models in state-of-the-art ASR systems that generalise well across different domains are usually trained on various corpora, not just one or two. Diverse corpora containing hundreds of hours of speech data are not available for resource limited languages. In this paper, we investigate the feasibility of creating additional speech resources for the official languages of South Africa by employing a semi-automatic data harvesting procedure. Factorised time-delay neural network models were used to generate phone-level transcriptions of speech data harvested from different domains. DA - 2021-11 DB - ResearchSpace DP - CSIR J1 - Proceedings of the International Conference of the Digital Humanities Association of Southern Africa. 2nd workshop on Resources for African Indigenous Language (RAIL), Virtual, 29 November - 3 December 2021 KW - Automatic speech recognition KW - Data harvesting KW - Domain adaptation KW - Low-resource languages KW - TDNN-F LK - https://researchspace.csir.co.za PY - 2021 T1 - Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages TI - Investigating the feasibility of harvesting broadcast speech data to develop resources for South African languages UR - http://hdl.handle.net/10204/12380 ER - en_ZA
dc.identifier.worklist 25227 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record