ResearchSpace

Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography

Show simple item record

dc.contributor.author Pretorius, R
dc.contributor.author Berg, A
dc.contributor.author Pretorius, L
dc.contributor.author Viljoen, B
dc.date.accessioned 2010-04-18T12:53:07Z
dc.date.available 2010-04-18T12:53:07Z
dc.date.issued 2009-03
dc.identifier.citation Pretorius, R, Berg, A, Pretorius, L et al 2009. Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009, pp 66-73 en
dc.identifier.isbn 1-932432-25-6
dc.identifier.uri http://delivery.acm.org/10.1145/1570000/1564522/p66-pretorius.pdf?key1=1564522&key2=5403951721&coll=GUIDE&dl=GUIDE&CFID=86103539&CFTOKEN=75582888
dc.identifier.uri http://hdl.handle.net/10204/4030
dc.description Copyright: 2009 Association for Computational Linguistics. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009 en
dc.description.abstract Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis. en
dc.language.iso en en
dc.publisher Association for Computational Linguistics en
dc.subject Setswana en
dc.subject Setswana tokenisation en
dc.subject Computational verb morphology en
dc.subject Disjunctive orthography en
dc.subject Verbal prefixal morphemes en
dc.subject Suffixal morphemes en
dc.subject POS tagging en
dc.subject Shallow parsing en
dc.title Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography en
dc.type Conference Presentation en
dc.identifier.apacitation Pretorius, R., Berg, A., Pretorius, L., & Viljoen, B. (2009). Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. Association for Computational Linguistics. http://hdl.handle.net/10204/4030 en_ZA
dc.identifier.chicagocitation Pretorius, R, A Berg, L Pretorius, and B Viljoen. "Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography." (2009): http://hdl.handle.net/10204/4030 en_ZA
dc.identifier.vancouvercitation Pretorius R, Berg A, Pretorius L, Viljoen B, Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography; Association for Computational Linguistics; 2009. http://hdl.handle.net/10204/4030 . en_ZA
dc.identifier.ris TY - Conference Presentation AU - Pretorius, R AU - Berg, A AU - Pretorius, L AU - Viljoen, B AB - Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis. DA - 2009-03 DB - ResearchSpace DP - CSIR KW - Setswana KW - Setswana tokenisation KW - Computational verb morphology KW - Disjunctive orthography KW - Verbal prefixal morphemes KW - Suffixal morphemes KW - POS tagging KW - Shallow parsing LK - https://researchspace.csir.co.za PY - 2009 SM - 1-932432-25-6 T1 - Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography TI - Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography UR - http://hdl.handle.net/10204/4030 ER - en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record