Two methods for objectively measuring similarities and dissimilarities between the 11 official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the 11 South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to wellknown language groupings, and also suggest a finer level of detail on these relationships.
Reference:
Zulu, PN, Botha, G and Barnard, E. 2007. Orthographic measures of language distances between the official South African languages. CSIR Report (2007)
Zulu, P., Botha, G., & Barnard, E. (2007). Orthographic measures of language distances between the official South African languages CSIR. Retrieved from http://hdl.handle.net/10204/5547
Zulu, PN, G Botha, and E Barnard Orthographic measures of language distances between the official South African languages. CSIR, 2007. http://hdl.handle.net/10204/5547
Zulu P, Botha G, Barnard E. Orthographic measures of language distances between the official South African languages. 2007 [cited yyyy month dd]. Available from: http://hdl.handle.net/10204/5547
Author:Peché, M; Davel, MH; Barnard, EDate:Dec 2009This article introduces the first Spoken Language Identification system developed to distinguish among all eleven of South Africa’s official languages. The PPR-LM (Parallel Phoneme Recognition followed by Language Modeling) architecture is ...Read more
Author:Grover, AS; Van Huyssteen, GB; Pretorius, MWDate:Jun 2011Human language technology (HLT) has been identified as a priority area by the South African government. However, despite efforts by government and the research and development (R&D) community, South Africa has not yet been able to maximise ...Read more
Author:Sharma Grover, A; Calteaux, Karen V; Van Huyssteen, G; Pretorius, MDate:Oct 2010South Africa is one of the few countries in the world that boasts a large number of official languages. Due to the efforts of the government and the local research and development (R&D) community all the official languages are enabled with ...Read more