dc.contributor.author |
Zulu, PN
|
|
dc.contributor.author |
Botha, G
|
|
dc.contributor.author |
Barnard, E
|
|
dc.date.accessioned |
2012-01-31T10:12:31Z |
|
dc.date.available |
2012-01-31T10:12:31Z |
|
dc.date.issued |
2007 |
|
dc.identifier.citation |
Zulu, PN, Botha, G and Barnard, E. 2007. Orthographic measures of language distances between the official South African languages. CSIR Report (2007) |
en_US |
dc.identifier.uri |
http://www.docstoc.com/docs/19459727/Orthographic-measures-of-language-distances-between-the-official
|
|
dc.identifier.uri |
http://hdl.handle.net/10204/5547
|
|
dc.description |
Copyright: 2007 CSIR Report |
en_US |
dc.description.abstract |
Two methods for objectively measuring similarities and dissimilarities between the 11 official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the 11 South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to wellknown language groupings, and also suggest a finer level of detail on these relationships. |
en_US |
dc.language.iso |
en |
en_US |
dc.publisher |
CSIR |
en_US |
dc.subject |
Language distances |
en_US |
dc.subject |
Language identification systems |
en_US |
dc.subject |
Levenshtein distance |
en_US |
dc.subject |
Clustering |
en_US |
dc.subject |
n-gram |
en_US |
dc.subject |
Linguistics |
en_US |
dc.subject |
Literary studies |
en_US |
dc.subject |
South African languages |
en_US |
dc.title |
Orthographic measures of language distances between the official South African languages. |
en_US |
dc.type |
Report |
en_US |
dc.identifier.apacitation |
Zulu, P., Botha, G., & Barnard, E. (2007). <i>Orthographic measures of language distances between the official South African languages</i> CSIR. Retrieved from http://hdl.handle.net/10204/5547 |
en_ZA |
dc.identifier.chicagocitation |
Zulu, PN, G Botha, and E Barnard <i>Orthographic measures of language distances between the official South African languages.</i> CSIR, 2007. http://hdl.handle.net/10204/5547 |
en_ZA |
dc.identifier.vancouvercitation |
Zulu P, Botha G, Barnard E. Orthographic measures of language distances between the official South African languages. 2007 [cited yyyy month dd]. Available from: http://hdl.handle.net/10204/5547 |
en_ZA |
dc.identifier.ris |
TY - Report
AU - Zulu, PN
AU - Botha, G
AU - Barnard, E
AB - Two methods for objectively measuring similarities and dissimilarities between the 11 official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the 11 South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to wellknown language groupings, and also suggest a finer level of detail on these relationships.
DA - 2007
DB - ResearchSpace
DP - CSIR
KW - Language distances
KW - Language identification systems
KW - Levenshtein distance
KW - Clustering
KW - n-gram
KW - Linguistics
KW - Literary studies
KW - South African languages
LK - https://researchspace.csir.co.za
PY - 2007
T1 - Orthographic measures of language distances between the official South African languages
TI - Orthographic measures of language distances between the official South African languages
UR - http://hdl.handle.net/10204/5547
ER -
|
en_ZA |