A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language

Butgereit, L; Botha, RA

dc.contributor.author	Butgereit, L
dc.contributor.author	Botha, RA
dc.date.accessioned	2013-12-12T07:34:27Z
dc.date.available	2013-12-12T07:34:27Z
dc.date.issued	2013-10
dc.identifier.citation	Butgereit, L and Botha, R.A. 2013. A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language. In: South African Institute for Computer Scientists and Information Technologists (SAICSIT) 2013, 7-9 October 2013, East London, South Africa	en_US
dc.identifier.uri	http://delivery.acm.org/10.1145/2520000/2513458/p1-butgereit.pdf?ip=146.64.81.115&id=2513458&acc=ACTIVE%20SERVICE&key=C2716FEBFA981EF16F26307A25115533B16AE41C93EF03EC&CFID=387516605&CFTOKEN=24848873&__acm__=1386666358_3b0c923db5c180fedd0a187906ec5273
dc.identifier.uri	http://hdl.handle.net/10204/7119
dc.description	South African Institute for Computer Scientists and Information Technologists (SAICSIT) 2013, 7-9 October 2013, East London, South Africa. Abstract only attached.	en_US
dc.description.abstract	Mobile Instant Messaging (MIM) systems have produced a new convention in writing where vowels are often omitted, where new suffixes have appeared, where numerals and symbols often appear in the place of letters which have a similar shape or sound, and where words are often spelled phonetically. A word such as mister may be spelled numerous ways including mista and mistr (with new suffixes). When both participants to a MIM conversation understand these new spelling conventions, there is no problem. But in a situation such as automated topic spotting, it is advantageous to attempt to associate these new spellings (mista and mistr) back to the original word (mister). This paper describes work in creating a spelling corrector for MIM conversations for use after stop words have been removed from a conversation, after words have been stemmed, and after double letters have been collapsed to single letters. Four different similarity calculations Jaccard, Sørensen-Dice, Cosine, and Overlap are investigated and tested with historical data from the Dr Math mobile tutoring environment. This research found that the Overlap similarity calculation was the least accurate of the four measured. In situations where the length of the various words were the same, Sørensen-Dice and Cosine similarity calculations were identical. Jaccard and Sørensen-Dice worked equally well, however, they required different numerical cut-off values for misspelled words.	en_US
dc.language.iso	en	en_US
dc.publisher	ACM Digital Library	en_US
dc.relation.ispartofseries	Workflow;11770
dc.subject	Algorithms	en_US
dc.subject	N-grams	en_US
dc.subject	Spelling	en_US
dc.subject	Dr math	en_US
dc.subject	Mobile Instant Messaging	en_US
dc.subject	MIM	en_US
dc.title	A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language	en_US
dc.type	Conference Presentation	en_US
dc.identifier.apacitation	Butgereit, L., & Botha, R. (2013). A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language. ACM Digital Library. http://hdl.handle.net/10204/7119	en_ZA
dc.identifier.chicagocitation	Butgereit, L, and RA Botha. "A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language." (2013): http://hdl.handle.net/10204/7119	en_ZA
dc.identifier.vancouvercitation	Butgereit L, Botha R, A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language; ACM Digital Library; 2013. http://hdl.handle.net/10204/7119 .	en_ZA
dc.identifier.ris	TY - Conference Presentation AU - Butgereit, L AU - Botha, RA AB - Mobile Instant Messaging (MIM) systems have produced a new convention in writing where vowels are often omitted, where new suffixes have appeared, where numerals and symbols often appear in the place of letters which have a similar shape or sound, and where words are often spelled phonetically. A word such as mister may be spelled numerous ways including mista and mistr (with new suffixes). When both participants to a MIM conversation understand these new spelling conventions, there is no problem. But in a situation such as automated topic spotting, it is advantageous to attempt to associate these new spellings (mista and mistr) back to the original word (mister). This paper describes work in creating a spelling corrector for MIM conversations for use after stop words have been removed from a conversation, after words have been stemmed, and after double letters have been collapsed to single letters. Four different similarity calculations Jaccard, Sørensen-Dice, Cosine, and Overlap are investigated and tested with historical data from the Dr Math mobile tutoring environment. This research found that the Overlap similarity calculation was the least accurate of the four measured. In situations where the length of the various words were the same, Sørensen-Dice and Cosine similarity calculations were identical. Jaccard and Sørensen-Dice worked equally well, however, they required different numerical cut-off values for misspelled words. DA - 2013-10 DB - ResearchSpace DP - CSIR KW - Algorithms KW - N-grams KW - Spelling KW - Dr math KW - Mobile Instant Messaging KW - MIM LK - https://researchspace.csir.co.za PY - 2013 T1 - A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language TI - A comparison of different calculations for N-Gram similarities in a spelling corrector for mobile instant messaging language UR - http://hdl.handle.net/10204/7119 ER -	en_ZA

Files in this item

Name: Butgereit3_2013_A ...

Size: 26.91Kb

Format: PDF

View/Open

This item appears in the following Collection(s)

Conference Publications

Show simple item record

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.