dc.contributor.author |
Daelemans, W
|
|
dc.contributor.author |
Groenewald, HJ
|
|
dc.contributor.author |
Van Huyssteen, GB
|
|
dc.date.accessioned |
2009-10-12T07:11:47Z |
|
dc.date.available |
2009-10-12T07:11:47Z |
|
dc.date.issued |
2009-09 |
|
dc.identifier.citation |
Daelemans, W, Groenewald, HJ and Van Huyssteen, GB. 2009. Prototype-based active learning for lemmatization. Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70 |
en |
dc.identifier.issn |
1313-8502 |
|
dc.identifier.uri |
http://hdl.handle.net/10204/3646
|
|
dc.description |
This is the author's version of the work. The definitive version was published in the Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70 |
en |
dc.description.abstract |
Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL), criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context researchers explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009), researchers investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task, it addresses lemmatization, the reduction of inflected word forms to their base-form. It operationalizes prototypicality as features (i.e. word frequency and word length) of the already available training data items, and combines this with a measure of uncertainty (entropy). The paper shows that the selection of less prototypical instances first, provides performance that is better than when data is randomly selected or when state of the art AL methods are used. The researchers argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces, as there are often few regularities and many irregularities. |
en |
dc.language.iso |
en |
en |
dc.subject |
Active learning |
en |
dc.subject |
Prototype-based active classification |
en |
dc.subject |
PBAC |
en |
dc.subject |
Lemmatization |
en |
dc.subject |
Prototype theory |
en |
dc.subject |
Afrikaans lemmatization |
en |
dc.subject |
Natural language processing |
en |
dc.subject |
Recent Advances in Natural Language Processing Conference 2009 |
en |
dc.subject |
RANLP 2009 |
en |
dc.subject |
Prototypicality |
en |
dc.subject |
Text technology |
en |
dc.subject |
Linguistic data |
en |
dc.title |
Prototype-based active learning for lemmatization |
en |
dc.type |
Conference Presentation |
en |
dc.identifier.apacitation |
Daelemans, W., Groenewald, H., & Van Huyssteen, G. (2009). Prototype-based active learning for lemmatization. http://hdl.handle.net/10204/3646 |
en_ZA |
dc.identifier.chicagocitation |
Daelemans, W, HJ Groenewald, and GB Van Huyssteen. "Prototype-based active learning for lemmatization." (2009): http://hdl.handle.net/10204/3646 |
en_ZA |
dc.identifier.vancouvercitation |
Daelemans W, Groenewald H, Van Huyssteen G, Prototype-based active learning for lemmatization; 2009. http://hdl.handle.net/10204/3646 . |
en_ZA |
dc.identifier.ris |
TY - Conference Presentation
AU - Daelemans, W
AU - Groenewald, HJ
AU - Van Huyssteen, GB
AB - Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL), criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context researchers explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009), researchers investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task, it addresses lemmatization, the reduction of inflected word forms to their base-form. It operationalizes prototypicality as features (i.e. word frequency and word length) of the already available training data items, and combines this with a measure of uncertainty (entropy). The paper shows that the selection of less prototypical instances first, provides performance that is better than when data is randomly selected or when state of the art AL methods are used. The researchers argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces, as there are often few regularities and many irregularities.
DA - 2009-09
DB - ResearchSpace
DP - CSIR
KW - Active learning
KW - Prototype-based active classification
KW - PBAC
KW - Lemmatization
KW - Prototype theory
KW - Afrikaans lemmatization
KW - Natural language processing
KW - Recent Advances in Natural Language Processing Conference 2009
KW - RANLP 2009
KW - Prototypicality
KW - Text technology
KW - Linguistic data
LK - https://researchspace.csir.co.za
PY - 2009
SM - 1313-8502
T1 - Prototype-based active learning for lemmatization
TI - Prototype-based active learning for lemmatization
UR - http://hdl.handle.net/10204/3646
ER -
|
en_ZA |