ResearchSpace

Prototype-based active learning for lemmatization

Show simple item record

dc.contributor.author Daelemans, W
dc.contributor.author Groenewald, HJ
dc.contributor.author Van Huyssteen, GB
dc.date.accessioned 2009-10-12T07:11:47Z
dc.date.available 2009-10-12T07:11:47Z
dc.date.issued 2009-09
dc.identifier.citation Daelemans, W, Groenewald, HJ and Van Huyssteen, GB. 2009. Prototype-based active learning for lemmatization. Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70 en
dc.identifier.issn 1313-8502
dc.identifier.uri http://hdl.handle.net/10204/3646
dc.description This is the author's version of the work. The definitive version was published in the Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70 en
dc.description.abstract Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL), criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context researchers explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009), researchers investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task, it addresses lemmatization, the reduction of inflected word forms to their base-form. It operationalizes prototypicality as features (i.e. word frequency and word length) of the already available training data items, and combines this with a measure of uncertainty (entropy). The paper shows that the selection of less prototypical instances first, provides performance that is better than when data is randomly selected or when state of the art AL methods are used. The researchers argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces, as there are often few regularities and many irregularities. en
dc.language.iso en en
dc.subject Active learning en
dc.subject Prototype-based active classification en
dc.subject PBAC en
dc.subject Lemmatization en
dc.subject Prototype theory en
dc.subject Afrikaans lemmatization en
dc.subject Natural language processing en
dc.subject Recent Advances in Natural Language Processing Conference 2009 en
dc.subject RANLP 2009 en
dc.subject Prototypicality en
dc.subject Text technology en
dc.subject Linguistic data en
dc.title Prototype-based active learning for lemmatization en
dc.type Conference Presentation en
dc.identifier.apacitation Daelemans, W., Groenewald, H., & Van Huyssteen, G. (2009). Prototype-based active learning for lemmatization. http://hdl.handle.net/10204/3646 en_ZA
dc.identifier.chicagocitation Daelemans, W, HJ Groenewald, and GB Van Huyssteen. "Prototype-based active learning for lemmatization." (2009): http://hdl.handle.net/10204/3646 en_ZA
dc.identifier.vancouvercitation Daelemans W, Groenewald H, Van Huyssteen G, Prototype-based active learning for lemmatization; 2009. http://hdl.handle.net/10204/3646 . en_ZA
dc.identifier.ris TY - Conference Presentation AU - Daelemans, W AU - Groenewald, HJ AU - Van Huyssteen, GB AB - Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL), criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context researchers explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009), researchers investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task, it addresses lemmatization, the reduction of inflected word forms to their base-form. It operationalizes prototypicality as features (i.e. word frequency and word length) of the already available training data items, and combines this with a measure of uncertainty (entropy). The paper shows that the selection of less prototypical instances first, provides performance that is better than when data is randomly selected or when state of the art AL methods are used. The researchers argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces, as there are often few regularities and many irregularities. DA - 2009-09 DB - ResearchSpace DP - CSIR KW - Active learning KW - Prototype-based active classification KW - PBAC KW - Lemmatization KW - Prototype theory KW - Afrikaans lemmatization KW - Natural language processing KW - Recent Advances in Natural Language Processing Conference 2009 KW - RANLP 2009 KW - Prototypicality KW - Text technology KW - Linguistic data LK - https://researchspace.csir.co.za PY - 2009 SM - 1313-8502 T1 - Prototype-based active learning for lemmatization TI - Prototype-based active learning for lemmatization UR - http://hdl.handle.net/10204/3646 ER - en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record