Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL), criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context researchers explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009), researchers investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task, it addresses lemmatization, the reduction of inflected word forms to their base-form. It operationalizes prototypicality as features (i.e. word frequency and word length) of the already available training data items, and combines this with a measure of uncertainty (entropy). The paper shows that the selection of less prototypical instances first, provides performance that is better than when data is randomly selected or when state of the art AL methods are used. The researchers argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces, as there are often few regularities and many irregularities.
Reference:
Daelemans, W, Groenewald, HJ and Van Huyssteen, GB. 2009. Prototype-based active learning for lemmatization. Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70
Daelemans, W., Groenewald, H., & Van Huyssteen, G. (2009). Prototype-based active learning for lemmatization. http://hdl.handle.net/10204/3646
Daelemans, W, HJ Groenewald, and GB Van Huyssteen. "Prototype-based active learning for lemmatization." (2009): http://hdl.handle.net/10204/3646
Daelemans W, Groenewald H, Van Huyssteen G, Prototype-based active learning for lemmatization; 2009. http://hdl.handle.net/10204/3646 .
This is the author's version of the work. The definitive version was published in the Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70