dc.contributor.author |
De Waal, A
|
|
dc.contributor.author |
Barnard, E
|
|
dc.date.accessioned |
2010-12-23T09:14:10Z |
|
dc.date.available |
2010-12-23T09:14:10Z |
|
dc.date.issued |
2010-11 |
|
dc.identifier.citation |
De Waal, A and Barnard, E. 2010. Influence of input matrix representation on topic modelling performance. 21st Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa, 22-23 November 2010, pp 6 |
en |
dc.identifier.isbn |
978-0-7992-2470-2 |
|
dc.identifier.uri |
http://hdl.handle.net/10204/4712
|
|
dc.description |
21st Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa, 22-23 November 2010 |
en |
dc.description.abstract |
Topic models explain a collection of documents with a small set of distributions over terms. These distributions over terms define the topics. Topic models ignore the structure of documents and use a bag-of-words approach which relies solely on the frequency of words in the corpus. We challenge the bag-of-word assumption and propose a method to structure single words into concepts. In this way, the inherent meaning of the feature space is enriched by more descriptive concepts rather than single words. We turn to the field of natural language processing to find processes to structure words into concepts. In order to compare the performance of structured features with the bag-of-words approach, we sketch an evaluation framework that accommodates different feature dimension sizes. This is in contrast with existing methods such as perplexity, which depend on the size of the vocabulary modelled and can therefore not be used to compare models which use different input feature sets. We use a stability-based validation index to measure a model’s ability to replicate similar solutions of independent data sets generated from the same probabilistic source. Stability-based validation acts more consistently across feature dimensions than perplexity or information-theoretic measures. |
en |
dc.language.iso |
en |
en |
dc.publisher |
PRASA 2010 |
en |
dc.relation.ispartofseries |
Conference Paper |
en |
dc.subject |
Bag-of-words approach |
en |
dc.subject |
Descriptive concepts |
en |
dc.subject |
Natural language processing |
en |
dc.subject |
Input matrix representation |
en |
dc.subject |
PRASA 2010 |
en |
dc.title |
Influence of input matrix representation on topic modelling performance |
en |
dc.type |
Conference Presentation |
en |
dc.identifier.apacitation |
De Waal, A., & Barnard, E. (2010). Influence of input matrix representation on topic modelling performance. PRASA 2010. http://hdl.handle.net/10204/4712 |
en_ZA |
dc.identifier.chicagocitation |
De Waal, A, and E Barnard. "Influence of input matrix representation on topic modelling performance." (2010): http://hdl.handle.net/10204/4712 |
en_ZA |
dc.identifier.vancouvercitation |
De Waal A, Barnard E, Influence of input matrix representation on topic modelling performance; PRASA 2010; 2010. http://hdl.handle.net/10204/4712 . |
en_ZA |
dc.identifier.ris |
TY - Conference Presentation
AU - De Waal, A
AU - Barnard, E
AB - Topic models explain a collection of documents with a small set of distributions over terms. These distributions over terms define the topics. Topic models ignore the structure of documents and use a bag-of-words approach which relies solely on the frequency of words in the corpus. We challenge the bag-of-word assumption and propose a method to structure single words into concepts. In this way, the inherent meaning of the feature space is enriched by more descriptive concepts rather than single words. We turn to the field of natural language processing to find processes to structure words into concepts. In order to compare the performance of structured features with the bag-of-words approach, we sketch an evaluation framework that accommodates different feature dimension sizes. This is in contrast with existing methods such as perplexity, which depend on the size of the vocabulary modelled and can therefore not be used to compare models which use different input feature sets. We use a stability-based validation index to measure a model’s ability to replicate similar solutions of independent data sets generated from the same probabilistic source. Stability-based validation acts more consistently across feature dimensions than perplexity or information-theoretic measures.
DA - 2010-11
DB - ResearchSpace
DP - CSIR
KW - Bag-of-words approach
KW - Descriptive concepts
KW - Natural language processing
KW - Input matrix representation
KW - PRASA 2010
LK - https://researchspace.csir.co.za
PY - 2010
SM - 978-0-7992-2470-2
T1 - Influence of input matrix representation on topic modelling performance
TI - Influence of input matrix representation on topic modelling performance
UR - http://hdl.handle.net/10204/4712
ER -
|
en_ZA |