Influence of input matrix representation on topic modelling performance

De Waal, A; Barnard, E

dc.contributor.author	De Waal, A
dc.contributor.author	Barnard, E
dc.date.accessioned	2010-12-23T09:14:10Z
dc.date.available	2010-12-23T09:14:10Z
dc.date.issued	2010-11
dc.identifier.citation	De Waal, A and Barnard, E. 2010. Influence of input matrix representation on topic modelling performance. 21st Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa, 22-23 November 2010, pp 6	en
dc.identifier.isbn	978-0-7992-2470-2
dc.identifier.uri	http://hdl.handle.net/10204/4712
dc.description	21st Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa, 22-23 November 2010	en
dc.description.abstract	Topic models explain a collection of documents with a small set of distributions over terms. These distributions over terms define the topics. Topic models ignore the structure of documents and use a bag-of-words approach which relies solely on the frequency of words in the corpus. We challenge the bag-of-word assumption and propose a method to structure single words into concepts. In this way, the inherent meaning of the feature space is enriched by more descriptive concepts rather than single words. We turn to the field of natural language processing to find processes to structure words into concepts. In order to compare the performance of structured features with the bag-of-words approach, we sketch an evaluation framework that accommodates different feature dimension sizes. This is in contrast with existing methods such as perplexity, which depend on the size of the vocabulary modelled and can therefore not be used to compare models which use different input feature sets. We use a stability-based validation index to measure a model’s ability to replicate similar solutions of independent data sets generated from the same probabilistic source. Stability-based validation acts more consistently across feature dimensions than perplexity or information-theoretic measures.	en
dc.language.iso	en	en
dc.publisher	PRASA 2010	en
dc.relation.ispartofseries	Conference Paper	en
dc.subject	Bag-of-words approach	en
dc.subject	Descriptive concepts	en
dc.subject	Natural language processing	en
dc.subject	Input matrix representation	en
dc.subject	PRASA 2010	en
dc.title	Influence of input matrix representation on topic modelling performance	en
dc.type	Conference Presentation	en
dc.identifier.apacitation	De Waal, A., & Barnard, E. (2010). Influence of input matrix representation on topic modelling performance. PRASA 2010. http://hdl.handle.net/10204/4712	en_ZA
dc.identifier.chicagocitation	De Waal, A, and E Barnard. "Influence of input matrix representation on topic modelling performance." (2010): http://hdl.handle.net/10204/4712	en_ZA
dc.identifier.vancouvercitation	De Waal A, Barnard E, Influence of input matrix representation on topic modelling performance; PRASA 2010; 2010. http://hdl.handle.net/10204/4712 .	en_ZA
dc.identifier.ris	TY - Conference Presentation AU - De Waal, A AU - Barnard, E AB - Topic models explain a collection of documents with a small set of distributions over terms. These distributions over terms define the topics. Topic models ignore the structure of documents and use a bag-of-words approach which relies solely on the frequency of words in the corpus. We challenge the bag-of-word assumption and propose a method to structure single words into concepts. In this way, the inherent meaning of the feature space is enriched by more descriptive concepts rather than single words. We turn to the field of natural language processing to find processes to structure words into concepts. In order to compare the performance of structured features with the bag-of-words approach, we sketch an evaluation framework that accommodates different feature dimension sizes. This is in contrast with existing methods such as perplexity, which depend on the size of the vocabulary modelled and can therefore not be used to compare models which use different input feature sets. We use a stability-based validation index to measure a model’s ability to replicate similar solutions of independent data sets generated from the same probabilistic source. Stability-based validation acts more consistently across feature dimensions than perplexity or information-theoretic measures. DA - 2010-11 DB - ResearchSpace DP - CSIR KW - Bag-of-words approach KW - Descriptive concepts KW - Natural language processing KW - Input matrix representation KW - PRASA 2010 LK - https://researchspace.csir.co.za PY - 2010 SM - 978-0-7992-2470-2 T1 - Influence of input matrix representation on topic modelling performance TI - Influence of input matrix representation on topic modelling performance UR - http://hdl.handle.net/10204/4712 ER -	en_ZA

Files in this item

Name: de Waal_2010.pdf

Size: 1.008Mb

Format: PDF

View/Open

This item appears in the following Collection(s)

Conference Publications

Show simple item record

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.