Two approaches to gathering text corpora from the WorldWideWeb

Botha, G; Barnard, E

Two approaches to gathering text corpora from the WorldWideWeb

http://hdl.handle.net/10204/5587

Abstract:

Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results.

Reference:

Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005

Botha, G., & Barnard, E. (2005). Two approaches to gathering text corpora from the WorldWideWeb. PRASA. http://hdl.handle.net/10204/5587

Botha, G, and E Barnard. "Two approaches to gathering text corpora from the WorldWideWeb." (2005): http://hdl.handle.net/10204/5587

Botha G, Barnard E, Two approaches to gathering text corpora from the WorldWideWeb; PRASA; 2005. http://hdl.handle.net/10204/5587 .

Download RIS

Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005

Botha, G
Barnard, E

Nov 2005

Text corpora
Text collection
Web-crawling

Show full item record

Files in this item

Barnard_2005.pdf

This item appears in the following Collection(s)

Conference Publications

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

Two approaches to gathering text corpora from the WorldWideWeb

Two approaches to gathering text corpora from the WorldWideWeb

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect