Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results.
Reference:
Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005
Botha, G., & Barnard, E. (2005). Two approaches to gathering text corpora from the WorldWideWeb. PRASA. http://hdl.handle.net/10204/5587
Botha, G, and E Barnard. "Two approaches to gathering text corpora from the WorldWideWeb." (2005): http://hdl.handle.net/10204/5587
Botha G, Barnard E, Two approaches to gathering text corpora from the WorldWideWeb; PRASA; 2005. http://hdl.handle.net/10204/5587 .