ResearchSpace

A toolkit for text extraction and analysis for natural language processing tasks

Show simple item record

dc.contributor.author Sefara, Tshephisho J
dc.contributor.author Mbooi, Mahlatse S
dc.contributor.author Mashile, Katlego J
dc.contributor.author Rambuda, Thompho
dc.contributor.author Rangata, Mapitsi R
dc.date.accessioned 2022-12-11T14:33:45Z
dc.date.available 2022-12-11T14:33:45Z
dc.date.issued 2022-08
dc.identifier.citation Sefara, T.J., Mbooi, M.S., Mashile, K.J., Rambuda, T. & Rangata, M.R. 2022. A toolkit for text extraction and analysis for natural language processing tasks. http://hdl.handle.net/10204/12565 . en_ZA
dc.identifier.isbn 978-1-6654-8422-0
dc.identifier.isbn 978-1-6654-8421-3
dc.identifier.isbn 978-1-6654-8423-7
dc.identifier.uri DOI: 10.1109/icABCD54961.2022.9856269
dc.identifier.uri http://hdl.handle.net/10204/12565
dc.description.abstract Text extraction is an important part of natural language processing (NLP) tasks. Most NLP tasks like text classification, machine translation, text-to-speech, text-based language identification, text summarization, and named-entity recognition involve the use of textual data. Such data is limited for low-resourced languages making it difficult to experiment advanced NLP techniques on these languages. This paper presents a Python-based toolkit for text analysis and text extraction from different types of images, documents, and audio files. The toolkit is built as a library that has functions that can be imported and utilized for text extraction. en_US
dc.format Fulltext en_US
dc.language.iso en en_US
dc.relation.uri https://ieeexplore.ieee.org/document/9856269 en_US
dc.source 2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 4-5 August 2022 en_US
dc.subject Text recognition en_US
dc.subject Text categorization en_US
dc.subject Big data en_US
dc.subject Natural Language Processing en_US
dc.subject Machine translation en_US
dc.subject Data communication en_US
dc.title A toolkit for text extraction and analysis for natural language processing tasks en_US
dc.type Conference Presentation en_US
dc.description.pages 6 en_US
dc.description.note Due to copyright restrictions, the attached PDF file contains the preprint version of the published item. For access to the published version, please consult the publisher's website: https://ieeexplore.ieee.org/document/9856269 en_US
dc.description.cluster Next Generation Enterprises & Institutions en_US
dc.description.impactarea Data Science en_US
dc.identifier.apacitation Sefara, T. J., Mbooi, M. S., Mashile, K. J., Rambuda, T., & Rangata, M. R. (2022). A toolkit for text extraction and analysis for natural language processing tasks. http://hdl.handle.net/10204/12565 en_ZA
dc.identifier.chicagocitation Sefara, Tshephisho J, Mahlatse S Mbooi, Katlego J Mashile, Thompho Rambuda, and Mapitsi R Rangata. "A toolkit for text extraction and analysis for natural language processing tasks." <i>2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 4-5 August 2022</i> (2022): http://hdl.handle.net/10204/12565 en_ZA
dc.identifier.vancouvercitation Sefara TJ, Mbooi MS, Mashile KJ, Rambuda T, Rangata MR, A toolkit for text extraction and analysis for natural language processing tasks; 2022. http://hdl.handle.net/10204/12565 . en_ZA
dc.identifier.ris TY - Conference Presentation AU - Sefara, Tshephisho J AU - Mbooi, Mahlatse S AU - Mashile, Katlego J AU - Rambuda, Thompho AU - Rangata, Mapitsi R AB - Text extraction is an important part of natural language processing (NLP) tasks. Most NLP tasks like text classification, machine translation, text-to-speech, text-based language identification, text summarization, and named-entity recognition involve the use of textual data. Such data is limited for low-resourced languages making it difficult to experiment advanced NLP techniques on these languages. This paper presents a Python-based toolkit for text analysis and text extraction from different types of images, documents, and audio files. The toolkit is built as a library that has functions that can be imported and utilized for text extraction. DA - 2022-08 DB - ResearchSpace DP - CSIR J1 - 2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 4-5 August 2022 KW - Text recognition KW - Text categorization KW - Big data KW - Natural Language Processing KW - Machine translation KW - Data communication LK - https://researchspace.csir.co.za PY - 2022 SM - 978-1-6654-8422-0 SM - 978-1-6654-8421-3 SM - 978-1-6654-8423-7 T1 - A toolkit for text extraction and analysis for natural language processing tasks TI - A toolkit for text extraction and analysis for natural language processing tasks UR - http://hdl.handle.net/10204/12565 ER - en_ZA
dc.identifier.worklist 26284 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record