Vom 20.12.2025 bis 11.01.2026 ist die Universitätsbibliothek geschlossen. Ab dem 12.01.2026 gelten wieder die regulären Öffnungszeiten. Ausnahme: Medizinische Hauptbibliothek und Zentralbibliothek sind bereits ab 05.01.2026 wieder geöffnet. Weitere Informationen

Treffer: Comparison of text preprocessing methods.

Title:
Comparison of text preprocessing methods.
Source:
Natural Language Engineering; May2023, Vol. 29 Issue 3, p509-553, 45p
Database:
Complementary Index

Weitere Informationen

Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods. [ABSTRACT FROM AUTHOR]

Copyright of Natural Language Engineering is the property of Cambridge University Press and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)