Treffer: Supervised Machine Learning mit Nutzergenerierten Inhalten: Oversampling für nicht balancierte Trainingsdaten.
Weitere Informationen
In online communication research, Document Classification with Supervised Machine Learning (SML) is increasingly being used to automatically identify the content and meaning of User Generated Content (UGC). Detecting categories such as Incivility or Offensive Language is currently of high relevance in this context, but in comparison to "harmless" content, they occur rarely in many samples of UGC. If categories (e.g. UNCIVIL vs. CIVIL) are unequally distributed in a data set, one speaks of imbalanced (or unbalanced) data. For document classification with SML, imbalanced data can lead to multiple problems. First, if a category is infrequent in the training data, it can hardly be learned by a statistical model. Second, most Machine Learning (ML) algorithms easily become biased towards the overrepresented category, and predictions become inaccurate and unreliable. Such biases often remain undiscovered because they are not immediately apparent from the overall performance measures of a model. For example, an incivility classifier could achieve an accuracy of 90% by only predicting the category CIVIL if there are 90% civil and 10% uncivil instances in the sample. Such a classifier would not have learned anything about detecting incivility in UGC at all. Furthermore, its predictions will be biased towards the major category CIVIVL in the data. However, since a classifier will be used to label new data and to answer further research questions, it is particularly important that its predictions are as reliable as possible. One important strategy to overcome the problem of rare categories and imbalanced data is oversampling. Oversampling means that instances of the underrepresented category are weighted, or that new, synthetic cases of the minor category are generated to balance the sample. In ML research, oversampling is an established technique, which is successfully applied in the field of biotechnology and medical technology, e.g. for the diagnosis of diseases and the identification of genes, or in financial management, e.g. in the detection of credit card fraud. However, there is a lack of studies that investigates, whether oversampling also affects and possibly improves document classification with UGC and whether oversampling can be applied to reduce prediction biases from imbalanced samples of text documents such as tweets or user comments. The present study investigates how oversampling can improve the identification of the outcome categories Incivility, Offensive language, and Sentiment on three imbalanced samples of UGC, including English and German Tweets and Facebook user comments (n = 55,400). To predict the outcome categories, a logistic regression classifier based on Bag-of-Words n-grams is applied, which is a common baseline approach to document classification. The results are compared before and after the application of two different oversampling strategies on the training data. Random Over Sampling (ROS) randomly selects cases of the underrepresented category and simply weights them for classification to balance the case numbers of all categories. The more complex algorithm Synthetic Minority Over-sampling Technique (SMOTE) uses the k nearest neighbors algorithm to generate new, synthetic cases. The new instances are supposed to be similar to the other cases of the minor category, meaning their should have a small distance to each other in the vector space. Here, SMOTE generated new text documents with a similar frequency distribution of words and word combinations. Results show that both ROS and SMOTE significantly improve the classification of the outcome categories Offensive language, Incivility and Sentiment. After applying oversampling on the training data, the overall classification results improve by up to 15 percent points (after oversampling: Macro-F1<subscript>Offensive</subscript><subscript>_ROS</subscript> = 0.71, Macro-F1<subscript>Incivlity</subscript><subscript>_SMOTE</subscript> = 0.62, Macro-F1<subscript>Sentiment</subscript><subscript>_ROS</subscript> = 0.72, before oversampling: Macro-F1<subscript>Offensive</subscript> = 0.56, Macro-F1<subscript>Sentiment</subscript> = 0.62, Macro-F1<subscript>Incivlity</subscript> = 0.46). Before applying oversampling, the classifiers failed to identify many instances of the minor categories (Recall<subscript>UNCIVIL</subscript> = 0.01, Recall<subscript>NEGATIVE_SENTI</subscript> = 0.26, Recall<subscript>OFFENSIVE</subscript> = 0.18) and overestimated the major categories. The findings are stable for all categories and forms of UGC. The results show that by using oversampling, instances with relevant information for a category (e.g., certain words to detect offensive language) can be considered, even if they rarely occur in the original, imbalanced distribution. This way, the bias towards the major categories can be reduced and the overall model performance increases. At the same time, however, all classifier lose precision for the minor categories at the same time, which is because both ROS and SMOTE also consider misleading instances from the training data for weighting or generating new text documents. In sum, the study shows that oversampling on UGC can lead to significant improvement of the model performance and reduction of the estimation bias caused by imbalanced training data. Oversampling works where relevant features (here: words and word combinations) already exist in the sample, since none of the oversampling algorithms can generate data with new information (here: new words or word combinations to identify a category such as incivility). Therefore, oversampling should be considered as a useful and important tool for classification with UGC. It is likely that these findings can also be transferred to other use cases of document classification in communication science that should be investigated by following studies. [ABSTRACT FROM AUTHOR]
Zusammenfassung: Viele der aktuell im Forschungsbereich Onlinekommunikation untersuchten Phänomene wie Hate Speech, Inzivilität oder Offensive Language kommen in einer Stichprobe aus Nutzergenerierten Inhalten (User Generated Content, UGC) vergleichsweise selten vor. Sind die Kategorien in einer Stichprobe nicht gleich verteilt, spricht man von unbalancierten Daten. Für die Textklassifikation mit Überwachtem Maschinellem Lernen (Supervised Machine Learning) sind solche nicht balancierten Stichproben häufig problematisch, da sie die automatisierte Identifikation der Katgeorien erschweren und Klassifikationsmodelle (Classifier) oft ungenau und unzuverlässig werden lassen. Kommt eine Kategorie in den Daten nur selten vor, kann sie durch ein statistisches Klassifikationsmodell nur schwer erlernt werden. Zudem tendieren viele ML-Algorithmen dazu, bei Unsicherheit die vorherrschende Kategorie in den Daten vorherzusagen, und die Klassifikation wird zugunsten der überrepräsentierten Kategorie verzerrt. Die vorliegende Studie untersucht, inwieweit die Methode des Oversampling die Klassifikation von UGC verbessern kann, wenn eine Kategorie in der Stichprobe deutlich unterrepräsentiert ist. Hierfür wurden anhand von verschiedenen nicht balancierten Stichproben aus deutsch- und englischsprachigen Tweets und Nutzerkommentaren Klassifikationsmodelle für die Identifikation von Offensive Language, Inzivilität und Sentiment trainiert und getestet. Verglichen wurden die Ergebnisse bevor und nachdem die Oversampling-Strategien ROS (Random Over Sampling) und SMOTE (Synthetic Minority Over-sampling Technique) auf den Trainingsdaten angewendet wurden. Die Ergebnisse zeigen, dass sowohl ROS als auch SMOTE die Klassifikation von UGC in allen Stichproben deutlich verbessert, vor allem die Identifikation der unterrepräsentierten Kategorie. Die Anwendung von Oversampling führt zudem dazu, dass die Verzerrung der Schätzung zu Gunsten der vorherschenden Kategorie deutlich reduziert wird. Ziel der Studie ist es, Forschenden aus der Kommunikationswissenschaft Erkenntnisse darüber liefern, wie sich die Problematik von nicht balancierten Stichproben auf die automatisierte Inhaltsanalyse mit Supervised Machine Learning auswirkt und bis zu welchem Punkt diesem Problem mit Oversampling begegnet werden kann. [ABSTRACT FROM AUTHOR]
Copyright of Publizistik: Vierteljahreshefte für Kommunikationsforschung is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)