Automatic text classification has numerous applications such as information retrieval out of a corpus, sentiment analysis of text or spam filtering. The use of machine learning in these text classification tasks are increasing in both frequency and performance. Unfortunately, this fast performance increase in machine learning techniques for text classification is often not accompanied with a better theoretical understanding of these machine learning models. This research investigates the performance and links to related theory behind some automatic text classification methods. Broadly speaking, automatic text classification consists of three steps. The first step is pre-processing, which reduces noise and removes unwanted textual features. The second step is feature representation, which transforms the textual corpus to a numerical data matrix. The last step is the classification of this data matrix using a classifier. This thesis specifically focuses on two feature representation methods: Bag-of-Words and Bag-of-Concepts. The performance as well as the important features of the models are investigated in this thesis. The Bag-of-Words is commonly used and represents a document by the frequency of its words. On top of the word term frequency representation, we also investigate the effect of applying different weighting schemes (TFIDFs) to the Bag-of-Words as well as using boolean features instead of term frequencies. Bag-of-Concepts is an alternative to Bag-of-Words and works similarly through representing a document by the frequency of its concepts. These concepts are made by clustering word vectors generated by the word2vec algorithm, which assigns each word a vector based on the context it appears in. The clusters, the numbers ranging in size between 10 and 1000, were made using K-means. All of these feature representations are classified with multinomial naive Bayes. In this thesis three data sets were used, consisting of labeled documents. The first data set investigated, R52, is a subset of the commonly used Reuters-21578 news set. The other data sets have as source questions from the Q&A forum StackExchange: one data set contains questions regarding computer games and the last data set contains questions regarding the English language. Of the four Bag-of-Words variants that were evaluated in this research, the word term frequencies Bag-of-Words was found to be the best performing one, as measured using F1-score. The boolean Bag-of-Words performs slightly worse, it is however a good alternative. In contrast to literature, the use of TFIDF decreased the performance of the classifier. It is hypothesized that this is due to the lack of parameter optimization as well as feature selection in this thesis. But, albeit investigated, a solid theoretical explanation for this result cannot be given in this research. In general, the Bag-of-Words representation was found to be performing somewhat better than the Bag-of-Concepts representation. Nevertheless, a significant dimension reduction was achieved using Bag-of-Concepts and the word clusters (concepts) that arise from the Bag-of-Concepts method were considered (humanly) relevant to the classification tasks. Because of this, the Bag-of-Concepts method could be used to identify important keywords for a certain class.

Velden, M. van de
hdl.handle.net/2105/44111
Econometrie
Erasmus School of Economics

Brave, D.M. den. (2018, November 14). Investigating the Performance of Different Feature Representations in Text Classification within Machine Learning. Econometrie. Retrieved from http://hdl.handle.net/2105/44111