This research tries to reduce a hierarchical taxonomy by measuring overlap and co-cluster keywords and subjects. The taxonomy consists of three levels: a heading, subheading and subjects. This taxonomy is used for the classification of news articles to one or more specific groups/subjects. Now there are 3195 subjects to which the articles could be classified. This is quite much. The number of subjects should be reduced to allow it to use this taxonomy in combination with an international standard. It will also help for the company itself when you reduce this number. By searching for subjects whose articles are (almost) already in another subject we try to find subjects which can be removed or converged with another subject. We also want to converge content related subjects. We find these subjects by applying a co-clustering algorithm which group related keywords and subjects together. The keywords which are used for the co- clustering are terms from the corpus which score a relative high Term Frequency-Inverse Document Frequency (tf-idf) score. A stemming algorithm is used before applying tf-idf.

Kaymak, U.
Economie & Informatica
Erasmus School of Economics

Plas, C. van der. (2010, August 30). New Article Taxonomy Reducing by co-clusering or article overlap. Economie & Informatica. Retrieved from