New Article Taxonomy Reducing by co-clusering or article overlap

Plas, C. van der

This research tries to reduce a hierarchical taxonomy by measuring overlap and co-cluster keywords and subjects. The taxonomy consists of three levels: a heading, subheading and subjects. This taxonomy is used for the classification of news articles to one or more specific groups/subjects. Now there are 3195 subjects to which the articles could be classified. This is quite much. The number of subjects should be reduced to allow it to use this taxonomy in combination with an international standard. It will also help for the company itself when you reduce this number. By searching for subjects whose articles are (almost) already in another subject we try to find subjects which can be removed or converged with another subject. We also want to converge content related subjects. We find these subjects by applying a co-clustering algorithm which group related keywords and subjects together. The keywords which are used for the co- clustering are terms from the corpus which score a relative high Term Frequency-Inverse Document Frequency (tf-idf) score. A stemming algorithm is used before applying tf-idf.

Additional Metadata
Thesis Advisor	Kaymak, U.
Persistent URL	hdl.handle.net/2105/7928
Series	Economie & Informatica
Organisation	Erasmus School of Economics
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Plas, C. van der. (2010, August 30). New Article Taxonomy Reducing by co-clusering or article overlap. Economie & Informatica. Retrieved from http://hdl.handle.net/2105/7928

Free Full Text ( Final Version , 1mb )

New Article Taxonomy Reducing by co-clusering or article overlap

Publication

Publication

About

New Article Taxonomy Reducing by co-clusering or article overlap

Publication

Publication

Workflow

Workflow

Add Content