Named Entity Recognition (NER) is a sub-task of information extraction in which named entities are classified in unstructured text, typically including the names of people, organizations, locations and quantities. In this research, NER is applied to data science job descriptions collected from with the goal of extracting the programming skills an applicant needs, which spoken languages are required, how much experience is asked for and if any educational background is preferred. Different Long Short Term Memory (LSTM) methods and a Conditional Random Field (CRF) are compared to each other. Although LSTM models have theoretical advantages over CRF models due to their ability to capture long term dependencies within a sentence, the CRF model obtains the highest overall accuracy with a F1-score of 0.86. The high F1-score for CRFs can partly be attributed to its ability to classify multi-token chunks well, which are entities that consist of more than one word. The methods are compared on different subsets of the data, and this research shows that LSTM based methods need more data to perform well.

, , ,
Groenen, P.J.F.
Business Economics
Erasmus School of Economics

Dijkman, B.N. van (Bjorn). (2019, August 22). LSTM and CRF models for entity extraction from job descriptions. Business Economics. Retrieved from