PATSTAT database stores information on patent applications and publications. One of its tables, with the code name TLS214, stores information about scientific references, that are cited by patents. In the 2014 version of PATSTAT this table holds almost 24 million citations. As such, this table is a potentially powerful resource to investigate the relation between science and technology. However, TLS214 is poorly designed and it proves problematic for researchers and policy makers to use the information it contains. The project, which is described in this paper, presents an automated record disambiguation procedure, that aims to provide a reliable way for the scientific community to verify hypothesis on the TLS214 table. To this end, we employ basic, string cleaning methods alongside some pattern harmonization techniques. Next, we extract bibliographic information and use it to detect pairs of records that are potential duplicates. A pair scoring system is used to reject certain pairs and the final clusters of duplicates are obtained with use of a clustering algorithm.

Caron, E.A.M.
Erasmus School of Economics

Guner, S. (2015, July 31). Disambiguation of Scientific References in a Patent Database: A Project to Facilitate Economic Research and Policy Evaluation. Econometrie. Retrieved from