The foundation of our unique data set is the so called DOCDB database which is provided by the European Patent Office. This database is very rich and contains more than 120 million patent publications since the beginning of the patent systems. Patent numbers are harmonized and thereby follow a homogenious system.

However, DOCDB does not contain any information about patent representatives. Those are only available in national or international data sets which we collected separately.

In our Patent-Pilot database, we combine DOCDB with national data collections and further data sources such as attorney register and thereby create the most comprehensive database containing patent documents and information about the patent law firms who filed and prosecuted these patent applications.

Data cleaning and processing

In general, our data sources provide raw data only. These data are very messy as well as unstructured and need to be harmonized and aggregated before they can be used for the purpose of analytics. For example, in the raw data, both, attorney names and patent law firms, are found in the same bibliographic meta fields which would not allow to observe and analyze persons and organizations as one joint entity.

Moreover, spelling errors, renaming of law firms and variations of the same names and entities make it almost impossible to use general patent databases for the purpose of patent representative analytics without any correction.

We created our own algorithm to clean the raw data, distinguish persons from organizations, and aggregate attorneys to the law firms they have been working for at the time filed an application for a client.

How do we do it? Using natural language processing and machine learning, Patent-Pilot has probably the most advanced algorithm for patent representative analytics in the IP industry.

