Patents provide a rich source of technical vocabulary, product names, and person names that complement other data sources used for machine translation.
We have processed patents from the United States Patent and Trademark Office and from the European Patent Organisation. By matching up related patents in different languages, we can obtain parallel text that is useful for training machine translation systems. Data is available to download as matched sentences from pairs of languages.
We use information from the European Patent Organisation database to identify patents from different countries that belong to the same "family". These are often patents for the same invention that have been registered in different jurisdictions.
The title, abstract, claims, and description are extracted from each patent as plain text. All numbers, images, chemical formulas, and DNA sequences are removed.
The patent text is translated to English so we can identify matching pairs of patent documents within a family. Once a matching pair of patents has been found, the sentences in each patent are lined up. In this way, the pair of patents becomes a list of paired sentences in two languages.
You can download the paired sentences in three formats. The Text files contain only the sentence pairs that pass our quality threshold. The same pairs are also available in TMX format with additional metadata. The Raw files contain all paired sentences without any filtering based on quality.