Unleashing European Patent Translations

Downloads

German
— English

Sentences
19,734,742

Source tokens
472,187,688

Spanish
— English

Sentences
51,352,279

Source tokens
1,629,994,079

raw
14GB

French
— English

Sentences
11,098,710

Source tokens
342,545,387

Croatian
— English

Sentences
154,774

Source tokens
4,801,057

Norwegian
— English

Sentences
4,341,458

Source tokens
109,234,468

Polish
— English

Sentences
332,119

Source tokens
7,822,804

German
— English

Sentences
15,571,044

Source tokens
468,814,948

Spanish
— English

Sentences
44,063,940

Source tokens
1,262,635,074

raw
13GB

French
— English

Sentences
12,081,950

Source tokens
369,899,080

Croatian
— English

Sentences
75,104

Source tokens
2,331,218

raw
41MB

Norwegian
— English

Sentences
4,050,340

Source tokens
114,959,672

Polish
— English

Sentences
87,983

Source tokens
1,728,224

raw
41MB
tmx
12MB

German
— English

Sentences
12,614,161

Source tokens
705,958,877

French
— English

Sentences
9,213,466

Source tokens
567,930,281

Patents as parallel corpora

Patents provide a rich source of technical vocabulary, product names, and person names that complement other data sources used for machine translation.

We have processed patents from the United States Patent and Trademark Office and from the European Patent Organisation. By matching up related patents in different languages, we can obtain parallel text that is useful for training machine translation systems. Data is available to download as matched sentences from pairs of languages.

How it works

We use information from the European Patent Organisation database to identify patents from different countries that belong to the same "family". These are often patents for the same invention that have been registered in different jurisdictions.

The title, abstract, claims, and description are extracted from each patent as plain text. All numbers, images, chemical formulas, and DNA sequences are removed.

The patent text is translated to English so we can identify matching pairs of patent documents within a family. Once a matching pair of patents has been found, the sentences in each patent are lined up. In this way, the pair of patents becomes a list of paired sentences in two languages.

You can download the paired sentences in three formats. The Text files contain only the sentence pairs that pass our quality threshold. The same pairs are also available in TMX format with additional metadata. The Raw files contain all paired sentences without any filtering based on quality.

Patent

EuroPat project partners