EuroPat

Downloads

German
— English

Sentences
19,734,742

Source tokens
472,187,688

raw

4.0GB

tmx

5.0GB

txt

2.8GB

Spanish
— English

Sentences
51,352,279

Source tokens
1,629,994,079

raw

14GB

tmx

15.4GB

txt

7.4GB

French
— English

Sentences
11,098,710

Source tokens
342,545,387

raw

2.4GB

tmx

2.8GB

txt

1.6GB

Croatian
— English

Sentences
154,774

Source tokens
4,801,057

raw

74.9MB

tmx

46.0MB

txt

23.0MB

Norwegian
— English

Sentences
4,341,458

Source tokens
109,234,468

raw

946.3MB

tmx

1.3GB

txt

605.4MB

Polish
— English

Sentences
332,119

Source tokens
7,822,804

raw

107.0MB

tmx

100.1MB

txt

47.3MB

German
— English

Sentences
15,571,044

Source tokens
468,814,948

raw

4.1GB

tmx

2.0GB

txt

1.6GB

Spanish
— English

Sentences
44,063,940

Source tokens
1,262,635,074

raw

13GB

tmx

5.2GB

txt

4.3GB

French
— English

Sentences
12,081,950

Source tokens
369,899,080

raw

2.5GB

tmx

1.5GB

txt

1.2GB

Croatian
— English

Sentences
75,104

Source tokens
2,331,218

raw

41MB

tmx

9.8MB

txt

7.4MB

Norwegian
— English

Sentences
4,050,340

Source tokens
114,959,672

raw

934MB

tmx

490MB

txt

400MB

Polish
— English

Sentences
87,983

Source tokens
1,728,224

raw

41MB

tmx

12MB

txt

7.5MB

German
— English

Sentences
12,614,161

Source tokens
705,958,877

raw

3.3GB

tmx

2.7GB

txt

1.6GB

French
— English

Sentences
9,213,466

Source tokens
567,930,281

raw

2.1GB

tmx

1.9GB

txt

1.2GB

Patents as parallel corpora

Patents provide a rich source of technical vocabulary, product names, and person names that complement other data sources used for machine translation.

We have processed patents from the United States Patent and Trademark Office and from the European Patent Organisation. By matching up related patents in different languages, we can obtain parallel text that is useful for training machine translation systems. Data is available to download as matched sentences from pairs of languages.

How it works

We use information from the European Patent Organisation database to identify patents from different countries that belong to the same "family". These are often patents for the same invention that have been registered in different jurisdictions.

The title, abstract, claims, and description are extracted from each patent as plain text. All numbers, images, chemical formulas, and DNA sequences are removed.

The patent text is translated to English so we can identify matching pairs of patent documents within a family. Once a matching pair of patents has been found, the sentences in each patent are lined up. In this way, the pair of patents becomes a list of paired sentences in two languages.

You can download the paired sentences in three formats. The Text files contain only the sentence pairs that pass our quality threshold. The same pairs are also available in TMX format with additional metadata. The Raw files contain all paired sentences without any filtering based on quality.

Unleashing European Patent Translations

Downloads

German — English

Sentences 19,734,742

Source tokens 472,187,688

Spanish — English

Sentences 51,352,279

Source tokens 1,629,994,079

French — English

Sentences 11,098,710

Source tokens 342,545,387

Croatian — English

Sentences 154,774

Source tokens 4,801,057

Norwegian — English

Sentences 4,341,458

Source tokens 109,234,468

Polish — English

Sentences 332,119

Source tokens 7,822,804

German — English

Sentences 15,571,044

Source tokens 468,814,948

Spanish — English

Sentences 44,063,940

Source tokens 1,262,635,074

French — English

Sentences 12,081,950

Source tokens 369,899,080

Croatian — English

Sentences 75,104

Source tokens 2,331,218

Norwegian — English

Sentences 4,050,340

Source tokens 114,959,672

Polish — English

Sentences 87,983

Source tokens 1,728,224

German — English

Sentences 12,614,161

Source tokens 705,958,877

French — English

Sentences 9,213,466

Source tokens 567,930,281

Patents as parallel corpora

How it works

EuroPat project partners

German
— English

Sentences
19,734,742

Source tokens
472,187,688

Spanish
— English

Sentences
51,352,279

Source tokens
1,629,994,079

French
— English

Sentences
11,098,710

Source tokens
342,545,387

Croatian
— English

Sentences
154,774

Source tokens
4,801,057

Norwegian
— English

Sentences
4,341,458

Source tokens
109,234,468

Polish
— English

Sentences
332,119

Source tokens
7,822,804

German
— English

Sentences
15,571,044

Source tokens
468,814,948

Spanish
— English

Sentences
44,063,940

Source tokens
1,262,635,074

French
— English

Sentences
12,081,950

Source tokens
369,899,080

Croatian
— English

Sentences
75,104

Source tokens
2,331,218

Norwegian
— English

Sentences
4,050,340

Source tokens
114,959,672

Polish
— English

Sentences
87,983

Source tokens
1,728,224

German
— English

Sentences
12,614,161

Source tokens
705,958,877

French
— English

Sentences
9,213,466

Source tokens
567,930,281