@incollection{ovadek2023iuropatext,
= "Ovádek, Michal and Fjelstul, Joshua and Naurin, Daniel and Lindholm, Johan",
author = "The IUROPA Text Corpus",
title = "Johan Lindholm, Daniel Naurin, Urska Sadl, Anna Wallerman Ghavanini, Stein Arne Brekke, Joshua Fjelstul, Silje Synnøve Lyder Hermansen, Olof Larsson, Andreas Moberg, Moa Näsström, Michal Ovádek, Tommaso Pavone, and Philipp Schroeder",
editor = "The Court of Justice of the European Union Database",
booktitle = 2023,
year = "IUROPA Project",
publisher = "https://iuropa.pol.gu.se/"
url }
IUROPA Text Corpus
Codebook and release notes
Introduction
This document accompanies the IUROPA Text Corpus, a database of judicial texts from the Court of Justice of the European Union (CJEU). The database comprises all types of judicial decisions and Advocate-General (AG) opinions at the paragraph level. Where available, both the French and English texts are included.
The corpus, codebook and release notes are current as of version 0.3 (release 2024-07-30).
The corpus remains work-in-progress. Please let us know if you notice an error in the data or if you have suggestions for improvements. Consult the IUROPA website for more resources.
Citation
When using any part of the IUROPA Text Corpus (including this page), please refer to the following citation:
Ovádek, Michal, Joshua Fjelstul, Daniel Naurin and Johan Lindholm. 2023. “The IUROPA Text Corpus”, in Lindholm, Johan, Daniel Naurin, Urska Sadl, Anna Wallerman Ghavanini, Stein Arne Brekke, Joshua Fjelstul, Silje Synnøve Lyder Hermansen, Olof Larsson, Andreas Moberg, Moa Näsström, Michal Ovádek, Tommaso Pavone, and Philipp Schroeder, The Court of Justice of the European Union (CJEU) Database, IUROPA, https://iuropa.pol.gu.se/.
In BibTex form:
Please note that this database is licensed for non-commercial use only. Unauthorized commercial use, including but not limited to resale, redistribution, or use as part of a commercial product or service, is strictly prohibited.
Rationale
Prior to the release of the IUROPA Text Corpus, there was no unified and comprehensive database of CJEU decisions. Individually, neither the Curia nor the Eur-Lex website contains all decisions in a plain text format. In addition, older decisions – including landmark rulings such as Costa v ENEL and Cassis de Dijon – are only partially and imperfectly digitized.
The following table compares the number of unique documents present in the IUROPA Text Corpus with the number of documents containing plain (html) text on Curia and Eur-Lex by language version:
English | French | |
---|---|---|
Curia | 24334 | 37227 |
Eur-Lex | 32727 | 35919 |
IUROPA | 37170 | 47868 |
Whereas most academic and legal work relating to the CJEU has relied on English texts, this comparison reveals how much more comprehensive the French language corpus is. French remains the working language of the CJEU and not all decisions are translated into English.
Moreover, unlike most previous attempts to create a CJEU database, the unit of analysis in the IUROPA Text Corpus is the paragraph. Paragraphs are understood as blocks of text separated by line breaks in a document. This definition comprises both conventionally designated (and numbered) paragraphs and individual lines with text, such as presentation of lawyers and judges. Overall, paragraph-level data offers more granularity and versatility than decision-level data.
Overview
As of version 0.3 (release 2024-07-30), the IUROPA Text Corpus contains a total of 10748956
paragraphs across 85038
documents. The number of documents in the corpus varies significantly not only over time, but also between courts and language versions.
Court | Number of documents | Paragraphs (min) | Paragraphs (mean) | Paragraphs (max) | Paragraphs (total) | Words (total) |
---|---|---|---|---|---|---|
Civil Service Tribunal | 2600 | 11 | 94.79 | 659 | 246444 | 10386845 |
Court of Justice | 59048 | 2 | 122.64 | 5219 | 7241762 | 345719703 |
General Court | 23390 | 2 | 139.41 | 7146 | 3260750 | 174925068 |
The corpus contains all available documents for the Court of Justice, the General Court and the now defunct Civil Service Tribunal. Most documents and paragraphs come from the Court of Justice whose existence pre-dates the European Economic Community. The General Court was established in 1989 as the Court of First Instance. The Civil Service Tribunal was established in 2005 and abolished in 2016. All types of court decisions, including AG opinions, form part of the corpus.
As a general rule, the working language of all three courts is French. Subsequently, documents are translated into English and other EU languages. As the workload of the courts increased, the pressure on selective translation increased as well.
The maximum number of French documents per year (1966
) occurs in 2018
. An average year contains 674
documents in French and 524
documents in English. In instances when only the English version was made available in plain text – an issue almost exclusively affecting the General Court – the corpus might not include the French version (though if at least a summary is available, it is included). Otherwise and as a general rule, the French language sub-corpus is much more comprehensive.
The documents contain a varying number of paragraphs, ranging from 2
to 7146
. On average, French documents have 120
paragraphs, while English documents have an average of 118
paragraphs. In some cases, the corpus includes also the so-called “report for the hearing” (“rapport d’audience”) which the CJEU produced from the 1980s until 2012. Most of these reports have not yet been released to the public, but in the future they should become a part of the (or a separate) database.
More generally, average differences between language versions stem not only from linguistic variance between English and French, but also from differences between the two samples – recall that French is the default language for the vast majority of documents, not all of which end up being translated into English.
On average, French paragraphs contain 49
words, while English paragraphs have an average of 50
words.
A major contribution of the IUROPA Text Corpus is the inclusion of digitized texts of older decisions. After applying state-of-the-art optical character recognition (OCR) to correctly segmented PDFs of documents from 1954 to 1989, the digitization pipeline involves highly labour-intensive cleaning of plain texts to ensure their high quality at the paragraph level.
The digitization task remains work-in-progress. Cleaned data is added to the corpus on a rolling basis. Of the total 86649
pages of text that require manual checking or cleaning, 31491
have been completed and added to the corpus.
Priority has been given to French documents: so far, 56%
of pages in French and 14%
of English pages have been processed. Contact us if you wish to contribute to this ongoing work.
As of version 0.3, all remaining documents were digitized using a combination of OCR and Open AI’s Generative Pre-trained Transformer 4 (GPT-4, in its early 2024 state) thanks to a collaboration with University of Oslo’s Centre for Computational and Data Science. The OCR output was fed to GPT-4 accompanied by the following prompt:
Consider the following OCR output text and make corrections. Please note that the text to be corrected is in [language]. Fix spelling mistakes, do not add/remove words, make consistent word spacing, add missing spaces, fix font case issues within words, fix numbering issues, make consistent line breaks. In addition, please remove all unnecessary newline characters that break paragraphs incorrectly, detect paragraphs and ensure to add an extra line between paragraphs. Here is the text:
The prompt was optimized by looking at edit (Levenshtein) distance between the GPT-augmented output and a subset of manually corrected OCR documents. In our testing, the normalized edit distance between the GPT- and manually corrected document was low – below 0.05 with 0 indicating identity – which suggests a satisfactory quality of output.
We maintain that conclusion after manually inspecting a small subset of the processed documents. While the GPT-processed documents contain more errors than the manually corrected ones, ranging from undetected paragraphs to undesired word completion, for the most part they faithfully reproduce the underlying PDF documents. We continue to make manual corrections to the corpus in every update.
Process
The IUROPA Text Corpus is generated using a data pipeline that combines automatic and manual processing. The workflow begins by downloading all html pages from Curia and Eur-Lex relating to CJEU decisions, including AG opinions. Subsequent steps retrieve and clean the texts and metadata at the paragraph level from the downloaded pages. This automatic process is then supplemented with manually curated data on decisions for which the html pages offer incomplete or erroneous texts. The main contribution of the manually curated data is the retrieval of older paragraphs that previously only existed in PDF documents. Finally, where more than one source contains a document, the source with the highest quality text (most complete, best labelled) is selected.
Standardization
In the final stage of preparing the database, a number of standardization steps are performed on the texts to decrease substantively meaningless variation that is likely to be undesired by most users. Namely, the following standardization steps are implemented at the paragraph level:
- guillemets
« »
are preceded and followed by a single space () - all types of dashes and hyphens are replaced by a single
-
- all
n°
etc are converted into explicitno
ornos
- commas
,
are not preceded by a space when the preceding character is a letter or a number - commas
,
are followed by a space when the next character is a letter - ellipsis
…
are always represented by three periods...
- all types of characters used as single quotation marks are replaced by
'
- all types of characters used as double quotation marks except guillemets are replaced by
"
- a sequence of single quotation marks is replaced by a single
"
- opening parentheses
([{
are never followed by a space, closing parentheses)]}
are never preceded by a space - French characters
æ
andœ
are replaced byae
andoe
respectively - all types of whitespace and blanks are converted into a single space ()
- no leading or trailing whitespace at the beginning and end of paragraphs
Codebook
The iuropa_text.gz.parquet
file contains a rectangular spreadsheet where each row encodes information about a single paragraph of a given document. The paragraph information is spread over columns with the following names and descriptions:
Description | Values | |
document_id |
Uniquely identifies each document in the corpus. | character Document identifiers. |
paragraph_id |
Uniquely identifies each paragraph in the corpus. | character Paragraph identifiers. |
source |
Source of the document text. | character Curia (cur ), Eur-Lex (elx ), manually validated optical character recognition (ocr ) or AI-processed optical character recognition (gpt ). |
language |
Language version of the document. | character English (EN ) or French (FR ). |
ecli |
European Case Law Identifier (ECLI). | character ECLI identifiers. |
court |
Authoring EU court. | character One of Court of Justice , General Court or Civil Service Tribunal . |
date |
Date of publication. | character Date in format YYYY-MM-DD . |
year |
Year of publication. | integer Ranges from 1954 to 2024 . |
text |
Paragraph text. | character Free text. |
line_id |
Running number for paragraphs in each document. | integer Ranges from 1 to n paragraphs. |
section |
The section of the document where the paragraph is found. | character Experimental. One of presentation , grounds , costs , operative , footnotes , annex . |
paragraph_type |
A basic categorization of paragraph types. | character Experimental. One of heading , paragraph , quote , footnote , keywords , table . |
paragraph_number |
The official paragraph number used in case citations. Note that some older decisions did not follow the numbering scheme which later became standard. | integer Paragraph numbers. 0 indicates no official paragraph number. Several paragraphs can belong to the same paragraph_number . Contains errors. |
nchar |
Number of characters in text . |
integer Minimum 1 . |
html_class |
Paragraph html class in the source file. | character Classes. |
html_attr |
Paragraph html attribute in the source file. | character Attributes. |
As the IUROPA Text Corpus remains work-in-progress, the information in this codebook as well as the underlying data are subject to change. The database version is included in the metadata of the iuropa_text.gz.parquet
file.1
Release notes
Database versioning is encoded in the version number and the date of the release. Version 0.1 constituted the initial release of the database. Minor adjustments and updates are only reflected in the release date (currently 2024-07-30), not the version number (currently 0.3). Major changes are captured by the version number and documented below.
Version 0.3
- Digitization of older documents completed using an OCR-to-GPT pipeline (source == “gpt”)
- Where a full decision is missing, but a summary exists, the summary is used in the corpus (French documents only)
- Added ten missing AG opinions (French only)
- Additional manual corrections
- The database now ships with variable
html_class
- Improvements to paragraph numbering, but due to persistent issues the number is no longer deleted from
text
Version 0.2
- Substantially improved processing of OCR’d text data
- Introduced text standardization to the pipeline
- More manually corrected documents
- More extensive documentation, including packaged data versioning
Version 0.1
- Initial release
Footnotes
Access the version number by running
attributes()
on the loaded data frame in R orparquet.read_schema().metadata
on the file path in Python after importing thepyarrow
parquet
module.↩︎