IUROPA Text Corpus

Codebook and release notes

Author

Dr Michal Ovádek

Published

30 Jul 2024

Introduction

This document accompanies the IUROPA Text Corpus, a database of judicial texts from the Court of Justice of the European Union (CJEU). The database comprises all types of judicial decisions and Advocate-General (AG) opinions at the paragraph level. Where available, both the French and English texts are included.

The corpus, codebook and release notes are current as of version 0.3 (release 2024-07-30).

The corpus remains work-in-progress. Please let us know if you notice an error in the data or if you have suggestions for improvements. Consult the IUROPA website for more resources.

Citation

When using any part of the IUROPA Text Corpus (including this page), please refer to the following citation:

Ovádek, Michal, Joshua Fjelstul, Daniel Naurin and Johan Lindholm. 2023. “The IUROPA Text Corpus”, in Lindholm, Johan, Daniel Naurin, Urska Sadl, Anna Wallerman Ghavanini, Stein Arne Brekke, Joshua Fjelstul, Silje Synnøve Lyder Hermansen, Olof Larsson, Andreas Moberg, Moa Näsström, Michal Ovádek, Tommaso Pavone, and Philipp Schroeder, The Court of Justice of the European Union (CJEU) Database, IUROPA, https://iuropa.pol.gu.se/.

In BibTex form:

@incollection{ovadek2023iuropatext,
  author       = "Ovádek, Michal and Fjelstul, Joshua and Naurin, Daniel and Lindholm, Johan",
  title        = "The IUROPA Text Corpus",
  editor       = "Johan Lindholm, Daniel Naurin, Urska Sadl, Anna Wallerman Ghavanini, Stein Arne Brekke, Joshua Fjelstul, Silje Synnøve Lyder Hermansen, Olof Larsson, Andreas Moberg, Moa Näsström, Michal Ovádek, Tommaso Pavone, and Philipp Schroeder",
  booktitle    = "The Court of Justice of the European Union Database",
  year         = 2023,
  publisher    = "IUROPA Project",
  url          = "https://iuropa.pol.gu.se/"
}

Please note that this database is licensed for non-commercial use only. Unauthorized commercial use, including but not limited to resale, redistribution, or use as part of a commercial product or service, is strictly prohibited.

Rationale

Prior to the release of the IUROPA Text Corpus, there was no unified and comprehensive database of CJEU decisions. Individually, neither the Curia nor the Eur-Lex website contains all decisions in a plain text format. In addition, older decisions – including landmark rulings such as Costa v ENEL and Cassis de Dijon – are only partially and imperfectly digitized.

The following table compares the number of unique documents present in the IUROPA Text Corpus with the number of documents containing plain (html) text on Curia and Eur-Lex by language version:

Number of documents with plain text
	English	French
Curia	24334	37227
Eur-Lex	32727	35919
IUROPA	37170	47868

Whereas most academic and legal work relating to the CJEU has relied on English texts, this comparison reveals how much more comprehensive the French language corpus is. French remains the working language of the CJEU and not all decisions are translated into English.

Moreover, unlike most previous attempts to create a CJEU database, the unit of analysis in the IUROPA Text Corpus is the paragraph. Paragraphs are understood as blocks of text separated by line breaks in a document. This definition comprises both conventionally designated (and numbered) paragraphs and individual lines with text, such as presentation of lawyers and judges. Overall, paragraph-level data offers more granularity and versatility than decision-level data.

Overview

As of version 0.3 (release 2024-07-30), the IUROPA Text Corpus contains a total of 10748956 paragraphs across 85038 documents. The number of documents in the corpus varies significantly not only over time, but also between courts and language versions.

Comparison of EU courts
Court	Number of documents	Paragraphs (min)	Paragraphs (mean)	Paragraphs (max)	Paragraphs (total)	Words (total)
Civil Service Tribunal	2600	11	94.79	659	246444	10386845
Court of Justice	59048	2	122.64	5219	7241762	345719703
General Court	23390	2	139.41	7146	3260750	174925068

The corpus contains all available documents for the Court of Justice, the General Court and the now defunct Civil Service Tribunal. Most documents and paragraphs come from the Court of Justice whose existence pre-dates the European Economic Community. The General Court was established in 1989 as the Court of First Instance. The Civil Service Tribunal was established in 2005 and abolished in 2016. All types of court decisions, including AG opinions, form part of the corpus.

As a general rule, the working language of all three courts is French. Subsequently, documents are translated into English and other EU languages. As the workload of the courts increased, the pressure on selective translation increased as well.

The maximum number of French documents per year (1966) occurs in 2018. An average year contains 674 documents in French and 524 documents in English. In instances when only the English version was made available in plain text – an issue almost exclusively affecting the General Court – the corpus might not include the French version (though if at least a summary is available, it is included). Otherwise and as a general rule, the French language sub-corpus is much more comprehensive.

The documents contain a varying number of paragraphs, ranging from 2 to 7146. On average, French documents have 120 paragraphs, while English documents have an average of 118 paragraphs. In some cases, the corpus includes also the so-called “report for the hearing” (“rapport d’audience”) which the CJEU produced from the 1980s until 2012. Most of these reports have not yet been released to the public, but in the future they should become a part of the (or a separate) database.

Average number of paragraphs per document

More generally, average differences between language versions stem not only from linguistic variance between English and French, but also from differences between the two samples – recall that French is the default language for the vast majority of documents, not all of which end up being translated into English.

On average, French paragraphs contain 49 words, while English paragraphs have an average of 50 words.

A major contribution of the IUROPA Text Corpus is the inclusion of digitized texts of older decisions. After applying state-of-the-art optical character recognition (OCR) to correctly segmented PDFs of documents from 1954 to 1989, the digitization pipeline involves highly labour-intensive cleaning of plain texts to ensure their high quality at the paragraph level.

Number of processed OCR pages (current manual progress, remaining pages processed using AI in shaded bars)

The digitization task remains work-in-progress. Cleaned data is added to the corpus on a rolling basis. Of the total 86649 pages of text that require manual checking or cleaning, 31491 have been completed and added to the corpus.

Priority has been given to French documents: so far, 56% of pages in French and 14% of English pages have been processed. Contact us if you wish to contribute to this ongoing work.

As of version 0.3, all remaining documents were digitized using a combination of OCR and Open AI’s Generative Pre-trained Transformer 4 (GPT-4, in its early 2024 state) thanks to a collaboration with University of Oslo’s Centre for Computational and Data Science. The OCR output was fed to GPT-4 accompanied by the following prompt:

Consider the following OCR output text and make corrections. Please note that the text to be corrected is in [language]. Fix spelling mistakes, do not add/remove words, make consistent word spacing, add missing spaces, fix font case issues within words, fix numbering issues, make consistent line breaks. In addition, please remove all unnecessary newline characters that break paragraphs incorrectly, detect paragraphs and ensure to add an extra line between paragraphs. Here is the text:

The prompt was optimized by looking at edit (Levenshtein) distance between the GPT-augmented output and a subset of manually corrected OCR documents. In our testing, the normalized edit distance between the GPT- and manually corrected document was low – below 0.05 with 0 indicating identity – which suggests a satisfactory quality of output.

We maintain that conclusion after manually inspecting a small subset of the processed documents. While the GPT-processed documents contain more errors than the manually corrected ones, ranging from undetected paragraphs to undesired word completion, for the most part they faithfully reproduce the underlying PDF documents. We continue to make manual corrections to the corpus in every update.

Process

The IUROPA Text Corpus is generated using a data pipeline that combines automatic and manual processing. The workflow begins by downloading all html pages from Curia and Eur-Lex relating to CJEU decisions, including AG opinions. Subsequent steps retrieve and clean the texts and metadata at the paragraph level from the downloaded pages. This automatic process is then supplemented with manually curated data on decisions for which the html pages offer incomplete or erroneous texts. The main contribution of the manually curated data is the retrieval of older paragraphs that previously only existed in PDF documents. Finally, where more than one source contains a document, the source with the highest quality text (most complete, best labelled) is selected.

Standardization

In the final stage of preparing the database, a number of standardization steps are performed on the texts to decrease substantively meaningless variation that is likely to be undesired by most users. Namely, the following standardization steps are implemented at the paragraph level:

guillemets « » are preceded and followed by a single space ()
all types of dashes and hyphens are replaced by a single -
all n° etc are converted into explicit no or nos
commas , are not preceded by a space when the preceding character is a letter or a number
commas , are followed by a space when the next character is a letter
ellipsis … are always represented by three periods ...
all types of characters used as single quotation marks are replaced by '
all types of characters used as double quotation marks except guillemets are replaced by "
a sequence of single quotation marks is replaced by a single "
opening parentheses ([{ are never followed by a space, closing parentheses )]} are never preceded by a space
French characters æ and œ are replaced by ae and oe respectively
all types of whitespace and blanks are converted into a single space ()
no leading or trailing whitespace at the beginning and end of paragraphs

Codebook

The iuropa_text.gz.parquet file contains a rectangular spreadsheet where each row encodes information about a single paragraph of a given document. The paragraph information is spread over columns with the following names and descriptions:

	Description	Values
`document_id`	Uniquely identifies each document in the corpus.	`character` Document identifiers.
`paragraph_id`	Uniquely identifies each paragraph in the corpus.	`character` Paragraph identifiers.
`source`	Source of the document text.	`character` Curia (`cur`), Eur-Lex (`elx`), manually validated optical character recognition (`ocr`) or AI-processed optical character recognition (`gpt`).
`language`	Language version of the document.	`character` English (`EN`) or French (`FR`).
`ecli`	European Case Law Identifier (ECLI).	`character` ECLI identifiers.
`court`	Authoring EU court.	`character` One of `Court of Justice`, `General Court` or `Civil Service Tribunal`.
`date`	Date of publication.	`character` Date in format `YYYY-MM-DD`.
`year`	Year of publication.	`integer` Ranges from `1954` to `2024`.
`text`	Paragraph text.	`character` Free text.
`line_id`	Running number for paragraphs in each document.	`integer` Ranges from `1` to `n` paragraphs.
`section`	The section of the document where the paragraph is found.	`character` Experimental. One of `presentation`, `grounds`, `costs`, `operative`, `footnotes`, `annex`.
`paragraph_type`	A basic categorization of paragraph types.	`character` Experimental. One of `heading`, `paragraph`, `quote`, `footnote`, `keywords`, `table`.
`paragraph_number`	The official paragraph number used in case citations. Note that some older decisions did not follow the numbering scheme which later became standard.	`integer` Paragraph numbers. `0` indicates no official paragraph number. Several paragraphs can belong to the same `paragraph_number`. Contains errors.
`nchar`	Number of characters in `text`.	`integer` Minimum `1`.
`html_class`	Paragraph html class in the source file.	`character` Classes.
`html_attr`	Paragraph html attribute in the source file.	`character` Attributes.

As the IUROPA Text Corpus remains work-in-progress, the information in this codebook as well as the underlying data are subject to change. The database version is included in the metadata of the iuropa_text.gz.parquet file.¹

Release notes

Database versioning is encoded in the version number and the date of the release. Version 0.1 constituted the initial release of the database. Minor adjustments and updates are only reflected in the release date (currently 2024-07-30), not the version number (currently 0.3). Major changes are captured by the version number and documented below.

Version 0.3

Digitization of older documents completed using an OCR-to-GPT pipeline (source == “gpt”)
Where a full decision is missing, but a summary exists, the summary is used in the corpus (French documents only)
Added ten missing AG opinions (French only)
Additional manual corrections
The database now ships with variable html_class
Improvements to paragraph numbering, but due to persistent issues the number is no longer deleted from text

Version 0.2

Substantially improved processing of OCR’d text data
Introduced text standardization to the pipeline
More manually corrected documents
More extensive documentation, including packaged data versioning

Version 0.1

Initial release

Footnotes

Access the version number by running attributes() on the loaded data frame in R or parquet.read_schema().metadata on the file path in Python after importing the pyarrow parquet module.↩︎