Open Source LLMs: A Critical Component of Europe's Digital Sovereignty Agenda

In a significant development for Europe's digital sovereignty, a new collaborative project has emerged with the ambitious goal of creating a series of truly open source Large Language Models (LLMs) encompassing all European Union languages.

OpenEuroLLM: A Broad Collaboration

OpenEuroLLM is a joint venture involving approximately 20 organizations, led by Jan Hajič, a computational linguist from Charles University in Prague, and Peter Sarlin, co-founder of Finnish AI lab Silo AI, acquired by AMD last year.

This project aligns with the broader European strategy of prioritizing digital sovereignty, aiming to bring critical infrastructure and tools closer to home. The EU has been pushing for cloud providers to invest in local infrastructure and retain data within Europe. AI pioneer OpenAI also recently introduced a service to process and store data in Europe.

Budget and Resources

The dedicated budget for developing the models themselves is €37.4 million, with €20 million originating from the EU's Digital Europe Program. While this represents a significant investment, it pales in comparison to the substantial amounts invested by tech giants in AI development. However, the overall budget is larger when factoring in funding for related work, with compute costs being a major expense.

Partnerships and Collaborations

OpenEuroLLM partners with EuroHPC supercomputer centers in Spain, Italy, Finland, and the Netherlands, which are part of the broader EuroHPC project with a budget of €7 billion.

Challenges and Concerns

The sheer number and diversity of participating entities, spanning academia, research, and industry, has raised questions about the project's feasibility. Critics have expressed concerns about whether such a large consortium can maintain the focus and efficiency of smaller, privately-owned AI firms.

Building on Existing Foundations

OpenEuroLLM leverages the groundwork laid by the High Performance Language Technologies (HPLT) project, which has developed freely accessible datasets, models, and tools using high-performance computing (HPC). HPLT, coordinated by Hajič, is scheduled to conclude in late 2025 and is seen as a "predecessor" to OpenEuroLLM.

Goals and Timeline

The project aims to release the first version of its LLMs by mid-2026, with the final iterations expected by 2028. However, Hajič acknowledges that there is still much groundwork to be done.

Scope and Focus

OpenEuroLLM aims to create a multilingual LLM for general-purpose tasks, as well as smaller, "quantized" versions for edge applications where efficiency is critical. The project places a strong emphasis on preserving the linguistic and cultural diversity of European languages.

Data and Open Source Definition

The project will utilize datasets from the HPLT project and the Common Crawl. However, the definition of "open source" in AI is still evolving, and OpenEuroLLM faces challenges in reconciling the needs for transparency with data privacy and copyright regulations.

Funding and Collaborations

The project's funding comes solely from the EU, limiting collaborations with entities outside the EU, including U.K. universities.

Comparison to EuroLLM

OpenEuroLLM has drawn comparisons to EuroLLM, another European LLM project that launched in 2024. While both projects share similar goals, the OpenEuroLLM team hopes to foster cooperation and avoid duplication of efforts.

Digital Sovereignty

Ultimately, OpenEuroLLM's primary objective is to establish digital sovereignty for Europe, providing open-source LLMs that are both accessible and under European control. This aligns with the EU's broader strategy to reduce dependency on foreign technology and protect its digital infrastructure.