German proficiency LLM benchmark| Edgeless Systems

Sep 16, 2024

Why do we need German-speaking public LLMs?

In non-English-speaking countries, companies and government agencies have a growing need for AI tools that work well in their native languages. Even if many LLMs support different languages, like German, they often present grammatical errors or mistranslations or are generally of lesser quality than in English.

Many workplaces ban popular AI tools like ChatGPT and DeepL due to data security concerns, fearing the potential leak of sensitive or classified information. However, banning these services isn’t a secure solution, as employees often resort to using them on personal devices—this is known as “shadow AI,” which poses significant security risks to organizations’ data.

To address these concerns, we built Privatemode, a platform that uses confidential computing technology to run AI models in an isolated environment, ensuring prompts and responses are always encrypted. This makes chatbot applications safe for sensitive data—neither we, the model, nor the cloud provider can view the data. Visit the Privatemode website to learn more.

Edgeless Systems works with public sector, healthcare, and financial service clients in Germany who need state-of-the-art encrypted LLM applications that perform just as well in German as in English. We identified and tested the most relevant LLMs on the market.

This blog post will cover the requirements for LLMs to run on Privatemode, our approach to selecting suitable LLMs, and our methodology for evaluating the German proficiency of these models. Finally, we’ll share the results of our benchmarking and why we chose to deploy the Llama 3.1 70B AWQ model on Privatemode.

Which LLMs can we deploy on Privatemode?

Privatemode is an open-source platform that runs on public or private clouds, always keeping prompts and responses encrypted. To be deployed on Privatemode, LLMs must meet two technical requirements:

They need to be public: We need to download the model to run it on our platform. This excludes proprietary models like those from OpenAI or Anthropic.
Single GPU (multi-GPU support coming soon!): Currently, models must fit on a single Nvidia H100 GPU, which limits us to around 30 billion parameters. However, we can deploy quantized models—compressed LLMs that take up less memory —allowing us to use advanced models like Meta’s 70B versions. Our tests showed that the quality loss from quantization is minimal.

Llama 3.1 70b: Standard vs Quantized Version

Methodology: How we benchmarked and tested the LLMs

To identify the best German-proficient open-source LLMs, we first reviewed already-existing benchmarks. LLM evaluations are usually human- or AI-driven.

LMSYS Org’s Chatbot Arena uses crowdsourcing to create a live leaderboard where users rate answers before knowing which models were used. Another common evaluation method involves standardized tests, like MMLU (Massive Multitask Language Understanding) 5-shot, used in model cards on platforms like Hugging Face. It includes 16,000 multiple-choice questions across 57 subjects like mathematics, philosophy, law, and medicine. The MMLU 5-shot methodology is based on a Paper by Hendrycks et al, 2020. 5-shot means, that five examples are prepended to each prompt, to enhance the LLMs answer.

We focused on MMLU 5-shot scores to shortlist relevant models. We then created our own German test prompts covering tasks our clients face, like translations, summarizations, explanations of legal texts, and drafting emails. This is because standard benchmarks are not made to cover language accuracy and grammar mistakes. Our scoring scheme rated responses on Content Accuracy, Linguistic Clarity, and Adaptation to Context, resulting in an overall score of 1-5.GPT-4o acted as a judge, scoring the responses to ensure consistency, and we did multiple runs. Initial answers were checked manually to validate the automated scoring.

Evaluation benchmarks

Our results aligned closely with MMLU 5-shot rankings, placing the quantized Llama 3.1 70B at the top of our list. To provide context, we also tested prompts with ChatGPT-4o and the standard Llama 3.1 70B. ChatGPT-4o scored highest, though the results have to be taken with a pinch of salt since LLMs can have a bias rating own responses.

The outstanding results of the quantized Llama 3.1 70b led us to deploy it on Privatemode.

We even found out that by providing an according system prompt, the language mistakes could mostly be eliminated. In our set of 10 test prompts, the number of mistakes was greatly reduced to a single one through this approach (see Privatemode docs).

Give it a try and chat with encrypted prompts by using our Privatemode app (don’t hesitate to speak German!). You can also join the waitlist on the same page to get enterprise access, or contact our experts to talk about Confidential AI, or if you want to learn more about our benchmarking method and results!

Author: Frederik Stump

Insights