Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Hanna Yukhymenko^1*,2, Anton Alexandrov¹, Martin Vechev^1,2

¹INSAIT, Sofia University "St. Kliment Ohridski", ²ETH Zurich
^*Work done while at INSAIT

Abstract: The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

Paper

Code

Benchmarks

Translating Benchmarks and Datasets for Multilingual LLMs

Our framework demonstrates that test-time compute strategies significantly enhance machine translation quality for multilingual benchmarks and datasets.

Our proposed multi-sampling methods achieve the highest COMET scores on WMT24++ and FLORES benchmarks.
LLM-as-a-judge evaluations show our translations are preferred over existing ones by a big margin.
We trace a connection between translation quality of the benchmarks and downstream model performance on them.

Downstream Model Performance by Benchmark

Mid-sized models consistently achieve higher scores on our translations compared to existing benchmarks. Winogrande shows the largest improvement (+3.42%), followed by ARC-Challenge (+2.35%), Hellaswag (+1.63%), and MMLU (+0.94%).

Key Insight: Context-Aware Translation

Translating questions and answer options within the same prompt context is essential for sentence completion tasks, as it preserves semantic relationships and prevents contextual misleading during evaluation. This addresses a common flaw in existing benchmarks.

Our Framework

We present a novel automated translation framework that utilizes Large Language Models with features adapted for both dataset and benchmark (QA/test) formats. The framework offers flexibility in methods and configurable parameters to optimize cost- and time-effectiveness.

Translation Methods

Our framework supports four translation methods, each with distinct advantages for different use cases:

SC (Self-Check)

Simple 0-shot translation with optional self-correction. Best for high-resource languages where basic translation quality is already sufficient.

Best-of-N

Samples N translation candidates at higher temperature and selects the best one based on LLM scoring. Cost-effective and language-agnostic.

USI

Universal Self-Improvement: samples N candidates, then combines the best features into a refined output. Only N+1 model calls, highly efficient and cost-optimized.

T-RANK

Translation Ranking: uses multi-round competitive ranking to identify subtle errors. Best for complex benchmarks requiring high accuracy.

T-RANK: Multi-Round Competitive Ranking

Our proposed T-RANK method addresses positional bias in LLM evaluation by systematically presenting candidates in different orders across multiple rounds.

Original Text

Translator LLM

N Candidates

Shuffle Orders

Multi-Round Ranking

Average Rank

Improved Translation

Start with the original English text to be translated. Pass the text to the Translator LLM model. Sample N diverse translation candidates using higher temperature. Create N different orderings of the candidates to reduce positional bias. Judge LLM ranks candidates in each ordering across multiple rounds. Calculate average rank across all rounds to find the best candidate. Apply final corrections to produce the refined translation.

USI: Universal Self-Improvement

USI is a cost-efficient method that generates diverse candidates and synthesizes the best elements into an improved translation with only N+1 model calls.

Original Text

Translator LLM

N Candidates

Quality Analysis

Combined Translation

Start with the original English text to be translated. Pass the text to the Translator LLM with higher temperature sampling. Generate N diverse translation candidates with different phrasings. Analyze all candidates to identify strengths and weaknesses in each. Combine the best elements into a single refined translation.

Key Insight: Multiple Candidate Sampling

Sampling multiple translation candidates and combining them leads to improved performance. We also find that mixing prompts in English and target languages enhances the quality of translations.

Benchmark Coverage

We translate four popular benchmarks into eight Eastern and Southern European languages.

🌍 Covered Languages

🇺🇦 Ukrainian 🇷🇴 Romanian 🇸🇰 Slovak 🇱🇹 Lithuanian 🇪🇪 Estonian 🇧🇬 Bulgarian 🇹🇷 Turkish 🇬🇷 Greek

Comparison with Existing Translations

We compare our translations against existing multilingual benchmarks using LLM-as-a-judge evaluations.

LLM-as-a-Judge: All Languages

Using Gemini-2.5-Flash as a judge, our T-RANK translations are consistently preferred over existing benchmarks across all languages, with win ratios ranging from 3:1 to 4:1.

Language-Specific Results

Our framework demonstrates consistent improvements across all eight target languages. Eastern and Southern European languages were chosen due to their complex grammatical features (case systems, grammatical gender, aspect-based verbs) that are sensitive to contextual misalignment.

Model Performance by Language

Average accuracy improvement across all benchmarks for each language. Greek shows the highest improvement (+3.89%), while languages with existing higher-quality resources show more modest gains. These improvements reflect both translation quality and the elimination of systematic errors in existing benchmarks.

Method Comparison Across Languages

While USI is more suitable for short and simple dataset translation, T-RANK shows better performance when translating benchmarks with complex question structures. Below are detailed COMET scores for each language.

Translation Model:

WMT24++ COMET Scores

FLORES COMET Scores

Key Insight: Different Models Need Different Translation Methods

We find, that our proposed methods show improvements on machine translation benchmarks. We also notice that different methods perform better for different models, highlighting the importance of model-specific optimization.

Issues in Existing Translations

Existing multilingual benchmarks often contain systematic quality issues that affect evaluation reliability. Below are examples of common problems we identified and how our framework addresses them.

Translation Examples

Our methods identify and correct subtle translation errors through competitive ranking or self-improving fusion. Below are examples showing how the methods improve translation quality by comparing multiple candidates.

Citation

@article{yukhymenko2026recovered,
      title={Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets},
      author={Yukhymenko, Hanna and Alexandrov, Anton and Vechev, Martin},
      year={2026},
      journal={arXiv preprint arXiv:2602.22207},
      primaryClass={cs.CL},
}