Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Translating Benchmarks and Datasets for Multilingual LLMs
Our framework demonstrates that test-time compute strategies significantly enhance machine translation quality for multilingual benchmarks and datasets.
- Our proposed multi-sampling methods achieve the highest COMET scores on WMT24++ and FLORES benchmarks.
- LLM-as-a-judge evaluations show our translations are preferred over existing ones by a big margin.
- We trace a connection between translation quality of the benchmarks and downstream model performance on them.
Downstream Model Performance by Benchmark
Mid-sized models consistently achieve higher scores on our translations compared to existing benchmarks. Winogrande shows the largest improvement (+3.42%), followed by ARC-Challenge (+2.35%), Hellaswag (+1.63%), and MMLU (+0.94%).
Key Insight: Context-Aware Translation
Translating questions and answer options within the same prompt context is essential for sentence completion tasks, as it preserves semantic relationships and prevents contextual misleading during evaluation. This addresses a common flaw in existing benchmarks.
Our Framework
We present a novel automated translation framework that utilizes Large Language Models with features adapted for both dataset and benchmark (QA/test) formats. The framework offers flexibility in methods and configurable parameters to optimize cost- and time-effectiveness.
Translation Methods
Our framework supports four translation methods, each with distinct advantages for different use cases:
SC (Self-Check)
Simple 0-shot translation with optional self-correction. Best for high-resource languages where basic translation quality is already sufficient.
Best-of-N
Samples N translation candidates at higher temperature and selects the best one based on LLM scoring. Cost-effective and language-agnostic.
USI
Universal Self-Improvement: samples N candidates, then combines the best features into a refined output. Only N+1 model calls, highly efficient and cost-optimized.
T-RANK
Translation Ranking: uses multi-round competitive ranking to identify subtle errors. Best for complex benchmarks requiring high accuracy.
T-RANK: Multi-Round Competitive Ranking
Our proposed T-RANK method addresses positional bias in LLM evaluation by systematically presenting candidates in different orders across multiple rounds.
USI: Universal Self-Improvement
USI is a cost-efficient method that generates diverse candidates and synthesizes the best elements into an improved translation with only N+1 model calls.
Key Insight: Multiple Candidate Sampling
Sampling multiple translation candidates and combining them leads to improved performance. We also find that mixing prompts in English and target languages enhances the quality of translations.
Benchmark Coverage
We translate four popular benchmarks into eight Eastern and Southern European languages.
MMLU
15,858 samples across 57 subjects testing world knowledge and reasoning.
Hellaswag
10,042 samples for commonsense reasoning about everyday situations.
ARC-Challenge
2,291 grade-school science questions requiring complex reasoning.
Winogrande
1,267 pronoun resolution problems testing commonsense knowledge.
π Covered Languages
Comparison with Existing Translations
We compare our translations against existing multilingual benchmarks using LLM-as-a-judge evaluations.
LLM-as-a-Judge: All Languages
Using Gemini-2.5-Flash as a judge, our T-RANK translations are consistently preferred over existing benchmarks across all languages, with win ratios ranging from 3:1 to 4:1.
Language-Specific Results
Our framework demonstrates consistent improvements across all eight target languages. Eastern and Southern European languages were chosen due to their complex grammatical features (case systems, grammatical gender, aspect-based verbs) that are sensitive to contextual misalignment.
Model Performance by Language
Average accuracy improvement across all benchmarks for each language. Greek shows the highest improvement (+3.89%), while languages with existing higher-quality resources show more modest gains. These improvements reflect both translation quality and the elimination of systematic errors in existing benchmarks.
Method Comparison Across Languages
While USI is more suitable for short and simple dataset translation, T-RANK shows better performance when translating benchmarks with complex question structures. Below are detailed COMET scores for each language.
WMT24++ COMET Scores
FLORES COMET Scores
Key Insight: Different Models Need Different Translation Methods
We find, that our proposed methods show improvements on machine translation benchmarks. We also notice that different methods perform better for different models, highlighting the importance of model-specific optimization.
Issues in Existing Translations
Existing multilingual benchmarks often contain systematic quality issues that affect evaluation reliability. Below are examples of common problems we identified and how our framework addresses them.
Translation Examples
Our methods identify and correct subtle translation errors through competitive ranking or self-improving fusion. Below are examples showing how the methods improve translation quality by comparing multiple candidates.
Citation
@article{yukhymenko2026recovered,
title={Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets},
author={Yukhymenko, Hanna and Alexandrov, Anton and Vechev, Martin},
year={2026},
journal={arXiv preprint arXiv:2602.22207},
primaryClass={cs.CL},
}
Benchmarks