How We Measure Our Translator's Quality

Standard translation metrics don’t work well for Inuktitut. Tools like BLEU or METEOR were built for languages with rigid grammar, and they mostly measure surface-level similarity between texts. Inuktitut is polysynthetic — a single word can carry the meaning of a full English sentence — so two perfectly valid translations can look completely different, and two translations that look similar can actually mean different things. We couldn’t rely on those tools to tell us whether our translator was actually improving. So we built our own evaluation platform. Community members and expert translators compare translations side by side, without knowing which system produced which. We collect those judgments, aggregate them, and use them to track quality over time.

How it works

Each session presents an evaluator with a source sentence and two candidate translations, labeled A and B. They choose one of four options: A is Better, B is Better, Both are good, or Both Are Bad. System identities are never shown, which keeps brand preference out of the judgment. Evaluators log in with a name and PIN so results stay auditable.

Pairwise comparison interface on the translation evaluation platform

To turn those individual judgments into a ranking, we use the Elo rating system — the same approach used in chess and competitive gaming. Each system starts at 1000, and ratings shift up or down based on match outcomes. Winning against a higher-rated opponent moves your score more than winning against a lower-rated one. Both tie options count as draws. We aim for at least 50 matches per system before treating a result as stable. All three systems currently sit around 37 to 38 matches, so the numbers below are a current snapshot, not a final verdict.

Current standings

Rank	System	Elo rating	Matches	Wins	Losses
1	Heritage Lab (JAN-20-M)	1060	37	26	11
2	Bing Translate	1005	38	15	23
3	Google Translate	951	37	15	22

Our translator currently leads by 55 Elo points over Bing and 109 over Google. A 100-point Elo gap roughly doubles the odds of being preferred in a head-to-head match, so these are meaningful gaps. Heritage Lab’s overall win rate across all matches is 70.3%, compared to around 40% for both Bing and Google. It’s worth noting that Elo scores are relative. They express how a system performs against the others in this evaluation, on these sentences. They don’t assign an absolute quality score, and they will continue to shift as more evaluations come in.

What’s next

Priorities for the coming months include getting all systems past the 50-match threshold, publishing dialect-specific results as volume grows, expanding the reference sentence set to cover more domains and registers, and feeding evaluator feedback directly into the next version of our translator.

Get involved

If you are an Inuktitut speaker, language keeper, or expert translator, your input makes this more reliable and more representative. The more evaluators we have, the more confidence we can place in the results — and the more this stays in community hands.

Leaderboard figures reflect the evaluation snapshot from April 15, 2026. The platform is continuously updated; consult the live leaderboard for current figures. Translation testing: Shaun Annanack, Siasie Ilisituk. Statistics and platform: Anissa Jean, Ali Mehdi.

Getting Started

Policies

Quality Control

How We Measure Our Translator's Quality

How it works

Current standings

What’s next

Get involved

​How it works

​Current standings

​What’s next

​Get involved

How it works

Current standings

What’s next

Get involved