AI translationcontextlocalization engineering

How context-aware AI translation works for product strings

Mar 12, 2026 7 min read

Abstract visualization of string context graph connecting source strings across 19 locale columns

Every localization engineer has seen it: a generic machine translation pipeline that produces technically correct output for simple strings and completely wrong output for everything else. The translation of "Save" becomes "Sauvegarder" in French, which is fine for a button. The translation of "Save changes to {{document_name}}?" strips the variable and returns "Enregistrer les modifications apportées?" with a dangling question mark and no binding point for the template engine. That failure mode isn't a language quality problem. It's a context problem.

The root issue with string-by-string MT is that each string is processed as an isolated sentence. The model receives "Save changes to {{document_name}}?" and has no information about what came before it, what comes after it, what UI element it lives in, or what character budget it has to work within. The output may be fluent French — and still be entirely wrong for the product.

What "context" actually means in product string translation

When localization engineers talk about context, they mean several distinct things that often get collapsed into one word. It's worth separating them because they require different solutions:

Linguistic context is the relationship between strings in a file. "Delete" by itself is ambiguous — is it a button label, a confirmation message subject, a menu item? If the surrounding strings include "Are you sure you want to delete this file?" and "This action cannot be undone," the model can infer register and tone with much more accuracy. Without that window, "Delete" might be translated as a verb infinitive in one locale and a command form in another, producing inconsistency across the same UI flow.

UI context is what kind of element the string appears on. A button label has different constraints than a tooltip. A dialog title has different length expectations than a menu option. This information is often available in the source file — .xliff files carry restype attributes, iOS .strings comments often include UI element descriptions, and well-maintained .po files have #. extractor comments. A context-aware pipeline reads these signals rather than discarding them as metadata.

Variable and interpolation context is understanding which parts of a string are placeholders that must be preserved verbatim. {{first_name}}, %s, {count, plural, one {# item} other {# items}}, and %(username)s are all syntactically different ways to say "do not translate this token." Generic MT does not reliably distinguish content from code. A context-aware system normalizes placeholders before translation and re-injects them after, so the output structure is always {translated prefix} {{first_name}} {translated suffix}, not a translated string with the variable removed or renamed.

Character limit context is perhaps the most mechanical of the four, but it's the one that causes the most visible failures. If a button in your design has a 30-character limit and the German translation of "Continue to payment page" is "Weiter zur Zahlungsseite" at 24 characters — great. But "Proceed to checkout and confirm" at 32 characters in English was already pushing the constraint, and some languages routinely expand 30–40% on average. A system without character limit awareness doesn't know to reach for a shorter equivalent form when it's available in the target language.

How context window construction works in practice

Consider a mobile checkout flow with 47 strings in a single screen. A context-aware translation system doesn't process each string in isolation; it first builds a context window: an ordered view of neighboring strings, their UI element types, any developer comments, and the character limit annotations on the containing key.

For a concrete scenario: an early-stage productivity app shipped a Japanese localization of their onboarding flow in early 2024. The English string set included a sequence starting with "Welcome back," followed by "You have {{count}} unread notifications," then "Tap to review." Translated in isolation, each string was grammatically correct Japanese. But "Welcome back" in Japanese requires different honorific register depending on what follows it — and "Tap to review" is an instruction that reads as abrupt in isolation but natural when the reader has just been told they have notifications waiting. The isolated MT produced three strings that were each individually acceptable and collectively awkward. A context window covering all three strings in sequence produces output where the register is consistent and the instruction landing is natural.

The window size question is non-trivial. Too narrow (3–5 strings) and you miss document-level register signals. Too wide (the entire file) and you introduce irrelevant context that confuses more than it helps. In practice, a sliding window of 10–15 surrounding strings, weighted toward immediately adjacent ones, tends to produce the best balance between coherence and noise.

Variable preservation: the mechanics

Placeholder handling is where many pipelines fail silently. The failure is silent because the translated string often looks correct — it's only when the template engine tries to bind the variable at runtime that the error surfaces, usually as a crash, a blank string, or an untranslated literal like "{{first_name}}".

Robust variable preservation requires a normalization step before the string reaches the translation layer. All known placeholder syntaxes — ICU message format, printf-style %s/%d, Mustache {{var}}, Python-style %(key)s, and .NET {0} — are identified and replaced with language-neutral tokens like ⟨VAR_0⟩, ⟨VAR_1⟩. The translation model sees "Hello, ⟨VAR_0⟩! You have ⟨VAR_1⟩ messages." and has no ambiguity about what to preserve. After translation, the tokens are swapped back to the original syntax in the original order — or reordered when the target language's grammar requires argument reordering (which is a real requirement for some language pairs, particularly for languages with SOV word order where argument position in a template string changes).

ICU message format adds another layer of complexity because the plural selector is itself content that must be translated. {count, plural, one {# item} other {# items}} contains two translated strings inside a structural shell. A context-aware system must recognize the ICU envelope, extract the inner messages, translate each plural form separately against the CLDR plural rule for the target locale, and reconstruct the ICU envelope with the correct number of forms for that locale — which for Arabic is six, not two.

Character limits as first-class translation constraints

Character limits are usually enforced in QA, not at translation time. The standard workflow is: translate everything, export, run a length check script, flag violations, send back for revision. The revision step is expensive — it requires a translator who knows the UI context, the available character budget, and the acceptable compression strategies for that language. In German, compound nouns can often be decomposed or abbreviated. In Japanese, there may be a shorter kanji form. In French, an infinitive can sometimes replace a command form and save a few characters. These are judgment calls that require knowing the string is over-limit before committing to a full translation.

When character limit metadata is available in the source file — whether as a maxwidth attribute in XLIFF, a #. max_length: 30 comment in a .po file, or a custom annotation in a JSON strings file — a context-aware translation system can use it as a hard constraint during generation rather than a post-hoc filter. The model attempts the most natural translation first; if it exceeds the limit, it generates alternative shorter forms, evaluates them for meaning preservation, and selects the best fit. The output arrives pre-validated against the layout constraint.

We're not saying this eliminates all length violations — there are language pairs and string types where compression to a hard limit genuinely loses meaning, and those should surface as warnings for human review rather than auto-corrected. The goal is to reduce the revision loop, not to pretend the constraint can always be satisfied automatically.

Translation memory and context coherence

A context-aware translation system that ignores translation memory (TM) creates a different problem: terminological inconsistency across releases. If "Archive" was translated as "Archivieren" in v1.0 and the context-aware system independently arrives at "Einlagern" for the same string in v2.0, users who upgraded see a different label for the same action. TM matching is not just a cost-reduction mechanism — it's a consistency mechanism.

The interplay between TM and context works best when TM entries carry their original context alongside the translation. A TM match at 100% on the source string but from a different UI context (same words in a tooltip vs. a button) might not be the right match. Fuzzy matches in the 75–85% range that share the same UI element type may be more valuable than a 100% match from a different context class. Modern TM systems that store restype, key patterns, and surrounding string fingerprints alongside the translation pair produce better context-conditioned suggestions than ones that match on source text alone.

What this doesn't solve

Context-aware translation handles structural and mechanical translation failures well. It does not handle cultural localization — the difference between a literal translation and a culturally appropriate one for a given market. "Free trial" translates directly to most European languages and reads naturally. In some markets, "free" has low perceived-value connotations and a culturally aware adaptation might use a different value proposition entirely. No amount of linguistic context in a string file resolves that; it requires a human localization strategist with market knowledge.

Similarly, context-aware translation doesn't solve the problem of untranslatable proper nouns, brand terms that should pass through unchanged, or strings whose meaning depends on visual context that isn't encoded anywhere in the file (a string that labels an icon visible in a screenshot but not described in a comment). These gaps are real, and the honest answer is that a glossary lock for brand terms, a screenshot annotation workflow for visual-context strings, and human review for cultural adaptation are all part of a complete localization pipeline — not replacements for each other.

The machinery of context-aware translation gets product teams to a state where mechanical failures — stripped variables, broken character limits, inconsistent plural forms — stop being the dominant failure mode in localization QA. Once those are eliminated, the remaining failures tend to be the interesting ones: judgment calls that benefit from human expertise rather than pipeline improvements.