RFC: Lexical Entity-Mapping for Safety-Critical Ingredient Matching

Abstract

This document specifies a deterministic, auditable scoring algorithm for matching user-facing entity names to domain catalog descriptions (e.g., USDA FoodData Central, medical code sets, legal corpora). The system replaces a cascade of heuristic strategies --- including raw substring matching --- that produced safety-critical errors in production APIs.

The core invariant: "oil" must never match "boiled." This is not a string similarity problem. It is a word boundary problem, and the solution is architectural: tokenizer-driven boundary correctness enforced via set membership, not regex.

No machine learning. No human-in-the-loop. Fully deterministic and auditable.

Scope note: Examples use nutrition terms for clarity, but the algorithm is domain-agnostic and applies to any knowledge graph or catalog with a prescriptive corpus + user reality.


1. Architecture

The system scores every recipe ingredient against every FDC food description. With ~2,000 ingredients and ~8,158 FDC foods, this is ~16 million scoring operations --- trivially parallelizable and completable in under 10 seconds.

Recipe Ingredient ("olive oil", freq=32,763)
        |
  [Pre-process: tokenize, normalize, stop-word removal]
        |
  Core Tokens: {olive, oil}
        |
  [Score ALL 8,158 FDC foods]
        |
  Per candidate: 5 signals --> composite score
        |
  [Threshold: >= 0.80 mapped, 0.40-0.79 review, < 0.40 no_match]
        |
  Top match + near ties --> persisted with full breakdown

1.1 The Tokenizer Invariant

Word boundary safety comes from the tokenizer, not from regex against raw strings. Both ingredient and FDC descriptions are tokenized by the same function τ\tau:

τ("boiled")={boiled}\tau(\text{"boiled"}) = \{\text{boiled}\}

τ("olive oil")={olive,oil}\tau(\text{"olive oil"}) = \{\text{olive}, \text{oil}\}

Token overlap is computed via set membership on pre-tokenized sets. The token oil cannot match the description "boiled potatoes" because oilτ("boiled potatoes")\text{oil} \notin \tau(\text{"boiled potatoes"}).

This is the system's most important invariant. Every other signal operates on tokenized data. There is no path through the scoring pipeline where raw substring matching can introduce a false positive.

1.2 Two-Channel Tokens

Tokens are classified into two channels by a deterministic classifier σ\sigma:

  • Core tokens (TIT_I, TCT_C): identity tokens --- the words that define what a food is. Examples: chicken, breast, olive, oil.
  • State tokens (SIS_I, SCS_C): cooking, preservation, and processing state. Examples: raw, cooked, boiled, frozen, sliced.

State tokens are retained as a separate channel, not deleted. Core tokens drive primary matching; state tokens contribute a small bonus signal for disambiguation.


2. Scoring Algorithm

For each ingredient II and candidate food CC, five signals are computed and combined into a composite score.

SignalWeightWhat It Measures
Token overlap0.35IDF-weighted fraction of ingredient tokens found in candidate
Jaro-Winkler0.25Character-level similarity (gated by token evidence)
Segment match0.20Alignment with USDA primary vs. secondary comma segments
Category affinity0.10Whether FDC category matches a versioned expectation lexicon
Synonym confirmation0.10Whether known synonym tokens all appear in candidate

2.1 IDF-Weighted Token Overlap

Token weights use inverse document frequency to ensure that rare, discriminative tokens (like olive) contribute more than common tokens (like added):

w(t)=1log(2+df(t))w(t) = \frac{1}{\log(2 + df(t))}

where df(t)df(t) is the number of catalog entries containing token tt. This is precomputed once at startup from the full corpus.

Edge case: If a token appears in zero catalog entries (df=0df = 0), the formula yields 1log21.44\frac{1}{\log 2} \approx 1.44. Novel tokens receive high weight, which is intentional — rare user terms should strongly discriminate when they do match.

The total ingredient weight is:

WI=tTIw(t)W_I = \sum_{t \in T_I} w(t)

The directional overlap from ingredient to candidate is:

Overlap(I,C)=tTITCw(t)WI\text{Overlap}(I, C) = \frac{\sum_{t \in T_I \cap T_C} w(t)}{W_I}

Range: [0,1][0, 1]. The direction is intentional: we measure what fraction of the ingredient's tokens are found in the candidate, not the reverse. A candidate with many extra tokens (like "Oil, olive, salad or cooking") is not penalized for having tokens the ingredient doesn't mention.

2.2 Jaro-Winkler Similarity (Gated)

Jaro-Winkler character-level similarity is computed against multiple candidate representations, taking the maximum:

JWmax=max(JW(I,raw), JW(I,inverted), JW(I,seg_join))JW_{\max} = \max\big(JW(I, \text{raw}),\ JW(I, \text{inverted}),\ JW(I, \text{seg\_join})\big)

where inverted is the human-readable form (e.g., "olive oil" from "Oil, olive") and seg_join is the comma segments rejoined.

Gating rule --- Jaro-Winkler is gated by token evidence to prevent character-level similarity from rescuing candidates with no token overlap:

JWgated={JWmaxif Overlap(I,C)0.40min(JWmax,  0.20)if Overlap(I,C)<0.40JW_{\text{gated}} = \begin{cases} JW_{\max} & \text{if } \text{Overlap}(I,C) \geq 0.40 \\ \min(JW_{\max},\; 0.20) & \text{if } \text{Overlap}(I,C) < 0.40 \end{cases}

Without this gate, Jaro-Winkler would assign high scores to strings that happen to share many characters but represent completely different foods. The gate ensures that JW only amplifies matches that already have meaningful token evidence.

2.3 Segment Match

USDA FDC descriptions use a comma-separated structure where the first segment is typically the primary identity. For example, in "Butter, salted", the primary segment is "Butter"; in "Cookies, butter, commercially prepared", the primary segment is "Cookies."

We compute overlap with the primary segment (s0s_0) and the remaining segments (s1+s_{1+}) separately:

o0=tTITC,0w(t)WIo1=tTIj1TC,jw(t)WIo_0 = \frac{\sum_{t \in T_I \cap T_{C,0}} w(t)}{W_I} \qquad o_{\geq 1} = \frac{\sum_{t \in T_I \cap \bigcup_{j \geq 1} T_{C,j}} w(t)}{W_I}

The segment score uses threshold buckets:

Seg(I,C)={1.0if o00.600.6if o0<0.60 and o10.600.3if o00.30 or o10.300.0otherwise\text{Seg}(I,C) = \begin{cases} 1.0 & \text{if } o_0 \geq 0.60 \\ 0.6 & \text{if } o_0 < 0.60 \text{ and } o_{\geq 1} \geq 0.60 \\ 0.3 & \text{if } o_0 \geq 0.30 \text{ or } o_{\geq 1} \geq 0.30 \\ 0.0 & \text{otherwise} \end{cases}

This ensures "Butter, salted" (primary = "butter") beats "Cookies, butter" (primary = "cookies") when the ingredient is "butter."

2.4 Category Affinity

A small, explicit, versioned lexicon maps ingredient tokens to expected FDC categories:

"oil"     --> ["Fats and Oils"]
"butter"  --> ["Dairy and Egg Products", "Fats and Oils"]
"sugar"   --> ["Sweets"]
"flour"   --> ["Cereal Grains and Pasta"]
"salt"    --> ["Spices and Herbs"]
"chicken" --> ["Poultry Products"]

The affinity score checks all ingredient tokens and takes any match:

Aff(I,C)={1.0if any token’s expected categories include cat(C)0.0otherwise (neutral, not a penalty)\text{Aff}(I,C) = \begin{cases} 1.0 & \text{if any token's expected categories include } \text{cat}(C) \\ 0.0 & \text{otherwise (neutral, not a penalty)} \end{cases}

When no expectation exists for any token, the score is 0 --- neutral, not a penalty. The lexicon is versioned and deterministic; it does not learn or drift.

2.5 Synonym Confirmation (Gated)

A synonym table maps recipe ingredient names to sets of FDC description tokens that confirm a match:

"olive oil"     --> {oil, olive}
"kosher salt"   --> {salt, table}
"cilantro"      --> {coriander, leaves}
"baking soda"   --> {leavening, baking, soda}

The synonym table is defined inline in src/lib/lexical-scorer.ts (~40 entries) and versioned in source control.

Satisfaction requires all tokens in the synonym set to appear in the candidate:

sat(Σ,C)={1if ΣTC0otherwise\text{sat}(\Sigma, C) = \begin{cases} 1 & \text{if } \Sigma \subseteq T_C \\ 0 & \text{otherwise} \end{cases}

Syn(I,C)={maxisat(Σi,C)if Overlap(I,C)>00otherwise\text{Syn}(I,C) = \begin{cases} \max_i \text{sat}(\Sigma_i, C) & \text{if } \text{Overlap}(I,C) > 0 \\ 0 & \text{otherwise} \end{cases}

The overlap gate prevents synonym confirmation from independently establishing a match --- it can only strengthen an existing one.

2.6 Composite Score

Score(I,C)=0.35Overlap+0.25JWgated+0.20Seg+0.10Aff+0.10Syn\text{Score}(I,C) = 0.35 \cdot \text{Overlap} + 0.25 \cdot JW_{\text{gated}} + 0.20 \cdot \text{Seg} + 0.10 \cdot \text{Aff} + 0.10 \cdot \text{Syn}

Range: [0,1][0, 1].

2.7 Confidence Thresholds

ScoreStatusAction
0.80\geq 0.80mappedAuto-accept. Collect all candidates within 0.05 of best.
0.400.790.40 - 0.79needs_reviewFlag for manual inspection. Store best candidate.
<0.40< 0.40no_matchNo credible match.

3. Worked Example

"olive oil" vs. "Oil, olive, salad or cooking"

Pre-processing:

  • TI={olive,oil}T_I = \{\text{olive}, \text{oil}\}
  • TC={oil,olive,salad,cooking}T_C = \{\text{oil}, \text{olive}, \text{salad}, \text{cooking}\}
  • Inverted name: "olive oil" (see §4 for inverted naming resolution)

Signal computation:

SignalValueWeighted
Overlap1.01.0 (both tokens match)0.35×1.0=0.3500.35 \times 1.0 = 0.350
JW0.95\sim 0.95 (not capped, overlap 0.40\geq 0.40)0.25×0.95=0.2380.25 \times 0.95 = 0.238
Segment0.60.6 (primary segment is "oil", rest contains "olive")0.20×0.6=0.1200.20 \times 0.6 = 0.120
Affinity1.01.0 ("oil" expects "Fats and Oils")0.10×1.0=0.1000.10 \times 1.0 = 0.100
Synonym1.01.0 ({oil,olive}TC\{\text{oil}, \text{olive}\} \subseteq T_C)0.10×1.0=0.1000.10 \times 1.0 = 0.100
Composite0.908 --> mapped

vs. "Olives, green, raw"

  • TC={olives,green}T_C = \{\text{olives}, \text{green}\}
  • Overlap: "olive" matches "olives" via plural variant (0.9×w0.9 \times w), "oil" not found. Overlap 0.47\approx 0.47.
  • Affinity: "Vegetables and Vegetable Products" does not match "Fats and Oils" expectation from "oil" token. Score =0= 0.
  • Composite 0.45\approx 0.45 --> needs_review

The system correctly identifies "Oil, olive, salad or cooking" as the intended match, not raw olives.


4. Catalog Naming Normalization

Many authoritative catalogs use inverted or structured naming conventions that differ from natural user language. The system resolves these at load time using configurable domain knowledge sets:

  • Container categories: "Oil, olive" → "olive oil"
  • Entity bases: "Chicken, breast" → "chicken breast"
  • Product forms: "Wheat, flour" → "wheat flour"

The normalized name is one scoring input for Jaro-Winkler. It does not replace token-level matching. Raw description tokens always participate independently.

See Appendix A for USDA FDC-specific normalization rules.


5. Tripwire Tests

The following invariants are enforced as regression tests. Any violation blocks promotion of a scoring run.

Word Boundary Tripwires

InputMust NOT MatchReason
"oil""boiled", "broiled", "foil", "toil", "coil"Substring of cooking method
"salt""asphalt", "basalt", "cobalt"Substring of mineral/material
"corn""corner", "cornucopia"Prefix of unrelated word
"ham""champignon"Substring of mushroom name

Medical Correctness Tripwires

IngredientMust Map ToMust NOT Map To
"oil"Fats and Oils categoryBoiled/broiled foods
"butter"Butter, salted (Dairy)Cookies, butter (Baked)
"sugar"Sugar, turbinado (Sweets)Cookies, sugar (Baked)
"olive oil"Oil, olive (Fats and Oils)Olives, raw (Vegetables)
"olive"Olives, ripe (Fruits)Oil, olive (Fats and Oils)

6. Run-Based Staging

Scoring results are not written directly to production tables. Instead, each scoring run produces a versioned artifact:

  1. Stage: results are written to a staging table keyed by run_id, with full config, tokenizer hash, and IDF hash recorded.
  2. Validate: tripwire tests and distribution sanity checks run against the staged data.
  3. Promote: a single-row pointer table (lexical_mapping_current) is updated to point at the new run_id.

Rollback is instant: repoint the current pointer to the previous run_id. No data is deleted; all runs are retained for audit.

This replaces the previous approach of truncating production tables before rebuilding, which created a window of data loss and prevented comparison between runs.


7. Performance

MetricValue
FDC corpus size8,158 foods (SR Legacy + Foundation)
Recipe ingredients~2,000 (frequency \geq 25)
Total scoring pairs~16.3 million
Scoring throughput~50ms per ingredient (all 8K candidates)
Full run~10 seconds
Startup (load + IDF build)~200ms
Memory~50MB

The scorer is pure functions with no I/O in the hot path. All FDC food data is pre-processed once at startup into in-memory structures with O(1)O(1) token set membership.


8. False-Negative Analysis

The system prioritizes precision over recall. Some classes of valid matches may score below threshold:

  • Novel ingredients: User terms not in the synonym table and with no token overlap will score poorly. This is acceptable — false negatives flag for manual review; false positives silently corrupt data.
  • Abbreviations: "evoo" (extra virgin olive oil) has no token overlap with "Oil, olive" and will fail. The synonym table can be extended as abbreviations are discovered.
  • Misspellings: "oliv oil" loses one token's IDF weight. Jaro-Winkler provides partial recovery, but severe misspellings may drop below threshold.

The needs_review tier (0.40–0.79) exists precisely to surface these cases without auto-accepting them.


9. What This Architecture Does Not Do

  • No AI inference. Scoring is deterministic arithmetic on pre-tokenized sets.
  • No embedding similarity. There are no vectors, no cosine similarity, no semantic search.
  • No human-in-the-loop. The needs_review tier flags candidates for optional inspection but does not require it for the system to function.
  • No learning or drift. The category expectation lexicon and synonym table are versioned in source control. They change only through explicit code review.

In a domain where a 25x error in fat content could affect medical decisions, "the model thinks these are similar" is not an acceptable justification. Every match in this system can be explained by citing the exact token overlap, segment position, category match, and synonym confirmation that produced the score.


Appendix A — Nutrition-Specific Notes (Example Domain)

This appendix captures nutrition-specific examples and constraints. The core algorithm is domain-agnostic.

A.1 USDA FDC Inverted Naming Rules

FDC uses an inverted naming convention: "Oil, olive" instead of "olive oil." The system resolves this using domain knowledge sets:

  • Container categories (oil, spices, nuts, seeds, sauce, ...): "Oil, olive" → "olive oil"
  • Protein bases (chicken, beef, pork, ...): "Chicken, breast" → "chicken breast"
  • Product forms (flour, juice, oil, powder, ...): "Wheat, flour" → "wheat flour"
  • Poultry classifiers (broilers or fryers, roasting, ...): skipped to find actual cut

These sets are exported from the canonicalization layer and versioned in source control.

A.2 Tripwire Examples (Nutrition)

IngredientMust Map ToMust NOT Map To
"oil"Fats and Oils categoryBoiled/broiled foods
"butter"Butter, salted (Dairy)Cookies, butter (Baked)
"sugar"Sugar, turbinado (Sweets)Cookies, sugar (Baked)
"olive oil"Oil, olive (Fats and Oils)Olives, raw (Vegetables)
"olive"Olives, ripe (Fruits)Oil, olive (Fats and Oils)

A.3 Worked Example (Nutrition)

"olive oil" vs. "Oil, olive, salad or cooking" is a canonical case for token boundary safety and segment weighting.


Changelog

VersionDateChanges
1.02026-02Initial release. Domain-agnostic framing with nutrition appendix.