RFC: Lexical Entity-Mapping for Safety-Critical Ingredient Matching
Abstract
This document specifies a deterministic, auditable scoring algorithm for matching user-facing entity names to domain catalog descriptions (e.g., USDA FoodData Central, medical code sets, legal corpora). The system replaces a cascade of heuristic strategies --- including raw substring matching --- that produced safety-critical errors in production APIs.
The core invariant: "oil" must never match "boiled." This is not a string similarity problem. It is a word boundary problem, and the solution is architectural: tokenizer-driven boundary correctness enforced via set membership, not regex.
No machine learning. No human-in-the-loop. Fully deterministic and auditable.
Scope note: Examples use nutrition terms for clarity, but the algorithm is domain-agnostic and applies to any knowledge graph or catalog with a prescriptive corpus + user reality.
1. Architecture
The system scores every recipe ingredient against every FDC food description. With ~2,000 ingredients and ~8,158 FDC foods, this is ~16 million scoring operations --- trivially parallelizable and completable in under 10 seconds.
Recipe Ingredient ("olive oil", freq=32,763)
|
[Pre-process: tokenize, normalize, stop-word removal]
|
Core Tokens: {olive, oil}
|
[Score ALL 8,158 FDC foods]
|
Per candidate: 5 signals --> composite score
|
[Threshold: >= 0.80 mapped, 0.40-0.79 review, < 0.40 no_match]
|
Top match + near ties --> persisted with full breakdown
1.1 The Tokenizer Invariant
Word boundary safety comes from the tokenizer, not from regex against raw strings. Both ingredient and FDC descriptions are tokenized by the same function :
Token overlap is computed via set membership on pre-tokenized sets. The token oil cannot match the description "boiled potatoes" because .
This is the system's most important invariant. Every other signal operates on tokenized data. There is no path through the scoring pipeline where raw substring matching can introduce a false positive.
1.2 Two-Channel Tokens
Tokens are classified into two channels by a deterministic classifier :
- Core tokens (, ): identity tokens --- the words that define what a food is. Examples:
chicken,breast,olive,oil. - State tokens (, ): cooking, preservation, and processing state. Examples:
raw,cooked,boiled,frozen,sliced.
State tokens are retained as a separate channel, not deleted. Core tokens drive primary matching; state tokens contribute a small bonus signal for disambiguation.
2. Scoring Algorithm
For each ingredient and candidate food , five signals are computed and combined into a composite score.
| Signal | Weight | What It Measures |
|---|---|---|
| Token overlap | 0.35 | IDF-weighted fraction of ingredient tokens found in candidate |
| Jaro-Winkler | 0.25 | Character-level similarity (gated by token evidence) |
| Segment match | 0.20 | Alignment with USDA primary vs. secondary comma segments |
| Category affinity | 0.10 | Whether FDC category matches a versioned expectation lexicon |
| Synonym confirmation | 0.10 | Whether known synonym tokens all appear in candidate |
2.1 IDF-Weighted Token Overlap
Token weights use inverse document frequency to ensure that rare, discriminative tokens (like olive) contribute more than common tokens (like added):
where is the number of catalog entries containing token . This is precomputed once at startup from the full corpus.
Edge case: If a token appears in zero catalog entries (), the formula yields . Novel tokens receive high weight, which is intentional — rare user terms should strongly discriminate when they do match.
The total ingredient weight is:
The directional overlap from ingredient to candidate is:
Range: . The direction is intentional: we measure what fraction of the ingredient's tokens are found in the candidate, not the reverse. A candidate with many extra tokens (like "Oil, olive, salad or cooking") is not penalized for having tokens the ingredient doesn't mention.
2.2 Jaro-Winkler Similarity (Gated)
Jaro-Winkler character-level similarity is computed against multiple candidate representations, taking the maximum:
where inverted is the human-readable form (e.g., "olive oil" from "Oil, olive") and seg_join is the comma segments rejoined.
Gating rule --- Jaro-Winkler is gated by token evidence to prevent character-level similarity from rescuing candidates with no token overlap:
Without this gate, Jaro-Winkler would assign high scores to strings that happen to share many characters but represent completely different foods. The gate ensures that JW only amplifies matches that already have meaningful token evidence.
2.3 Segment Match
USDA FDC descriptions use a comma-separated structure where the first segment is typically the primary identity. For example, in "Butter, salted", the primary segment is "Butter"; in "Cookies, butter, commercially prepared", the primary segment is "Cookies."
We compute overlap with the primary segment () and the remaining segments () separately:
The segment score uses threshold buckets:
This ensures "Butter, salted" (primary = "butter") beats "Cookies, butter" (primary = "cookies") when the ingredient is "butter."
2.4 Category Affinity
A small, explicit, versioned lexicon maps ingredient tokens to expected FDC categories:
"oil" --> ["Fats and Oils"]
"butter" --> ["Dairy and Egg Products", "Fats and Oils"]
"sugar" --> ["Sweets"]
"flour" --> ["Cereal Grains and Pasta"]
"salt" --> ["Spices and Herbs"]
"chicken" --> ["Poultry Products"]
The affinity score checks all ingredient tokens and takes any match:
When no expectation exists for any token, the score is 0 --- neutral, not a penalty. The lexicon is versioned and deterministic; it does not learn or drift.
2.5 Synonym Confirmation (Gated)
A synonym table maps recipe ingredient names to sets of FDC description tokens that confirm a match:
"olive oil" --> {oil, olive}
"kosher salt" --> {salt, table}
"cilantro" --> {coriander, leaves}
"baking soda" --> {leavening, baking, soda}
The synonym table is defined inline in src/lib/lexical-scorer.ts (~40 entries) and versioned in source control.
Satisfaction requires all tokens in the synonym set to appear in the candidate:
The overlap gate prevents synonym confirmation from independently establishing a match --- it can only strengthen an existing one.
2.6 Composite Score
Range: .
2.7 Confidence Thresholds
| Score | Status | Action |
|---|---|---|
mapped | Auto-accept. Collect all candidates within 0.05 of best. | |
needs_review | Flag for manual inspection. Store best candidate. | |
no_match | No credible match. |
3. Worked Example
"olive oil" vs. "Oil, olive, salad or cooking"
Pre-processing:
- Inverted name: "olive oil" (see §4 for inverted naming resolution)
Signal computation:
| Signal | Value | Weighted |
|---|---|---|
| Overlap | (both tokens match) | |
| JW | (not capped, overlap ) | |
| Segment | (primary segment is "oil", rest contains "olive") | |
| Affinity | ("oil" expects "Fats and Oils") | |
| Synonym | () | |
| Composite | 0.908 --> mapped |
vs. "Olives, green, raw"
- Overlap: "olive" matches "olives" via plural variant (), "oil" not found. Overlap .
- Affinity: "Vegetables and Vegetable Products" does not match "Fats and Oils" expectation from "oil" token. Score .
- Composite -->
needs_review
The system correctly identifies "Oil, olive, salad or cooking" as the intended match, not raw olives.
4. Catalog Naming Normalization
Many authoritative catalogs use inverted or structured naming conventions that differ from natural user language. The system resolves these at load time using configurable domain knowledge sets:
- Container categories: "Oil, olive" → "olive oil"
- Entity bases: "Chicken, breast" → "chicken breast"
- Product forms: "Wheat, flour" → "wheat flour"
The normalized name is one scoring input for Jaro-Winkler. It does not replace token-level matching. Raw description tokens always participate independently.
See Appendix A for USDA FDC-specific normalization rules.
5. Tripwire Tests
The following invariants are enforced as regression tests. Any violation blocks promotion of a scoring run.
Word Boundary Tripwires
| Input | Must NOT Match | Reason |
|---|---|---|
| "oil" | "boiled", "broiled", "foil", "toil", "coil" | Substring of cooking method |
| "salt" | "asphalt", "basalt", "cobalt" | Substring of mineral/material |
| "corn" | "corner", "cornucopia" | Prefix of unrelated word |
| "ham" | "champignon" | Substring of mushroom name |
Medical Correctness Tripwires
| Ingredient | Must Map To | Must NOT Map To |
|---|---|---|
| "oil" | Fats and Oils category | Boiled/broiled foods |
| "butter" | Butter, salted (Dairy) | Cookies, butter (Baked) |
| "sugar" | Sugar, turbinado (Sweets) | Cookies, sugar (Baked) |
| "olive oil" | Oil, olive (Fats and Oils) | Olives, raw (Vegetables) |
| "olive" | Olives, ripe (Fruits) | Oil, olive (Fats and Oils) |
6. Run-Based Staging
Scoring results are not written directly to production tables. Instead, each scoring run produces a versioned artifact:
- Stage: results are written to a staging table keyed by
run_id, with full config, tokenizer hash, and IDF hash recorded. - Validate: tripwire tests and distribution sanity checks run against the staged data.
- Promote: a single-row pointer table (
lexical_mapping_current) is updated to point at the newrun_id.
Rollback is instant: repoint the current pointer to the previous run_id. No data is deleted; all runs are retained for audit.
This replaces the previous approach of truncating production tables before rebuilding, which created a window of data loss and prevented comparison between runs.
7. Performance
| Metric | Value |
|---|---|
| FDC corpus size | 8,158 foods (SR Legacy + Foundation) |
| Recipe ingredients | ~2,000 (frequency 25) |
| Total scoring pairs | ~16.3 million |
| Scoring throughput | ~50ms per ingredient (all 8K candidates) |
| Full run | ~10 seconds |
| Startup (load + IDF build) | ~200ms |
| Memory | ~50MB |
The scorer is pure functions with no I/O in the hot path. All FDC food data is pre-processed once at startup into in-memory structures with token set membership.
8. False-Negative Analysis
The system prioritizes precision over recall. Some classes of valid matches may score below threshold:
- Novel ingredients: User terms not in the synonym table and with no token overlap will score poorly. This is acceptable — false negatives flag for manual review; false positives silently corrupt data.
- Abbreviations: "evoo" (extra virgin olive oil) has no token overlap with "Oil, olive" and will fail. The synonym table can be extended as abbreviations are discovered.
- Misspellings: "oliv oil" loses one token's IDF weight. Jaro-Winkler provides partial recovery, but severe misspellings may drop below threshold.
The needs_review tier (0.40–0.79) exists precisely to surface these cases without auto-accepting them.
9. What This Architecture Does Not Do
- No AI inference. Scoring is deterministic arithmetic on pre-tokenized sets.
- No embedding similarity. There are no vectors, no cosine similarity, no semantic search.
- No human-in-the-loop. The
needs_reviewtier flags candidates for optional inspection but does not require it for the system to function. - No learning or drift. The category expectation lexicon and synonym table are versioned in source control. They change only through explicit code review.
In a domain where a 25x error in fat content could affect medical decisions, "the model thinks these are similar" is not an acceptable justification. Every match in this system can be explained by citing the exact token overlap, segment position, category match, and synonym confirmation that produced the score.
Appendix A — Nutrition-Specific Notes (Example Domain)
This appendix captures nutrition-specific examples and constraints. The core algorithm is domain-agnostic.
A.1 USDA FDC Inverted Naming Rules
FDC uses an inverted naming convention: "Oil, olive" instead of "olive oil." The system resolves this using domain knowledge sets:
- Container categories (oil, spices, nuts, seeds, sauce, ...): "Oil, olive" → "olive oil"
- Protein bases (chicken, beef, pork, ...): "Chicken, breast" → "chicken breast"
- Product forms (flour, juice, oil, powder, ...): "Wheat, flour" → "wheat flour"
- Poultry classifiers (broilers or fryers, roasting, ...): skipped to find actual cut
These sets are exported from the canonicalization layer and versioned in source control.
A.2 Tripwire Examples (Nutrition)
| Ingredient | Must Map To | Must NOT Map To |
|---|---|---|
| "oil" | Fats and Oils category | Boiled/broiled foods |
| "butter" | Butter, salted (Dairy) | Cookies, butter (Baked) |
| "sugar" | Sugar, turbinado (Sweets) | Cookies, sugar (Baked) |
| "olive oil" | Oil, olive (Fats and Oils) | Olives, raw (Vegetables) |
| "olive" | Olives, ripe (Fruits) | Oil, olive (Fats and Oils) |
A.3 Worked Example (Nutrition)
"olive oil" vs. "Oil, olive, salad or cooking" is a canonical case for token boundary safety and segment weighting.
Changelog
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-02 | Initial release. Domain-agnostic framing with nutrition appendix. |