Atlas Labs
Lab Notes
Entity ResolutionFinanceLLMsData Engineering

Entity resolution for financial instruments: deterministic + probabilistic + LLM assist

·6 min read

A bond issued by "JPMorgan Chase & Co." appears in one data feed as JPMORGANCHASE, in another as J.P. Morgan Chase, and in a third as JPM Chase Co. Your analytics system needs to know they're the same entity. Your matching logic disagrees.

Entity resolution — the problem of determining when two records refer to the same real-world thing — is one of the most practically important and technically underappreciated problems in financial data engineering. It's the foundation of consolidated positions, risk aggregation, regulatory reporting, and portfolio analytics.

This post covers a layered approach that works in practice: deterministic rules first, probabilistic matching for the ambiguous cases, and LLM assist for the hard ones.

The landscape of financial instrument identifiers

Financial instruments are identified by a confusing array of standards:

  • ISIN (International Securities Identification Number): The gold standard. 12-character alphanumeric. Covers most securities globally.
  • CUSIP: 9-character identifier used primarily in North America.
  • SEDOL: 7-character identifier used in the UK and internationally.
  • Bloomberg ticker: Widely used but proprietary and not stable (tickers get reassigned).
  • RIC (Reuters Instrument Code): Another proprietary identifier.
  • LEI (Legal Entity Identifier): 20-character identifier for legal entities, not instruments.
  • Freetext names: The wild west.

The first challenge is that not all records have all identifiers. A data feed from a smaller broker might have CUSIP but not ISIN. A news feed might have only a company name. Your resolution system needs to handle partial information gracefully.

Layer 1: Deterministic matching

Deterministic matching is fast, explainable, and high-precision. Start here.

Exact identifier matching. If two records share the same ISIN, they're the same instrument. Full stop. Build a lookup table and resolve everything you can this way first.

def deterministic_match(record_a, record_b) -> bool:
    # Any shared authoritative identifier = match
    authoritative_ids = ["isin", "cusip", "sedol"]
    for id_field in authoritative_ids:
        a_val = record_a.get(id_field)
        b_val = record_b.get(id_field)
        if a_val and b_val and a_val == b_val:
            return True
    return False

Canonical name normalization. Before comparing names, normalize aggressively:

  • Strip legal suffixes: Inc., Corp., Ltd., plc, & Co.
  • Lowercase and remove punctuation
  • Expand common abbreviations: IntlInternational, MfgManufacturing
  • Handle common misspellings in your specific domain

Rule-based cross-reference resolution. Financial data providers maintain their own ID mappings. Bloomberg-to-ISIN, CUSIP-to-ISIN, SEDOL-to-ISIN mappings are available (and worth purchasing or sourcing from open providers like OpenFIGI). Load these as lookup tables.

Deterministic matching typically resolves 60–80% of cases. The rest go to probabilistic matching.

Layer 2: Probabilistic matching

Probabilistic matching assigns a confidence score to potential matches rather than making a binary decision. You set thresholds: auto-accept above X, auto-reject below Y, human review in between.

Blocking first. Don't compare every record against every other — that's O(n²). Use blocking to generate candidate pairs:

  • Records with the same first letter of the issuer name
  • Records with similar CUSIP prefixes (first 6 characters identify the issuer)
  • Records with the same currency and approximate maturity

Features for scoring:

  • Name similarity (Jaro-Winkler or token set ratio work well for company names)
  • Identifier partial matches (CUSIP prefix match, ISIN country code match)
  • Numeric attribute similarity (maturity date, coupon rate, face value)
  • Data source reliability weights (give higher weight to Bloomberg data than a CSV from a regional broker)
from rapidfuzz import fuzz
from datetime import datetime

def score_pair(record_a, record_b) -> float:
    score = 0.0
    weights = 0.0

    # Name similarity
    if record_a.get("name") and record_b.get("name"):
        name_sim = fuzz.token_set_ratio(
            normalize_name(record_a["name"]),
            normalize_name(record_b["name"])
        ) / 100.0
        score += name_sim * 0.35
        weights += 0.35

    # CUSIP prefix match (issuer-level)
    if record_a.get("cusip") and record_b.get("cusip"):
        prefix_match = record_a["cusip"][:6] == record_b["cusip"][:6]
        score += (1.0 if prefix_match else 0.0) * 0.25
        weights += 0.25

    # Maturity date proximity
    if record_a.get("maturity") and record_b.get("maturity"):
        days_diff = abs((record_a["maturity"] - record_b["maturity"]).days)
        date_sim = max(0, 1 - days_diff / 365)
        score += date_sim * 0.20
        weights += 0.20

    # Coupon rate similarity
    if record_a.get("coupon") and record_b.get("coupon"):
        coupon_sim = 1 - min(1, abs(record_a["coupon"] - record_b["coupon"]) / 5.0)
        score += coupon_sim * 0.20
        weights += 0.20

    return score / weights if weights > 0 else 0.0

Track your threshold performance over time. False positives (incorrectly merged records) are expensive in finance — they corrupt positions and P&L. False negatives (missed matches) lead to fragmented views of risk. The right thresholds depend on your downstream use case.

Layer 3: LLM assist for hard cases

After deterministic and probabilistic matching, you'll have a residual set of ambiguous cases that are genuinely hard. Some of these are worth human review. Many can be resolved by an LLM with the right context.

What LLMs are good at here:

  • Interpreting ambiguous name variants that require world knowledge ("GE Capital" vs "General Electric Capital Corporation")
  • Reasoning about corporate actions (name changes, spin-offs, mergers)
  • Handling non-English names and transliterations
  • Synthesizing context from multiple weak signals

A structured prompt pattern:

def llm_resolve(record_a, record_b, evidence: list[str]) -> dict:
    prompt = f"""
You are a financial data specialist. Determine if these two records refer to the same financial instrument.

Record A:
{json.dumps(record_a, indent=2)}

Record B:
{json.dumps(record_b, indent=2)}

Additional context:
{chr(10).join(f"- {e}" for e in evidence)}

Respond with JSON:
{{
  "is_match": true | false,
  "confidence": 0.0 to 1.0,
  "reasoning": "brief explanation",
  "match_type": "same_instrument" | "same_issuer_different_instrument" | "different_entity"
}}
"""
    response = llm.complete(prompt)
    return json.loads(response.text)

Key design choices:

  • Always ask for structured output (JSON)
  • Ask for confidence and reasoning, not just a binary answer
  • Pass in the evidence from your probabilistic scoring
  • Use a faster/cheaper model (GPT-4o-mini, Haiku) for high-volume cases; reserve larger models for low-confidence cases

Human-in-the-loop for critical cases. For your highest-stakes instruments (e.g., large position holdings), route low-confidence LLM decisions to human reviewers. The LLM's reasoning still adds value — it gives the reviewer a starting point.

Making the resolved data trustworthy

Entity resolution is worthless if downstream consumers don't trust the output. A few practices that help:

Audit trail. Record which layer resolved each match and why. Deterministic matches from ISIN lookups are more trustworthy than probabilistic matches — make that visible.

Confidence scores in the output. Don't collapse confidence to a binary. Let downstream systems decide how to handle uncertain matches.

Feedback loops. When downstream consumers find errors — wrong merges, missed matches — route that signal back to improve thresholds and rules. Track your precision and recall over time.

Versioned canonical identifiers. When you assign an internal canonical ID to an entity, version it. Corporate actions (mergers, spin-offs, name changes) mean entities evolve. Your ID scheme should handle this without breaking historical records.

Entity resolution done well is invisible — everything just works. Done poorly, it's a perpetual source of data quality incidents. The layered approach gives you the coverage of probabilistic matching with the precision of deterministic rules, and LLM assist handles the long tail without requiring endless rule engineering.