Deterministic Matching Vs Probabilistic Matching

Why not both together at the same time?

Mar 16, 2024

In the entity resolution space, we refer to exact matching on attributes as deterministic matching. If the application systems are consistently collecting and persisting trusted and known identifiers like Social Security Number, deterministic matching can help to resolve such records and help us identify the true individual.

In most real world systems, data in different systems in an organisation has different attributes. Let us imagine a retail store with both online and offline presence. A web support system would capture the email correctly while also logging name, a call centre would mostly have a correct phone number but other details may not be as reliable. The offline store would have some or all of the details, as accurately as the human behind the computer tasked with entering them can type it in. Thus some of the individual records have identifiers, like phone and email in the above case, which may match exactly, while many of the records have names and addresses with all kinds of variations, typos, abbreviations etc. Assessing the extent of such variations for an entity and reconciling the fuzzy and inexact matches is known as Probabilistic Matching.

At a simplistic level, one can think of deterministic matching as a database join on trusted id columns like passport number or email. Probabilistic Matching involves statistical and ML based techniques to determine if a given record pair is indeed the same real world entity. If we search these terms, there are multiple articles around Deterministic Vs Probabilistic Matching, pitting one against the other. Since the approaches to solve deterministic matching are very different from the ones for probabilistic matching, these techniques are often viewed as contrary.

However, we have a different view.

As we dived deeper and looked at multiple datasets, we saw that in most cases, data contained multiple signals, some with trusted identifiers and most times without. Even when there are identifiers, they tend to be more than one. For example one system could have driver’s license id mandatory and passport number optional. While another system could have the mandatory rule reversed. A third system may have both of these ids as optional. If we match deterministically on passport and license in this case, we would only resolve some entities, not all. Worse, a person who has provided license id in the third system would only match with the first, and the record in the second system would go unmatched.

This has serious ramifications. In an ecommerce store, the inability to reconcile effectively would jeopardize the million dollar investment in recommender systems that consume the customer preferences. In an anti money laundering system, the ability to catch fraud is directly proportional to the quality of the matching. An insurer can not understand risk at a household or person level if records remain unresolved.

Thus, we need to solve for both determinstic and probabilistic matching at the same time. An either/or approach with respect to deterministic and probabilistic matching would lead to identifying records with identifiers as a different entity than those without. Why not use every single data point of the record? There is tremendous value to be unlocked by looking at the problem holistically. If a deterministic match can be found, use that. If not, use the other attributes. Essentially, weave them all together holistically to ensure that the probabilistic and deterministic matching records are resolved to a single entity.

The above is a simplistic take and there is a lot of software engineering involved under the hood to build this out. Effort nothwithstanding, we are convinced that it is the right approach. In fact we have gone all out while building this feature, with support for multiple combinations of attributes to make sure no record which has a match remains unmatched. Determistic and probabilistic matching are better together, and early Zingg Enterprise customers see an immediate jump in match accuracy.

Is this something you care about too? Hit a reply and lets spark a conversation!

Sonal’s Newsletter

Discussion about this post