What is in a name?
That which we call a rose, by any other name would smell as sweet. But, will it resolve to the same entity?
When Shakespeare penned these famous words, he wasn't thinking about data quality or entity resolution—but the sentiment captures a fundamental challenge in modern data management. A person known as "Robert," "Rob," "Bob," or "Bobby" remains the same individual, regardless of how they're addressed in the data. Names are perhaps the most fundamental identifier we use in data systems, yet they're also among the most inconsistent. Let us consider these common scenarios:
Nicknames and diminutives: William becomes Will, Bill, or Billy
Cultural variations: James in English is Diego in Spanish or Jacques in French
Formal vs. informal usage: Margaret on official documents, Maggie among friends
Transliterations: Китай (Russian) becomes Kitay in Latin script
Spelling inconsistencies: Steven vs. Stephen, Catherine vs. Katherine
These variations create significant challenges for data matching. Without proper handling, our systems might fail to recognize that Robert Smith and Bob Smith are the same person, leading to duplicated records, fragmented customer views, and inaccurate analytics. For customer-facing applications, recognizing these are the same person enables more personalized service and prevents frustrating redundancies. Regulatory frameworks require comprehensive entity resolution. Proper nickname handling helps ensure compliance with KYC (Know Your Customer), AML (Anti-Money Laundering), and GDPR requirements. False negatives—failing to match records that should be matched—can be particularly costly in fraud detection, risk and healthcare.
While it might seem straightforward to compare names for exact matches, the reality is far more complex. We already handle salutations and variations in the data. Yet, without nickname handling, many true matches go undetected. Our internal benchmarks showed that incorporating nickname matching can improve overall match rates by 10-25% in typical customer datasets.
Hence, it was time to build! Amongst the numerous approaches we considered, dictionary matching shone out for a couple of reasons.
Name variations differ significantly across cultures. Dictionaries can be configured to prioritize certain cultural naming patterns based on the data demographics. We are doing Hebrew names in one case and English names in another.
Configurable dictionaries empower users to leverage their domain knowledge for entity resolution.
Even without user involvement, a nickname dictionary mapping formal names to their common variations is not that difficult to build in the age of LLMs.
The surrounding data points of the record like address, date of birth, contact information) can be easily validated for potential nickname matches to reduce false positives.
Different industries and use cases require different sensitivity levels. Healthcare applications might prioritize precision, while marketing applications might favor recall. A configurable dictionary helps in such cases.
As we zeroed in on the dictionary, we realized that the approach is extensible to other domains like corporations, where mapping an IBM to International Business Machine will greatly improve matching. In the end, we built a MAPPING match type that can even normalize categorical columns like gender, or states and countries.
Rolling new features on running production pipelines is not trivial. For existing users, we did comprehesive A/B testing of entity resolution with and without nickname handling to quantify the improvement in their specific datasets. The results have been even better than we expected!
Names are at once the most essential and most variable identifiers in our data systems. Effective entity resolution must account for the rich complexity of how names are used in the real world. As Shakespeare suggested, a rose by any other name would indeed smell as sweet—and with Zingg's nickname matching, data systems will finally recognize that truth!