The Identity Crisis: Why “Getting the Basics Right” Is the Hardest Work in Data

And it is actually world-class data engineering!

Oct 23, 2025

Matthew Niederberger, Founder of Martech Therapy recently wrote something that stopped me in my tracks:

“If you’re a practitioner still trying to get a unified customer ID working across channels, you feel left behind technologically and question your competence.”

This single sentence captures a crisis that’s quietly plaguing data engineering teams across the industry. While vendor booths showcase AI-driven personalization and predictive customer journeys, talented practitioners are privately wrestling with something supposedly “simpler”: building a unified customer ID that actually works.

The cruel irony? They’re doing the hardest, most important work. But in a market obsessed with bleeding-edge features, nobody’s celebrating it.

The Vendor Narrative Gap

So why do talented teams feel like they’re failing when they’re still working on identity?

Matthew Niederberger nails the dynamic: vendors showcase what’s possible, not what’s required to get there.

Vendor conference presentations follow a formula:

Demonstrate the amazing outcomes (real-time personalization! predictive journeys!)
Show the sleek UI that makes it look effortless
Mention “unified customer data” as an assumed prerequisite
Move quickly past the foundation to focus on the flashy features

The implicit message: if you’re still working on unified identity, you’re behind. Everyone else has already solved this.

But that’s not remotely true. The vast majority of organizations are still struggling with identity resolution. They’re just not talking about it publicly because it feels like admitting failure.

The result is a massive selection bias. We hear about the 5% who’ve moved on to advanced use cases. We don’t hear about the 95% still building foundations—not because they’re incompetent, but because the foundations are genuinely difficult to build well.

The Imposter Syndrome Epidemic

I’ve had countless conversations with data teams at conferences, in Slack channels, and on video calls. A pattern emerges again and again:

“We’re still working on identity resolution.”
“We haven’t gotten to the AI stuff yet.”
“We’re just trying to get our customer IDs right first.”

The word “just” does a lot of heavy lifting in those sentences. It diminishes foundational engineering work into a checkbox, a prerequisite you should have already mastered before moving on to the “real” work.

But here’s what nobody is saying loudly enough: unified identity isn’t a prerequisite. It’s the final boss of data engineering.

Why Unified Identity Is Actually Hard

Let’s be brutally honest about what “getting a unified customer ID working across channels” actually requires:

The Data Quality Reality

Real-world customer data is messy in ways that mock deterministic rules:

Sarah Johnson gets married and becomes Sarah Martinez
john.smith@gmail.com and jsmith@gmail.com might be the same person (or different people with similar names)
Phone numbers get recycled—last year’s customer is this year’s prospect
Addresses have hundreds of valid formatting variations
People move, change jobs, update preferences
Data entry errors create systematic inconsistencies

A unified customer ID system has to handle all of this, at scale, while maintaining high accuracy. Miss too many matches and your “360-degree customer view” is fragmented. Create false positives and you’re merging different people, violating privacy and destroying trust.

The Technical Challenge

You’re not matching records in a single clean database. You’re resolving identities across:

Web analytics (cookie-based, increasingly unreliable)
Mobile apps (device IDs, app-specific identifiers)
CRM systems (email, phone, account IDs)
Point-of-sale systems (loyalty cards, payment methods)
Call center interactions (phone numbers, case IDs)
Marketing platforms (each with their own identifier schemes)
Third-party data sources (with their own quality issues)

Each source has different identifier types, different data quality standards, and different update frequencies. You’re not just joining tables—you’re probabilistically determining which records across fundamentally different systems represent the same human being.

The Scale Problem

Now multiply this across millions or billions of records. The naive approach—comparing every record to every other record—is computationally impossible. You need sophisticated blocking strategies, efficient algorithms, and infrastructure that can process massive datasets without collapsing under its own weight.

And you need to do this continuously, not as a one-time batch job. New data arrives constantly. Identities evolve. Your resolution system has to keep pace.

The Governance Complexity

Layer on consent management, privacy regulations, data retention policies, and the right-to-be-forgotten requirements. Your unified ID system isn’t just a technical challenge—it’s a governance framework that has to work across jurisdictions, respect customer preferences, and provide audit trails.

This isn’t “basic.” This is some of the most complex data engineering work you can do.

The Three-Layer Approach That Actually Works

So how do you actually build identity resolution that works at warehouse scale? Let me break down the architecture that successful implementations share:

Layer 1: Blocking

You can’t compare every record to every other record—that’s O(n²) complexity and computationally impossible at billion-record scale. Smart blocking reduces potential comparisons by 99%+ while maintaining high recall.

The key is choosing blocking keys that cast a wide enough net to catch true matches while being selective enough to make comparison tractable. This might mean blocking on:

First three characters of last name + ZIP code
Phone area code + first name
Email domain + approximate birth date
Phonetic encoding of names

Good blocking strategies use multiple passes with different keys, ensuring that records with typos or variations still get compared. At Zingg, we learn blocking directly on all the fields present in the data from the user data and labels to block different datasets efficiently.

Layer 2: Matching

This is where ML earns its keep. For each pair of records that passed blocking, you need to score the likelihood they represent the same entity.

Modern matching models use features like:

String similarity (edit distance, Jaro-Winkler, token-based matching)
Phonetic similarity (Soundex, Metaphone, NYSIIS)
Behavioral patterns (purchase timing, channel preferences, engagement signals)
Temporal proximity (how close in time were interactions?)
Geographic consistency (do locations make sense together?)

The output isn’t binary—it’s probabilistic scoring. This pair has a 0.95 probability of being a match. That pair scores 0.45. This gives you flexibility in tuning precision vs. recall based on your use case.

Layer 3: Clustering

The final challenge: transform pairwise match scores into entity groups. This is where you resolve transitive relationships.

If Record A matches Record B with 0.92 confidence, and Record B matches Record C with 0.88 confidence, are A, B, and C all the same entity? What if A and C compared directly would only score 0.65?

Clustering algorithms handle these transitivity questions, creating stable entity groups even as new records arrive and scores shift over time. This layer also handles entity splits (when you discover you incorrectly merged two people) and entity merges (when new evidence shows records you thought were separate are actually one entity).

Why This Must Happen In the Warehouse And Datalake

Here’s a critical architectural decision: entity resolution must run natively in your data warehouse or data lake. Moving billions of records out for processing defeats the entire purpose of warehouse-native architecture.

Data Gravity

Your customer data is already in Snowflake, Databricks, BigQuery, or Redshift. That’s where your transformation pipelines run, where your analytics queries execute, where your data science models train. Extracting data to a separate entity resolution system creates:

Data movement costs (bandwidth, storage, processing)
Synchronization complexity (keeping multiple copies current)
Latency (time to move data before resolution can happen)
New data silos (exactly what warehouse consolidation was supposed to eliminate)

Incremental Processing

Entity resolution isn’t a one-time batch job. New customer records arrive constantly—from web signups, mobile registrations, purchase transactions, support interactions. You need incremental processing that resolves new records against existing entities without full dataset re-processing.

This only works efficiently when resolution runs where the data lives, using the warehouse’s native capabilities for incremental updates and change data capture.

Integration with Data Pipelines

Your entity resolution system needs to integrate seamlessly with:

dbt transformations and data modeling
Reverse ETL for audience activation
BI tools for reporting and analytics
ML platforms for model training
Data quality frameworks for monitoring
AI agents for context

This integration is natural when everything operates in the same environment. It becomes painful when entity resolution is a separate system with its own APIs, data formats, and operational requirements.

Governance and Security

Your data warehouse already has:

Access controls and role-based permissions
Audit logging for compliance
Encryption at rest and in transit
Data masking and tokenization capabilities
Retention policies and deletion workflows

Replicating customer data to an external entity resolution system means duplicating all of this governance infrastructure. It becomes a privacy and compliance nightmare with GDPR, CCPA and other applicable laws. Run resolution in the warehouse, and you inherit these controls automatically.

The Human-in-the-Loop Problem

Here’s what the automation-obsessed narrative misses: pure ML automation for entity resolution rarely works in practice.

Why Full Automation Fails

Even the best ML models make mistakes on edge cases:

Common names with similar addresses (are these two John Smiths in the same apartment building one person or two?)
Shared email addresses (family members, assistants, corporate accounts)
Identity changes (marriages, name corrections, gender transitions)
Data entry errors that create systematic patterns the model learns incorrectly
Inadequate data capture

You need human judgment on ambiguous cases, and you need to feed that judgment back into the model to improve over time.

Workflows That Actually Work

Successful implementations build workflows for:

Training data labeling: Which record pairs are matches, which are non-matches? Your initial model needs labeled examples, and getting these labels right determines everything downstream.

Threshold tuning: ML models output probability scores, but you choose the threshold. Score above 0.90 is an automatic match, below 0.40 is automatic non-match, but what about the 0.65 that could go either way? Human review helps calibrate these thresholds based on business impact.

Edge case review: The model flags cases it’s least confident about. A data steward reviews them, makes decisions, and those decisions become training examples for model tuning.

Continuous model refinement: Data patterns evolve. New data sources get added. Data quality issues emerge and get fixed. The model needs periodic retraining, and human feedback on recent predictions drives that retraining.

Dispute resolution: When downstream teams question why records were merged or split, you need workflows to investigate, explain the decision, and potentially override the model when it’s wrong.

The best systems don’t eliminate human review—they make it efficient. Instead of manually reviewing millions of records, you review the hundreds that matter most.

Common Implementation Mistakes

I’ve seen these patterns repeatedly as organizations tackle identity resolution:

Mistake 1: Starting Too Big

Teams try to resolve all entity types simultaneously—customers, accounts, products, locations, employees. This creates enormous scope, makes success criteria fuzzy, and delays time to value.

Better approach: Start with one high-value entity type (usually customers). Prove the value, build organizational capability, then expand to other entity types. Each new entity type gets easier because you’ve learned the patterns.

Mistake 2: Ignoring Data Quality

Entity resolution surfaces every data quality issue you’ve been ignoring. If 40% of your email addresses are null or invalid, you can’t rely on email matching. If name fields contain inconsistent formatting, parsing errors, or non-name data, string matching becomes unreliable.

Better approach: Treat entity resolution and data quality as inseparable workstreams. Profile your data first. Fix systematic quality issues before expecting resolution to work well. Use entity resolution results to identify and prioritize remaining quality problems.

Mistake 3: Treating It as a Project

Organizations approach entity resolution as a one-time project: scope it, build it, deploy it, move on. Then they’re shocked when accuracy degrades over time as data patterns shift.

Better approach: Treat entity resolution as infrastructure that requires ongoing ownership, monitoring, and refinement—just like your data pipelines or ML models. Assign clear ownership, establish SLAs, build monitoring dashboards, schedule regular review.

Mistake 4: Optimizing for Perfection

Teams spend months trying to achieve 99.9% accuracy before releasing anything. Meanwhile, downstream teams make decisions based on no identity resolution at all (effectively 0% accuracy).

Better approach: Ship at 95% accuracy, deliver value, iterate to higher accuracy based on real-world feedback. You learn more from production usage in two weeks than from six months of pre-release testing. Start good, iterate to great.

Mistake 5: Building From Scratch

Organizations underestimate the complexity and decide to build custom entity resolution systems. Eighteen months later, they have something that works on small datasets but can’t scale, has accuracy issues they don’t understand, and requires specialized knowledge that only one person has.

Better approach: Use proven solutions—whether open source frameworks or commercial platforms—that solve the hard algorithmic and scaling problems. Invest your engineering effort in customization, integration, and business-specific logic, not reinventing core entity resolution algorithms.

The Real Question

At this point in conversations, someone usually asks: “Should we do ML-based entity resolution?”

That’s the wrong question.

The real question is: “Can our organization afford to build composable architectures on a foundation of deterministic rules that we know will fail at scale?”

Every organization moving to warehouse-native architectures will eventually face the identity resolution challenge. The only question is whether you address it proactively—as part of your architectural planning—or reactively, after your first major data quality incident.

The reactive path looks like this:

Build your composable architecture on simple deduplication rules
Activate audiences and launch campaigns
Discover that 15% of your “unified” customers are actually duplicates
Or worse, discover you’ve merged different people and violated privacy
Scramble to retrofit proper identity resolution
Rebuild audiences and downstream dependencies
Deal with the business impact and loss of trust

The proactive path looks like this:

Recognize identity resolution as foundational, not an afterthought
Build or adopt ML-based entity resolution that runs in your warehouse
Start with one entity type, prove value, iterate
Build downstream capabilities on accurate identity from day one
Scale confidently knowing your foundation is solid

The reactive path is more common (because people underestimate the challenge). The proactive path is more successful (because foundations matter).

What Actually Endures

Here’s the uncomfortable truth about chasing bleeding-edge features without solid foundations:

A CDP that can’t accurately unify customer IDs is just an expensive data warehouse with a marketing UI.

You can bolt on AI-driven orchestration, predictive segmentation, and real-time decisioning. But if your underlying identity graph is wrong—if you think one customer is three people, or three people are one customer—every downstream capability inherits those errors.

Your AI doesn’t fix bad identity data. It amplifies it. Your personalization engine delivers fragmented experiences. Your predictive models train on fictional customer journeys. Your real-time decisioning makes decisions based on incomplete context.

The organizations actually positioned to leverage advanced CDP capabilities aren’t the ones with the longest feature lists. They’re the ones who took the time to build reliable identity resolution first.

The Real Achievement

Let me share two examples of organizations that built this foundation right.

Fortnum & Mason, the 300-year-old luxury British retailer, faced a classic challenge: customer data fragmented across restaurant bookings, email signups, online transactions, and in-store purchases. They initially tried a third-party service for identity resolution, but it created non-persistent identifiers, offered limited visibility into the matching process, and raised privacy concerns about sharing their entire dataset externally.

Jon Moss, Fortnum’s Customer Engagement Director, describes the transformation:

“For the first time, we’re able to understand how customers are shopping with us—online, in-store, over the phone, or in restaurants. Zingg has helped us unify this data and gain insights we never had before.”

By implementing ML-based entity resolution running natively in their data warehouse, Fortnum & Mason built:

Persistent, unified customer identifiers across all touchpoints
Complete visibility and control over the matching process
Privacy-compliant resolution that keeps sensitive data in their own infrastructure
A foundation for their composable CDP architecture

Orthodox Union, one of the largest Orthodox Jewish organizations in the US, operates over 40 websites, 5 mobile applications, and custom CRMs for different departments. Shelomo Dobkin, their Director of Product Development, explains:

“We moved from using Zingg Open Source to the Enterprise version on Snowflake, making it the engine powering our golden records. Zingg’s powerful features and clean results have really set us apart.”

Their implementation required understanding not just individual identities but household relationships—a critical requirement for their community-focused mission. They built entity resolution that could:

Match people across multiple websites and CRM systems
Understand household connections and family relationships
Create golden records that serve as the foundation for all downstream systems
Process data incrementally as new interactions occur across their digital properties

What makes these achievements remarkable isn’t just the technical implementation. It’s what they enable. This is world-class data engineering that delivers measurable business value:

Marketing teams can trust their audiences
Analytics teams can build on reliable customer journeys
Data science teams can train models on accurate historical data
Operations teams can deliver consistent experiences across channels
Legal and compliance teams can demonstrate proper governance

Everything else—the AI features, the real-time orchestration, the predictive models—becomes possible because the foundation is solid.

Celebrating Foundational Work

The industry needs to reframe how we talk about identity resolution and other foundational data work.

These aren’t prerequisites that should be quick and easy. They’re achievements that deserve recognition.

When a team successfully builds unified customer IDs across disparate systems, that’s a case study worth sharing. When an organization finally has audiences their teams trust, that’s an accomplishment worth celebrating. When identity resolution maintains accuracy as data scales from millions to billions of records, that’s engineering excellence.

The castle that survives wave after wave isn’t the tallest or flashiest. It’s the one with foundations deep enough to withstand the tide.

Moving Forward

If you’re a data practitioner still working on unified customer identity, you’re not behind. You’re doing the work that matters most.

Don’t let vendor demos and curated LinkedIn posts make you feel like an imposter. The bleeding-edge use cases get all the attention precisely because they’re rare. The foundational work gets less visibility precisely because it’s what everyone is actually struggling with.

Focus on making your identity infrastructure trustworthy and durable:

Invest in ML-based entity resolution that can handle real-world data messiness
Build in your data warehouse and datalake so you’re not creating new data silos
Design for continuous operation, not one-time batch processing
Create human-in-the-loop workflows for edge cases and model refinement
Treat it as infrastructure that requires ongoing ownership and monitoring

The teams that master this foundation aren’t behind. They’re building something that will outlast the next wave of features, the next vendor pivot, the next technological hype cycle.

That’s not just the basics. That’s the real work. And it’s work worth doing right.

What’s your experience building unified customer identity? Where did you hit the biggest obstacles? I’d love to hear your stories—the successes and the struggles.

Sonal’s Newsletter

Discussion about this post