The Identity Layer

Before You Build a Knowledge Graph, You Need to Solve Identity

Sonal Goyal — Fri, 12 Jun 2026 10:16:08 GMT

A financial services team I know spent nine months building a knowledge graph. The use case was compelling: link entities across counterparty risk data, beneficial ownership filings, and internal CRM records to power a new compliance analytics layer. They ingested data from six sources. Modeled the ontology carefully. Built the pipelines.

Then they ran a query asking how many distinct legal entities they had relationships with.

The graph said 240,000. Their compliance team knew from experience the real number was closer to 140,000. The graph had created a node for every record, not every entity. “Goldman Sachs,” “Goldman, Sachs & Co.,” “Goldman Sachs International,” and “GS” were four separate nodes — unlinked, unresolved, floating as if they had nothing to do with each other.

Nine months. Significant infrastructure investment. And the output was fundamentally untrustworthy.

This is the quiet failure mode of most enterprise knowledge graph projects. Not in the technology, but in what teams assumed away before they started.

What a Knowledge Graph Is Actually Promising You

Traditional SQL gives you rows. A knowledge graph gives you the network — customers connected to orders, legal entities connected to subsidiaries and counterparties, relationships that flat tables can’t express.

The power is real. But it comes with a catch that most teams discover too late.

Juan Sequeda, Principal Researcher at ServiceNow and co-author with Ora Lassila of Designing and Building Enterprise Knowledge Graphs, frames the core problem precisely: the goal is to reason across connected entities — but that reasoning only holds if you can trust that each node actually represents what it claims to represent.

Relationships between nodes are only meaningful if you know which node is actually which entity. If “Goldman Sachs” and “GS Capital” are two separate nodes when they should be one, every relationship attached to either is incomplete. Your compliance query returns partial results. Your fraud model misses a connection. Your customer 360 view has invisible holes.

As Ora Lassila — co-author of the original RDF specification and the Semantic Web paper with Tim Berners-Lee — has observed: enterprise data is messy at best. Messiness that gets structured into a graph doesn’t go away. It becomes structurally embedded.

The Specific Problem: Synonym Nodes

In graph terminology, the failure mode has a name: synonym nodes. Two nodes that represent the same real-world entity, appearing as separate entries because source data spelled the name differently, abbreviated differently, or came from a system that had no way to know the other record existed.

The problem is structural, not accidental. Most enterprise knowledge graph construction involves pulling data from multiple sources — CRM, ERP, compliance filings, third-party data providers. These systems were never built to agree on identity. Each assigned its own identifiers, its own formats, its own rules for what counts as a “company” or a “person.” When you combine those sources into a graph, you inherit every one of those disagreements simultaneously.

Variations don’t have to be dramatic to cause real damage. “Robert Smith,” “Bob R. Smith,” and “Robert Smith Jr.” might all be the same person — or they might not. The graph cannot tell. And when you’re integrating a new data source, you face the same question for every incoming record: does this refer to a node you already have, or is it a genuinely new entity? Without entity resolution, you can’t answer that. You just create new nodes.

A March 2026 piece in Medium on knowledge graph normalization described the failure mode directly: if variations are added to a graph without careful handling, the result is a cluttered structure with duplicate nodes, broken identities, and relationships that appear more complex than the underlying reality. This isn’t a fringe problem — it’s the default outcome when identity is treated as something to clean up later.

The scale this reaches in production is significant. Research documented by Graph Praxis found that Children’s Medical Center Dallas discovered 22% of their patient records were duplicates before implementing proper entity resolution — a figure that dropped to 0.14% afterward. In healthcare, that’s the difference between a care team seeing a complete patient history and seeing a fragment. In financial services, it’s the difference between detecting a counterparty exposure and missing it entirely.

Why This Gets Worse Under AI

This problem is old. What’s changed is the cost of ignoring it.

In a traditional BI workflow, an analyst queries the graph, notices something looks off, and flags it. Human attention catches the error before it becomes a decision. The feedback loop is slow but it exists.

In an agentic AI workflow, that loop is broken.

When an AI agent queries your knowledge graph — to answer a question, execute a task, or route a process — it trusts the graph. It takes the retrieved context as ground truth. It doesn’t pause to notice that there are six nodes for one customer, or that two supplier entries that look different are actually the same company. It acts on what it finds.

Research published in 2025 on denoising knowledge graphs for retrieval-augmented generation put numbers to this. The paper — from researchers at Nanyang Technological University, Nanjing University, and Mila — studied what happens when LLMs construct knowledge graphs automatically, as most GraphRAG implementations do. The finding is uncomfortable: LLMs consistently produce duplicate entities because they struggle to maintain consistency across long contexts. The entity “LLMs” might appear alongside “LLM,” “Large Language Models,” and “llms” — all as separate nodes, none linked. The paper found that simply applying entity resolution to remove these redundant nodes — reducing graph size by around 40% — consistently improved question-answering performance across all four GraphRAG variants tested. Less graph, better answers, because the graph that remained was actually trustworthy.

The implication is significant. The failure mode isn’t just in enterprise data pipelines assembled over years from messy source systems. It shows up in graphs that were constructed minutes ago by an LLM. The redundancy is structural to how language models generate graphs, not a legacy data problem you can engineer around. Entity resolution isn’t a preprocessing step for dirty old data — it’s a required component of any production GraphRAG pipeline.

The GraphRAG pattern — using knowledge graphs to ground LLM-based reasoning and reduce hallucination — is one of the most promising architectural approaches in enterprise AI right now. But the value of that pattern depends entirely on reasoning across correctly identified entities. If the nodes aren’t resolved, GraphRAG doesn’t reduce hallucination. It structures it.

What Entity Resolution Is Actually Doing

Entity resolution is the process of determining which records across a dataset — or across datasets — refer to the same real-world entity.

It sounds like deduplication, but it’s more than that. Deduplication removes exact copies. Entity resolution handles the fuzzy cases: variations in spelling, abbreviations, transpositions, name changes over time, intentional obfuscation. It combines probabilistic matching with learned models and, critically, continuous feedback that improves accuracy as new data arrives and edge cases are resolved.

For knowledge graph construction, entity resolution serves two distinct roles.

The first is deduplication within the graph: ensuring the node for a given entity appears once, with all relevant attributes consolidated, rather than scattered across variants. The second is linking across sources during ingestion: when a new data source arrives, entity resolution is what allows you to match incoming records to existing nodes rather than creating new ones for every record. It is the mechanism that makes multi-source integration coherent rather than merely cumulative.

The hardest part of building a knowledge graph isn’t the graph technology itself — it’s the data integration problem underneath. Getting multiple sources to agree on what an entity is, and which records refer to it, is the work that makes everything else possible. Entity resolution is where that agreement is actually enforced.

Academic research on this has been building for decades. A foundational paper from Jay Pujara at USC established that the collective relationships in a knowledge graph are themselves a key input to improving entity resolution — the graph structure helps resolve ambiguous records by triangulating on shared connections. Identity and graph quality improve each other iteratively, which is why treating entity resolution as a one-time preprocessing step misses the point.

What This Looks Like in Practice

Take a financial crimes use case — one of the most common production applications of knowledge graphs. You’re ingesting beneficial ownership data, sanctions lists, transaction records, and corporate registry filings. Each source has its own identifier for legal entities. The same holding company appears under slightly different names in each.

Without entity resolution, your graph has hundreds of thousands of nodes, many representing the same entity. A query looking for connections between a transaction counterparty and a sanctioned entity returns nothing — not because the connection doesn’t exist, but because the two nodes representing the same company were never linked.

The Knowledge Graph Conference 2024 featured sessions on exactly this pattern: compliance and financial crime use cases where the graph produced structurally accurate output, but missed the relationships that mattered most because entity resolution had not been done. The graph knew everything about each record. It knew nothing about the entities those records actually referred to.

The same principle applies in retail (supplier networks and customer identity across channels), in healthcare (patient identity across care settings and systems), and in any enterprise where AI agents are making operational decisions against data assembled from multiple source systems that never agreed on identity in the first place.

The Timing Problem

One more thing that makes this hard in practice: entity resolution isn’t a one-time job.

Data arrives continuously. New sources get added. Entities change — companies get acquired, people change names, addresses update. An entity resolution layer that runs once at graph construction time will degrade as new data accumulates. New records land without being matched to existing nodes. The graph grows stale in ways that are invisible until they cause a problem.

Production-grade entity resolution needs to be incremental: processing new records as they arrive, matching them against the existing entity model, updating the graph continuously rather than in batch. This is a substantially harder engineering problem than a one-time deduplication pass. It requires blocking strategies that make pairwise comparison feasible at scale, scoring models that learn from feedback, and audit trails that let you trace why two records were or weren’t resolved together.

Lassila’s observation about enterprise data messiness points to why this is so persistent: the root cause isn’t bad data entry — it’s that the systems generating the data were never designed to coordinate on identity. Every new source integration re-opens the problem. The only durable answer is infrastructure that handles it continuously, not a cleanup pass that eventually becomes stale.

The teams I work with that have gotten this right share one characteristic: they thought about identity before they thought about the graph. They didn’t build the graph and then ask how to clean it up. They solved the identity layer first, and let that become the foundation for everything they wanted to build.

The Right Order of Operations

If you’re planning a knowledge graph project — for GraphRAG, compliance analytics, customer intelligence, or supply chain visibility — the sequence matters more than the tooling.

Build identity first. Establish how your organization will recognize when two records refer to the same entity, across all the sources you plan to ingest. Build the infrastructure to run that matching continuously as new data arrives. Make sure the output of that process — a canonical, deduplicated, linked set of entities — is what flows into your graph.

Then build the graph on top of that foundation.

The entity resolution layer isn’t the glamorous part of this work. Knowledge graph visualization is compelling. The identity pipeline underneath it is infrastructure. But it’s the infrastructure that determines whether the graph reflects the world accurately — or just reflects your data’s history of accumulated confusion.

The goal of a knowledge graph is to reliably create knowledge from data. The word “reliably” is doing a lot of work in that sentence. Reliable knowledge requires resolved entities. There is no shortcut around it.

The teams that get this right will build AI systems that can actually be trusted to act. The teams that skip it will spend a long time wondering why their graph-powered AI keeps producing wrong answers — not because the model is bad, but because the ground truth it was given was never true to begin with.

What’s your experience building knowledge graphs in production? I’d love to hear where identity came up — and whether it was planned for, or discovered the hard way.

The MDM Magic Quadrant Is Back. Here’s What Is Missing.

Sonal Goyal — Fri, 22 May 2026 06:22:30 GMT

After a four-year absence, Gartner has brought back the Magic Quadrant for Master Data Management Solutions. Published on April 6, 2026, this is the first MDM MQ since Gartner retired it in 2022 — back when it judged the market “stable and mature with little meaningful differentiation among vendors.”

That retirement, in hindsight, was premature. And Gartner essentially admits as much by resurrecting it.

The return is significant. The MDM MQ is the document that shapes enterprise buying decisions, procurement shortlists, and board-level conversations about data strategy. For four years, data leaders evaluating MDM platforms were navigating without it — relying instead on a Market Guide, Gartner’s lower-fidelity fallback for markets it considers less dynamic. The MQ’s return signals that Gartner now sees MDM as a live, competitive, differentiated market again.

They’re right. But the question worth asking is: does the 2026 evaluation fully reflect the market as it exists today?

What Brought It Back

The honest answer is AI. The report itself frames the shift clearly — MDM has evolved from a “static, back-office system of record into a dynamic, real-time system of intelligence essential for the AI era.” When AI agents start acting on enterprise data — making decisions, triggering workflows, serving customers — the quality of the underlying master data becomes existential. Hallucinating AI agents and broken customer experiences trace back to the same root cause: untrustworthy master data.

The rise of the Chief Data and AI Officer has also changed the conversation. CDOs are rejecting monolithic, multi-year implementations in favor of agile, composable solutions. They want master data treated as a data product — governed, reusable, consumable by humans and AI agents alike. That’s a fundamentally different mandate than the MDM world of 2021.

So the MQ is back, evaluating 20 vendors across a framework of Leaders, Challengers, Visionaries, and Niche Players. The Leaders quadrant includes Salesforce (Informatica), Reltio (SAP), Stibo Systems, and Semarchy — credible, enterprise-proven platforms that have been at this for years. The evaluation criteria are thorough: matching and survivorship, multi-domain support, data stewardship workflows, AI augmentation, governance, and integration patterns.

It is a rigorous document. It is also, in two important ways, a document of the world as it was — not entirely as it is.

Gap #1: Open Source Is Invisible

Scan all 20 vendors and their honorable mentions. Not a single open source MDM project is evaluated.

This is not a criticism of Gartner’s methodology — the inclusion criteria require a minimum of 25 production customers, multi-region presence, and a solution that can be deployed without code changes. Open source projects, almost by definition, fail that last test. They are built to be anonymous, downloaded and used in production without the builder knowing about it. Unless they break. Good quality open source products do not break like that.

Unfortunately, the inclusion criteria framing itself reveals something important: the Gartner MDM evaluation is structurally oriented toward commercial, packaged, vendor-managed platforms. The assumption baked into the criteria is that MDM is something you buy, not something you build on top of your existing infrastructure.

For a growing class of data teams — especially those running modern, engineering-led data stacks — that assumption no longer holds. These teams are not looking for another SaaS hub with its own data plane, its own governance layer, and its own license metering model. They are looking for MDM capabilities they can run inside their existing environment, version-control alongside their dbt models, and extend without waiting for a vendor roadmap.

That world exists. It is just not in the quadrant.

Gap #2: Warehouse-Native MDM Is an Afterthought

This is the more structurally significant gap, and it is visible right in the report’s own evaluation criteria.

The mandatory features for inclusion require vendors to create “a central, persisted system of record or index of record for master data.” Optional features — not required, merely nice-to-have — include “medallion architecture support” and “data fabric integration.” Warehouse-native execution, where master data is managed inside the customer’s cloud data platform rather than extracted to a proprietary hub, does not appear as a named evaluation criterion at all.

The logic of warehouse-native MDM is straightforward: your data already lives in Snowflake, Databricks, BigQuery, or Microsoft Fabric. Your transformations run there. Your governance catalog governs it there. Your AI models train on it there. Why should MDM require pulling that data out, copying it into a proprietary hub, and then synchronizing it back — creating a new class of latency, duplication, and reconciliation debt?

The answer, for many modern data teams, is that it shouldn’t. The pattern of warehouse-native MDM — managing master data entities directly within the cloud data platform using the compute, lineage, and governance tooling that’s already there — is the logical extension of the same architectural shift that gave us dbt, Hightouch, and composable data stacks.

The 2026 MQ acknowledges the direction. Stibo Systems earns credit for native integrations with Databricks, Snowflake, and BigQuery. Semarchy’s listing of a Snowflake Native App is called out as a strength. Tamr explicitly earns a caution for deprecating proprietary connectors in favor of an API-first model. The signal is in the details — the market is moving toward the lakehouse, and the vendors that acknowledge this are being rewarded for their vision.

But “integrating with the warehouse” and “running natively inside it” are architecturally very different things. The first still requires an external hub. The second does not.

What This Means for Data Leaders

The 2026 MDM MQ is a valuable document. For organizations evaluating commercial MDM platforms — especially those with complex multi-domain requirements, rich hierarchy management, business-user stewardship workflows, and regulatory audit needs — the Leaders quadrant is a legitimate starting point.

But if you are a data leader running a modern lakehouse or warehouse-first stack, read the MQ as a useful reference for one segment of the market, not a complete map of the terrain.

The terrain also includes a different kind of MDM — one where the matching engine runs inside your Databricks workspace or Snowflake account, where data never leaves your environment, where the model is trained by your team’s domain knowledge through active learning rather than configured through a vendor’s pre-built templates, and where the total cost of ownership is not metered by record count or user seat.

This is exactly the space that Zingg occupies. As an open source, ML-native entity resolution engine built to run on Spark, Snowpark, and every major cloud data platform — Databricks, Snowflake, Microsoft Fabric, GCP, AWS — Zingg represents the architectural thesis that Gartner’s evaluation criteria have not yet fully caught up to: that MDM should be a capability of the warehouse, not a system alongside it.

For the matching, entity resolution, golden ID generation, and survivorship layer — the computational core of what MDM actually does — a warehouse-native, open source engine is not just viable. It is indeed the superior architecture.

The Bigger Picture

Gartner retiring the MDM MQ in 2022 reflected a moment of genuine market stasis. Bringing it back in 2026 reflects something more interesting: a market in genuine motion.

The AI era has made master data strategic in a way it has never quite been before. The quality of the data feeding your AI agents is now a competitive variable. MDM is no longer infrastructure maintenance — it is an AI readiness prerequisite.

That urgency is real and the 2026 MQ captures it well. Where the report leaves room for further conversation is in its definition of what MDM can look like in 2026 — whether it must be a packaged hub or whether it can be a native capability of the platforms where data already lives.

That conversation is worth having. The market is already having it, even if the quadrant hasn’t drawn it yet.

Are you running MDM natively inside your warehouse or lakehouse, or are you still working with a separate hub? I’d love to hear how your team has approached this.

The Semantic Layer Does Not Have All The Answers

Sonal Goyal — Fri, 15 May 2026 11:59:25 GMT

A few months ago, I was in a conversation with a data architect at a large enterprise company. They had just finished a significant investment in their semantic layer. Metrics were defined once. Every tool — BI dashboards, SQL notebooks, their new AI agent — queried against the same definitions. “Revenue” finally meant the same thing everywhere.

It was a real win. They were proud of it. And rightly so.

Then their agent started acting on customers.

It was supposed to send renewal nudges to accounts that hadn’t logged in for 60 days. Simple enough. Except the agent was working off a customer ID that wasn’t clean. The same enterprise account appeared in the CRM as three separate records — slightly different company names, slightly different contacts, same domain. The agent sent three separate renewal sequences to the same account. The account manager got an angry call.

The semantic layer was perfect. The entity underneath it was broken.

The two questions data infrastructure needs to answer

There are really two different kinds of questions your data stack needs to answer.

The first is a definition

What does “active user” mean? How do we calculate churn? What counts as revenue? These questions have answers. Someone writes the YAML, someone reviews the PR, the definition gets committed. Deterministic. The semantic layer is built precisely for this — and the ecosystem around it (dbt, MetricFlow, Cube, LookML) has matured enormously.

The second is a definite value

Who is this customer? Is the “Priya Sharma” in your CRM the same as “P. Sharma” in your billing system and “priya.sharma@oldcompany.com” in your support tickets? That question doesn’t need a definition. It needs a definite answer. And no amount of metric definition answers it.

While building semantic layers, it is easy to assume that they will answer everything. For most of the last months of the agentic rush, that assumption was manageable — agents were mostly querying, not acting. But now, agents are acting. And the assumption that we have clean entities is cracking.

When agents query vs. when agents act

Think about using AI agents for querying high level information. “What was APAC revenue last quarter?” The semantic layer handles this well.

The next generation of AI agents is different. Agents are taking actions on behalf of businesses — sending communications, flagging accounts, triggering workflows, making decisions per customer. Now the customer ID has to mean something specific. It has to map to one real-world person, consistently, across every system the agent touches.

Let us see how this plays out across industries:

Retail — An agent personalizes offers per customer. Duplicate records mean the same customer gets competing offers. Margin erodes. The customer is confused.

Financial services — A risk agent monitors accounts for suspicious patterns. If the same individual is fragmented across three records, the pattern never surfaces. The agent sees three clean records instead of one risky one.

Healthcare — A care coordination agent tracks patient history. “Maria Santos” at one clinic is “M. Santos” at the specialist. The agent misses the allergy note. That’s not a data quality problem anymore — that’s a patient safety problem.

In every case, the agent is doing exactly what it was designed to do. The entities beneath it are what’s broken. And because agents move fast and act repeatedly, they don’t make the mistake once. They make it at scale, before anyone notices.

The piece most context stacks haven’t made agent-ready - MDM

Master Data Management has existed as a discipline for decades. The idea is straightforward: maintain a single trusted record for every customer, product, account, location. Know that these three records are all the same person. Keep that knowledge current as data changes.

In practice, MDM has lived inside expensive enterprise software, long implementation cycles, and governance teams. It was built for periodic batch reconciliation. It was never designed to be queried by an agent at runtime.

That’s the gap I keep running into. Not that MDM and entity resolution doesn’t exist as a discipline — it does, and people have been working on it seriously for a long time. The gap is that a resolved entity graph hasn’t been made agent-ready in the way that metric stores have. Single source of truth for humans and agents. API-accessible. Continuously updated. Queryable at the point of action.

The semantic layer is only as good as its entities

Here’s the part that doesn’t get said enough: the semantic layer depends on this problem being solved.

The customer ID is the join key that makes your semantic layer useful. It connects revenue to behavior, behavior to support history, support history to renewal risk. The entire model assumes that when you say “customer,” every system is pointing to the same real-world person.

If the entity graph is wrong, the metrics are wrong — even if every definition is perfect. You can define revenue with mathematical precision and still be measuring it for a customer population that’s partly duplicated, partly fragmented, and partly imaginary.

The semantic layer is foundational infrastructure. The resolved entity graph is the layer giving it strength. Most stacks have invested heavily in the first. The second is still treated as a cleanup task, a data quality ticket, something to handle later.

As agents move from answering questions to taking actions on real customers, accounts, and products, this “later” treatment will be the difference between agentic success or failure. It is time to invest in the entity layer today!

Are you seeing this in your stack? I’m curious how teams are approaching entity resolution today — whether through MDM tooling, home-grown pipelines, or something else entirely. Would love to hear what’s working. Hit reply or comment below!

The Context You’re Missing: Why Identity Resolution Is Marketing’s Center of Gravity

Sonal Goyal — Sat, 09 May 2026 12:46:58 GMT

I spent the last two days reading through Scott Brinker and Frans Riemersma’s State of Martech 2026 report. All 112 pages. The metamorphosis metaphor on the cover — the chrysalis transforming a caterpillar into something unrecognizable — turns out to be more than clever design. It’s an accurate description of what’s happening in marketing right now.

But there’s something hiding in that report that deserves more attention than it’s getting. Buried in the discussion of AI agents, agentic workflows, and context engineering is a problem that predates all of this: most organizations still don’t have clean customer identities.

And in the AI era, that’s not just an annoyance. It’s the difference between golden context and garbage context.

The Three-Circle Problem

The report introduces a framework of three overlapping circles representing Company Context, Customer Context, and Systems Context.

Source: State of Martech 2026, Scott Brinker & Frans Riemersma

Company context is everything you know about your own business — your goals, capabilities, brand voice, governance rules, what you can actually deliver.

Customer context is everything about the customer’s situation — their needs, history, preferences, where they are in their journey, what problem they’re trying to solve right now.

Systems context is what your technology stack can actually access and connect at decision time.

The report calls the overlap between all three circles “Golden Context” — the moment when your company’s capabilities, the customer’s needs, and your systems’ ability to deliver all align perfectly. These are the moments where revenue happens.

Source: State of Martech 2026, Scott Brinker & Frans Riemersma

You might have deep customer insights somewhere in your data warehouse. You might have a thoughtful segmentation strategy in a deck from last quarter. But if your AI agent can’t access that information when it needs to respond to a customer, you don’t have context. You have documentation.

The Context Stack: Different Layers, Different Speeds

The report also introduces the idea of a context stack with different layers operating at different speeds. Each layer has its own timeline and decay rate.

The Context Stack: different layers operating at different speeds. Source: State of Martech 2026, Scott Brinker & Frans Riemersma

At the bottom, market context changes over years — industry structure, regulations, macro trends. Company context shifts over quarters — strategy, capabilities, governance. These are the slow-moving, relatively stable layers.

At the top, the layers move much faster. Journey context changes over days and weeks. Session context over minutes and hours. Moment context in real-time — what someone is doing right now, the question they just asked, the intent signal from this browsing session.

The report’s key insight: the more granular and timely the context, the faster it loses value. That intent signal from this morning? By this afternoon the customer may have moved on, found an alternative, or simply lost interest.

This creates what the report calls a “context distribution problem.” You might have the data somewhere. But if that real-time signal can’t reach the right system before it decays, it becomes worthless.

How Context Reduces Friction

Scott and Frans’ report shows how context transforms customer interactions. The more context an interaction carries, the less work both sides have to do.

How context reduces friction in customer interactions. Source: State of Martech 2026, Scott Brinker & Frans Riemersma

The curve is convex — the first increments of context eliminate the most friction. Knowing who the customer is and what segment they fall into removes the need to start from scratch. Each additional layer helps: connecting previously siloed information, enriching what you already have, or capturing those fast-moving signals at the top of the stack.

But that curve assumes you’re applying the right context to the right person. When customer identities are fragmented, you’re not reducing friction — you’re creating it at scale.

Where Identity Resolution Fits: The Universal Truth That Doesn’t Decay

The report doesn’t focus on identity resolution explicitly. But once you understand the context framework, you realize identity is fundamentally different from every other layer in the stack. Look at the Customer Context circle within the Golden Context. Every piece of information in that circle — purchase history, preferences, journey stage, engagement patterns, support tickets — needs to be tied to the right person. When a customer exists as three separate records across your systems, you don’t have customer context. You have customer fragments.

Those fragments show up everywhere in the context stack and they break every layer of the stack:

In the relationship layer (months to quarters): when account history lives under one identity, recent engagement under another, and support interactions under a third, you can’t build a coherent relationship view.

In the journey layer (days to weeks): when a prospect’s web behavior, email engagement, and sales conversations aren’t connected to the same identity, you can’t tell where they actually are in the buying process.

In the session layer (minutes to hours): when someone browses your site, chats with support, and checks pricing — all in one afternoon — but your systems see three different anonymous visitors, you miss the signal entirely.

In the moment layer (real-time): when an AI agent needs to respond to “what’s the status of my order?” but can’t connect the question to the right customer record, the best it can do is ask clarifying questions that make you look incompetent.

Without that persistent identity thread, you’re constantly starting from cold. Every interaction is generic. Every AI agent makes decisions based on incomplete pictures. Every personalization attempt is a guess.

The report frames context as marketing’s alpha — the competitive advantage that generates excess returns by knowing things your competitors don’t, or connecting what you know faster than they can.

Identity is the foundation of that alpha. It’s the thread that ties all the other context layers together. Market context shifts. Company strategy evolves. Journey signals go stale. Session behavior expires. Moment context has a half-life measured in minutes. But identity persists. The fact that customer A browsing your site right now is the same person who chatted with support yesterday and purchased six months ago — that connection doesn’t decay. Identity is the only context that doesn’t decay.

This is Identity Resolution’s Moment

The report makes the case that AI is dissolving production constraints and revealing context constraints hiding behind them. That pattern applies directly to identity.

For years, the constraint was: can we collect enough data about customers? Marketing teams invested in CDPs, data lakes, customer 360 initiatives. Data volume is no longer the bottleneck for most organizations.

The new constraint is: can we connect that data to the right customer at the right moment?

Traditional marketing automation could limp along with imperfect identities. If you sent an email to the wrong version of a customer record, maybe it didn’t convert, but the damage was contained.

AI agents are different. When they make decisions based on fragmented identities, they create contradictory experiences at scale. An AI-powered shopping assistant recommends products based on incomplete purchase history. A chatbot can’t find the support ticket from last week because it’s filed under a different email. A lead scoring model thinks a loyal customer is a cold prospect because their recent activity is tied to a new cookie.

From conversations with teams, many are discovering a pattern: the AI works fine in testing, then underperforms in production because the underlying customer data is fragmented. The model isn’t broken. The identity foundation is.

The RAG Pattern and Identity

One of the most interesting findings in the report is what they call “RAG Everywhere” — Retrieval-Augmented Generation showing up across marketing in different forms.

Customer service chatbots, sales enablement Q&A, knowledge management, chat-with-data analytics. These look like different use cases, but underneath they’re solving the same problem: retrieve the right context, then generate an accurate response.

The report notes high concurrent buy-and-build activity in these areas because no vendor has completely solved the context retrieval problem.

But retrieval quality depends entirely on identity quality. When a customer asks “what’s the status of my order?” your RAG system needs to identify who is asking, retrieve their order history, apply the right permissions, and generate a response. If step one fails — if you can’t reliably identify the customer — the rest of the chain collapses.

Every RAG implementation in marketing is, at its foundation, an identity resolution challenge.

The Governance Gap

One pattern that shows up repeatedly in the report: AI production adoption is racing ahead while governance lags behind. In Data, coding and automation use cases run at 75–83% adoption while Data Compliance & Governance sits at 50%, Data Lineage at 49%, and Customer Privacy & Consent Management at 47%. The report notes that in Data specifically, governance failures don’t stay contained. Weak lineage and privacy controls infect every downstream agent that depends on that data. A governance gap in data is a governance gap in everything built on top of it.

The report frames this as a budget problem — there’s no immediate ROI story for governance, so it keeps losing the budget fight. But there’s a deeper issue: you can’t govern what you can’t identify.

Data lineage, privacy controls, consent management — all of these require knowing which data belongs to which customer. When customer identities are fragmented, your governance layer has holes regardless of how sophisticated your policies are. If you can’t trace which systems contributed to a customer’s unified profile, you can’t honor deletion requests. If you can’t connect consent preferences to all versions of a customer identity, you can’t enforce them.

What the Martech Landscape Tells Us

What caught my attention in the category-level data: Audience & Identity Resolution sits in an interesting position. It has solid SaaS adoption (48 respondents using AI-native or existing tools), but 57 non-adopters — making it oddly bimodal.

Organizations either have identity resolution through existing CDP, CRM, or ABM platforms, or they don’t have it at all. A meaningful cohort still treats identity as someone else’s problem — owned by sales ops, IT, or data engineering rather than marketing.

The report suggests this structural hesitation may prove costly as AI gets better at fuzzy entity matching and cross-source reconciliation. I think that’s exactly right. The organizations that treat identity as marketing infrastructure — not just a data engineering problem — will have a significant advantage.

Where This Leaves Us

The State of Martech 2026 report makes a compelling case that we’re in the middle of a metamorphosis. The stack is stratifying. AI is everywhere. Context engineering is becoming a distinct discipline. The chrysalis metaphor works because what’s happening inside isn’t gradual improvement. It’s dissolution and reconstruction. The old forms are breaking down. New ones are assembling.

Organizations are investing heavily in the Company Context circle (brand guidelines, knowledge bases) and the Systems Context circle (integrations, APIs, AI tools). That’s necessary work. But they cant leave the Customer Context circle behind. Understanding who your customers are, what they need, where they’ve been, and connecting all that information reliably — this has to be done at an equal footing too.

AI agents need context to be useful. Context needs to be connected to the right customer at the right moment. And that connection — that fundamental ability to say “this data belongs to this person” with confidence — is identity resolution.

The organizations that win in this environment won’t be the ones with the best AI models or the most sophisticated agents. They’ll be the ones with clean customer identities.

This article draws from insights in “State of Martech 2026” by Scott Brinker and Frans Riemersma. All diagrams and frameworks referenced are from their original report. The author has added her perspective to the findings.

The identity resolution challenge isn’t new, but the stakes have changed dramatically with AI. If you’re interested in how we’re approaching this problem at Zingg, you can learn more at zingg.ai.

More persistence for Zingg IDs!

Sonal Goyal — Sun, 05 Apr 2026 16:23:22 GMT

In my last few posts, I wrote about the persistent ZINGG_ID and incremental flows. Quick recap for those who missed them. As data keeps changing in a growing business - new customers, updated addresses, records arriving from new channels - the identity graph has to keep up. The incremental flow in Zingg Enterprise handles exactly this. New and updated records get matched against existing clusters, clusters merge and split automatically, and the ZINGG_ID travels with the entity through all of it. The identity graph stays consistent even as the data underneath it keeps moving.

That was a big piece of work to build and ship. But as we rolled it out to more customers, we started hearing about a different scenario. One we had not fully planned for.

Some of our customers had been running Zingg for a while. Their identity graphs were in good shape. Incremental flows were humming. And then, their data got better.

Phone numbers that were unreliable at the start of the project - maybe coming from a poorly integrated source system, or just sparsely captured - were now being collected consistently. Email fields that were mostly empty were now populated. A new column from a recently onboarded data source was throwing up a great matching signal. The data engineering team had done some serious cleanup work on one of the source systems.

In other words, the matching could now be sharper than it was when the graph was first built.

And yet, teams were not acting on it. We started hearing the same thing again and again. Customers who loved their ZINGG_IDs - and were relying on them deeply - were finding that the very thing they valued was also holding them back. One customer had significant architecture changes and needed to run a full refresh, but was worried about losing their ZINGG_IDs. Others wanted to add a new matching signal but held off. Some wanted to move to a different data platform entirely, but the ZINGG_ID dependency made the migration feel too risky.

The Zingg identifier had become load-bearing infrastructure. Which we loved to see - it meant the graph was genuinely embedded in how the business operated. But it also meant that any change to the graph felt dangerous. Customers were not able to move from one model to another, capture more signals, or even change their data platform because of Zingg. That was not the position we wanted our customers to be in.

So, what do you do when you want a full refresh - run matching again on the full dataset, this time with the richer fields, the better signal, the cleaner data? But your downstream systems are wired to the existing ZINGG_IDs.

Reporting tables, Customer 360 dashboards, ML feature stores, operational workflows at the call centre - all of them reference the ZINGG_IDs from the previous run. If you blow away the graph and rebuild it, you get better clusters but you lose the identifier continuity. Every downstream system breaks or needs a migration. That is a huge cost, and it tends to make teams postpone the refresh indefinitely. The identity graph gets frozen at the quality level it was when it was first built, because improving it feels too disruptive.

This is the gap that Reassign closes.

After you run a full refresh with your updated fields and retrained model, reassign maps the new cluster assignments back to the existing ZINGG_IDs. It carries the established identifiers forward wherever the entity continuity holds. Downstream systems keep their reference point. The identity graph gets the benefit of the improved matching signal. The refresh is no longer a disruptive operation.

Here is the even more interesting bit - what taking the persistent ZINGG_ID one step further really means. Persistence, as we built it earlier, means the identifier travels with the entity through routine changes - new records, updated records, incremental loads. Reassign extends that guarantee to cover a deliberate, intentional rebuild of the graph itself. The identifier now survives not just the churn of normal operations, but also the data team’s decision to go back and do it better.

We think of this as data maturity support built into the identity graph. Because in practice, the signals get better over time. Source systems get cleaned up. New integrations bring in new attributes. The identity graph should be able to evolve with that maturity - not fight against it.

If you have a full refresh sitting in your backlog because the downstream impact felt too risky, we would love to hear from you. Hit reply and tell us what triggered it - what improved signal are you sitting on that you have not yet been able to incorporate?

From “What Was Revenue Last Quarter?” to “Generate 30% More Revenue This Quarter”

Sonal Goyal — Thu, 12 Mar 2026 06:37:44 GMT

A16z just published a piece that every data practitioner should read. Your Data Agents Need Context makes the case that AI agents have been failing not because the models are bad, but because they’ve been operating without the business context they need to reason well — no understanding of how revenue is defined, which tables are the source of truth, what a fiscal quarter actually means for this organization. The fix, they argue, is a modern context layer: richer than a semantic layer, encompassing canonical entities, tribal knowledge, governance logic, and identity resolution.

They’re right. And I want to extend that argument into territory I think deserves more attention — because the implications of taking it seriously go well beyond the examples that are easiest to reach for.

This Is Data Teams’ Moment — The Question Is How Big We Think

I wrote recently about how agentic AI is finally giving data teams their long-overdue moment. For years, data teams spent budget cycles defending their existence. That is changing fast. AI agents are only as good as the data they run on, and the foundational work data teams have been quietly doing for years — quality, governance, lineage, entity resolution — is now the prerequisite for every enterprise AI initiative that matters.

The budget conversation is flipping — from “justify why we need you” to “how fast can you get us AI-ready?” That’s a genuine shift. But it comes with a risk: if the measure of success for data agents settles at “answering questions BI could already answer,” we’ll have spent enormous energy rebuilding, more expensively and with more failure modes, something that already worked.

Revenue last quarter? Your BI team has a Looker dashboard for that. It refreshes nightly, the fiscal calendar is baked in, and the CFO trusts it. The a16z piece walks through exactly why an agent struggles to replicate even this — wrong revenue definition, stale YAML, ambiguous tables. That’s a real problem worth solving. But it’s also worth asking: once we’ve solved it, what have we actually unlocked?

The answer, I’d argue, is: the foundation for something far more valuable.

The Questions That Context-Plus-Identity Actually Makes Possible

The a16z piece mentions that a modern context layer should go beyond metric definitions to include canonical entities and identity resolution. I want to take that seriously and spell out what it actually unlocks. Walk into any commercial or strategy meeting and listen for the questions that make the room go quiet — not because the answer is sensitive, but because nobody actually knows: How many real customers do we have — not accounts or rows in a CRM, but distinct humans or organisations with an active relationship? Who are our most valuable customers, across product lines, referrals, and renewal behaviour? What is our customers’ lifetime value when “Acme Corp” in the CRM and “ACME Inc.” in billing are finally recognised as the same entity? Who can we upsell, and when, based on what similar cohorts did next? Which customers are three months from churning, based on signals scattered across product, support, and finance systems? These questions couldn’t be asked before not because agents were too dumb, but because the identity layer that makes them answerable was never in place.

What Agents Actually Do with a Resolved Identity Graph

This is where the discussion stops being theoretical. Once entity resolution is in place — once the agent knows who it’s reasoning about across every system — a new class of autonomous workflows becomes possible. Here are four that illustrate the shift.

The Churn Prevention Agent. The agent monitors product usage daily — login frequency, feature adoption, support ticket volume — and when a customer’s pattern matches historical churn signatures, it autonomously cross-references their contract renewal date, checks their NPS history across systems, and drafts a personalised outreach for the customer success manager with a suggested intervention. The entire workflow collapses if the agent is seeing three different records for the same customer across the product database, CRM, and support tool. Entity resolution is not a prerequisite step — it is the agent’s eyes.

The Upsell Timing Agent. A customer’s usage of Product A crosses a threshold. The agent compares their trajectory against cohorts of similar customers who went on to adopt Product B within 90 days, finds the pattern match, creates a qualified opportunity in the CRM, and alerts the account manager — without anyone pulling a report. The cohort comparison is only valid if the purchase history and usage data it’s drawing on are resolved to the same underlying customer. A fragmented identity graph means the agent is comparing apples to a mix of apples and ghosts.

The Cross-Sell Across Household or Org Agent. In B2C — a bank, an insurer, a retailer — one household member has a mortgage, another has a current account, a third just opened a savings product. An agent that understands household relationships, not just individual records, can surface the cross-sell opportunity and route it to the right team before a competitor does. In B2B, the same logic applies at the organisational level: one division is already a customer, another is in a competitor’s trial. The agent identifies the relationship and acts. Both scenarios are structurally impossible without entity resolution at the household or org level — and both are routine once it’s in place.

The Supplier Risk Agent. The agent continuously watches news feeds, financial filings, and logistics data for signals about key suppliers — credit downgrades, port disruptions, leadership changes. When a risk threshold is crossed, it maps which SKUs are affected, what inventory buffer exists, and which alternative suppliers are pre-qualified, then surfaces a recommended action to the procurement team. The catch: this requires knowing that “Acme Logistics Ltd” in the ERP, “ACME Logistics” in the contracts system, and “Acme Log.” in the invoicing platform are the same supplier. Without that, the agent is monitoring fragments, not entities — and the risk signal it’s supposed to catch slips through the gaps.

What “Canonical Entities” Actually Requires

It’s worth dwelling on the practical aspects of having canonical entities and identity resolution as components of a modern context layer — because this is where most implementations will either succeed or quietly fail.

I’ve written before about why building a unified customer identity is genuinely hard, and why so many teams feel like imposters for still wrestling with it. The industry narrative treats unified identity as a prerequisite — something that should already be done before you get to the interesting work. In reality, the vast majority of organisations are still fighting this battle. They’re just not talking about it publicly because it feels like admitting failure.

Entity resolution — sometimes called record linkage, identity resolution, or master data management — is the discipline of determining that two records refer to the same real-world person, company, or object, even when names, formats, and identifiers don’t match. It requires probabilistic matching across messy strings, address normalisation, fuzzy deduplication, graph-based clustering, and ongoing maintenance as records evolve. It’s one of the oldest unsolved problems in enterprise data, not because people haven’t tried, but because the messiness is structural. Organisations grow through acquisition. Customers interact across channels. Nobody standardised data entry in 1997.

For the context layer to include this, entity resolution:

Cannot be hard-coded into YAML or a semantic layer — it requires ML-powered probabilistic matching trained on actual data

Cannot be a one-time batch job — it needs to run continuously as new records arrive and existing ones change

Cannot live outside the warehouse — moving billions of records to a separate resolution system reintroduces the very data silos and lack of context that AI agents are struggling with

Cannot be fully automated — it needs human-in-the-loop workflows to handle edge cases and feed corrections back into the model

This is not background infrastructure that someone configures once and forgets. It’s an ongoing data capability — closer in character to a production ML system than to a semantic layer definition. Teams that treat it as the former will find their context layers actually working. Teams that treat it as the latter will find their agents confidently wrong.

What This Looks Like When Done Right

Two examples illustrate what the context layer vision looks like when the entity resolution component is actually implemented well.

Fortnum & Mason, the 300-year-old luxury British retailer, had customer data fragmented across restaurant bookings, email signups, online transactions, and in-store purchases. They tried a third-party resolution service first — it created non-persistent identifiers, offered limited visibility, and raised privacy concerns about sending their entire customer dataset externally. By implementing Zingg natively in their Databricks environment, they built persistent, unified customer identifiers across all touchpoints, with full control over the matching process and privacy-compliant resolution that kept sensitive data in their own infrastructure. For the first time in their history, they could understand how customers were shopping across every channel and devise clienteling and personalisation around that. That’s not a BI question answered better. That’s a question that was previously unanswerable and an action that could not be done earlier.

Orthodox Union, operating over 40 websites and 5 mobile applications across disparate CRMs, used Zingg on Snowflake to power their golden records — resolving not just individual identities but household relationships across their entire digital estate. Their agents now reason about constituents as whole people with family connections, not as disconnected records across systems.

These outcomes are exactly what the context layer is promising. They require taking the entity resolution component as seriously as the metric definition component — which means treating it as an engineering discipline, not a checkbox.

The Architecture That Makes This Real

The a16z piece lays out a thoughtful five-step architecture: access the right data, build context automatically, refine with human input, connect to agents, keep it self-updating. That framework is right. The entity resolution layer sits at the foundation of step two — and its quality determines the ceiling for everything above it.

Concretely: before an agent reasons about revenue, lifetime value, churn risk, or upsell opportunity, it needs a stable, persistent entity graph that tells it who “this customer” actually is across every system. That graph needs to be built where the data lives, continuously maintained and grounded in human judgement for edge cases :

Only on top of this identity foundation does the rest of the context layer — the metric definitions, the table routing, the tribal knowledge — become genuinely trustworthy. Knowing what “revenue” means is important. Knowing that the revenue from “Acme Corp” and “ACME Inc.” should be attributed to the same customer is what makes the answer actually right.

The Real Payoff

As I argued in The Data Team’s Moment, AI doesn’t paper over data problems — it runs them at scale, in production, with consequences. A customer success agent operating on a fragmented identity layer doesn’t just give one bad recommendation; it systematically misjudges an entire customer segment, at machine speed, before anyone notices. The autonomous nature of agents is what makes the identity foundation so critical — errors compound rather than get caught.

But the inverse is equally true, and more exciting: an agent operating on a trustworthy identity graph can answer questions that were structurally impossible before. Not because the model got smarter, but because the data it’s reasoning on finally reflects reality.

AI agents that can tell you which accounts are actually the same company, which of those companies are in an expansion window, which customers share a behavioural signature with cohorts that churned six months ago — that’s the payoff the context layer is reaching for. The revenue-growth question is the proof of concept. The customer intelligence actions are the business transformation.

The context layer is necessary. Entity resolution is what makes it sufficient. And the questions worth asking, and the agents worth running — about customers, not metrics — are what make the whole thing worth building.

Building on the a16z post “Your Data Agents Need Context” by Jason Cui and Jennifer Li. Further reading: The Data Team’s Moment and The Identity Crisis. If you’re building the entity resolution layer, Zingg is open source and fully documented.

The Data Team’s Moment: How Agentic AI Is Turning the “Support Function” Into the Strategic Center of the Enterprise

Sonal Goyal — Sat, 07 Mar 2026 18:21:26 GMT

There’s a meeting that happens in every organization, usually sometime in Q3 or Q4. A CFO pulls up the budget spreadsheet, squints at the line items, and asks the question that makes every data leader’s stomach drop: “What exactly are we getting out of the data team?”

It’s an uncomfortable question — and for a long time, it didn’t have a clean answer.

Data teams built pipelines nobody asked about, maintained dashboards that were exported to Excel before anyone actually read them, and spent political capital defending headcount to executives who viewed them as a sophisticated cost center. As Joe Reis, author of Fundamentals of Data Engineering, put it plainly: “Data teams will continue struggling to justify their value as long as data in Enterpriseland is a back-office function.”

That was the reality for most of the 2010s and into the early 2020s. Data teams worked in the shadows — vital infrastructure, but rarely celebrated. They were the plumbers of the modern enterprise: essential when things worked, blamed when things didn’t, and invisible the rest of the time.

The era of agentic AI is changing all of that — right now, in real time.

The Old Indignity: Fighting for a Seat at the Table

The data ROI conversation has long been a recurring humiliation dressed up as a business exercise.

Unlike sales, whose numbers are on the board every Monday morning, or marketing, which can point to impressions and conversions, data teams have operated in the murky middle. Their impact is indirect — they help other teams make decisions. As data writer Anna Geller captured it well: “Most data teams work as a support function... their involvement in value creation is indirect. You can’t directly quantify the impact of a new table, dashboard, or pipeline.”

This indirectness has been weaponized against them at budget time. Data leaders find themselves in a paradoxical position: the team responsible for quantification can’t quantify itself. Some resort to building internal P&L reports to justify their own existence. Others try charging other departments “internal fees” for data services — a workaround that says more about the dysfunction of how data is valued than it does about the team’s actual contribution.

Meanwhile, data stacks keep growing enormous. As Monte Carlo’s blog noted, “data teams made history as one of the first departments to spin up 8-figure technology stacks with little to no questions asked” — a fact that makes them conspicuous targets whenever CFOs start scrutinizing every line item with new intensity.

The modern data stack is a marvel of engineering. But without a clear, legible link to revenue, it remains a liability in any budget conversation. At least, it has been — until now.

The Shift Underway: AI Agents Are Looking for a Nervous System

Here’s the thing about agentic AI: it is only as good as the data it runs on.

This seems obvious, but its implications are still rippling through the enterprise in real time. When a company deploys an AI agent to autonomously manage customer service tickets, optimize supply chains, approve procurement requests, or run financial reconciliations, every decision that agent makes is being built on a data foundation. If that foundation is shaky — inconsistent definitions, poor lineage, unresolved duplication — the agent doesn’t just underperform. It fails loudly, in production, with consequences.

Robin Sutara, Field CDO at Databricks, is making this organizational implication explicit in Databricks’ 2025 strategic priorities report: “A successful AI strategy starts with a solid infrastructure. Addressing fundamental components like data unification and governance through one underlying system lets organizations focus their attention on getting use cases into the real-world, where they can actually drive value for the business.”

The work that data teams have been doing for years — data quality, entity resolution, governance, lineage, pipeline reliability — is no longer background noise. It is becoming the prerequisite for the most important initiative in every company.

The budget conversation is flipping. It’s no longer “justify why we need you.” It’s “how fast can you get us AI-ready?”

The Numbers Bearing This Out

The scale of the agentic AI wave is making clear why data teams are starting to matter in a way they simply haven’t before.

Gartner predicts that 40% of enterprise applications will be integrated with task-specific AI agents by the end of 2026 — up from less than 5% today. By 2028, at least 15% of work decisions across organizations will be made autonomously by AI agents, up from virtually zero in 2024. And Gartner’s best-case projection suggests agentic AI could drive approximately 30% of enterprise application software revenue by 2035, surpassing $450 billion.

The old framing — tools automate tasks, people make decisions — is no longer holding. And the people who build, govern, and maintain the data powering this new paradigm are moving from the back office to the top of the boardroom agenda.

The Verizon Test Case

Few leaders are articulating this shift as clearly as Kalyani Sekar, Verizon’s Chief Data Officer. In a 2025 interview with CDO Magazine, she describes the arc of her team’s work — and its growing centrality — with unusual candor:

“Data quality is foundational to Verizon’s long-term goals. As we continue to roll out more analytical use cases, from predictive and prescriptive to generative and agentic, it’s critical that these AI and analytical capabilities, including the reports built on them, are grounded in data that is trustworthy and of the highest quality.”

Verizon handles petabytes of data daily — from network devices, customer touchpoints, service channels. The data team’s job was always to make sense of it and surface insights. Today, that same infrastructure is becoming the operating layer for autonomous AI systems making real-time decisions affecting 140 million wireless retail connections.

The data team isn’t changing. The stakes around its work are.

From Cost Center to Control Plane

McKinsey’s 2025 research on the “agentic organization” is describing what’s unfolding as the largest organizational paradigm shift since the industrial revolution. The key insight for data leaders: in this new world, data quality cannot remain a background maintenance task. It needs to be the foundation every AI agent is built on — before a single workflow goes live.

Nowhere is this more visceral than in entity resolution — the problem of making sure your AI actually knows who it’s dealing with.

I have been watching this play out in real time across industries: “Agentic AI is the race car. Entity resolution is the track. You wouldn’t run a Formula 1 machine on gravel and expect to win.”

The examples are stark. In retail, duplicate customer records lead autonomous agents to dispatch multiple shipments, issue multiple offers, and trigger multiple refunds — quietly eroding margins at scale. In finance, fraudsters opening several accounts with small name variations go undetected because the fraud-detection AI treats each record as a separate person. In healthcare, a patient known as “Mary Chen” at one hospital and “M. Y. Chen” at a clinic becomes two different people in the AI’s world — and an autonomous care coordinator misses her allergy record before booking a procedure.

In each case, the AI wasn’t making bad decisions. It was making confident decisions on bad data. And because agentic systems are autonomous, they don’t make that mistake once — they repeat it hundreds or thousands of times before anyone notices.

Tools like Zingg — which resolve entities at scale across CRMs, EHRs, billing systems, and data lakes, running natively on Snowflake and Databricks — are becoming the kind of foundational data infrastructure that determines whether an AI deployment succeeds or silently fails. One bank that placed Zingg’s entity resolution layer beneath its fraud-detection AI saw fraud detection improve 4x — without changing the AI model at all. The model didn’t get smarter. The data it was reasoning on did.

That’s the new pitch data leaders are making — and it’s a powerful one. AI doesn’t paper over your data problems. It runs them at scale, in production, with consequences. Every dollar not being invested in data quality is a growing tax on every AI initiative downstream.

What This Means for Data Leaders Right Now

The opportunity is real and it is opening — but it requires a change in posture. Data teams have spent years learning to survive by making themselves useful to whoever controls the budget. The new play is positioning as the precondition for everything the business cares about.

Some practical realities for data leaders navigating this moment:

Your backlog is becoming a risk register. Every unresolved data quality issue, every undocumented pipeline, every inconsistent metric definition is a live failure point for an AI agent that will execute on bad information at machine speed. Frame your roadmap accordingly.

Entity resolution is becoming a first-class AI problem. Before an AI agent can act intelligently, it needs to know who it is acting on — and right now, most enterprise data lakes are full of the same customer, supplier, or patient under a dozen slightly different names. Deduplicating and resolving those identities isn’t a one-time cleanup job anymore; it’s an ongoing data foundation that every agentic workflow sits on top of. Teams that are investing in entity resolution infrastructure — using tools like Zingg to continuously resolve identities across CRMs, EHRs, and data platforms — are finding that their AI deployments perform dramatically better, without touching the models at all. The intelligence was always there. The data just wasn’t ready for it.

Data governance is becoming AI governance. The governance frameworks data teams have been building for years — classification rules, data retention policies, lineage tracking — are exactly what enterprises now need to keep AI agents grounded in reality.

The CDO is becoming a critical executive. The function that once reported to the CIO is increasingly driving the AI strategy conversation alongside the CEO and CFO.

The dbt Labs 2025 State of Analytics Engineering survey is already showing the shift: 40% of data teams added headcount in the past year — compared to just 14% the year before. Data budgets are growing too, with 30% of respondents reporting increases versus just 9% in the prior year. The most common complaint is no longer budget cuts — it’s that the team can’t keep pace with the business’s appetite.

As dbt Labs CTO Mark Porter put it directly: “As companies increase AI investments, leaders are prioritizing the teams responsible for data quality and governance — the essential foundation for AI effectiveness.”

The Table Is Turning

There’s something quietly satisfying about watching an industry vindicate itself not through persuasion, but through circumstance.

Data teams have spent years arguing that their work mattered. They built ROI frameworks and internal P&Ls and political coalitions. They embedded with business units and learned to speak in revenue terms. Some of it worked. Most of it was an uphill grind.

Now, agentic AI is making the argument for them — not through rhetoric, but through dependency. The most transformative technology initiative of the decade cannot function without clean data, governed pipelines, and trustworthy infrastructure. And the people building and maintaining those things are, right now, becoming central to the enterprise.

The data team is no longer supporting the business.

The business is starting to run on the data team.

If you found this useful, share it with a data leader who’s still spending Q4 justifying their existence to a skeptical CFO. Their moment is here — and it’s only getting bigger.

The Real Threat to SaaS: It’s Not Vibe Coding. It’s Access To Your Data.

Sonal Goyal — Wed, 11 Feb 2026 17:11:48 GMT

Vibe coding has everyone debating whether SaaS is dead.

One camp says: when you can build a CRM in an afternoon, why pay Salesforce $150 a seat? The other fires back: a POC is not a product. Security, compliance, scalability, ongoing maintenance — the real cost of DIY reveals itself fast.

Both sides have a point. But they’re arguing about the wrong thing.

The real threat to SaaS isn’t that customers will build replacements. It’s that customers will demand their data back.

The New Requirement Nobody Saw Coming

Most enterprises are not going to build their own CRM, their own support desk, or their own payroll system. That’s not happening. But they are going to build custom AI agents — agents tailored to their workflows, their industry, their competitive edge. And those agents are useless without access to the data that lives inside enterprise SaaS systems.

Think about what that means in practice. A procurement agent that can’t read your ERP. A sales agent that can’t query your CRM. A support agent that can’t see your ticketing history. You end up with AI that is powerful in theory and crippled in practice — not because the models aren’t good enough, but because the data is locked behind walls that the enterprise itself cannot breach.

This is the structural tension SaaS has quietly avoided for two decades. Data lock-in was never just a feature — it was the moat. Salesforce, Workday, ServiceNow, SAP built empires not just on software, but on the gravitational pull of accumulated enterprise data. The switching cost wasn’t the product. It was your own history, trapped inside someone else’s system.

AI is now making that lock-in impossible to ignore.

Enterprises building custom agents will demand data portability as a first-class requirement — not a nice-to-have, not a paid add-on, but a baseline expectation. The competitive pressure will shift toward SaaS vendors who open their data, and away from those who treat it as a proprietary asset. MCP is an early signal of this direction, but it is barely the beginning. Real openness means real-time access, structured APIs, granular permissions, and data that can move across systems without friction.

Why We Built Zingg the Way We Did

I want to share something personal here, because it’s directly relevant.

When we built Zingg — an open source entity resolution and data matching tool — one of our earliest and most deliberate architectural decisions was this: Zingg would run natively inside the customer’s own environment. Their data would never move.

Not to our servers. Not to a cloud we managed. Not through a pipeline that touched our infrastructure. Zingg would go to the data, not the other way around.

At the time, this felt like swimming against the tide. The SaaS playbook was clear: centralise everything, host it yourself, lock in the customer through the platform. Everyone was building managed services where customers uploaded their data and got results back. It was convenient. It was scalable. It was the default.

We chose differently — and we took some friction for it. Prospects would ask why they couldn’t just hand us their data and get clean results. Investors would ask why we weren’t building a managed pipeline. The market was optimised for data movement, and we were building for data residency.

Our reasoning was straightforward. Entity resolution deals with some of the most sensitive data an enterprise has — customer records, patient data, financial identities, supplier profiles. The enterprises that needed this most were also the ones with the strictest data governance requirements: banks, healthcare providers, insurers, government agencies. For them, sending data to a third-party system wasn’t a preference issue. It was a compliance wall.

Beyond compliance, there was a deeper principle at work. The enterprise owns its data. The enterprise should control its data. A tool that requires data to leave the enterprise in order to function is not really serving the enterprise — it’s holding the enterprise’s data hostage to deliver a service.

What We Learned From Building This Way

Running natively inside the customer’s environment changed everything about how we thought about the product.

It forced us to build something that was genuinely portable — that could run on on-premise Spark clusters, on Databricks, on AWS EMR, on GCP, on Snowflake, wherever the customer’s data actually lived. It couldn’t be opinionated about infrastructure, because the whole point was to meet the data where it was.

It meant our integration surface was the data itself — structured, well-defined, queryable — not a proprietary pipeline that the customer had to trust and maintain.

And it gave customers something that is genuinely rare in enterprise software: full control and transparency. They could see exactly what Zingg was doing with their data, because Zingg was running on their machines, under their monitoring, in their environment, at their schedule.

There was also a dimension we hadn’t fully appreciated until customers started telling us: it was significantly faster AND cheaper. When your tool runs inside the enterprise’s existing environment, the enterprise leverages infrastructure it has already paid for — compute clusters, data quality frameworks, observability stacks. There is no new data pipeline to design, build, or maintain. There is no ongoing cost of extracting data, shipping it somewhere, and syncing it back. No ETL tax. No vendor markup on compute. The enterprise uses what it already has, and the tool fits around it. For large organisations running entity resolution at scale, this wasn’t a marginal saving — it was a fundamental difference in total cost of ownership.

Interestingly, this also made Zingg stickier in a different way. Not because we had the data — we never did — but because the customer’s confidence in the product grew over time.

Why This Matters Now More Than Ever

Fast forward to today, and the AI agent conversation is making this architectural principle mainstream.

Every enterprise that wants to build custom agents is asking the same question we were thinking about years ago: how do I give my AI access to my data without giving someone else access to my data?

The answer is the same one we arrived at for Zingg. The data should never need to leave.

This is what makes the current MCP momentum so significant. MCP is a protocol that lets AI agents communicate with data systems without requiring data to be extracted, copied, or centralised. It is, in essence, a formalisation of the principle we built Zingg on: bring the compute to the data, not the data to the compute.

But MCP alone is not enough. A faithful MCP implementation means more than exposing an endpoint — it means real-time access, granular permissions, data that reflects the current state of the system, and APIs that are robust enough to support production AI workloads, not just demos.

The enterprises of the next decade will not tolerate data opacity. They have regulators demanding audit trails. They have security teams demanding residency guarantees. And now they have AI ambitions that demand data access. These forces are all pointing in the same direction.

SaaS vendors who open their data thoughtfully — with proper permissioning, auditability, and reliability — will find that openness builds a different kind of loyalty. One that is harder to compete with, because it’s based on trust rather than captivity.

The vendors who resist will find themselves on the wrong side of procurement conversations as AI becomes a standard enterprise requirement. The question won’t be “does your product have AI features?” It will be “can my AI agents use your data?”

If the answer is no, or not without significant friction, the deal will start to look elsewhere.

The Output Was Always the Point

There is one more dimension to this that feels especially relevant now.

Entity resolution is not a peripheral problem. It is one of the most foundational challenges in enterprise data — the question of whether the “Acme Corp” in your CRM is the same as the “Acme Corporation” in your ERP, whether the customer in your support desk is the same person as the one in your billing system, whether your supplier records are duplicated across three different procurement tools. Resolved, unified data is the prerequisite for almost everything else an enterprise wants to do with its data. Analytics, reporting, compliance, machine learning — none of it works well on dirty, fragmented, unresolved data.

This is why we were always deliberate about how Zingg surfaces its output. We didn’t just want Zingg to solve the matching problem — we wanted the resolved data to be genuinely portable, consumable by whatever system or workflow the enterprise chose to use next. Clean, resolved entities written back to the data store of the enterprise’s choice, in formats that fit naturally into existing pipelines, without any proprietary wrapper that would create a new dependency.

We did not see AI agents coming when we made that decision. But in hindsight, it was exactly the right call. A resolved, unified, trustworthy view of an enterprise’s core entities — customers, suppliers, products, locations — is precisely the kind of context foundation that AI agents need to function well. An agent working from fragmented, duplicated, unresolved data will produce fragmented, unreliable answers. An agent working from a clean, resolved data layer can actually be trusted.

We built Zingg to ensure its output could be consumed anywhere the enterprise wanted. It turns out that “anywhere the enterprise wants” now includes AI agents, RAG pipelines, and real-time reasoning systems we couldn’t have imagined when we started. The principle held.

A Principle Worth Standing Behind

When we made the architectural choice for Zingg, we weren’t predicting AI agents. We were just trying to build something enterprises could actually trust with their most sensitive data.

But the principle was right then, and it is right now: data should live where the enterprise decides it lives. Tools — whether they are matching algorithms or AI agents — should go to the data, not the other way around.

The market is catching up to this idea. And those who embrace it, rather than fight it, will be the ones who matter in the next era of enterprise software.

Zingg is an open source entity resolution framework that runs natively in your environment — no data movement required. You can find it at github.com/zinggAI/zingg.

The Identity Crisis: Why “Getting the Basics Right” Is the Hardest Work in Data

Sonal Goyal — Thu, 23 Oct 2025 14:56:40 GMT

laying the foundation

Matthew Niederberger, Founder of Martech Therapy recently wrote something that stopped me in my tracks:

“If you’re a practitioner still trying to get a unified customer ID working across channels, you feel left behind technologically and question your competence.”

This single sentence captures a crisis that’s quietly plaguing data engineering teams across the industry. While vendor booths showcase AI-driven personalization and predictive customer journeys, talented practitioners are privately wrestling with something supposedly “simpler”: building a unified customer ID that actually works.

The cruel irony? They’re doing the hardest, most important work. But in a market obsessed with bleeding-edge features, nobody’s celebrating it.

The Vendor Narrative Gap

So why do talented teams feel like they’re failing when they’re still working on identity?

Matthew Niederberger nails the dynamic: vendors showcase what’s possible, not what’s required to get there.

Vendor conference presentations follow a formula:

Demonstrate the amazing outcomes (real-time personalization! predictive journeys!)
Show the sleek UI that makes it look effortless
Mention “unified customer data” as an assumed prerequisite
Move quickly past the foundation to focus on the flashy features

The implicit message: if you’re still working on unified identity, you’re behind. Everyone else has already solved this.

But that’s not remotely true. The vast majority of organizations are still struggling with identity resolution. They’re just not talking about it publicly because it feels like admitting failure.

The result is a massive selection bias. We hear about the 5% who’ve moved on to advanced use cases. We don’t hear about the 95% still building foundations—not because they’re incompetent, but because the foundations are genuinely difficult to build well.

The Imposter Syndrome Epidemic

I’ve had countless conversations with data teams at conferences, in Slack channels, and on video calls. A pattern emerges again and again:

“We’re still working on identity resolution.”
“We haven’t gotten to the AI stuff yet.”
“We’re just trying to get our customer IDs right first.”

The word “just” does a lot of heavy lifting in those sentences. It diminishes foundational engineering work into a checkbox, a prerequisite you should have already mastered before moving on to the “real” work.

But here’s what nobody is saying loudly enough: unified identity isn’t a prerequisite. It’s the final boss of data engineering.

Why Unified Identity Is Actually Hard

Let’s be brutally honest about what “getting a unified customer ID working across channels” actually requires:

The Data Quality Reality

Real-world customer data is messy in ways that mock deterministic rules:

Sarah Johnson gets married and becomes Sarah Martinez
john.smith@gmail.com and jsmith@gmail.com might be the same person (or different people with similar names)
Phone numbers get recycled—last year’s customer is this year’s prospect
Addresses have hundreds of valid formatting variations
People move, change jobs, update preferences
Data entry errors create systematic inconsistencies

A unified customer ID system has to handle all of this, at scale, while maintaining high accuracy. Miss too many matches and your “360-degree customer view” is fragmented. Create false positives and you’re merging different people, violating privacy and destroying trust.

The Technical Challenge

You’re not matching records in a single clean database. You’re resolving identities across:

Web analytics (cookie-based, increasingly unreliable)
Mobile apps (device IDs, app-specific identifiers)
CRM systems (email, phone, account IDs)
Point-of-sale systems (loyalty cards, payment methods)
Call center interactions (phone numbers, case IDs)
Marketing platforms (each with their own identifier schemes)
Third-party data sources (with their own quality issues)

Each source has different identifier types, different data quality standards, and different update frequencies. You’re not just joining tables—you’re probabilistically determining which records across fundamentally different systems represent the same human being.

The Scale Problem

Now multiply this across millions or billions of records. The naive approach—comparing every record to every other record—is computationally impossible. You need sophisticated blocking strategies, efficient algorithms, and infrastructure that can process massive datasets without collapsing under its own weight.

And you need to do this continuously, not as a one-time batch job. New data arrives constantly. Identities evolve. Your resolution system has to keep pace.

The Governance Complexity

Layer on consent management, privacy regulations, data retention policies, and the right-to-be-forgotten requirements. Your unified ID system isn’t just a technical challenge—it’s a governance framework that has to work across jurisdictions, respect customer preferences, and provide audit trails.

This isn’t “basic.” This is some of the most complex data engineering work you can do.

The Three-Layer Approach That Actually Works

So how do you actually build identity resolution that works at warehouse scale? Let me break down the architecture that successful implementations share:

Layer 1: Blocking

You can’t compare every record to every other record—that’s O(n²) complexity and computationally impossible at billion-record scale. Smart blocking reduces potential comparisons by 99%+ while maintaining high recall.

The key is choosing blocking keys that cast a wide enough net to catch true matches while being selective enough to make comparison tractable. This might mean blocking on:

First three characters of last name + ZIP code
Phone area code + first name
Email domain + approximate birth date
Phonetic encoding of names

Good blocking strategies use multiple passes with different keys, ensuring that records with typos or variations still get compared. At Zingg, we learn blocking directly on all the fields present in the data from the user data and labels to block different datasets efficiently.

Layer 2: Matching

This is where ML earns its keep. For each pair of records that passed blocking, you need to score the likelihood they represent the same entity.

Modern matching models use features like:

String similarity (edit distance, Jaro-Winkler, token-based matching)
Phonetic similarity (Soundex, Metaphone, NYSIIS)
Behavioral patterns (purchase timing, channel preferences, engagement signals)
Temporal proximity (how close in time were interactions?)
Geographic consistency (do locations make sense together?)

The output isn’t binary—it’s probabilistic scoring. This pair has a 0.95 probability of being a match. That pair scores 0.45. This gives you flexibility in tuning precision vs. recall based on your use case.

Layer 3: Clustering

The final challenge: transform pairwise match scores into entity groups. This is where you resolve transitive relationships.

If Record A matches Record B with 0.92 confidence, and Record B matches Record C with 0.88 confidence, are A, B, and C all the same entity? What if A and C compared directly would only score 0.65?

Clustering algorithms handle these transitivity questions, creating stable entity groups even as new records arrive and scores shift over time. This layer also handles entity splits (when you discover you incorrectly merged two people) and entity merges (when new evidence shows records you thought were separate are actually one entity).

Why This Must Happen In the Warehouse And Datalake

Here’s a critical architectural decision: entity resolution must run natively in your data warehouse or data lake. Moving billions of records out for processing defeats the entire purpose of warehouse-native architecture.

Data Gravity

Your customer data is already in Snowflake, Databricks, BigQuery, or Redshift. That’s where your transformation pipelines run, where your analytics queries execute, where your data science models train. Extracting data to a separate entity resolution system creates:

Data movement costs (bandwidth, storage, processing)
Synchronization complexity (keeping multiple copies current)
Latency (time to move data before resolution can happen)
New data silos (exactly what warehouse consolidation was supposed to eliminate)

Incremental Processing

Entity resolution isn’t a one-time batch job. New customer records arrive constantly—from web signups, mobile registrations, purchase transactions, support interactions. You need incremental processing that resolves new records against existing entities without full dataset re-processing.

This only works efficiently when resolution runs where the data lives, using the warehouse’s native capabilities for incremental updates and change data capture.

Integration with Data Pipelines

Your entity resolution system needs to integrate seamlessly with:

dbt transformations and data modeling
Reverse ETL for audience activation
BI tools for reporting and analytics
ML platforms for model training
Data quality frameworks for monitoring
AI agents for context

This integration is natural when everything operates in the same environment. It becomes painful when entity resolution is a separate system with its own APIs, data formats, and operational requirements.

Governance and Security

Your data warehouse already has:

Access controls and role-based permissions
Audit logging for compliance
Encryption at rest and in transit
Data masking and tokenization capabilities
Retention policies and deletion workflows

Replicating customer data to an external entity resolution system means duplicating all of this governance infrastructure. It becomes a privacy and compliance nightmare with GDPR, CCPA and other applicable laws. Run resolution in the warehouse, and you inherit these controls automatically.

The Human-in-the-Loop Problem

Here’s what the automation-obsessed narrative misses: pure ML automation for entity resolution rarely works in practice.

Why Full Automation Fails

Even the best ML models make mistakes on edge cases:

Common names with similar addresses (are these two John Smiths in the same apartment building one person or two?)
Shared email addresses (family members, assistants, corporate accounts)
Identity changes (marriages, name corrections, gender transitions)
Data entry errors that create systematic patterns the model learns incorrectly
Inadequate data capture

You need human judgment on ambiguous cases, and you need to feed that judgment back into the model to improve over time.

Workflows That Actually Work

Successful implementations build workflows for:

Training data labeling: Which record pairs are matches, which are non-matches? Your initial model needs labeled examples, and getting these labels right determines everything downstream.

Threshold tuning: ML models output probability scores, but you choose the threshold. Score above 0.90 is an automatic match, below 0.40 is automatic non-match, but what about the 0.65 that could go either way? Human review helps calibrate these thresholds based on business impact.

Edge case review: The model flags cases it’s least confident about. A data steward reviews them, makes decisions, and those decisions become training examples for model tuning.

Continuous model refinement: Data patterns evolve. New data sources get added. Data quality issues emerge and get fixed. The model needs periodic retraining, and human feedback on recent predictions drives that retraining.

Dispute resolution: When downstream teams question why records were merged or split, you need workflows to investigate, explain the decision, and potentially override the model when it’s wrong.

The best systems don’t eliminate human review—they make it efficient. Instead of manually reviewing millions of records, you review the hundreds that matter most.

Common Implementation Mistakes

I’ve seen these patterns repeatedly as organizations tackle identity resolution:

Mistake 1: Starting Too Big

Teams try to resolve all entity types simultaneously—customers, accounts, products, locations, employees. This creates enormous scope, makes success criteria fuzzy, and delays time to value.

Better approach: Start with one high-value entity type (usually customers). Prove the value, build organizational capability, then expand to other entity types. Each new entity type gets easier because you’ve learned the patterns.

Mistake 2: Ignoring Data Quality

Entity resolution surfaces every data quality issue you’ve been ignoring. If 40% of your email addresses are null or invalid, you can’t rely on email matching. If name fields contain inconsistent formatting, parsing errors, or non-name data, string matching becomes unreliable.

Better approach: Treat entity resolution and data quality as inseparable workstreams. Profile your data first. Fix systematic quality issues before expecting resolution to work well. Use entity resolution results to identify and prioritize remaining quality problems.

Mistake 3: Treating It as a Project

Organizations approach entity resolution as a one-time project: scope it, build it, deploy it, move on. Then they’re shocked when accuracy degrades over time as data patterns shift.

Better approach: Treat entity resolution as infrastructure that requires ongoing ownership, monitoring, and refinement—just like your data pipelines or ML models. Assign clear ownership, establish SLAs, build monitoring dashboards, schedule regular review.

Mistake 4: Optimizing for Perfection

Teams spend months trying to achieve 99.9% accuracy before releasing anything. Meanwhile, downstream teams make decisions based on no identity resolution at all (effectively 0% accuracy).

Better approach: Ship at 95% accuracy, deliver value, iterate to higher accuracy based on real-world feedback. You learn more from production usage in two weeks than from six months of pre-release testing. Start good, iterate to great.

Mistake 5: Building From Scratch

Organizations underestimate the complexity and decide to build custom entity resolution systems. Eighteen months later, they have something that works on small datasets but can’t scale, has accuracy issues they don’t understand, and requires specialized knowledge that only one person has.

Better approach: Use proven solutions—whether open source frameworks or commercial platforms—that solve the hard algorithmic and scaling problems. Invest your engineering effort in customization, integration, and business-specific logic, not reinventing core entity resolution algorithms.

The Real Question

At this point in conversations, someone usually asks: “Should we do ML-based entity resolution?”

That’s the wrong question.

The real question is: “Can our organization afford to build composable architectures on a foundation of deterministic rules that we know will fail at scale?”

Every organization moving to warehouse-native architectures will eventually face the identity resolution challenge. The only question is whether you address it proactively—as part of your architectural planning—or reactively, after your first major data quality incident.

The reactive path looks like this:

Build your composable architecture on simple deduplication rules
Activate audiences and launch campaigns
Discover that 15% of your “unified” customers are actually duplicates
Or worse, discover you’ve merged different people and violated privacy
Scramble to retrofit proper identity resolution
Rebuild audiences and downstream dependencies
Deal with the business impact and loss of trust

The proactive path looks like this:

Recognize identity resolution as foundational, not an afterthought
Build or adopt ML-based entity resolution that runs in your warehouse
Start with one entity type, prove value, iterate
Build downstream capabilities on accurate identity from day one
Scale confidently knowing your foundation is solid

The reactive path is more common (because people underestimate the challenge). The proactive path is more successful (because foundations matter).

What Actually Endures

Here’s the uncomfortable truth about chasing bleeding-edge features without solid foundations:

A CDP that can’t accurately unify customer IDs is just an expensive data warehouse with a marketing UI.

You can bolt on AI-driven orchestration, predictive segmentation, and real-time decisioning. But if your underlying identity graph is wrong—if you think one customer is three people, or three people are one customer—every downstream capability inherits those errors.

Your AI doesn’t fix bad identity data. It amplifies it. Your personalization engine delivers fragmented experiences. Your predictive models train on fictional customer journeys. Your real-time decisioning makes decisions based on incomplete context.

The organizations actually positioned to leverage advanced CDP capabilities aren’t the ones with the longest feature lists. They’re the ones who took the time to build reliable identity resolution first.

The Real Achievement

Let me share two examples of organizations that built this foundation right.

Fortnum & Mason, the 300-year-old luxury British retailer, faced a classic challenge: customer data fragmented across restaurant bookings, email signups, online transactions, and in-store purchases. They initially tried a third-party service for identity resolution, but it created non-persistent identifiers, offered limited visibility into the matching process, and raised privacy concerns about sharing their entire dataset externally.

Jon Moss, Fortnum’s Customer Engagement Director, describes the transformation:

“For the first time, we’re able to understand how customers are shopping with us—online, in-store, over the phone, or in restaurants. Zingg has helped us unify this data and gain insights we never had before.”

By implementing ML-based entity resolution running natively in their data warehouse, Fortnum & Mason built:

Persistent, unified customer identifiers across all touchpoints
Complete visibility and control over the matching process
Privacy-compliant resolution that keeps sensitive data in their own infrastructure
A foundation for their composable CDP architecture

Orthodox Union, one of the largest Orthodox Jewish organizations in the US, operates over 40 websites, 5 mobile applications, and custom CRMs for different departments. Shelomo Dobkin, their Director of Product Development, explains:

“We moved from using Zingg Open Source to the Enterprise version on Snowflake, making it the engine powering our golden records. Zingg’s powerful features and clean results have really set us apart.”

Their implementation required understanding not just individual identities but household relationships—a critical requirement for their community-focused mission. They built entity resolution that could:

Match people across multiple websites and CRM systems
Understand household connections and family relationships
Create golden records that serve as the foundation for all downstream systems
Process data incrementally as new interactions occur across their digital properties

What makes these achievements remarkable isn’t just the technical implementation. It’s what they enable. This is world-class data engineering that delivers measurable business value:

Marketing teams can trust their audiences
Analytics teams can build on reliable customer journeys
Data science teams can train models on accurate historical data
Operations teams can deliver consistent experiences across channels
Legal and compliance teams can demonstrate proper governance

Everything else—the AI features, the real-time orchestration, the predictive models—becomes possible because the foundation is solid.

Celebrating Foundational Work

The industry needs to reframe how we talk about identity resolution and other foundational data work.

These aren’t prerequisites that should be quick and easy. They’re achievements that deserve recognition.

When a team successfully builds unified customer IDs across disparate systems, that’s a case study worth sharing. When an organization finally has audiences their teams trust, that’s an accomplishment worth celebrating. When identity resolution maintains accuracy as data scales from millions to billions of records, that’s engineering excellence.

The castle that survives wave after wave isn’t the tallest or flashiest. It’s the one with foundations deep enough to withstand the tide.

Moving Forward

If you’re a data practitioner still working on unified customer identity, you’re not behind. You’re doing the work that matters most.

Don’t let vendor demos and curated LinkedIn posts make you feel like an imposter. The bleeding-edge use cases get all the attention precisely because they’re rare. The foundational work gets less visibility precisely because it’s what everyone is actually struggling with.

Focus on making your identity infrastructure trustworthy and durable:

Invest in ML-based entity resolution that can handle real-world data messiness
Build in your data warehouse and datalake so you’re not creating new data silos
Design for continuous operation, not one-time batch processing
Create human-in-the-loop workflows for edge cases and model refinement
Treat it as infrastructure that requires ongoing ownership and monitoring

The teams that master this foundation aren’t behind. They’re building something that will outlast the next wave of features, the next vendor pivot, the next technological hype cycle.

That’s not just the basics. That’s the real work. And it’s work worth doing right.

What’s your experience building unified customer identity? Where did you hit the biggest obstacles? I’d love to hear your stories—the successes and the struggles.

London Calling: Do You Really Know Your Customer?

Sonal Goyal — Tue, 26 Aug 2025 18:22:30 GMT

Last week, I was finetuning the Zingg website, when something caught my eye. We have a quote on Fortnum & Mason’s journey from fragmented customer records to a unified view with Zingg:

"For the first time, we're able to understand how customers are shopping with us—online, in-store, over the phone, or in restaurants. We never had that before."

That sentence got me thinking: in an era where we can track a customer's digital breadcrumbs across every touchpoint, why do so many brands still struggle with the simple question: "Do you really know your customer?"

The Identity Resolution Promise That Keeps Getting Broken

Most of us have watched the same promises get made over and over again. First, it was Master Data Management (MDM) systems that would create "one golden record." Then identity services like LiveRamp promised to connect everyone's messy data through their identity graph. Then for the last few years, Customer Data Platforms (CDPs) became our solution to the elusive single customer view.

But here's what I keep seeing: enterprises continue to struggle with messy and fragmented data. Multi-year MDM implementations that are still not very useful. The MDM system could tell you the exact lineage of how John Smith's email address made it through seven data transformation steps, but it couldn't figure out that J. Smith from the mobile app was the same person. The system was simply not ready for the messy reality of how customer data actually behaves.

The same pattern plays out with identity services. A different promise: "Don't worry about cleaning your own house — we'll connect everyone's messy data." Unfortunately, identity services operate as closed systems where you can't see or validate their matching logic. When they say Customer A matches Customer B, you're essentially trusting a proprietary algorithm without transparency. This becomes problematic when you need to adjust matching to your specific business context, explain results to stakeholders and debug why certain matches seem incorrect. Add to that the privacy risk of sending trusted data to a third party. Designed around fixed data models (email, phone, cookie IDs, etc.), it is tough to resolve on more complex or domain-specific attributes (like healthcare patient IDs, loyalty card numbers, or industry-specific identifiers).

CDPs were supposed to fix everything by democratizing customer data for marketers. The reality? Most became expensive data silos with pretty dashboards, lots of data movement yet the hard work of actually resolving who is who got pushed to the background.

What Actually Works

This is why I'm excited about what the team at Fortnum & Mason have built. Starting with data scattered across restaurant bookings, email sign-ups, online orders, and in-store transactions, they built a composable CDP using Zingg for entity resolution on Databricks.

The transformation wasn't about technology for technology's sake.

"We are able to start turning that into insights and use it to personalize experiences online. We are going to start doing this in store and in our restaurants as well. Being able to connect all of these different customer experiences together is really important for us."

What made their approach different? They treated entity resolution as a first-class capability, not an afterthought. They started with Zingg's open source version to build confidence, understood how the matching worked, then moved to the enterprise version. The entire journey from proof of concept to production took just 2-3 months. For the first time, they can understand how customers shop across all channels. Suddenly, “one customer” meant the same thing everywhere — and the results have been transformational, not just for analytics, but for personalization and customer experience.

Or look at the Orthodox Union, a non-profit serving millions globally. They struggled with donor and member records scattered across legacy systems, event platforms, and CRM databases. Without a reliable way to unify identities, their engagement strategies risked missing context and duplicating outreach. By resolving identities natively in their Snowflake data warehouse, they were able to create a consistent single view of donors and members, while respecting privacy and compliance requirements central to their mission. For them, identity resolution wasn’t just a data challenge — it was a way to deepen community trust.

These are two very different organizations, but the pattern is the same: composability delivers agility, while identity delivers coherence. Without both, the vision of modern customer experience falls short.

The Composable CDP Revolution

Fortnum & Mason's and Orthodox Union’s approach reflects a broader trend that's reshaping the market. Composable CDPs sit on top of existing data clouds to preserve the customer 360 and activate real-time data across channel tools. Instead of moving data around, they work where the data lives. They are private by design. They leverage existing investments in governance and observability. They scale as you grow.

But here’s the hard truth: Composable architectures give you flexibility — but only if the components are stitched together with a stable, reliable, and explainable identity layer. You can have the most elegant semantic layer, real-time pipelines, and activation tools, but if “Acme Corp” and “ACME Inc.” are treated as two different customers, your insights fracture. If one table says “Tim Chen” and another says “T. Chen,” your campaigns lose relevance. If customer data is duplicated at ingestion, those cracks echo through every model, dashboard, and recommendation.

And here’s the other piece we don’t talk about enough: owning your identity layer means owning your customer understanding. Instead of shipping your most sensitive data out to a black-box vendor, you resolve identities where your data already lives — in your own warehouse or lake. That means:

Data ownership: You stay in control of your customer data, critical in a world of tightening privacy regulations.
Schema flexibility: As your business evolves — new channels, attributes, and data types — your identity system evolves with it, instead of breaking because it was designed for someone else’s schema.
Privacy and compliance: By resolving identities within your existing data governance framework, you respect customer consent and meet regulatory requirements without creating extra data copies.
Stack leverage: You’re not rebuilding infrastructure. You’re making your existing investments in warehouses, lakes, and analytics more powerful by adding the missing link: identity.

Why I'm Hosting This Roundtable

This brings me to London. On 9th September, I'll be at the Martech World Forum hosting a roundtable titled "Do You Really Know Your Customer?" — and it will be lovely to meet you if you are there!

Here's what we'll cover:

The Identity Crisis in Marketing Why traditional approaches to customer identity keep failing, and how agentic AI makes the problem more urgent than ever.

Building Composable CDPs That Actually Work Moving beyond tool marketing to understand what composable really means, when it makes sense, and how to implement it successfully.

The Zingg Approach to Entity Resolution How modern ML-based entity resolution differs from traditional MDM, why transparency matters, and how to get started with open source tools.

Real Customer Stories on fragmented data to unified customer understanding, including the practical challenges and business outcomes. The folks from Fortnum & Mason will join us to share their real-world experience, the challenges they overcame, and the business impact they've achieved.

The roundtable format means we can go deep, ask tough questions, and learn from each other's experiences. Whether you're just starting your customer identity journey or you've been through multiple implementations, there is value in understanding what's working now and what's coming next.

Join Us in London

I have a limited number of VIP passes available for the Martech World Forum. If you're a data or marketing leader in London, I'd love to share the passes and have you join the conversation in person. Because at the end of the day, the real question isn’t just “do you know your customer?” — it’s whether your systems know them too.

Not just their demographic data or their transaction history, but their actual journey across all your touchpoints? Can you recognize them when they call customer service after browsing your website? Do you know that the person who signed up for your newsletter is the same one who made a purchase in your store?

If you can't answer these questions confidently, you're not alone. But you also can't afford to stay in that position much longer.

The companies that figure this out — the ones that truly know their customers — are going to have an unfair advantage in the age of AI-driven experiences. They'll deliver more relevant products, more timely services, and more personal interactions. They'll waste less money on duplicate processes and misguided campaigns.

Most importantly, they'll build the kind of customer relationships that survive economic downturns, competitive pressures, and technology transitions.

And if you can't make it to London but have war stories about identity resolution initiatives (successful or otherwise), I'd love to hear them. This isn't really about technology — it's about understanding the humans behind the data. And that, I've learned over the years, is both the hardest and most rewarding part of what we do.

Want to learn more about entity resolution and composable CDPs? Check out Zingg's open source project or read about how we're helping companies like Fortnum & Mason build better customer understanding.

Agentic AI Is Only as Smart as Its Entities

Sonal Goyal — Wed, 13 Aug 2025 09:43:38 GMT

A few weeks ago, I was talking to a data leader at a fast-growing retail brand.
They’d just rolled out their shiny new AI marketing agent. It could segment customers, design campaigns, and send hyper-personalized emails — all without human intervention.

Two weeks in, the VP of Marketing noticed something strange. One of their top VIP customers had received three different “Welcome to our brand!” offers — each with a different discount.

Turns out, their CRM had John A. Smith, J. Smith, and Jonathan Smith as three separate customers.

The agent wasn’t broken. It was doing exactly what it was told — treat each record as a unique person and delight them with a welcome offer. The problem? The records were wrong.

And this, I’ve realized over and over again, is where agentic AI stumbles: when the entities are messy, the automation multiplies the mess.

Over the past few years, I’ve seen this pattern play out in every industry:

Retail — Duplicate customer records mean multiple shipments, multiple offers, multiple refunds. Margins erode quietly.

Finance — Fraudsters open several accounts with tiny name variations. Without linking those identities, your fraud-detection AI treats them as separate people — and misses the pattern.

Healthcare — Patient “Mary Chen” at Hospital A is “M. Y. Chen” at Clinic B. An AI care coordinator misses her allergy record and books a risky procedure.

In each case, the AI wasn’t “dumb.” It was confidently acting on bad data. And because agentic AI is autonomous, it didn’t just make one mistake — it repeated the mistake hundreds or thousands of times before anyone noticed. Smart agents get the “who” wrong and this really hurts the business.

If you capture the hype, everyone talks about the cost of building and running agentic AI. Yet, fewer people talk about the cost of running it on messy data. Agents run on LLM calls, and every action burns tokens. When your entities aren’t resolved, the agent:

Processes duplicates
Repeats actions unnecessarily
Sends extra API calls trying to reconcile

Here’s the math from just one example scenario:

*Based on $15 per 1M tokens (GPT-4.1 tier).

Even with cheaper models, the % waste stays the same. That’s just compute waste. It doesn’t even include the cost of fixing the damage — reissuing refunds, restoring trust, or dealing with regulatory fallouts.

Why I care about this (and what we built at Zingg)

Before I started Zingg, I kept running into the same frustration:
We had amazing ML models, sophisticated rules engines, smart agents… but they were all reasoning on fragmented, messy entities.

At Zingg, we decided to treat entity resolution as a first-class product, not a one-off data cleanup exercise. That means:

Matching at scale across CRMs, EHRs, billing systems, and data lakes
Adaptive learning so the matching improves with every variation we see
Transparent decisions so you can explain why two records were linked

When customers put this layer under their agentic AI, they stopped paying for their agents to repeatedly do the wrong thing. In one case, a bank’s fraud detection improved 4x — without touching the AI model.

My advice to data leaders

If you’re about to launch an agentic AI project, take a breath.
Before you buy another GPU hour or fine-tune another model, ask:

“Do my agents actually know who they’re dealing with?”

Because in my experience, agentic AI is the race car. Entity resolution is the track.
You wouldn’t run a Formula 1 machine on gravel and expect to win.

P.S. If you’re curious about what we are building at Zingg, or have war stories about entity chaos in AI systems, I’d love to hear from you.

The Curious Case of the Extra Records!

Sonal Goyal — Sat, 31 May 2025 03:30:04 GMT

Earlier last month, we started working on a feature we are calling Match Statistics. The idea about statistics came from the incredible folks at Fortnum And Mason, who are using Zingg as the backbone for Single Customer View. While running Zingg incrementally, the Match Statistics would expose how clusters numbers change as records get inserted and updated into the identity graph. This information about changing clusters would be a good first step to observe the entity resolution pipeline. If the number of clusters change disproportionately to the number of records updated and added, an alert could be triggered.

As we started thinking about this feature, we realized that the Match Statistics feature could be made even more powerful. Currently the Zingg output consists of the original data along with the Zingg ID and the match probabilities. If we could surface information about the linkages we found in the records within the cluster, we could help users with matching internals and anomalies. When our users are dealing with millions of records, finding the needle in the haystack is critical. Hence, the specification for the feature expanded to additional statistics.

While building new features, or augmenting existing flows within Zingg, we have to design both the Community and Enterprise products, since they share the same code base. The design of any feature on either product has to take into consideration the impact on both products and ensure that they continue to work individually as well as in tandem. And then comes the trickier part. The design has to work well for both Snowflake and Spark deployments. Not just functional. Performance wise as well. The Zingg Enterprise code base is mostly common across Spark and Snowflake. However, since both platforms have very different APIs, query planning, optimisation and execution, we have a lot of ground to cover while building or modifying anything.

Thats what we thrive at :-)

Eventually we designed a stats package which would get the record level and cluster level matches from the match process. This package would be responsible for computing the metrics we cared about, and writing them to appropriate locations, or Pipes in Zingg lingo. The main flow would mostly be unaltered, except for invoking the stats package and passing the dataframes around.

As we started building the core functionality, we started seeing a very strange issue. Within our flow, we do probabilistic matching only on records which do not match deterministically. If two records match deterministically, we do not try to figure out if they match probabilistically as well since that is a waste of computation time.

Now, something strange started happening. Within the stats package, which was consuming the probabilistic and deterministic structures, we found some records which matched probabilistically as well as deterministically. This threw an error in the statistics computations, as they banked on the record pair to match one way or the other or not match at all. Our deeply crafted logic was breaking.

However, this part of the flow never changed. All we were doing was computing additional metrics in this package. And the main flow was unaltered. So how could stuff which was working earlier AND which was untouched stop working? What followed was a crazy cycle of runs, tests and validations.

Our initial thought was that we had made an error in collecting the data. Why not run it in another sandbox? Umm, the same result. Maybe we had run the flow twice appending to original intermittent data leading to the wrong state? But if that was the case, every record pair would appear twice. In our case, only some of the record pairs were being matched probabilistically as well as deterministically. We then sought to understand if there was anything special about those records. We looked at column values, deterministic rules, and the probabilistic match criteria. To our dismay, every run threw up a different set of extra records. But clearly, not data or code issue then.

Things were surely getting curiouser and curiouser. We were completely off track from the original feature we set out to build :-(

Next, we turned our focus to the execution environment. When we ran the build on Snowflake, there were no extra records. This was a relief, since it meant it was the Spark side of things that we needed to focus on. Now we ran the workload on Databricks. No issue at all! This was a relief too, since it meant earlier released production code deployed at customer locations was working fine and there was no snake in the grass.

After two weeks of racking our brains, we were finally making progress :-) Things started moving pretty fast post that. We searched for known issues and discovered a issue on Apache Spark. When we changed the Apache Spark version, the issue disappeared. To be absolutely sure, we tested different datasets, different volumes, and all looked good.

Debugging complex systems is always challenging, but some techniques always help.

Time travel to the last known good state helps understand which change or set of changes could be the cause.

Surgical elimination is super critical in such scenarios. Being able to quickly run different non-overlapping tests can lead to faster results.

Weird behavior - missing records, extra counts etc which alter at different runs are more likely to be caused by the environment than the actual code.

Such non deterministic errors usually stress teams out, since things feel very out of control. For engineers like us who are used to predictable outcomes, such issues can be very frustrating. Having a cool head goes a long way.

After this wild chase, we are now back to building the Match Statistics feature. We hope to ship it soon!

Just one last thing that I forgot to share. Many wise developers have said this before me, but I must say it again! Plan at least twice the time you anticipate it will take :-)

If you have a similar story to share, hit reply!

What is in a name?

Sonal Goyal — Mon, 12 May 2025 06:39:09 GMT

When Shakespeare penned these famous words, he wasn't thinking about data quality or entity resolution—but the sentiment captures a fundamental challenge in modern data management. A person known as "Robert," "Rob," "Bob," or "Bobby" remains the same individual, regardless of how they're addressed in the data. Names are perhaps the most fundamental identifier we use in data systems, yet they're also among the most inconsistent. Let us consider these common scenarios:

Nicknames and diminutives: William becomes Will, Bill, or Billy
Cultural variations: James in English is Diego in Spanish or Jacques in French
Formal vs. informal usage: Margaret on official documents, Maggie among friends
Transliterations: Китай (Russian) becomes Kitay in Latin script
Spelling inconsistencies: Steven vs. Stephen, Catherine vs. Katherine

These variations create significant challenges for data matching. Without proper handling, our systems might fail to recognize that Robert Smith and Bob Smith are the same person, leading to duplicated records, fragmented customer views, and inaccurate analytics. For customer-facing applications, recognizing these are the same person enables more personalized service and prevents frustrating redundancies. Regulatory frameworks require comprehensive entity resolution. Proper nickname handling helps ensure compliance with KYC (Know Your Customer), AML (Anti-Money Laundering), and GDPR requirements. False negatives—failing to match records that should be matched—can be particularly costly in fraud detection, risk and healthcare.

While it might seem straightforward to compare names for exact matches, the reality is far more complex. We already handle salutations and variations in the data. Yet, without nickname handling, many true matches go undetected. Our internal benchmarks showed that incorporating nickname matching can improve overall match rates by 10-25% in typical customer datasets.

Hence, it was time to build! Amongst the numerous approaches we considered, dictionary matching shone out for a couple of reasons.

Name variations differ significantly across cultures. Dictionaries can be configured to prioritize certain cultural naming patterns based on the data demographics. We are doing Hebrew names in one case and English names in another.
Configurable dictionaries empower users to leverage their domain knowledge for entity resolution.
Even without user involvement, a nickname dictionary mapping formal names to their common variations is not that difficult to build in the age of LLMs.
The surrounding data points of the record like address, date of birth, contact information) can be easily validated for potential nickname matches to reduce false positives.
Different industries and use cases require different sensitivity levels. Healthcare applications might prioritize precision, while marketing applications might favor recall. A configurable dictionary helps in such cases.

As we zeroed in on the dictionary, we realized that the approach is extensible to other domains like corporations, where mapping an IBM to International Business Machine will greatly improve matching. In the end, we built a MAPPING match type that can even normalize categorical columns like gender, or states and countries.

Rolling new features on running production pipelines is not trivial. For existing users, we did comprehesive A/B testing of entity resolution with and without nickname handling to quantify the improvement in their specific datasets. The results have been even better than we expected!

Names are at once the most essential and most variable identifiers in our data systems. Effective entity resolution must account for the rich complexity of how names are used in the real world. As Shakespeare suggested, a rose by any other name would indeed smell as sweet—and with Zingg's nickname matching, data systems will finally recognize that truth!

The open source identity resolution journey so far..

Sonal Goyal — Tue, 07 Jan 2025 15:32:54 GMT

More than a decade ago, in the year 2013, the identity resolution problem hit me. As a niche consulting company focused on data and AI, we were tasked with integrating customer records from disparate systems. Our project involved setting up a Hadoop cluster on Amazon EC2 and building out the analytics and reporting for our client. This was the time when services like Elastic Map Reduce were about to get launched. It felt like an engineering feat that we could run Hadoop on the cloud through custom scripts. In a way, we were actually mimicking today’s serverless approaches.

We built data extraction, transformation, and loading pipelines along with the reporting using Sqoop, Pig, Hive, and Cascading. But, as we got along the project, we faced a critical issue. Even when we got all the data in one place, it was extremely challenging to identify the real world entity represented by the varying records from different systems. Since the records did not have any common identifiers and they also had all sorts of variations, unifying the data became the biggest issue for us on that project.

Once I became aware of the problem, I started seeing it in other projects. Even when I was doing different work, the problem kind of never left me. How do we say that the records are a match? How do we scale this out? How do we run it for any entity which can appear for an enterprise? For the programmer in me, it became a thrill ride of problem solving with lots of first principle thinking. The more I worked on it, the more challenges I became aware of. Eventually, I got completely immersed and stopped taking other consulting work to focus on building a generic solution.

When I showed an early version to my friend and mentor Joydeep Sen Sarma, creator of Apache Hive, he encouraged me to open source it. As an active consumer of amazing open source technologies, it was a great way to contribute back. So I latched on to the idea. Here is what I wrote when I published Zingg:

I also feel that open source Zingg is more powerful than closed source, because then it is up to the community to put it to use in more ways than I have seen and could ever anticipate. It is also a promise to work closely with different people and collaborate, something I am very sincerely hoping will happen!

I was convinced there were more people like me, but it was not clear if they would find me. Fast forward to today - the Zingg open source community is 700+ members strong. Turns out what felt like an esoteric problem to me is a strongly felt pain for enterprises. Starting with a single person, Zingg now has a founding team that cares deeply about entity resolution. Zingg resolves billions of records every month and our users and customers come from diverse industries like healthcare, insurance, government sector, non profits as well as startups. This love and support is driving us to innovate more, and we are building amazing stuff to run production loads with a persistent ZINGG_ID along with incremental flows. We continue to work extensively on accuracy and performance. Zingg also supports diverse deployments on Microsoft Fabric, Snowflake, AWS Glue, Databricks, and others.

All this would not have been possible without our community members, well wishers and supporters. Thank you all, you have made this journey well worth it

Celebrating 3 Years of Open Source Entity Resolution!

Sonal Goyal — Wed, 02 Oct 2024 06:05:00 GMT

“The Data Weaver’s Tale”
In fields of data, vast and wide,
Where records dance and entities hide,
A weaver works with threads so fine,
To link and match and intertwine.

Deduplication clears the way,
As fuzzy matching comes to play.
Record linkage, strong and true,
Brings scattered pieces into view.

Entity linking, precise and keen,
Connects the dots once unforeseen.
The knowledge graph, a tapestry,
Reveals hidden identity.

With AI’s wisdom as our guide,
We stitch and resolve, side by side.
In this grand quest for truth and light,
Entity resolution shines so bright.

When I open sourced Zingg three years ago, Entity Resolution felt like an esoteric problem. While I had personally experienced the issue of unharmonized records, it was hard to say if I was an outlier or if this was a much more common problem. Sure, master data management systems have existed since the beginning of the data industry. Or even before it. Customer data platforms have also had their share of the spotlight. Yet, for all of the modern data stack maps out there, identity and entity resolution have been largely omitted.

For someone who had burnt their life’s savings in working on Zingg, I was treading unknown territory. The encouragement I had received from people who understood the problem truly meant a lot to me. Yet, the way forward was unclear. Starting and building a company is terribly hard. It was tempting to think of a cushioned job which would save me from a lot of hassles, guarantee a handsome package and possibly some other interesting problems.

However, I could not bring myself to do anything else - entity resolution felt like a problem I could not forget. If I could help a few practitioners like myself who had struggled with entity resolution, I would be satisfied. If I think back, it is likely that there was a builder bias in my thinking. Working on Zingg is fun! The challenges we see are not common place. They force you to think, they push you to the drawing board and scratch your head again and again in the quest to find a solution. It is wonderful to apply tools and techniques learnt over multiple years, yet see oneself falling short and researching and learning more and more. Working on Zingg feels like an endless maze of hard computational puzzles thrown our way. As a professional, one could not ask for more!

Also, somewhere deep down, I was not too worried about the selling bit. I was convinced that it was worth a try. One big reason was that the existing tooling in entity resolution is outdated. MDM and CDP tools aim to do too much, and the core problem that they try to solve - entity resolution, is something Zingg specializes in. Zingg’s identity resolution capabilities make it a drop in replacement for an MDM system and a key component in composable CDP stacks. Zingg can work with large datasets and resolve different kinds of entities with ease. We are privacy first, with no data ever leaving customer premises. Our warehouse/lakehouse native approach enables an elegant data architecture, letting people leverage their existing investment into data tooling and running Zingg AI as part of their data pipeline. Without any specialized AI skills, one can straight off use Zingg results in marketing attribution, risk or personalisation models. Or pipe Zingg resolved entities to the RAG of an LLM model or to a knowledge graph.

I have surely made my share of mistakes in life, but turns out I was right on betting wholeheartedly on Zingg. Adoption by open source users has been motivating. Due to on premise and multicloud environments along with the proliferation of SAAS tools, there is a growing need for trusted and enriched entity data. The race to deploy large language models, enterprise knowledge graphs and RAG based systems is pushing the demand for entity resolution even more. An enterprise data leader recently told me that Customer 360 is among the top 3 problems their team worries about. We are repeatedly hearing single customer view, SCV, Customer 360, C360 and variants in sales conversations. A very healthy and growing revenue from Zingg Enterprise is giving us the lever to go deeper into the problem, innovate even further and build simple yet delightful experiences for end users.

We are onto something special, and it is an experience worth having!

This would not have been possible without some really solid support and I am extremely grateful for that.

A big thank you to each of our users for trying Zingg, mentioning us to friends and coworkers and for joining our Slack. We hope that Zingg will continue to help you in your data journey. Special thanks to our investors and paying customers of Zingg Enterprise. We could not have reached here without you. Thank you for your early confidence and bet on us. We have built a lot of valuable features like ZINGG_ID, incremental flows, and determinstic matching based on customer asks and our understanding of customer needs. The product is so much richer due to your feedback and feature asks.

A lot of ground has been covered in the last 3 years, and a lot more remains. In the coming months, we plan to grow the team to build more cool stuff we have been aching to, and continue to dabble in interesting problems at the intersection of technology and company building. Wish us luck!

Thank you for reading, and if this story resonates with you, do say hello!

Photo by Jennie Brown on Unsplash

Zingg Incremental Flow

Sonal Goyal — Thu, 22 Aug 2024 17:20:16 GMT

As a startup founder and someone who actively follows companies doing interesting work, I often wonder about the levers behind a viable and thriving business. One thing that comes up repeatedly is Growth.

From the point of view of data teams in a fast growing business, what really does growth amount to? How does it manifest in the day to day data collection and processing?

If we dissect the key activities happening across businesses during the growth phase, there are a few commonalities that stand out. New customers get acquired at a fast pace. Existing customers buy more products and services. They also buy from additional channels, like a website, if they earlier only bought from the brick and mortar store. The business keeps in touch with existing customers and makes sure it has the ways and means to contact them with offers and product updates. Customers inform the business when they move or change jobs. Strategic mergers and acquisitons happen. Additional procurement of products and services happens to fuel the growth and new suppliers get onboarded. New product lines are introduced and additional customers onboarded.

Thus, data about core business entities - customers, suppliers, products and parts starts changing rapidly. New entity records get added, existing ones get updated. Some of the records are deleted.

A small detour before we go any further. In my last post, I discussed about the cross reference and persistent ZINGG_ID. Let us recap a bit. When we build our customer journeys, segment users and attribute acquisition to campaigns and channels, we need a consistent view of the customer data. We want to associate an identifier that inherently represents the entity. And we want to use this identifier in all our reporting and analytics, as well as operational systems. If a customer gets in touch with us at the call centre, we want to use the ZINGG_ID to get the context of their purchases and their relationship with us.

Coming back on the current topic. There is immense business value in having a persistent and continuously updated identity graph for each of the downstream usecases like marketing attribution, personalisation, life time value, fraud and risk.

However, as we saw earlier, in a growing business, records are no longer static. They are getting added and deleted and updated. So, how does all this change affect identity resolution? New records would start matching with existing ones. Updated records may start matching with other records they did not match with earlier. They may stop matching with records they matched up earlier. Hence clusters of resolved identities are evolving with the data. They are merging or splitting and new ones are created all the time.

Let us dig deeper how that looks by going over some examples.

In the following dataset, the records with ID rec-1020-org-incr and rec-1023-org-incr have been updated during business operations and continue to exhibit strong similarity with their original matches. Two new records with ID rec-10000-org-new got added, matched with no other existing record but with each other.

With me so far? Now, here comes an even more interesting bit! In the following dataset, rec-1022-org and rec-1022-dup-4 had matched originally.

In an update at a later date, rec-1022-dup-4 got altered and these records now stop matching with each other. This usually happens in the case of wrong data capture in an original record.

Thus, when a record is updated yet it is close enough to the original cluster values, we ought to keep the cluster relationship. In the case of adding a new record that does not match any existing ones, a new cluster should get created. When a new record shows up that matches an existing cluster, the new record has to get attached to the cluster. When a new record that matches two clusters is passed, the clusters need to get merged. If we make edits to a record so it no longer matches with its existing cluster and then reverted those edits, the models should correctly track the updated and reversed entry, with the record remaining in the correct cluster throughout the incremental runs. There are myriad possible scenarios of clusters splitting and merging with each other. Matching new and updated against existing records gets extremely complex before we realise.

Technically, we could run a full match with Zingg Community version on the updated dataset and get the new clusters.

But.

If we run the entire matching process again, how do we preserve the persistent ZINGG_ID for the records? How do we tie back the newly generated ZINGG_IDs on the updated set to the original ZINGG_IDs? Especially when a new cluster will now have multiple older ZINGG_IDs, or when a new cluster may have some old different ZINGG_IDs and some fresh ones. How do we choose the surviving ZINGG_ID? On top of that, running full matching loads on growing datasets is costly in terms of time taken and compute.

Here is how we handle this in Zingg Enterprise.

Clusters automatically merge, unmerge and new clusters with ZINGG_IDs are created based on new and updated values while only matching the updated records.

Which means that

The identity graph stays updated even when the data changes.

When I had open sourced Zingg, I felt that it was my life’s best work and wondered if I would ever work on anything more challenging. While building the incremental update, I realized that this functionality is way way tougher to build than everything else I had built so far. Magnitudes tougher than the active learning aspects of Zingg, or the blocking model. There are tons of algorithmic and performance aspects to consider. Even chalking out the test strategy and test data proved to be quite a challenge.

After intensive testing on internally generated datasets, we released the runIncremental phase to our early customers. We knew our test data could not mimic every possible scenario, and though we were confident of the algorithms, we suspected edge cases which we had not envisaged. We requested our customers to test the incremental flows on their data and to report any inconsistencies in matching or cluster assignment. We were thrilled to get an extremely positive response!

Here is an excerpt from one of their test reports:

Each test scenario was carefully constructed to challenge the model and verify the expected behavior. The results were consistently positive, demonstrating the model's precision and adaptability in every test case. Its ability to accurately handle minor data modifications and correctly process entirely new, atypical entries was particularly noteworthy. Equally impressive was its capability to manage entries that aligned with several clusters without compromising accuracy and its efficiency in reverting changes to their original form.
The success of these tests is an essential step on the way to releasing the incremental Zingg models in production.

We are super proud to have come so far, and happy to roll it to all our customers. If the idea of an identity graph on your own data within your own data lake and warehouse appeals to you, why not hit reply and let us grab a coffee? :-)

Hello Zingg ID!

Sonal Goyal — Sat, 08 Jun 2024 11:50:27 GMT

Photo by Amol Tyagi on Unsplash

Imagine we have multiple records of a customer spread across ticketing, billing, offline stores, and web properties. This is a usual scenario we witness everyday in our data land. Different sources of data come with multiple versions of truth. When they land centrally in a warehouse or datalake, its hard to figure things out. Revenue attribution and lifetime value can not be calculated accurately. Nor can we estimate churn or calculate confidently how many new customers were added last quarter. Increasing revenue through personalized offers and exploring opportunities to cross sell and upsell will not be possible either. Let alone anti money laundering or risk. Visualize a multimillion-dollar investment in a GenAI customer service agent. How far can it go using this fragmented data?

To establish the right data foundation, analyze and report on business critical metrics, and productionize AI and ML technology, we will need to unify such records and build a trusted single source of truth for the customer.

Enter entity resolution. Variations in attributes can all be resolved, and clusters of matching records can be created. The Zingg Community Version offers robust clustering and fuzzy matching, which bring these records together with a unique Z_CLUSTER field. Matching records share the same Z_CLUSTER. Voila, there is a way to say who is who!

What happens when we want to track the customer over time? Or build a holistic view? Establish their journey and reference them or search for them in different enterprise systems. When there is no identifier to refer to the customer, we need a way to associate them with an identifier. Kind of like an SSN or a passport number. Or a primary key in the database.

The Z_CLUSTER is extremely powerful and useful. But it is not persistent. There is no guarantee that it will not change in the next Zingg run. If we stream our resolved records into our source systems and want to use Z_CLUSTER as the primary key, it would break.

As we saw different Zingg deployments, this was an ask that we heard again and again. A globally unique persistent identifier that acts like a reference key to the entity and all its records. The same key even when the record changes. The unambigious way to refer to an entity across multiple enterprise data systems, operational processes, and workflows. Even as new records come in and earlier ones get updated.

This is exactly the ZINGG_ID in the Zingg Enterprise version. Resolved, unified records across the entire enterprise which can be referenced through a persistent key. A help desk agent could use it to look up the entire customer interaction journey. A GPT based chatbot operating on patient data would work seamlessly with all the information at its disposal.

It's been fun solving this complex piece of the puzzle, and it's extremely satisfying to witness the value it generates for Zingg Enterprise customers.

Sounds interesting, or something you care about? Do get in touch!

Deterministic Matching Vs Probabilistic Matching

Sonal Goyal — Sat, 16 Mar 2024 14:09:14 GMT

In the entity resolution space, we refer to exact matching on attributes as deterministic matching. If the application systems are consistently collecting and persisting trusted and known identifiers like Social Security Number, deterministic matching can help to resolve such records and help us identify the true individual.

In most real world systems, data in different systems in an organisation has different attributes. Let us imagine a retail store with both online and offline presence. A web support system would capture the email correctly while also logging name, a call centre would mostly have a correct phone number but other details may not be as reliable. The offline store would have some or all of the details, as accurately as the human behind the computer tasked with entering them can type it in. Thus some of the individual records have identifiers, like phone and email in the above case, which may match exactly, while many of the records have names and addresses with all kinds of variations, typos, abbreviations etc. Assessing the extent of such variations for an entity and reconciling the fuzzy and inexact matches is known as Probabilistic Matching.

At a simplistic level, one can think of deterministic matching as a database join on trusted id columns like passport number or email. Probabilistic Matching involves statistical and ML based techniques to determine if a given record pair is indeed the same real world entity. If we search these terms, there are multiple articles around Deterministic Vs Probabilistic Matching, pitting one against the other. Since the approaches to solve deterministic matching are very different from the ones for probabilistic matching, these techniques are often viewed as contrary.

However, we have a different view.

As we dived deeper and looked at multiple datasets, we saw that in most cases, data contained multiple signals, some with trusted identifiers and most times without. Even when there are identifiers, they tend to be more than one. For example one system could have driver’s license id mandatory and passport number optional. While another system could have the mandatory rule reversed. A third system may have both of these ids as optional. If we match deterministically on passport and license in this case, we would only resolve some entities, not all. Worse, a person who has provided license id in the third system would only match with the first, and the record in the second system would go unmatched.

This has serious ramifications. In an ecommerce store, the inability to reconcile effectively would jeopardize the million dollar investment in recommender systems that consume the customer preferences. In an anti money laundering system, the ability to catch fraud is directly proportional to the quality of the matching. An insurer can not understand risk at a household or person level if records remain unresolved.

Thus, we need to solve for both determinstic and probabilistic matching at the same time. An either/or approach with respect to deterministic and probabilistic matching would lead to identifying records with identifiers as a different entity than those without. Why not use every single data point of the record? There is tremendous value to be unlocked by looking at the problem holistically. If a deterministic match can be found, use that. If not, use the other attributes. Essentially, weave them all together holistically to ensure that the probabilistic and deterministic matching records are resolved to a single entity.

The above is a simplistic take and there is a lot of software engineering involved under the hood to build this out. Effort nothwithstanding, we are convinced that it is the right approach. In fact we have gone all out while building this feature, with support for multiple combinations of attributes to make sure no record which has a match remains unmatched. Determistic and probabilistic matching are better together, and early Zingg Enterprise customers see an immediate jump in match accuracy.

Is this something you care about too? Hit a reply and lets spark a conversation!

Zingg Turns Two!

Sonal Goyal — Sat, 30 Sep 2023 17:09:19 GMT

Two years is a journey, yet it seems like yesterday when I open sourced Zingg. In the interim, we have done fulfilling work and received a lot of love. Curious what’s up at our end? Read on!

First thing first.

We now have paying customers for our enterprise products! There is nothing more exhilarating than seeing our life’s work come alive. And to be paid so that we can continue to build and move in the direction we aspire to. In the startup world, we hold a lot of discussions around product market fit, product founder fit, building things people want, user research, and other associated activities. Product-led startups like ours get the first set of customers through the urgent need of a visionary early adopter. The visionary early adopter is the person in a team who has a pressing need along with the foresight to solve the problem and the confidence to trust a new team and an innovative product. And who can influence and move budgets to solve that need. I am very glad to have this opportunity to work closely with this first set of customers who are helping us shape the product. Not just the features, but also the pricing.

Thank you for supporting us so that we can help you!

We are building three flavors of Zingg Enterprise. Zingg Snowflake, Zingg Databricks and Zingg Spark. All of them with the same roots in Zingg open source. Each carries forward Zingg OSS and gives the required ammunition to build and use the persistent and evolving enterprise identity graph. Zingg Enterprise creates global unique identifiers for entities, explains model behaviour and maintains and enhances the identity graph through resolution of new and updated records against previously matched records. Of course there is the improved accuracy and scalability :-)

On top of this, we have added platform-specific features. The Snowflake version runs inside Snowflake without any other infrastructure or software dependence. The Databricks flavour adds Unity Catalog support and a lot of other Databricks goodies. The Spark one has the enterprise features and runs on all other Spark environments like EMR.

On the open source side, we just hit 1000 downloads per month on PyPi! The rate of downloads has doubled over the past few months, and we now have a Slack group touching 450 people, from varied organizations like LinkedIn, Maersk, Newfront, Samba TV, Canadian Football League, Heb and John Deere to name a few. We have federal agencies like Pandemic Response Centre, and established startups like Provenance.org, Redica Systems, Ninjacart and Cherre. We have a good number of nonprofits and data consultancies as well as friends from Databricks, Snowflake and Exasol. The strong code contributions from these partner companies have helped us to support their users more deeply.

We are clearly onto something :-)

I am truly amazed by the depth of use cases we are seeing across big and small organizations in varying domains. Zingg is resolving customer identity for marketing and personalization use cases in retail. It is resolving patient identity for comprehensive health, and healthcare provider identity to understand drug recommendation patterns. And the identity of an insurance policyholder, for compliance and risk, and also from the point of view of a reinsurer. Identity resolution is much broader than we thought initially. It is the identity of members of religious establishments, of donors or recipients. It is the identity of the fans of a football club to send out the right information at the right time for a game of their liking. B2B accounts, which buy from multiple subsidiaries in different geographies, need identity resolution as well. So do products on an e-commerce website for competitive analysis on other sites.

Identity is anything and everything a business cares about.

We are happy that we built Zingg so that users can work on their schema and train on their data without specialized ML skills. All in a privacy centric way. This has allowed us to see Zingg in action in the different use cases we are discovering every passing day.

To each of you who has used Zingg, mentioned us to friends and coworkers or checked us out - thank you for your time and we hope we have removed some data troubles from your life.

I also wanted to thank our investors and advisors, who backed Zingg when no one had heard of us, and who have been generous with their time and support.

These two years have been a tremendous learning experience, and our plans are only getting bigger. There is a lot of AI yet to be applied to enterprise data, and we will love to create more value for our users and customers. For that, we need to continue to work closely with them, ship more and more often, keep the momentum going and work twice as hard.

We are excited for the path ahead. Wish us luck!

From Half A Day To Half An Hour! Performance Tuning Snowpark For Identity Resolution On Snowflake

Sonal Goyal — Sat, 29 Jul 2023 04:55:12 GMT

We have spent the last few weeks on the Zingg Identity Resolution product for Snowflake by battle-testing it for deployment and scale. Built as an entry for the Snowflake Challenge and on top of the Zingg Open Source version, we already had a pretty solid application that could resolve entities at scale within Snowflake. So we did not expect any major hiccups in terms of the overall flow, model building, and prediction. Developer confidence, bolstered by feedback from users you may say ;-) However, since the product leverages Snowpark instead of Spark, we did expect a bit of performance tuning here and there. In this post, I wanted to share the challenges we faced and the changes we made to overcome them. Hope this is useful to others building on Snowflake.

To begin with, we had a 500k test record set running on the x-small size Snowflake warehouse in roughly 12 hours. Though the code was not tuned for Snowflake performance, the job ran successfully to completion without hiccups and we used this as our baseline. Entity Resolution is a tough problem, and harder to do at scale. We want to get maximum mileage out of Snowflake and ensure Zingg ran efficiently for our customers. Our philosophy is that the customer should not have to pay the computing bill for any inefficient code. Hence we always aim to extract maximum performance without throwing more hardware at the problem. Unless we absolutely have to. We also had a target dataset of roughly 2.5 million records, 5 times the baseline. This was a real customer dataset, with non-uniform distribution across fields and with triple the columns of our test set. When we do entity resolution, a 5-fold increase in the input with the same number of columns and match types leads to a 25-fold increase in the comparison complexity. As we have robust index learning based on training data in Zingg along with tons of performance optimizations, we expected the matching job to be somewhere close to 18 hours, which we were planning to tune. However, when we ran this dataset on the x-small Snowflake warehouse, we started seeing errors on session timeout intermittently after roughly 8 hours.

`com.snowflake.snowpark.SnowparkClientException: Error Code: 0408, Error message: Your Snowpark session has expired. You must recreate your session. Authentication token has expired. The user must authenticate again.`

Running the Snowpark client as a background process or with nohup helped, and we were able to see the job continue to run beyond 24 hours. Surely this performance was not something we wanted to ship. So we killed the run after identifying the first set of bottlenecks. Our approach was to zero in on the hotspots and solve them surgically in the order of highest impact first. The highest impact was mostly about the functionality taking maximum time, but in some cases, we also fixed stuff that was earlier in the flow but slow enough to not let us go past a particular stage in a reasonable time. We used the query profiler by Snowflake extensively for understanding and improving the flow, along with some traces and timing logs in our own code base. The query profiler gives a lot of valuable statistics about the query plan, function timing, and which part of the code triggered the query. It also provides partition-level and row-level information, so it was our go-to source for hotspot identification as well as figuring out if our changes worked as expected. Here is a snapshot of how it looks.

At a high level, here is how the Zingg flow is structured:

Input Data → Index → Join On Index to get Approximate Similar Pairs → Pairwise Similarity Features → Predict Matches → Graph Processing

We learn an index on the input data to create approximately similar pairs. We then compute similarity features on these pairs and predict if the pair is a match or not. This output is clustered using graph algorithms to convert matched pairs to clusters.

Our logs indicated that the biggest bottleneck was the Predict Match flow. The. predict flow leveraged a Snowpark user-defined table function to return two columns. The first was if a given pair was a match or not. The second was the score indicating the similarity of records in the pair with each other. The higher the score, the better the match. Since we returned more than one value from this function, we built it as a UDTF instead of a UDF. However, this UDTF was consuming the bulk of our processing time. Given that we were not really creating multiple rows per input row but returning multiple values per row, all the lateral joins we saw on the query plan did not make sense. It made a worthwhile experiment to convert the UDTF to a user defined function, concatenating the prediction and the score in one value and extracting them out through another UDF. This worked like a charm and immediately cut down the processing on our baseline dataset from 12 hours to 5 and a half hours. More than a 2x improvement! This was a massive gain, but we clearly had more work to do.

We then set out to optimize the queries generated by Snowflake through the Snowpark code we had written. The bottlenecks here were mostly while indexing and generating the near-similar pairs. We saw that if we did computations on a Snowpark Dataframe to build the index and then self-joined the Dataframe on the index, the index computations were repeated. This was counterintuitive to us, as we were using the same instance of the Dataframe. We tried to clone the Dataframe as advised but that did not work. Finally, when we cached the Dataframe, the query plan was altered and our costly computations happened only once. The lesson here was that if the computation is costly and the Dataframe has to be used in a self-join, it is wise to cache the intermediate results. However, this has to be used with caution, as there is an overhead in temporary table creation and disk spillage.

Another optimization we did was on the range scan. We join the records on the computed index to build the pairs. To avoid pairs like Record A, Record B repeat as Record B, Record A, we filter out the pairs on ids. This is critical as the subsequent similarity feature processing is expensive. However, we saw that the join became a cartesian join with filtering pushed out to much later in the flow. This led to slow performance along with a suboptimal flow, making similarity feature computations that we did not need. We cached the Dataframe arising out of the join and changed the id1 > id2 range query to id1-id2 > 0, which pushed the filter at the right junction for our flow as well as changed the earlier cartesian join to a normal join.

Together, the above changes shaved off roughly 20% of runtime on our baseline dataset, and we were now down to 4 hours and 30 minutes of the entity resolution workflow on 500,000 records.

With these major fixes in, we profiled the run again, closely looking through the hotspots, and to our utter dismay, found that the predict function was still the most time-consuming part of the entity resolution workflow. This is a Python UDF that uses machine learning for the prediction. A pre-built model learnt from the training data is passed as an argument to the function along with multiple similarity features of the pair. This UDF predicts whether the pair is a match or not along with their similarity score. Since the features are different for different datasets, we do not know beforehand the shape of the feature columns. Hence we construct an array over the multiple features so that we can resolve different types of entities. In a good bad way, there was no code change we could do to optimize this function. Our code was concise and to the point, and there wasn’t anything to shave off here. So we focused on the Snowflake invocation of this UDF. We learned that Snowflake advocates vectorizing UDFs, especially if they involve matrices and other machine learning workloads. The vectorized UDFs are passed batches of rows, and return batches with results. This made immediate sense as invoking the UDF row by row was surely not going to help with the inherent matrix manipulations. Though we wanted to do this right away, there were quite a few challenges.

It was difficult for us to define the function signature as the column number and types were dynamic. Earlier we used to pass an array, so it had worked. After some thought, we were able to build the signature and define the function on the fly using the metadata we already had while creating the similarity features. Passing the model became tricky however and we were not able to deserialize it in the function no matter what we tried. It seemed there were some extra bytes added when the model was passed as a column received by the UDF as part of a Pandas Dataframe. The same code had been working earlier row by row, so the conversion of Snowframe Dataframe rows to Pandas Dataframe internally seemed to be the likely cause. There was another complexity. We were invoking the Python function through Java Snowpark API, so there were too many layers to unravel here. We went over the open-source code of Snowpark, we tried to add function logging and tracing, we tried to debug calling the UDF from Python, and we read all that we could..we pretty much did anything and everything we could think of!

After struggling for a few anxious days, we changed our approach and began persisting the model to a Snowflake stage. Our UDF would import that location and read and use the model. This finally solved the serde issue we were seeing earlier, along with reducing the data sent to the function. In hindsight, we should have tried this approach first, but at that time the thought was to only change what was absolutely necessary. The code for passing the model had been working for a while, and while tuning performance, it is important to control the changes. So we weren’t too wrong either.

It got a bit late in the day when we ran the build with these changes, and when the test dataset ran in half an hour, we were sure something was crazily wrong. Maybe we had left the bulk of our processing code commented out in the run by mistake? We saw the output table, the results were there, and the predictions looked accurate. It was hard to believe. Too weary to think further, we called it a day.

The next day we double-checked the output and everything was in order. We cross-checked. Yes, it looks fine. No, we are not sure. Not yet. Let us run it once more. Boom, the job was finished by the time we finished our daily catchup + coffee.

28 minutes. From 12 hours. From half a day to half an hour! 24x.

On an x-small warehouse resolving identities across multiple columns for 500,000 rows. Wow! High Five! Now we need a break. Mission Impossible Dead Reckoning, here we come!

It was now time to run the customer data set in their warehouse. Brimming with confidence, we fire the job and monitor it from time to time. Umm..it is not working. The predict function is still a bottleneck. For some reason, the batch size Snowflake has automatically chosen on this dataset is 10, while it was sending 1000 rows at a time to the function on our test data. Possibly because there are way too many columns here compared to our test data? The batch size hint is just a hint, but that seems to be the only ammunition we have right now. Will it work? So we modify the UDF and add the max batch size annotation. This has the desired effect, and we have some great results for our customers.

There were a few other changes we had to make to get here. For example, the Dataframe.show() function would cause an out of memory error on the client machine if the data was big. While working on performance optimization, it is critical that functional parts of the code work flawlessly. We had used show() liberally in our code to verify the logic, assuming that it would limit the records sent to the Snowpark client by limiting rows in the warehouse itself. But show() works differently. It fetches everything to the client and then truncates to 10 rows. This was a bit counterintuitive, but an easy fix.

Overall we are super excited about the gains we have made and so happy that we have come this far. We hope the lessons above will help folks who are building on Snowflake and Snowpark. Feel free to talk to us if you are struggling with Snowpark/Snowflake performance issues, we would love to help and share our learnings! And if you know someone who needs entity resolution on Snowflake, please do send them our way :-)

Ciao!