<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Learning From Data, Data Teams and Ecosystem]]></title><description><![CDATA[MDM and Entity resolution as foundational infrastructure — and why it matters for AI-ready data stacks. By the founder of Zingg.]]></description><link>https://www.learningfromdata.zingg.ai</link><image><url>https://www.learningfromdata.zingg.ai/img/substack.png</url><title>Learning From Data, Data Teams and Ecosystem</title><link>https://www.learningfromdata.zingg.ai</link></image><generator>Substack</generator><lastBuildDate>Tue, 05 May 2026 18:50:57 GMT</lastBuildDate><atom:link href="https://www.learningfromdata.zingg.ai/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Sonal Goyal]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[sonalgoyal@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[sonalgoyal@substack.com]]></itunes:email><itunes:name><![CDATA[Sonal Goyal]]></itunes:name></itunes:owner><itunes:author><![CDATA[Sonal Goyal]]></itunes:author><googleplay:owner><![CDATA[sonalgoyal@substack.com]]></googleplay:owner><googleplay:email><![CDATA[sonalgoyal@substack.com]]></googleplay:email><googleplay:author><![CDATA[Sonal Goyal]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[More persistence for Zingg IDs!]]></title><description><![CDATA[Introducing Reassign - taking the identity graph one step further.]]></description><link>https://www.learningfromdata.zingg.ai/p/more-persistence-for-zingg-ids</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/more-persistence-for-zingg-ids</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sun, 05 Apr 2026 16:23:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2HD-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2HD-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2HD-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 424w, https://substackcdn.com/image/fetch/$s_!2HD-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 848w, https://substackcdn.com/image/fetch/$s_!2HD-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!2HD-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2HD-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png" width="838" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:838,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1679572,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.learningfromdata.zingg.ai/i/193263282?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2HD-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 424w, https://substackcdn.com/image/fetch/$s_!2HD-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 848w, https://substackcdn.com/image/fetch/$s_!2HD-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!2HD-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d79b6d5-0c0b-43c7-8e66-53eba3e9638b_838x1048.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In my last few posts, I wrote about the <a href="https://www.learningfromdata.zingg.ai/p/hello-zingg-id">persistent ZINGG_ID</a> and <a href="https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow">incremental flows</a>. Quick recap for those who missed them. As data keeps changing in a growing business - new customers, updated addresses, records arriving from new channels - the identity graph has to keep up. The <a href="https://www.zingg.ai/product/features/incremental-flow">incremental flow in Zingg Enterprise</a> handles exactly this. New and updated records get matched against existing clusters, clusters merge and split automatically, and the ZINGG_ID travels with the entity through all of it. The identity graph stays consistent even as the data underneath it keeps moving.</p><p>That was a big piece of work to build and ship. But as we rolled it out to more customers, we started hearing about a different scenario. One we had not fully planned for.</p><p>Some of our customers had been running Zingg for a while. Their identity graphs were in good shape. Incremental flows were humming. And then, their data got better.</p><p>Phone numbers that were unreliable at the start of the project - maybe coming from a poorly integrated source system, or just sparsely captured - were now being collected consistently. Email fields that were mostly empty were now populated. A new column from a recently onboarded data source was throwing up a great matching signal. The data engineering team had done some serious cleanup work on one of the source systems.</p><p>In other words, the matching could now be sharper than it was when the graph was first built.</p><p>And yet, teams were not acting on it. We started hearing the same thing again and again. Customers who loved their ZINGG_IDs - and were relying on them deeply - were finding that the very thing they valued was also holding them back. One customer had significant architecture changes and needed to run a full refresh, but was worried about losing their ZINGG_IDs. Others wanted to add a new matching signal but held off. Some wanted to move to a different data platform entirely, but the ZINGG_ID dependency made the migration feel too risky.</p><p>The Zingg identifier had become load-bearing infrastructure. Which we loved to see - it meant the graph was genuinely embedded in how the business operated. But it also meant that any change to the graph felt dangerous. Customers were not able to move from one model to another, capture more signals, or even change their data platform because of Zingg. That was not the position we wanted our customers to be in.</p><p>So, what do you do when you want a full refresh - run matching again on the full dataset, this time with the richer fields, the better signal, the cleaner data? But your downstream systems are wired to the existing ZINGG_IDs. </p><p>Reporting tables, Customer 360 dashboards, ML feature stores, operational workflows at the call centre - all of them reference the ZINGG_IDs from the previous run. If you blow away the graph and rebuild it, you get better clusters but you lose the identifier continuity. Every downstream system breaks or needs a migration. That is a huge cost, and it tends to make teams postpone the refresh indefinitely. The identity graph gets frozen at the quality level it was when it was first built, because improving it feels too disruptive.</p><p>This is the gap that <a href="https://docs.zingg.ai/latest/stepbystep/reassignzinggid">Reassign</a> closes.</p><p>After you run a full refresh with your updated fields and retrained model, reassign maps the new cluster assignments back to the existing ZINGG_IDs. It carries the established identifiers forward wherever the entity continuity holds. Downstream systems keep their reference point. The identity graph gets the benefit of the improved matching signal. The refresh is no longer a disruptive operation.</p><p>Here is the even more interesting bit - what taking the persistent ZINGG_ID one step further really means. Persistence, as we built it earlier, means the identifier travels with the entity through routine changes - new records, updated records, incremental loads. <strong>Reassign extends that guarantee to cover a deliberate, intentional rebuild of the graph itself.</strong> The identifier now <em>survives</em> not just the churn of normal operations, but also the data team&#8217;s decision to go back and do it better.</p><p>We think of this as data maturity support built into the identity graph. Because in practice, the signals get better over time. Source systems get cleaned up. New integrations bring in new attributes. The identity graph should be able to evolve with that maturity - not fight against it.</p><p>If you have a full refresh sitting in your backlog because the downstream impact felt too risky, we would love to hear from you. Hit reply and tell us what triggered it - what improved signal are you sitting on that you have not yet been able to incorporate?</p>]]></content:encoded></item><item><title><![CDATA[From “What Was Revenue Last Quarter?” to “Generate 30% More Revenue This Quarter”]]></title><description><![CDATA[The a16z piece on context layers is spot on. Here&#8217;s what it means when you follow the argument all the way through.]]></description><link>https://www.learningfromdata.zingg.ai/p/from-what-was-revenue-last-quarter</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/from-what-was-revenue-last-quarter</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Thu, 12 Mar 2026 06:37:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fCcn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fCcn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fCcn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!fCcn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!fCcn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!fCcn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fCcn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png" width="1024" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:608,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fCcn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!fCcn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!fCcn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!fCcn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88025e21-5b43-4fcb-904b-f381c0b30091_1024x608.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A16z just published a piece that every data practitioner should read. <a href="https://a16z.com/your-data-agents-need-context/">Your Data Agents Need Context</a> makes the case that AI agents have been failing not because the models are bad, but because they&#8217;ve been operating without the business context they need to reason well &#8212; no understanding of how revenue is defined, which tables are the source of truth, what a fiscal quarter actually means for this organization. The fix, they argue, is a modern context layer: richer than a semantic layer, encompassing canonical entities, tribal knowledge, governance logic, and identity resolution.</p><p>They&#8217;re right. And I want to extend that argument into territory I think deserves more attention &#8212; because the implications of taking it seriously go well beyond the examples that are easiest to reach for.</p><p><strong>This Is Data Teams&#8217; Moment &#8212; The Question Is How Big We Think</strong></p><p>I wrote recently about <a href="https://www.learningfromdata.zingg.ai/p/the-data-teams-moment-how-agentic">how agentic AI is finally giving data teams their long-overdue moment</a>. For years, data teams spent budget cycles defending their existence. That is changing fast. AI agents are only as good as the data they run on, and the foundational work data teams have been quietly doing for years &#8212; quality, governance, lineage, entity resolution &#8212; is now the prerequisite for every enterprise AI initiative that matters.</p><p>The budget conversation is flipping &#8212; from &#8220;justify why we need you&#8221; to &#8220;how fast can you get us AI-ready?&#8221; That&#8217;s a genuine shift. But it comes with a risk: if the measure of success for data agents settles at &#8220;answering questions BI could already answer,&#8221; we&#8217;ll have spent enormous energy rebuilding, more expensively and with more failure modes, something that already worked.</p><p>Revenue last quarter? Your BI team has a Looker dashboard for that. It refreshes nightly, the fiscal calendar is baked in, and the CFO trusts it. The a16z piece walks through exactly why an agent struggles to replicate even this &#8212; wrong revenue definition, stale YAML, ambiguous tables. That&#8217;s a real problem worth solving. But it&#8217;s also worth asking: once we&#8217;ve solved it, what have we actually unlocked?</p><p>The answer, I&#8217;d argue, is: the foundation for something far more valuable.</p><p><strong>The Questions That Context-Plus-Identity Actually Makes Possible</strong></p><p>The a16z piece mentions that a modern context layer should go beyond metric definitions to include canonical entities and identity resolution. I want to take that seriously and spell out what it actually unlocks. Walk into any commercial or strategy meeting and listen for the questions that make the room go quiet &#8212; not because the answer is sensitive, but because nobody actually knows: How many real customers do we have &#8212; not accounts or rows in a CRM, but distinct humans or organisations with an active relationship? Who are our most valuable customers, across product lines, referrals, and renewal behaviour? What is our customers&#8217; lifetime value when &#8220;Acme Corp&#8221; in the CRM and &#8220;ACME Inc.&#8221; in billing are finally recognised as the same entity? Who can we upsell, and when, based on what similar cohorts did next? Which customers are three months from churning, based on signals scattered across product, support, and finance systems? These questions couldn&#8217;t be asked before not because agents were too dumb, but because the identity layer that makes them answerable was never in place.</p><p><strong>What Agents Actually Do with a Resolved Identity Graph</strong></p><p>This is where the discussion stops being theoretical. Once entity resolution is in place &#8212; once the agent knows <em>who</em> it&#8217;s reasoning about across every system &#8212; a new class of autonomous workflows becomes possible. Here are four that illustrate the shift.</p><p><strong>The Churn Prevention Agent.</strong> The agent monitors product usage daily &#8212; login frequency, feature adoption, support ticket volume &#8212; and when a customer&#8217;s pattern matches historical churn signatures, it autonomously cross-references their contract renewal date, checks their NPS history across systems, and drafts a personalised outreach for the customer success manager with a suggested intervention. The entire workflow collapses if the agent is seeing three different records for the same customer across the product database, CRM, and support tool. Entity resolution is not a prerequisite step &#8212; it is the agent&#8217;s eyes.</p><p><strong>The Upsell Timing Agent.</strong> A customer&#8217;s usage of Product A crosses a threshold. The agent compares their trajectory against cohorts of similar customers who went on to adopt Product B within 90 days, finds the pattern match, creates a qualified opportunity in the CRM, and alerts the account manager &#8212; without anyone pulling a report. The cohort comparison is only valid if the purchase history and usage data it&#8217;s drawing on are resolved to the same underlying customer. A fragmented identity graph means the agent is comparing apples to a mix of apples and ghosts.</p><p><strong>The Cross-Sell Across Household or Org Agent.</strong> In B2C &#8212; a bank, an insurer, a retailer &#8212; one household member has a mortgage, another has a current account, a third just opened a savings product. An agent that understands household relationships, not just individual records, can surface the cross-sell opportunity and route it to the right team before a competitor does. In B2B, the same logic applies at the organisational level: one division is already a customer, another is in a competitor&#8217;s trial. The agent identifies the relationship and acts. Both scenarios are structurally impossible without entity resolution at the household or org level &#8212; and both are routine once it&#8217;s in place.</p><p><strong>The Supplier Risk Agent.</strong> The agent continuously watches news feeds, financial filings, and logistics data for signals about key suppliers &#8212; credit downgrades, port disruptions, leadership changes. When a risk threshold is crossed, it maps which SKUs are affected, what inventory buffer exists, and which alternative suppliers are pre-qualified, then surfaces a recommended action to the procurement team. The catch: this requires knowing that &#8220;Acme Logistics Ltd&#8221; in the ERP, &#8220;ACME Logistics&#8221; in the contracts system, and &#8220;Acme Log.&#8221; in the invoicing platform are the same supplier. Without that, the agent is monitoring fragments, not entities &#8212; and the risk signal it&#8217;s supposed to catch slips through the gaps.</p><p><strong>What &#8220;Canonical Entities&#8221; Actually Requires</strong></p><p>It&#8217;s worth dwelling on the practical aspects of having canonical entities and identity resolution as components of a modern context layer &#8212; because this is where most implementations will either succeed or quietly fail.</p><p>I&#8217;ve written before about <a href="https://www.learningfromdata.zingg.ai/p/the-identity-crisis-why-getting-the">why building a unified customer identity is genuinely hard, and why so many teams feel like imposters for still wrestling with it</a>. The industry narrative treats unified identity as a prerequisite &#8212; something that should already be done before you get to the interesting work. In reality, the vast majority of organisations are still fighting this battle. They&#8217;re just not talking about it publicly because it feels like admitting failure.</p><p>Entity resolution &#8212; sometimes called record linkage, identity resolution, or master data management &#8212; is the discipline of determining that two records refer to the same real-world person, company, or object, even when names, formats, and identifiers don&#8217;t match. It requires probabilistic matching across messy strings, address normalisation, fuzzy deduplication, graph-based clustering, and ongoing maintenance as records evolve. It&#8217;s one of the oldest unsolved problems in enterprise data, not because people haven&#8217;t tried, but because the messiness is structural. Organisations grow through acquisition. Customers interact across channels. Nobody standardised data entry in 1997.</p><p>For the context layer to include this, entity resolution:</p><ul><li><p>Cannot be hard-coded into YAML or a semantic layer &#8212; it requires ML-powered probabilistic matching trained on actual data</p></li></ul><ul><li><p>Cannot be a one-time batch job &#8212; it needs to run continuously as new records arrive and existing ones change</p></li></ul><ul><li><p>Cannot live outside the warehouse &#8212; moving billions of records to a separate resolution system reintroduces the very data silos and lack of context that AI agents are struggling with</p></li></ul><ul><li><p>Cannot be fully automated &#8212; it needs human-in-the-loop workflows to handle edge cases and feed corrections back into the model</p></li></ul><p>This is not background infrastructure that someone configures once and forgets. It&#8217;s an ongoing data capability &#8212; closer in character to a production ML system than to a semantic layer definition. Teams that treat it as the former will find their context layers actually working. Teams that treat it as the latter will find their agents confidently wrong.</p><p><strong>What This Looks Like When Done Right</strong></p><p>Two examples illustrate what the context layer vision looks like when the entity resolution component is actually implemented well.</p><p>Fortnum &amp; Mason, the 300-year-old luxury British retailer, had customer data fragmented across restaurant bookings, email signups, online transactions, and in-store purchases. They tried a third-party resolution service first &#8212; it created non-persistent identifiers, offered limited visibility, and raised privacy concerns about sending their entire customer dataset externally. <a href="https://www.zingg.ai/case-studies/building-a-composable-cdp-using-entity-resolution-at-fortnum-mason">By implementing Zingg natively in their Databricks environment</a>, they built persistent, unified customer identifiers across all touchpoints, with full control over the matching process and privacy-compliant resolution that kept sensitive data in their own infrastructure. For the first time in their history, they could understand how customers were shopping across every channel and devise clienteling and personalisation around that. That&#8217;s not a BI question answered better. That&#8217;s a question that was previously unanswerable and an action that could not be done earlier.</p><p>Orthodox Union, operating over 40 websites and 5 mobile applications across disparate CRMs, used <a href="https://www.zingg.ai/case-studies/understanding-household-and-person-relationships-using-entity-resolution-on-snowflake">Zingg on Snowflake</a> to power their golden records &#8212; resolving not just individual identities but household relationships across their entire digital estate. Their agents now reason about constituents as whole people with family connections, not as disconnected records across systems.</p><p>These outcomes are exactly what the context layer is promising. They require taking the entity resolution component as seriously as the metric definition component &#8212; which means treating it as an engineering discipline, not a checkbox.</p><p><strong>The Architecture That Makes This Real</strong></p><p>The a16z piece lays out a thoughtful five-step architecture: access the right data, build context automatically, refine with human input, connect to agents, keep it self-updating. That framework is right. The entity resolution layer sits at the foundation of step two &#8212; and its quality determines the ceiling for everything above it.</p><p>Concretely: before an agent reasons about revenue, lifetime value, churn risk, or upsell opportunity, it needs a stable, persistent entity graph that tells it who &#8220;this customer&#8221; actually is across every system. That graph needs to be built where the data lives, continuously maintained and grounded in human judgement for edge cases :</p><p>Only on top of this identity foundation does the rest of the context layer &#8212; the metric definitions, the table routing, the tribal knowledge &#8212; become genuinely trustworthy. Knowing what &#8220;revenue&#8221; means is important. Knowing that the revenue from &#8220;Acme Corp&#8221; and &#8220;ACME Inc.&#8221; should be attributed to the same customer is what makes the answer actually right.</p><p><strong>The Real Payoff</strong></p><p>As I argued in <a href="https://www.learningfromdata.zingg.ai/p/the-data-teams-moment-how-agentic">The Data Team&#8217;s Moment</a>, AI doesn&#8217;t paper over data problems &#8212; it runs them at scale, in production, with consequences. A customer success agent operating on a fragmented identity layer doesn&#8217;t just give one bad recommendation; it systematically misjudges an entire customer segment, at machine speed, before anyone notices. The autonomous nature of agents is what makes the identity foundation so critical &#8212; errors compound rather than get caught.</p><p>But the inverse is equally true, and more exciting: an agent operating on a trustworthy identity graph can answer questions that were structurally impossible before. Not because the model got smarter, but because the data it&#8217;s reasoning on finally reflects reality.</p><p>AI agents that can tell you which accounts are actually the same company, which of those companies are in an expansion window, which customers share a behavioural signature with cohorts that churned six months ago &#8212; that&#8217;s the payoff the context layer is reaching for. The revenue-growth question is the proof of concept. The customer intelligence actions are the business transformation.</p><p>The context layer is necessary. Entity resolution is what makes it sufficient. And the questions worth asking, and the agents worth running &#8212; about customers, not metrics &#8212; are what make the whole thing worth building.</p><p><em>Building on the a16z post <a href="https://a16z.com/your-data-agents-need-context/">&#8220;Your Data Agents Need Context&#8221;</a> by Jason Cui and Jennifer Li. Further reading: <a href="https://www.learningfromdata.zingg.ai/p/the-data-teams-moment-how-agentic">The Data Team&#8217;s Moment</a> and <a href="https://www.learningfromdata.zingg.ai/p/the-identity-crisis-why-getting-the">The Identity Crisis</a>. If you&#8217;re building the entity resolution layer, <a href="https://www.zingg.ai/">Zingg</a> is open source and <a href="https://docs.zingg.ai/">fully documented</a>.</em></p>]]></content:encoded></item><item><title><![CDATA[The Data Team’s Moment: How Agentic AI Is Turning the “Support Function” Into the Strategic Center of the Enterprise]]></title><description><![CDATA[For years, data teams were the quiet backbone nobody appreciated. That is changing &#8212; fast.]]></description><link>https://www.learningfromdata.zingg.ai/p/the-data-teams-moment-how-agentic</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/the-data-teams-moment-how-agentic</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sat, 07 Mar 2026 18:21:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!32x6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!32x6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!32x6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!32x6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!32x6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!32x6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!32x6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png" width="1024" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a153f584-cf5a-480f-9b05-468816237899_1024x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:608,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!32x6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!32x6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!32x6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!32x6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa153f584-cf5a-480f-9b05-468816237899_1024x608.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There&#8217;s a meeting that happens in every organization, usually sometime in Q3 or Q4. A CFO pulls up the budget spreadsheet, squints at the line items, and asks the question that makes every data leader&#8217;s stomach drop: <em>&#8220;What exactly are we getting out of the data team?&#8221;</em></p><p>It&#8217;s an uncomfortable question &#8212; and for a long time, it didn&#8217;t have a clean answer.</p><p>Data teams built pipelines nobody asked about, maintained dashboards that were exported to Excel before anyone actually read them, and spent political capital defending headcount to executives who viewed them as a sophisticated cost center. As Joe Reis, author of <em>Fundamentals of Data Engineering</em>, <a href="https://joereis.substack.com/p/joes-nerdy-weekend-reads-11">put it plainly</a>: <em>&#8220;Data teams will continue struggling to justify their value as long as data in Enterpriseland is a back-office function.&#8221;</em></p><p>That was the reality for most of the 2010s and into the early 2020s. Data teams worked in the shadows &#8212; vital infrastructure, but rarely celebrated. They were the plumbers of the modern enterprise: essential when things worked, blamed when things didn&#8217;t, and invisible the rest of the time.</p><p>The era of agentic AI is changing all of that &#8212; right now, in real time.</p><div><hr></div><h2>The Old Indignity: Fighting for a Seat at the Table</h2><p>The data ROI conversation has long been a recurring humiliation dressed up as a business exercise.</p><p>Unlike sales, whose numbers are on the board every Monday morning, or marketing, which can point to impressions and conversions, data teams have operated in the murky middle. Their impact is indirect &#8212; they help <em>other</em> teams make decisions. As data writer Anna Geller <a href="https://annageller.medium.com/should-you-measure-the-value-of-a-data-team-95c447f28d4a">captured it well</a>: <em>&#8220;Most data teams work as a support function... their involvement in value creation is indirect. You can&#8217;t directly quantify the impact of a new table, dashboard, or pipeline.&#8221;</em></p><p>This indirectness has been weaponized against them at budget time. Data leaders find themselves in a paradoxical position: the team responsible for quantification can&#8217;t quantify itself. Some resort to building internal P&amp;L reports to justify their own existence. Others try charging other departments &#8220;internal fees&#8221; for data services &#8212; a workaround that says more about the dysfunction of how data is valued than it does about the team&#8217;s actual contribution.</p><p>Meanwhile, data stacks keep growing enormous. As <a href="https://www.montecarlodata.com/blog-the-next-big-crisis-for-data-teams/">Monte Carlo&#8217;s blog noted</a>, <em>&#8220;data teams made history as one of the first departments to spin up 8-figure technology stacks with little to no questions asked&#8221;</em> &#8212; a fact that makes them conspicuous targets whenever CFOs start scrutinizing every line item with new intensity.</p><p>The modern data stack is a marvel of engineering. But without a clear, legible link to revenue, it remains a liability in any budget conversation. At least, it has been &#8212; until now.</p><div><hr></div><h2>The Shift Underway: AI Agents Are Looking for a Nervous System</h2><p>Here&#8217;s the thing about agentic AI: it is only as good as the data it runs on.</p><p>This seems obvious, but its implications are still rippling through the enterprise in real time. When a company deploys an AI agent to autonomously manage customer service tickets, optimize supply chains, approve procurement requests, or run financial reconciliations, every decision that agent makes is being built on a data foundation. If that foundation is shaky &#8212; inconsistent definitions, poor lineage, unresolved duplication &#8212; the agent doesn&#8217;t just underperform. It fails loudly, in production, with consequences.</p><p>Robin Sutara, Field CDO at Databricks, is making this organizational implication explicit in <a href="https://www.databricks.com/blog/strategic-priorities-data-and-ai-leaders-2025">Databricks&#8217; 2025 strategic priorities report</a>: <em>&#8220;A successful AI strategy starts with a solid infrastructure. Addressing fundamental components like data unification and governance through one underlying system lets organizations focus their attention on getting use cases into the real-world, where they can actually drive value for the business.&#8221;</em></p><p>The work that data teams have been doing for years &#8212; data quality, entity resolution, governance, lineage, pipeline reliability &#8212; is no longer background noise. It is becoming the prerequisite for the most important initiative in every company.</p><p>The budget conversation is flipping. It&#8217;s no longer <em>&#8220;justify why we need you.&#8221;</em> It&#8217;s <em>&#8220;how fast can you get us AI-ready?&#8221;</em></p><div><hr></div><h2>The Numbers Bearing This Out</h2><p>The scale of the agentic AI wave is making clear why data teams are starting to matter in a way they simply haven&#8217;t before.</p><p><a href="https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025">Gartner predicts</a> that 40% of enterprise applications will be integrated with task-specific AI agents by the end of 2026 &#8212; up from less than 5% today. <a href="https://www.gartner.com/en/articles/intelligent-agent-in-ai">By 2028</a>, at least 15% of work decisions across organizations will be made autonomously by AI agents, up from virtually zero in 2024. And Gartner&#8217;s best-case projection suggests agentic AI could drive approximately 30% of enterprise application software revenue by 2035, surpassing $450 billion.</p><p>The old framing &#8212; tools automate tasks, people make decisions &#8212; is no longer holding. And the people who build, govern, and maintain the data powering this new paradigm are moving from the back office to the top of the boardroom agenda.</p><div><hr></div><h2>The Verizon Test Case</h2><p>Few leaders are articulating this shift as clearly as Kalyani Sekar, Verizon&#8217;s Chief Data Officer. In a <a href="https://www.cdomagazine.tech/data-management/how-verizon-built-a-trusted-data-foundation-for-ai-a-cdos-inside-look">2025 interview with CDO Magazine</a>, she describes the arc of her team&#8217;s work &#8212; and its growing centrality &#8212; with unusual candor:</p><p><em>&#8220;Data quality is foundational to Verizon&#8217;s long-term goals. As we continue to roll out more analytical use cases, from predictive and prescriptive to generative and agentic, it&#8217;s critical that these AI and analytical capabilities, including the reports built on them, are grounded in data that is trustworthy and of the highest quality.&#8221;</em></p><p>Verizon handles petabytes of data daily &#8212; from network devices, customer touchpoints, service channels. The data team&#8217;s job was always to make sense of it and surface insights. Today, that same infrastructure is becoming the operating layer for autonomous AI systems making real-time decisions affecting 140 million wireless retail connections.</p><p>The data team isn&#8217;t changing. <strong>The stakes around its work are.</strong></p><div><hr></div><h2>From Cost Center to Control PlaneFrom Cost Center to Control Plane</h2><p><a href="https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-insights/the-agentic-organization-contours-of-the-next-paradigm-for-the-ai-era">McKinsey&#8217;s 2025 research on the &#8220;agentic organization&#8221;</a> is describing what&#8217;s unfolding as the largest organizational paradigm shift since the industrial revolution. The key insight for data leaders: in this new world, data quality cannot remain a background maintenance task. It needs to be the foundation every AI agent is built on &#8212; before a single workflow goes live.</p><p>Nowhere is this more visceral than in entity resolution &#8212; the problem of making sure your AI actually knows who it&#8217;s dealing with.</p><p>I have been <a href="https://www.learningfromdata.zingg.ai/p/agentic-ai-is-only-as-smart-as-its">watching this play out in real time</a> across industries: <em>&#8220;Agentic AI is the race car. Entity resolution is the track. You wouldn&#8217;t run a Formula 1 machine on gravel and expect to win.&#8221;</em></p><p>The examples are stark. In retail, duplicate customer records lead autonomous agents to dispatch multiple shipments, issue multiple offers, and trigger multiple refunds &#8212; quietly eroding margins at scale. In finance, fraudsters opening several accounts with small name variations go undetected because the fraud-detection AI treats each record as a separate person. In healthcare, a patient known as &#8220;Mary Chen&#8221; at one hospital and &#8220;M. Y. Chen&#8221; at a clinic becomes two different people in the AI&#8217;s world &#8212; and an autonomous care coordinator misses her allergy record before booking a procedure.</p><p>In each case, the AI wasn&#8217;t making bad decisions. It was making confident decisions on bad data. And because agentic systems are autonomous, they don&#8217;t make that mistake once &#8212; they repeat it hundreds or thousands of times before anyone notices.</p><p>Tools like Zingg &#8212; which resolve entities at scale across CRMs, EHRs, billing systems, and data lakes, running natively on Snowflake and Databricks &#8212; are becoming the kind of foundational data infrastructure that determines whether an AI deployment succeeds or silently fails. One bank that placed Zingg&#8217;s entity resolution layer beneath its fraud-detection AI saw fraud detection improve 4x &#8212; without changing the AI model at all. The model didn&#8217;t get smarter. The data it was reasoning on did.</p><p>That&#8217;s the new pitch data leaders are making &#8212; and it&#8217;s a powerful one. AI doesn&#8217;t paper over your data problems. It runs them at scale, in production, with consequences. Every dollar not being invested in data quality is a growing tax on every AI initiative downstream.</p><div><hr></div><h2>What This Means for Data Leaders Right Now</h2><p>The opportunity is real and it is opening &#8212; but it requires a change in posture. Data teams have spent years learning to survive by making themselves useful to whoever controls the budget. The new play is positioning as the precondition for everything the business cares about.</p><p>Some practical realities for data leaders navigating this moment:</p><p><strong>Your backlog is becoming a risk register.</strong> Every unresolved data quality issue, every undocumented pipeline, every inconsistent metric definition is a live failure point for an AI agent that will execute on bad information at machine speed. Frame your roadmap accordingly.</p><p><strong>Entity resolution is becoming a first-class AI problem.</strong> Before an AI agent can act intelligently, it needs to know <em>who</em> it is acting on &#8212; and right now, most enterprise data lakes are full of the same customer, supplier, or patient under a dozen slightly different names. Deduplicating and resolving those identities isn&#8217;t a one-time cleanup job anymore; it&#8217;s an ongoing data foundation that every agentic workflow sits on top of. Teams that are investing in entity resolution infrastructure &#8212; using tools like <a href="https://www.zingg.ai/">Zingg</a> to continuously resolve identities across CRMs, EHRs, and data platforms &#8212; are finding that their AI deployments perform dramatically better, without touching the models at all. The intelligence was always there. The data just wasn&#8217;t ready for it.</p><p><strong>Data governance is becoming AI governance.</strong> The governance frameworks data teams have been building for years &#8212; classification rules, data retention policies, lineage tracking &#8212; are exactly what enterprises now need to keep AI agents grounded in reality.</p><p><strong>The CDO is becoming a critical executive.</strong> The function that once reported to the CIO is increasingly driving the AI strategy conversation alongside the CEO and CFO.</p><p>The <a href="https://www.getdbt.com/resources/state-of-analytics-engineering-2025">dbt Labs 2025 State of Analytics Engineering survey</a> is already showing the shift: 40% of data teams added headcount in the past year &#8212; compared to just 14% the year before. Data budgets are growing too, with 30% of respondents reporting increases versus just 9% in the prior year. The most common complaint is no longer budget cuts &#8212; it&#8217;s that the team can&#8217;t keep pace with the business&#8217;s appetite.</p><p>As dbt Labs CTO Mark Porter <a href="https://www.getdbt.com/blog/ai-driving-surge-in-data-budgets-2025-state-of-analytics-engineering">put it directly</a>: <em>&#8220;As companies increase AI investments, leaders are prioritizing the teams responsible for data quality and governance &#8212; the essential foundation for AI effectiveness.&#8221;</em></p><div><hr></div><h2>The Table Is Turning</h2><p>There&#8217;s something quietly satisfying about watching an industry vindicate itself not through persuasion, but through circumstance.</p><p>Data teams have spent years arguing that their work mattered. They built ROI frameworks and internal P&amp;Ls and political coalitions. They embedded with business units and learned to speak in revenue terms. Some of it worked. Most of it was an uphill grind.</p><p>Now, agentic AI is making the argument for them &#8212; not through rhetoric, but through dependency. The most transformative technology initiative of the decade cannot function without clean data, governed pipelines, and trustworthy infrastructure. And the people building and maintaining those things are, right now, becoming central to the enterprise.</p><p>The data team is no longer supporting the business.</p><p>The business is starting to run on the data team.</p><div><hr></div><p><em>If you found this useful, share it with a data leader who&#8217;s still spending Q4 justifying their existence to a skeptical CFO. Their moment is here &#8212; and it&#8217;s only getting bigger.</em></p>]]></content:encoded></item><item><title><![CDATA[The Real Threat to SaaS: It’s Not Vibe Coding. It’s Access To Your Data.]]></title><description><![CDATA[And why we built Zingg the way we did.]]></description><link>https://www.learningfromdata.zingg.ai/p/the-real-threat-to-saas-its-not-vibe</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/the-real-threat-to-saas-its-not-vibe</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Wed, 11 Feb 2026 17:11:48 GMT</pubDate><content:encoded><![CDATA[<p>Vibe coding has everyone debating whether SaaS is dead.</p><p>One camp says: when you can build a CRM in an afternoon, why pay Salesforce $150 a seat? The other fires back: a POC is not a product. Security, compliance, scalability, ongoing maintenance &#8212; the real cost of DIY reveals itself fast.</p><p>Both sides have a point. But they&#8217;re arguing about the wrong thing.</p><p><strong>The real threat to SaaS isn&#8217;t that customers will build replacements. It&#8217;s that customers will demand their data back.</strong></p><h2>The New Requirement Nobody Saw Coming</h2><p>Most enterprises are not going to build their own CRM, their own support desk, or their own payroll system. That&#8217;s not happening. But they <em>are</em> going to build custom AI agents &#8212; agents tailored to their workflows, their industry, their competitive edge. And those agents are useless without access to the data that lives inside enterprise SaaS systems.</p><p>Think about what that means in practice. A procurement agent that can&#8217;t read your ERP. A sales agent that can&#8217;t query your CRM. A support agent that can&#8217;t see your ticketing history. You end up with AI that is powerful in theory and crippled in practice &#8212; not because the models aren&#8217;t good enough, but because the data is locked behind walls that the enterprise itself cannot breach.</p><p>This is the structural tension SaaS has quietly avoided for two decades. Data lock-in was never just a feature &#8212; it was <em>the moat</em>. Salesforce, Workday, ServiceNow, SAP built empires not just on software, but on the gravitational pull of accumulated enterprise data. The switching cost wasn&#8217;t the product. It was your own history, trapped inside someone else&#8217;s system.</p><p>AI is now making that lock-in impossible to ignore.</p><p>Enterprises building custom agents will demand data portability as a first-class requirement &#8212; not a nice-to-have, not a paid add-on, but a baseline expectation. The competitive pressure will shift toward SaaS vendors who open their data, and away from those who treat it as a proprietary asset. MCP is an early signal of this direction, but it is barely the beginning. Real openness means real-time access, structured APIs, granular permissions, and data that can move across systems without friction.</p><h2>Why We Built Zingg the Way We Did</h2><p>I want to share something personal here, because it&#8217;s directly relevant.</p><p>When we built <a href="https://zingg.ai/">Zingg</a> &#8212; an open source entity resolution and data matching tool &#8212; one of our earliest and most deliberate architectural decisions was this: <strong>Zingg would run natively inside the customer&#8217;s own environment. Their data would never move.</strong></p><p>Not to our servers. Not to a cloud we managed. Not through a pipeline that touched our infrastructure. Zingg would go to the data, not the other way around.</p><p>At the time, this felt like swimming against the tide. The SaaS playbook was clear: centralise everything, host it yourself, lock in the customer through the platform. Everyone was building managed services where customers uploaded their data and got results back. It was convenient. It was scalable. It was the default.</p><p>We chose differently &#8212; and we took some friction for it. Prospects would ask why they couldn&#8217;t just hand us their data and get clean results. Investors would ask why we weren&#8217;t building a managed pipeline. The market was optimised for data movement, and we were building for data residency.</p><p>Our reasoning was straightforward. Entity resolution deals with some of the most sensitive data an enterprise has &#8212; customer records, patient data, financial identities, supplier profiles. The enterprises that needed this most were also the ones with the strictest data governance requirements: banks, healthcare providers, insurers, government agencies. For them, sending data to a third-party system wasn&#8217;t a preference issue. It was a compliance wall.</p><p>Beyond compliance, there was a deeper principle at work. <strong>The enterprise owns its data. The enterprise should control its data.</strong> A tool that requires data to leave the enterprise in order to function is not really serving the enterprise &#8212; it&#8217;s holding the enterprise&#8217;s data hostage to deliver a service.</p><h2>What We Learned From Building This Way</h2><p>Running natively inside the customer&#8217;s environment changed everything about how we thought about the product.</p><p>It forced us to build something that was genuinely portable &#8212; that could run on on-premise Spark clusters, on Databricks, on AWS EMR, on GCP, on Snowflake, wherever the customer&#8217;s data actually lived. It couldn&#8217;t be opinionated about infrastructure, because the whole point was to meet the data where it was.</p><p>It meant our integration surface was the data itself &#8212; structured, well-defined, queryable &#8212; not a proprietary pipeline that the customer had to trust and maintain.</p><p>And it gave customers something that is genuinely rare in enterprise software: full control and transparency. They could see exactly what Zingg was doing with their data, because Zingg was running on their machines, under their monitoring, in their environment, at their schedule.</p><p>There was also a dimension we hadn&#8217;t fully appreciated until customers started telling us: <strong>it was significantly faster AND cheaper.</strong> When your tool runs inside the enterprise&#8217;s existing environment, the enterprise leverages infrastructure it has already paid for &#8212; compute clusters, data quality frameworks, observability stacks. There is no new data pipeline to design, build, or maintain. There is no ongoing cost of extracting data, shipping it somewhere, and syncing it back. <strong>No ETL tax</strong>. No vendor markup on compute. The enterprise uses what it already has, and the tool fits around it. For large organisations running entity resolution at scale, this wasn&#8217;t a marginal saving &#8212; it was a fundamental difference in total cost of ownership.</p><p>Interestingly, this also made Zingg stickier in a different way. Not because we had the data &#8212; we never did &#8212; but because the customer&#8217;s confidence in the product grew over time. </p><h2>Why This Matters Now More Than Ever</h2><p>Fast forward to today, and the AI agent conversation is making this architectural principle mainstream.</p><p>Every enterprise that wants to build custom agents is asking the same question we were thinking about years ago: <em>how do I give my AI access to my data without giving someone else access to my data?</em></p><p>The answer is the same one we arrived at for Zingg. The data should never need to leave.</p><p>This is what makes the current MCP momentum so significant. MCP is a protocol that lets AI agents communicate with data systems without requiring data to be extracted, copied, or centralised. It is, in essence, a formalisation of the principle we built Zingg on: bring the compute to the data, not the data to the compute.</p><p>But MCP alone is not enough. A faithful MCP implementation means more than exposing an endpoint &#8212; it means real-time access, granular permissions, data that reflects the current state of the system, and APIs that are robust enough to support production AI workloads, not just demos.</p><p>The enterprises of the next decade will not tolerate data opacity. They have regulators demanding audit trails. They have security teams demanding residency guarantees. And now they have AI ambitions that demand data access. These forces are all pointing in the same direction.</p><p>SaaS vendors who open their data thoughtfully &#8212; with proper permissioning, auditability, and reliability &#8212; will find that openness builds a different kind of loyalty. One that is harder to compete with, because it&#8217;s based on trust rather than captivity.</p><p>The vendors who resist will find themselves on the wrong side of procurement conversations as AI becomes a standard enterprise requirement. The question won&#8217;t be &#8220;does your product have AI features?&#8221; It will be &#8220;can my AI agents use your data?&#8221;</p><p>If the answer is no, or not without significant friction, the deal will start to look elsewhere.</p><h2>The Output Was Always the Point</h2><p>There is one more dimension to this that feels especially relevant now.</p><p>Entity resolution is not a peripheral problem. It is one of the most foundational challenges in enterprise data &#8212; the question of whether the &#8220;Acme Corp&#8221; in your CRM is the same as the &#8220;Acme Corporation&#8221; in your ERP, whether the customer in your support desk is the same person as the one in your billing system, whether your supplier records are duplicated across three different procurement tools. Resolved, unified data is the prerequisite for almost everything else an enterprise wants to do with its data. Analytics, reporting, compliance, machine learning &#8212; none of it works well on dirty, fragmented, unresolved data.</p><p>This is why we were always deliberate about how Zingg surfaces its output. We didn&#8217;t just want Zingg to solve the matching problem &#8212; we wanted the resolved data to be genuinely portable, consumable by whatever system or workflow the enterprise chose to use next. Clean, resolved entities written back to the data store of the enterprise&#8217;s choice, in formats that fit naturally into existing pipelines, without any proprietary wrapper that would create a new dependency.</p><p>We did not see AI agents coming when we made that decision. But in hindsight, it was exactly the right call. A resolved, unified, trustworthy view of an enterprise&#8217;s core entities &#8212; customers, suppliers, products, locations &#8212; is precisely the kind of context foundation that AI agents need to function well. An agent working from fragmented, duplicated, unresolved data will produce fragmented, unreliable answers. An agent working from a clean, resolved data layer can actually be trusted.</p><p>We built Zingg to ensure its output could be consumed anywhere the enterprise wanted. It turns out that &#8220;anywhere the enterprise wants&#8221; now includes AI agents, RAG pipelines, and real-time reasoning systems we couldn&#8217;t have imagined when we started. The principle held.</p><div><hr></div><h2>A Principle Worth Standing Behind</h2><p>When we made the architectural choice for Zingg, we weren&#8217;t predicting AI agents. We were just trying to build something enterprises could actually trust with their most sensitive data.</p><p>But the principle was right then, and it is right now: <strong>data should live where the enterprise decides it lives. Tools &#8212; whether they are matching algorithms or AI agents &#8212; should go to the data, not the other way around.</strong></p><p>The market is catching up to this idea. And those who embrace it, rather than fight it, will be the ones who matter in the next era of enterprise software.</p><div><hr></div><p><em>Zingg is an open source entity resolution framework that runs natively in your environment &#8212; no data movement required. You can find it at <a href="https://github.com/zinggAI/zingg">github.com/zinggAI/zingg</a>.</em></p>]]></content:encoded></item><item><title><![CDATA[The Identity Crisis: Why “Getting the Basics Right” Is the Hardest Work in Data]]></title><description><![CDATA[And it is actually world-class data engineering!]]></description><link>https://www.learningfromdata.zingg.ai/p/the-identity-crisis-why-getting-the</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/the-identity-crisis-why-getting-the</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Thu, 23 Oct 2025 14:56:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BhOG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BhOG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BhOG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!BhOG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!BhOG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!BhOG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BhOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png" width="1024" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:608,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BhOG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!BhOG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!BhOG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!BhOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cd106c-5e59-447e-b079-dd5e032d0193_1024x608.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">laying the foundation </figcaption></figure></div><p></p><p>Matthew Niederberger, Founder of Martech Therapy recently <a href="https://martech.org/why-chasing-shiny-cdp-features-leaves-marketers-feeling-like-imposters/">wrote</a> something that stopped me in my tracks:</p><blockquote><p>&#8220;If you&#8217;re a practitioner still trying to get a unified customer ID working across channels, you feel left behind technologically and question your competence.&#8221;</p></blockquote><p>This single sentence captures a crisis that&#8217;s quietly plaguing data engineering teams across the industry. While vendor booths showcase AI-driven personalization and predictive customer journeys, talented practitioners are privately wrestling with something supposedly &#8220;simpler&#8221;: building a unified customer ID that actually works.</p><p>The cruel irony? They&#8217;re doing the hardest, most important work. But in a market obsessed with bleeding-edge features, nobody&#8217;s celebrating it.</p><h2>The Vendor Narrative Gap</h2><p>So why do talented teams feel like they&#8217;re failing when they&#8217;re still working on identity?</p><p>Matthew Niederberger nails the dynamic: vendors showcase what&#8217;s possible, not what&#8217;s required to get there.</p><p>Vendor conference presentations follow a formula:</p><ol><li><p>Demonstrate the amazing outcomes (real-time personalization! predictive journeys!)</p></li><li><p>Show the sleek UI that makes it look effortless</p></li><li><p>Mention &#8220;unified customer data&#8221; as an assumed prerequisite</p></li><li><p>Move quickly past the foundation to focus on the flashy features</p></li></ol><p>The implicit message: if you&#8217;re still working on unified identity, you&#8217;re behind. Everyone else has already solved this.</p><p>But that&#8217;s not remotely true. The vast majority of organizations are still struggling with identity resolution. They&#8217;re just not talking about it publicly because it feels like admitting failure.</p><p>The result is a massive selection bias. We hear about the 5% who&#8217;ve moved on to advanced use cases. We don&#8217;t hear about the 95% still building foundations&#8212;not because they&#8217;re incompetent, but because <strong>the foundations are genuinely difficult to build well.</strong></p><h2>The Imposter Syndrome Epidemic</h2><p>I&#8217;ve had countless conversations with data teams at conferences, in Slack channels, and on video calls. A pattern emerges again and again:</p><p>&#8220;We&#8217;re still working on identity resolution.&#8221;<br>&#8220;We haven&#8217;t gotten to the AI stuff yet.&#8221;<br>&#8220;We&#8217;re just trying to get our customer IDs right first.&#8221;</p><p>The word &#8220;just&#8221; does a lot of heavy lifting in those sentences. It diminishes foundational engineering work into a checkbox, a prerequisite you should have already mastered before moving on to the &#8220;real&#8221; work.</p><p>But here&#8217;s what nobody is saying loudly enough: <strong>unified identity isn&#8217;t a prerequisite. It&#8217;s the final boss of data engineering.</strong></p><h2>Why Unified Identity Is Actually Hard</h2><p>Let&#8217;s be brutally honest about what &#8220;getting a unified customer ID working across channels&#8221; actually requires:</p><h3>The Data Quality Reality</h3><p>Real-world customer data is messy in ways that mock deterministic rules:</p><ul><li><p>Sarah Johnson gets married and becomes Sarah Martinez</p></li><li><p>john.smith@gmail.com and jsmith@gmail.com might be the same person (or different people with similar names)</p></li><li><p>Phone numbers get recycled&#8212;last year&#8217;s customer is this year&#8217;s prospect</p></li><li><p>Addresses have hundreds of valid formatting variations</p></li><li><p>People move, change jobs, update preferences</p></li><li><p>Data entry errors create systematic inconsistencies</p></li></ul><p>A unified customer ID system has to handle all of this, at scale, while maintaining high accuracy. Miss too many matches and your &#8220;360-degree customer view&#8221; is fragmented. Create false positives and you&#8217;re merging different people, violating privacy and destroying trust.</p><h3>The Technical Challenge</h3><p>You&#8217;re not matching records in a single clean database. You&#8217;re resolving identities across:</p><ul><li><p>Web analytics (cookie-based, increasingly unreliable)</p></li><li><p>Mobile apps (device IDs, app-specific identifiers)</p></li><li><p>CRM systems (email, phone, account IDs)</p></li><li><p>Point-of-sale systems (loyalty cards, payment methods)</p></li><li><p>Call center interactions (phone numbers, case IDs)</p></li><li><p>Marketing platforms (each with their own identifier schemes)</p></li><li><p>Third-party data sources (with their own quality issues)</p></li></ul><p>Each source has different identifier types, different data quality standards, and different update frequencies. You&#8217;re not just joining tables&#8212;you&#8217;re probabilistically determining which records across fundamentally different systems represent the same human being.</p><h3>The Scale Problem</h3><p>Now multiply this across millions or billions of records. The naive approach&#8212;comparing every record to every other record&#8212;is computationally impossible. You need sophisticated blocking strategies, efficient algorithms, and infrastructure that can process massive datasets without collapsing under its own weight.</p><p>And you need to do this continuously, not as a one-time batch job. New data arrives constantly. Identities evolve. Your resolution system has to keep pace.</p><h3>The Governance Complexity</h3><p>Layer on consent management, privacy regulations, data retention policies, and the right-to-be-forgotten requirements. Your unified ID system isn&#8217;t just a technical challenge&#8212;it&#8217;s a governance framework that has to work across jurisdictions, respect customer preferences, and provide audit trails.</p><p><strong>This isn&#8217;t &#8220;basic.&#8221; This is some of the most complex data engineering work you can do.</strong></p><h2>The Three-Layer Approach That Actually Works</h2><p>So how do you actually build identity resolution that works at warehouse scale? Let me break down the architecture that successful implementations share:</p><h3>Layer 1: Blocking</h3><p>You can&#8217;t compare every record to every other record&#8212;that&#8217;s O(n&#178;) complexity and computationally impossible at billion-record scale. Smart blocking reduces potential comparisons by 99%+ while maintaining high recall.</p><p>The key is choosing blocking keys that cast a wide enough net to catch true matches while being selective enough to make comparison tractable. This might mean blocking on:</p><ul><li><p>First three characters of last name + ZIP code</p></li><li><p>Phone area code + first name</p></li><li><p>Email domain + approximate birth date</p></li><li><p>Phonetic encoding of names</p></li></ul><p>Good blocking strategies use multiple passes with different keys, ensuring that records with typos or variations still get compared. At Zingg, we learn blocking directly on all the fields present in the data from the user data and labels to block different datasets efficiently. </p><h3>Layer 2: Matching</h3><p>This is where ML earns its keep. For each pair of records that passed blocking, you need to score the likelihood they represent the same entity.</p><p>Modern matching models use features like:</p><ul><li><p>String similarity (edit distance, Jaro-Winkler, token-based matching)</p></li><li><p>Phonetic similarity (Soundex, Metaphone, NYSIIS)</p></li><li><p>Behavioral patterns (purchase timing, channel preferences, engagement signals)</p></li><li><p>Temporal proximity (how close in time were interactions?)</p></li><li><p>Geographic consistency (do locations make sense together?)</p></li></ul><p>The output isn&#8217;t binary&#8212;it&#8217;s probabilistic scoring. This pair has a 0.95 probability of being a match. That pair scores 0.45. This gives you flexibility in tuning precision vs. recall based on your use case.</p><h3>Layer 3: Clustering</h3><p>The final challenge: transform pairwise match scores into entity groups. This is where you resolve transitive relationships.</p><p>If Record A matches Record B with 0.92 confidence, and Record B matches Record C with 0.88 confidence, are A, B, and C all the same entity? What if A and C compared directly would only score 0.65?</p><p>Clustering algorithms handle these transitivity questions, creating stable entity groups even as new records arrive and scores shift over time. This layer also handles entity splits (when you discover you incorrectly merged two people) and entity merges (when new evidence shows records you thought were separate are actually one entity).</p><h2>Why This Must Happen In the Warehouse And Datalake</h2><p>Here&#8217;s a critical architectural decision: entity resolution must run natively in your data warehouse or data lake. Moving billions of records out for processing defeats the entire purpose of warehouse-native architecture.</p><h3>Data Gravity</h3><p>Your customer data is already in Snowflake, Databricks, BigQuery, or Redshift. That&#8217;s where your transformation pipelines run, where your analytics queries execute, where your data science models train. Extracting data to a separate entity resolution system creates:</p><ul><li><p>Data movement costs (bandwidth, storage, processing)</p></li><li><p>Synchronization complexity (keeping multiple copies current)</p></li><li><p>Latency (time to move data before resolution can happen)</p></li><li><p>New data silos (exactly what warehouse consolidation was supposed to eliminate)</p></li></ul><h3>Incremental Processing</h3><p>Entity resolution isn&#8217;t a one-time batch job. New customer records arrive constantly&#8212;from web signups, mobile registrations, purchase transactions, support interactions. You need incremental processing that resolves new records against existing entities without full dataset re-processing.</p><p>This only works efficiently when resolution runs where the data lives, using the warehouse&#8217;s native capabilities for incremental updates and change data capture.</p><h3>Integration with Data Pipelines</h3><p>Your entity resolution system needs to integrate seamlessly with:</p><ul><li><p>dbt transformations and data modeling</p></li><li><p>Reverse ETL for audience activation</p></li><li><p>BI tools for reporting and analytics</p></li><li><p>ML platforms for model training</p></li><li><p>Data quality frameworks for monitoring</p></li><li><p>AI agents for context</p></li></ul><p>This integration is natural when everything operates in the same environment. It becomes painful when entity resolution is a separate system with its own APIs, data formats, and operational requirements.</p><h3>Governance and Security</h3><p>Your data warehouse already has:</p><ul><li><p>Access controls and role-based permissions</p></li><li><p>Audit logging for compliance</p></li><li><p>Encryption at rest and in transit</p></li><li><p>Data masking and tokenization capabilities</p></li><li><p>Retention policies and deletion workflows</p></li></ul><p>Replicating customer data to an external entity resolution system means duplicating all of this governance infrastructure. It becomes a privacy and compliance nightmare with GDPR, CCPA and other applicable laws. Run resolution in the warehouse, and you inherit these controls automatically. </p><h2>The Human-in-the-Loop Problem</h2><p>Here&#8217;s what the automation-obsessed narrative misses: pure ML automation for entity resolution rarely works in practice.</p><h3>Why Full Automation Fails</h3><p>Even the best ML models make mistakes on edge cases:</p><ul><li><p>Common names with similar addresses (are these two John Smiths in the same apartment building one person or two?)</p></li><li><p>Shared email addresses (family members, assistants, corporate accounts)</p></li><li><p>Identity changes (marriages, name corrections, gender transitions)</p></li><li><p>Data entry errors that create systematic patterns the model learns incorrectly</p></li><li><p>Inadequate data capture</p></li></ul><p>You need human judgment on ambiguous cases, and you need to feed that judgment back into the model to improve over time.</p><h3>Workflows That Actually Work</h3><p>Successful implementations build workflows for:</p><p><strong>Training data labeling</strong>: Which record pairs are matches, which are non-matches? Your initial model needs labeled examples, and getting these labels right determines everything downstream.</p><p><strong>Threshold tuning</strong>: ML models output probability scores, but you choose the threshold. Score above 0.90 is an automatic match, below 0.40 is automatic non-match, but what about the 0.65 that could go either way? Human review helps calibrate these thresholds based on business impact.</p><p><strong>Edge case review</strong>: The model flags cases it&#8217;s least confident about. A data steward reviews them, makes decisions, and those decisions become training examples for model tuning.</p><p><strong>Continuous model refinement</strong>: Data patterns evolve. New data sources get added. Data quality issues emerge and get fixed. The model needs periodic retraining, and human feedback on recent predictions drives that retraining.</p><p><strong>Dispute resolution</strong>: When downstream teams question why records were merged or split, you need workflows to investigate, explain the decision, and potentially override the model when it&#8217;s wrong.</p><p>The best systems don&#8217;t eliminate human review&#8212;they make it efficient. Instead of manually reviewing millions of records, you review the hundreds that matter most.</p><h2>Common Implementation Mistakes</h2><p>I&#8217;ve seen these patterns repeatedly as organizations tackle identity resolution:</p><h3>Mistake 1: Starting Too Big</h3><p>Teams try to resolve all entity types simultaneously&#8212;customers, accounts, products, locations, employees. This creates enormous scope, makes success criteria fuzzy, and delays time to value.</p><p><strong>Better approach</strong>: Start with one high-value entity type (usually customers). Prove the value, build organizational capability, then expand to other entity types. Each new entity type gets easier because you&#8217;ve learned the patterns.</p><h3>Mistake 2: Ignoring Data Quality</h3><p>Entity resolution surfaces every data quality issue you&#8217;ve been ignoring. If 40% of your email addresses are null or invalid, you can&#8217;t rely on email matching. If name fields contain inconsistent formatting, parsing errors, or non-name data, string matching becomes unreliable.</p><p><strong>Better approach</strong>: Treat entity resolution and data quality as inseparable workstreams. Profile your data first. Fix systematic quality issues before expecting resolution to work well. Use entity resolution results to identify and prioritize remaining quality problems.</p><h3>Mistake 3: Treating It as a Project</h3><p>Organizations approach entity resolution as a one-time project: scope it, build it, deploy it, move on. Then they&#8217;re shocked when accuracy degrades over time as data patterns shift.</p><p><strong>Better approach</strong>: Treat entity resolution as infrastructure that requires ongoing ownership, monitoring, and refinement&#8212;just like your data pipelines or ML models. Assign clear ownership, establish SLAs, build monitoring dashboards, schedule regular review.</p><h3>Mistake 4: Optimizing for Perfection</h3><p>Teams spend months trying to achieve 99.9% accuracy before releasing anything. Meanwhile, downstream teams make decisions based on no identity resolution at all (effectively 0% accuracy).</p><p><strong>Better approach</strong>: Ship at 95% accuracy, deliver value, iterate to higher accuracy based on real-world feedback. You learn more from production usage in two weeks than from six months of pre-release testing. Start good, iterate to great.</p><h3>Mistake 5: Building From Scratch</h3><p>Organizations underestimate the complexity and decide to build custom entity resolution systems. Eighteen months later, they have something that works on small datasets but can&#8217;t scale, has accuracy issues they don&#8217;t understand, and requires specialized knowledge that only one person has.</p><p><strong>Better approach</strong>: Use proven solutions&#8212;whether open source frameworks or commercial platforms&#8212;that solve the hard algorithmic and scaling problems. Invest your engineering effort in customization, integration, and business-specific logic, not reinventing core entity resolution algorithms.</p><h2>The Real Question</h2><p>At this point in conversations, someone usually asks: &#8220;Should we do ML-based entity resolution?&#8221;</p><p>That&#8217;s the wrong question.</p><p>The real question is: <strong>&#8220;Can our organization afford to build composable architectures on a foundation of deterministic rules that we know will fail at scale?&#8221;</strong></p><p>Every organization moving to warehouse-native architectures will eventually face the identity resolution challenge. The only question is whether you address it proactively&#8212;as part of your architectural planning&#8212;or reactively, after your first major data quality incident.</p><p>The reactive path looks like this:</p><ul><li><p>Build your composable architecture on simple deduplication rules</p></li><li><p>Activate audiences and launch campaigns</p></li><li><p>Discover that 15% of your &#8220;unified&#8221; customers are actually duplicates</p></li><li><p>Or worse, discover you&#8217;ve merged different people and violated privacy</p></li><li><p>Scramble to retrofit proper identity resolution</p></li><li><p>Rebuild audiences and downstream dependencies</p></li><li><p>Deal with the business impact and loss of trust</p></li></ul><p>The proactive path looks like this:</p><ul><li><p>Recognize identity resolution as foundational, not an afterthought</p></li><li><p>Build or adopt ML-based entity resolution that runs in your warehouse</p></li><li><p>Start with one entity type, prove value, iterate</p></li><li><p>Build downstream capabilities on accurate identity from day one</p></li><li><p>Scale confidently knowing your foundation is solid</p></li></ul><p>The reactive path is more common (because people underestimate the challenge). The proactive path is more successful (because foundations matter).</p><h2>What Actually Endures</h2><p>Here&#8217;s the uncomfortable truth about chasing bleeding-edge features without solid foundations:</p><p><strong>A CDP that can&#8217;t accurately unify customer IDs is just an expensive data warehouse with a marketing UI.</strong></p><p>You can bolt on AI-driven orchestration, predictive segmentation, and real-time decisioning. But if your underlying identity graph is wrong&#8212;if you think one customer is three people, or three people are one customer&#8212;every downstream capability inherits those errors.</p><p>Your AI doesn&#8217;t fix bad identity data. It amplifies it. Your personalization engine delivers fragmented experiences. Your predictive models train on fictional customer journeys. Your real-time decisioning makes decisions based on incomplete context.</p><p>The organizations actually positioned to leverage advanced CDP capabilities aren&#8217;t the ones with the longest feature lists. <strong>They&#8217;re the ones who took the time to build reliable identity resolution first.</strong></p><h2>The Real Achievement</h2><p>Let me share two examples of organizations that built this foundation right.</p><p><strong>Fortnum &amp; Mason</strong>, the 300-year-old luxury British retailer, faced a classic challenge: customer data fragmented across restaurant bookings, email signups, online transactions, and in-store purchases. They initially tried a third-party service for identity resolution, but it created non-persistent identifiers, offered limited visibility into the matching process, and raised privacy concerns about sharing their entire dataset externally.</p><p>Jon Moss, Fortnum&#8217;s Customer Engagement Director, describes the transformation: </p><blockquote><p>&#8220;For the first time, we&#8217;re able to understand how customers are shopping with us&#8212;online, in-store, over the phone, or in restaurants. Zingg has helped us unify this data and gain insights we never had before.&#8221;</p></blockquote><p>By implementing ML-based entity resolution running natively in their data warehouse, Fortnum &amp; Mason built:</p><ul><li><p>Persistent, unified customer identifiers across all touchpoints</p></li><li><p>Complete visibility and control over the matching process</p></li><li><p>Privacy-compliant resolution that keeps sensitive data in their own infrastructure</p></li><li><p>A foundation for their composable CDP architecture</p></li></ul><p><strong>Orthodox Union</strong>, one of the largest Orthodox Jewish organizations in the US, operates over 40 websites, 5 mobile applications, and custom CRMs for different departments. Shelomo Dobkin, their Director of Product Development, explains: </p><blockquote><p>&#8220;We moved from using Zingg Open Source to the Enterprise version on Snowflake, making it the engine powering our golden records. Zingg&#8217;s powerful features and clean results have really set us apart.&#8221;</p></blockquote><p>Their implementation required understanding not just individual identities but household relationships&#8212;a critical requirement for their community-focused mission. They built entity resolution that could:</p><ul><li><p>Match people across multiple websites and CRM systems</p></li><li><p>Understand household connections and family relationships</p></li><li><p>Create golden records that serve as the foundation for all downstream systems</p></li><li><p>Process data incrementally as new interactions occur across their digital properties</p></li></ul><p>What makes these achievements remarkable isn&#8217;t just the technical implementation. It&#8217;s what they enable. This is <strong>world-class data engineering </strong>that delivers measurable business value:</p><ul><li><p>Marketing teams can trust their audiences</p></li><li><p>Analytics teams can build on reliable customer journeys</p></li><li><p>Data science teams can train models on accurate historical data</p></li><li><p>Operations teams can deliver consistent experiences across channels</p></li><li><p>Legal and compliance teams can demonstrate proper governance</p></li></ul><p>Everything else&#8212;<strong>the AI features, the real-time orchestration, the predictive models&#8212;becomes possible because the foundation is solid</strong>.</p><h2>Celebrating Foundational Work</h2><p>The industry needs to reframe how we talk about identity resolution and other foundational data work.</p><p><strong>These aren&#8217;t prerequisites that should be quick and easy. They&#8217;re achievements that deserve recognition.</strong></p><p>When a team successfully builds unified customer IDs across disparate systems, that&#8217;s a case study worth sharing. When an organization finally has audiences their teams trust, that&#8217;s an accomplishment worth celebrating. When identity resolution maintains accuracy as data scales from millions to billions of records, that&#8217;s engineering excellence.</p><p><strong>The castle that survives wave after wave isn&#8217;t the tallest or flashiest. It&#8217;s the one with foundations deep enough to withstand the tide.</strong></p><h2>Moving Forward</h2><p>If you&#8217;re a data practitioner still working on unified customer identity, you&#8217;re not behind. <strong>You&#8217;re doing the work that matters most</strong>.</p><p>Don&#8217;t let vendor demos and curated LinkedIn posts make you feel like an imposter. The bleeding-edge use cases get all the attention precisely because they&#8217;re rare. <strong>The foundational work gets less visibility precisely because it&#8217;s what everyone is actually struggling with.</strong></p><p>Focus on making your identity infrastructure trustworthy and durable:</p><ul><li><p>Invest in ML-based entity resolution that can handle real-world data messiness</p></li><li><p>Build in your data warehouse and datalake so you&#8217;re not creating new data silos</p></li><li><p>Design for continuous operation, not one-time batch processing</p></li><li><p>Create human-in-the-loop workflows for edge cases and model refinement</p></li><li><p>Treat it as infrastructure that requires ongoing ownership and monitoring</p></li></ul><p>The teams that master this foundation aren&#8217;t behind. They&#8217;re building something that will outlast the next wave of features, the next vendor pivot, the next technological hype cycle.</p><p>That&#8217;s not just the basics. That&#8217;s the real work. And it&#8217;s work worth doing right.</p><div><hr></div><p><em>What&#8217;s your experience building unified customer identity? Where did you hit the biggest obstacles? I&#8217;d love to hear your stories&#8212;the successes and the struggles.</em></p>]]></content:encoded></item><item><title><![CDATA[London Calling: Do You Really Know Your Customer?]]></title><description><![CDATA[Why I'm hosting a roundtable on Single Customer View at Martech World Forum &#8212; and why you should join us]]></description><link>https://www.learningfromdata.zingg.ai/p/london-calling-do-you-really-know</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/london-calling-do-you-really-know</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Tue, 26 Aug 2025 18:22:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JRq-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JRq-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JRq-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!JRq-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!JRq-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!JRq-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JRq-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png" width="1024" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:608,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JRq-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!JRq-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!JRq-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!JRq-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F383e3d4e-c861-4cf6-a823-69070443f651_1024x608.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Last week, I was finetuning the Zingg website, when something caught my eye. We have a quote on Fortnum &amp; Mason&#8217;s journey from fragmented customer records to a unified view with Zingg:</p><div class="pullquote"><p>"For the first time, we're able to understand how customers are shopping with us&#8212;online, in-store, over the phone, or in restaurants. We never had that before."</p></div><p>That sentence got me thinking: in an era where we can track a customer's digital breadcrumbs across every touchpoint, why do so many brands still struggle with the simple question: "Do you really know your customer?"</p><p><strong>The Identity Resolution Promise That Keeps Getting Broken</strong></p><p>Most of us have watched the same promises get made over and over again. First, it was Master Data Management (MDM) systems that would create "one golden record." Then identity services like LiveRamp promised to connect everyone's messy data through their identity graph. Then for the last few years, Customer Data Platforms (CDPs) became our solution to the elusive single customer view.</p><p>But here's what I keep seeing: enterprises continue to struggle with messy and fragmented data. Multi-year MDM implementations that are still not very useful. The MDM system could tell you the exact lineage of how John Smith's email address made it through seven data transformation steps, but it couldn't figure out that J. Smith from the mobile app was the same person. The system was simply not ready for the messy reality of how customer data actually behaves.</p><p>The same pattern plays out with identity services. A different promise: "Don't worry about cleaning your own house &#8212; we'll connect everyone's messy data." Unfortunately, identity services operate as closed systems where you can't see or validate their matching logic. When they say Customer A matches Customer B, you're essentially trusting a proprietary algorithm without transparency. This becomes problematic when you need to adjust matching to your specific business context, explain results to stakeholders and debug why certain matches seem incorrect. Add to that the privacy risk of sending trusted data to a third party. Designed around <strong>fixed data models</strong> (email, phone, cookie IDs, etc.), it is tough to resolve on more complex or domain-specific attributes (like healthcare patient IDs, loyalty card numbers, or industry-specific identifiers).</p><p>CDPs were supposed to fix everything by democratizing customer data for marketers. The reality? Most became expensive data silos with pretty dashboards, lots of data movement yet the hard work of actually resolving who is who got pushed to the background.</p><p><strong>What Actually Works</strong></p><p>This is why I'm excited about what the team at Fortnum &amp; Mason have built. Starting with data scattered across restaurant bookings, email sign-ups, online orders, and in-store transactions, they built a composable CDP using Zingg for entity resolution on Databricks.</p><p>The transformation wasn't about technology for technology's sake. </p><div class="pullquote"><p>"We are able to start turning that into insights and use it to personalize experiences online. We are going to start doing this in store and in our restaurants as well. Being able to connect all of these different customer experiences together is really important for us."</p></div><p>What made their approach different? They treated entity resolution as a first-class capability, not an afterthought. They started with Zingg's open source version to build confidence, understood how the matching worked, then moved to the enterprise version. The entire journey from proof of concept to production took just 2-3 months. For the first time, they can understand how customers shop across all channels. Suddenly, &#8220;one customer&#8221; meant the same thing everywhere &#8212; and the results have been transformational, not just for analytics, but for personalization and customer experience.</p><p>Or look at the <strong>Orthodox Union</strong>, a non-profit serving millions globally. They struggled with donor and member records scattered across legacy systems, event platforms, and CRM databases. Without a reliable way to unify identities, their engagement strategies risked missing context and duplicating outreach. By resolving identities natively in their Snowflake data warehouse, they were able to create a consistent single view of donors and members, while respecting privacy and compliance requirements central to their mission. For them, identity resolution wasn&#8217;t just a data challenge &#8212; it was a way to deepen community trust.</p><p>These are two very different organizations, but the pattern is the same: composability delivers agility, while identity delivers coherence. Without both, the vision of modern customer experience falls short.</p><p><strong>The Composable CDP Revolution</strong></p><p>Fortnum &amp; Mason's and Orthodox Union&#8217;s approach reflects a broader trend that's reshaping the market. Composable CDPs sit on top of existing data clouds to preserve the customer 360 and activate real-time data across channel tools. Instead of moving data around, they work where the data lives. They are private by design. They leverage existing investments in governance and observability. They scale as you grow. </p><p>But here&#8217;s the hard truth: Composable architectures give you flexibility &#8212; but only if the components are stitched together with a stable, reliable, and explainable identity layer. You can have the most elegant semantic layer, real-time pipelines, and activation tools, but if &#8220;Acme Corp&#8221; and &#8220;ACME Inc.&#8221; are treated as two different customers, your insights fracture. If one table says &#8220;Tim Chen&#8221; and another says &#8220;T. Chen,&#8221; your campaigns lose relevance. If customer data is duplicated at ingestion, those cracks echo through every model, dashboard, and recommendation.</p><p>And here&#8217;s the other piece we don&#8217;t talk about enough: <strong>owning your identity layer means owning your customer understanding</strong>. Instead of shipping your most sensitive data out to a black-box vendor, you resolve identities where your data already lives &#8212; in your own warehouse or lake. That means:</p><ul><li><p><strong>Data ownership</strong>: You stay in control of your customer data, critical in a world of tightening privacy regulations.</p></li><li><p><strong>Schema flexibility</strong>: As your business evolves &#8212; new channels, attributes, and data types &#8212; your identity system evolves with it, instead of breaking because it was designed for someone else&#8217;s schema.</p></li><li><p><strong>Privacy and compliance</strong>: By resolving identities within your existing data governance framework, you respect customer consent and meet regulatory requirements without creating extra data copies.</p></li><li><p><strong>Stack leverage</strong>: You&#8217;re not rebuilding infrastructure. You&#8217;re making your existing investments in warehouses, lakes, and analytics more powerful by adding the missing link: identity.</p></li></ul><p><strong>Why I'm Hosting This Roundtable</strong></p><p>This brings me to London. On 9th September, I'll be at the <a href="https://themartechweekly.com/martech-world-forum/london/">Martech World Forum</a> hosting a roundtable titled "Do You Really Know Your Customer?" &#8212; and it will be lovely to meet you if you are there!</p><p>Here's what we'll cover:</p><p><strong>The Identity Crisis in Marketing</strong> Why traditional approaches to customer identity keep failing, and how agentic AI makes the problem more urgent than ever.</p><p><strong>Building Composable CDPs That Actually Work</strong> Moving beyond tool marketing to understand what composable really means, when it makes sense, and how to implement it successfully.</p><p><strong>The Zingg Approach to Entity Resolution</strong> How modern ML-based entity resolution differs from traditional MDM, why transparency matters, and how to get started with open source tools.</p><p><strong>Real Customer Stories</strong> on fragmented data to unified customer understanding, including the practical challenges and business outcomes. The folks from Fortnum &amp; Mason will join us to share their real-world experience, the challenges they overcame, and the business impact they've achieved. </p><p>The roundtable format means we can go deep, ask tough questions, and learn from each other's experiences. Whether you're just starting your customer identity journey or you've been through multiple implementations, there is value in understanding what's working now and what's coming next.</p><p><strong>Join Us in London</strong></p><p>I have a limited number of <strong>VIP passes</strong> available for the Martech World Forum. If you're a data or marketing leader in London, I'd love to share the passes and have you join the conversation in person. Because at the end of the day, the real question isn&#8217;t just &#8220;do you know your customer?&#8221; &#8212; it&#8217;s whether your systems know them too.</p><p>Not just their demographic data or their transaction history, but their actual journey across all your touchpoints? Can you recognize them when they call customer service after browsing your website? Do you know that the person who signed up for your newsletter is the same one who made a purchase in your store?</p><p>If you can't answer these questions confidently, you're not alone. But you also can't afford to stay in that position much longer.</p><p>The companies that figure this out &#8212; the ones that truly know their customers &#8212; are going to have an unfair advantage in the age of AI-driven experiences. They'll deliver more relevant products, more timely services, and more personal interactions. They'll waste less money on duplicate processes and misguided campaigns.</p><p>Most importantly, they'll build the kind of customer relationships that survive economic downturns, competitive pressures, and technology transitions.</p><p>And if you can't make it to London but have war stories about identity resolution initiatives (successful or otherwise), I'd love to hear them. This isn't really about technology &#8212; it's about understanding the humans behind the data. And that, I've learned over the years, is both the hardest and most rewarding part of what we do.</p><div><hr></div><p><em>Want to learn more about entity resolution and composable CDPs? Check out <a href="https://github.com/zinggAI/zingg">Zingg's open source project</a> or read about <a href="https://www.zingg.ai/case-studies/building-a-composable-cdp-using-entity-resolution-at-fortnum-mason">how we're helping companies like Fortnum &amp; Mason</a> build better customer understanding.</em></p><div><hr></div><p></p>]]></content:encoded></item><item><title><![CDATA[Agentic AI Is Only as Smart as Its Entities]]></title><description><![CDATA[And why I&#8217;ve seen smart agents make dumb mistakes &#8212; at scale]]></description><link>https://www.learningfromdata.zingg.ai/p/agentic-ai-is-only-as-smart-as-its</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/agentic-ai-is-only-as-smart-as-its</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Wed, 13 Aug 2025 09:43:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vE0q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few weeks ago, I was talking to a data leader at a fast-growing retail brand.<br>They&#8217;d just rolled out their shiny new AI marketing agent. It could segment customers, design campaigns, and send hyper-personalized emails &#8212; all without human intervention.</p><p>Two weeks in, the VP of Marketing noticed something strange. One of their top VIP customers had received <em>three</em> different &#8220;Welcome to our brand!&#8221; offers &#8212; each with a different discount.</p><p>Turns out, their CRM had <strong>John A. Smith</strong>, <strong>J. Smith</strong>, and <strong>Jonathan Smith</strong> as three separate customers.</p><p>The agent wasn&#8217;t broken. It was doing exactly what it was told &#8212; treat each record as a unique person and delight them with a welcome offer. The problem? The <em>records</em> were wrong.</p><p>And this, I&#8217;ve realized over and over again, is where agentic AI stumbles: <strong>when the entities are messy, the automation multiplies the mess</strong>. </p><p>Over the past few years, I&#8217;ve seen this pattern play out in every industry:</p><p><strong>Retail</strong> &#8212; Duplicate customer records mean multiple shipments, multiple offers, multiple refunds. Margins erode quietly.</p><p><strong>Finance</strong> &#8212; Fraudsters open several accounts with tiny name variations. Without linking those identities, your fraud-detection AI treats them as separate people &#8212; and misses the pattern.</p><p><strong>Healthcare</strong> &#8212; Patient &#8220;Mary Chen&#8221; at Hospital A is &#8220;M. Y. Chen&#8221; at Clinic B. An AI care coordinator misses her allergy record and books a risky procedure.</p><p>In each case, the AI wasn&#8217;t &#8220;dumb.&#8221; It was confidently acting on bad data. And because agentic AI is autonomous, it didn&#8217;t just make one mistake &#8212; it repeated the mistake hundreds or thousands of times before anyone noticed.  Smart agents get the &#8220;who&#8221; wrong and this really hurts the business. </p><p>If you capture the hype, everyone talks about the cost of building and running agentic AI. Yet, fewer people talk about <strong>the cost of running it on messy data</strong>. Agents run on LLM calls, and every action burns tokens. When your entities aren&#8217;t resolved, the agent:</p><ul><li><p>Processes duplicates</p></li><li><p>Repeats actions unnecessarily</p></li><li><p>Sends extra API calls trying to reconcile</p></li></ul><p>Here&#8217;s the math from just one example scenario:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vE0q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vE0q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 424w, https://substackcdn.com/image/fetch/$s_!vE0q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 848w, https://substackcdn.com/image/fetch/$s_!vE0q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 1272w, https://substackcdn.com/image/fetch/$s_!vE0q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vE0q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png" width="1257" height="323" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:323,&quot;width&quot;:1257,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58928,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.learningfromdata.zingg.ai/i/170856561?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vE0q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 424w, https://substackcdn.com/image/fetch/$s_!vE0q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 848w, https://substackcdn.com/image/fetch/$s_!vE0q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 1272w, https://substackcdn.com/image/fetch/$s_!vE0q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0c295d1-f874-45b3-bd97-acdf45a2a531_1257x323.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>*Based on $15 per 1M tokens (GPT-4.1 tier). </p><p>Even with cheaper models, the % waste stays the same. That&#8217;s just <strong>compute waste</strong>. It doesn&#8217;t even include the cost of fixing the damage &#8212; reissuing refunds, restoring trust, or dealing with regulatory fallouts. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v2RK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v2RK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!v2RK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!v2RK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!v2RK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v2RK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1793812,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.learningfromdata.zingg.ai/i/170856561?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v2RK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!v2RK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!v2RK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!v2RK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67a3f44-2127-45d5-96fd-8a6fb626cf8b_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h2>Why I care about this (and what we built at Zingg)</h2><p>Before I started Zingg, I kept running into the same frustration:<br>We had amazing ML models, sophisticated rules engines, smart agents&#8230; but they were all reasoning on <strong>fragmented, messy entities</strong>.</p><p>At Zingg, we decided to treat entity resolution as a <strong>first-class product</strong>, not a one-off data cleanup exercise. That means:</p><ul><li><p>Matching at scale across CRMs, EHRs, billing systems, and data lakes</p></li><li><p>Adaptive learning so the matching improves with every variation we see</p></li><li><p>Transparent decisions so you can explain why two records were linked</p></li></ul><p>When customers put this layer under their agentic AI, they stopped paying for their agents to repeatedly do the wrong thing. In one case, a bank&#8217;s fraud detection improved 4x &#8212; without touching the AI model.</p><div><hr></div><h2>My advice to data leaders</h2><p>If you&#8217;re about to launch an agentic AI project, take a breath.<br>Before you buy another GPU hour or fine-tune another model, ask:</p><blockquote><p>&#8220;Do my agents actually know <em>who</em> they&#8217;re dealing with?&#8221;</p></blockquote><p>Because in my experience, <strong>agentic AI is the race car. Entity resolution is the track.</strong><br>You wouldn&#8217;t run a Formula 1 machine on gravel and expect to win.</p><div><hr></div><p><em>P.S. If you&#8217;re curious about what we are building at Zingg, or have war stories about entity chaos in AI systems, I&#8217;d love to hear from you.</em></p><div><hr></div>]]></content:encoded></item><item><title><![CDATA[The Curious Case of the Extra Records! ]]></title><description><![CDATA[Earlier last month, we started working on a feature we are calling Match Statistics.]]></description><link>https://www.learningfromdata.zingg.ai/p/the-curious-case-of-the-extra-records</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/the-curious-case-of-the-extra-records</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sat, 31 May 2025 03:30:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EYxD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EYxD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EYxD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!EYxD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!EYxD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!EYxD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EYxD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png" width="1024" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b331d912-b27a-441e-a7c7-164275e59560_1024x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:608,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EYxD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!EYxD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!EYxD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!EYxD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331d912-b27a-441e-a7c7-164275e59560_1024x608.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Earlier last month, we started working on a feature we are calling Match Statistics. The idea about statistics came from the incredible folks at <a href="https://www.zingg.ai/case-studies/building-a-composable-cdp-using-entity-resolution-at-fortnum-mason">Fortnum And Mason</a>, who are using Zingg as the backbone for Single Customer View. While running Zingg incrementally, the Match Statistics would expose how clusters numbers change as records get inserted and updated into the identity graph. This information about changing clusters would be a good first step to observe the entity resolution pipeline. If the number of clusters change disproportionately to the number of records updated and added, an alert could be triggered.</p><p>As we started thinking about this feature, we realized that the Match Statistics feature could be made even more powerful. Currently the Zingg output consists of the original data along with the Zingg ID and the match probabilities. If we could surface information about the linkages we found in the records within the cluster, we could help users with matching internals and anomalies. When our users are dealing with millions of records, finding the needle in the haystack is critical. Hence, the specification for the feature expanded to additional statistics.</p><p>While building new features, or augmenting existing flows within Zingg, we have to design both the <a href="https://github.com/zinggAI/zingg">Community</a> and <a href="https://www.zingg.ai">Enterprise</a> products, since they share the same code base. The design of any feature on either product has to take into consideration the impact on both products and ensure that they continue to work individually as well as in tandem. And then comes the trickier part. The design has to work well for both <a href="https://www.zingg.ai/product/snowflake/snowflake">Snowflake</a> and <a href="https://www.zingg.ai/product/entity-and-identity-resolution-for-databricks/databricks">Spark</a> deployments. Not just functional. Performance wise as well. The Zingg Enterprise code base is mostly common across Spark and Snowflake. However, since both platforms have very different APIs, query planning, optimisation and execution, we have a lot of ground to cover while building or modifying anything.</p><p>Thats what we thrive at :-)</p><p>Eventually we designed a stats package which would get the record level and cluster level matches from the match process. This package would be responsible for computing the metrics we cared about, and writing them to appropriate locations, or <a href="https://docs.zingg.ai/latest/connectors/pipes">Pipes</a> in Zingg lingo. The main flow would mostly be unaltered, except for invoking the stats package and passing the dataframes around.</p><p>As we started building the core functionality, we started seeing a very strange issue. Within our flow, we do <a href="https://www.zingg.ai/product/features/train">probabilistic matching</a> only on records which do not match <a href="https://www.zingg.ai/product/features/deterministic-matching">deterministically</a>. If two records match deterministically, we do not try to figure out if they match probabilistically as well since that is a waste of computation time.</p><p>Now, something strange started happening. Within the stats package, which was consuming the probabilistic and deterministic structures, we found some records which matched probabilistically as well as deterministically. This threw an error in the statistics computations, as they banked on the record pair to match one way or the other or not match at all. Our deeply crafted logic was breaking.</p><p>However, this part of the flow never changed. All we were doing was computing additional metrics in this package. And the main flow was unaltered. So how could stuff which was working earlier AND which was untouched stop working? What followed was a crazy cycle of runs, tests and validations.</p><p>Our initial thought was that we had made an error in collecting the data. Why not run it in another sandbox? Umm, the same result. Maybe we had run the flow twice appending to original intermittent data leading to the wrong state? But if that was the case, <strong>every</strong> record pair would appear twice. In our case, only some of the record pairs were being matched probabilistically as well as deterministically. We then sought to understand if there was anything special about those records. We looked at column values, deterministic rules, and the probabilistic match criteria. To our dismay, every run threw up a different set of extra records. But clearly, not data or code issue then.</p><p>Things were surely getting curiouser and curiouser. We were completely off track from the original feature we set out to build :-(</p><p>Next, we turned our focus to the execution environment. When we ran the build on Snowflake, there were no extra records. This was a relief, since it meant it was the Spark side of things that we needed to focus on. Now we ran the workload on Databricks. No issue at all! This was a relief too, since it meant earlier released production code deployed at customer locations was working fine and there was no snake in the grass. </p><p>After two weeks of racking our brains, we were finally making progress :-) Things started moving pretty fast post that. We searched for known issues and discovered a  <a href="https://issues.apache.org/jira/browse/SPARK-45282">issue on Apache Spark</a>. When we changed the Apache Spark version, the issue disappeared. To be absolutely sure, we tested different datasets, different volumes, and all looked good.</p><p>Debugging complex systems is always challenging, but some techniques always help.</p><ol><li><p>Time travel to the last known good state helps understand which change or set of changes could be the cause.</p></li></ol><ol start="2"><li><p>Surgical elimination is super critical in such scenarios. Being able to quickly run different non-overlapping tests can lead to faster results.</p></li></ol><ol start="3"><li><p>Weird behavior - missing records, extra counts etc which alter at different runs are more likely to be caused by the environment than the actual code.</p></li></ol><ol start="4"><li><p>Such non deterministic errors usually stress teams out, since things feel very out of control. For engineers like us who are used to predictable outcomes, such issues can be very frustrating. Having a cool head goes a long way.</p></li></ol><p>After this wild chase, we are now back to building the Match Statistics feature. We hope to ship it soon!</p><p>Just one last thing that I forgot to share. Many wise developers have said this before me, but I must say it again! Plan at least twice the time you anticipate it will take :-)</p><p>If you have a similar story to share, hit reply!   </p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[What is in a name?]]></title><description><![CDATA[That which we call a rose, by any other name would smell as sweet. But, will it resolve to the same entity?]]></description><link>https://www.learningfromdata.zingg.ai/p/what-is-in-a-name</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/what-is-in-a-name</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Mon, 12 May 2025 06:39:09 GMT</pubDate><content:encoded><![CDATA[<p>When Shakespeare penned these famous words, he wasn't thinking about data quality or entity resolution&#8212;but the sentiment captures a fundamental challenge in modern data management. A person known as "Robert," "Rob," "Bob," or "Bobby" remains the same individual, regardless of how they're addressed in the data. Names are perhaps the most fundamental identifier we use in data systems, yet they're also among the most inconsistent. Let us consider these common scenarios:</p><ul><li><p><strong>Nicknames and diminutives</strong>: William becomes Will, Bill, or Billy</p></li><li><p><strong>Cultural variations</strong>: James in English is Diego in Spanish or Jacques in French</p></li><li><p><strong>Formal vs. informal usage</strong>: Margaret on official documents, Maggie among friends</p></li><li><p><strong>Transliterations</strong>: &#1050;&#1080;&#1090;&#1072;&#1081; (Russian) becomes Kitay in Latin script</p></li><li><p><strong>Spelling inconsistencies</strong>: Steven vs. Stephen, Catherine vs. Katherine</p></li></ul><p>These variations create significant challenges for data matching. Without proper handling, our systems might fail to recognize that Robert Smith and Bob Smith are the same person, leading to duplicated records, fragmented customer views, and inaccurate analytics. For customer-facing applications, recognizing these are the same person enables more personalized service and prevents frustrating redundancies. Regulatory frameworks require comprehensive entity resolution. Proper nickname handling helps ensure compliance with KYC (Know Your Customer), AML (Anti-Money Laundering), and GDPR requirements. False negatives&#8212;failing to match records that should be matched&#8212;can be particularly costly in fraud detection, risk and healthcare. </p><p>While it might seem straightforward to compare names for exact matches, the reality is far more complex. We already handle salutations and variations in the data. Yet, without nickname handling, many true matches go undetected. Our internal benchmarks showed that incorporating nickname matching can improve overall match rates by 10-25% in typical customer datasets. </p><p>Hence, it was time to build! Amongst the numerous approaches we considered, dictionary matching shone out for a couple of reasons. </p><ul><li><p>Name variations differ significantly across cultures. Dictionaries can be configured to prioritize certain cultural naming patterns based on the data demographics. We are doing Hebrew names in one case and English names in another. </p></li><li><p>Configurable dictionaries empower users to leverage their domain knowledge for entity resolution.</p></li><li><p>Even without user involvement, a nickname dictionary mapping formal names to their common variations is not that difficult to build in the age of LLMs. </p></li><li><p>The surrounding data points of the record like address, date of birth, contact information) can be easily validated for potential nickname matches to reduce false positives.</p></li><li><p>Different industries and use cases require different sensitivity levels. Healthcare applications might prioritize precision, while marketing applications might favor recall. A configurable dictionary helps in such cases. </p><p></p></li></ul><p>As we zeroed in on the dictionary, we realized that the approach is extensible to other domains like corporations, where mapping an IBM to International Business Machine will greatly improve matching. In the end, we built a <a href="https://docs.zingg.ai/latest/stepbystep/configuration/adv-matchtypes">MAPPING</a> match type that can even normalize categorical columns like gender, or states and countries.  </p><p>Rolling new features on running production pipelines is not trivial. For existing users, we did comprehesive A/B testing of entity resolution with and without nickname handling to quantify the improvement in their specific datasets. The results have been even better than we expected! </p><p>Names are at once the most essential and most variable identifiers in our data systems. Effective entity resolution must account for the rich complexity of how names are used in the real world. As Shakespeare suggested, a rose by any other name would indeed smell as sweet&#8212;and with Zingg's nickname matching, data systems will finally recognize that truth!</p>]]></content:encoded></item><item><title><![CDATA[The open source identity resolution journey so far..]]></title><description><![CDATA[More than a decade ago, in the year 2013, the identity resolution problem hit me.]]></description><link>https://www.learningfromdata.zingg.ai/p/the-open-source-identity-resolution</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/the-open-source-identity-resolution</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Tue, 07 Jan 2025 15:32:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BIuc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>More than a decade ago, in the year 2013, the identity resolution problem hit me. As a niche consulting company focused on data and AI, we were tasked with integrating customer records from disparate systems. Our project involved setting up a Hadoop cluster on Amazon EC2 and building out the analytics and reporting for our client. This was the time when services like Elastic Map Reduce were about to get launched. It felt like an engineering feat that we could run Hadoop on the cloud through custom scripts. In a way, we were actually mimicking today&#8217;s serverless approaches. </p><p>We built data extraction, transformation, and loading pipelines along with the reporting using Sqoop, Pig, Hive, and Cascading. But, as we got along the project, we faced a critical issue. Even when we got all the data in one place, it was extremely challenging to identify the real world entity represented by the varying records from different systems. Since the records did not have any common identifiers and they also had all sorts of variations, unifying the data became the biggest issue for us on that project.  </p><p>Once I became aware of the problem, I started seeing it in other projects. Even when I was doing different work, the problem kind of never left me. How do we say that the records are a match? How do we scale this out? How do we run it for any entity which can appear for an enterprise? For the programmer in me, it became a thrill ride of problem solving with lots of first principle thinking. The more I worked on it, the more challenges I became aware of. Eventually, I got completely immersed and stopped taking other consulting work to focus on building a generic solution. <br><br>When I showed an early version to my friend and mentor <a href="https://www.linkedin.com/in/joydeeps/">Joydeep Sen Sarma</a>, creator of Apache Hive, he encouraged me to open source it. As an active consumer of amazing open source technologies, it was a great way to contribute back. So I latched on to the idea. Here is what I <a href="https://www.learningfromdata.zingg.ai/p/time-to-zingg">wrote</a> when I published Zingg:</p><blockquote><p>I also feel that open source Zingg is more powerful than closed source, because then it is up to the community to put it to use in more ways than I have seen and could ever anticipate. It is also a <em>promise</em> to work closely with different people and collaborate, something I am very sincerely hoping will happen!</p></blockquote><p>I was convinced there were more people like me, but it was not clear if they would find me. Fast forward to today - the Zingg open source community is 700+ members strong. Turns out what felt like an esoteric problem to me is a strongly felt pain for enterprises. Starting with a single person, Zingg now has a founding team that cares deeply about entity resolution. Zingg resolves billions of records every month and our users and customers come from diverse industries like healthcare, insurance, government sector, non profits as well as startups. This love and support is driving us to innovate more, and we are building amazing stuff to run production loads with a <a href="https://www.zingg.ai/product/features/zingg-id">persistent ZINGG_ID</a> along with <a href="https://www.zingg.ai/product/features/incremental-flow">incremental flows</a>. We continue to work extensively on accuracy and performance. Zingg also supports diverse deployments on <a href="https://www.zingg.ai/product/fabric">Microsoft Fabric</a>, <a href="https://www.zingg.ai/product/snowflake/snowflake">Snowflake</a>, AWS Glue, <a href="https://www.zingg.ai/product/databricks/databricks">Databricks</a>, and others. <br><br>All this would not have been possible without our community members, well wishers and supporters. Thank you all, you have made this journey well worth it</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BIuc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BIuc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 424w, https://substackcdn.com/image/fetch/$s_!BIuc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 848w, https://substackcdn.com/image/fetch/$s_!BIuc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!BIuc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BIuc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg" width="1000" height="668" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:668,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:143625,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BIuc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 424w, https://substackcdn.com/image/fetch/$s_!BIuc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 848w, https://substackcdn.com/image/fetch/$s_!BIuc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!BIuc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82917998-fa78-4dbd-b6a5-aa294fbe9454_1000x668.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>!</p>]]></content:encoded></item><item><title><![CDATA[Celebrating 3 Years of Open Source Entity Resolution!]]></title><description><![CDATA[Miles travelled, and miles to go...]]></description><link>https://www.learningfromdata.zingg.ai/p/celebrating-3-years-of-open-source</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/celebrating-3-years-of-open-source</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Wed, 02 Oct 2024 06:05:00 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>&#8220;The Data Weaver&#8217;s Tale&#8221;</strong><br>In fields of data, vast and wide,<br>Where records dance and entities hide,<br>A weaver works with threads so fine,<br>To link and match and intertwine.</p><p>Deduplication clears the way,<br>As fuzzy matching comes to play.<br>Record linkage, strong and true,<br>Brings scattered pieces into view.</p><p>Entity linking, precise and keen,<br>Connects the dots once unforeseen.<br>The knowledge graph, a tapestry,<br>Reveals hidden identity.</p><p>With AI&#8217;s wisdom as our guide,<br>We stitch and resolve, side by side.<br>In this grand quest for truth and light,<br>Entity resolution shines so bright.</p></div><p>When I open sourced Zingg three years ago, Entity Resolution felt like an esoteric problem. While I had personally experienced the issue of unharmonized records, it was hard to say if I was an outlier or if this was a much more common problem. Sure, master data management systems have existed since the beginning of the data industry. Or even before it. Customer data platforms have also had their share of the spotlight. Yet, for all of the modern data stack maps out there, identity and entity resolution have been largely omitted. </p><p>For someone who had burnt their life&#8217;s savings in working on Zingg, I was treading unknown territory. The <a href="https://www.learningfromdata.zingg.ai/p/of-i-and-we">encouragement I had received from people</a><a href="https://www.learningfromdata.zingg.ai/p/sometimes-the-best-things-in-life"> who understood</a> the <a href="https://www.learningfromdata.zingg.ai/p/sometimes-the-best-things-in-life">problem</a> truly meant a lot to me. Yet, the way forward was unclear. Starting and building a company is terribly hard. It was tempting to think of a cushioned job which would save me from a lot of hassles, guarantee a handsome package and possibly some <em>other</em> interesting problems. </p><p>However, I could not bring myself to do anything else - entity resolution felt like a problem I could not forget. If I could help a <em>few</em> practitioners like myself who had struggled with entity resolution, I would be satisfied. If I think back, it is likely that there was a builder bias in my thinking. Working on Zingg is fun! The challenges we see are not common place. They force you to think, <a href="https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow">they push you to the drawing board and scratch your head again and again</a> in the quest to find a solution. It is wonderful to apply tools and techniques learnt over multiple years, yet see oneself falling short and researching and learning more and more. Working on Zingg feels like an endless maze of <a href="https://www.learningfromdata.zingg.ai/p/performance-tuning-snowpark-for-identity">hard computational puzzles</a> thrown our way. As a professional, one could not ask for more!   </p><p>Also, somewhere deep down, I was not too worried about the selling bit. I was convinced that it was worth a try. One big reason was that the existing tooling in entity resolution is outdated. MDM and CDP tools aim to do too much, and the core problem that they try to solve - entity resolution, is something Zingg <a href="https://www.learningfromdata.zingg.ai/p/deterministic-matching-vs-probabilistic">specializes</a> in. Zingg&#8217;s identity resolution capabilities make it a drop in replacement for an MDM system and a key component in composable CDP stacks. Zingg can work with large datasets and <a href="https://www.learningfromdata.zingg.ai/p/hello-zingg-id">resolve</a> different kinds of entities with ease. We are privacy first, with no data ever leaving customer premises. Our warehouse/lakehouse native approach enables an elegant data architecture, letting people leverage their existing investment into data tooling and running Zingg AI as part of their data pipeline. Without any specialized AI skills, one can straight off use <a href="https://www.learningfromdata.zingg.ai/p/hello-zingg-id">Zingg results</a> in marketing attribution, risk or personalisation models. Or pipe Zingg resolved entities to the RAG of an LLM model or to a knowledge graph. </p><p>I have surely made my share of mistakes in life, but turns out I was right on betting wholeheartedly on Zingg. Adoption by open source users has been motivating. Due to on premise and multicloud environments along with the proliferation of SAAS tools, there is a growing need for trusted and enriched entity data. The race to deploy large language models, enterprise knowledge graphs and RAG based systems is pushing the demand for entity resolution even more. An enterprise data leader recently told me that Customer 360 is among the top 3 problems their team worries about. We are repeatedly hearing single customer view, SCV, Customer 360, C360 and variants in sales conversations. A very healthy and growing revenue from Zingg Enterprise is giving us the lever to go deeper into the problem, innovate even further and build simple yet delightful experiences for end users. </p><p>We are onto something special, and it is an experience worth having! </p><p>This would not have been possible without some really solid support and I am extremely grateful for that. </p><p>A big thank you to each of our users for trying Zingg, mentioning us to friends and coworkers and for joining our Slack. We hope that Zingg will continue to help you in your data journey. Special thanks to our investors and paying customers of Zingg Enterprise. We could not have reached here without you. Thank you for your early confidence and bet on us. We have built a lot of valuable features like <a href="https://www.learningfromdata.zingg.ai/p/hello-zingg-id?r=1adrn&amp;utm_campaign=post&amp;utm_medium=web">ZINGG_ID</a>, <a href="https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow">incremental flows</a>, and <a href="https://www.learningfromdata.zingg.ai/p/deterministic-matching-vs-probabilistic?r=1adrn&amp;utm_campaign=post&amp;utm_medium=web">determinstic matching</a> based on customer asks and our understanding of customer needs. The product is so much richer due to your feedback and feature asks. </p><p>A lot of ground has been covered in the last 3 years, and a lot more remains. In the coming months, we plan to grow the team to build more cool stuff we have been aching to, and continue to dabble in interesting problems at the intersection of technology and company building. Wish us luck!</p><p>Thank you for reading, and if this story resonates with you, do say hello!</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="2304" height="1536" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2304,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;three cupcakes with toppings&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="three cupcakes with toppings" title="three cupcakes with toppings" srcset="https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1519869491916-8ca6f615d6bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8YmlydGhkYXl8ZW58MHx8fHwxNzI3ODAxODkwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="true">Jennie Brown</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p></p><p></p><p> </p><p></p>]]></content:encoded></item><item><title><![CDATA[Zingg Incremental Flow]]></title><description><![CDATA[because data changes, and the identity graph should keep up too!]]></description><link>https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Thu, 22 Aug 2024 17:20:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/76e7445f-276e-4972-b0c2-2b12122f9b12_1456x292.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As a startup founder and someone who actively follows companies doing interesting work, I often wonder about the levers behind a viable and thriving business. One thing that comes up repeatedly is<strong> Growth</strong>. </p><p>From the point of view of data teams in a fast growing business, what really does growth amount to? How does it manifest in the day to day data collection and processing? </p><p>If we dissect the key activities happening across businesses during the growth phase, there are a few commonalities that stand out. New customers get acquired at a fast pace. Existing customers buy more products and services. They also buy from additional channels, like a website, if they earlier only bought from the brick and mortar store. The business keeps in touch with existing customers and makes sure it has the ways and means to contact them with offers and product updates. Customers inform the business when they move or change jobs. Strategic mergers and acquisitons happen. Additional procurement of products and services happens to fuel the growth and new suppliers get onboarded. New product lines are introduced and additional customers onboarded. </p><p>Thus, data about core business entities - customers, suppliers, products and parts starts changing rapidly. New entity records get added, existing ones get updated. Some of the records are deleted.</p><p>A small detour before we go any further. In my last post, I discussed about the <a href="https://www.learningfromdata.zingg.ai/p/hello-zingg-id">cross reference and persistent ZINGG_ID</a>. Let us recap a bit. When we build our customer journeys, segment users and attribute acquisition to campaigns and channels, we need a consistent view of the customer data. We want to associate an identifier that inherently represents the entity. And we want to use this identifier in all our reporting and analytics, as well as operational systems. If a customer gets in touch with us at the call centre, we want to use the ZINGG_ID to get the context of their purchases and their relationship with us. </p><p>Coming back on the current topic. There is immense business value in having a <strong>persistent</strong> and <strong>continuously</strong> <strong>updated identity graph</strong> for each of the downstream usecases like marketing attribution, personalisation, life time value, fraud and risk. </p><p>However, as we saw earlier, in a growing business, records are no longer static. They are getting added and deleted and updated. So, how does all this change affect identity resolution? New records would start matching with existing ones. Updated records may start matching with other records they did not match with earlier. They may stop matching with records they matched up earlier. Hence clusters of resolved identities are evolving with the data. They are merging or splitting and new ones are created all the time. </p><p>Let us dig deeper how that looks by going over some examples.</p><p>In the following dataset, the records with ID<code> rec-1020-org-incr</code> and <code>rec-1023-org-incr</code> have been updated during business operations and continue to exhibit strong similarity with their original matches. Two new records with ID <code>rec-10000-org-new</code>  got added, matched with no other existing record but with each other.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NUPf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NUPf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 424w, https://substackcdn.com/image/fetch/$s_!NUPf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 848w, https://substackcdn.com/image/fetch/$s_!NUPf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 1272w, https://substackcdn.com/image/fetch/$s_!NUPf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NUPf!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png" width="1200" height="240.65934065934067" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78c9f4f2-39dd-4352-b847-234507120421_1763x353.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:292,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:253779,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NUPf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 424w, https://substackcdn.com/image/fetch/$s_!NUPf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 848w, https://substackcdn.com/image/fetch/$s_!NUPf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 1272w, https://substackcdn.com/image/fetch/$s_!NUPf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c9f4f2-39dd-4352-b847-234507120421_1763x353.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>With me so far? Now, here comes an even more interesting bit! In the following dataset, <code>rec-1022-org</code> and <code>rec-1022-dup-4 </code>had matched originally.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dLzz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dLzz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 424w, https://substackcdn.com/image/fetch/$s_!dLzz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 848w, https://substackcdn.com/image/fetch/$s_!dLzz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 1272w, https://substackcdn.com/image/fetch/$s_!dLzz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dLzz!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png" width="1200" height="73.35164835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:89,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:70825,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dLzz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 424w, https://substackcdn.com/image/fetch/$s_!dLzz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 848w, https://substackcdn.com/image/fetch/$s_!dLzz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 1272w, https://substackcdn.com/image/fetch/$s_!dLzz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93e2cdb-3498-45dd-b9bd-daa2231377c1_1755x107.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In an update at a later date, <code>rec-1022-dup-4</code> got altered and these records now stop matching with each other. This usually happens in the case of wrong data capture in an original record. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nfJW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nfJW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 424w, https://substackcdn.com/image/fetch/$s_!nfJW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 848w, https://substackcdn.com/image/fetch/$s_!nfJW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 1272w, https://substackcdn.com/image/fetch/$s_!nfJW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nfJW!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png" width="1200" height="70.87912087912088" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:86,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:63226,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nfJW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 424w, https://substackcdn.com/image/fetch/$s_!nfJW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 848w, https://substackcdn.com/image/fetch/$s_!nfJW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 1272w, https://substackcdn.com/image/fetch/$s_!nfJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c90324-06c7-40f8-aeda-6ae010bce9eb_1783x105.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Thus, when a record is updated yet it is close enough to the original cluster values, we ought to keep the cluster relationship. In the case of adding a new record that does not match any existing ones, a new cluster should get created. When a new record shows up that matches an existing cluster, the new record has to get attached to the cluster. When a new record that matches two clusters is passed, the clusters need to get merged. If we make edits to a record so it no longer matches with its existing cluster and then reverted those edits, the models should correctly track the updated and reversed entry, with the record remaining in the correct cluster throughout the incremental runs. There are myriad possible scenarios of clusters splitting and merging with each other. Matching new and updated against existing records gets extremely complex before we realise.</p><p>Technically, we could run a full match with Zingg Community version on the updated dataset and get the new clusters. </p><p>But.</p><p>If we run the entire matching process again, how do we preserve the persistent ZINGG_ID for the records? How do we tie back the newly generated ZINGG_IDs on the updated set to the original ZINGG_IDs? Especially when a new cluster will now have multiple older ZINGG_IDs, or when a new cluster may have some old different ZINGG_IDs and some fresh ones. How do we choose the surviving ZINGG_ID? On top of that, running full matching loads on growing datasets is costly in terms of time taken and compute.</p><p>Here is how we handle this in Zingg Enterprise.</p><blockquote><p>Clusters <strong>automatically</strong> merge, unmerge and new clusters with ZINGG_IDs are created based on new and updated values while <strong>only</strong> matching the updated records. </p></blockquote><p>Which means that</p><blockquote><p>The identity graph <strong>stays</strong> updated even when the data changes.</p></blockquote><p>When I had open sourced Zingg, I felt that it was my life&#8217;s best work and wondered if I would ever work on anything more challenging. While building the incremental update, I realized that this functionality is way way tougher to build than everything else I had built so far. Magnitudes tougher than the <a href="https://docs.zingg.ai/zingg0.4.0/stepbystep/createtrainingdata/label">active learning aspects of Zingg</a>, or the <a href="https://docs.zingg.ai/zingg0.4.0/zmodels">blocking model</a>. There are tons of algorithmic and performance aspects to consider. Even chalking out the test strategy and test data proved to be quite a challenge. </p><p>After intensive testing on internally generated datasets, we released the <a href="https://docs.zingg.ai/zingg0.4.0/stepbystep/runincremental">runIncremental</a> phase to our early customers. We knew our test data could not mimic every possible scenario, and though we were confident of the algorithms, we suspected edge cases which we had not envisaged. We requested our customers to test the incremental flows on their data and to report any inconsistencies in matching or cluster assignment. We were thrilled to get an extremely positive response!</p><p>Here is an excerpt from one of their test reports:</p><blockquote><p>Each test scenario was carefully constructed to challenge the model and verify the expected behavior. The results were consistently positive, demonstrating the model's precision and adaptability in every test case. Its ability to accurately handle minor data modifications and correctly process entirely new, atypical entries was particularly noteworthy. Equally impressive was its capability to manage entries that aligned with several clusters without compromising accuracy and its efficiency in reverting changes to their original form. </p><p>The success of these tests is an essential step on the way to releasing the incremental Zingg models in production.</p></blockquote><p>We are super proud to have come so far, and happy to roll it to <strong>all</strong> our customers. If the idea of an identity graph on your own data within your own data lake and warehouse appeals to you, why not hit reply and let us grab a coffee? :-) </p>]]></content:encoded></item><item><title><![CDATA[Hello Zingg ID!]]></title><description><![CDATA[Cross reference every entity consistently throughout the enterprise.]]></description><link>https://www.learningfromdata.zingg.ai/p/hello-zingg-id</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/hello-zingg-id</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sat, 08 Jun 2024 11:50:27 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="3765" height="4706" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:4706,&quot;width&quot;:3765,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;silver skeleton key on black surface&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="silver skeleton key on black surface" title="silver skeleton key on black surface" srcset="https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1609770231080-e321deccc34c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHJpbWFyeSUyMGtleSUyMGlkZW50aWZpZXJ8ZW58MHx8fHwxNzE3ODQ2ODU2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="true">Amol Tyagi</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p>Imagine we have multiple records of a customer spread across ticketing, billing, offline stores, and web properties. This is a usual scenario we witness everyday in our data land. Different sources of data come with multiple versions of truth. When they land centrally in a warehouse or datalake, its hard to figure things out. Revenue attribution and lifetime value can not be calculated accurately. Nor can we estimate churn or calculate confidently how many new customers were added last quarter. Increasing revenue through personalized offers and exploring opportunities to cross sell and upsell will not be possible either. Let alone anti money laundering or risk. Visualize a multimillion-dollar investment in a GenAI customer service agent. How far can it go using this fragmented data?</p><p>To establish the right data foundation, analyze and report on business critical metrics, and productionize AI and ML technology, we will need to unify such records and build a trusted single source of truth for the customer.</p><p>Enter entity resolution. Variations in attributes can all be resolved, and clusters of matching records can be created. The Zingg Community Version offers robust clustering and fuzzy matching, which bring these records together with a unique Z_CLUSTER field. Matching records share the same Z_CLUSTER. Voila, there is a way to say who is who!</p><p>What happens when we want to track the customer over time? Or build a holistic view? Establish their journey and reference them or search for them in different enterprise systems. When there is no identifier to refer to the customer, we need a way to associate them with an identifier. Kind of like an SSN or a passport number. Or a primary key in the database.</p><p>The Z_CLUSTER is extremely powerful and useful. But it is not persistent. There is no guarantee that it will not change in the next Zingg run. If we stream our resolved records into our source systems and want to use Z_CLUSTER as the primary key, it would break.</p><p>As we saw different Zingg deployments, this was an ask that we heard again and again. A globally unique persistent identifier that acts like a reference key to the entity and all its records. The same key even when the record changes. The unambigious way to refer to an entity across multiple enterprise data systems, operational processes, and workflows. Even as new records come in and earlier ones get updated.</p><p>This is exactly the ZINGG_ID in the <a href="https://www.zingg.ai/company/zingg-enterprise">Zingg Enterprise</a> version. Resolved, unified records across the entire enterprise which can be referenced through a persistent key. A help desk agent could use it to look up the entire customer interaction journey. A GPT based chatbot operating on patient data would work seamlessly with all the information at its disposal.</p><p>It's been fun solving this complex piece of the puzzle, and it's extremely satisfying to witness the value it generates for Zingg Enterprise customers.</p><p>Sounds interesting, or something you care about? Do get in <a href="https://www.zingg.ai/company/contact">touch</a>!</p><p>  </p>]]></content:encoded></item><item><title><![CDATA[Deterministic Matching Vs Probabilistic Matching]]></title><description><![CDATA[Why not both together at the same time?]]></description><link>https://www.learningfromdata.zingg.ai/p/deterministic-matching-vs-probabilistic</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/deterministic-matching-vs-probabilistic</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sat, 16 Mar 2024 14:09:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CVPk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CVPk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CVPk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 424w, https://substackcdn.com/image/fetch/$s_!CVPk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 848w, https://substackcdn.com/image/fetch/$s_!CVPk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 1272w, https://substackcdn.com/image/fetch/$s_!CVPk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CVPk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512" width="512" height="512" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb8fc6f8-41d5-4170-8e98-af269b279d05_800x512&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:512,&quot;width&quot;:512,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CVPk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 424w, https://substackcdn.com/image/fetch/$s_!CVPk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 848w, https://substackcdn.com/image/fetch/$s_!CVPk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 1272w, https://substackcdn.com/image/fetch/$s_!CVPk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8fc6f8-41d5-4170-8e98-af269b279d05_800x512 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>In the entity resolution space, we refer to exact matching on attributes as deterministic matching. If the application systems are consistently collecting and persisting trusted and known identifiers like Social Security Number, deterministic matching can help to resolve such records and help us identify the true individual. </p><p>In most real world systems, data in different systems in an organisation has different attributes. Let us imagine a retail store with both online and offline presence. A web support system would capture the email correctly while also logging name, a call centre would mostly have a correct phone number but other details may not be as reliable. The offline store would have some or all of the details, as accurately as the human behind the computer tasked with entering them can type it in. Thus some of the individual records have identifiers, like phone and email in the above case, which may match exactly, while many of the records have names and addresses with all kinds of variations, typos, abbreviations etc. Assessing the extent of such variations for an entity and reconciling the fuzzy and inexact matches is known as Probabilistic Matching.</p><p>At a simplistic level, one can think of deterministic matching as a database join on trusted id columns like passport number or email. Probabilistic Matching involves statistical and ML based techniques to determine if a given record pair is indeed the same real world entity. If we search these terms, there are multiple articles around Deterministic Vs Probabilistic Matching, pitting one against the other. Since the approaches to solve deterministic matching are very different from the ones for probabilistic matching, these techniques are often viewed as contrary. </p><p>However, we have a different view.</p><p>As we dived deeper and looked at multiple datasets, we saw that in most cases, data contained multiple signals, some with trusted identifiers and most times without. Even when there are identifiers, they tend to be more than one. For example one system could have driver&#8217;s license id mandatory and passport number optional. While another system could have the mandatory rule reversed. A third system may have both of these ids as optional. If we match deterministically on passport and license in this case, we would only resolve some entities, not all. Worse, a person who has provided license id in the third system would only match with the first, and the record in the second system would go unmatched. </p><p>This has serious ramifications. In an ecommerce store, the inability to reconcile effectively would jeopardize the million dollar investment in recommender systems that consume the customer preferences. In an anti money laundering system, the ability to catch fraud is directly proportional to the quality of the matching. An insurer can not understand risk at a household or person level if records remain unresolved. </p><p>Thus, we need to solve for both determinstic and probabilistic matching <strong>at the same time</strong>. An either/or approach with respect to deterministic and probabilistic matching would lead to identifying records with identifiers as a different entity than those without. Why not use every single data point of the record? There is tremendous value to be unlocked by looking at the problem holistically. If a deterministic match can be found, use that. If not, use the other attributes. Essentially, weave them all together holistically to ensure that the probabilistic and deterministic matching records are resolved to a single entity. </p><p>The above is a simplistic take and there is a lot of software engineering involved under the hood to build this out. Effort nothwithstanding, we are convinced that it is the right approach. In fact we have gone all out while building this feature, with support for multiple combinations of attributes to make sure no record which has a match remains unmatched. Determistic and probabilistic matching are better together, and early Zingg Enterprise customers see an immediate jump in match accuracy. </p><p>Is this something you care about too? Hit a reply and lets spark a conversation!    </p><p></p>]]></content:encoded></item><item><title><![CDATA[Zingg Turns Two!]]></title><description><![CDATA[The days were long, but the years have been short.]]></description><link>https://www.learningfromdata.zingg.ai/p/zingg-turns-two</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/zingg-turns-two</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sat, 30 Sep 2023 17:09:19 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Two years is a journey, yet it seems like yesterday when I open sourced Zingg. In the interim, we have done fulfilling work and received a lot of love. Curious what&#8217;s up at our end? Read on!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="430" height="644.9510585021625" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:6589,&quot;width&quot;:4393,&quot;resizeWidth&quot;:430,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;cake with lit sparkling stick&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="cake with lit sparkling stick" title="cake with lit sparkling stick" srcset="https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1577998474517-7eeeed4e448a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw3fHxiaXJ0aGRheXxlbnwwfHx8fDE2OTYwOTAwODR8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>First thing first.</p><p>We now have paying customers for our enterprise products!  There is nothing more exhilarating than seeing our life&#8217;s work come alive. And to be paid so that we can continue to build and move in the direction we aspire to. In the startup world, we hold a lot of discussions around product market fit, product founder fit, building things people want, user research,  and other associated activities. Product-led startups like ours get the first set of customers through the urgent need of a visionary early adopter. The <strong>visionary early adopter</strong> is the person in a team who has a <strong>pressing need</strong> along with the <strong>foresight</strong> to solve the problem and the <strong>confidence</strong> to trust a new team and an innovative product. <em>And</em> who can influence and move budgets to solve that need. I am very glad to have this opportunity to work closely with this first set of customers who are helping us shape the product. Not just the features, but also the pricing.  </p><p>Thank you for supporting us so that we can help you!</p><p>We are building three flavors of Zingg Enterprise. Zingg Snowflake, Zingg Databricks and Zingg Spark. All of them with the same roots in Zingg open source. Each carries forward Zingg OSS and gives the required ammunition to build and use the persistent and evolving enterprise identity graph. Zingg Enterprise creates global unique identifiers for entities, explains model behaviour and maintains and enhances the identity graph through resolution of new and updated records against previously matched records. Of course there is the improved accuracy and scalability :-) </p><p>On top of this, we have added platform-specific features. The Snowflake version runs inside Snowflake without any other infrastructure or software dependence. The Databricks flavour adds Unity Catalog support and a lot of other Databricks goodies. The Spark one has the enterprise features and runs on all other Spark environments like EMR. </p><p>On the open source side, we just hit 1000 downloads per month on PyPi! The rate of downloads has doubled over the past few months, and we now have a Slack group touching 450 people, from varied organizations like LinkedIn, Maersk, Newfront, Samba TV, Canadian Football League, Heb and John Deere to name a few. We have federal agencies like Pandemic Response Centre, and established startups like <a href="https://www.zingg.ai/case-studies/powering-sustainability-through-brand-and-product-entity-resolution">Provenance.org</a>, Redica Systems, Ninjacart and Cherre. We have a good number of <a href="https://www.zingg.ai/case-studies/transparency-in-north-carolina-campaign-finance-data">nonprofits and data consultancies</a> as well as friends from Databricks, Snowflake and Exasol. The strong code contributions from these partner companies have helped us to support their users more deeply. </p><p>We are clearly onto something :-) </p><p>I am truly amazed by the depth of use cases we are seeing across big and small organizations in varying domains. Zingg is resolving customer identity for marketing and personalization use cases in retail. It is resolving patient identity for comprehensive health, and healthcare provider identity to understand drug recommendation patterns. And the identity of an insurance policyholder, for compliance and risk, and also from the point of view of a reinsurer. <a href="https://www.learningfromdata.zingg.ai/p/entity-resolution-and-the-sea">Identity resolution</a> is much broader than we thought initially. It is the identity of members of religious establishments, of donors or recipients. It is the identity of the fans of a football club to send out the right information at the right time for a game of their liking. B2B accounts, which buy from multiple subsidiaries in different geographies, need identity resolution as well. So do <a href="https://www.databricks.com/blog/using-images-and-metadata-product-fuzzy-matching-zingg">products on an e-commerce website</a> for competitive analysis on other sites. </p><p>Identity is anything and everything a business cares about. </p><p>We are happy that we built Zingg so that users can work on their schema and train on their data without specialized ML skills. All in a privacy centric way. This has allowed us to see Zingg in action in the different use cases we are discovering every passing day.</p><p>To each of you who has used Zingg, mentioned us to friends and coworkers or checked us out - thank you for your time and we hope we have removed some data troubles from your life. </p><p>I also wanted to thank our investors and advisors, who backed Zingg when no one had heard of us, and who have been generous with their time and support.</p><p>These two years have been a tremendous learning experience, and our plans are only getting bigger. There is a lot of AI yet to be applied to enterprise data, and we will love to create more value for our users and customers. For that, we need to continue to work closely with them, ship more and more often, keep the momentum going and work twice as hard. </p><p>We are excited for the path ahead. Wish us luck!</p>]]></content:encoded></item><item><title><![CDATA[From Half A Day To Half An Hour! Performance Tuning Snowpark For Identity Resolution On Snowflake]]></title><description><![CDATA[Scaling to millions of records on the x small size warehouse]]></description><link>https://www.learningfromdata.zingg.ai/p/performance-tuning-snowpark-for-identity</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/performance-tuning-snowpark-for-identity</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sat, 29 Jul 2023 04:55:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RUUK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We have spent the last few weeks on the <a href="https://www.zingg.ai/company/zingg-enterprise-snowflake">Zingg Identity Resolution product for Snowflake</a> by battle-testing it for deployment and scale. Built as an entry for the Snowflake Challenge and on top of the <a href="https://github.com/zinggAI/zingg/">Zingg Open Source</a> version, we already had a pretty solid application that could resolve entities at scale within Snowflake. So we did not expect any major hiccups in terms of the overall flow, model building, and prediction. Developer confidence, bolstered by feedback from users you may say ;-) However, since the product leverages Snowpark instead of Spark, we did expect a bit of performance tuning here and there. In this post, I wanted to share the challenges we faced and the changes we made to overcome them. Hope this is useful to others building on Snowflake. </p><p>To begin with, we had a 500k test record set running on the x-small size Snowflake warehouse in roughly 12 hours. Though the code was not tuned for Snowflake performance, the job ran successfully to completion without hiccups and we used this as our baseline. Entity Resolution is a tough problem, and harder to do at scale. We want to get maximum mileage out of Snowflake and ensure Zingg ran efficiently for our customers. Our philosophy is that the customer should not have to pay the computing bill for any inefficient code. Hence we always aim to extract maximum performance without throwing more hardware at the problem. Unless we absolutely have to. We also had a target dataset of roughly 2.5 million records, 5 times the baseline. This was a real customer dataset, with non-uniform distribution across fields and with triple the columns of our test set. When we do entity resolution, a 5-fold increase in the input with the same number of columns and match types leads to a 25-fold increase in the comparison complexity. As we have robust index learning based on training data in Zingg along with tons of performance optimizations, we expected the matching job to be somewhere close to 18 hours, which we were planning to tune. However, when we ran this dataset on the x-small Snowflake warehouse, we started seeing errors on session timeout intermittently after roughly 8 hours. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h5><code>com.snowflake.snowpark.SnowparkClientException: Error Code: 0408, Error message: Your Snowpark session has expired. You must recreate your session.<br>Authentication token has expired. The user must authenticate again.</code></h5><p></p><p>Running the Snowpark client as a background process or with nohup helped, and we were able to see the job continue to run beyond 24 hours. Surely this performance was not something we wanted to ship. So we killed the run after identifying the first set of bottlenecks. Our approach was to zero in on the hotspots and solve them surgically in the order of highest impact first. The highest impact was mostly about the functionality taking maximum time, but in some cases, we also fixed stuff that was earlier in the flow but slow enough to not let us go past a particular stage in a reasonable time. We used the query profiler by Snowflake extensively for understanding and improving the flow, along with some traces and timing logs in our own code base. The query profiler gives a lot of valuable statistics about the query plan, function timing, and which part of the code triggered the query. It also provides partition-level and row-level information, so it was our go-to source for hotspot identification as well as figuring out if our changes worked as expected. Here is a snapshot of how it looks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RUUK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RUUK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 424w, https://substackcdn.com/image/fetch/$s_!RUUK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 848w, https://substackcdn.com/image/fetch/$s_!RUUK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 1272w, https://substackcdn.com/image/fetch/$s_!RUUK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RUUK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png" width="1456" height="687" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:520973,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RUUK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 424w, https://substackcdn.com/image/fetch/$s_!RUUK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 848w, https://substackcdn.com/image/fetch/$s_!RUUK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 1272w, https://substackcdn.com/image/fetch/$s_!RUUK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa9710e-7304-4026-b287-6fd1b293be95_1857x876.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>At a high level, here is how the Zingg flow is structured:</p><p>Input Data &#8594; Index &#8594; Join On Index to get Approximate Similar Pairs &#8594; Pairwise Similarity Features &#8594; Predict Matches &#8594; Graph Processing</p><p>We learn an index on the input data to create approximately similar pairs. We then compute similarity features on these pairs and predict if the pair is a match or not. This output is clustered using graph algorithms to convert matched pairs to clusters.</p><p>Our logs indicated that the biggest bottleneck was the Predict Match flow. The. predict flow leveraged a Snowpark <a href="https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-udtfs">user-defined table function</a> to return two columns.  The first was if a given pair was a match or not. The second was the score indicating the similarity of records in the pair with each other. The higher the score, the better the match. Since we returned more than one value from this function, we built it as a UDTF instead of a UDF. However, this UDTF was consuming the bulk of our processing time. Given that we were not really creating multiple rows per input row but returning multiple values per row, all the lateral joins we saw on the query plan did not make sense. It made a worthwhile experiment to convert the UDTF to a <a href="https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-udfs">user defined function</a>, concatenating the prediction and the score in one value and extracting them out through another UDF. This worked like a charm and immediately cut down the processing on our baseline dataset from 12 hours to 5 and a half hours. More than a 2x improvement! This was a massive gain, but we clearly had more work to do. </p><p>We then set out to optimize the queries generated by Snowflake through the Snowpark code we had written. The bottlenecks here were mostly while indexing and generating the near-similar pairs. We saw that if we did computations on a <a href="https://docs.snowflake.com/en/developer-guide/snowpark/java/working-with-dataframes">Snowpark Dataframe</a> to build the index and then self-joined the Dataframe on the index, the index computations were repeated. This was counterintuitive to us, as we were using the same instance of the Dataframe. We tried to <a href="https://docs.snowflake.com/en/developer-guide/snowpark/java/working-with-dataframes">clone the Dataframe</a> as advised but that did not work. Finally, when we cached the Dataframe, the query plan was altered and our costly computations happened only once. The lesson here was that if the computation is costly and the Dataframe has to be used in a self-join, it is wise to cache the intermediate results. However, this has to be used with caution, as there is an overhead in temporary table creation and disk spillage. </p><p>Another optimization we did was on the range scan. We join the records on the computed index to build the pairs. To avoid pairs like Record A, Record B repeat as Record B, Record A, we filter out the pairs on ids. This is critical as the subsequent similarity feature processing is expensive. However, we saw that the join became a cartesian join with filtering pushed out to much later in the flow. This led to slow performance along with a suboptimal flow, making similarity feature computations that we did not need. We cached the Dataframe arising out of the join and changed the id1 &gt; id2 range query to id1-id2 &gt; 0, which pushed the filter at the right junction for our flow as well as changed the earlier cartesian join to a normal join. </p><p>Together, the above changes shaved off roughly 20% of runtime on our baseline dataset, and we were now down to 4 hours and 30 minutes of the entity resolution workflow on 500,000 records. </p><p>With these major fixes in, we profiled the run again, closely looking through the hotspots, and to our utter dismay, found that the predict function was still the most time-consuming part of the entity resolution workflow. This is a Python UDF that uses machine learning for the prediction. A pre-built model learnt from the training data is passed as an argument to the function along with multiple similarity features of the pair. This UDF predicts whether the pair is a match or not along with their similarity score. Since the features are different for different datasets, we do not know beforehand the shape of the feature columns. Hence we construct an array over the multiple features so that we can resolve different types of entities. In a good bad way, there was no code change we could do to optimize this function. Our code was concise and to the point, and there wasn&#8217;t anything to shave off here. So we focused on the Snowflake invocation of this UDF. We learned that <a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch">Snowflake advocates vectorizing UDFs</a>, especially if they involve matrices and other machine learning workloads. The vectorized UDFs are passed batches of rows, and return batches with results. This made immediate sense as invoking the UDF row by row was surely not going to help with the inherent matrix manipulations. Though we wanted to do this right away, there were quite a few challenges. </p><p>It was difficult for us to define the function signature as the column number and types were dynamic. Earlier we used to pass an array, so it had worked. After some thought, we were able to build the signature and define the function on the fly using the metadata we already had while creating the similarity features. Passing the model became tricky however and we were not able to deserialize it in the function no matter what we tried. It seemed there were some extra bytes added when the model was passed as a column received by the UDF as part of a Pandas Dataframe. The same code had been working earlier row by row, so the conversion of Snowframe Dataframe rows to Pandas Dataframe internally seemed to be the likely cause. There was another complexity. We were invoking the Python function through Java Snowpark API, so there were too many layers to unravel here. We went over the open-source code of Snowpark, we tried to add function logging and tracing, we tried to debug calling the UDF from Python, and we read all that we could..we pretty much did anything and everything we could think of! </p><p>After struggling for a few anxious days, we changed our approach and began persisting the model to a Snowflake stage. Our UDF would import that location and read and use the model. This finally solved the serde issue we were seeing earlier, along with reducing the data sent to the function. In hindsight, we should have tried this approach first, but at that time the thought was to only change what was absolutely necessary. The code for passing the model had been working for a while, and while tuning performance, it is important to control the changes. So we weren&#8217;t too wrong either. </p><p>It got a bit late in the day when we ran the build with these changes, and when the test dataset ran in half an hour, we were sure something was crazily wrong. Maybe we had left the bulk of our processing code commented out in the run by mistake? We saw the output table, the results were there, and the predictions looked accurate. It was hard to believe. Too weary to think further, we called it a day.</p><p>The next day we double-checked the output and everything was in order. We cross-checked. Yes, it looks fine. No, we are not sure. Not yet. Let us run it once more. Boom, the job was finished by the time we finished our daily catchup + coffee. </p><p>28 minutes. From 12 hours. From half a day to half an hour! 24x. </p><p>On an x-small warehouse resolving identities across multiple columns for 500,000 rows. Wow! High Five! Now we need a break. Mission Impossible Dead Reckoning, here we come! </p><p>It was now time to run the customer data set in their warehouse. Brimming with confidence, we fire the job and monitor it from time to time. Umm..it is not working. The predict function is still a bottleneck. For some reason, the batch size Snowflake has automatically chosen on this dataset is 10, while it was sending 1000 rows at a time to the function on our test data. Possibly because there are way too many columns here compared to our test data? The <a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch#setting-a-target-batch-size">batch size hint is just a hint</a>, but that seems to be the only ammunition we have right now. Will it work? So we modify the UDF and add the max batch size annotation. This has the desired effect, and we have some great results for our customers. </p><p>There were a few other changes we had to make to get here. For example, the Dataframe.show() function would cause an out of memory error on the  client machine if the data was big. While working on performance optimization, it is critical that functional parts of the code work flawlessly. We had used show() liberally in our code to verify the logic, assuming that it would limit the records sent to the Snowpark client by limiting rows in the warehouse itself. But show() works differently. It fetches everything to the client and then truncates to 10 rows. This was a bit counterintuitive, but an easy fix. </p><p>Overall we are super excited about the gains we have made and so happy that we have come this far. We hope the lessons above will help folks who are building on Snowflake and Snowpark. Feel free to talk to us if you are struggling with Snowpark/Snowflake performance issues, we would love to help and share our learnings! And if you know someone who needs entity resolution on Snowflake, please do send them our way :-) </p><p>Ciao!</p><p></p><p></p><p> </p><p></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Building Identity Resolution on Snowflake Using Snowpark]]></title><description><![CDATA[Crossing boundaries of languages and technologies to get stuff done.]]></description><link>https://www.learningfromdata.zingg.ai/p/building-identity-resolution-on-snowflake</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/building-identity-resolution-on-snowflake</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Mon, 19 Jun 2023 12:32:35 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="2520" height="2448" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2448,&quot;width&quot;:2520,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;silver and diamond studded cross pendant&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="silver and diamond studded cross pendant" title="silver and diamond studded cross pendant" srcset="https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1613481950732-c1dcddcb410e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxzbm93Zmxha2V8ZW58MHx8fHwxNjg3MTc2ODY2fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@zmachacek">Zden&#283;k Mach&#225;&#269;ek</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p>I have very exciting news to share - we have signed our first customer for <a href="https://www.zingg.ai/company/zingg-enterprise-snowflake">Zingg Enterprise Snowflake</a>! It feels magical, almost surreal. I call it magical not because I did not believe it will happen, but because of the pace and ease of closing the deal. If you have been doing B2B enterprise sales, you probably know what I mean :-) </p><p>Zingg open source has always supported Snowflake as a data source using Apache Spark as the connector. As we talked with our users, we realized that many of them had only SQL or warehouse workloads, and we could really simplify their stack by using Snowflake directly. Over the last year, we watched Snowpark evolve and did multiple experiments to see how viable it would be to build Zingg over Snowpark. This post will detail some of the things we did and learned as part of the process. </p><p>Our first challenge was building <a href="https://docs.zingg.ai/zingg/zmodels">our extensive ML pipeline for entity resolution</a> in Snowpark. We have our own blocking algorithm but rely on Spark&#8217;s MLLib for classification. Snowpark does not have a built in machine learning library but facilitates ML through third party packages like scikit-learn. This means that Snowpark Dataframes can not be passed directly for training models, but a conversion to Pandas Dataframe becomes necessary. This has two issues. Firstly, this limits the size of data one can train on. Fortunately for us, as we use <a href="https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/label">active learning to build training sets</a> efficiently, our training data size is quite small. The second issue was the real one for us. Our code is in Java and Scala so introducing a Python flow for training and prediction became challenging. Zingg does substantial preprocessing and feature engineering enriching the input data with various signals before training and prediction. Using Python meant our in-memory enriched Dataframe could not be passed directly for training. Eventually, we decided to persist the features into a temporary table, and built a stored procedure in Python to read it to train a classifier. We persist the classifier into a Snowflake table. and load it in a named Python user defined function, which is invoked from our codebase during inference. Thankfully, a named user defined function in one language can be easily invoked from another, which is the strategy we have used extensively in our code. </p><p>Snowpark also misses a graph library, like GraphX or Graphframes in Apache Spark. Luckily the Snowflake team helped us here and provided us with a Scala library for graph processing. We had some work cut out to integrate this into our code, especially since the Snowpark Scala Dataframes are not interchangeable with the Snowpark Java ones. </p><p>Building a cross-domain ML product introduces a great deal of generalization in the flow. We do not know beforehand the shape of the input dataset - it could be product matching, it could be customer matching, or it could be a B2B account with domain names and other details. Many row-wise operations become tricky without a map equivalent in Snowpark. Hence we built our own version of map, passing all the columns into a VARIANT type array and then getting individual values as necessary. This was a bit of a tough problem, and we were pretty proud when we could get it to work.  </p><p>There were some other minor hiccups like temporary udfs not getting cleaned up, no way to access data from dataframes using column names but having to pass column index, etc. I am sure these will get sorted out as the Snowpark API matures further. </p><p>One of our biggest concerns was that our codebase assumed Apache Spark as the defacto processing engine, meaning that the dependencies were all over the code. Apache Spark is an integral part of Zingg, and is intricately weaved into all Zingg operations. So our first thought was to have two code bases, one for Apache Spark and then the same logic ported for Snowpark. All functionality would exist independently in each. There would be some common code like reading user configuration that would be shared between the two. As we were planning the Snowflake version of the product to be closed source, the code would anyways have to be in a different repository than the open source ones. Hence, why not? This would also let us build the product in less time, as the open source is extensively validated and already in production. Ship faster, <a href="https://www.ycombinator.com/library/40-the-art-of-shipping-early-and-often">they say</a>! We spent a day or two doing global search and replace operations on dependencies and imports, to see how this would shape out and realized soon enough that this was going to be a mess. If we had to fix a bug, let alone add a new feature, we would have to do it in two places. Clearly, this was going to be a recipe for a lot of overwork and technical debt and I am glad we abandoned this soon enough. </p><p>The next approach was to build up generic classes and then templatize them with Spark or Snowpark equivalents. All the logic remains centrally located in one place, and the processing engine equivalents have their own extensions. We can easily add another processing engine if a new Dataframe API comes up. This meant touching every single bit of our code, which was going to be a lot of work on both the development and the testing fronts. Not a decision we wanted to jump into, but there wasn&#8217;t a lot of choice here. I have to let you in a secret here. Besides the user discussions, we had another motive for the Snowflake product. We wanted to enter the Snowflake challenge. Some cash and publicity can never hurt a frugal startup! We wanted to apply with a running application, which left us with roughly 1.5 months to get everything running, and that is the deadline we set for ourselves at the beginning of the year. It was a lot of work, <em>and</em> <em>more</em> work, <em>and</em> <em>more</em> <em>work</em>. It was tiring, but it was fun.  We finished a week before our self-induced deadline, without any shortcuts or hacks :-) </p><p>Snowpark has been recently open-sourced, so it is going to be even more fun building around it. Zingg is surely one of the more complex applications being built around it, and we can&#8217;t wait to add more stuff to our identity resolution product and push Snowpark further. </p><p>In the end, we did not win the challenge, but had some nice interactions with the Snowflake team and have learned a lot of stuff that will help us as we now put Zingg Identity Resolution on Snowflake into production. </p><p>We are really excited about how this journey is shaping up! </p><p></p>]]></content:encoded></item><item><title><![CDATA[Measure What Matters! Our Simple Data Stack To Understand Open Source Adoption]]></title><description><![CDATA[Is it a Modern Data Stack? Surely! Minimalist? Of course!]]></description><link>https://www.learningfromdata.zingg.ai/p/measure-what-matters-our-simple-data</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/measure-what-matters-our-simple-data</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Tue, 18 Apr 2023 16:05:53 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="1080" height="717" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:717,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;person holding white Samsung Galaxy Tab&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="person holding white Samsung Galaxy Tab" title="person holding white Samsung Galaxy Tab" srcset="https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1434626881859-194d67b2b86f?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw2fHxtb2Rlcm4lMjBkYXRhJTIwc3RhY2t8ZW58MHx8fHwxNjgxODI0MDYz&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" loading="lazy" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What good is a song composed but not hummed by the audience? The musician may love it as their best song, but it will soon fade away from the memory of its listeners. While the pleasure of building and innovating is its own reward, creators need external validation to generate the ammunition to make even bolder bets.</p><p>As people who pour our hearts and souls into building Zingg, we too take a pause at times to reflect if we are indeed creating value for our users. There are two main things we are super curious to learn at this stage.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ol><li><p>To learn how people are discovering us</p></li><li><p>To understand how they are using Zingg</p><p></p><p></p></li></ol><p>Signals for these come from multiple sources. The important ones are direct user and customer communication, Slack, website traffic, product downloads and product usage statistics.</p><p>Direct interaction with users and customers is the richest form of learning. Many of our open source users join our slack and chat with us and we learn about the problems they are solving with Zingg. Some of them write to us on email, or DM us on <a href="https://join.slack.com/t/zinggai/shared_invite/zt-w7zlcnol-vEuqU9m~Q56kLLUVxRgpOA">Slack</a> or LinkedIn, sharing what they are building with and around Zingg. They want to check the best way to use Zingg for their scenario. We ask them about their stack and use cases, and how we can help them better. Learning more about our users guides us in shaping the roadmap for Zingg. Zingg&#8217;s docker and the Zingg <a href="https://readthedocs.org/projects/zingg/">Python API</a> are a direct result of these interactions. We built them because we realized that they will help users generate identity resolution results faster and will enable them to use Zingg as a first class citizen in their transformation pipelines. </p><p>While interacting directly, we often ask users where they learnt about Zingg. This is augmented with the signals from traffic information from Google Analytics. Github also provides repository traffic analysis of the previous fortnight. </p><p>Often, the discovery is word of mouth. For us, having a great product and providing an even better support is the best form of engaging people. It suits us as developers, it helps our users drive value faster and it builds a strong relationship with the user. A check at the <a href="https://www.zingg.ai/">testimonials on the Zingg website</a> reveals we are doing a nice job there. </p><p>And we should still aim to get better here.  Like identity resolution, support is never going to be perfect, so we always have things to improve upon.</p><p>Many times the discovery is through the tutorials and guides we have written. Or talks and events in select data communities. Instead of a lot of content, we have focused on deep and long articles that help the user understand the <a href="https://www.zingg.ai/documentation-article/the-what-and-why-of-entity-resolution">problem of identity and entity resolution, why it is needed, the challenges in resolving entities</a> and how Zingg can be run on different platforms like <a href="https://www.zingg.ai/documentation-article/identity-resolution-on-databricks-for-customer-360">Databricks</a> and <a href="https://www.zingg.ai/documentation-article/identifying-duplicates-in-snowflake-with-zingg">Snowflake</a>. Early adopters of a product know the problem and are looking for a solution. Hence, we owe it to them <a href="https://medium.com/towards-data-science/step-by-step-identity-resolution-with-python-and-zingg-e0895b369c50">to make using Zingg easy</a> so that they can focus on solving the bigger puzzle - Customer 360, Segmentation, Fraud, Risk, Compliance etc. Discoverability is the side benefit of our content, helping our users adopt Zingg is the main goal. </p><p>Broadly, these data points are sufficient for us to understand and improve discoverability at this stage. Besides this, we track open source adoption and product usage anonymously. We gather information about</p><ol><li><p>Github downloads of our releases</p></li><li><p>Docker pulls</p></li><li><p>PyPi installs</p></li><li><p>Slack users</p></li><li><p>Documentation views</p><p></p></li></ol><p>Together, the above give us a fair idea of our adoption and how we are growing. For now, we get these values manually every couple of weeks and track them in an excel file, charting out a time series to understand our growth rate. We do have a column for github stars which we populate but dont think much about as we have richer data to get signals from.</p><p>Coming to product analytics, the aim here is to learn the different aspects of how people are using Zingg-</p><ol><li><p>Which platform do they run Zingg on?</p></li><li><p>Is the product scaling? How many records do they match?</p></li><li><p>How much time does it take to run?</p></li><li><p>Do they run Zingg multiple times or is it a one time deduplication?</p><p></p><p> </p></li></ol><p>For any product analytics to function, we ought to have a way to identify unique installations and trigger usage events from the installation to a place where we can analyze them. Our asks are fairly simple questions, but the novel nature of our product as well as deployment makes it tricky to get the answers. </p><p>Often, Zingg is deployed on cloud resources like Databricks or EMR or on private enterprise networks which are blocked from sending information outside the VPN. So if a user downloads and uses Zingg under such environments, we have no way of knowing how they are using Zingg. </p><p>Another issue is that it is impossible to identify installs through cloud systems, as virtual machines change with every run. Even when we don&#8217;t want to identify the user but only the install, there is no way to track a unique installation by writing a random uuid on disk. This problem exists on our docker environment as well. </p><p>Some of the features we are building now will allow users to trigger cloud jobs from their machines, so we will think about the install identification then. Till then, we already have the download stats, so thats a good enough proxy for us.  Instead, we gather usage data. We <a href="https://docs.zingg.ai/zingg/security">respect user privacy</a>, so we limit data collection to reflect product usage. We also provide a flag to turn data collection off. </p><p>We use Google Analytics as our event server and post custom events to it from the code every time Zingg is run. These events are sent to our GA account, which has been configured to pipe these events into BigQuery. For every run of Zingg, we collect the data and output formats, time it took to run, number of records and a few other stats. Every few weeks, a handful of SQL queries help us understand the data volumes we are processing, the number of daily jobs run and their trends, the time taken to complete each run etc. Mostly, it is a brief glance on the numbers to make sure we are on the right track.  </p><p>Many times, the usage data has an unheard story to tell. We discovered that there were a good amount of users using Zingg to resolve entities on Snowflake. A Spark deployment is not ideal for them, as they are not using it anywhere in their stack. Could we leverage Snowpark and build a lighter, tighter product for such users? <a href="https://www.zingg.ai/company/zingg-enterprise-snowflake">Seems we can</a>, and I have a lot of exciting stories to tell about that effort and where it is leading us! </p><p>Coming back to our simplistic data stack. </p><p>Sometimes we do detective work ;-) If we see events of the same format and number of fields triggered at regular time intervals through the day, we treat it as Zingg running in production. Once in a while, it is fun to look into individual events and extrapolate the usage to different scenarios. </p><p>Churning out aggregate stats is a revelation too - like Zingg has resolved <em>at least</em> <strong>2 billion records in the last two months</strong>. This is a huge number, and it has grown rapidly over the last few months. </p><p>When I say &#8220;at least&#8221;, it is with the awareness that we are not collecting all the data points.  We have talked with many users who are running production loads but they do not show up in our telemetry. So our analysis tends to be conservative. Which is absolutely fine. </p><p>Overall, the aim of this stack is to help us get a sense of how we are doing, what we can do to help our users and what more we should build. We are on free plans on Slack as well as GA. The frugality is in the cost, and also in the implementation. As a small team, we needed something that is easy to setup and will answer our basic questions without much effort. Also, the trend in the data is more important than the individual data points at this stage. </p><p>We may miss some dots, but the path is clear. Which is what matters!</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Zingg turns one!]]></title><description><![CDATA[What, already? :-)]]></description><link>https://www.learningfromdata.zingg.ai/p/zingg-turns-one</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/zingg-turns-one</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Sun, 02 Oct 2022 12:27:18 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I did not realize that a year has gone by working publicly on Zingg. But now that the calendar has made the fact sink in, it feels surreal. Wasn&#8217;t it just yesterday that I was fretting about the documentation before announcing the open source and <a href="https://www.learningfromdata.zingg.ai/p/time-to-zingg">penning down my reasons for building Zingg</a>? Now here I am, recruiting the <a href="https://www.zingg.ai/resources/careers">team</a> for the enterprise version :-)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080" width="1080" height="720" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;person holding lighted sparklers&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="person holding lighted sparklers" title="person holding lighted sparklers" srcset="https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1498673394965-85cb14905c89?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHw1fHxvbmUlMjB5ZWFyfGVufDB8fHx8MTY2NDcwNjE5Ng&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@cristian1">Cristian Escobar</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p>It feels great to complete an entire year! Days have passed fairly quickly, spread between <a href="https://medium.com/towards-data-science/identifying-duplicates-in-snowflake-e95b3f3fce2b">writing</a>, <a href="https://crossroads-cx.medium.com/building-open-access-to-nc-campaign-finance-data-entity-resolution-pt-1-baff56e573a1">lots</a> <a href="https://hightouch.com/blog/warehouse-identity-resolution">of</a> <a href="https://www.databricks.com/blog/2022/08/04/new-solution-accelerator-customer-entity-resolution.html">reading</a>, <a href="https://www.youtube.com/watch?v=F5Dw1QH0idQ">speaking</a>, giving <a href="https://www.youtube.com/watch?v=zOabyZxN9b0&amp;t=5s">demos</a>, and <a href="https://www.youtube.com/watch?v=tUfmoQY_UyM">meeting people</a>. We have done a lot of <a href="https://www.learningfromdata.zingg.ai/p/to-python-with-love">development</a> <a href="https://github.com/zinggAI/zingg/releases">work</a> as well as spent time on incorporation and funding. I have thoroughly enjoyed interacting with users and learning about the different use cases enabled by identity resolution. We have more than 200 members on our <a href="https://join.slack.com/t/zinggai/shared_invite/zt-w7zlcnol-vEuqU9m~Q56kLLUVxRgpOA">Slack</a> now, with representation from Fortune 500s, digital natives, data platforms, consulting companies and startups. Besides the Customer 360, personalisation, AML, KYC, GDPR, master data management use cases, the trend towards <a href="https://deloitte.wsj.com/articles/bridge-the-customer-data-divide-between-marketing-and-it-01657637033?mod=Deloitte_cmo_wsjsf_h1&amp;tesla=y&amp;id=us:2sm:3tw:4dd_dualzone::6dd:20220805181527::7325000954:5&amp;utm_source=tw&amp;utm_campaign=dd_dualzone&amp;utm_content=dd&amp;utm_medium=social&amp;linkId=176108611">composable</a> <a href="https://www.google.com/search?q=composable+cdp&amp;rlz=1C5CHFA_enIN1024IN1025&amp;oq=composable+cdp&amp;aqs=chrome..69i57j0i512j0i390l3j69i60l3.5925j0j7&amp;sourceid=chrome&amp;ie=UTF-8">customer data platforms</a> is fuelling a lot of interest in Zingg which is very promising. The data stack has proven patterns for data ingestion, <a href="https://www.getdbt.com/">transformation</a> and activation, but identity is a core missing piece there. <a href="https://www.zingg.ai/">Zingg</a> fits in very well into this <a href="https://roundup.getdbt.com/p/from-rows-to-people">space</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>One of the fundamental issues with identity resolution is <a href="https://docs.zingg.ai/zingg-0.3.4/zmodels">scalability</a>. While building Zingg, we used <a href="https://github.com/zinggAI/zingg/tree/main/examples">example</a> datasets and tested our algorithms on them. To test scalability, we copy pasted an example multiple times to build a dummy dataset of roughly 15 million rows. Most enterprise master data is sub 10m rows. Zingg scaled well on this. However, dummy datasets with the same rows perform differently from actual data with real distribution of values. So there were some lingering doubts in the back of my mind. These were soon resolved when we consulted with a Fortune 10 having 12.5m records where Zingg ran like a breeze. Hence, I was confident of scalability while releasing the open source, but I wasn&#8217;t sure how far it would go beyond 15m records without tuning the knobs. Hence, it is extremely uplifting to see our users resolve identities on 70-80m records in a few hours with Zingg. Due to the quadratic nature of the comparisons, this is at least 25 times more complex than our 15m endeavours, and seeing it work well is a great confidence boost! Another measure for Zingg is the accuracy of predictions, and users have done head to head comparisons with their multi-million dollar MDM tools to confirm our results. Yay!</p><p>One very important aspect for us is arming our users with the power of machine learning and distributed systems in a self-serve, no-code way. Many of our users are first timers with Java, Spark or ML. Zingg abstracts away the technical complexity, and gives the superpower of identity resolution at scale directly to the end user. Now that our Python release is out, we are actively working on a native Snowflake implementation. This opens up a new way for Snowflake users to use Zingg and identify their customers and personalize their journeys and experiences. </p><p>There have been some mistakes last year, and I plan to do better in the coming year. Due to my <a href="https://www.verywellmind.com/what-is-flow-2794768">flow state</a>, I missed applying to some conferences but the one I am most disappointed about missing is <a href="https://coalesce.getdbt.com/">Coalesce from dbt Labs</a>. I would have met some amazing data practitioners there and made many friends. This time I will clear my calendar and attend virtually. If you are joining, do let me know. </p><p>There have been <a href="https://www.nytimes.com/2022/09/15/sports/tennis/federer-retires-tennis.html#:~:text=%E2%80%9CTennis%20has%20treated%20me%20more,titles%2C%20310%20weeks%20ranked%20No.">some other disappointments</a>, but they were in the making since some time. Can&#8217;t complain. </p><p>On a philosophical note, <a href="https://www.magicalquote.com/bookquotes/who-in-the-world-am-i-ah-thats-the-great-puzzle/">identity</a> resolution is the <a href="https://www.learningfromdata.zingg.ai/p/of-i-and-we">discovery of self as well as building relationships</a>. Working on Zingg enables me to learn a lot of things around communication, positioning, customer needs, technology, hiring and sales. It pushes my boundaries and makes me discover myself. It also helps me talk to a lot of people and build relationships with customers, data leaders, practitioners, evangelists, contributors, team, investors and mentors. Many of us feel lost at the thought of networking - I am not sure how I would do if I had the charter to meet x new people every week! But when the baseline is common ground on data or ML, it is much easier to connect. </p><p>Zingg is made with <a href="https://www.linkedin.com/pulse/magic-word-meraki-thought-difference-between-working-just-mescalchin/">meraki</a>. It has been a lot of hard work, but it doesn&#8217;t feel that way at all. Last year has been nice, and <a href="https://www.youtube.com/watch?v=HgzGwKwLmgM">we are having a good time</a>. This coming year will decide how things shape up for us as a project and a startup. </p><p>Thanks to all of you for investing your time and effort on Zingg. We are incredibly grateful and motivated to do even more this year.    </p><p>Wish us luck! </p><p></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[To Python, With Love! ]]></title><description><![CDATA[A new new way to work with Zingg.]]></description><link>https://www.learningfromdata.zingg.ai/p/to-python-with-love</link><guid isPermaLink="false">https://www.learningfromdata.zingg.ai/p/to-python-with-love</guid><dc:creator><![CDATA[Sonal Goyal]]></dc:creator><pubDate>Fri, 26 Aug 2022 07:23:43 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>There was a feeling of suspense and excitement hanging in the air. From the depths of the cubicles rose a bewitching hymn, whose clicks and clacks surrounded the room. Despite this alluring melody, silence and serenity prevailed. The countless fingers on the various keyboards  pressed and released in sync as if they were playing one giant piano. Everyone typed in tandem and the entire scene felt straight out of an orchestra. Time, which stood eternally and unmovingly, slid when a delighted exclamation rang through the air. A gentle whoosh brought out the new Zingg feature, acting as a spell breaker to the enchantment everyone was under&#8230; </em></p><p>As you might have guessed by now, we just released a new version of Zingg! This release marks a complete shift from the command line training and matching jobs to a full programmatic interface of building applications with Zingg. With Python.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080" width="1080" height="608" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:608,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;person holding sticky note&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="person holding sticky note" title="person holding sticky note" srcset="https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1526379095098-d400fd0bf935?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwzMDAzMzh8MHwxfHNlYXJjaHwxfHxweXRob258ZW58MHx8fHwxNjY0ODk4ODIz&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@hiteshchoudhary">Hitesh Choudhary</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>After Zingg became open source, we found most of our early users came from a non-Java, non-Spark background. Though it was extremely encouraging to see a broader user base for Zingg than originally anticipated, it had been a poor assumption at our end that everyone had the JVM running on their laptops to run Zingg. Hats off to the early adopters who still went ahead and resolved their entities with Zingg, we had definitely not made working with Zingg easy for them :-( </p><p>This realization led to our <a href="https://hub.docker.com/repository/docker/zingg/zingg">urgent Docker release</a>, and it felt that we had crossed a big hurdle! Life and work moved on, and as interactions with users increased, one common pattern emerged - the use of Python. A lot of our users are writing Python data pipelines to get data from multiple sources. They are wrangling data in Python before putting it through Zingg. They are consuming the Zingg output results in Python based machine learning pipelines for risk modeling, customer segmentation, and product recommendations. They are on <a href="https://www.databricks.com/blog/2022/08/04/new-solution-accelerator-customer-entity-resolution.html">Databricks notebooks</a>. Wouldn&#8217;t it be cool if such users could use Zingg through Python itself?  </p><p>With this in mind, we set about defining a Python interface for Zingg. Our biggest challenge with Python was that the majority of our source code and all our core algorithms are in Java/Scala. Due to the <a href="https://docs.zingg.ai/zingg-0.3.4/zmodels">computationally intensive nature of the identity resolution problem</a>, we keep a very tight and performance-focused implementation. Rewriting everything would not make sense, though we did think about it for a fleeting moment. After some tension, we realized that Apache Spark was structured similarly, and realized we could borrow the best practices from there. This turned out to be a really good decision. In fact, we referred to the Apache Spark codebase a lot to build a Py4J-based<a href="https://zingg.readthedocs.io/en/latest/zingg.html#contents"> interface to Zingg</a>. We would invoke our Java code from Python, all the Java interfaces would be replicated on the Python side. Things started to look up again. </p><p>There was one more hurdle though. Something we realized much later as we got closer to shipping - Python <a href="https://pypi.org/project/zingg/">packaging</a>. Unlike a usual Python application, our interface is dependent on the compiled Java code and hence we had to figure out a way to include this as part of our package. Luckily, Spark is doing something similar so the packaging there became our big inspiration. Again! </p><p>With these major hurdles crossed, we <a href="https://www.zingg.ai/post/my-internship-experience-with-zingg">got to the grind</a> and were finally ready to put our work out. Honestly, this release has been a real big one - not only did we build the whole Python interface, but we have also added lots of cool things to core Zingg. <a href="https://docs.zingg.ai/zingg-0.3.4/improving-accuracy/stopwordsremoval">Stopwords</a> - stuff to ignore while matching attributes for example. Or special <a href="https://docs.zingg.ai/zingg-0.3.4/stepbystep/configuration/field-definitions">match types</a> for emails or pin codes. More diagnostic messages while running Zingg - getting better there slowly and steadily! </p><p>In the end, once the<a href="https://github.com/zinggAI/zingg/releases"> release was done</a>, it felt like a triumph! A mini celebration, and having a different version of fun - rest, recharge and laze around a bit. I haven&#8217;t heard any user complaints so far so it is probably working as planned ;-)</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P7q5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P7q5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 424w, https://substackcdn.com/image/fetch/$s_!P7q5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 848w, https://substackcdn.com/image/fetch/$s_!P7q5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 1272w, https://substackcdn.com/image/fetch/$s_!P7q5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P7q5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png" width="552" height="72" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:72,&quot;width&quot;:552,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!P7q5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 424w, https://substackcdn.com/image/fetch/$s_!P7q5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 848w, https://substackcdn.com/image/fetch/$s_!P7q5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 1272w, https://substackcdn.com/image/fetch/$s_!P7q5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a777a8b-eba5-4cb7-a559-061d50ba24ad_552x72.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NRtC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NRtC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 424w, https://substackcdn.com/image/fetch/$s_!NRtC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 848w, https://substackcdn.com/image/fetch/$s_!NRtC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 1272w, https://substackcdn.com/image/fetch/$s_!NRtC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NRtC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png" width="501" height="41" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/36ec558f-9074-4581-ac13-c0af821b8900_501x41.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:41,&quot;width&quot;:501,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NRtC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 424w, https://substackcdn.com/image/fetch/$s_!NRtC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 848w, https://substackcdn.com/image/fetch/$s_!NRtC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 1272w, https://substackcdn.com/image/fetch/$s_!NRtC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F36ec558f-9074-4581-ac13-c0af821b8900_501x41.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!URFH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!URFH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 424w, https://substackcdn.com/image/fetch/$s_!URFH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 848w, https://substackcdn.com/image/fetch/$s_!URFH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 1272w, https://substackcdn.com/image/fetch/$s_!URFH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!URFH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png" width="209" height="33" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/d012364d-9306-4cdb-93e4-51311812d2b3_209x33.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:33,&quot;width&quot;:209,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3300,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!URFH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 424w, https://substackcdn.com/image/fetch/$s_!URFH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 848w, https://substackcdn.com/image/fetch/$s_!URFH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 1272w, https://substackcdn.com/image/fetch/$s_!URFH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd012364d-9306-4cdb-93e4-51311812d2b3_209x33.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Among other things, I am very excited about the ability to finally have a good integration point with dbt. Besides Python, SQL through dbt is something we frequently see (of course! :-)). With <a href="https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#:~:text=dbt%20Python%20models%20have%20access,other%20users%2C%20and%20so%20on.">dbt supporting Python</a>, we have a cool new way for Zingg to work with dbt. I can&#8217;t wait to try this out&#8230;.or maybe someone is working on this already! Are you <a href="https://www.youtube.com/watch?v=Vy7RaQUmOzE">The One</a>? </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.learningfromdata.zingg.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Sonal&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>