The open source identity resolution journey so far..

Jan 07, 2025

More than a decade ago, in the year 2013, the identity resolution problem hit me. As a niche consulting company focused on data and AI, we were tasked with integrating customer records from disparate systems. Our project involved setting up a Hadoop cluster on Amazon EC2 and building out the analytics and reporting for our client. This was the time when services like Elastic Map Reduce were about to get launched. It felt like an engineering feat that we could run Hadoop on the cloud through custom scripts. In a way, we were actually mimicking today’s serverless approaches.

We built data extraction, transformation, and loading pipelines along with the reporting using Sqoop, Pig, Hive, and Cascading. But, as we got along the project, we faced a critical issue. Even when we got all the data in one place, it was extremely challenging to identify the real world entity represented by the varying records from different systems. Since the records did not have any common identifiers and they also had all sorts of variations, unifying the data became the biggest issue for us on that project.

Once I became aware of the problem, I started seeing it in other projects. Even when I was doing different work, the problem kind of never left me. How do we say that the records are a match? How do we scale this out? How do we run it for any entity which can appear for an enterprise? For the programmer in me, it became a thrill ride of problem solving with lots of first principle thinking. The more I worked on it, the more challenges I became aware of. Eventually, I got completely immersed and stopped taking other consulting work to focus on building a generic solution.

When I showed an early version to my friend and mentor Joydeep Sen Sarma, creator of Apache Hive, he encouraged me to open source it. As an active consumer of amazing open source technologies, it was a great way to contribute back. So I latched on to the idea. Here is what I wrote when I published Zingg:

I also feel that open source Zingg is more powerful than closed source, because then it is up to the community to put it to use in more ways than I have seen and could ever anticipate. It is also a promise to work closely with different people and collaborate, something I am very sincerely hoping will happen!

I was convinced there were more people like me, but it was not clear if they would find me. Fast forward to today - the Zingg open source community is 700+ members strong. Turns out what felt like an esoteric problem to me is a strongly felt pain for enterprises. Starting with a single person, Zingg now has a founding team that cares deeply about entity resolution. Zingg resolves billions of records every month and our users and customers come from diverse industries like healthcare, insurance, government sector, non profits as well as startups. This love and support is driving us to innovate more, and we are building amazing stuff to run production loads with a persistent ZINGG_ID along with incremental flows. We continue to work extensively on accuracy and performance. Zingg also supports diverse deployments on Microsoft Fabric, Snowflake, AWS Glue, Databricks, and others.

All this would not have been possible without our community members, well wishers and supporters. Thank you all, you have made this journey well worth it

Sonal’s Newsletter

Discussion about this post