When I wrote last time that one of the reasons for open sourcing Zingg was to be able to collaborate and work closely with different people, it was mostly a prayer at my end. I was, and I am, super optimistic about the long-term need for entity resolution - linking records of the same entity without common identifiers. But, I was not really expecting anything concrete to happen immediately. An open-source project, without the backing of a major company, without a marketing budget, AND coming from a small suburb of the Indian capital will really need to find its way into the crowded data ecosystem. As builders, we tend to think that if we build it they will come. But that’s not how real life works. At least in a majority of the cases.
I thought that it would be an achievement for even a few people to hear about Zingg in the first week. I am happy and so excited to tell you that the last few weeks have been pretty unexpected, in a very good way. Read on!
Zingg got a pretty detailed mention in The Analytics Engineering Roundup. For a humble consultant like me, this is way beyond my wildest imagination! Even though I can sometimes believe in as many as 6 impossible things before breakfast, I could have never ever dreamt for Zingg to be noticed by the founder of DBT Labs, Tristan Handy himself!!!
When I first saw the post, I couldn’t control my excitement and woke up my sleeping family to show it to them. And I got them to read it word by word, multiple times, with me popping over the screen to see where they were ;-)
I think it is days like these that make all the effort really worth it - the late nights, the sweat and the fret, the rack your brain till the problem gets solved, and fine-tune it further!
Since then, far more people have learned about Zingg than I could have ever managed alone. The stargazers on the repository have shot up multiple times, and I have been having good early discussions around Zingg usage in different scenarios - data quality, master data management as well as pure entity resolution.
Data quality, in terms of deduplicating customer records, is one of the use cases I have solved personally. Duplicates hurt operations - imagine making multiple marketing calls to leads or sending multiple email messages. It can only drive business away. Duplicates also hurt understanding customer lifetime value, usually the first customer analytics metric one starts with. If I buy a refrigerator as well as a TV from Samsung and they think I am two different people, they will continue to send Person 1 (me) offers for a TV and Person 2 (me again!) offers for a refrigerator.
Master Data Management is a very broad discipline, and clearly, there is disenchantment around the current tooling. Many of the modern tools are overlapping with the older master data management activities, data catalogs being one. I feel having a modern data stewardship experience which fixes data in the warehouse directly which then flows and corrects the business tools through reverse ETL should be valuable. The business tools continue to self-correct, and this data flows into the warehouse and corrects it too. So the mastering happens at multiple levels. And lineage is preserved. This is something I am thinking a lot about these days. What do you think I am missing?
Entity Resolution as a pure technology, for AML/KYC scenarios and knowledge graphs is interesting as well. Most graph providers mention entity resolution, as an example check this. Assuming common identifiers(SSN, Emails..) exist will be a simplistic view of real-world data IMHO, and so it will be super cool to see someone use Zingg there.
Identity Resolution, a subset of entity resolution, dealing only with person data is intriguing as well. First party data is even more important to companies now that we are heading towards a cookie-less world. Data matching and identity resolution should be critical tools needed to thrive in this scenario.
These are some of the discussions I have had in the last few days. Opening Zingg up has helped me elucidate these problems clearly for the first time. My intuition behind Zingg is my personal experience in this space. My hypothesis is that entity resolution is a core need in the modern data stack. But, like a true machine learning engineer, I need to cross-validate my model :-) So I have been asking people in different forums whether they face challenges while building their warehouse dimensions and core entities in the data lake. I have discovered many things through the responses - homegrown solutions, hacks, mapping tables… and other unique ways of getting around or living with this.
An interesting bit happened here. For my question on Reddit on how people are linking customer records across different systems, Felipe, Cloud data Advocate at Snowflake sent me this
He had read the roundup (thanks again Tristan!) and it was awesome to experience in reality how the word is getting around. I had a long chat with Felipe on Snowflake and Zingg, the details of which I will not bore you currently.
So, with all this going, my plan for the next few weeks is to continue the conversation - offline as well as online. To assess the issues people encounter while ensuring that “a single row in a single warehouse table corresponds to exactly one human”. And to be very focused on how Zingg can be made useful so that people have to do less work to work with their data.
Wanna talk?
Nicely written and clearly demonstrates your passion for entity resolution, Sonal. The use cases you described are very interesting and real problems to solve.
Beautiful Article Sonam about the grit and entity resolution.
Happy to see you building Zingg and solving the real problems of data.