Time to Zingg!
because it is important to start with why.
It has been a week of open source with Zingg. Though I had imagined the process of putting it out as a lot of work, it amounted to finally changing the git repository setting to public. AND that was it! All the hard work of the past few months, and years (..more on that later) was as simple as a LinkedIn post. I had been nervous till then, making sure every bit of the documentation was proper...that the build was working..oh did we cover all the flavours of Spark? My overwhelming worry has been the documentation - when you have been working long enough on something, a lot of things seem obvious. It is only when you open it up that you realize so much more of what could be better.
I followed up the LinkedIn post with announcements in various LinkedIn groups I was part of. And shares on Reddit and Hacker News. And a long (really long) article in Towards Data Science. The plan for the next few weeks is to spread the word, get people to know about it and (hopefully) some people will find it useful. I am very very glad that in the first week, people noticed Zingg. And some of them wrote back!
In my interactions past week, people told me of a broken link (oops!), the need for a step-by-step guide, python API, docker, information on the algorithms and on how we compared to Tamr or fuzzywuzzy.
Besides these, most conversations were around the motive - why me, why Zingg, why now, why OSS? In this post, I want to elaborate on these.
I have been working in the data field for a long long time - both on the warehouse and on the datalake sides of things. As a boutique consulting, we would often get tasked with integrating data from multiple internal systems and building up the dashboards and the models. Sometimes the data also needed enrichment. It was way back in 2013 when we got stuck on a project, and we failed miserably in our first attempt. The work needed us to integrate customer records from 3 different Oracle systems. There were no common unique identifiers, no standard definition of the customer as well as lots of variations in the data for it to match exactly. Existing tools did not fit into our stack and were way too expensive. They also looked like black boxes with a long learning curve.
How tough could it be? - the programmer in me thought. I took a sample of 25,000 records from the 2 datasets and wrote a small algorithm to match them, which crashed my computer! I had no basics in entity resolution at that time and did not understand how hard it could be to match everything with everything. I changed my approach and rewrote it. Let me throw a bit of hardware to this, I thought. So we built something on hadoop with some kind of fuzzy matching using open libraries, a refinement of my single machine algorithm..specifically tailored for the attributes we had and barely managed to get it running. We shifted to Spark, and it worked better. We were able to deliver something that saved us, but I wasn’t too proud.
A few weeks later we again stumbled upon this problem in a different project. This time we had to enrich internal first-party customer data with third-party data sourced externally. I was wiser, or so I thought. This time I read up a lot of research, Arxiv was my friend! And I was glad that I had not been a sloppy programmer, albeit an ignorant one! Top machine learning researchers were building systems on record linkage, and most conferences had a paper or two on it. It was an important AND a hard problem. Clearly, I had a lot to catch up on…
This project was successful, and I was able to write something that worked decently, scaled ok and worked with different types of entities, though with varying levels of changes to the program. But it was a start! I talked about our work at Spark Summit, then got on with other work, as is usual in data consultancies. But I did fall in love with the problem, I must admit!
It was in 2015 that a Hong Kong based event management company approached me. They had seen my Spark Summit talk and were looking to deduplicate their customer records. So far, they had been doing this deduplication manually. It was a big drain on their time and resources, and messing up all the customer analytics they wanted to do. They did not want to buy an expensive MDM or Identity Resolution tool, and the open source libraries they tried had not worked for them. I was excited, but there were two problems. Their data was in Chinese and Japanese and English. And they wanted something that would work upto a million records, which was larger than our earlier datasets.
One thing I learnt which was eye opening was that they wanted to get the problem solved, in any way it could be, and have a repeatable process to do so. Thankfully they had some training data which we could use. It was a stretch for us, but it worked. And I couldn’t be happier!
There were many things I wanted to build in the algorithms to take it to the next level. But growing work and other priorities pushed me in another direction. I also saw that people were struggling at the base layers of extraction and loading more. Matching could only happen once that was settled. It made sense to go slow. But I kept speaking about it and having conversations around what we had built or what people were facing. The problem kind of never left me.
Things have changed drastically over time. Thousands of companies use Snowflake. Databricks is rising and rising ( and raising :-)) . We now have good and established practices around extraction, loading AND transfomation. Analytics is a profession!
The modern data stack has arrived!
(Though I do think it is missing Zingg, what good is data without the relationships? :-)
Agree/Disagree/Have opinions? Comment below)
Eventually, when I took a pause to think about what I really want to do and which way I want to go, this problem was the highest on my list. I just see it everywhere - how can you build customer 360 if the data is not unified? How can you do customer lifetime value or segment customers or personalize their messaging when you have multiple records of the same customer? What is Anti Money Laundering if the households and links between customers are not established? Oh and what about all those company names and their details which show up in every vendor system or B2B enterprise without ever getting reconciled?
And so in the last year, I started off afresh and wrote Zingg from the grounds up.
Now to the question of open source. I have been consuming open source software all my life. Right from Java in 1998! It feels great to give something back in my own way. I also feel that open source Zingg is more powerful than closed source, because then it is up to the community to put it to use in more ways than I have seen and could ever anticipate. It is also a promise to work closely with different people and collaborate, something I am very sincerely hoping will happen!
We will have to watch the next few months on how much that promise gets materialized, but it is a start. Wish me luck!