There was a feeling of suspense and excitement hanging in the air. From the depths of the cubicles rose a bewitching hymn, whose clicks and clacks surrounded the room. Despite this alluring melody, silence and serenity prevailed. The countless fingers on the various keyboards pressed and released in sync as if they were playing one giant piano. Everyone typed in tandem and the entire scene felt straight out of an orchestra. Time, which stood eternally and unmovingly, slid when a delighted exclamation rang through the air. A gentle whoosh brought out the new Zingg feature, acting as a spell breaker to the enchantment everyone was under…
As you might have guessed by now, we just released a new version of Zingg! This release marks a complete shift from the command line training and matching jobs to a full programmatic interface of building applications with Zingg. With Python.
After Zingg became open source, we found most of our early users came from a non-Java, non-Spark background. Though it was extremely encouraging to see a broader user base for Zingg than originally anticipated, it had been a poor assumption at our end that everyone had the JVM running on their laptops to run Zingg. Hats off to the early adopters who still went ahead and resolved their entities with Zingg, we had definitely not made working with Zingg easy for them :-(
This realization led to our urgent Docker release, and it felt that we had crossed a big hurdle! Life and work moved on, and as interactions with users increased, one common pattern emerged - the use of Python. A lot of our users are writing Python data pipelines to get data from multiple sources. They are wrangling data in Python before putting it through Zingg. They are consuming the Zingg output results in Python based machine learning pipelines for risk modeling, customer segmentation, and product recommendations. They are on Databricks notebooks. Wouldn’t it be cool if such users could use Zingg through Python itself?
With this in mind, we set about defining a Python interface for Zingg. Our biggest challenge with Python was that the majority of our source code and all our core algorithms are in Java/Scala. Due to the computationally intensive nature of the identity resolution problem, we keep a very tight and performance-focused implementation. Rewriting everything would not make sense, though we did think about it for a fleeting moment. After some tension, we realized that Apache Spark was structured similarly, and realized we could borrow the best practices from there. This turned out to be a really good decision. In fact, we referred to the Apache Spark codebase a lot to build a Py4J-based interface to Zingg. We would invoke our Java code from Python, all the Java interfaces would be replicated on the Python side. Things started to look up again.
There was one more hurdle though. Something we realized much later as we got closer to shipping - Python packaging. Unlike a usual Python application, our interface is dependent on the compiled Java code and hence we had to figure out a way to include this as part of our package. Luckily, Spark is doing something similar so the packaging there became our big inspiration. Again!
With these major hurdles crossed, we got to the grind and were finally ready to put our work out. Honestly, this release has been a real big one - not only did we build the whole Python interface, but we have also added lots of cool things to core Zingg. Stopwords - stuff to ignore while matching attributes for example. Or special match types for emails or pin codes. More diagnostic messages while running Zingg - getting better there slowly and steadily!
In the end, once the release was done, it felt like a triumph! A mini celebration, and having a different version of fun - rest, recharge and laze around a bit. I haven’t heard any user complaints so far so it is probably working as planned ;-)
Among other things, I am very excited about the ability to finally have a good integration point with dbt. Besides Python, SQL through dbt is something we frequently see (of course! :-)). With dbt supporting Python, we have a cool new way for Zingg to work with dbt. I can’t wait to try this out….or maybe someone is working on this already! Are you The One?
wonderfull