Building Identity Resolution on Snowflake Using Snowpark
Crossing boundaries of languages and technologies to get stuff done.
I have very exciting news to share - we have signed our first customer for Zingg Enterprise Snowflake! It feels magical, almost surreal. I call it magical not because I did not believe it will happen, but because of the pace and ease of closing the deal. If you have been doing B2B enterprise sales, you probably know what I mean :-)
Zingg open source has always supported Snowflake as a data source using Apache Spark as the connector. As we talked with our users, we realized that many of them had only SQL or warehouse workloads, and we could really simplify their stack by using Snowflake directly. Over the last year, we watched Snowpark evolve and did multiple experiments to see how viable it would be to build Zingg over Snowpark. This post will detail some of the things we did and learned as part of the process.
Our first challenge was building our extensive ML pipeline for entity resolution in Snowpark. We have our own blocking algorithm but rely on Spark’s MLLib for classification. Snowpark does not have a built in machine learning library but facilitates ML through third party packages like scikit-learn. This means that Snowpark Dataframes can not be passed directly for training models, but a conversion to Pandas Dataframe becomes necessary. This has two issues. Firstly, this limits the size of data one can train on. Fortunately for us, as we use active learning to build training sets efficiently, our training data size is quite small. The second issue was the real one for us. Our code is in Java and Scala so introducing a Python flow for training and prediction became challenging. Zingg does substantial preprocessing and feature engineering enriching the input data with various signals before training and prediction. Using Python meant our in-memory enriched Dataframe could not be passed directly for training. Eventually, we decided to persist the features into a temporary table, and built a stored procedure in Python to read it to train a classifier. We persist the classifier into a Snowflake table. and load it in a named Python user defined function, which is invoked from our codebase during inference. Thankfully, a named user defined function in one language can be easily invoked from another, which is the strategy we have used extensively in our code.
Snowpark also misses a graph library, like GraphX or Graphframes in Apache Spark. Luckily the Snowflake team helped us here and provided us with a Scala library for graph processing. We had some work cut out to integrate this into our code, especially since the Snowpark Scala Dataframes are not interchangeable with the Snowpark Java ones.
Building a cross-domain ML product introduces a great deal of generalization in the flow. We do not know beforehand the shape of the input dataset - it could be product matching, it could be customer matching, or it could be a B2B account with domain names and other details. Many row-wise operations become tricky without a map equivalent in Snowpark. Hence we built our own version of map, passing all the columns into a VARIANT type array and then getting individual values as necessary. This was a bit of a tough problem, and we were pretty proud when we could get it to work.
There were some other minor hiccups like temporary udfs not getting cleaned up, no way to access data from dataframes using column names but having to pass column index, etc. I am sure these will get sorted out as the Snowpark API matures further.
One of our biggest concerns was that our codebase assumed Apache Spark as the defacto processing engine, meaning that the dependencies were all over the code. Apache Spark is an integral part of Zingg, and is intricately weaved into all Zingg operations. So our first thought was to have two code bases, one for Apache Spark and then the same logic ported for Snowpark. All functionality would exist independently in each. There would be some common code like reading user configuration that would be shared between the two. As we were planning the Snowflake version of the product to be closed source, the code would anyways have to be in a different repository than the open source ones. Hence, why not? This would also let us build the product in less time, as the open source is extensively validated and already in production. Ship faster, they say! We spent a day or two doing global search and replace operations on dependencies and imports, to see how this would shape out and realized soon enough that this was going to be a mess. If we had to fix a bug, let alone add a new feature, we would have to do it in two places. Clearly, this was going to be a recipe for a lot of overwork and technical debt and I am glad we abandoned this soon enough.
The next approach was to build up generic classes and then templatize them with Spark or Snowpark equivalents. All the logic remains centrally located in one place, and the processing engine equivalents have their own extensions. We can easily add another processing engine if a new Dataframe API comes up. This meant touching every single bit of our code, which was going to be a lot of work on both the development and the testing fronts. Not a decision we wanted to jump into, but there wasn’t a lot of choice here. I have to let you in a secret here. Besides the user discussions, we had another motive for the Snowflake product. We wanted to enter the Snowflake challenge. Some cash and publicity can never hurt a frugal startup! We wanted to apply with a running application, which left us with roughly 1.5 months to get everything running, and that is the deadline we set for ourselves at the beginning of the year. It was a lot of work, and more work, and more work. It was tiring, but it was fun. We finished a week before our self-induced deadline, without any shortcuts or hacks :-)
Snowpark has been recently open-sourced, so it is going to be even more fun building around it. Zingg is surely one of the more complex applications being built around it, and we can’t wait to add more stuff to our identity resolution product and push Snowpark further.
In the end, we did not win the challenge, but had some nice interactions with the Snowflake team and have learned a lot of stuff that will help us as we now put Zingg Identity Resolution on Snowflake into production.
We are really excited about how this journey is shaping up!