Earlier last month, we started working on a feature we are calling Match Statistics. The idea about statistics came from the incredible folks at Fortnum And Mason, who are using Zingg as the backbone for Single Customer View. While running Zingg incrementally, the Match Statistics would expose how clusters numbers change as records get inserted and updated into the identity graph. This information about changing clusters would be a good first step to observe the entity resolution pipeline. If the number of clusters change disproportionately to the number of records updated and added, an alert could be triggered.
As we started thinking about this feature, we realized that the Match Statistics feature could be made even more powerful. Currently the Zingg output consists of the original data along with the Zingg ID and the match probabilities. If we could surface information about the linkages we found in the records within the cluster, we could help users with matching internals and anomalies. When our users are dealing with millions of records, finding the needle in the haystack is critical. Hence, the specification for the feature expanded to additional statistics.
While building new features, or augmenting existing flows within Zingg, we have to design both the Community and Enterprise products, since they share the same code base. The design of any feature on either product has to take into consideration the impact on both products and ensure that they continue to work individually as well as in tandem. And then comes the trickier part. The design has to work well for both Snowflake and Spark deployments. Not just functional. Performance wise as well. The Zingg Enterprise code base is mostly common across Spark and Snowflake. However, since both platforms have very different APIs, query planning, optimisation and execution, we have a lot of ground to cover while building or modifying anything.
Thats what we thrive at :-)
Eventually we designed a stats package which would get the record level and cluster level matches from the match process. This package would be responsible for computing the metrics we cared about, and writing them to appropriate locations, or Pipes in Zingg lingo. The main flow would mostly be unaltered, except for invoking the stats package and passing the dataframes around.
As we started building the core functionality, we started seeing a very strange issue. Within our flow, we do probabilistic matching only on records which do not match deterministically. If two records match deterministically, we do not try to figure out if they match probabilistically as well since that is a waste of computation time.
Now, something strange started happening. Within the stats package, which was consuming the probabilistic and deterministic structures, we found some records which matched probabilistically as well as deterministically. This threw an error in the statistics computations, as they banked on the record pair to match one way or the other or not match at all. Our deeply crafted logic was breaking.
However, this part of the flow never changed. All we were doing was computing additional metrics in this package. And the main flow was unaltered. So how could stuff which was working earlier AND which was untouched stop working? What followed was a crazy cycle of runs, tests and validations.
Our initial thought was that we had made an error in collecting the data. Why not run it in another sandbox? Umm, the same result. Maybe we had run the flow twice appending to original intermittent data leading to the wrong state? But if that was the case, every record pair would appear twice. In our case, only some of the record pairs were being matched probabilistically as well as deterministically. We then sought to understand if there was anything special about those records. We looked at column values, deterministic rules, and the probabilistic match criteria. To our dismay, every run threw up a different set of extra records. But clearly, not data or code issue then.
Things were surely getting curiouser and curiouser. We were completely off track from the original feature we set out to build :-(
Next, we turned our focus to the execution environment. When we ran the build on Snowflake, there were no extra records. This was a relief, since it meant it was the Spark side of things that we needed to focus on. Now we ran the workload on Databricks. No issue at all! This was a relief too, since it meant earlier released production code deployed at customer locations was working fine and there was no snake in the grass.
After two weeks of racking our brains, we were finally making progress :-) Things started moving pretty fast post that. We searched for known issues and discovered a issue on Apache Spark. When we changed the Apache Spark version, the issue disappeared. To be absolutely sure, we tested different datasets, different volumes, and all looked good.
Debugging complex systems is always challenging, but some techniques always help.
Time travel to the last known good state helps understand which change or set of changes could be the cause.
Surgical elimination is super critical in such scenarios. Being able to quickly run different non-overlapping tests can lead to faster results.
Weird behavior - missing records, extra counts etc which alter at different runs are more likely to be caused by the environment than the actual code.
Such non deterministic errors usually stress teams out, since things feel very out of control. For engineers like us who are used to predictable outcomes, such issues can be very frustrating. Having a cool head goes a long way.
After this wild chase, we are now back to building the Match Statistics feature. We hope to ship it soon!
Just one last thing that I forgot to share. Many wise developers have said this before me, but I must say it again! Plan at least twice the time you anticipate it will take :-)
If you have a similar story to share, hit reply!