Measure What Matters! Our Simple Data Stack To Understand Open Source Adoption

Is it a Modern Data Stack? Surely! Minimalist? Of course!

Apr 18, 2023

What good is a song composed but not hummed by the audience? The musician may love it as their best song, but it will soon fade away from the memory of its listeners. While the pleasure of building and innovating is its own reward, creators need external validation to generate the ammunition to make even bolder bets.

As people who pour our hearts and souls into building Zingg, we too take a pause at times to reflect if we are indeed creating value for our users. There are two main things we are super curious to learn at this stage.

To learn how people are discovering us
To understand how they are using Zingg

Signals for these come from multiple sources. The important ones are direct user and customer communication, Slack, website traffic, product downloads and product usage statistics.

Direct interaction with users and customers is the richest form of learning. Many of our open source users join our slack and chat with us and we learn about the problems they are solving with Zingg. Some of them write to us on email, or DM us on Slack or LinkedIn, sharing what they are building with and around Zingg. They want to check the best way to use Zingg for their scenario. We ask them about their stack and use cases, and how we can help them better. Learning more about our users guides us in shaping the roadmap for Zingg. Zingg’s docker and the Zingg Python API are a direct result of these interactions. We built them because we realized that they will help users generate identity resolution results faster and will enable them to use Zingg as a first class citizen in their transformation pipelines.

While interacting directly, we often ask users where they learnt about Zingg. This is augmented with the signals from traffic information from Google Analytics. Github also provides repository traffic analysis of the previous fortnight.

Often, the discovery is word of mouth. For us, having a great product and providing an even better support is the best form of engaging people. It suits us as developers, it helps our users drive value faster and it builds a strong relationship with the user. A check at the testimonials on the Zingg website reveals we are doing a nice job there.

And we should still aim to get better here. Like identity resolution, support is never going to be perfect, so we always have things to improve upon.

Many times the discovery is through the tutorials and guides we have written. Or talks and events in select data communities. Instead of a lot of content, we have focused on deep and long articles that help the user understand the problem of identity and entity resolution, why it is needed, the challenges in resolving entities and how Zingg can be run on different platforms like Databricks and Snowflake. Early adopters of a product know the problem and are looking for a solution. Hence, we owe it to them to make using Zingg easy so that they can focus on solving the bigger puzzle - Customer 360, Segmentation, Fraud, Risk, Compliance etc. Discoverability is the side benefit of our content, helping our users adopt Zingg is the main goal.

Broadly, these data points are sufficient for us to understand and improve discoverability at this stage. Besides this, we track open source adoption and product usage anonymously. We gather information about

Github downloads of our releases
Docker pulls
PyPi installs
Slack users
Documentation views

Together, the above give us a fair idea of our adoption and how we are growing. For now, we get these values manually every couple of weeks and track them in an excel file, charting out a time series to understand our growth rate. We do have a column for github stars which we populate but dont think much about as we have richer data to get signals from.

Coming to product analytics, the aim here is to learn the different aspects of how people are using Zingg-

Which platform do they run Zingg on?
Is the product scaling? How many records do they match?
How much time does it take to run?
Do they run Zingg multiple times or is it a one time deduplication?

For any product analytics to function, we ought to have a way to identify unique installations and trigger usage events from the installation to a place where we can analyze them. Our asks are fairly simple questions, but the novel nature of our product as well as deployment makes it tricky to get the answers.

Often, Zingg is deployed on cloud resources like Databricks or EMR or on private enterprise networks which are blocked from sending information outside the VPN. So if a user downloads and uses Zingg under such environments, we have no way of knowing how they are using Zingg.

Another issue is that it is impossible to identify installs through cloud systems, as virtual machines change with every run. Even when we don’t want to identify the user but only the install, there is no way to track a unique installation by writing a random uuid on disk. This problem exists on our docker environment as well.

Some of the features we are building now will allow users to trigger cloud jobs from their machines, so we will think about the install identification then. Till then, we already have the download stats, so thats a good enough proxy for us. Instead, we gather usage data. We respect user privacy, so we limit data collection to reflect product usage. We also provide a flag to turn data collection off.

We use Google Analytics as our event server and post custom events to it from the code every time Zingg is run. These events are sent to our GA account, which has been configured to pipe these events into BigQuery. For every run of Zingg, we collect the data and output formats, time it took to run, number of records and a few other stats. Every few weeks, a handful of SQL queries help us understand the data volumes we are processing, the number of daily jobs run and their trends, the time taken to complete each run etc. Mostly, it is a brief glance on the numbers to make sure we are on the right track.

Many times, the usage data has an unheard story to tell. We discovered that there were a good amount of users using Zingg to resolve entities on Snowflake. A Spark deployment is not ideal for them, as they are not using it anywhere in their stack. Could we leverage Snowpark and build a lighter, tighter product for such users? Seems we can, and I have a lot of exciting stories to tell about that effort and where it is leading us!

Coming back to our simplistic data stack.

Sometimes we do detective work ;-) If we see events of the same format and number of fields triggered at regular time intervals through the day, we treat it as Zingg running in production. Once in a while, it is fun to look into individual events and extrapolate the usage to different scenarios.

Churning out aggregate stats is a revelation too - like Zingg has resolved at least 2 billion records in the last two months. This is a huge number, and it has grown rapidly over the last few months.

When I say “at least”, it is with the awareness that we are not collecting all the data points. We have talked with many users who are running production loads but they do not show up in our telemetry. So our analysis tends to be conservative. Which is absolutely fine.

Overall, the aim of this stack is to help us get a sense of how we are doing, what we can do to help our users and what more we should build. We are on free plans on Slack as well as GA. The frugality is in the cost, and also in the implementation. As a small team, we needed something that is easy to setup and will answer our basic questions without much effort. Also, the trend in the data is more important than the individual data points at this stage.

We may miss some dots, but the path is clear. Which is what matters!

Sonal’s Newsletter

Discussion about this post