What’s the Perfect Data Model for Product Analytics?

Kubit CEO/Founder Alex Li shares his recommended best practices for modeling your Product data – and explains why warehouse-native Product Analytics is such a game-changer.

Alex Li

Alex Li

Founder / CEO

14 minutes

July 7, 2023

Our customers often ask us: what is the recommended data model for Product Analytics?

Just as your product evolves constantly, the best data model must also evolve – tailored to your analytics use cases through an iterative process.

After working in this field for over a decade, my answer is this: while there certainly are some principles and blueprints to follow, a universally perfect data model just doesn’t exist.

In this blog post, I’ll give you my recommended best practices.

Let’s back up –
when should I start doing
Product Analytics?

Before we get into data modeling for Product Analytics, let’s address another related issue.

A comment we often hear from companies is: “We need some time to improve our current data model before deploying Product Analytics.”

But – is that really the case?

First of all, Product Analytics is not just a one-time project.

Rather, it’s a thought process that takes time to fine-tune and merge into your data culture.

Like your digital products, Product Analytics is an iterative process with constant changes.

Without testing and verifying against your actual use cases, the “perfect data model” is just a pipe dream that may fail on launch – and which, even in the best of circumstances, will require ongoing revisions.

Second, you shouldn’t be feeling held back by your data model!

An initial Product Analytics data model can be as simple as having all the raw data in the proper schema to support context-rich and flexible analysis. As long as you follow the basic principles that I outline below, your data model should be able to adapt to changes easily.

Third, how about needing a “clean” data model because your current one is very “raw”?

Well, actually, the raw data model is perfect for running some quick-and-dirty analyses to get instant insights.

It’s also great for figuring out and verifying your analytics needs and – more importantly – for troubleshooting at the lowest possible granular level. Why couldn’t your Product Analytics tool work on top of the current “raw” data model while building the “clean” one? It should be able to support them both concurrently, and give you the iterative path to migrate and evolve your Product Analytics.

Yes, the best time to start Product Analytics is right now.

Free yourself from the past!

I think the question, “How do I model my data for Product Analytics?” stems from misconceptions created by the first generation of Product Analytics tools (like Amplitude and Mixpanel).

During the boom of mobile apps and the mobile app ecosystem, most developers lacked the technologies and expertise to quickly build large-scale data models themselves. Instead, they settled for SDKs – letting, for example, Amplitude collect their product data and store it in a proprietary data silo.

Product Analytics Innovation Timeline

Early Product Analytics tools were built before cloud data warehouses revolutionized the data stack. Kubit connects directly to your data warehouse and leverages modern data-sharing capabilities.

But this infrastructure has many limits. When the quantity of data you need to analyze increases, this solution requires more API integrations or complex ETL jobs to ingest all data sources into the silo.

This approach started to reveal its limits, including duplicated data copies and endless batch jobs to babysit.

Meanwhile, major developments were happening related to both data regulations and data infrastructure. Broadly, we saw a growing focus on data privacy and security, eventually resulting in data privacy policies like GDPR and CCPA and the phasing-out of third-party cookies. The rise of cloud data warehouses and the modern data stack made it possible for organizations to maintain control of their data. It also enabled the elimination of siloed black boxes and the reconciliation of all analytics with a Single Source of Truth.

Kubit, as the first Product Analytics tool that fully leverages data warehouses and the data-sharing capabilities of the modern data stack, empowers companies to do Product Analytics very differently than traditional tools.

But I’ve observed that, sometimes, our customers’ initial instinct is to get started by copying a proprietary data model developed by one of the traditional platforms – forgetting that it’s a model that was initially developed a decade ago, and designed to accommodate hundreds of different customers’ use cases!

We remind our customers that they can free themselves of these old models, which are not universally applicable and are full of legacy bandits.

The only Product Analytics data model that will fit like a glove is one designed for your product and business, and which can adapt to your evolving needs.

Product Analytics:
10 common use cases

Before we discuss the principles of designing a Product Analytics data model, let’s consider the typical use cases for Product Analytics – a specialty that’s all about understanding the full lifecycle of users of your app or other digital product.

The core of Product Analytics is user behavior analysis, which is based on all the action events your users take, explicitly (in the UI) or implicitly (backend or third-party).

For example:

  • App Launch
  • Login
  • Page/Screen View
  • Clicks
  • Purchase
  • Subscription Renewal
  • Likes.

With this detailed tracking information, you can get a variety of insights.

Count the occurrences of certain events and compare their ratios through a time series – to understand how your users engage with your products over time.

For example:

  • DAU, MAU: Number of unique users per Day or Month
  • Streams (Events) / DAU: Number of Streams per Daily Active Users
  • Viewing Minutes: Sum of the total time across multiple View events
  • Revenue / Order: Average purchase amount across multiple Order events.

Measure the conversion/drop-off rate between the steps of some key user flows like Registration, Checkout, or Social Loop funnels.

For example:

  • Registration Success Rate: % of guest users who complete the registration flow
  • Attach Rate: % of users who complete the subscription or purchase flow after seeing the paywall.

Measure how often your users come back and why.

For example:

  • Dn, Wn Retention: % of new users who open the app again on D1/D2, or W1/W2. Typically, you would use this metric to compare the effectiveness of your user acquisition or reactivation campaigns.
  • Viewing Retention: % of users who continuously watch certain shows at different time intervals (daily, weekly, monthly).
  • Subscription Retention: % of your subscribers who renew over time.

To focus and compare on certain perspectives of the measures, they should all be filtered or grouped by certain properties (attributes).

For example:

  • App properties: Platform, App Version, App Id
  • Device properties: Device Model, OS Version, Locale, Country
  • User properties: Age, Gender, Language, Install Date, Persona
  • Context properties: Page, Flow, Campaign.

Cohort means a segment of users who match certain criteria. It’s the crown jewel of Product Analytics, enabling your product decisions to be guided by the dynamic identification and tracking of the behavior of specific groups of users. Cohort can be used as either a Filter or Breakdown in any of the above-mentioned measures.

For example:

  • Active Viewers: Users who watched more than 1 hour of a certain show in the last 7 days
  • Churned Subscribers: Users who didn’t renew their subscriptions
  • Dormant Users: Users who didn’t open the app for more than a week.

All digital products thrive thanks to user acquisition. Being able to identify and attribute the sources of new users is critical. Attribution data is at the heart of major decisions about the effectiveness of marketing spend – decisions that often involve millions of dollars.

The following use cases require you to track and pinpoint which channels your new users come from, for example:

  • paid advertisement
  • Youtube/Instagram campaigns
  • social posts
  • organic installs from App Store or your website
  • content promotion with your apps.

Attribution is notoriously hard to track. This is often due to issues with the accuracy of your data – caused by privacy concerns and the general messiness of the ecosystem. Delayed attribution and multiple touchpoint situations can also present challenges. Once the attribution source is identified, these campaigns can be applied as filters or breakdowns in users’ engagement, conversion, and retention to measure their impact and effectiveness.

Every day, you acquire some new users, but you also lose some through churn. Reactivation (or re-engagement) becomes a critical growth strategy, especially when your product matures. You need to adjust your product constantly based on measurements of the impact of your push notifications and email campaigns.

For example:

  • Open rate: % of push notifications or emails opened
  • Click and conversion rate: % of users who reacted to the messages
  • Retention rate: impact of a campaign, if any, on user retention.

Continuous Product Design requires instant insights from your experiments (e.g. A/B tests). The ability to compare the outcome of different treatments with statistical analysis is a key part of data-driven decision-making. Almost all digital products run constant experiments because they have quite a significant amount of user data from which to derive insights. All you have to do is compare any of the KPIs between test groups and compute the statistical significance.

Content has become a key part of the growth strategy of many digital products. The ability to constantly test different content (e.g. different shows) through various presentation methods (catalogs, carousels, push notifications) and measure the impact on engagement, retention, virality, eventual subscriptions, and LTV (LifeTime Value) is one of the critical success factors of any content-driven platform.

Principles of Product Analytics data modeling

As you see already, Product Analytics is quite complex, and the data model can be even more so – because no two digital products have the same use cases, data sources, or vendors. Instead of pushing a “perfect data model,” I think it’s best to follow these general principles and best practices.

These are based on my own experiences, including – the best way to learn – some mistakes made along the way.

The cornerstone of the analytical data model is the star schema.

Here’s a quick refresher on star schemas. “Star” refers to the shape of the ERD (entity relationship diagram). At the center is a “fact table,” which contains all the details about the event or “what happened.” The fact table is linked by join to various “dimension tables” which, in the ERD, fan out from the centralized fact table. The dimension tables contain attributes that provide context to the fact table – often providing the details needed for slicing and dicing the data in the fact table, and for analyzing it from different perspectives.

Here’s an example:

Typically, the central fact table captures all the action events your users perform. Then you have many dimension tables containing all of the properties for each use case. All of these tables can be joined (associated) through a common user_id.

You can always have multiple fact tables. Each can be treated as the center of its own star schema or unioned together to serve other purposes. For example, Segment typically dumps one table per event type. Your Product Analytics vendor should be able to directly use these tables as they are, instead of forcing you to develop ETL jobs to create a single gigantic fact table.

Unlike tables in transactional databases (like MySQL, PostgreSQL, Oracle, SQL Server, etc.), tables in data warehouses have several special traits specifically for analytical purposes.

Analytical insights must be repeatable and verifiable. Not only are data warehouse tables mostly read-only, they are also append-only. This means data flows into these tables like a log: every entry is immutable and you only append new entries instead of updating/changing what happened in the past. 

Data warehouses are specially designed to store denormalized data efficiently, and to utilize it for very fast analytical queries over billions of rows (terabytes) of data. The fact table should be denormalized to avoid unnecessary table joins. This means you should capture as many properties as possible when the event happens and store them as separate fields/columns in the fact table.

For example, each event should contain all the properties which are available at the time –including user, app, device, and context. Dimension tables come into account for after-the-fact properties – those that are coming from vendors, that are only available in the backend, or that are computed through batch jobs like machine learning algorithms (e.g. each user’s Persona).

One common misconception is that it’s a good idea to use a single JSON field to store together multiple properties which are dependent on event types.

Cloud data warehouses are all columnar-based, which means they can handle thousands of nullable columns at ease and with the best optimizations in place. Settling for JSON fields (a practice sometimes resulting from laziness) will create problems. This is because:

  • There’s no data validation at ingestion time. It’s too easy for typos and other errors to occur without anyone noticing.
  • Everyone who queries the data must be an expert on JSON data structure and duplicate that logic everywhere.
  • JSON keys are not properly indexed, thus, query performance will suffer significantly.

Despite how inefficient JSON fields are, your Product Analytics tool should be flexible enough to support them when needed.

Proper data governance is required to manage the complex data models for Product Analytics.

Data governance best practices (which are often ignored) include maintaining a data dictionary and monitoring data quality. Without these steps, your data risks being mostly garbage in and garbage out.

  • Data Dictionary: All of your events and their properties must be documented thoroughly and ready to use. The documentation provides the context, detailed explanations, and history of any modifications. In product development, change is the only constant – and keeping track of every version of your data dictionary entry is critical to separate noise from meaningful insights.
  • Data Quality: Everything that can possibly fail, will. Your data pipeline may stop, your code changes may introduce bugs, and your data may be lost or duplicated. All of these possible incidents should be monitored and reported to guarantee the accuracy and completeness of your data – ideally before others have wasted days trying to figure out anomalies. 

Most Product Analytics use cases deal with time-series data, which should be captured and stored at the lowest granular level possible, and ideally context-free. Regardless of the timezone the analysis uses, of the calendar month (4 weeks or 30 days), or of the fiscal quarter… Your raw data should contain the basic information needed to satisfy these use cases.

Typically your events should have their timestamps captured and stored (accurate down to the millisecond), and they also should be in sync with your other data sources (e.g. dimension tables) for joining purposes. This is especially important in the attribution and experiment use cases, where the campaign or A/B test treatment should only impact the events that happened after the touch point.

Unfortunately, this kind of straightforward causal dependency is often neglected or even not supported by Product Analytics tools (because it requires a special yet expensive join using “Correlated Subquery”).

One frequently used design pattern for the Product Analytics data model is Slow Changing Dimension (SCD), specifically “Type 2: Add rows with effective date.”

For example, your users may have been touched multiple times by your acquisition or social campaigns, in which case you can append a new row for every touchpoint – when it happens, along with an effective date to mark the effective time range of the entry.

This gives the ultimate flexibility at analysis time, allowing you to apply different kinds of attribution strategies like First-Touch, Last-Touch, or Multi-Touch.

Some measures like session length and total viewing time must be computed across multiple events, such as “Session start/end” and “View start/end.” However, in real life, the ending events have to be derived based on certain thresholds like idleness or no action.

It can be tempting to pre-compute such measures into their own tables, often because your Product Analytics tool can’t handle them on-the-fly.

Unless you have a very fixed definition, usually it’s a bad idea to invest in heavy-duty ETL job s– not to mention another table with “duplicated” information. How about when you want to see the “what-if” scenario? Or what if you want to troubleshoot some bugs and thus need to change these definitions dynamically?

The new generation of warehouse-native Product Analytics tools should be able to let you define such measures as reusable components and easily compute them on-the-fly to give you the maximum flexibility, while hiding the complexity. Another similar use case is combining legacy events into one logical event.

For example, over time, you may have dozens of variants of login events with different names, but for analytical purposes, they should always be treated as a single event. Your Product Analytics tool should allow you to group these together during query time dynamically. 

You don’t want to put all your eggs in one basket and be constrained by your Product Analytics vendor’s SDKs.

Choosing the best tool for the job is always a better idea, instead of getting locked into something that claims to be universal – but which really can’t get anything done well.

For event instrumentation, you don’t need to reinvent the wheel. There are already very mature Customer Data Platforms (CDPs), including commercial and open-source solutions, to provide you with the ultimate freedom to collect and send the data anywhere. With a simple flip of a switch, you can decide what data to collect, and send that data to a cloud data warehouse or other vendor as the destination. Even better, you can choose to replay your history data if there are any mishaps or needs.

The days are long gone when we had to invoke API calls or build designated batch jobs to consume data from your vendors (those handling your acquisition, experiment, and reactivation campaigns). That way of working was expensive, fragile, and a headache to maintain.

All cloud data warehouses are headed towards data-sharing, and your vendor should be able to share your data securely and let you query it in real-time without ever needing to make a copy. All it takes is several lines of SQL code, and you can join with these vendors’ dimension tables live in all of your Product Analytics work.

How one Kubit customer successfully modeled their data for Product Analytics

Let’s look at how one of our customers built their Product Analytics model (let’s just call them “AcmeTV”).

AcmeTV is a top video-streaming platform with millions of daily active users. They started their Product Analytics journey using Segment as their CDP to instrument and collect their user behavior events data.

Their initial plan was to get insights into their product using Tableau (a Business Intelligence reporting tool), hiring many analysts to build reports.

They quickly realized that this solution couldn’t scale. Without the ability to do self-service analytics, their Product and Marketing people couldn’t get their questions answered in a timely manner. They were experiencing delays that significantly impaired the speed of decision-making, and also the speed of adapting to the constant changes in Product and Marketing that are inherent to today’s enterprises.

The challenge: almost all self-service Product Analytics tools would have required AcmeTV to send billions of events daily to proprietary data silos. From the beginning, this was a no-go for two main reasons:

  • Sending billions of events daily would have carried an astronomical price tag for AcmeTV. And they were not open to the option of working just with sampled data, which doesn’t allow for the measurement of crucial heartbeat signals (for example, metrics calculated based on viewing seconds).
  • AcmeTV’s growth relies heavily on its content strategy and user engagement campaigns (such as push notifications). An enormous amount of data already exists in their Snowflake instance that can be used dynamically, joining it with user behavior events. Building and maintaining ETL jobs to sync this other business data to a third-party would be too daunting, error-prone, and expensive.

As discussed above, a common reaction for companies that are exploring Kubit’s warehouse-native Product Analytics is to say: let’s put everything on hold until we have a “proper data model” in place – typically 6-12 months down the line.

We explained to AcmeTV our philosophy that “right now” is always the right time to start deploying Product Analytics.

We backed this up by guaranteeing them an impactful POC with minimal effort on their part. On our side, we knew that we were up to the challenge, including the pressure of launching their subscription-based premium product.

The POC went live within a week. All that the AcmeTV team had to do was flip a switch in Segment, to send their event data into Snowflake. Yes, the data was raw, and the volume was huge–spread over hundreds of tables. This scenario would have been a nightmare for any BI tool. But Kubit managed all the complexity, and immediately delivered a perfectly functional self-service Product Analytics solution.

User Data To Cloud Data Warehouse and Kubit

AcmeTV sends their event data from their Customer Data Platform (Segment) to their cloud data warehouse (Snowflake). Kubit sits directly on the warehouse, making it easy and fast to enrich your event and customer data with any other data stored in the warehouse–including marketing and experimentation data.

As they launched a new platform, AcmeTV got critical insights, specifically into understanding the subscription funnel and how different content drives revenue and retention.

Over the course of the next several months, the solution evolved rapidly – guided by new insights into their data, enabled by Kubit. These included:

  • Catching discrepancies and identifying bugs across their many active platforms. Kubit provided various ways to abstract and remove some of the complexity from the Product team.
  • At instrumentation time, some critical properties weren’t available to the client. Kubit allowed them to join with other dimension tables to filter and slice-and-dice. There were no ETL or backfill jobs required. Everything happened dynamically.
  • Interestingly, many metrics or KPIs the AcmeTV team had defined before the platform launch were proven ineffective or inaccurate. Kubit helped them refine their requirements, and adjust their data strategy – first from instrumentation to measurement, and then decision-making.

At the end of the third month of using Kubit, AcmeTV had a clear grasp of their users’ engagement and retention, and they were able to build a very effective and mature content/push strategy based on those insights.

Throughout this process, many of their original assumptions were quickly proved or disproved; also, they were able to design follow-up experiments to better understand their users’ full lifecycle.

Kubit helped them to iteratively improve their understanding of their data, and utilize it to make data-driven decisions in a timely manner.

Perhaps most surprising of all: Kubit brought about all of these impacts on AcmeTV without any development work from their data engineering team. The Kubit platform can adapt to changes and make them live within days, all thanks to its warehouse-native architecture.

If AcmeTV had waited for their “better data model,” they would have completely missed their platform launch and some of their most critical product insights.

Now with most of AcmeTV’s use cases clearly defined and battle-tested in real life, both teams are working together to build an improved, cleaner data model for the future. Meanwhile, their raw schema still functions as the workhorse for ad-hoc analysis, instant troubleshooting, and hypothesis verification.

Where are you in your Product Analytics journey?

My team and I would love to hear your thoughts on the best practices I’ve outlined here, and learn about where you are on your Product Analytics journey–whether you’re just getting started, or considering updating your tools and workflows. Drop us a note at info@kubut.ai

More Product Analytics Articles

More Product Analytics Articles
Product Analytics
Meaningful Metrics for Product Analytics
Dive into the world of metrics in product analytics and understand their pivotal role in steering business strategies. Uncover the secrets of selecting impactful metrics, measuring them effectively, and evolving them with your business needs.

7 min to read

January 3, 2024

Travis Strickland

Travis Strickland

Read more
Product Analytics
Four Product Analytics Trends Worth Investigating
Product Analytics is a field that’s constantly evolving, and it’s important for companies to stay up-to-date on the latest trends and technologies in order to make informed decisions about their products. In this blog post, we’ll explore some of the latest trends we’ve been observing in both medium-sized and large companies.

5 minutes min to read

February 14, 2023

Alex Li

Alex Li

Founder / CEO

Read more