full control of data Archives

Integrity vs Speed: The History of ‘Siloed’ Product Analytics

Historically, analytics has been conducted directly in a data warehouse. Consider traditional Business Intelligence (BI) tools like Tableau, Looker, and PowerBI; typically, data analysts create reports and charts in these tools to visualize the insights that ultimately stem from executing SQLs in their own data warehouse. The data control is entirely in the hands of the enterprises, though this approach requires dedicated and influential engineering and analytics teams.

With the exponential growth of digital products, from web to mobile applications, a different way of conducting analytics has emerged, starting with Omniture (later becoming Adobe Analytics) and Google Analytics. Due to the dynamics in the ecosystem, few enterprises’ data teams can keep up with the constant requirement changes and new data from different vendors. It became well-accepted to sacrifice integrity for speed by embedding SDKs and sending the data to third-party silos, and relying on a black box to get insights.

For a while, everyone was happy to rely on impression, conversion CPI/CPM, etc., and metrics from external analytics platforms to guide their marketing campaigns and product development. With the mobile era, the need for Continuous Product Design arose, along with a new breed of Growth Marketing people who rely on product insights to drive user acquisition, customer engagement, and content strategy. That’s when Mixpanel and Amplitude came into existence to provide self-service customer insights from their proprietary platforms, aiming to run fast and bypass data engineering and analytics teams.

Governance, Security, and Privacy: Rethink the Black Box

Fairly soon, the industry started to correct itself. Sharing customers’ private data, like device identifiers, is no longer acceptable with other vendors. Many enterprises now realize that it is impossible to have complete data governance, security, and privacy control if their sensitive data has been duplicated and stored in third parties’ data silos. How can they trust the insights from a black box that can never reconcile with their data? Without a Single Source of Truth, there is no point in running fast when your insights don’t have the integrity to justify the decisions.

Let’s face it: why should anyone give up their data to third parties in the first place? With the new modern data stack, especially the development of cloud data warehouses like Snowflake, BigQuery, and Databricks, the days of having to rely on external analytics silos are long gone. More and more enterprises have taken data control as their top priority. It was time to rethink product analytics: is it possible to explore customer insights with integrity and speed at the same time?

The Rise of Warehouse-native

This self-service demand and warehouse-native motion triggered a new generation of tools that provide SaaS product analytics directly from customers’ cloud data warehouse. It perfectly balances integrity and speed, which should be the objective of analytics platforms.

What is the Warehouse-native Way?

Here are four characteristics to identify a true warehouse-native solution:

Tailored to your data model

A warehouse-native solution should continually adapt to the customer’s data model instead of forcing them to develop ETL jobs to transform their data. Besides sharing data access, there should be zero engineering work required on the customer end, and all the integration should be entirely done by the vendor.

The effortless integration is one of the most significant differences from the traditional data silo approach, which mandates the customer to build and maintain heavy-duty ETL batch jobs, which could take months to develop and yet still can break frequently. One example is how Amplitude claims to be warehouse native, but in reality, it just means their application is “Snowflake Native” (running as containers) but still requires customers to transform their data into Amplitude’s schema.

Data should never leave your control

This should be assumed under the term ‘warehouse-native’. However, some solutions are engaging in warehouse syncing or mirroring to copy customers’ data into their data silos. Some admin UI may be provided to configure the data connection and eliminate the need for custom ETL jobs, but if you see words like “load,” “transform,” or “sync,” the system is essentially making copies of customers’ data into its silos.

Besides losing control, the biggest problem with data duplication is how they adapt to customer data changes. There will be a constant struggle for backfilling, scrubbing, restating, and reprocessing when there are data quality issues, or data model changes (e.g., a new attribute or a dimension table), which are fairly common and happen regularly.

Besides reducing some engineering work, achieving a Single Source of Truth or data integrity with a data syncing method is impossible. It’s difficult to trust a black box without visibility into how insights are generated.

Complete transparency with SQL

One of the most prominent traits of a proper warehouse-native solution is to provide customers with the SQL behind every report. Since the data lives in the customer’s warehouse anyway, there should be complete transparency on how the insights are computed. Such a level of transparency can guarantee accuracy and provide reconcilability and allows customers to extend the work from product analytics platform to more advanced internal development, like machine learning and predictive modeling.

Dynamic configuration with exploratory insights

Because all reports come directly from the data in a customer’s warehouse leveraging SQL, every insight should be dynamically generated. There are several significant benefits:

Underline data changes will immediately be reflected in analytics. There is no data to scrub, no cache to poke, and no vendor to wait for.
Raw data can be analyzed on the fly in an exploratory manner. Warehouse-native analytics supports virtual events, virtual properties, dynamic functions, and analyzing unstructured data (e.g., JSON or struct), which helps in hypothesis testing before committing to lengthy data engineering work.
Data model improvements can be iterative and incremental. When new attributes or dimensions are added, they automatically apply to historical data. There is no data backfill required because everything happens with dynamic joins. With the multi-schema support, it is possible to have both raw and clean data schemas running in parallel to satisfy the speed and consistency requirements simultaneously.

Incorporate operational data without the need for ETL. All of the clickstream/behavior events, vendor data and operational tables can be dynamically combined for analytics, all inside the customer’s data warehouse with no data movements required.

Summary

With its unique advantages and momentum in the market, enterprises will inevitably choose warehouse-native analytics to optimize their digital products and explore customer insights with integrity. In the meantime, it is vital to look through the marketing claims and find truthful solutions. In upcoming blogs, I will cover the real-world use cases for applying true warehouse-native product analytics solutions to different teams and industries.

Tag: full control of data

Unraveling The Truth About Warehouse-Native Product Analytics