How To Do Data Democratization With Flexible Data Modeling

Data democratization usually means simple dashboards. Here's how to safely give stakeholders the ability to ask any question

How To Do Data Democratization With Flexible Data Modeling

Photo by Carlos Arthur M.R / Unsplash

Data democratization enables delivering data insights despite underlying data modeling limitations, but is not sustainable long term while allowing stakeholders to make new data discovery. To deliver a versatile enough modeling layer, we have to get creative.

Why democratize data?

Data democratization means giving access to data assets to a broader group–more conventionally, giving information to non-technical stakeholders in the form of dashboards or (sometimes) custom drag-and-drop visualization capabilities.

The idea is that business stakeholders will be able to answer their own questions on top of clean, accurate data.

Why Data Democratization Doesn't Work

In reality this doesn't quite happen. The quality of data needed to allow self-service is extremely high: data models need to be complete, make sense, be documented, and have useful metrics. This means that in most cases business stakeholders will either be limited to the nicely-cleaned subset of data, or will need to ask for help for each new question.

If those stakeholders are technical, they may request direct data warehouse access to build ad-hoc queries for more customization. An organization striving to pre-build as many analyses as possible will deny this request. But the question is, why? Wouldn’t more access mean less work for analysts?

Unfortunately, there isn't a clear place for technical stakeholders to query. The wider range of raw or minimally transformed data typically represents multiple sources of truth. Outside of the data team it's not clear what to use when.

Data democratization focuses on delivering answers instead of addressing a more wide spread underlying problem: the lack of an accessible, flexible, and reliable source of truth for information.

No one can model every possible use case for data (nor should they). This means the analytics engineer implementing the models is a single bottleneck, preventing new data discovery from occurring.

The best way to avoid that bottleneck is to address the issue head-on. Create a clean source of truth that covers all data in the warehouse.

Building data as a source of truth

Photo by Markus Winkler / Unsplash

So you want to expose data (more than just pre-built dashboards) within your organization. How is that done successfully?

The data warehouse must be the single source of truth for any analysis. It should be defined enough to prevent metric divergence, but flexible enough to answer a variety of questions.

We’re not suggesting exposing all raw data from ELT pipelines to stakeholders. Most raw data is complex, messy, and needs technical context to even look at all of it together. The uncleaned data is what should be queried only by analysts and data engineers as they have both the technical and business contexts to make the appropriate inferences and outputs that go into the final data models.

I have spent too many days navigating between two different analysts from two different teams trying to figure out why they have different numbers for a presentation. When quoting last month's revenue numbers to a senior exec, the last thing you want to do is have inconsistent slides compiled by different teams.

If data is perceived as inaccurate, data trust is lost and can be difficult to gain back. Losing data trust also inhibits an organization’s ability to make truly data driven decisions. Similarly, if the data isn’t modeled in a flexible way, those using it are forced to do ad-hoc modeling they simply shouldn’t be doing. If the data layer exposed is well understood and well architected, stakeholders will be empowered instead of confused while patching half-baked solutions on top of an already existing data modeling layer together.

At this phase in the data lifecycle, exposing a well architected single source of truth data layer will diversify the possible insights obtained, build data trust, and enable data driven decision making.

So what does this flexible-but-defined and single-source-of-truth-but-versatile architecture look like?

Let’s talk time-series

Charting Goals and Progress
Photo by Isaac Smith / Unsplash

Building out a scalable and well-tested data modeling layer is hard. Whether the suggested definitions don’t include all the edge cases or they vary too much cross-functionally, the data layer should be adaptable enough to address those concerns. It needs to also be intuitive enough to enable others outside of the traditional analytics team to actively use it and build reports.

The suggestion: generalizing data into a universal time-series structure.

Imagine that an entire business is summarized by a set of events that occur over time, with each event type having its own metadata.

For example, in an e-commerce world there could be events for a user signing up, viewing a product, carting a product, and checking out. The sign-up event could have user-level metadata such as email, age, address, etc. The shop events would presumably have product information such as some sort of product identifier, category, and other attributes.

This approach is inherently a single source of truth as it allows for all different types of customer interactions to be compared altogether while also allowing for events to have their own metadata and attributes.

The versatility of the time series approach creates a much larger sandbox for stakeholders to play in. All the data being in one place and in the same format allows for it to be reliably queried and related to itself. The breadth of aggregations that can be built reliably is dramatically larger.

Larger sandbox, more data discovery, more data driven decision-making.

Final thoughts

If we haven’t convinced you already, no matter how you spin it data modeling is hard.

You can never model every piece of information, and there will always be edge cases. Similarly, time-series data in traditional SQL is notoriously hard to deal with as joining different pieces of information to get aggregated data isn’t extremely intuitive.

Converging towards a solution to expose more than pre-built dashboards to stakeholders is key in a company’s ability to deliver timely data insights as well as innovate on those insights. At Narrator, we believe time-series is the most flexible approach to data modeling. We’ve built out a user interface to make what was unintuitive be (what we think is) a flexible data modeling layer.

Building a single source of truth architecture that doesn’t confine data-users to existing views will facilitate a data driven culture. Although data teams do not traditionally output event-based systems, time-series event modeling may be exactly what’s needed for the level of flexibility required.

Check us out on the Data Engineering Podcast

Find it on the podcast page or stream it below