The Data Stack Show Podcast

Eric and Kostas of The Data Stack Show invited Ahmed onto their podcast earlier this month to discuss the the modern data stack's achilles heel. You won't want to miss this deep dive.

The Data Stack Show Podcast
Photo by Icons8 Team / Unsplash

Eric and Kostas of The Data Stack Show invited Ahmed onto their podcast earlier this year to discuss the the modern data stack's achilles heel. You won't want to miss this deep dive.

Thanks for having us! The link to the listen to the whole episode and some highlights are below.

Why does the modern data stack suck?

I think that everyone who has implemented the stack more than once will tell you that it seems like the only way and it’s a necessary evil. So at the core, you have your data everywhere, and you dump it into a warehouse, and there’s a lot of different warehouses. You can use different flavors, different benefits, and that’s been solved. Then you have your middle layer, we call this a transformation layer, where you actually use data and write sequel to represent the questions that you need to answer. That table gets materialized and put into your BI tool, a tool that allows you to build dashboards and visualize it.

And anyone who’s ever done this will tell you what happens… you build a dashboard, then there’s a follow up question. And they go cool, let’s go back to the data team, let’s build a new transformation. Let’s build a new materialized view. And let’s build a new dashboard.

And as time goes by those number of transformations you have in the middle, continue to grow. The number of data that’s similar and multiple transformation continues to grow. It actually gets so messy, that you often have 700 800 transformations… And because of this entire cycle of constantly going back to build these new transformations, you end up having the time to answer a question going into weeks and months. Every new question goes into these complex 1000 line SQL queries

I think that’s the problem we need to solve is why this transformation layer causes all these kind of roadblocks. And that’s the problem that I went out to really innovate in solving.

I don’t think it’s a problem of the tooling. It’s a problem of the approach.

The approach of building custom tables, to me is like building a car where every piece you need to cast mold is custom. You need more of the world to have interchangeable parts where different pieces can fit together really easily.

Today, you still need a 1000 line SQL query to answer is a series of questions. And that doesn’t go away. Now, why does that happen? It's because data that isn’t separate is actually captured in separate systems that don’t relate. We need to figure out ways to stitch it together. How do you tie an email to an order, because everybody wants to know, email attribution to order?

The complexity of doing simple joins across all these systems ends up generating these really complex queries. And I think that’s where the data, a lot of data set consistently fails.

SQL today requires you to join based on a key and if that key doesn’t exist, or you put a person to hack, get it with like a bunch of complexity to do it. That is the problem. Now, what the tool you’re using to manage a sequel doesn’t really matter if your sequel is doesn’t solve this fundamental problem. And I think that is the core problem that we realized is that It is actually a join problem, because joins depend for keys and foreign keys don’t exist. So to solve it, you actually need to reinvent how you join.

How do you do that with only 11 columns?

So first, I think we like to say one table, because of the kind of like shock factor, it’s like 95% one table. There are ways you can add additional tables, but that’s not the core. The core single table that we’re going to discuss is known as the activity schema. It’s an open source project that kind of discusses this one table approach.

And this is really just kind of taking the way that we speak about data and bringing it to a way you can structure it. So it's just a time series table, where it has customer, time, action, and you we just abstract three features. And you’re thinking, How can I just put everything I need in three features? I have so many features I need, I need like 100 features.

And that’s where the tool dataset comes in that narrator provides - it's a way to pull in and "borrow" features from different activities.

So let’s take a simple example: Did that email turn into order?

I want to know every email, I want to know what the campaign of each email is, I want to know when that person did that, when that person came to our website, from what email, how many pages did they view, and I want to know what page they landed on.

That seems like we already are talking about 10-50 features, right. But if you break it down to like actions, you have opened the email action, which has one feature, which is the campaign. You have the visited website action, which has path which is also one feature. You have the startup page viewed action, which might have some features on it, by just the fact that the customer viewed a page. And you have activity that they completed an order. So now it’s four activities.

And all I’m doing is really pulling the data from each activity. So if I want to know, if between emails did they have an order. I can pull the timestamp from when that order is, I can count how many pages they viewed between then and the first started session activity...etc. I just pretty much like to think of it as this really long table and doing a very clever fancy pivot and pulling the columns I want from each of these activities. And when you do that, what it turns out is that if you actually represent your business as this really rich customer journey, you don’t need that many features per action, but you do have a lot of actions. And those actions are where all that nice rich information comes.

And because time and accounting is given to you by Narrator out of the box you don't need to add so many features - they can all be recomputed on the spot when you’re answering the question that you need instantly.

How do we get to the point where we have, let’s say, well curated one table to rule them all?

Narrator provides a very, very thin layer, that’s known as our transformation layer. And this is not like a dbt transformation layer, because you’re really just mapping columns. The layer is so thin that it averages around 12 minutes to write. And you just kind map it to our simple structure per building blocks, you define each activity. Then Narrator migrates that data does a bunch of caching, does a bunch of migrating that make it really nice and easy and fast to use. And provides you with an interface to actually ask and answer these questions.

And the good thing about doing it with activities is that you only ever need to add a new building block when you have a new concept to add, not when you have a new question. So often in tables that you materialize in the modern data stack, every time there’s a new way of relating data you build a new table. Now with Narrator you don’t do that. You just build what’s called a dataset, and thats added to the UI with a couple of clicks. Every time you have a new concept added to your company, then you add an activity. So you’re often doing these activity transformations in the first weeks, and then you add one every other month. It’s really rare that you’re adding a bunch of new activities. Instead, you’re taking the building blocks that you’ve already built, and you’re reassembling them to answer all sorts of questions.

How do you make this accessible to people that they use different syntax and semantics about the same conflict?

When you’re depending on dashboarding you have to force everyone to buy into one definition, total sales has to be total sales, and total customers has to be total customers. What you see a lot in Narrator is that you can have multiple identifiers that get mapped to your global customer. We have companies that are ride sharing services, that a customer is a car. A company like Wework has a customer that's a building. You can have different ways of defining customer, but that entity still has events - you might have a created lead activity, you might have a crew started opportunity, you might have a closed opportunity, you might have a signed contract, sent contracts, moved in, made payment, started subscription.

And the reason why that’s so important is when you deal with that argument - Well, when is the sale? Is it when they sign? Is it when they move in? Is it when when they pay the first invoice is it when they start their subscription? When is the sale? You don’t have to actually fight that battle anymore.

Instead, what you do with Narrator is you have this concept of dataset, which is you have these activities that you can represent them in different ways. And then when you go to create your KPI, which is like your key performance indicator that Narrator allows you to create, you can then choose very explicitly what that is. And the user then sees the KPI, they can always click into it and see the underlying dataset and say, "Oh, this says you used a timestamp of the first opportunity created." And because of opportunity is now activity, it’s just a lot easier to get that transparency.

And I think like kind of create those three layers, whether it’s a company global KPI (which people are using to measure), any dataset (which is answering specific questions), and then having your building blocks represent real actions that the customers taking. It just kind of creates very little space for ambiguity. Like questions that we don’t get at Narrator often is "but with it is actually mean?" Just click on data set and see exactly what that means.

It creates that separation so that the data team isn’t fighting. And if the company decides actually we’re redefining total sales to look at it based on when the first invoice is made, that let me talk to data about that, the data is already modeled. You have to just choose that for your dataset. And that can be done without involving data at all. And everything will just cascade nicely because again, your building blocks are what you’re modeling. Not The final results. So you’re representing the world as these activities, everything else happens in Narrator. And you can build datasets to combine them, you could build KPIs, and you could change those things, not thinking about going back to data model ever.

What comes to your mind as potential challenges when you view the entire world as timestamped activities?

So the thing about the single table approach, and I’ll tell you the honest truth, it has two huge, huge, huge downsides. The first downside is that querying a single table can get really hard.

And the second thing is, if you notice, I’m doing something with every question you’re asking me. I’m actually translating your question, to be a little bit more defined in this activity way. I’ve mastered this, but a lot of our customers take a couple of weeks to learn is this new way of thinking.  

You have to relearn the mental model of how to combine data using these like temporal relationships we have. I think we actually find most customers who come from like a deep sql background, have a harder time learning our relationships than people who come from like a like marketing or product mindset. Because you’re used to thinking about things from this perspective, while sequels often thought about it from a table perspective.

So we solved the first problem with product and we solve the second problem with just iterating. We often give customers examples, we do a lot of documentation and blogging. We do a lot of automatic analysis generation and we have a series of templates that helps you see how to ask a question.

Narrator exists between different spaces in many ways in the data world. So who’s the user?

Everybody got into data to answer questions and make an impact by using data. That job used to be called a data analyst. Data analyst are people who used to take questions and ask good questions to derive answer. And whether you’re a product person operating as data analyst, or you’re a data engineer answering your question and you’re operating as data analyst. I think that the tool that we build is for people who want to answer question and those people are data analyst.

We’ve seen in companies is really interesting - it turns out that job of a data analyst analyst kind of disappeared, and now we have like seven roles that do part of the data analyst job. What if you force everyone to be a data analyst?

Everyone at the end of the day is trying to answer questions. Narrator enables those data analysts, with limited SQL knowledge but with the ability to ask good questions to do that work end to end. Create a dashboard, create the analysis, create the story, represent the data in a way to answer questions, and do that all in under 10

And I think the future of the world is going to be where everyone becomes a data analyst. That's what drives business value, the people helping make decisions.

Check us out on the Data Engineering Podcast

Find it on the podcast page or stream it below