Bad data analysis questions I see every week (and how to fix them)

Bad data analysis questions I see every week (and how to fix them)
Reflecting on data analysis questions!

For the past few years at Narrator, I’ve been collecting the bad data questions I often stumble upon and reflecting on how to make them better. This simple exercise, done regularly for a few years, significantly improved the quality of questions I ask. Last week, I thought other data analysts might benefit from reading my reflections (and hopefully start doing their own!). Hence this post.

The format is pretty simple. I will provide four poor data analysis questions and then explain what makes them ineffective and how I would make them better. My hope is that as you read through them, you’ll experience at least a few “ah-ha” moments. This means you’re learning.

Interestingly, these questions don’t come from a junior data analyst who’s fresh out of college and has never done this before. They come from all the teams I work with: big companies, small companies, big data teams, small data teams, c-level or not, etc. So many teams are guilty of this. And that’s to be expected! Unless you’ve worked in data for a while, you will have difficulty understanding why these questions are bad. I’m hoping to shed light on how you can make your questions better if you’re guilty of these things.

1. Can we have a real-time map showing our potential leads by country?

What I do like about this question is the goal of identifying leads. As you’ll see later in the post, waaay too many data questions suffer from being aimless ventures into the unknown.

What I don’t like about this question is that it’s prescriptive in terms of the solution. A real-time map is one way to deliver what the stakeholder needs here, but I’m guessing there are many more of them, and the map isn’t necessarily the best one.

Unfortunately, this problem is very common. A lot of times, stakeholders ask for specific solutions/approaches (e.g., real-time map!, lead scoring model!, clustering analysis!) just because they’ve heard of them, because those solutions/approaches are what their friends are doing, or because they are subconsciously using the prescription to describe their needs, e.g., they need to be able to quickly identify top candidates for leads without too much cognitive overhead.

This is not a good way to approach a data question for two reasons. First, the goal of what you’re trying to do should inform the approach you take, not the stakeholder preference. Second, as an analyst/data person, you are far better equipped to decide which method will get you there than the stakeholder. Therefore, when you end up in a situation like this, it’s important to pause and think hard whether the prescribed solution makes sense and never conform to what’s offered if you think there’s a better way. This is data, not politics.

2. What do our most successful customers look like? What are their characteristics?

An unfortunate side effect of the “data science” hype is that it’s often portrayed as a magic solution to answer any question. Load in the data, get out an insight! From looking at the question above, it’s clear that whoever asked this is expecting magic. In reality, any good data project requires context and hypotheses about what is important. Only then can data science work its magic.

So if you get a question like this, know that the person asking doesn’t really understand what’s required to find a good insight, or they are just committing the common crime of “lazy hypothesis generation” and sending you off to boil the data ocean.

The problem with this approach is not even the extra work that needs to be done, but that the analyst likely has much, much less context on the business than the stakeholder and, therefore, will come up with less relevant and more biased (based on the data that’s available) hypotheses than the stakeholder would. In most cases, if a stakeholder can develop their own hypotheses, those things are much more likely to be indicative or stand out or notable or interesting. Then you, as an analyst, are really just using data to validate or confirm that that’s the case or that it’s not the case. Ultimately that’s the entire process that an analyst does.

The second problem with this question is that “most successful” is super ambiguous and can mean different things for different roles. For a salesperson, “most successful” will likely be “has the highest contract value/renewal rate,” but for a customer success rep it can easily be “asks the least amount of questions,” and those two often conflict with each other. So, unless “most successful” is defined at the company level, it’s always good to have the stakeholder specify what they mean by that in advance. (This will also save some time on back-and-forth communication trying to clear it up.)

3. What success metrics can we tie directly to the users who engage with chat?

There’s a looooot to unpack in this one.

First of all, people are always looking for ways to justify what they’ve done and want to find something in data that confirms that it wasn’t all for nothing. Confirmation bias sets in, and we’re no longer learning. We’re patting ourselves on the back as we go. Beware of this pattern.

The pressure to be “data-driven” can sometimes make this problem worse. When companies start to be data-driven, that push usually comes from their leadership, saying, “Hey, we need to start using data to measure ourselves!” Because no employee wants to come up as useless/non-effective/unsuccessful in front of their bosses, they all start sieging the data person, trying to find something that proves they’re successful at what they do. I think it’s an awful practice, but that’s how many teams start using data. And because no one wants to be seen as a failure, the first efforts of introducing data usually produce “we’re doing great!!” when the business is clearly sinking.

Every analyst will be in this position at some point in their career, and likely more than once. I certainly have. It’s been an interesting problem to try and push against the team using data in this weird prove-how-great-we-are-doing way. It’s definitely not a fun role to play. It feels a bit like policing. But that’s how you get the most from using data.

4. What are the profiles of customers most likely to upgrade?

This question is better than #2 because it has a concrete goal: upgrade. However, it has the same issue as question #2: “profile of customers” could mean virtually anything, so I’d push this one back to the stakeholder to come up with a better hypothesis. (See #2 for more on “lazy hypothesis generation.”)

Some people would argue against that. They’d say that analysts are just there to answer stakeholders’ questions and get data quickly. I don’t think that’s the best practice, and neither is the other school of thought in data, where the analyst is viewed as a strategic partner who’s equipped to come up with those hypotheses on their own. As with many dichotomies, I think the truth is in the middle. The most effective analysts are the ones who are invested in the goal but who, instead of coming up with their own hypotheses, instead of using their own bias or context to come up with what they think is important (because that’s going to happen anyway in the analysis), are really effective at extracting that information out of the stakeholder’s head.

When this kind of vagueness comes up in my work, I go through an exercise with a stakeholder and help them take their high-level ideas and break them down into specific pieces that we can actually go analyze with data. For example, “what profile of customers like our product more than others?” I’d ask them: Who do you think those customers are? Bigger companies? Smaller companies? And then: How do we know that they like it? If you observed their behavior, what would indicate that they liked it versus not like it? Etc.


I hope this was useful! If it was, please drop me a line at brittany at narrator dot ai, and I’ll write some more :)


Check us out on the Data Engineering Podcast

Find it on the podcast page or stream it below