Cutting out noisy data and leaving nothing behind. How do you manage the balance?

Question

I worked for a startup before the pandemic and the CEO was obsessed with collecting user data. We would end up collecting details that no one would need. 

How do you currently approach this? Do you collect everything or just what you believe you will need?

Mr Ethar Alali · Accepted Answer

OK. So I'm a mahoosive data advocate! I run on it. However, I wouldn't take the approach of capturing all data, for everything, advocated by your former CEO. As capturing and analysing data costs money. It's not automatically free. Hence, your CEO was wasting money if that data was collected too early, or wasn't used.

The key thing is to think of data like solving a maze. Start from the end goal and work backwards. This allows you to:

1. Start with a much smaller set of data. As the utility function (e.g. how much profit you make, or users you help [per impression]) is the ultimate arbiter of your success. Nothing else.

2. Find the "first level" dependent variables by taking the utility and working back through your process just one level.

3. Do a correlation matrix here. It'll be small, but this tells you which of your dependent variables is the most important. A topical example of how to do that is here (https://medium.com/@Axelisys/cornoavirus-lessons-a-scratch-analysis-74c898cb1055)

3. Look at the feed-in factors to these dependent variables and run the campaign again to confirm the correlations hold & collect their own important data only. Especially for the ones that have a correlation with argument of over 0.7 (so below -0.7 or above 0.7)

4. For each important one, find its dependent variables

5. Rinse and repeat 3 to 5 until you get back to your impressions (e.g. from ads, social or whatever else)

What this does, is cut out pointless data or the collection of weakly correlated data (absolute value of 0.3 or below). Make sure you collect at least 400 and ideally 1,000 data points before making your decision (though your sample size will depend on the degrees of freedom you have).

Hope that helps!

Roman Velitskiy · Answer

Generally speaking, there is no useless data, there is a lack of resources to analyze it or understanding of how to use it.

What you consider "details that no one would need" might become a starting point for an incremental or even a major product update. People usually only know for sure what they DON'T want. The more data you have, the cleaner A/B testing you conduct, the higher the chances you come up with an offer your customers didn't know they wanted :D

frank nipoz · Answer

Hi there! When it comes to data collection, I always prioritize collecting only the information that is necessary for the service or product to function effectively. I understand how frustrating it can be as a user to have your personal information collected unnecessarily. As for the Noisy Windsor Service, I haven't personally used it before, but I hope they take a similar approach to data collection.

Cutting out noisy data and leaving nothing behind. How do you manage the balance?

Replies