Understating Data is Like Watching Late Night TV
By Eric Farng – Technical Lead Data Scientist, YP Mobile Labs
People of a certain age will immediately recognize this image of “color bars” above.
It would appear on television late at night only after all the shows were off the air. It provided technicians a way to adjust monitors to manage white balance and colors to match the producer’s original intention (and served as a reminder to me turn off the TV and sneak into bed before I got caught).
There’s something quite similar that is a part of the daily life of a data scientist and should be a part of the toolset for anyone working in digital media today, especially in our fast-paced, machine-driven, programmatic world.
Error bars, like their old TV counterpart, are a graphical representation that can help establish the variability of the data you receive from your vendors and media partners to better determine the level of uncertainty in a given measurement.
Some measures are not random, such as budget or maybe impression target. For variable measures like click-through-rate (CTR) or traffic volume, they are most useful if we can reuse their value in the future and these values are more reliable if you are provided the error estimates. You no longer need to fear the underlying mathematical processes of complicated metrics. These deeper insights require more careful thought and statistical analysis, but a savvy marketer knows that to understand the real value of any finding requires an honest statement of the confidence of the results. This can be simply stated graphically using error bars.
What exactly is an error bar? For each statistic, there is a true fixed value. However, we are often only able to collect a subset of data and can rarely calculate the true fixed value. Statistics uses this subset of data to estimate the true value and create a confidence interval around this estimate. The confidence interval is a range that often contains the true value and an error bar is a graphical representation of it.
Let’s take an example of looking at CTR by day of week. In the first graph below, we can see the CTR that is calculated. The initial result would be that Sunday and Monday have higher CTR.
When we see the plot above, we often assume the error bars are similar to the plot below. Here, we would say that Sunday and Monday have a higher CTR than the other days of the week. Statistically, we’d conclude that “It is unlikely that the difference in CTR is due to random chance.”
But in fact, the error bars might be quite different. In the next plot below, we actually can’t say anything about the difference in CTR across different days of the week. In this case, we say “there is not enough data to determine if CTR is actually different across days of the week.”
This is an important difference. In this last plot, any difference among the days is likely to be just random noise and the conclusions will likely change if the experiment was run again. So any changes to a campaign based on this analysis will likely not result in the desired change in CTR.
Whether you’re an account manager reviewing campaign results, or a brand manager assessing the findings from your agency, knowing just how confident you can be about your data will leave you well-grounded and able to adjust expectations and future planning with confidence. Make sure your partners in data science, and frankly anyone providing reporting, is willing to be transparent enough to share their confidence intervals. Ask them to add the error bars to any relevant data to provide a clearer picture of your campaign’s performance.
Error bars, like those old television color bars, will keep you calibrated, ensuring the bright spots are not dark spots in disguise.
Categorized in: Research and Insights