Don’t Be Fooled: 17 Common Data Traps 📊
Learn how to spot and avoid the most common data pitfalls
Hello, data-driven and curious minds, welcome to the 22nd edition of DataPulse Weekly.
In the world of data, numbers don’t always tell the full story. Statistics, while powerful, can be twisted to mislead and misinform. From manipulated graphs to skewed averages, the ways data can be distorted are as varied as they are deceptive.
Today, we’ll dive into 17 common pitfalls where data can mislead, illustrating how easily we can be fooled—and how to guard against it.
1. Misusing Percentage Changes
Context: Percentage changes can exaggerate results. They may sound impressive, but without context, they can mislead, making small changes seem significant.
Example: A campaign manager claims a 200% increase in sales after a new initiative. But if sales increased from 10 to 30 units for low AOV products, the actual growth of just 20 units is far less impactful.
How to avoid: Always ask for baseline numbers to understand the real impact.
2. Cherry-Picking Data
Context: This is a tactic that many find tempting to use. Cherry-picking involves selectively presenting data that supports a specific argument while ignoring contradictory data, creating a skewed picture.
Example: A customer experience team highlights a 95% CSAT from top-performing agents but ignores feedback from customers who had poor experiences with less experienced agents.
How to avoid: Request data that represents the entire population to get a true understanding.
3. Manipulating Timeframes
Context: Selecting specific time periods that show favorable trends can distort the overall picture.
Example: A government highlights a drop in unemployment rates during a quarter when seasonal employment was at its peak, downplaying rising rates in other periods.
How to avoid: Ensure timeframes represent a full, accurate period, not just a selectively chosen window.
4. Truncating Graph Scales
Context: Graphs can be manipulated by adjusting axes or truncating scales to exaggerate or downplay trends.
Example: A web agency uses a graph with a truncated y-axis at 3 seconds to make a small decrease in page load time appear significant.
How to avoid: Check graph axes and scales for accuracy, and seek the full data range.
5. Misinterpreting Correlation as Causation
Context: Correlation shows a relationship between two variables, but it doesn’t prove causation.
Example: A study finds that countries with higher chocolate consumption have more Nobel Prize winners. The real link might be wealth and education, not chocolate consumption.
How to avoid: Consider other factors that could influence the correlation before drawing conclusions. Remember correlation doesn’t imply causation.
6. Poor Quality Data
Context: Poor data quality—due to errors in collection or processing—leads to unreliable results.
Example: A data pipeline failure results in incomplete data, causing analysts to overestimate user engagement or misjudge a feature's effectiveness.
How to avoid: Regularly validate and clean data to reduce errors.
7. Small Sample Sizes
Context: Small samples can lead to misleading results, as they may not represent the broader population.
Example: A survey of 10 people shows that 80% prefer a new product, but a larger sample might reveal a different preference.
How to avoid: Use a sample size large enough to represent the population.
8. Misleading Averages
Context: Averages can be deceptive, especially when outliers skew the data.
Example: A social media manager considers an influencer with an average of 600k views per video. However, a few viral hits could inflate this average, with the median revealing much lower typical views.
How to avoid: Request median and view distribution data for a clearer picture.
9. Average of Averages
Context: Calculating the average of averages can be misleading, especially when group sizes differ.
Example: Two offices: New York with 300 employees averaging $80,000, and San Francisco with 30 employees averaging $120,000. Averaging these gives $100,000, but a weighted average shows the true average is $83,636.
How to avoid: Use weighted averages when combining data from groups of different sizes.
10. Machine Learning Model Stats: Accuracy
Context: Accuracy can be misleading in imbalanced datasets.
Example: A model predicts no loan defaults with 99% accuracy because only 1% of loans default, but it fails to identify actual defaults.
How to avoid: Use metrics like precision, recall, and F1 score to evaluate model performance.
11. Improper Test & Control Groups
Context: Improper test and control groups can lead to misleading results.
Example: A company tests a new feature with loyal, long-term customers while the control group includes a mix of new and returning users. The feature appears to perform well, but the results may be skewed by the test group’s familiarity, not the feature’s effectiveness.
How to avoid: Ensure test and control groups are properly randomized and matched.
12. Ignoring the Margin of Error
Context: Ignoring the margin of error can lead to overconfidence in close results.
Example: A poll shows Candidate A leading by 2 points, but with a margin of error of ±3%, the race is a statistical tie.
How to avoid: Consider the margin of error, especially when differences are close.
13. Data Dredging (P-Hacking)
Context: Data dredging, or p-hacking, involves searching through data to find statistically significant patterns that are actually due to chance, leading to misleading results.
Example: A research team tests multiple hypotheses on a dataset until they find a correlation with a p-value below 0.05, presenting it as significant. However, this result may be purely coincidental.
How to avoid: Use pre-registered hypotheses and adjust for multiple comparisons to ensure findings are genuinely significant.
14. Misleading Visualizations
Context: Visualizations can mislead by exaggerating trends or hiding details using colors, shapes, or selective data.
Example: A revenue chart uses bright colors and exaggerated shapes to overemphasize minor trends, misleading stakeholders on performance.
How to avoid: Use simple, accurate designs without embellishments, focusing on clear, truthful representation.
15. Survivorship Bias
Context: Survivorship bias occurs when analysis focuses only on subjects that survived a process, ignoring those that didn’t, leading to overly optimistic conclusions.
Example: During WWII, planes with visible damage were analyzed, but Abraham Wald suggested reinforcing the undamaged areas, as planes hit there didn’t survive—a classic case of survivorship bias. Explore more: Survivorship Bias.
How to avoid: Include all relevant data, including failures and dropouts, to get a complete view.
16. Simpson's Paradox
Context: Simpson’s Paradox occurs when a trend that appears in different groups reverses when the groups are combined, leading to misleading conclusions.
Example: This is a complex phenomenon best understood with a detailed example. We’ve covered this in-depth in one of our previous articles. You can explore it further here: Simpson’s Paradox in Action.
How to avoid: Always analyze data within subgroups before combining them to uncover hidden patterns and avoid misleading conclusions.
17. Sampling Biases
Context: Sampling biases occur when the sample selected for a study isn’t representative of the broader population, leading to skewed results and inaccurate conclusions.
Example: A survey on exercise habits conducted only at a local gym might overrepresent active individuals, not reflecting the general population’s habits.
How to avoid: Use the right sampling method to ensure your sample is diverse and representative of the entire population.
Conclusion
Data is a powerful asset, but it can easily mislead. By recognizing common pitfalls like misused percentages, misleading averages, and sampling biases, you can avoid falling for skewed narratives. Always approach data critically and dig deeper to get the full picture. Accurate data-driven decisions rely on careful interpretation.
Let us know in the comments if you've fallen victim to any of these pitfalls recently.
That wraps up our newsletter for today! If you found it valuable, please drop a like ❤️
Stay curious and connected!
Until next Tuesday!
Great list! Another one to add is comparing absolute values using % change, versus comparing percentages / rates using basis points or percentage point differences
Even data pros should review this list every so often to remind ourselves of what not to do.