Lying with Statistics

Lying with Statistics

One of the most influential books in the subject of statistics is How to Lie with Statistics by Darrell Huff (1954). While the book is relatively dated, it provides examples of crucial ways to trick the human mind while conveying important statistical concepts. This book provides a layperson’s overview of the types of ways to be intentionally (and sometimes unintentionally) deceitful with statistics. Here I will outline some of the most egregious lies from the book as well as demonstrate the effect these tactics can have on the statistical analysis of variance test.

One of the ways to lie with statistics is to use flawed data. The initial chapter, suitably named “The Sample with Built-In Bias” (Huff, 1954) covers how sampling bias is one of the easiest mistakes to make. The reason this is such a difficult issue is that the error occurs while the data is being collected, thus making it hard to detect. A proper random sample is difficult to attain because it requires proving that every subject in a population has an equal chance to be in the sample.

Most surveys and experiments researchers conduct are based on the assumption the subject is telling the truth about their responses. There are many cases of subjects wanting to give an answer that pleases the interviewer or makes themselves look better. For example, a survey conducted on a subject’s salary may have unseen factors that can affect a subject’s answers. Sometimes people say they earn less than they really do, for tax reasons, or they think they should be earning more and inflate their real salary. The only trustworthy data is direct data, not from the mouth of a subject. In addition, the sample is inherently biased if we consider how this experiment was run. In order to find out a graduate’s salary many years after graduation requires sending a questionnaire via mail, telephone, or email. This limits the sample to those who the University has an address, phone number, or email address for. This skews the sample to those who are more likely to be affluent and have publicly listed addresses.

One thing we can do to reduce sampling bias is stratified random sampling. Here we try to divide our groups in proportion to their known prevalence. Thanks to computers, we can come closer to picking a truly random sample. What this means is to have a computer generate a random sample of people to send questionnaires to and these proportions more closely match the distribution of people in the population.

Another way to lie with statistics is to conflate types of averages. Someone who is statistically savvy should be skeptical of using an average to communicate meaning. This leads into the differences between the mean, median, and mode, none of which are the “right” kind of average. While the arithmetic or mean average shows us the value in the center of the range, the median shows us the one in the middle of the values, and the mode display the value that appears most frequently.

Huff warns that small sample sizes are too sensitive to random chance and can easily be manipulated to support any claim. To be skeptical about many, often false marketing claims, finding out how they conducted their studies might reveal experiments run on very small sample sizes. In addition, most “average” data in a distribution lies within a range or numbers, not at a single point. To highlight this Huff demonstrates a home builder who builds home for the average sized family, 3.6 people.  Unfortunately building 3-person and 4-person homes proved inefficient despite the fact that they were tailored for the average household size. It turns out the majority of homes were either 2 people or 5 or more people. This anomaly can be attributed to ignoring the median and mode while focusing too intensely on the mean.

Many statistical fallacies, according to Huff, are rooted in the improper visual display of information. This can be in the form of misusing graph axis’ to display whatever perspective is the best at showing what you want to portray. Often when one wants do show a more dramatic increase, a graph may not start at 0, but instead another number that reduces the scale. For example, if we plot the monthly sales of a company from 0-24 in billions of dollars to only displaying 18-24 billion in the graph, the profit line looks to increase more sharply. Another visual deception tactic is to draw visuals out of scale. Such as in the case of a bar graph, if there are values for salaries in different countries at $30k and $60k, one bar should be exactly twice the size of the other. Using inaccurate scale to not the difference between values is an easy way to trick the human mind.

Applying the methods examined in How to Lie with Statistics to our course on Statistics and Experimental Design is quite easy. Many of the graphical forms of deceit are defeated by the graphical software such as Excel and Tableau. These tools make it difficult to intentionally show something out of scale, but a solid understanding of common graphing mistakes is still necessary.

One statistical test covered in this course that can be altered by a tactic described in the book is ANOVA or the Analysis of Variance test. The root of this test requires finding the means of two or more different distributions. Any statistical test can be inaccurate due to built-in bias as Chapter 1 suggests, however there are many more problems we can analyze. The biggest way to lie using an ANOVA test would be the improper use of the mean. While we understand we’re looking for a statistical mean in ANOVA if done by hand, it’s easy to improperly compute the median or mode in lieu of the correct mean. This is a tactic that can drastically change the results of an ANOVA test, despite most often the mean, median, and mode are close together.

In conclusion, How to Lie with Statistics gave plain-word descriptions of how graphs, misleading averages, and disproportionate charts can impact how people interpret statistics. Whether this is about a toothpaste brand performing 23% better than another or a recent study claiming drinking 7 cups of coffee is correlated to a longer life span, we need to be more skeptical with how people present numbers.

 

Ethics Policies in New South Wales

Ethics Policies in New South Wales

 The ACM Code of Ethics and Data Science

The ACM Code of Ethics and Data Science

0