Jeremy S. Wu, Ph.D.
Jeremy S. Wu, Ph.D.
  • Home
  • About
    • Personal
  • Activities
    • Regency at McLean
  • Big Data
    • Maps
      • Asian Americans by CD 2015
      • Asian Americans by CD 2014
      • Asian Americans by CD 2013
      • Berkeley Earth
      • Chinese Smart Cities
    • 清华论坛
  • Blogs
  • Justice
    • 1882 Timeline
    • 2020 Census
    • APA FISA Watch
    • Fed Cases
    • Profiling

Jeremy S. Wu, Ph.D.

胡善庆博士

Jeremy S. Wu, Ph.D.
  • Home
  • About
    • Personal
  • Activities
    • Regency at McLean
  • Big Data
    • Maps
      • Asian Americans by CD 2015
      • Asian Americans by CD 2014
      • Asian Americans by CD 2013
      • Berkeley Earth
      • Chinese Smart Cities
    • 清华论坛
  • Blogs
  • Justice
    • 1882 Timeline
    • 2020 Census
    • APA FISA Watch
    • Fed Cases
    • Profiling

Lying with Big Data

  • Big Data
  • Statistics
  • Statistics 2.0

About 45 years ago, I spent a whopping $1.95 on a little book titled “How to Lie with Statistics.”

Besides the catchy title, its bright orange cover has a comic character sweeping numbers under a rug.  Darrell Huff, a magazine editor and a freelance writer, wrote the book in 1954.  It went on to become the most popular statistics book in the world for more than half a century.  A translated version was published in China around 2002.

It takes only a few hours to read the entire book of about 140 pages and 80 pictures leisurely, but it was a major reason why I pursued an education and a professional career in statistics.

The corners of the book are now worn; the pages have turned yellow.  One can identify some of the social changes in the last 60 years from the book.  For example, $25,000 is no longer an enviable annual salary; few of today’s younger generation may know what a “telegram” was; “gay” has a very different meaning now; and “African Americans” has replaced “Negroes” in daily usage.  As indicative of the bygone era, the image of a cigar, a cigarette, or a pipe appeared in at least one out of every five pictures in the book – even babies were puffing away in high chairs.  The word “computer” did not show up once among its 26,000 words.

Huff’s words were simple, but sharp and direct.   He provided example after example that the most respected magazines and newspapers of his time lie with statistics, just like the dreadful “advertising man” and politician.

According to Huff, most humans have “a bias to favor, a point to prove, and an axe to grind.”  They tend to over- or under-state the truth in responding to surveys; those who complete surveys are systematically different from those who do not respond; and built-in partiality occurs in the wording of a questionnaire, appearance of an interviewer, or interpretation of the results.

There were no desktop computers or mobile devices; statistical charts and infographics were drawn by hand; data collection, especially complete counts like a census, was difficult and costly.  Huff conjectured, and the statistics profession has also concurred, that the only reliable small sample is one that is random and representative where all sources of bias have been removed.

Calling anyone a liar was harsh then, and it still is now.  The dictionary definition of a lie is a false statement made with deliberate intent to deceive.  Huff considered lying to include chicanery, distortion, manipulation, omission, and trickery; ignorance and incompetence were only excuses for not recognizing them as lies.  One may also lie by selectively using a mean, a median, or a mode to mislead readers although all of them are correct as an average.

No matter how broadly or narrowly lies may be defined, it cannot be denied that people do lie with statistics every day.  To some media’s credit, there are now fact-checkers who regularly examine stories or statements, most of them based on numbers, and evaluate their degree of truthfulness.

In the era of Big Data, lies occur in higher velocity with bigger volume and greater variety.

Moore’s law is not a legal, physical, or natural law, but a loosely-fitted regression equation in logarithmic scale.  Each of us has probably won the Nigerian lottery or its variations via email at least a few times.  While measures for gross domestic products or pollution are becoming more accurate because of Big Data, nations liberally use their aggregate or per capita average, depending on which favors their point of view.

Heavy mining of satellite, radar, audio messages, sensor, and other Big Data may one day solve the tragic mystery of Malaysian Flight MH370, but the many pure speculations, conspiracy theories, accusations of wrongdoing, and irresponsible lies quoting these data have mercilessly added anguish and misery to the families of the passengers and the crew.  No one seems to be tracking the velocity, volume and variety of the false positives that have been generated for this event, or other data mining efforts with Big Data.

The responsibility is of course not on the data; it is on the people.  There is the old saying that “figures don’t lie, but liars figure.”  Big Data – in terms of advancing technology and availability of some massive amount of randomly and non-randomly collected electronic data – will undoubtedly expand the study of statistics and bring our understanding and governance to new heights.

Huff observed that “without writers who use the words with honesty and understanding and readers who know what they mean, the result can only be semantic nonsense.”  Today many statisticians are still using terms like “Type I error” and “Type II error” in promoting statistical understanding, while these concepts and underlying pitfalls are seldom mentioned in Big Data discussions.

At the end of his book, Huff suggested that one can try to recognize sound and usable data in the wilderness of fraud by asking five questions: Who says so? How does he know? What’s missing? Did somebody change the subject? Does it make sense?  They are not perfect, but they are worth asking.  On the other hand, healthy skepticism should not become overzealous in discrediting truly sound and innovative findings.

Faced with the self-raised question of why he wrote the book, especially with the title and content that provides ideas to use statistics to deceive and swindle, Huff responded that “[t]he crooks already know these tricks; honest men must learn them in defense.”

How I wish there is a book about how to lie with Big Data now!  In the meantime, Huff’s book remains as enlightening as it was 45 years ago although the price of the book has gone up to $5.98 and is almost matched by its shipping cost.

Jeremy S. Wu, Ph. D., jeremy.s.wu@gmail.com

Data Quality Lies Random Sampling
April 8, 2014 Jeremy

Post navigation

Not All Data are Created Equal → ← Smart Wuhan, Built on Big Data

One thought on “Lying with Big Data”

  1. Anonymous says:
    April 9, 2014 at 1:12 am

    How to lie with Big Data?

    One should tell one Big lie and stick to it, even at the expense of looking ridiculous!

Comments are closed.

Related Posts

2014 Workshop on Big Data and Urban Informatics

After more than a year of preparation, the Workshop on Big Data and Urban Informatics was held at the University of Illinois at Chicago on August 11-12, 2014. More than […]

Smoking Statistics in the U.S. and China

The U.S. Surgeon General released a landmark report on smoking and health in 1964, concluding that smoking caused lung cancer.  At that time, smoking was at its peak in the […]

Crossing the Stream and Reaching the Sky

In the early stages of its economic reform, China chose to "cross a stream by feeling the rocks."Limited by expertise and conditions at that time when there was no statistical […]

Not All Data are Created Equal

Suppose we have data on 60,000 households.  Are they useful for analysis? If we add that the amount of data is very large, like 3 TB or even 30 TB, […]

Recent Posts

NSD201801-042

Trade Secrets to South KoreaOn May 1, 2015, Kolon Industries, Inc., a South Korean industrial company, was sentenced in the Eastern District of Virginia to 5 years’ probation and was ordered […]

More Info

NSD201801-040

Theft of Trade Secrets by Chinese Professors for Technology to ChinaOn May 16, 2015, Tianjin University Professor Hao Zhang was arrested upon entry into the U.S. from the People’s Republic [...]

More Info

NSD201801-029

Theft of Valuable Source Code for ChinaOn June 14, 2016, Jiaqiang Xu was charged in the Southern District of New York in a six-count superseding indictment with economic espionage and theft […]

More Info

NSD201801-028

Satellite Trade Secrets to Undercover AgentOn July 7, 2016, in the Central District of California, Gregory Allen Justice was arrested by FBI special agents on federal charges of economic [...]

More Info
Powered by WordPress | theme SG Window