Big Data Statistics 2.0

2014 Workshop on Big Data and Urban Informatics

After more than a year of preparation, the Workshop on Big Data and Urban Informatics was held at the University of Illinois at Chicago on August 11-12, 2014.

More than 150 persons from at least 10 countries (Australia, Canada, China, Greece, Israel, Italy, Japan, Portugal, United Kingdom, and the U.S.) attended the forum sponsored by the National Science Foundation.  

Piyushimita (Vonu) Thakuriah, co-chair for the workshop, reported on the funding of Urban Big Data Center at the University of Glasgow in Scotland (  Its mission is to “support research for improved understanding of urban challenges and to provide data, technology and services to manage, make policy, and innovate in cities.”  The Urban Big Data Center partners with five other universities including the University of Illinois at Chicago. Vonu, a transportation expert, is the director of the center.

In the course of two full days, 68 excellent presentations were made in total, far exceeding the expectations of the organizers a year ago.  These papers will be posted in the web in the near future.  

Two luncheon keynote speakers highlighted the workshop.  

Carlo Ratti presented the state-of-the-art work of the MIT SENSEable City Lab, which specializes in the deployment of sensors and hand-held electronics to study the environment.  Since conventional measures of air quality tend to be collected at stationary locations, they do not always represent the exposure of a mobile individual.  In one project titled “One Country, Two Lungs” (, a team of human probes travelled between Shenzhen and Hong Kong to detect urban air pollution.  The video revealed the divisions in atmospheric quality and individual exposure between these two cities. 

Paul Waddell of the University of California at Berkeley presented his work on urban simulation and dynamic 3-D visualization of land use and transportation.  Some of his impressive work images can be found at  His video and examples reminded me about their potential applicability for creating the “Three Districts and Four Lines” in China’s National Urbanization Plan.  I also learned about a somewhat similar set of products from China’s, a Geographic Information System software company based in Beijing. 

One of the 68 presentations described the use of smart card data to study the commuting patterns and volume in Beijing subways during rush hours.  One other presentation compared the characteristics of big data and statistics and raised the question of whether big data is a supplement or a substitute to statistics. 

The issue of data quality was seldom volunteered in the sessions, but questions about it came up frequently.  Through editing, filtering, cleaning, scrubbing, imputing, curating, re-structuring, and many other terms, it was clear that some presenters spent an enormous amount of their time and efforts to just get the data ready for very basic use.

Perhaps data quality is considered secondary in exploratory work.  However, there are good quality big data and bad quality big data.  When other options are available, spending too much time and effort on bad quality big data seems unwise because it does not project a practical, future purpose.

There were also few presentations that discussed the importance of data structure, whether it is already built in as design or created through metadata.  Structured data contain far more potential information content than unstructured ones and tend to be more efficient and optimal in information extraction, especially if they have the capability to be linked across multiple sources.  

For the purpose of governance, I was somewhat surprised that use of administrative records has not yet caught on in this workshop.  Accessibility and confidentiality appeared to be barriers.  It would seem helpful for future workshops to include city administrators and public officials to help bridge the gap between research and practical needs for day-to-day operations.  

Nations and cities share a common goal in urban planning and urban informatics – improve the quality of city life and service delivery to constituents and businesses alike.  On the other hand, there are drastic differences in their current standing and approach.

China is experiencing the largest human migration in history.  It has established goals and direction for urban development, but has little reliable, quantitative research or experience to support and execute its plans.  The West is transitioning from its century-old urban living to a future that is filled with exciting creativity and energy, but does not seem to have as clear a vision or direction.

Confidentiality is an issue that contrasts sharply between China and the West.  The Chinese plans show strong commitment to collect and merge linkable individual records extensively.  If implemented successfully, it will generate unprecedented amount of detailed information that can also be abused and misused.  The same approach would likely face much scrutiny and opposition in the West, which has to consider less reliable but more costly alternatives in order to meet the same needs. 

There is perhaps no absolute right or wrong approach to these issues.  The workshop and the international community being created offer a valuable opportunity to observe, discuss, and make comparisons in many globally common topics. 

Selected papers from the workshop will now undergo additional peer review.  They will be published in an edited volume titled “See Cities Through Big Data – Research, Methods and Applications in Urban Informatics.”

Big Data General Statistics Statistics 2.0

Crossing the Stream and Reaching the Sky

In the early stages of its economic reform, China chose to “cross a stream by feeling the rocks.”

Limited by expertise and conditions at that time when there was no statistical infrastructure in China to provide accurate and reliable measurements, the chosen path was the only option.

In fact, this path was traveled by many nations, including the U.S. At the beginning of the 20th century when the field of modern statistics had not taken shape, data were not believable or reliable even if they existed.   Well-known American writer and humorist Mark Twain once lamented about “lies, damned lies, and statistics,” pointing out the data quality problem of the time.  During the past hundred years, statistics deployed an international common language and reliable data, establishing a long history of success with broad areas of application in the U.S.  This stage of statistics may be generally called Statistics 1.0.

Feeling the rocks may help to across a stream, but it would be difficult to land on the moon, even more difficult to create smart cities and an affluent society.  If one could scientifically measure the depth of the stream and build roads and bridges, it may be unnecessary to make trials and errors.

The long-term development of society must exit this transitional stage and enter a more scientifically-based digital culture where high-quality data and credible, reliable statistics serve to continuously enhance the efficiency, equity and sustainability of national policies. At the same time, specialized knowledge must be converted responsibly to practical useful knowledge, serving the government, enterprises and the people.

Today, technologies associated with Big Data are advancing rapidly.  A new opportunity has arrived to usher in the Statistics 2.0 era.

Simply stated, Statistics 2.0 elevates the role and technical level of descriptive statistics, extends the theories and methods of mathematical statistics to non-randomly collected data, and expands statistical thinking to include facing the future.

One may observe that in a digital society, whether it is from crossing a stream or reaching the sky, or from governance of a nation to the daily life of the common people, what were once “unimaginable” are now “reality.”  Driverless cars, drone delivery of packages, and space travel are no longer imaginations in fictions.  Although their data that can be analyzed in a practical setting are still limited, they are within the realistic visions of Statistics 2.0.

In terms of social development, the U.S. and China are actively trying to improve people’s livelihood, enhance governance, and improve the environment. A harmonious and prosperous world cannot be achieved without vibrant and sustainable economies in both China and the U.S., and peaceful, mutually beneficial collaborations between the nations.

Statistics 2.0 can and should play an extremely important role in this evolution.

The WeChat platform Statistics 2.0 will not use low quality or duplicative information to clog already congested channels, but it values new thinking to share common interest in the study of Statistics 2.0, introducing state-of-the-art developments in the U.S. and China in a simple and timely manner, offering thoughts and discussions about classical issues, exploring innovative applications, and sharing the beauty of the science of data in theory and practice.

WeChat Platform: Statistics 2.0

Big Data Statistics Statistics 2.0

Lying with Big Data

About 45 years ago, I spent a whopping $1.95 on a little book titled “How to Lie with Statistics.”

Besides the catchy title, its bright orange cover has a comic character sweeping numbers under a rug.  Darrell Huff, a magazine editor and a freelance writer, wrote the book in 1954.  It went on to become the most popular statistics book in the world for more than half a century.  A translated version was published in China around 2002.

It takes only a few hours to read the entire book of about 140 pages and 80 pictures leisurely, but it was a major reason why I pursued an education and a professional career in statistics.

The corners of the book are now worn; the pages have turned yellow.  One can identify some of the social changes in the last 60 years from the book.  For example, $25,000 is no longer an enviable annual salary; few of today’s younger generation may know what a “telegram” was; “gay” has a very different meaning now; and “African Americans” has replaced “Negroes” in daily usage.  As indicative of the bygone era, the image of a cigar, a cigarette, or a pipe appeared in at least one out of every five pictures in the book – even babies were puffing away in high chairs.  The word “computer” did not show up once among its 26,000 words.

Huff’s words were simple, but sharp and direct.   He provided example after example that the most respected magazines and newspapers of his time lie with statistics, just like the dreadful “advertising man” and politician.

According to Huff, most humans have “a bias to favor, a point to prove, and an axe to grind.”  They tend to over- or under-state the truth in responding to surveys; those who complete surveys are systematically different from those who do not respond; and built-in partiality occurs in the wording of a questionnaire, appearance of an interviewer, or interpretation of the results.

There were no desktop computers or mobile devices; statistical charts and infographics were drawn by hand; data collection, especially complete counts like a census, was difficult and costly.  Huff conjectured, and the statistics profession has also concurred, that the only reliable small sample is one that is random and representative where all sources of bias have been removed.

Calling anyone a liar was harsh then, and it still is now.  The dictionary definition of a lie is a false statement made with deliberate intent to deceive.  Huff considered lying to include chicanery, distortion, manipulation, omission, and trickery; ignorance and incompetence were only excuses for not recognizing them as lies.  One may also lie by selectively using a mean, a median, or a mode to mislead readers although all of them are correct as an average.

No matter how broadly or narrowly lies may be defined, it cannot be denied that people do lie with statistics every day.  To some media’s credit, there are now fact-checkers who regularly examine stories or statements, most of them based on numbers, and evaluate their degree of truthfulness.

In the era of Big Data, lies occur in higher velocity with bigger volume and greater variety.

Moore’s law is not a legal, physical, or natural law, but a loosely-fitted regression equation in logarithmic scale.  Each of us has probably won the Nigerian lottery or its variations via email at least a few times.  While measures for gross domestic products or pollution are becoming more accurate because of Big Data, nations liberally use their aggregate or per capita average, depending on which favors their point of view.

Heavy mining of satellite, radar, audio messages, sensor, and other Big Data may one day solve the tragic mystery of Malaysian Flight MH370, but the many pure speculations, conspiracy theories, accusations of wrongdoing, and irresponsible lies quoting these data have mercilessly added anguish and misery to the families of the passengers and the crew.  No one seems to be tracking the velocity, volume and variety of the false positives that have been generated for this event, or other data mining efforts with Big Data.

The responsibility is of course not on the data; it is on the people.  There is the old saying that “figures don’t lie, but liars figure.”  Big Data – in terms of advancing technology and availability of some massive amount of randomly and non-randomly collected electronic data – will undoubtedly expand the study of statistics and bring our understanding and governance to new heights.

Huff observed that “without writers who use the words with honesty and understanding and readers who know what they mean, the result can only be semantic nonsense.”  Today many statisticians are still using terms like “Type I error” and “Type II error” in promoting statistical understanding, while these concepts and underlying pitfalls are seldom mentioned in Big Data discussions.

At the end of his book, Huff suggested that one can try to recognize sound and usable data in the wilderness of fraud by asking five questions: Who says so? How does he know? What’s missing? Did somebody change the subject? Does it make sense?  They are not perfect, but they are worth asking.  On the other hand, healthy skepticism should not become overzealous in discrediting truly sound and innovative findings.

Faced with the self-raised question of why he wrote the book, especially with the title and content that provides ideas to use statistics to deceive and swindle, Huff responded that “[t]he crooks already know these tricks; honest men must learn them in defense.”

How I wish there is a book about how to lie with Big Data now!  In the meantime, Huff’s book remains as enlightening as it was 45 years ago although the price of the book has gone up to $5.98 and is almost matched by its shipping cost.

Jeremy S. Wu, Ph. D.,