Big Data Statistics Statistics 2.0

21st Century Statistical Systems

A combination of traditional censuses and the introduction of random surveys served to measure and infer on populations and economies in the 20th century.  Both statistical approaches have been important in supporting decision- and policy-making worldwide, as well as informing the public.  The 21st century has begun with the massive conversion to digital data and the explosive growth of Big Data around the globe, which in turn stimulates an insatiable demand for ever more timely and comprehensive response to information needs.  Since conventional approaches to censuses and surveys are static and cross-sectional, they will not be able to meet these expanding dynamic requirements without fundamental changes.  In the 21st century the defining characteristics of statistical systems and methods will be the sophisticated application of massive longitudinal data, integration of multiple data sources, and rapid and simple delivery of results, while still strictly protecting confidentiality and data security and assuring accuracy and reliability.  Leaders of the statistical agencies from several nations including the United States have recently reiterated these needs and trends.  Government agencies that can successfully overcome these issues will help their nations enjoy unique advantages in global competition; otherwise, they will face certain obsolesence.  As a rapidly growing economic power whose statistics are receiving more attention and having greater impact in the world, China faces many of the same challenges.  This article identifies some of the success stories emerging in the U.S. and other nations and discusses the needed changes in statistical paradigms to meet the challenges of dynamic, integrated data systems for the 21stcentury.

The 20th Century Statistical Systems

Taking a census, i.e. collecting data from every entity in a target population, has been the traditional statistical method to measure the profile and characteristics of a population for centuries.  China reported its first population census more than 2,200 years ago [1].  By the Western Han Dynasty around 2 A.D., available records [2,3,4,5] placed China’s population at almost 58 million in over 12 million households.  The People’s Republic of China enacted its first laws on the governance of statistics in 1983 [6].  China has taken six population censuses since 1949, and every ten years since 1990 [7].  Continuing for more than two centuries as required by its constitution, the United States (U.S.) conducted its 23rdand most recent decennial national population census in 2010 [8,9].

Myriad other topical censuses, such as those on the economy, industry, and agriculture, are also commonly conducted in the U.S., China, and other nations.  For example, the U.S. conducts an economic census of business activities every five years.  The next economic census is scheduled to start in 2012 [10].  The 2007 economic census covered 24 million businesses in the non-farm private economy, accounting for about 96% of the U.S. Gross Domestic Product [11].  China conducted its last economic census in 2008 [12].  Although each census may have different legal origins or motivations, the ultimate purpose is similar – to provide relevant, current, and reliable information for research, analysis, and ultimately decision- and policy-making.

While the census has demonstrated its importance for many centuries, it has several well-known practical shortcomings.  Most of all, human activities are continuous and dynamic over time, but a census can only provide a comprehensive snapshot on a designated census day or a defined period of time.  Census results typically become outdated as soon as they are released.

Dynamic human behavior and social, economic and political phenomena cannot be fully captured by a census taken at a single point in time.  The operation of a national census is typically so complex that multiple years are needed to design and collect data.  More time is then spent to process, analyze, and report the results.  The cost of a national census has become so prohibitively high that it is usually supplemented by smaller random surveys to provide more frequent results.

After more than a decade of design, development, and testing, the American Community Survey (ACS) [13] began to implement “continuous measurement” on the characteristics of U.S. population and housing in 2005.  About 3 million addresses are sampled per year (or 250,000 addresses per month) for a 5-year rolling cycle.  The ACS produces estimates by aggregating data collected in the monthly surveys over a period of time so that they would be summarized annually based on the calendar year.  For local geographies with small populations, the ACS estimates may take up to five years to aggregate and become reportable [14].

China reports its total population and structural changes in 2011 according to the National Sample Survey on Population Changes, which is described as a stratified, multi-stage, cluster, probability proportional to size sampling method.  Nearly 1.5 million persons in 31 provinces, 4,800 villages, 4,420 townships, and 2,133 counties were reportedly interviewed for the most recent update [15].

Random sampling is a relatively new concept, introduced by the director of the Norwegian Statistics Bureau to the International Statistical Institute (ISI) in 1895 [16].  The international statistical community spent more than 30 years debating its merit before deciding that random sampling is an acceptable and sound scientific practice.  During this period, theories and practices of today’s mathematical statistics developed and grew to support the sampling approach.

The first Department of Statistics in an arts and science colleges in the U.S. was established by the George Washington University in 1935 [17].  Academia would become the primary training ground for future statisticians.  According to the U.S. Census Bureau, statistical sampling methods were first used in its 1937 test survey of unemployment, partly in response to the need for more timely information about the scope of unemployment during the Great Depression [18].  Governments would become the primary employer of future statisticians.

Supported by new theories and tested by applications in many fields, combined with the introduction of commercial computers in the 1950s and subsequent desktop computing, random surveys soon became the standard statistical practice to collect data and perform statistical analyses for making informed decisions.   The foundation for today’s statistical systems was built primarily from computing technologies developed in the 1970s, before the commercialization of the Internet ushered in the advanced information age in the 1990s.

By the end of the 20th century, statistical systems including census and survey data were not only core governmental operations, but also the analytical foundation for market research, political predictions, agriculture and economic development planning, environmental management, public health, transportation planning, physical sciences, and other human and societal activities.  However, data must be collected according to statistical designs, including the application of probability principles, before they can be used for making statistical inference.  Large-scale statistical analyses were typically conducted by statisticians and subject-matter experts in either government or academia.

21st Century Information Needs and Trends

The first decade of the 21stcentury was marked by the rapid conversion of data from analog to digital, as well as its quick acceptance and growth by the rapidly increasing Internet users, most of them are not statistical experts in academia or government.

According to a study by the University of Southern California [19,20], digital storage surpassed non-digital for the first time in 2002, but at least 94 percent of all information on the planet was in digital form by 2007.

Visualized: One Zettabyte [21]
Visualized: One Zettabyte [21]

The capacity to create and store digital data reportedly exceeded one zettabyte (1 ZB or 1021 bytes) for the first time in 2010 [22,23], compared to about 0.29 ZB in 2007 and 0.00002 ZB in 1986 [19,20]. An industry executive declared that “(e)very two days now we (human beings) create as much information as we did from the dawn of civilization up until 2003” [24]. To illustrate the relative magnitude, the entire human genome [25] containing 3 billion chemical bases along the chromosomes of an individual can be captured in about 3 gigabytes (3 GB or 0.000000000003 ZB) of computer storage, relatively modest according to today’s standards. In contrast, the Alpha Magnetic Spectrometer [26] records cosmic ray data at about 1GB per second.

In practical terms, it means that paper records are becoming obsolete, the private sector is also generating large amounts of data, and billions of data consumers are not necessarily specialists.

Complete sets of data are easily captured into electronic files for direct machine processing and computation without the need or consideration for sampling. The speed of this enormous change was concurrently matched by the spread of electronic data beyond political and geographical boundaries. Access to and use of information technology is now pervasive if not common place in developed nations as well as in less developed nations.  No matter where a computer is located in the world, it can be accessed as long as it is connected to the Internet.

Big Data is a new and loosely defined term for large electronic datasets that may or may not be collected according to the structure and probability principles specified in the traditional statistical systems.  Administrative records, social media, barcode and radio frequency scanners, transportation sensors, energy and environmental monitors, online transactions, streaming videos, and satellite images have all contributed to the explosive growth of Big Data.  Most of these Big Data are not structured for conventional statistical analyses and inferences, nor are they simple or easy to use initially with current software and statistical systems.   However, some contain important information that has not been available before for decision- and policy-making, especially when they are appropriately integrated into government data sources.

The private sector has led the way in generating Big Data, integrating them with government statistics, and developing data mining techniques and methods to identify potential consumers, expand markets, test new products, and extract information for market and consumer research.  In some cases, they may even challenge traditional government functions.  For example, certain search terms [27] in the social media may be good indicators of flu activity and can perform as well as the indicators produced by the public health agencies, if not actually better in terms of reduced lag time. 

Despite its diminished share in the ocean of available data, government statistics remain uniquely important in support of an increasingly global economy and expanding social needs of each nation.  However, in an era when search engines produce millions of results in seconds and international stock market data are reported in almost real-time around the clock, taking years and even months to collect, process, and release static results for limited coverage of geography, industry, or demographics is rapidly losing relevance.

Most nations, even developed nations, are facing severe budgetary constraints.  The high costs and limited return with the current approach preclude the introduction of new censuses and surveys or the feasibility of any major expansion of the current census and sampling approach.   Declining response rates worldwide compounds the problem.  For example, despite intense planning and efforts, the participation rate for the 2010 decennial census in the U.S. barely matched 74 percent achieved in 2000 [28].  Follow-up personal interviews would increase the average census cost to $56 per household [29], about 100 times the original mailing cost.

In fact, the U.S. House of Representatives voted in May 2012 to terminate the American Community Survey, citing both confidentiality and budgetary concerns.  It is also uncertain now whether the 2012 economic census will be conducted in the U.S. as originally planned.

The challenge to the national statistical agencies is real and daunting: the 20th century statistical systems can no longer adequately meet the needs of the 21stcentury.  Consumers of government statistics are rapidly increasing in number and breadth.  They require more comprehensive, dynamic, and timely data that can be accessed and understood easily, but the resources and time of development required by the existing methods are simply not available or affordable.  Governments are still expected to provide statistics that are accurate and reliable, while strictly protecting the confidentiality of the responding entities.

Failing to meet these requirements, the Australian Bureau of Statistics is not sure that it “will remain at the heart of official information for societies” [30].  The arrival of the Big Data era, along with the growing user requirements, is inevitable, and yet many governments and their statistical agencies are still unprepared in making the best use of Big Data.

Fundamental change in the historical census and survey paradigm will be necessary to meet these challenges of the 21st century.  Small evolutionary steps to tweak and tinker the edges of the current statistical systems built on knowledge and technologies grounded in the 1970s will simply not be adequate for the Big Data Revolution.

Characteristics of the 21st Century Statistical Systems

The defining characteristics of the 21st century statistical systems will be the sophisticated application of massive longitudinal data, integration of multiple data sources, and rapid and simple delivery of results, while still strictly protecting confidentiality and data security and assuring accuracy and reliability.

Longitudinal data refer to repeated observations of the same entity (such as a worker, a student, a household, a business, a school, or a hospital) over time.  They provide unique measures of a baseline and change at the individual or business level.  These measures are not tracked and measured in the conventional cross-sectional studies, which collect data from multiple subjects at the same point in time.

In particular, longitudinal administrative records are potential data sources for developing a comprehensive statistical system.  Statistics Canada defines administrative records simply as “data collected for the purpose of carrying out various non-statistical programs” [31].  Other potential data sources include birth and death certificates, customs declarations, marriage and driver’s licenses, individual and business taxes, unemployment insurance payments, social security, and medical prescriptions.  There are a number of examples of massive amounts of longitudinal administrative records.  

  • A new business must complete forms to register before it can start operation.  Reports are produced to pay salaries and taxes on a regular basis.  Additional paperwork must be completed if loans are made or if there are mergers and acquisitions.  Corporations must file applications before their shares can be publicly traded.
  • A student must fill out forms to enter a school.  He or she must register to enroll in classes.  Individual grades and test scores are recorded.  A transcript is needed to move from one school to another.  A diploma or degree is issued when a student graduates.
  • Similarly, there are records for each person’s visit to a doctor’s office or admittance to a hospital, vital health signs that are measured during the visit, symptoms of illness, and amount and type of medical prescriptions.

Under proper design and automation, cost of linking electronic data records is only a fraction of the cost of labor-intensive survey or census data collection.  There is also no additional burden on the respondents because the administrative records already exist.  Once established, the need to collect individual demographic data such as gender, date of birth, race, and ethnicity will be greatly reduced because they do not change or they change in predictable manners.

The potential of integrating administrative records into statistical systems and substituting for a population census were discussed and debated vibrantly during the last two decades of the 20th century [e.g., 32,33,34,35,36].  Pioneered by Denmark in 1981, at least 20 out of 27 European Union nations are now using population registers or a combination of population registers and the traditional census to count their populations [37].

Although longitudinal studies have been used quite extensively in clinical trials for many years, their integration and applications in other areas have been sparse and limited due largely to complex design, high cost of processing and data storage, difficulty in understanding and accessing the data, and concerns about protecting confidentiality.

In a recent blog by the Director of the U.S. Census Bureau on a summit meeting between the leaders of the government statistical agencies from Australia, Canada, New Zealand, United Kingdom, and the U.S. [38], consensus and shared vision were reported about the 21st century official statistical systems.  Among the future vision is:

“Blending together multiple available data sources (administrative and other records) with traditional surveys and censuses (using paper, internet, telephone, face-to-face interviewing) to create high quality, timely statistics that tell a coherent story of economic, social and environmental progress must become a major focus of central government statistical agencies.”

Government statistical agencies must continue to create and maintain frames for conducting censuses or surveys, but by making additional, optimal use of available data sources.  Such frames have been static and minimal in content in the past.  In the 21stcentury, these frames must become dynamic in structure and rich in content, capable as the first response to produce comprehensive, top quality and timely statistics regularly and on demand, integrating new relevant data elements and sources as they are identified and introduced.  These dynamic national frames will include both statistical and geographical data for companion mapping and reporting, and serve in a secondary role of a traditional frame for census or survey where needed.

“Telling a coherent story” is part of the needed change in paradigm for statistical agencies in the 21st century.  The role of descriptive statistics has long been relegated as secondary or supplemental to statistical inference by statistical professionals.  Modern data visualization methods and applications to extract information dynamically from complex data are in every way a valuable, statistical practice in the Big Data era.  When government and academic experts are no longer the only or even dominant data suppliers and data analysts, ease of understanding, access, and use must also be an integral part of rapid delivery of results.

The assembly and maintenance of a comprehensive, dynamic statistical system requires massive amounts of sensitive personal and business data.  However, the end results must be in the form of statistical summaries that are totally void of the possibility of identification or re-identification of the original entities.  Individuals and enterprises should rightly be concerned and informed about protection of their confidentiality against any misuse and abuse of their data.  Integrity and security of the infrastructure data, as well as the output statistics, must also be strictly safeguarded against intended or malicious tampering and alteration.  

Emerging Success Stories

Several countries have initiated public integrated longitudinal data program on employment, education, and public health.  These initiatives are at various stages of development, and provide encouraging news about the feasibility of creating and maintaining comprehensive dynamic statistical systems in the Age of Big Data, although there are still many challenges. 

An international symposium was held in 1998, featuring research using integrated employee-employer data from more than 20 nations [36,39].  The U.S. Census Bureau started the Longitudinal Employer-Household Dynamics program later that year to create new, innovative statistical products by linking existing employer-employee data [40].

Today the U.S. federal government and each of the 54 state, city, and territorial governments have agreed to secure the continuing supply of unemployment insurance wage records for workers and employers from the states on a quarterly basis.  The U.S. Census Bureau updates and maintains a longitudinal national frame of jobs going as far back as 1990.  Each job connects a worker with an employer, and a worker can have multiple jobs.  This data infrastructure was designed to track and refresh the employment status and pay for each of the over 140 million workers and more than 10 million employers (including the self-employed) every 3 months, while still strictly protecting the confidentiality of each entity by legal, policy, physical and methodological means.

The longitudinal data infrastructure has stimulated the development of creative, practical online applications using the new data, such as time series indicators to describe the underlying dynamics of the U.S. workforce at unprecedented levels of demographic and geographic detail [41].  In addition, an innovative mapping and reporting application allows a user to select any geographical area online to produce worker profiles and potential commuting reports [42], as well as almost real-time assessments of the potential impacts of hurricanes and other natural disasters in emergency situations [43].  The application was presented as an innovative statistical product in the United Nations Statistical Commission [44] and received the gold medal from the U.S. Department of Commerce, the highest form of recognition for scientific accomplishments in the department.     

The Data Quality Campaign (DQC) [45] was launched in 2005 to empower stakeholders of the U.S. education system, including students, parents, teachers, and policy-makers, with “high quality data from the early childhood, K-12, postsecondary, and workforce systems to make decisions that ensure every student graduates high school prepared for success in college and the workplace.”  To achieve this vision, “DQC supports state policymakers and other key leaders to promote the development and effective use of statewide longitudinal data systems.”

The U.S. Departments of Education and Labor concurrently issued competitive grants to states for the construction and integration of these comprehensive statewide longitudinal data systems.  In the words of DQC, “we can no longer afford to not use data in education” to make informed decisions.   In particular, DQC identified “10 Essential Elements of Statewide Longitudinal Data Systems” and “10 State Actions to Support Effective Data Use” as roadmaps for state policymakers.  Status and progress of each state have been tracked by annual surveys since 2005.

The Health Information Technology for Economic and Clinical Health Act of 2009 [46] established the goal of widespread adoption and meaningful use of electronic health records by 2014 in the U.S.  Belgium [47] reported on the Belgian Longitudinal Health Information System initiative in 2011 by broadly defining health-related data as “all personal data that concerns past, current or future states of the physical or mental health of the person.”  The research focused on the longitudinal approach of health and made comparison to international initiatives, including those in Canada, Denmark, and United Kingdom.  China has also initiated major public health reforms [48] since 2009, including a national longitudinal system of electronic health records [49] covering its 1.3 billion citizens as part of its institutional infrastructure.  Key policies have been established, and the system is being populated.

Major Challenges and China

The U.S. statistical system is highly decentralized.  Although the 2012 share of the budget resources supporting federal statistics is only 0.02 percent of the Gross Domestic Product in the U.S., it is spread across 13 principal statistical agencies and more than 85 other agencies that carry out statistical activities along with their non-statistical program missions [50].  A sizable portion of the U.S. efforts have been spent on overcoming the barriers inherent in its decentralized structure – inadequate data sharing, competing data quality standards, unnecessary duplication, multiple administrative costs, and resolving difficulty of data access.

For example, the U.S. Census Bureau and the Bureau of Labor Statistics maintain two separate Business Registers; each register is supposed to represent the universe of U.S. businesses.  These registers are the sampling frames from which surveys and censuses are drawn, contributing to important information such as the national economic indicators for the U.S.  However, due to their independent sources and dynamic nature, the two registers have substantial differences in the number of firms and their respective payroll and employment [51].  Although progress has been made in the last decade, a single source Business Register for the U.S. has yet to emerge.

The White House announced a “Big Data Research and Development Initiative” [52] in March 2012, providing $200 million in new research and development investments to improve the ability to extract knowledge and insights from large and complex collections of digital data.  Therefore, efforts continue to bring convergence on Big Data inside the statistical agencies in the U.S.  Top governmental funding, commitment, and leadership, especially in the transparency and openness towards data [53], will also be needed in other nations including China.

A recent Baidu search showed that awareness of the Big Data issues in China appears to be sporadic and but increasing rapidly in the last 6 months.  The search results included a translation of a February 11 New York Times article on “The Age of Big Data” [54], an interview with the author who published the first known Chinese-language book [55,56] on the topic of Big Data on July 14, and a media report on how Big Data may threaten individual privacy [57], also on July 14.

A seminar hosted by Tsinghua University on July 1, 2012 was one of very few known activities about how Chinese statistical bureaus, universities and the private sector are actually dealing with Big Data and its impact on the statistical systems.  A notable exception was the research efforts being conducted by the Alibaba Group, which has millions of small companies using its site and billions of dollars of e-Commerce transactions in China every day.

The need for top quality statistics is no less in China than in other nations.  Many of the key targets for China’s 12th five-year plan [58] are defined in quantitative terms.  As China is transforming from an export-oriented nation to a consumer-oriented nation, the status and progress of each goal over time will be measured and evaluated by statistics and indicators that must be credible, reliable, and timely.   As China’s economic growth slows recently, in-depth data are needed to understand the latest trends and patterns, as well as to assist in developing potential mid-course modifications and corrections.  These statistics and indicators have great influence on the global economy as China ascends into world power status.  As the saying goes, when China sneezes, the rest of the world catches cold.

A recent report from the Chinese academia [59] provided an overview of a longitudinal micro-level data base focused on the corporate behavior and performance in China.  The data base is known variously as the “Chinese Industrial Enterprises Database,” “China Annual Survey of Industrial Firms,” or “China Annual Survey of Manufacturing Firms.”

Based on regular and annual reports submitted by sample enterprises to the local statistical bureaus, the National Bureau of Statistics of China assembles and maintains this database covering all state-owned and large-scale non-government firms beginning in 1998.  Its largest industrial component is the manufacturing sector.  This economic database is the only supplement to the Chinese economic census, representing about 90% of the sales volume of all industrial enterprises in China in 2004.  Small and medium-size businesses, as well as e-Commerce companies, are not included.  

The article identified nine areas for which important information can be extracted from the database and described its increasing domestic and international interests and use for analysis.  However, the database “suffers from data matching problems as well as measurement errors, unrealistic outliers and definition ambiguities etc., all of which practically lead to research results thereupon that are at best questionable.”

The article appeals for effective leadership and management to overcome the fundamental problems that seriously undermine a valuable longitudinal statistical system.  Herein underlies some of the major challenges for the development of the 21st century statistical system that are shared across nations.

Not all Big Data are structured or suitable for integration into statistical systems for intended statistical use.  Optimal extraction of information from data and total quality management are core values and functions of the statistics profession.  Awareness of the limitations of data, as well professional considerations against false discovery, bias, and confounding, is part of the value that can be added by statisticians [60].  With knowledge accumulated from the past centuries, the statistics profession is well positioned to address the many challenges of Big Data, prompting the prophecy that “the sexy jobs for the next 10 years will be statisticians” [61].

Actual experience and empirical evidence have suggested at least the following list of major potential contributions by statisticians:

  • Record Linkage.  Design of system and application of exact and probability matching techniques to improve record linkage from multiple data sources and to minimize mismatches for small populations.
  • Imputation.  Development and application of imputation techniques to reliably replace missing values created by merging data from different data sources, but not to create unsupported, artificial information.
  • Data Quality Assurance.  Establishment of sound methods and rules to continuously measure and detect the presence of outlying or influential observations and to apply best, appropriate resolutions. 
  • Evolving Standards.  Standardization of terms and definitions to provide consistent understanding, but be sufficiently flexible to adjust for new concepts such as “green industry.”
  • Statistical modeling.  Mathematical abstraction and statistical application that can range from imputing missing values, profiling markets and customers, assessing risks, predicting future occurrences, to creating artificial intelligence.
  • Data Visualization and Innovative Applications.  Innovative and timely dissemination and presentation to develop coherent stories, market new concepts, and improve statistical education.
  • Confidentiality Protection.  Development and application of statistical methods and rules such as noise infusion and synthetic data to protect confidentiality of individual entities and to quantify the level of protection being applied.

Big Data is more than another technological advancement that only improves statistical computation.  It is a revolution that challenges conventional statistical thinking and stimulates innovative thinking and development.

Some theories in mathematical statistics that have been prevalent for the last century may need extensions.  For example, while it is well known that a 5% random sample will yield better measurable properties than a 5% non-random sample, it is unknown how a 5% random sample will compare with a 30%, 50%, or higher non-random sample, which is common in Big Data situations.  How should the metrics be modified for story-telling instead of inference-making?

As happened when the random sampling concept was first introduced in 1895, construction of the 21st century statistical systems will be empirically guided and happen concurrently with theoretical development.  However, it is inconceivable for the international statistics community to take more than 30 years to welcome and embrace the use of Big Data.   

The revolutionary and innovative development of 21st century statistical systems using Big Data will be multi-disciplinary, involving expertise from statistics, computer science, geography, and subject matters such as economics, education, energy, environment, health care, and transportation.

It will require academic-public-private partnership.  Subject to the will and appropriateness to share data, the private sector is a key supplier to the 21st century statistical systems.  The role of academia is still critically important in conducting basic research, training future “data scientists,” and developing supporting theories.  The state of Massachusetts, the Massachusetts Institute of Technology, and multiple private-sector companies have recently joined and taken the first step in this direction in the U.S. [62].


Taking a census has been the official statistical method for the last two thousand years.  In the last century, random sampling was introduced and random surveys became the dominant statistical method.  When sample units are selected according to probability theory, results from a small fraction of the population can be used to make valid inferences about the entire population with measurable accuracy and reliability.

Modern information technologies started in the 1990s ushered in the Big Data era.  Massive amounts of data are becoming available; the number of potential data users with capability to access data has increased explosively; and the cost of storing and processing has decreased dramatically.  The need to collect, analyze and disseminate data widely, timely, and comprehensively in a global economy is firmly established for the 21st century.

Most national statistical agencies will not be able to improve and expand their current practices on their own.  Current theories of mathematical statistics are not sufficient to support the empirical use of Big Data.  Twentieth-century statistical systems rooted in the 1970s technologies are no longer sufficient to meet the requirements of the 21stcentury. 

This paper outlines the basic challenges facing all nations and provides emerging success stories of how Big Data from multiple sources can be successfully integrated to construct longitudinal data systems.

Development of innovative, dynamic 21st century statistical systems with support of a new statistical foundation is both feasible and necessary.  It will require government commitment and leadership, academic-public-private partnerships, concurrent multi-disciplinary research and development, and making innovative use of statistical thinking from the past centuries.  The prophecy that “the sexy jobs for the next 10 years will be statisticians” can be realized.  On the other hand, failure to make a paradigm change now will likely lead to irrelevance and even disappearance of national statistical agencies and the loss of competitive advantage for a nation in the global economy.


[1] National Bureau of Statistics of China.  “History of Statistics Prior to Qin Dynasty.”  Available at on July 16, 2012.

[2] National Bureau of Statistics of China.  “History of Statistics during the Qin and Han Dynasties.”  Available at on July 16, 2012.

[3] Wikipedia.  “Census.”  Available at on July 16, 2012.

[4] Hays, Jeffrey.  “China – Facts and Details: Han Dynasty (206 B.C. – A.D. 220).”  Available at on July 16, 2012.

[5] Loewe, Michael. “The Former Han dynasty.” The Ch’in and Han Empires, 221 B.C.–A.D. 220. Eds. Denis Twitchett and John K. Fairbank. Cambridge University Press, 1987.  Available at on July 16, 2012.

[6] National Bureau of Statistics of China.  “Statistical Laws of The People’s Republic of China.”  Available at on July 16, 2012.

[7] National Bureau of Statistics of China.  “How Many Years to Conduct a Census; How Many Censuses China Has Conducted.”  Available at on July 16, 2012.

[8] U.S. Census Bureau.  “What is The Census?”  Available at on July 16, 2012.

[9] National Bureau of Statistics of China.  “How Does The United States Conduct Its Population Census?”  Available at on July 16, 2012.

[10] U.S. Census Bureau.  “Economic Census.”  Available at on July 16, 2012.

[11] U.S. Census Bureau.  “About the 2007 Economic Census.”  Available at on July 16, 2012.

[12] National Bureau of Statistics of China.  Communiqué on Major Data of the Second National Economic Census (No.1)”.  December 25, 2009.  Available at on July 16, 2012.

[13] U.S. Census Bureau.  “Design and Methodology – American Community Survey.”  Chapter 2. Program History.  Available at on July 16, 2012.

[14] U.S. Census Bureau.  “Design and Methodology – American Community Survey.”  Chapter 13. Preparation and Review of Data Products.  Available at on July 16, 2012.

[15] National Bureau of Statistics of China.  “China’s Total Population and Structural Changes in 2011.”  Available at on July 16, 2012.

[16] Wu, Jeremy S., Chinese translation by Zhang, Yaoting and Yu, Xiang.  “One Hundred Years of Sampling,” invited paper in “Sampling Theory and Practice”.  ISBN7-5037-1670-3.  China Statistical Publishing Company, 1995.

[17] The George Washington University.  “The Department of Statistics”.  Available at on July 16, 2012.

[18] U.S. Census Bureau.  “Developing Sampling Techniques”.  Available at on July 16, 2012.

[19] The Washington Post.  “Rise of the Digital Information.”  Available at on July 16, 2012.

[20] Hilbert, Martin and Lopez, Priscila.  “The World’s Technological Capacity to Store, Communicate, and Compute Information.”  Science 1 April 2011: Vol. 332 no.6025 pp.60-65. DOI:10.1126/science. 1200970.  Available at on July 16, 2012.

[21] Savov, Vlad.  “Visualized: a zettabyte” June 29, 2011.  Available at on July 16, 2012.

[22] International Data Corporation.  “The Diverse and Exploding Digital Universe.”  Sponsored by EMC Corporation, March 2008.  Available at on July 16, 2012.

[23] Data Center Knowledge.  “”Digital Universe’ Nears a Zettabyte.”  May 4, 2010.  Available at on July 16, 2012.

[24] TechCrunch.  “Eric Schmidt: Every 2 Days We Create As Much Information As We Did Up To 2003.”  August 4, 2010.  Available at on July 16, 2012.

[25] Human Genome Project.  “Frequently Asked Questions.” Joint international project under the U.S. Departments of Energy and the National Institute of Health.  Available at on July 16, 2012.

[26] Wikipedia.  “Alpha Magnetic Spectrometer.”  Available at on July 16, 2012.

[27] Google.  “Explore Flu Trends around the World.”  Available at on July 16, 2012.

[28] U.S. Census Bureau.  “2010 Census Mail Participation Rate Map.”  Available at on July 16, 2012.

[29] El Nasser, Haya; and Overberg, Paul.  “2010 Census Response Rate Surprisingly Close to 2000 Rate.”  USA Today.  April 26, 2010.  Available at on July 16, 2012.   

[30] Pink, Brian; Borowik, Jenine; Lee, Geoff.  “The Case for an International Statistical Innovation Program – Transforming National and International Statistics Systems.”  Supporting paper, Australian Bureau of Statistics.  10/2009.  Available at$FILE/Supporting%20Discussion%20Paper.pdf on July 16, 2012.

[31] Statistics Canada. “Administrative Data Use.”  Available at on July 16, 2012.

[32] Brackstone, G.J. “Issues in the use of administrative records for statistical purposes.” Survey Methodology. Vol. 13. p. 29–43, 1987.

[33] Scheuren, Fritz and Petska, Tom.  “Turning Administrative Systems into Information Systems.”  Available at on July 16, 2012.

[34] Office of Management and Budget.  “Seminar on Quality of Federal Data.”  Part 1 of 3.  Federal Committee on Statistical Methodology, March 1991.  Available at on July 16, 2012.

[35] Organization for Economic Co-operation and Development.  “Use of Administrative Sources for Business Statistics Purposes.”  Handbook of Good Practices.  Available at on July 16, 2012.

[36] Haltiwanger, John; Lane, Julia; Spletzer, Jim; Theeuwes, Jules; and Troske, Ken.  “Conference Report: International Symposium on Linked Employer-Employee Data.”  Monthly Labor Review, July 1998.  Available at on July 16, 2012.

[37] Valente, Paolo.  “Census Taking in Europe: How are Populations Counted in 2010?”  Bulletin Mensuel d’Information de L’Institut National d’Études Démographiques. Population and Societies, No. 467, May 2010.  Available at on July 16, 2012.

[38] Groves, Robert M.  “National Statistical Offices: Independent, Identical, Simultaneous Actions Thousands of Miles Apart.”  U.S. Census Bureau, February 2, 2012.  Available at on July 16, 2012.

[39] Haltiwanger, John C.; Lane, Julia I.; Spletzer, James, R.; Theeuwes, Jules J.M.; Troske, Kenneth R.  “The Creation and Analysis of Employer-Employee Matched Data: Contributions to Economic Analysis.”  North Holland, 1999.

[40] Wu, Jeremy S.  “State of Longitudinal Employer-Household Dynamics Program.”  Unpublished manuscript, U.S. Census Bureau, January 2006.

[41] U.S. Census Bureau.  “Quarterly Workforce Indicators Online.”  Available at on July 16, 2012.

[42] U.S. Census Bureau.  “OnTheMap.”  Available at on July 16, 2012.

[43] U.S. Census Bureau.  “OnTheMap for Emergency Management.”  Available at on July 16, 2012.

[44] Mesenbourg Jr., Thomas.  “Innovations in Data Dissemination.”  United Nations Statistical Commission Seminar on Innovations in Official Statistics, February 20, 2009.  Available at on July 16, 2012.

[45] Data Quality Campaign.  “Using Data to Improve Student Achievement” Website.  Available at on July 16, 2012

[46] U.S. Department of Health and Human Services.  “Accelerating Electronic Health Records Adoption and Meaningful Use.”  August 5, 2010.  Available at on July 16, 2012.

[47] Ecole de Santé publique, Vakgroep Sociaal Onderzoek – SOCO, and Institut de Recherche Santé et Société.  “Belgian Longitudinal Health Information System: Supplement the health information system by means of longitudinal data.  Summary of the research.” Project AGORA AG / JJ / 139.   February 2011.  Available at on July 16, 2012.

[48] International Health Economics Association.  “China Forum.” Available at on July 16, 2012.

[49]  “China Will Build Unified National Citizen Health Records; Apply Standardized Management.”  April 7, 2009.  Available at on July 16, 2012.

[50] Office of Management and Budget.  “Statistical Programs of the United States Government: Fiscal Year 2012.”  Available at on July 16, 2012.

[51] Foster, Lucia; Elvery, Joel; Becker, Randy; Krizan, Cornell; Nguyen, Sang; and Talan, David.  “A Comparison of the Business Registers used by the Bureau of Labor Statistics and the Bureau of the Census.  Office of Survey Methods Research, Bureau of Labor Statistics, 2005.  Available at on July 16, 2012.

[52] The Executive Office of the President of the United States.  “Obama Administration Unveils ‘Big Data’ Initiative: Announces $200 Million in New R&D Investments.” March 29, 2012. Available at on July 16, 2012.

[53] The White House. “Open Government Initiative.”  January 21, 2009.  Available at on July 16, 2012.

[54]  “Arrival of the Big Data Era.” March 31, 2012.  Available at on July 16, 2012.

[55] Tu, Zipei 涂子沛.”The Big Data Revolution 大数据:正在到来的数据革命.” Guangxi Normal University Publications.  广西师范大学出版社.

[56]  “An Interview with Tu Zipei: Public Life of Dignity Needs ‘Big Data’.”  July 14, 2012.  Available at on July 16, 2012.

[57]  “Disappearance of Individual Privacy Upon the Arrival of the Big Data Era?”  July 14, 2012.  Available at on July 16, 2012.

[58] China Daily.  “Key Targets of China’s 12thFive-Year Plan.” Available at on July 16, 2012.

[59] Nie, Huihua; Jiang, Ting; and Yang, Rudai.  “A Review and Reflection on the Use and Abuse of Chinese Industrial Enterprises Database.”  To appear in World Economics, Volume 5, 2012.  Available at on July 16, 2012.

[60] Rodriguez, Robert.  “Big Data and Better Data.”  AMSTAT News, President’s Corner, American  Statistical Association.  May 31, 2012.  Available at on July 16, 2012.

[61] Varian, Hal.  “Hal Varian explains why statisticians will be the sexiest job in the next 10 years.”  September 15, 2009.  YouTube.  Available at on July 16, 2012.

[62] Massachusetts Institute of Technology.  “MIT CSAIL & Intel Join State of Massachusetts to Tackle Big Data.”  Press release by MIT Computer Science and Artificial Intelligence Laboratory.  May 30, 2012.  Available at on July 16, 2012.

2 replies on “21st Century Statistical Systems”

Your posts in the LinkedIn groups, esp Advanced Analytics, on big data prompted me to explore your blog. This article in particular should be required reading for all discussants on those threads. The mandates you articulate, now over a year old, are clear and illuminating.
Thank you for this summary,
Best regards,
Thomas Ball

Comments are closed.