This blog in simplified Chinese describes the status and need for statistical monitoring in Smart City development in China; it includes an interactive map of the 291 test locations.
After more than a year of preparation, the Workshop on Big Data and Urban Informatics was held at the University of Illinois at Chicago on August 11-12, 2014.
More than 150 persons from at least 10 countries (Australia, Canada, China, Greece, Israel, Italy, Japan, Portugal, United Kingdom, and the U.S.) attended the forum sponsored by the National Science Foundation.
Piyushimita (Vonu) Thakuriah, co-chair for the workshop, reported on the funding of Urban Big Data Center at the University of Glasgow in Scotland (http://bit.ly/1kXG2Uh). Its mission is to “support research for improved understanding of urban challenges and to provide data, technology and services to manage, make policy, and innovate in cities.” The Urban Big Data Center partners with five other universities including the University of Illinois at Chicago. Vonu, a transportation expert, is the director of the center.
In the course of two full days, 68 excellent presentations were made in total, far exceeding the expectations of the organizers a year ago. These papers will be posted in the web in the near future.
Two luncheon keynote speakers highlighted the workshop.
Carlo Ratti presented the state-of-the-art work of the MIT SENSEable City Lab, which specializes in the deployment of sensors and hand-held electronics to study the environment. Since conventional measures of air quality tend to be collected at stationary locations, they do not always represent the exposure of a mobile individual. In one project titled “One Country, Two Lungs” (http://bit.ly/1nbSBXi), a team of human probes travelled between Shenzhen and Hong Kong to detect urban air pollution. The video revealed the divisions in atmospheric quality and individual exposure between these two cities.
Paul Waddell of the University of California at Berkeley presented his work on urban simulation and dynamic 3-D visualization of land use and transportation. Some of his impressive work images can be found at http://bit.ly/1rn9hmj. His video and examples reminded me about their potential applicability for creating the “Three Districts and Four Lines” in China’s National Urbanization Plan. I also learned about a somewhat similar set of products from China’s supermap.com, a Geographic Information System software company based in Beijing.
One of the 68 presentations described the use of smart card data to study the commuting patterns and volume in Beijing subways during rush hours. One other presentation compared the characteristics of big data and statistics and raised the question of whether big data is a supplement or a substitute to statistics.
The issue of data quality was seldom volunteered in the sessions, but questions about it came up frequently. Through editing, filtering, cleaning, scrubbing, imputing, curating, re-structuring, and many other terms, it was clear that some presenters spent an enormous amount of their time and efforts to just get the data ready for very basic use.
Perhaps data quality is considered secondary in exploratory work. However, there are good quality big data and bad quality big data. When other options are available, spending too much time and effort on bad quality big data seems unwise because it does not project a practical, future purpose.
There were also few presentations that discussed the importance of data structure, whether it is already built in as design or created through metadata. Structured data contain far more potential information content than unstructured ones and tend to be more efficient and optimal in information extraction, especially if they have the capability to be linked across multiple sources.
For the purpose of governance, I was somewhat surprised that use of administrative records has not yet caught on in this workshop. Accessibility and confidentiality appeared to be barriers. It would seem helpful for future workshops to include city administrators and public officials to help bridge the gap between research and practical needs for day-to-day operations.
Nations and cities share a common goal in urban planning and urban informatics – improve the quality of city life and service delivery to constituents and businesses alike. On the other hand, there are drastic differences in their current standing and approach.
China is experiencing the largest human migration in history. It has established goals and direction for urban development, but has little reliable, quantitative research or experience to support and execute its plans. The West is transitioning from its century-old urban living to a future that is filled with exciting creativity and energy, but does not seem to have as clear a vision or direction.
Confidentiality is an issue that contrasts sharply between China and the West. The Chinese plans show strong commitment to collect and merge linkable individual records extensively. If implemented successfully, it will generate unprecedented amount of detailed information that can also be abused and misused. The same approach would likely face much scrutiny and opposition in the West, which has to consider less reliable but more costly alternatives in order to meet the same needs.
There is perhaps no absolute right or wrong approach to these issues. The workshop and the international community being created offer a valuable opportunity to observe, discuss, and make comparisons in many globally common topics.
Selected papers from the workshop will now undergo additional peer review. They will be published in an edited volume titled “See Cities Through Big Data – Research, Methods and Applications in Urban Informatics.”
The U.S. Surgeon General released a landmark report on smoking and health in 1964, concluding that smoking caused lung cancer. At that time, smoking was at its peak in the U.S. – more than half of the men and nearly one-third of the women were reported to be smokers.
The U.S. Surgeon General released another report [1] in June this year, titled “The Health Consequences of Smoking – 50 Years of Progress.”
A time plot based on the recent report [2] shows the trend of one statistic – adult per capita cigarette consumption – for the period of 1900-2012. It reveals the rise of smoking in the U.S. in the first half of the 20th century, coinciding with the Great Depression and two world wars when the government supplied cigarettes as rations to soldiers. There has been a steady decline in the last 50 years.
When the 1964 report was released, an American adult was smoking more than 4,200 cigarettes a year on the average. Today it is less than 1,300. About 18% of Americans smoked in 2012, down from the overall 42% in 1964. The difference between male and female smokers is relatively small – men at 20% and women at 16%. According to a 2013 Gallup poll [3], 95% of the American public believed that smoking is very harmful or somewhat harmful, compared to only 44% of Americans who believed that smoking causes cancer in 1958.
After the release of the 1964 report, Congress required all cigarette packages to carry a health warning label in 1965. Cigarette advertising on television and radio were banned effective in 1970. Taxes on cigarettes were raised; treatments for nicotine introduced; non-smoker rights movement started. Together laws, regulations, public education, treatment, taxation, and community efforts have all played an important role in transforming a national habit to a recognized threat to human health and quality of life in the last 50 years. This was beyond my wildest imagination that it could happen in my lifetime.
Statistics has been at the center of this enormous social change from the beginning of the smoking and health issue.
As early as 1928, statistical data began to appear and showed a higher proportion of heavy smokers among lung cancer patients [4]. A 10-member advisory committee prepared the 1964 report, spending over a year to review more than 7,000 scientific articles along with 150 consultants. By design, the committee included five non-smokers and five smokers, representing disciplines in medicine, surgery, pharmacology, and STATISTICS. The lone statistician was William G. Cochran, a smoker who was also a founding member of the Statistics Department at Harvard University and author of two classic books, “Experimental Design” and “Sampling Techniques.”
During the past 50 years, an estimated 21 million Americans have died because of smoking, including nearly 2.5 million non-smokers due to second-hand smoke and 100,000 babies due to parental smoking.
There are still about 42 million adult smokers and 3.5 million middle and high school students smoking cigarettes in the U.S. today. Interestingly, Asian Americans have the lowest rate of smokers at 11% among all racial groups in the U.S.
China agreed to join the World Health Organization Framework Convention on Tobacco Control in 2003. It reported [5] 356 million smokers in 2010, about 28% of its total population and practically unchanged from its 2002 level. The gender difference was remarkable – 340 million male smokers (96%) and 16 million female smokers (4%). About 1.2 million people die from smoking in China each year. Among the remaining over 900 million non-smokers in China, about 738 million, including 182 million children, are exposed to second-hand smoke. Only 20% of Chinese adults reportedly believed that smoking causes cancer in 2010 [6].
More detailed historical records on smoking in China are either inconsistent or fragmented. One source outside of China [7] suggested that there were 281 million Chinese smokers in 2012 and an increase of 100 million smokers from 1980.
China has been stumbling in its efforts to control smoking.
According to a 2013 survey by the Chinese Association on Tobacco Control [8], 50.2% of the male school teachers were smokers; male doctors 47.3%; and male public servants 61%. Given these high rates for their important roles, there is concern and skepticism on how effective tobacco control can be implemented or enforced.
Coupled with the institutional issues of its tobacco industry, China has been criticized for its ineffective tobacco control ineffective. While the size of some American Tobacco companies may be larger, they are not state-owned. China is the world’s largest tobacco producer and consumer. Its state-owned monopoly, China National Tobacco Corporation, is the largest company of this type in the world.
Nonetheless, the Chinese government has enacted a number of measures to restrict smoking in recent years. The Ministry of Health took the lead in banning smoking in the medical and healthcare systems in 2009. Smoking in public indoor spaces such as restaurants, hotels, and public transportation were banned beginning in 2011.
According to the Chinese Tobacco Control Program (2012-2015) [9,10], China will ban cigarette advertising, marketing and sponsorship, setting a goal of reducing the smoking rate from 28.1% in 2010 to 25%.
Smoking is a social issue common to both the U.S. and China.
Statistics facilitates understanding of the status and implications, as well as providing advice, assistance, and guidance for governance. More statistics can certainly be cited about the ill effects of smoking in both nations. At the end, it is the collective will and wisdom of each nation that will determine the ultimate course of actions.
[5] The Central People’s Government of the People’s Republic of China. (2011, January 6) Population of tobacco remains high and not declining; smokers are still over 300 million. Retrieved from http://www.gov.cn/jrzg/2011-01/06/content_1779597.htm.
In the early stages of its economic reform, China chose to “cross a stream by feeling the rocks.”
Limited by expertise and conditions at that time when there was no statistical infrastructure in China to provide accurate and reliable measurements, the chosen path was the only option.
In fact, this path was traveled by many nations, including the U.S. At the beginning of the 20th century when the field of modern statistics had not taken shape, data were not believable or reliable even if they existed. Well-known American writer and humorist Mark Twain once lamented about “lies, damned lies, and statistics,” pointing out the data quality problem of the time. During the past hundred years, statistics deployed an international common language and reliable data, establishing a long history of success with broad areas of application in the U.S. This stage of statistics may be generally called Statistics 1.0.
Feeling the rocks may help to across a stream, but it would be difficult to land on the moon, even more difficult to create smart cities and an affluent society. If one could scientifically measure the depth of the stream and build roads and bridges, it may be unnecessary to make trials and errors.
The long-term development of society must exit this transitional stage and enter a more scientifically-based digital culture where high-quality data and credible, reliable statistics serve to continuously enhance the efficiency, equity and sustainability of national policies. At the same time, specialized knowledge must be converted responsibly to practical useful knowledge, serving the government, enterprises and the people.
Today, technologies associated with Big Data are advancing rapidly. A new opportunity has arrived to usher in the Statistics 2.0 era.
Simply stated, Statistics 2.0 elevates the role and technical level of descriptive statistics, extends the theories and methods of mathematical statistics to non-randomly collected data, and expands statistical thinking to include facing the future.
One may observe that in a digital society, whether it is from crossing a stream or reaching the sky, or from governance of a nation to the daily life of the common people, what were once “unimaginable” are now “reality.” Driverless cars, drone delivery of packages, and space travel are no longer imaginations in fictions. Although their data that can be analyzed in a practical setting are still limited, they are within the realistic visions of Statistics 2.0.
In terms of social development, the U.S. and China are actively trying to improve people’s livelihood, enhance governance, and improve the environment. A harmonious and prosperous world cannot be achieved without vibrant and sustainable economies in both China and the U.S., and peaceful, mutually beneficial collaborations between the nations.
Statistics 2.0 can and should play an extremely important role in this evolution.
The WeChat platform Statistics 2.0 will not use low quality or duplicative information to clog already congested channels, but it values new thinking to share common interest in the study of Statistics 2.0, introducing state-of-the-art developments in the U.S. and China in a simple and timely manner, offering thoughts and discussions about classical issues, exploring innovative applications, and sharing the beauty of the science of data in theory and practice.
Suppose we have data on 60,000 households. Are they useful for analysis? If we add that the amount of data is very large, like 3 TB or even 30 TB, does it change your answer?
The U.S. government collects monthly data from 60,000 randomly selected households and reports on the national employment situation. Based on these data, the U.S. unemployment rate is estimated to within a margin of sampling error of about 0.2%. Important inferences are drawn and policies are made from these statistics about the U.S. economy comprised of 120 million households and 310 million individuals.
In this case, data for 60,000 households are very useful.
These 60,000 households represent only 0.05% of all the households in the U.S. If they were not randomly selected, the statistics they generate will contain unknown and potentially large bias. They are not reliable to describe the national employment situation.
In this case, data for 60,000 households are not useful at all, regardless of what the file size may be.
Suppose further that the 60,000 households are all located in a small city that has only 60,000 households. In other words, they represent the entire universe of households in the city. These data are potentially very useful. Depending on its content and relevance to the question of interest, usefulness of the data may again range widely between two extremes. If the content is relevant and the quality is good, file size may then become an indicator of the degree of usefulness for the data.
This simple line of reasoning shows that the original question is too incomplete for a direct, satisfactory answer. We must also consider, for example, the sample selection method, representation of the sample in the population under study, and the relevance and quality of the data relative to a specified hypothesis that is being investigated.
The original question of data usefulness was seldom asked until the Big Data era began around 2000 when electronic data became widely available in massive amounts at relatively low cost. Prior to this time, data were usually collected when they were driven and needed by a known specific purpose, such as an exploration to conduct, a hypothesis to test, or a problem to resolve. It was costly to collect data. When they were collected, they were already considered to be potentially useful for the intended analysis.
For example, when the nation was mired in the Great Depression, the U.S. government began to collect data from randomly selected households in the 1930s so that it could produce more reliable and timely statistics about unemployment. This practice has continued to this date.
Statisticians initially considered data mining to be a bad practice. It was argued that without a prior hypothesis, false or misleading identification of “significant” relationships and patterns is inevitable by “fishing,” “dredging,” or “snooping” data aimlessly. An analogy is the over interpretation or analysis of a person winning a lottery, not necessarily because the person possesses any special skill or knowledge about winning a lottery, but because random chance dictates that some person(s) must eventually win a lottery.
Although the argument of false identification remains valid today, it has also been overwhelmed by the abundance of available Big Data that are frequently collected without design or even structure. Total dismissal of the data-driven approach bypasses the chance of uncovering hidden, meaningful relationships that have not been or cannot be established as a priori hypotheses. An analogy is the prediction of hereditary disease and the study of potential treatment. After data on the entire human genome are collected, they may be explored and compared for the systematic identification and treatment of specific hereditary diseases.
Not all data are created equal and have the same usefulness.
Complete and structured data can create dynamic frames that describe an entire population in detail over time, providing valuable information that has never been available in previous statistical systems. On the other hand, fragmented and unstructured data may not yield any meaningful analysis no matter how large the file size may be.
As problem solving is rapidly expanding from a hypothesis-driven paradigm to include a data-driven approach, the fundamental questions about the usefulness and quality of these data have also increased in importance. While the question of study interest may not be specified a priori, establishing it a posteriori to data collection is still necessary before conducting any analysis. We cannot obtain a correct answer to non-existing questions.
How are the samples selected? How much does the sample represent the universe of inference? What is the relevance and quality of data relative to the posterior hypothesis of interest? File size has little to no meaning if the usefulness of data cannot even be established in the first place.
Ignoring these considerations may lead to the need to update a well-known quote: “Lies, Damned Lies, and Big Data.”
About 45 years ago, I spent a whopping $1.95 on a little book titled “How to Lie with Statistics.”
Besides the catchy title, its bright orange cover has a comic character sweeping numbers under a rug. Darrell Huff, a magazine editor and a freelance writer, wrote the book in 1954. It went on to become the most popular statistics book in the world for more than half a century. A translated version was published in China around 2002.
It takes only a few hours to read the entire book of about 140 pages and 80 pictures leisurely, but it was a major reason why I pursued an education and a professional career in statistics.
The corners of the book are now worn; the pages have turned yellow. One can identify some of the social changes in the last 60 years from the book. For example, $25,000 is no longer an enviable annual salary; few of today’s younger generation may know what a “telegram” was; “gay” has a very different meaning now; and “African Americans” has replaced “Negroes” in daily usage. As indicative of the bygone era, the image of a cigar, a cigarette, or a pipe appeared in at least one out of every five pictures in the book – even babies were puffing away in high chairs. The word “computer” did not show up once among its 26,000 words.
Huff’s words were simple, but sharp and direct. He provided example after example that the most respected magazines and newspapers of his time lie with statistics, just like the dreadful “advertising man” and politician.
According to Huff, most humans have “a bias to favor, a point to prove, and an axe to grind.” They tend to over- or under-state the truth in responding to surveys; those who complete surveys are systematically different from those who do not respond; and built-in partiality occurs in the wording of a questionnaire, appearance of an interviewer, or interpretation of the results.
There were no desktop computers or mobile devices; statistical charts and infographics were drawn by hand; data collection, especially complete counts like a census, was difficult and costly. Huff conjectured, and the statistics profession has also concurred, that the only reliable small sample is one that is random and representative where all sources of bias have been removed.
Calling anyone a liar was harsh then, and it still is now. The dictionary definition of a lie is a false statement made with deliberate intent to deceive. Huff considered lying to include chicanery, distortion, manipulation, omission, and trickery; ignorance and incompetence were only excuses for not recognizing them as lies. One may also lie by selectively using a mean, a median, or a mode to mislead readers although all of them are correct as an average.
No matter how broadly or narrowly lies may be defined, it cannot be denied that people do lie with statistics every day. To some media’s credit, there are now fact-checkers who regularly examine stories or statements, most of them based on numbers, and evaluate their degree of truthfulness.
In the era of Big Data, lies occur in higher velocity with bigger volume and greater variety.
Moore’s law is not a legal, physical, or natural law, but a loosely-fitted regression equation in logarithmic scale. Each of us has probably won the Nigerian lottery or its variations via email at least a few times. While measures for gross domestic products or pollution are becoming more accurate because of Big Data, nations liberally use their aggregate or per capita average, depending on which favors their point of view.
Heavy mining of satellite, radar, audio messages, sensor, and other Big Data may one day solve the tragic mystery of Malaysian Flight MH370, but the many pure speculations, conspiracy theories, accusations of wrongdoing, and irresponsible lies quoting these data have mercilessly added anguish and misery to the families of the passengers and the crew. No one seems to be tracking the velocity, volume and variety of the false positives that have been generated for this event, or other data mining efforts with Big Data.
The responsibility is of course not on the data; it is on the people. There is the old saying that “figures don’t lie, but liars figure.” Big Data – in terms of advancing technology and availability of some massive amount of randomly and non-randomly collected electronic data – will undoubtedly expand the study of statistics and bring our understanding and governance to new heights.
Huff observed that “without writers who use the words with honesty and understanding and readers who know what they mean, the result can only be semantic nonsense.” Today many statisticians are still using terms like “Type I error” and “Type II error” in promoting statistical understanding, while these concepts and underlying pitfalls are seldom mentioned in Big Data discussions.
At the end of his book, Huff suggested that one can try to recognize sound and usable data in the wilderness of fraud by asking five questions: Who says so? How does he know? What’s missing? Did somebody change the subject? Does it make sense? They are not perfect, but they are worth asking. On the other hand, healthy skepticism should not become overzealous in discrediting truly sound and innovative findings.
Faced with the self-raised question of why he wrote the book, especially with the title and content that provides ideas to use statistics to deceive and swindle, Huff responded that “[t]he crooks already know these tricks; honest men must learn them in defense.”
How I wish there is a book about how to lie with Big Data now! In the meantime, Huff’s book remains as enlightening as it was 45 years ago although the price of the book has gone up to $5.98 and is almost matched by its shipping cost.
The following is an abstract for a presentation given in the Committee of 100 Fourth Tien Changlin (田长霖) Symposium held in Wuhan, China, on June 20, 2013.
The presentation in simplified Chinese is available at 智慧武汉:善用大数据.
The urban population in China doubled between 1990 and 2012. It is estimated that an additional 400 million people will move from the countryside to the cities in the next decade. China has announced plans to become a well-off society, while maintaining harmony, during this time period. This is an enormous challenge to China and its cities like Wuhan.
A well-off society necessarily includes a sound infrastructure and sustainable economic development with entrepreneurial spirits and drive for innovation. It must constantly improve quality of life for its citizens with effective management of the environment and natural resources. Most of all, it must change governance so that flexibility, high efficiency and responsiveness are the norms that its citizens would expect.
If data were letters and single words, statistics would be grammar that binds them together in an international language that quantifies what a well-off society is, measures performance, and communicates results. Modern technology can now collect and deliver electronic information in great variety with massive volume at rapid speed during the Big Data era. Combined with open policy, talented people, and partnership between the academia, government, and private sector, Wuhan can get smart with Big Data, as it has started with projects like “China Technology and Science City” and “Citizen’s Home.” Although there are many areas yet to expand and improve, a smart Wuhan will lead the nation up another level toward a well-off society.
Link to presentation in simplified Chinese: 智慧武汉:善用大数据.
If we melt all the existing gold in the world and put them together, it would amount to about one third of the Washington Monument. The value of gold is high because it is rare and has many uses.
Recent hype about Big Data compares it to gold as if you can collect them by simply dipping your hands in it. If so, many people should already be rich.
It may be more appropriate to compare Big Data to sand, within which there may be gold – sometimes a little, sometimes none at all, and in rare situations, a lot. Whatever the case, it will need investment and hard work to clean and mine them. There is no real substitute with software or hardware.
There is value in a big pile of sand too. According to an ancient saying, you can build pagodas with it; it is also the raw material for silicon chips today.
Of higher values is that Big Data is a yet-to-be totally explored branch of new knowledge.
[4] U.S. Social Security Administration. Fifty Years of Operations in the Social Security Administration, by Michael A. Cronin, June 1985. Social Security Bulletin, Volume 48, Number 6. Available at http://www.ssa.gov/history///cronin.html on April 29, 2013.
[5] U.S. Social Security Administration. The Story of the Social Security Number, by Carolyn Puckett, 2009. Social Security Bulletin, Volume 69, Number 2. Available at http://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html on April 29, 2013.
[12] U.S. Social Security Administration. Application for a Social Security Card, Form SS-5. Available at http://www.ssa.gov/online/ss-5.pdf on April 29, 2013.
[22] U.S. Internal Revenue Service. Form SS-4: Application for Employer Identification Number. Available at http://www.irs.gov/pub/irs-pdf/fss4.pdf on April 29, 2013.
[26] National Administration for Code Allocation to Organizations. National Organization Code Information Retrieval System, 全国组织机构信息核查系统. Available at http://www.nacao.org.cn/ on April 29, 2013.
[27] Nie, Huihua; Jiang, Ting; and Yang, Rudai. A Review and Reflection on the Use and Abuse of Chinese Industrial Enterprises Database. World Economics, Volume 5, 2012. Available at http://www.niehuihua.com/UploadFile/ea_201251019517.pdf on April 29, 2013.
[28] National Administration for Code Allocation to Organizations. Historical Development of National Organization Codes, 全国组织机构代码犮展历程. Available at http://www.nacao.org.cn/publish/main/236/index.html on April 29, 2013.
Big Data promises to improve governance of society and better inform the public in the 21st century. Although every data record has some information to contribute, linking and merging relevant electronic records can minimize the collection of duplicate data and increase the value and utility of the integrated data rapidly and exponentially. Essential in this approach is the presence of identification codes that will facilitate the actual integration of record and data.
The identification code is a key to unlocking the enormous power in Big Data. However, it may also be the primary cause of system failures, misuses and abuses, and even fraudulent or criminal activities, if it is not properly applied and managed. In addition to technology, statistical design and quality feedback loops, proper education and training, relevant policies and regulations, and public awareness are all needed for the effective and responsible use of identification codes and Big Data.
The Need for Identification Codes
When a student enters a school, a record will store the student’s name, gender, age, family background, field of study, and other data. When she takes a course and receives a final grade, the results are recorded. When she satisfies all the requirements for graduation, another record will show the grade point average she has achieved and the degree she is awarded.
Each record represents a snapshot for the student. The records are collected over time for administrative purposes. Together the longitudinal snapshots provide comprehensive information about the education of a student.
When the student enters the workforce, additional data are collected over her lifetime about the industry and occupation she works in, the job she performs, the wages and promotions she receives, the taxes and insurances she pays, and the employment or unemployment status she is in.
In like manner, massive amounts of data are collected about a firm, including its initial registration as a business, periodic reports on revenues and expenses, entry into the stock markets, acquisitions or mergers with other companies, payment of taxes and fees, growth in sales and staffing, and expansion or death of the business.
These administrative records used to be stored in dusty file cabinets, but they are now mostly digitized and available for computer processing when the Big Data era arrived at the turn of the millennium.
Timely and proper integration of the records of all students would provide unprecedented details about how the school is performing, such as its graduation or dropout rate over time. Further roll up of all schools would inform a nation about its state of education, such as its capacity to support employment and economic growth. This is what Big Data promises to bring in the 21st century. From allocation of resources, measurement of performance, to formulation of policy, every segment of society can benefit from the details and insights of Big Data to improve governance and inform the public.
Although every data record has some information to contribute, linking and merging relevant electronic records minimizes the collection of duplicate data and increases the value and utility of the integrated data exponentially. Essential in this approach is the presence of identification codes that will facilitate the actual integration of record and data. Statisticians can make significant contributions to building new statistical systems with their thinking and methods in this process.
Types of Identification Codes
The name of an individual or a company was the preferred identification code when files were still physical, such as in paper form. It has been conventional to consolidate records under the same name and sort them by alphabetical order in English, number of strokes in Chinese, or chronological order.
However, a major shortcoming of using names, especially when processed massively by computer, is that they are not unique. The top four family names of Lee, Wang, Zhang, and Liu accounted for 334 million individuals in China in 2006 [1], exceeding the total U.S. population. Chinese names may also appear differently because of the simplified and traditional characters. The English first name of Robert, the 61st most popular male name at birth in the U.S. in 2011 [2,3], can have at least 7 common variations for the same person, including Bert, Bo, Bob, Bobby, Rob, Robbie, and Robby. Bert may also be short for Albert. Individuals may apply to change their names or use more than one name; women may change their names after marriage. Human errors can add errant names. References to the same name across nations with different languages can be notoriously difficult.
The name of a company is usually checked and validated to avoid duplication during the registration process and protected by applicable local, national and international rules and laws including trademarks after registration. The company may use multiple names including abbreviations and stock market symbols; it can also change its name due to merger with another company, acquisition agreement, reorganization, or a simple desire to change its brand.
A non-unique identification code poses the risk of linking and merging records incorrectly, leading to incorrect results or conclusions. Supplementing a name with auxiliary information, such as age, gender, and an address, would reduce but not eliminate the chance of record mismatches, and at the cost of increasing processing time.
A series of numbers, letters and special characters (alphanumeric) or a series of numbers alone (numeric) is increasingly used as the identification code of choice with modern machine sorting, linking, and merging of electronic records. Numeric codes tend to be less restrictive because they are independent of the writing system. Alphanumeric codes using letters from the English alphabet may be suitable for systems using languages based on the Latin alphabet, but systems using non-Latin scripts may still find them unavailable or difficult to use, understand, or interpret. It is also easier to understand how numeric codes are sorted compared to alphanumeric codes.
When the Social Security Act of 1935 was passed in the U.S., one of the first challenges in implementation was to create an identification code that would “permanently identify each individual to be covered” and “be sufficiently elastic to function indefinitely as additional workers became covered” [4]. An 8-field alphanumeric code was initially chosen, but it was soon rejected by the statistical agencies, as well as labor and justice departments. This exchange was described [4,5] as the first sign of “the tremendous impact machines would have on the way [government] would do business.” This was BEFORE computers were introduced for actual use.
Today, the impact of information technology is obvious and continues to increase in every aspect of government, business, and individual activities. An identification code may be applied to a person, a company, a vehicle, a credit card, a cargo, en email account, a location, or just about any practical entity.
An electronic record that does not contain an identification code or cannot be correctly linked with other records may be described as lacking in “structure” or “unstructured” in the Big Data era. Since the beginning of the 21stcentury, “unstructured” data are occurring in much higher frequency than “structured” data. However, they contain relatively limited information content and utility compared to “structured” data, especially for continuing, consistent, and reliable information about a society or an economy over time.
Effective use of the identification code is a key to unlocking the enormous power inherent in Big Data.
Effective Use of Identification Codes
Match and Merge Records. Ideal identification codes are mutually exclusive and exhaustive, establishing an unambiguous one-to-one relationship between the code and the entity, including those yet to appear in the future. The code facilitates direct and perfect machine sorting, matching, and merging of electronic records, potentially increasing the amount of information about the entity with no limit.
Anonymize and Protect Identity. A code offers the first-line protection of identity by anonymizing the entity. Due to the increasing importance of the code and the relative ease of linking with other records, the risks and stakes of identity fraud or theft through the identification code have also risen, requiring responsible policy and management of the code as safeguards.
Provide Basic Description and Classification. An identification code can provide the most basic description of the content and context of the data records, from which simple observations or summaries can be quickly derived. Over time, this concept also evolved into codes for classification and the separate development of “metadata” [6,7] for efficiently building structure into data systems and broadening their use across systems.
Perform Initial Quality Check. Unintentional human errors in typing or transcribing an identification code incorrectly may damage the quality of integrated data and the eventual analytical results. Fraudulent or malicious altering of the identification codes may inflict even more severe damage to the integrity and reliability of the data. Early detection with the deployment of “check digit” [8,9] in the identification code may eliminate more than 90 percent of these common errors.
Facilitate Statistical Innovations. By collecting and integrating data continuously for each entity such as a student, a dynamic frame with rich content can be built for all students and all schools. New data elements may be defined for analysis; statistical summaries may be produced in real time or according to set schedules to describe the performance of a school or the state of education for a nation, while strictly protecting the confidentiality of individuals and security of their data. Innovative efforts to construct these dynamic frames, or longitudinal data systems, have started in the U.S. and China [10]. The Data Quality Campaign [11] lists “a unique statewide student identifier that connects student data across key databases across years” to be the top essential element in building state longitudinal data systems for education in the U.S.
Personal Identification Codes of the U.S. and China
The U.S. does not have a national identification system. The Social Security Number (SSN) was created to track earnings of workers in the U.S. in 1936, before computers were introduced for commercial use. Its transition into the computer age revealed some of the strengths and weaknesses of its evolving role as an identification code.
The 9-digit SSN is composed of 3 parts:
U.S. Social Security Card
Area Number (3 digits) – initially geographical region where the SSN was issued and later the postal area code of the mailing address in the application
Group Number (2 digits) – representing each set of SSN being assigned as a group
Serial number (4 digits) – from 0001 to 9999
Demographic data are collected in the SSN application [12], including name, place of birth, date of birth, citizenship, race, ethnicity, gender, parents’ name and SSN, phone number and mailing address. The U.S. Social Security Administration is responsible for issuing the SSN. Some of the SSN are reserved and not used. Once issued, a SSN is supposed to be unique because it would not be issued again. However, some duplicate situations exist.
A wallet manufacturer decided to promote its product in 1938 by showing how a copy of a Social Security card from one of its employees would fit into its wallets, which were sold through department stores [13]. In all, over 40,000 people mistakenly reported this to be their own SSN, with some as late as in 1977.
Use of the SSN by the government and later the private sector has expanded substantially since its creation. Beginning in 1943, federal agencies were required by executive order to use the SSN whenever the agency finds it advisable to establish a new system of permanent account numbers for individuals [5]. In the early 1960s, federal employees and individual tax filers were required to use SSN. In the late 1960s, SSN began to serve as military identification numbers. Throughout the 1970s when computers were increasingly used, SSN was required for federal benefits and financial transactions such as opening bank accounts and applying for credit cards and loans. Beginning in 1986, parents must list the SSN for each dependent for whom the parents want to claim as a tax deduction. The anti-fraud change resulted in 7 million fewer minor dependents being claimed in the first year of implementation [14].
As SSN became essentially an unofficial national identifier that can link and merge many electronic files for the same person together, it can also be the direct cause of misuse and abuse such as identity theft [15]. The SSN does not have a check digit; it cannot be used reliably for authentication of identity. Academic researchers have also demonstrated ways to use publicly available information “to reconstruct SSN with a startling degree of accuracy” [16]. These identified vulnerabilities have resulted in more cautious, secured, and responsible use of the SSN in the U.S. in recent years. The original 1943 executive order requiring the use of SSN was rescinded and replaced by another executive order in 2008 that makes the use of SSN optional.
China had a relatively late start in personal identification codes. It revised the Resident Identification Number (RIN) from 15 digits to 18 digits on July 1, 1999, raising the embedded birth year from 2 to 4 digits and adding a check digit. The 18-digit RIN is composed of 4 parts [17,18]:
Chinese Identification Card
Address Area Number (6 digits) – administrative code for the individual’s residence
Birthdate Number (8 digits) – in the form of YYYYMMDD where YYYY is year, MM is month and DD is day of the birthdate
Serial Number (3 digits) – with odd numbers reserved for males and even numbers reserved for females
Check Digit (1 digit) – computed digit based on 17 previous digits using the ISO 7064 standard algorithm [18,19]
Security offices at county-level local governments issue the resident identification cards to individuals upon application no later than age 16. Data collected include name, gender, race, birthdate, and residential address. The resident identification cards may be valid permanently or for a time period as short as 5 years, depending on the age of the applicant. According to official announcements, the RIN is also used to track individual health records in the National Electronic Health Record System in China [20].
Business Identification and Industry Classification Codes of the U.S. and China
An Employer Identification Number (EIN) to a business is equivalent to the SSN to an individual in the U.S. [21]. However, a business in this case may also be a local, state, or federal government; it may also be a company without employees or an individual who has to pay withholding taxes on his/her employees. The EIN is a unique 9-digit number assigned by the U.S. Internal Revenue Service (IRS) according to the GG-NNNNNNN format, where GG was a numerical geographical code to the location of the business prior to 2001 and the remaining 7 numeric digits have no special meanings. Once issued, an EIN will not be reissued by IRS. In addition, each state has its own, different Employer Identification Number for its tax collection and administrative purposes.
Information collected about the business during the EIN application process include legal name, trade name, executor name, responsible party name, mailing address, location of principal business, type of entity or company, reason for application, starting date of business, accounting year, highest number of employees expected in the next 12 months, first date of paid wages, and principal activity of business [22].
U.S. statistical agencies use the North American Industry Classification System (NAICS) to classify business establishments for the purpose of collecting, analyzing, and publishing statistical data related to the U.S. economy [23]. NAICS was adopted and replaced the Standard Industrial Classification (SIC) system in 1997.
NAICS is a hierarchical classification coding system consisting of 2, 3, 4, 5, or up to 6 numeric digits. The top 2-digit codes represent the major economic sectors such as Construction and Manufacturing. Each 2-digit sector contains a collection of 3-digit subsectors, each of which in turn contains a collection of 4-digit industry groups. For example, 31-33 is the Manufacturing sector for which the following hierarchy exists for the Rice Milling industry:
311 Food Manufacturing
3112 Grain and Oilseed Milling
31121 Flour Milling and Malt Manufacturing
311212 Rice Milling
One of the strengths of the hierarchical system is that aggregation can be performed easily up the chain. For example, sum of all 311X companies should form the 311 Food Manufacturing industry in the U.S.
Consistent creation and assignment of NAICS codes is a challenge in a global, dynamic economy where obsolete industries may disappear and new industries may spawn and grow overnight. Examples of challenging industries include “high technology” industries in the past and the recent “green” industries. Application of the NAICS codes is subject to interpretation and consistency issues. For example, the U.S. Census Bureau and the U.S. Bureau of Labor Statistics disagree in creating and maintaining their respective business frames due to differences in data sources and assignment of NAICS codes [10]. Inconsistent use of NAICS codes disrupts or even invalidates analysis and interpretation of time series or longitudinal data.
A new business in China must apply to the local Quality and Technical Supervision Office for a 9-digit National Organization Code, which contains 8 digits and 1 check digit [22]. The Chinese regulation, GB 11714-1997 on Rules of Coding for the Representation of Organizations, is patterned after international standards, ISO 6523 Information Technology – Structure for the Identification of Organizations and Organization Parts [25]. Online directories exist to look up information about the organization based on the National Organization Code [26].
The value of the Chinese Industrial Statistical Dataset is well recognized by economists and other analysts domestically and internationally. Substantial resources were invested into the construction and maintenance of the comprehensive data system that describes almost all state-owned and large enterprises (annual sales of over RMB Ұ5 million until 2010 and over RMB Ұ20 million thereafter) in China longitudinally since 1998. However, serious data quality problems have been reported, and the primary cause can be traced to the inconsistent and incorrect application of the identification codes [27]. This situation exists although China started its standardization of organization codes in 1989 and is currently in the third phase of implementation [28].
As recently as last month, Guangdong province has announced its commitment to use a shared platform on the National Organization Codes as part of its campaign to combat corruption [29].
China also has a standard industry classification system under GS-T4754-2002 [30]. The hierarchical system has 4 categories with the highest level indicated by a 1-digit letter, and the lower levels represented by 2, 3, and 4 digits respectively. For the previous example of Rice Milling, the Chinese classification system provides the following hierarchy:
C Manufacturing
C13 Food Manufacturing
C131 Grain Milling
C1312 Rice Milling
Summary
As technology continues to evolve and grow, larger amount of digitized data will be collected more rapidly at relatively low cost. This has characterized the Big Data era.
These Big Data contain unprecedented amount of information. If integrated and structured, their value and power will be increased exponentially beyond any existing statistical systems have been able to provide. Identification codes that facilitate linking and merging of records hold the key to unlocking this enormous trove of opportunities.
As the gateway to the enormous power of Big Data, identification codes may also be the primary cause of system failures, misuses and abuses, and even fraudulent or criminal activities, if they are not properly applied and managed.
The practical challenges of applying an identification code are complex. In addition to technology, statistical design and quality feedback loops, proper education and training, effective policies and regulations, and public awareness are all needed for the effective and responsible use of identification codes. These topics will be discussed in future papers.
[4] U.S. Social Security Administration. Fifty Years of Operations in the Social Security Administration, by Michael A. Cronin, June 1985. Social Security Bulletin, Volume 48, Number 6. Available at http://www.ssa.gov/history///cronin.html on April 29, 2013.
[5] U.S. Social Security Administration. The Story of the Social Security Number, by Carolyn Puckett, 2009. Social Security Bulletin, Volume 69, Number 2. Available at http://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html on April 29, 2013.
[12] U.S. Social Security Administration. Application for a Social Security Card, Form SS-5. Available at http://www.ssa.gov/online/ss-5.pdf on April 29, 2013.
[22] U.S. Internal Revenue Service. Form SS-4: Application for Employer Identification Number. Available at http://www.irs.gov/pub/irs-pdf/fss4.pdf on April 29, 2013.
[26] National Administration for Code Allocation to Organizations. National Organization Code Information Retrieval System, 全国组织机构信息核查系统. Available at http://www.nacao.org.cn/ on April 29, 2013.
[27] Nie, Huihua; Jiang, Ting; and Yang, Rudai. A Review and Reflection on the Use and Abuse of Chinese Industrial Enterprises Database. World Economics, Volume 5, 2012. Available at http://www.niehuihua.com/UploadFile/ea_201251019517.pdf on April 29, 2013.
[28] National Administration for Code Allocation to Organizations. Historical Development of National Organization Codes, 全国组织机构代码犮展历程. Available at http://www.nacao.org.cn/publish/main/236/index.html on April 29, 2013.
A frame identifies all the known units in a population from which a census can be conducted or a random sample can be drawn, providing the structural foundation for the extraction of maximum, reliable information from designed statistical studies with the support of established statistical theories.The significance of the Big Data era is that most data are now digitized, easily stored, and processed in large quantity at relatively low cost.Big Data offers unprecedented opportunities for statisticians to rethink and innovate.Among the many possibilities offered by Big Data is the creation and maintenance of Dynamic Frames – frames that are rich in content, capture the most up-to-date data as soon as they become available, and produce results and reports in real time on demand.
Traditional Population and Frame
A population is an important concept in the study of statistics.It is commonly understood to be an entire collection of items of interest, be it a nation’s people or businesses, a day’s production of light bulbs, or an ocean’s fish [1,2,3].
A less well-known term is a frame, or a list of the units that cover the entire population with its identification system.A frame is the working definition of a population under study.It identifies all the known units in a population from which a census can be conducted or a random sample can be drawn, providing the structure for statistical description and analysis about the population [2,4,5].
Figure 1
Figure 1 shows a flow chart of a conventional statistical study by census or random sample.Quoting from [4], an ideal frame should have the following qualities:
All units have a logical, numerical identifier
All units can be found – their contact information, map location or other relevant information is present
The frame is organized in a logical, systematic fashion
The frame has additional information about the units that allow the use of more advanced sampling frames
Every element of the population of interest is present in the frame
Every element of the population is present only once in the frame
No elements from outside the population of interest are present in the frame
The data is “up-to-date”
Modeling may be considered part of a sampling process, sometimes bypassing the need for a frame by assuming that the model and data adequately represent the underlying population.
Practicing statisticians understand the importance of frames – it is the structural foundation for the extraction of maximum, reliable information from designed statistical studies with the support of established statistical theories.However, there are few statistical papers or forums that discuss the best practices for creating and maintaining a frame, primarily because it is viewed as an administrative or clerical task.
Many lament how difficult it is to obtain or maintain a good frame or their bitter experience of working with incomplete or error-prone frames.Indeed, poor quality frames may prevent a well-planned statistical study from even taking place or create misleading or biased results.
Inadequate attention to the creation and maintenance of a flexible, up-to-date, and dynamic population frame has been costly to the statistics profession and the U.S. in terms of efficiency and innovation.
For example, according to [6], although “an accurate and complete address list is a critical ingredient in all U.S. Census Bureau surveys and censuses,” each program prepared its own separate list until the concept of a national frame was advanced not even 20 years ago in the name of the Master Address File (MAF).
The MAF is used primarily to support mail delivery of questionnaires [7], which is increasingly an outdated mode for information collection.It is relied upon heavily for follow-up visits to non-respondents, when rising labor costs are now met with tight budget constraints.Web-based questionnaire delivery or data submission was not allowed in the latest 2010 decennial census in the U.S.The MAF is also not designed to promote or support web-based applications.
The arrival of the Big Data era seems to have caught the statistics profession in a deer-in-the-headlight moment.As statistician is hailed as “the sexiest job for the next 10 years” and beyond [8], the profession is still wondering why statistics is undervalued and left out, while in search of a role it should play in the Big Data era [9].
Only a few seem to recognize that statistics is “the science of learning from data” [10], regardless of how big or small the data are, and that the moment has arrived for the profession to join the revolution and remain relevant in the future.
Statistics 2.0: Dynamic Frames
Big Data is a relative concept.Tomorrow’s Big Data will be bigger than today’s Big Data.If it is only the size of data that statisticians would consider, the impact of Big Data would be limited to only scaling the existing software and methods.
The significance of the Big Data era is that most data are now digitized, including sound, vision, and handwriting [e.g., 11], much of which have never been available before.They can be easily stored and processed in large quantity at relatively low cost.Today’s consumers of statistics are much higher in number and less interested in technical details, but they also want comprehensive, reliable, easy-to-use information rapidly and readily.
Big Data is as much a revolution in information technology as it is for advancement in statistics because it offers unprecedented opportunities for statisticians to rethink its systems and operations and innovate.
For example, mathematical statistics clearly demonstrates that a 5 percent random sample is superior to a 5 percent non-random sample.However, how does it compare to a 50 percent or a 95 percent non-random sample?We have continued to caution, warn, condemn, or dismiss large, non-random samples, but have done little to go beyond the existing framework of mathematical statistics. Is there not a point, albeit that it may vary from case to case, where the inherent statistical bias can be reduced by the large size of a non-random sample so that they can become practically acceptable and meaningful?
As another example, as long as Figure 1 remains the typical process of conducting statistical studies in a sequential and cross-sectional manner, there is little room for innovative improvement to reduce turnaround time or introduce new metrics such as measuring longitudinal change at the unit level [12].Is it absolutely impossible to produce accurate and reliable statistical results in real time?Or is it because we have become so comfortable with the present software, approach, and convenience that there is no desire to consider other possibilities?
Random sampling has been the dominant mode of statistical operation for a century [13].Because of Big Data, one may now study an entire population almost as easily as one can study a random sample today.Should we ignore this opportunity?
If statisticians do not recognize or embrace the challenges of theory and practice posted by Big Data as part of the core of studying and practicing statistics, the risk is high that others including the yet-undefined “data scientists” will fill the void [14].
Among the many possibilities offered by Big Data is the creation and maintenance of Dynamic Frames – population frames that are rich in content, capture the most up-to-date data as soon as they become available, and produce results and reports according to established schedules or even in real time.
With some user base exceeding one billion people in membership, E-Commerce companies and the social media are well positioned to apply their data from online transactions, emails, and blog postings to conduct market research and perform predictive analyses.A lay person may also capture these data in a less structured manner.
Figure 2
Figure 2 provides a simple schematic on how the Dynamic Frames may work, which are also described as longitudinal data systems in educational applications in the U.S. [15,16]
In essence, primary efforts are put into the creation and maintenance of the frame so that it is optimized by the previously identified properties.It is constantly updated with new data for every sampling unit over time.
Statisticians must be fully engaged in the design, implementation, and operation of Dynamic Frames, in addition to the production of descriptive and analytical results.There are many new and traditional functions that statisticians can make major contributions.
For example, the identification code is a key to unlocking the enormous power in Big Data.It controls the extent additional records and data may be linked, determines firsthand the overall quality of data and study, and is the first safeguard to protect confidentiality.
As another example, the size and content for the units have no conceivable limit.They depend only on availability of data, ability to link and match records, and design of system.Effective operation minimizes mismatches of records and collection of duplicative data that do not change or change in predictable manner.Appropriate replacement or imputation for missing values ensures quality and timely integration of data.
Other enhancement of traditional statistical functions [14] include, but are not limited to, establishing continuous quality loops back to the data sources; developing new definitions, metrics, and standards for the dynamic frames; applying new statistical modeling for imputation, profiling, risk assessment, and creating artificial intelligence; developing innovative visualizations; improving statistical training and education; and protecting confidentiality.
Summary
Dynamic frames will retain its original purpose as a list of known units for conducting censuses and drawing random samples as needed, but the potential use of structured Big Data is limited only by the imagination and innovative spirit of the statistics profession.Statisticians need to embrace Big Data as its own revolution, which will lead to the next level of human knowledge and practice by study and use of data.
[1] Hansen, Morris H.; Hurwitz, William N.; and Madow, William G.(1953).Sample Survey Methods and Theory.Wiley Classics Library Edition, John Wiley & Sons, Inc.
[2] Kish, Leslie.(1965).Survey Sampling.Wiley Classics Library Edition, John Wiley & Sons, Inc.
[3] Cochran, William G.(1977).Sampling Techniques.A Wiley Publication in Applied Statistics, Third Edition, John Wiley & Sons, Inc.
[8] Varian, Hal.Hal Varian explains why statisticians will be the sexy job in the next 10 years,September 15, 2009.YouTube.Available at http://www.youtube.com/watch?v=pi472Mi3VLw on April 8, 2013.
[12] Diggle, Peter J.; Heagerty, Patrick J.; Liang, Kung-Yee; and Zeger, Scott L. (2001).Analysis of Longitudinal Data.Second Edition, Oxford University Press.
[13] Wu, Jeremy S., Chinese translation by Zhang, Yaoting and Yu, Xiang.One Hundred Years of Sampling, invited paper in Sampling Theory and Practice, ISBN7-5037-1670-3, 1995.China Statistical Publishing Company.
[16] U.S. Department of Education.Statewide Longitudinal Data Systems Grant Program, National Center for Education Statistics.Available at http://nces.ed.gov/programs/slds/ on April 8, 2013.
[1] Hansen, Morris H.; Hurwitz, William N.; and Madow, William G.(1953).Sample Survey Methods and Theory.Wiley Classics Library Edition, John Wiley & Sons, Inc.
[2] Kish, Leslie.(1965).Survey Sampling.Wiley Classics Library Edition, John Wiley & Sons, Inc.
[3] Cochran, William G.(1977).Sampling Techniques.A Wiley Publication in Applied Statistics, Third Edition, John Wiley & Sons, Inc.
[8] Varian, Hal.Hal Varian explains why statisticians will be the sexy job in the next 10 years,September 15, 2009.YouTube.Available at http://www.youtube.com/watch?v=pi472Mi3VLw on April 8, 2013.
[12] Diggle, Peter J.; Heagerty, Patrick J.; Liang, Kung-Yee; and Zeger, Scott L. (2001).Analysis of Longitudinal Data.Second Edition, Oxford University Press.
[13] Wu, Jeremy S., Chinese translation by Zhang, Yaoting and Yu, Xiang.One Hundred Years of Sampling, invited paper in Sampling Theory and Practice, ISBN7-5037-1670-3, 1995.China Statistical Publishing Company.
[16] U.S. Department of Education.Statewide Longitudinal Data Systems Grant Program, National Center for Education Statistics.Available at http://nces.ed.gov/programs/slds/ on April 8, 2013.