Categories
General 统计

推动中国智慧城市发展,小统计势在必行

胡善庆 王琼 刘真

This blog in simplified Chinese describes the status and need for statistical monitoring in Smart City development in China; it includes an interactive map of the 291 test locations.

中国现代城市体制源于中国宪法第一章第三十条。主要分直辖市、地级市和县级市三个层次。地级行政区域包括30个自治州,其中22个州府设在县级市,其余8个州府设在县中。

改革开放以来,有不同的副地级市或省直辖市出现,但目前还属于非正式行政级别。2015SCTable1

据《国家新型城镇化规划(2014-2020)》报告,2010年中国共有658个城市。截至2015年6月,有据可查的中国城市增加到670个,分别是4个直辖市,291个地级市和375个县级市。山东是全国最多城市的省份,共48个。广东其次,共44个。

智慧城市,利用先进的信息科技,以人为本,实现城市智慧式管理和运行,是21世纪很多国家都在追求的理想,亦是中国于2020年建成小康社会的一大目标。

据报导,“十二五”期间,全国智慧城市计划投资规模预计将超过1.6万亿元人民币。有些估计未来中国智慧城市市场的规模可达4万亿元。

一、智慧城市布局差异显著2015SCTable2

从2013年1月至2015年4月,中国住建部和科技部分別公布三批共291个智慧城市试点名单,包括一批扩张范围试点。

291个智慧城市试点并不代表是291个城市。例如中国四个直辖市共有24个试点。

三层城市中,直辖市参与率和试点密度最高,地级市居次。另一方面,有5个试点目前属于4个县级行政区域 (四川省汶川县,新疆自治区富蕴县,福建省平潭县,和青海省海南州) 。

中国670个城市中有210个参加智慧城市试点,参与率为31.3%。地级市的参与率超过半数(53.3%)。

2015SCTable3四个直辖市智慧城市试点差异最大。北京有11个试点,上海只有1个。安徽省有近六成城市参与,但每个城市基本上只有一个试点。浙江省参与的城市不足全省四份之一,但大部份参与城市都有不止一个试点。

二、六大建设方向明确,但缺乏监测与分析

由于智慧城市试点有3到5年的创建期,因此部份第一批试点应已接近从概念转为现实的转折点。

《规划(2014-2020)》积极推进智慧城市建设,提出六个建设方向,最主要莫过於信息网络宽带化。市民和企业能否上网?能否快速、大批、廉价地上网?简单地说,不能上网就没有智慧城市。其他五个发展方向,必须要实现能上网之后,再去衡量市民和企业是否可以获得及时、可靠、和高质量的政府信息和服务。2015SCTable4

《规划(2014-2020)》第三十一章注明健全监测评估的重要。除了加强城镇化统计工作,并要实施动态监测与跟踪分析,开展规划中期评估和专项监测,推动规划顺利实施。

可惜到目前为止,291个试点中没有任何一个试点对智慧城市作出系统性或任何专项动态统计监测与跟踪分析。中央部门也没有公布任何计划,对各批试点作出定时或实时的监测统计报告。

与此同时,信息孤岛、重复建设、资源浪费以及政绩工程等问题,陆续浮现。一些被报导陷入危机的“鬼城”,也处于291个智慧城市试点之中。健全监测系统的必要性似乎越来越明显。

三、规范智慧城市监管,可从小统计入手

智慧城市的管理起步,并不需要拥有大数据,一些小统计可以先开始,由浅入深,由简至繁。例如信息网络宽带化的目标本来就很明确,处理一般的行政记录都应可提供这些小统计,定时甚至实时在网上公布。

去年9月,工信部和科技部公告39个城市(城市群)为“宽带中国”示范城市,其中至少32个已是智慧城市试点之一。“宽带中国”中的指标更近期化和精细化。

2014-2015年是“宽带中国”推广普及阶段。重点在继续推进宽带网络提速的同时,加快扩大宽带网络覆盖范围和规模,深化应用普及。2016年应该开始优化升级阶段。

2015年底,“宽带中国”中的数量指标包括固定宽带用户超过2.7亿户,城市和农村家庭固定宽带普及率分别达到65%和30%。3G/LTE用户超过4.5亿户,用户普及率达到32.5%。行政村通宽带比例达到95%。城市家庭宽带接入能力基本达到20Mbps,部分发达城市达到100Mbps,农村家庭宽带接入能力达到4Mbps。3G网络基本覆盖城乡,LTE实现规模商用,无线局域网全面实现公共区域热点覆盖。互联网网民规模达到8.5亿。

目前智慧城市试点有多少个会在今年底达到这些个别的目标?

此时不推出统计评估系统,开始实施动态监测与跟踪分析,更待何时?

Categories
Big Data Statistics 2.0

2014 Workshop on Big Data and Urban Informatics

After more than a year of preparation, the Workshop on Big Data and Urban Informatics was held at the University of Illinois at Chicago on August 11-12, 2014.

More than 150 persons from at least 10 countries (Australia, Canada, China, Greece, Israel, Italy, Japan, Portugal, United Kingdom, and the U.S.) attended the forum sponsored by the National Science Foundation.  

Piyushimita (Vonu) Thakuriah, co-chair for the workshop, reported on the funding of Urban Big Data Center at the University of Glasgow in Scotland (http://bit.ly/1kXG2Uh).  Its mission is to “support research for improved understanding of urban challenges and to provide data, technology and services to manage, make policy, and innovate in cities.”  The Urban Big Data Center partners with five other universities including the University of Illinois at Chicago. Vonu, a transportation expert, is the director of the center.

In the course of two full days, 68 excellent presentations were made in total, far exceeding the expectations of the organizers a year ago.  These papers will be posted in the web in the near future.  

Two luncheon keynote speakers highlighted the workshop.  

Carlo Ratti presented the state-of-the-art work of the MIT SENSEable City Lab, which specializes in the deployment of sensors and hand-held electronics to study the environment.  Since conventional measures of air quality tend to be collected at stationary locations, they do not always represent the exposure of a mobile individual.  In one project titled “One Country, Two Lungs” (http://bit.ly/1nbSBXi), a team of human probes travelled between Shenzhen and Hong Kong to detect urban air pollution.  The video revealed the divisions in atmospheric quality and individual exposure between these two cities. 

Paul Waddell of the University of California at Berkeley presented his work on urban simulation and dynamic 3-D visualization of land use and transportation.  Some of his impressive work images can be found at http://bit.ly/1rn9hmj.  His video and examples reminded me about their potential applicability for creating the “Three Districts and Four Lines” in China’s National Urbanization Plan.  I also learned about a somewhat similar set of products from China’s supermap.com, a Geographic Information System software company based in Beijing. 

One of the 68 presentations described the use of smart card data to study the commuting patterns and volume in Beijing subways during rush hours.  One other presentation compared the characteristics of big data and statistics and raised the question of whether big data is a supplement or a substitute to statistics. 

The issue of data quality was seldom volunteered in the sessions, but questions about it came up frequently.  Through editing, filtering, cleaning, scrubbing, imputing, curating, re-structuring, and many other terms, it was clear that some presenters spent an enormous amount of their time and efforts to just get the data ready for very basic use.

Perhaps data quality is considered secondary in exploratory work.  However, there are good quality big data and bad quality big data.  When other options are available, spending too much time and effort on bad quality big data seems unwise because it does not project a practical, future purpose.

There were also few presentations that discussed the importance of data structure, whether it is already built in as design or created through metadata.  Structured data contain far more potential information content than unstructured ones and tend to be more efficient and optimal in information extraction, especially if they have the capability to be linked across multiple sources.  

For the purpose of governance, I was somewhat surprised that use of administrative records has not yet caught on in this workshop.  Accessibility and confidentiality appeared to be barriers.  It would seem helpful for future workshops to include city administrators and public officials to help bridge the gap between research and practical needs for day-to-day operations.  

Nations and cities share a common goal in urban planning and urban informatics – improve the quality of city life and service delivery to constituents and businesses alike.  On the other hand, there are drastic differences in their current standing and approach.

China is experiencing the largest human migration in history.  It has established goals and direction for urban development, but has little reliable, quantitative research or experience to support and execute its plans.  The West is transitioning from its century-old urban living to a future that is filled with exciting creativity and energy, but does not seem to have as clear a vision or direction.

Confidentiality is an issue that contrasts sharply between China and the West.  The Chinese plans show strong commitment to collect and merge linkable individual records extensively.  If implemented successfully, it will generate unprecedented amount of detailed information that can also be abused and misused.  The same approach would likely face much scrutiny and opposition in the West, which has to consider less reliable but more costly alternatives in order to meet the same needs. 

There is perhaps no absolute right or wrong approach to these issues.  The workshop and the international community being created offer a valuable opportunity to observe, discuss, and make comparisons in many globally common topics. 

Selected papers from the workshop will now undergo additional peer review.  They will be published in an edited volume titled “See Cities Through Big Data – Research, Methods and Applications in Urban Informatics.”

Categories
General Statistics

Smoking Statistics in the U.S. and China

The U.S. Surgeon General released a landmark report on smoking and health in 1964, concluding that smoking caused lung cancer.  At that time, smoking was at its peak in the U.S. – more than half of the men and nearly one-third of the women were reported to be smokers.
 
The U.S. Surgeon General released another report [1] in June this year, titled “The Health Consequences of Smoking – 50 Years of Progress.”
 
A time plot based on the recent report [2] shows the trend of one statistic – adult per capita cigarette consumption – for the period of 1900-2012.  It reveals the rise of smoking in the U.S. in the first half of the 20th century, coinciding with the Great Depression and two world wars when the government supplied cigarettes as rations to soldiers.  There has been a steady decline in the last 50 years.
 
When the 1964 report was released, an American adult was smoking more than 4,200 cigarettes a year on the average.   Today it is less than 1,300.  About 18% of Americans smoked in 2012, down from the overall 42% in 1964.  The difference between male and female smokers is relatively small – men at 20% and women at 16%.  According to a 2013 Gallup poll [3], 95% of the American public believed that smoking is very harmful or somewhat harmful, compared to only 44% of Americans who believed that smoking causes cancer in 1958.
 
After the release of the 1964 report, Congress required all cigarette packages to carry a health warning label in 1965.  Cigarette advertising on television and radio were banned effective in 1970.  Taxes on cigarettes were raised; treatments for nicotine introduced; non-smoker rights movement started.  Together laws, regulations, public education, treatment, taxation, and community efforts have all played an important role in transforming a national habit to a recognized threat to human health and quality of life in the last 50 years.  This was beyond my wildest imagination that it could happen in my lifetime.
 
Statistics has been at the center of this enormous social change from the beginning of the smoking and health issue.
 
As early as 1928, statistical data began to appear and showed a higher proportion of heavy smokers among lung cancer patients [4].  A 10-member advisory committee prepared the 1964 report, spending over a year to review more than 7,000 scientific articles along with 150 consultants.  By design, the committee included five non-smokers and five smokers, representing disciplines in medicine, surgery, pharmacology, and STATISTICS.  The lone statistician was William G. Cochran, a smoker who was also a founding member of the Statistics Department at Harvard University and author of two classic books, “Experimental Design” and “Sampling Techniques.”
 
During the past 50 years, an estimated 21 million Americans have died because of smoking, including nearly 2.5 million non-smokers due to second-hand smoke and 100,000 babies due to parental smoking.
 
There are still about 42 million adult smokers and 3.5 million middle and high school students smoking cigarettes in the U.S. today.  Interestingly, Asian Americans have the lowest rate of smokers at 11% among all racial groups in the U.S.
 
China agreed to join the World Health Organization Framework Convention on Tobacco Control in 2003.  It reported [5] 356 million smokers in 2010, about 28% of its total population and practically unchanged from its 2002 level.  The gender difference was remarkable – 340 million male smokers (96%) and 16 million female smokers (4%).  About 1.2 million people die from smoking in China each year.  Among the remaining over 900 million non-smokers in China, about 738 million, including 182 million children, are exposed to second-hand smoke.   Only 20% of Chinese adults reportedly believed that smoking causes cancer in 2010 [6].
 
More detailed historical records on smoking in China are either inconsistent or fragmented.  One source outside of China [7] suggested that there were 281 million Chinese smokers in 2012 and an increase of 100 million smokers from 1980.
 
China has been stumbling in its efforts to control smoking.
 
According to a 2013 survey by the Chinese Association on Tobacco Control [8], 50.2% of the male school teachers were smokers; male doctors 47.3%; and male public servants 61%.  Given these high rates for their important roles, there is concern and skepticism on how effective tobacco control can be implemented or enforced.
 
Coupled with the institutional issues of its tobacco industry, China has been criticized for its ineffective tobacco control ineffective. While the size of some American Tobacco companies may be larger, they are not state-owned. China is the world’s largest tobacco producer and consumer.  Its state-owned monopoly, China National Tobacco Corporation, is the largest company of this type in the world.
 
Nonetheless, the Chinese government has enacted a number of measures to restrict smoking in recent years.  The Ministry of Health took the lead in banning smoking in the medical and healthcare systems in 2009.  Smoking in public indoor spaces such as restaurants, hotels, and public transportation were banned beginning in 2011.
 
According to the Chinese Tobacco Control Program (2012-2015) [9,10], China will ban cigarette advertising, marketing and sponsorship, setting a goal of reducing the smoking rate from 28.1% in 2010 to 25%.
 
Smoking is a social issue common to both the U.S. and China.
 
Statistics facilitates understanding of the status and implications, as well as providing advice, assistance, and guidance for governance.  More statistics can certainly be cited about the ill effects of smoking in both nations.  At the end, it is the collective will and wisdom of each nation that will determine the ultimate course of actions.
 
REFERENCES
 
[1] U.S. Department of Health and Human Services. (2014). The Health Consequences of Smoking – 50 Years of Progress: A Report of the Surgeon General.  Retrieved from http://www.surgeongeneral.gov/library/reports/50-years-of-progress/full-report.pdf.
 
[2] Ferdman, Roberto. (2014). The young and poor are keeping big American tobacco alive.  The Washington Post.  Retrieved from http://www.washingtonpost.com/blogs/wonkblog/wp/2014/07/16/the-young-and-poor-are-keeping-the-u-s-tobacco-industry-alive/.
 
[3] Gallup Poll. Tobacco and Smoking. Retrieved from http://www.gallup.com/poll/1717/tobacco-smoking.aspx.
 
[4] National Library of Medicine. Profiles in Science. The Reports of the Surgeon General.  Retrieved from http://profiles.nlm.nih.gov/ps/retrieve/Narrative/NN/p-nid/58.
 
[5] The Central People’s Government of the People’s Republic of China. (2011, January 6) Population of tobacco remains high and not declining; smokers are still over 300 million.  Retrieved from http://www.gov.cn/jrzg/2011-01/06/content_1779597.htm.
 
[6] Xinhuanet.com. (2011, May 2). New smoking ban effective in China. Retrieved from  http://news.xinhuanet.com/english2010/video/2011-05/02/c_13855260.htm.
 
[7] Qin, Amy. (2014, January 9). Smoking Prevalence Steady in China, but Numbers Rise. The New York Times. Retrieved from http://sinosphere.blogs.nytimes.com/2014/01/09/smoking-prevalence-steady-in-china-but-numbers-rise/?_php=true&_type=blogs&_r=0.
 
[8] China News. (2013, December 31). Survey finds over 60% of male public servants smoke; half never quit.  Retrieved from http://www.chinanews.com/sh/2013/12-31/5680798.shtml.
 
[9] The Chinese Ministry of Health. 2013 Report on Tobacco Control in China – Total Prohibition of Tobacco Advertising, Marketing and Sponsorship. Retrieved from http://www.moh.gov.cn/ewebeditor/uploadfile/2013/05/20130531132109426.pdf.
 
[10] China Women’s Federation News. Banning Tobacco Advertising Cannot be Just Paper Planning. Retrieved from http://acwf.people.com.cn/n/2013/0603/c99013-21712571.html.
Categories
Big Data General Statistics Statistics 2.0

Crossing the Stream and Reaching the Sky

In the early stages of its economic reform, China chose to “cross a stream by feeling the rocks.”

Limited by expertise and conditions at that time when there was no statistical infrastructure in China to provide accurate and reliable measurements, the chosen path was the only option.

In fact, this path was traveled by many nations, including the U.S. At the beginning of the 20th century when the field of modern statistics had not taken shape, data were not believable or reliable even if they existed.   Well-known American writer and humorist Mark Twain once lamented about “lies, damned lies, and statistics,” pointing out the data quality problem of the time.  During the past hundred years, statistics deployed an international common language and reliable data, establishing a long history of success with broad areas of application in the U.S.  This stage of statistics may be generally called Statistics 1.0.

Feeling the rocks may help to across a stream, but it would be difficult to land on the moon, even more difficult to create smart cities and an affluent society.  If one could scientifically measure the depth of the stream and build roads and bridges, it may be unnecessary to make trials and errors.

The long-term development of society must exit this transitional stage and enter a more scientifically-based digital culture where high-quality data and credible, reliable statistics serve to continuously enhance the efficiency, equity and sustainability of national policies. At the same time, specialized knowledge must be converted responsibly to practical useful knowledge, serving the government, enterprises and the people.

Today, technologies associated with Big Data are advancing rapidly.  A new opportunity has arrived to usher in the Statistics 2.0 era.

Simply stated, Statistics 2.0 elevates the role and technical level of descriptive statistics, extends the theories and methods of mathematical statistics to non-randomly collected data, and expands statistical thinking to include facing the future.

One may observe that in a digital society, whether it is from crossing a stream or reaching the sky, or from governance of a nation to the daily life of the common people, what were once “unimaginable” are now “reality.”  Driverless cars, drone delivery of packages, and space travel are no longer imaginations in fictions.  Although their data that can be analyzed in a practical setting are still limited, they are within the realistic visions of Statistics 2.0.

In terms of social development, the U.S. and China are actively trying to improve people’s livelihood, enhance governance, and improve the environment. A harmonious and prosperous world cannot be achieved without vibrant and sustainable economies in both China and the U.S., and peaceful, mutually beneficial collaborations between the nations.

Statistics 2.0 can and should play an extremely important role in this evolution.

The WeChat platform Statistics 2.0 will not use low quality or duplicative information to clog already congested channels, but it values new thinking to share common interest in the study of Statistics 2.0, introducing state-of-the-art developments in the U.S. and China in a simple and timely manner, offering thoughts and discussions about classical issues, exploring innovative applications, and sharing the beauty of the science of data in theory and practice.

WeChat Platform: Statistics 2.0

Categories
Big Data General Statistics Statistics 2.0

Not All Data are Created Equal

Suppose we have data on 60,000 households.  Are they useful for analysis? If we add that the amount of data is very large, like 3 TB or even 30 TB, does it change your answer?
 
The U.S. government collects monthly data from 60,000 randomly selected households and reports on the national employment situation.  Based on these data, the U.S. unemployment rate is estimated to within a margin of sampling error of about 0.2%.  Important inferences are drawn and policies are made from these statistics about the U.S. economy comprised of 120 million households and 310 million individuals.
 
In this case, data for 60,000 households are very useful.
 
These 60,000 households represent only 0.05% of all the households in the U.S.  If they were not randomly selected, the statistics they generate will contain unknown and potentially large bias.  They are not reliable to describe the national employment situation.
 
In this case, data for 60,000 households are not useful at all, regardless of what the file size may be.
 
Suppose further that the 60,000 households are all located in a small city that has only 60,000 households.  In other words, they represent the entire universe of households in the city.  These data are potentially very useful.  Depending on its content and relevance to the question of interest, usefulness of the data may again range widely between two extremes.  If the content is relevant and the quality is good, file size may then become an indicator of the degree of usefulness for the data.
 
This simple line of reasoning shows that the original question is too incomplete for a direct, satisfactory answer.  We must also consider, for example, the sample selection method, representation of the sample in the population under study, and the relevance and quality of the data relative to a specified hypothesis that is being investigated.
 
The original question of data usefulness was seldom asked until the Big Data era began around 2000 when electronic data became widely available in massive amounts at relatively low cost.  Prior to this time, data were usually collected when they were driven and needed by a known specific purpose, such as an exploration to conduct, a hypothesis to test, or a problem to resolve.  It was costly to collect data.  When they were collected, they were already considered to be potentially useful for the intended analysis.
 
For example, when the nation was mired in the Great Depression, the U.S. government began to collect data from randomly selected households in the 1930s so that it could produce more reliable and timely statistics about unemployment. This practice has continued to this date.
 
Statisticians initially considered data mining to be a bad practice.   It was argued that without a prior hypothesis, false or misleading identification of “significant” relationships and patterns is inevitable by “fishing,” “dredging,” or “snooping” data aimlessly.  An analogy is the over interpretation or analysis of a person winning a lottery, not necessarily because the person possesses any special skill or knowledge about winning a lottery, but because random chance dictates that some person(s) must eventually win a lottery.
 
Although the argument of false identification remains valid today, it has also been overwhelmed by the abundance of available Big Data that are frequently collected without design or even structure.  Total dismissal of the data-driven approach bypasses the chance of uncovering hidden, meaningful relationships that have not been or cannot be established as a priori hypotheses.  An analogy is the prediction of hereditary disease and the study of potential treatment.  After data on the entire human genome are collected, they may be explored and compared for the systematic identification and treatment of specific hereditary diseases.
 
Not all data are created equal and have the same usefulness.
 
Complete and structured data can create dynamic frames that describe an entire population in detail over time, providing valuable information that has never been available in previous statistical systems.  On the other hand, fragmented and unstructured data may not yield any meaningful analysis no matter how large the file size may be.
 
As problem solving is rapidly expanding from a hypothesis-driven paradigm to include a data-driven approach, the fundamental questions about the usefulness and quality of these data have also increased in importance.  While the question of study interest may not be specified a priori, establishing it a posteriori to data collection is still necessary before conducting any analysis.  We cannot obtain a correct answer to non-existing questions.
 
How are the samples selected?  How much does the sample represent the universe of inference?  What is the relevance and quality of data relative to the posterior hypothesis of interest?   File size has little to no meaning if the usefulness of data cannot even be established in the first place.  
 
Ignoring these considerations may lead to the need to update a well-known quote: “Lies, Damned Lies, and Big Data.”
Categories
Big Data Statistics Statistics 2.0

Lying with Big Data

About 45 years ago, I spent a whopping $1.95 on a little book titled “How to Lie with Statistics.”

Besides the catchy title, its bright orange cover has a comic character sweeping numbers under a rug.  Darrell Huff, a magazine editor and a freelance writer, wrote the book in 1954.  It went on to become the most popular statistics book in the world for more than half a century.  A translated version was published in China around 2002.

It takes only a few hours to read the entire book of about 140 pages and 80 pictures leisurely, but it was a major reason why I pursued an education and a professional career in statistics.

The corners of the book are now worn; the pages have turned yellow.  One can identify some of the social changes in the last 60 years from the book.  For example, $25,000 is no longer an enviable annual salary; few of today’s younger generation may know what a “telegram” was; “gay” has a very different meaning now; and “African Americans” has replaced “Negroes” in daily usage.  As indicative of the bygone era, the image of a cigar, a cigarette, or a pipe appeared in at least one out of every five pictures in the book – even babies were puffing away in high chairs.  The word “computer” did not show up once among its 26,000 words.

Huff’s words were simple, but sharp and direct.   He provided example after example that the most respected magazines and newspapers of his time lie with statistics, just like the dreadful “advertising man” and politician.

According to Huff, most humans have “a bias to favor, a point to prove, and an axe to grind.”  They tend to over- or under-state the truth in responding to surveys; those who complete surveys are systematically different from those who do not respond; and built-in partiality occurs in the wording of a questionnaire, appearance of an interviewer, or interpretation of the results.

There were no desktop computers or mobile devices; statistical charts and infographics were drawn by hand; data collection, especially complete counts like a census, was difficult and costly.  Huff conjectured, and the statistics profession has also concurred, that the only reliable small sample is one that is random and representative where all sources of bias have been removed.

Calling anyone a liar was harsh then, and it still is now.  The dictionary definition of a lie is a false statement made with deliberate intent to deceive.  Huff considered lying to include chicanery, distortion, manipulation, omission, and trickery; ignorance and incompetence were only excuses for not recognizing them as lies.  One may also lie by selectively using a mean, a median, or a mode to mislead readers although all of them are correct as an average.

No matter how broadly or narrowly lies may be defined, it cannot be denied that people do lie with statistics every day.  To some media’s credit, there are now fact-checkers who regularly examine stories or statements, most of them based on numbers, and evaluate their degree of truthfulness.

In the era of Big Data, lies occur in higher velocity with bigger volume and greater variety.

Moore’s law is not a legal, physical, or natural law, but a loosely-fitted regression equation in logarithmic scale.  Each of us has probably won the Nigerian lottery or its variations via email at least a few times.  While measures for gross domestic products or pollution are becoming more accurate because of Big Data, nations liberally use their aggregate or per capita average, depending on which favors their point of view.

Heavy mining of satellite, radar, audio messages, sensor, and other Big Data may one day solve the tragic mystery of Malaysian Flight MH370, but the many pure speculations, conspiracy theories, accusations of wrongdoing, and irresponsible lies quoting these data have mercilessly added anguish and misery to the families of the passengers and the crew.  No one seems to be tracking the velocity, volume and variety of the false positives that have been generated for this event, or other data mining efforts with Big Data.

The responsibility is of course not on the data; it is on the people.  There is the old saying that “figures don’t lie, but liars figure.”  Big Data – in terms of advancing technology and availability of some massive amount of randomly and non-randomly collected electronic data – will undoubtedly expand the study of statistics and bring our understanding and governance to new heights.

Huff observed that “without writers who use the words with honesty and understanding and readers who know what they mean, the result can only be semantic nonsense.”  Today many statisticians are still using terms like “Type I error” and “Type II error” in promoting statistical understanding, while these concepts and underlying pitfalls are seldom mentioned in Big Data discussions.

At the end of his book, Huff suggested that one can try to recognize sound and usable data in the wilderness of fraud by asking five questions: Who says so? How does he know? What’s missing? Did somebody change the subject? Does it make sense?  They are not perfect, but they are worth asking.  On the other hand, healthy skepticism should not become overzealous in discrediting truly sound and innovative findings.

Faced with the self-raised question of why he wrote the book, especially with the title and content that provides ideas to use statistics to deceive and swindle, Huff responded that “[t]he crooks already know these tricks; honest men must learn them in defense.”

How I wish there is a book about how to lie with Big Data now!  In the meantime, Huff’s book remains as enlightening as it was 45 years ago although the price of the book has gone up to $5.98 and is almost matched by its shipping cost.

Jeremy S. Wu, Ph. D., jeremy.s.wu@gmail.com

Categories
Big Data Statistics 2.0

Smart Wuhan, Built on Big Data

智慧武汉:善用大数据

The following is an abstract for a presentation given in the Committee of 100 Fourth Tien Changlin (田长霖) Symposium held in Wuhan, China, on June 20, 2013.

The presentation in simplified Chinese is available at 智慧武汉:善用大数据.

The urban population in China doubled between 1990 and 2012.  It is estimated that an additional 400 million people will move from the countryside to the cities in the next decade.  China has announced plans to become a well-off society, while maintaining harmony, during this time period.  This is an enormous challenge to China and its cities like Wuhan.

A well-off society necessarily includes a sound infrastructure and sustainable economic development with entrepreneurial spirits and drive for innovation.  It must constantly improve quality of life for its citizens with effective management of the environment and natural resources.  Most of all, it must change governance so that flexibility, high efficiency and responsiveness are the norms that its citizens would expect.

If data were letters and single words, statistics would be grammar that binds them together in an international language that quantifies what a well-off society is, measures performance, and communicates results.  Modern technology can now collect and deliver electronic information in great variety with massive volume at rapid speed during the Big Data era.  Combined with open policy, talented people, and partnership between the academia, government, and private sector, Wuhan can get smart with Big Data, as it has started with projects like “China Technology and Science City” and “Citizen’s Home.”  Although there are many areas yet to expand and improve, a smart Wuhan will lead the nation up another level toward a well-off society.

Link to presentation in simplified Chinese: 智慧武汉:善用大数据.

Categories
Big Data General

Is Big Data Gold or Sand?

If we melt all the existing gold in the world and put them together, it would amount to about one third of the Washington Monument.  The value of gold is high because it is rare and has many uses.

Recent hype about Big Data compares it to gold as if you can collect them by simply dipping your hands in it.  If so, many people should already be rich.

It may be more appropriate to compare Big Data to sand, within which there may be gold – sometimes a little, sometimes none at all, and in rare situations, a lot.  Whatever the case, it will need investment and hard work to clean and mine them.  There is no real substitute with software or hardware.

There is value in a big pile of sand too.  According to an ancient saying, you can build pagodas with it; it is also the raw material for silicon chips today.

Of higher values is that Big Data is a yet-to-be totally explored branch of new knowledge.

Source of Chart: http://www.jsmineset.com/2010/05/27/how-money-works/

Categories
Statistics 2.0 大数据 统计

识别码的要义

简体中文版: 《海外学人》2013年 – 大数据专刊   

This blog is a simplified Chinese translation of my blog on The Essentials of Identification Codes. Published version in Chinese Association for Science and Techonology special edition on Big Data.

在21世纪,大数据承诺将为社会有效治理以及大众信息分享做出贡献。尽管任何数据本身都包含一定的信息与作用,但是关联和整合后的数据不仅减少收集数据的重复性,而且极度的增加它的價值和可用性。识别码在这个过程中不仅促进实际记录和数据的整合,而且是解放大数据威力的关键。如果识别码没有得到正确的使用和管理,它亦将会是系统失灵、误用和滥用、甚至欺诈及犯罪的元凶。因此,除了技术以外,合理的统计学设计,提高质量的反馈,适当的教育和培训,相关的法律法规,公众的认知,这些都将成为识别码和大数据有效和负责任应用的必要条件。

识别码的必要性

在学生入学时,会有档案存储学生的各种数据,比如:姓名,性别,年龄,家庭背景,专业等。当学生选修一门课并获得成绩时,这个结果也被记录下来。当这个学生满足了所有毕业要求,另一条记录会显示出她的加权平均分并且获得的学位。

每一条记录都是这学生的一个“快照”,随时间累积成为行政记录。这些纵向“快照”提供每个学生受教育情况的丰富信息。

当学生进入工作单位,更多的关于她工作的数据将被收集,伴随她一生,这些数据包括:她的职业及工作单位,工作表现,工资及晋升情况,保险和税的支付数额,就、失业状态等。

在同样的情形下,大量关于公司的数据也会被收集。这些数据记录了:最初注册成立,收支财政状况报告,上市情况,收购或者与其他公司合并,所缴税费,收入增长和雇员增加,公司的扩增或是公司的倒闭。

这些行政记录过去被封存于满身尘埃的文件柜里,但是在千禧年伴随着大数据时代的到来,它们大部份都已数字化。

对学生数据及时和适当的整合将会提供空前的细节,使我们更详细地了解这所学校运作情况,比如说毕业率随时间的变化。当数据整合扩展到所有学校,我们将更好的了解这个国家的教育状况,例如它对就业和经济增长的潜力和支持。这些就是21世纪大数据承诺将会给我们带来的变化。从分配资源,评估表现,到制定政策,社会的方方面面都可以从大数据的细节和深度中促进社会有效治理及大众信息分享。

尽管任何数据本身都包含一定的信息与作用,但是关联和整合后的数据将更为重要,因为它不仅减少收集重复的数据,而且极度的增加它的價值和可用性。识别码的要义是促进实际记录和数据的整合。在这个过程中,统计学家可以作出卓越的贡献,运用他们的智慧和知识创立新的统计系统。

识别码的种类

当文件例如纸质表格还未被数字化前,人名或者公司名称是被常用的识别码。通常来说,人们会用相同的名称整合记录并给他们排序,比如英文的字母,中文的笔画,或按时间顺序。

但是,使用名字的一大弊端是他们并不是独特唯一,特别是在电脑大量处理数据时,这一弊端尤其明显。据2006年的统计,李、王、张、刘这四个中国最大姓氏占了3.34亿人口[1],超过美国人口总数。同样的中文姓名,也有繁体和简体中文的可能分别。英文名罗伯特(Robert)有至少七种不同使用方法,包括:Bert, Bo, Bob, Bobby, Rob, Robbie, 以及Robby,它在2011年美国出生的男性人名中的使用率排第61位[2,3],而Bert又可以是英文名Albert的缩写。个人又有可能更改名字或者有不止一个名字;女性可能在结婚后改名。人为的错误又可能增加不正确的名字。跨国不同语言的情况下,引用同一名字更是特别困难。

在注册的过程中,公司的名称会被检查以确保不出现重名。公司的名称包括它的商标也会被当地,全国性以及国际性的规则和法律受到保护。但公司仍有可能使用多个名称,包括它的缩写和公司股票代码,而且它也有可能在合并,重组,被收购的时候变更名称,或者只是简单的更改品牌。

非唯一的识别码会造成不正确链接和合并数据的风险,导致不正确的结果或结论。虽然给一个名称增加辅助信息,比如说年龄,性别和地址,可以减少风险,但是并不能完全去除错误配对记录和数据的可能性,而且会增加处理数据的时间。

识别码可以由一系列的数字、字母或特殊字符(字母数字)组成。越来越多使用单纯的数字来组成识别码,应用於现代的机器排序,链接和合并电子记录。因为纯数字识别码不依赖于语言系统,受到比较少的限制。使用字母数字的识别码,可能适合使用拉丁语系的系统,但是那些非拉丁语系的系统就比较难以使用、明白或理解。同时,数字字符比较字母数字字符容易排序。

当美国在1935年通过社会安全法案时,履行法案遇到的第一个挑战就是创造如何”永久识别每个个体”的识别码,同时保有”以后能够有效和无限制的增加识别对应增长工人的功能”[4]。一个八位字母数字系统最先被提出来,但很快遭到了统计机构、劳工及法律部门的反对。这个变换被描述为“机器会如何深远地映响[政府] 操作”的第一个徵兆[4,5] 。这些都是在计算机实际使用以前发生的事情。

如今,信息科技的巨大影响很明显,不但是政府,商业和个人活动的方方面面,而且影响力还在不断增强。一个识别码可以应用于个人,一家公司,一辆车,一张信用卡,一箱货物,一个电子邮箱账户,一个地方,或者是任何一个实际个体。

如果一条电子记录不包含识别码,或不能和其他记录連接,大数据中称为缺乏“结构”或者叫做“无结构”。从21世纪初开始,“无结构”数据比有“结构”数据出现的频率多得多,但是它们比有“结构”数据包含较少信息內容,也更难应用,特别是在社会和经济方面,我们很难得到后续、连贯和可靠的时序信息。

如何有效地使用识别码将是发挥大数据巨大作用的关键。

有效使用识别码

  1. 匹配和合并记录。理想的识别码同时互斥,完全穷尽,在代码和实体间建立了明确的一对一对应关系,同时也会延生到未来的记录。识别码促进对电子记录直接有效地排序、匹配及合并,具有无限扩增实体信息内容的潜能。
  2. 匿名和保护身份 因为代码是实体的匿名,所以它为身份保护提供第一道防线。但随着识别码重要性的增强,以及它与其他数据链接相对容易,通过识别码伪造及盗用身份的危险性和可能性也在增加,这就需要加强对识别码的政策和负责任的管理,以使它起到保护的作用。
  3. 基本描述和分类。识别码可以对数据内容提供最基本的描述,迅速从中得到简单的信息或者是总结。随着时间的推移,这个概念延伸到识别码的分类和“元数据”的发展[6,7],这个过程包含了在数据系统中建立有效的结构以及扩展它们跨系统的应用。
  4. 初部质量检查。无意的人为输入错误以及不正确的转录识别码都可能对整体数据和最终分析结果的质量造成破坏。欺诈或恶意改变识别码亦可能会造成对数据的完整性和可靠性严重的破坏。“效验码”[8,9]在早期检查中的使用可使识别码中常见错误降低90%。
  5. 促进统计学创新。通过对每个学生数据连续不断的收集和整合,可以建立一个含有所有学生和学校丰富信息的动态框架。在严格保护个体隐私和数据安全的同时,新的变数可以定义用做分析研究,描述一所学校的表现或者是一个国家教育状况的统计结论可以是定时或实时产生。在美国及中国,建立这些动态框架和纵向数据系统都已起步[10]。Data Quality Campaign[11]列“唯一州际连接学生数据和主要数据库学生识别码”为建立全美教育纵向数据系统中最关键的部分。

美国和中国的个人识别码

美国没有全国性的个人识别系统。社会安全号创立于1936年,还在商业使用电脑之前,用于追踪劳工的收入。在电脑大规模使用以后,社会安全号作为识别码表现出一些优势和劣势。

美国社会安全证
美国社会安全证

九位的社会安全号由三部分组成:

  • 地区号码(三位)- 最初是发放社会安全号的地区代码,后来代表申请邮寄地址的邮政代码
  • 组号(两位)- 代表着一个社会安全号集合被指定为一个组
  • 系列号码(四位)- 从0001到9999

社会安全号的申请过程中[12]收集人口信息,包括名字,出生地,出生日期,国籍,种族,性别,父母的名字和社会安全号,电话号码和邮政地址。美国社会安全部负责社会安全号的发放。有一些社会安全号被保留,没有使用。一旦一个社会安全号被发放,它应是唯一的,因为它不会被第二次发放。但重复的情况仍可能存在。

1938年,一个钱包的生产厂商显示他们的产品是多么适合社会安全号卡来促销其在百货商场出售的钱包,但是他们使用一张自己员工的社会安全号卡[13]。这导致有四万人错误的使用了这个社会安全号,甚至到1977年还有人将这个号码做作为自己的社会安全号。

自从社会安全号的产生,它被政府部门和私有企业的使用显著增加。从1943年开始,总统行政命令要求各联邦政府部门必须使用社会安全号建立拥有永久账户号码的系统[5]。在1960年代初,政府雇员和个体报税者必须使用社会安全号。到1960年代末,社会安全号被作为军人的识别码。在整个七十年代,当电脑被越来越多使用后,金融活动,如开设新银行账户和申请信用卡和贷款,以及联邦福利的运行中,社会安全号成为必不可少的一部分。从1986年开始,如果父母想要有受抚养人的免税,就必须将其抚养人的社会安全号也列在税表里。在法律实施的第一年,这反欺诈行动减少了七百万的受抚养人数[14]。

社会安全号可以将同一个人的很多电子文件链接合并到一起,因此它本质上作为了非官方全国性识别码,但是它也可能直接造成误用或者滥用,例如身份盗用[15]。社会安全号没有效验码,它并不能有效的作为身份的認证。有学者也展示如何用公开的信息“异常精确的重建社会安全号”[16]。这些年在美国,识别码的这些脆弱奌使得人们更加小心谨慎和负责任的使用社会安全号。1943年要求使用社会安全号的行政命令也被废除,取而代之的是在2008年颁布的行政命令使社会安全号成为可以选择而非必须的。

中国居民身份证
中国居民身份证

中国相对较晚开始使用个人识别码。在1999年7月1日,身份证号码由15位提升为18位,其中出生年份由两位变为四位,并且增加效验码。18位身份证号由四部分组成[17,18]:

  • 地区代码(六位)— 个人住址的行政编号
  • 生日代码(八位)— 按生日的年月日顺序组成
  • 系列代码(三位)— 其中奇数代表男性,偶数代表女性
  • 效验码(一位)— 使用ISO 7064标志算法,基于前面17位数字计算得到[18,19]

居民身份证由居民常住户口所在地的县级人民政府公安机关基于未满16岁居民的申请签发。居民身份证登记的项目包括:姓名,性别,民族,生日以及居住地址。居民身份证有效期长至永久,也可能短至五年,取决于申请人的年龄。根据官方的声明,居民身份证号在中国电子健康档案中也用于记录个人的健康信息[20]。

中国及美国的商业识别码和工业分类码

美国企业的僱主识别码相当於个人的社会安全号[21]。但是,这里的企业包括地方,州和联邦政府,也包括无雇员的公司,亦包括需要为其雇员缴纳税款的个人公司。僱主识别码是由美国税务局负责指派的一个九位数字,它的形式是GG-NNNNNNN,其中GG在2001年前是公司所在地的代码,而后七位数字没有特别的含义。一旦一个僱主识别码被使用,美国税务局就不会再次使用。另外,每个州亦各有自已的僱主识别码用于税务收缴和行政管理。

在联邦僱主识别码的申请过程中有以下信息被收集:正式名称,交易名称,法人姓名,责任人员,邮政地址,商业地址,公司类型,申请原因,成立时间,财政年度,未来12个月员工数目估计,首次工资发放日期以及公司主营业务[22]。

美国统计部门使用北美工业分类系统(以下称NAICS)来对公司营业进行归类,以期能收集,分析以及发布美国经济的统计信息[23]。在1997年,NAICS继承取代工业标准分类系统(SIC)。

NAICS是一个层级分类代码系统,其中可能包含有2到6位数字。最高层级的2位数字代表主要经济部门,例如建筑和生产。每个2位数字所代表的部门都包含一系列的3位数字子部门,而它又包含有一系列4位数字的工业集团。例如31到33是表示生产部门,而碾米工业在其所属层级之中:

311       食品加工制造业
3112     粮食和油菜籽加工业
31121   面粉和麦芽生产业
311212 碾米业

层级系统其中的一个优势就是它可以相当容易地链式聚集产业总值。比如说,所有代码为311x企业的总和就组成了代码为311的食品加工制造业。

持续用NAICS代码凖确地把企业分类是一大挑战,因为在当今快速变化的动态国际经济环境下,一夜之间过时的行业会被淘汰,新的行业也会出现成长,过去的”高科技企业”及最近的“绿色”行业就是例子之一。使用NAICS代码的过程中有着理解和持续性的问题,例如美国统计局和美国劳工统计局就因为数据来源和NAICS代码分类不同,令到各自创立和维护的商业框架有異[10]。不一致使用NAICS代码破坏甚至造成对时间序列和纵向数据分析无效。

中国的新企业必须向当地质量监督局申请9位数的国家组织机构代码,其中由8位数字(或大写拉丁字母)本体代码和1位数字(或大写拉丁字母)效验码组成。中国的组织机构代码,借鉴原ISO6523《数据交换标识法的结构》(现ISO6523《信息技术组织和组织各部分标识用的结构》)国际标准的基础上,根据GB 11714—1997《全国组织机构代码编制规则》国家标准的规定,编制的全国统一的组织机构代码识别标识码[25]。可以通过网上信息核查系统基于国家组织机构代码查询组织机构的信息[26]。

国内及国外的经济学家和其他学者十分认可中国工业企业数据库的价值。透过相当的投资,这个丰富的综合数据系统从1998年开始纵向描述中国差不多所有的国有和大型企业(2010>年前销售额在500万人民币以上及2010年后销售额在2000万人民币以上的企业)。但是,十分严重的质量问题已有所报导,而主要数据错误原因可以追溯到不正确和不连贯地使用识别码[27]。虽然中国从1989年就开始标准化国家组织机构代码,并且现在已经进行到了第三阶段,但是这个问题仍然存在[28]。

就在上个月,广东省宣布他们运用国家组织机构代码这个平台推动反腐败[29]。中国也有一个根据GS-T4754-2002文件而建立的标准工业分类系统[30]。这个层级系统有四类,其中最高层为一个字母,其余分别有2位,3位,4位数字代码表示较低层级。以前述的的碾米业为例,中国的分类系统中表示为以下层级:

C           制造业
C13       农副食品加工业
C131     谷物磨制
C1312   大米加工企业

总结

随着科技的轉变和发展,收集大规模的数字化数据的成本将更低,速度也将更快。这些是大数据时代的标志。

这些大数据包含了空前规模的信息。如果数据整合和结构化,它们的价值和功能将会暴涨,超过现有数据系统所能提供。识别码促进数据的链接和合并,是提供这些巨大机会的关键。

识别码能解放大数据的巨大能量。如果我们不能正确使用和管理识别码,它同样可以成为系统失灵,误用和滥用,甚至是欺骗和犯罪行为的罪魁祸首。

现实使用识别码的挑战是多样复杂。除了科技技术,统计学设计和质量回馈途径,适当的教育和培训,有效的政策和监控,以及公众的意识參与都是有效负责使用识别码所必须的。未来的文章中将讨论这些话题。

胡善庆博士, Jeremy.s.wu@gmail.com丁浩, edwarddh101@gmail.com

参考文献

[1] 360doc.com.  Quantitative Ranking of Chinese Family Names (中國姓氏人口數),November 25, 2012.  Available at http://www.360doc.com/content/12/1125/17/6264479_250155720.shtml on April 29, 2013.

[2] Wikipedia.  Robert.  Available at http://en.wikipedia.org/wiki/Robert on April 29, 2013.

[3] U.S. Social Security Administration.  Change in Name Popularity.  Available at http://www.ssa.gov/OACT/babynames/rankchange.html on April 29, 2013.

[4] U.S. Social Security Administration.  Fifty Years of Operations in the Social Security Administration, by Michael A. Cronin, June 1985.  Social Security Bulletin, Volume 48, Number 6.  Available at http://www.ssa.gov/history///cronin.html on April 29, 2013.

[5] U.S. Social Security Administration.  The Story of the Social Security Number, by Carolyn Puckett, 2009.  Social Security Bulletin, Volume 69, Number 2.  Available at http://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html on April 29, 2013.

[6] Wikipedia.  Metadata. Available at http://en.wikipedia.org/wiki/Metadata on April 29, 2013.

[7] Wikipedia. 元数据. Available at http://zh.wikipedia.org/wiki/%E5%85%83%E6%95%B0%E6%8D%AE on April 29, 2013.

[8] Wikipedia.  Check Digit.  Available at http://en.wikipedia.org/wiki/Check_digit on April 29, 2013.

[9] Wikipedia. 效验码. Available at http://zh.wikipedia.org/wiki/%E6%A0%A1%E9%AA%8C%E7%A0%81 on April 29, 2013.

[10] Wu, Jeremy S. 21st Century Statistical Systems, August 1, 2012.  Available at https://jeremy-wu.info/21st-century-statistical-systems/ on April 29, 2013.

[11] Data Quality Campaign.  10 Essential Elements of a State Longitudinal Data System.  Available at http://www.dataqualitycampaign.org/build/elements/1 on April 29, 2013.

[12] U.S. Social Security Administration.  Application for a Social Security Card, Form SS-5.  Available at http://www.ssa.gov/online/ss-5.pdf on April 29, 2013.

[13] U.S. Social Security Administration.  Social Security Cards Issued by Woolworth.  Available at http://www.socialsecurity.gov/history/ssn/misused.html on April 29, 2013.

[14] Wikipedia.  Social Security Number.  Available at http://en.wikipedia.org/wiki/Social_Security_number, on April 29, 2013.

[15] President’s Identity Theft Task Force. 2007. Combating Identity Theft: A Strategic Plan. Available at http://www.idtheft.gov/reports/StrategicPlan.pdf on April 29, 2013.

[16] Timmer, John.  New Algorithm Guesses SSNs Using Data and Place of Birth, July 6, 2009. Available at http://arstechnica.com/science/2009/07/social-insecurity-numbers-open-to-hacking/ on April 29, 2013.

[17] baidu.com.  GB11643-1999 Citizen Identity Number 公民身份号码.  Available at http://wenku.baidu.com/view/4f19376348d7c1c708a14587.html on April 29, 2013.

[18] Wikipedia.  Resident Identity Card.  Available at http://en.wikipedia.org/wiki/Resident_Identity_Card_%28PRC%29 on April 29, 2013.

[19] Wikipedia.  ISO 7064.  Available at http://en.wikipedia.org/wiki/ISO_7064:1983 on April 29, 2013.

[20] baidu.com.  Electronic Health Record 电子健康档案. Available at http://wenku.baidu.com/view/348d5a18a300a6c30c229fec.html on April 29, 2013.

[21] Wikipedia.  Employer Identification Number.  Available at http://en.wikipedia.org/wiki/Employer_identification_number on April 29, 2013.

[22] U.S. Internal Revenue Service.  Form SS-4: Application for Employer Identification Number.  Available at http://www.irs.gov/pub/irs-pdf/fss4.pdf on April 29, 2013.

[23] U.S. Census Bureau.  North American Industry Classification System.  Available at http://www.census.gov/eos/www/naics/index.html on April 29, 2013

[24] National Administration for Code Allocation to Organizations.  Introduction to Organizational Codes, 组织机构代码简介.  Available at http://www.nacao.org.cn/publish/main/65/index.html on April 29, 2013.

[25] Wikipedia.  ISO/IEC 6523.  Available at http://en.wikipedia.org/wiki/ISO_6523 on April 29, 2013.

[26] National Administration for Code Allocation to Organizations.  National Organization Code Information Retrieval System, 全国组织机构信息核查系. Available at http://www.nacao.org.cn/ on April 29, 2013.

[27] Nie, Huihua; Jiang, Ting; and Yang, Rudai.  A Review and Reflection on the Use and Abuse of Chinese Industrial Enterprises Database.  World Economics, Volume 5, 2012.  Available at http://www.niehuihua.com/UploadFile/ea_201251019517.pdf on April 29, 2013.

[28] National Administration for Code Allocation to Organizations.  Historical Development of National Organization Codes, 全国组织机构代码犮展历. Available at http://www.nacao.org.cn/publish/main/236/index.html on April 29, 2013.

[29] National Administration for Code Allocation to Organizations.  Guangdong Aggressively Promotes the Use of identification Codes in its Campaign against Corruption, 广东积极发挥代码在反腐倡廉中的促进作用, March 7, 2013. Available at http://www.nacao.org.cn/publish/main/13/2013/20130307150216299954995/20130307150216299954995_.html on April 29, 2013.

[30] baidu.com.  National Economic Industry Classification, GB-t4754-2002, 国民经济行业分类(GB-T4754-2002)(总表).  Available at http://wenku.baidu.com/view/69f04af8c8d376eeaeaa31cf.html on April 29, 2013.

Categories
Big Data Statistics Statistics 2.0

The Essentials of Identification Codes

Big Data promises to improve governance of society and better inform the public in the 21st century.  Although every data record has some information to contribute, linking and merging relevant electronic records can minimize the collection of duplicate data and increase the value and utility of the integrated data rapidly and exponentially.  Essential in this approach is the presence of identification codes that will facilitate the actual integration of record and data.

The identification code is a key to unlocking the enormous power in Big Data.  However, it may also be the primary cause of system failures, misuses and abuses, and even fraudulent or criminal activities, if it is not properly applied and managed.  In addition to technology, statistical design and quality feedback loops, proper education and training, relevant policies and regulations, and public awareness are all needed for the effective and responsible use of identification codes and Big Data.

The Need for Identification Codes

When a student enters a school, a record will store the student’s name, gender, age, family background, field of study, and other data.  When she takes a course and receives a final grade, the results are recorded.  When she satisfies all the requirements for graduation, another record will show the grade point average she has achieved and the degree she is awarded.

Each record represents a snapshot for the student.  The records are collected over time for administrative purposes.  Together the longitudinal snapshots provide comprehensive information about the education of a student.

When the student enters the workforce, additional data are collected over her lifetime about the industry and occupation she works in, the job she performs, the wages and promotions she receives, the taxes and insurances she pays, and the employment or unemployment status she is in.

In like manner, massive amounts of data are collected about a firm, including its initial registration as a business, periodic reports on revenues and expenses, entry into the stock markets, acquisitions or mergers with other companies, payment of taxes and fees, growth in sales and staffing, and expansion or death of the business.

These administrative records used to be stored in dusty file cabinets, but they are now mostly digitized and available for computer processing when the Big Data era arrived at the turn of the millennium.

Timely and proper integration of the records of all students would provide unprecedented details about how the school is performing, such as its graduation or dropout rate over time.  Further roll up of all schools would inform a nation about its state of education, such as its capacity to support employment and economic growth.  This is what Big Data promises to bring in the 21st century.  From allocation of resources, measurement of performance, to formulation of policy, every segment of society can benefit from the details and insights of Big Data to improve governance and inform the public.

Although every data record has some information to contribute, linking and merging relevant electronic records minimizes the collection of duplicate data and increases the value and utility of the integrated data exponentially.  Essential in this approach is the presence of identification codes that will facilitate the actual integration of record and data.  Statisticians can make significant contributions to building new statistical systems with their thinking and methods in this process.

Types of Identification Codes

The name of an individual or a company was the preferred identification code when files were still physical, such as in paper form.  It has been conventional to consolidate records under the same name and sort them by alphabetical order in English, number of strokes in Chinese, or chronological order.

However, a major shortcoming of using names, especially when processed massively by computer, is that they are not unique.  The top four family names of Lee, Wang, Zhang, and Liu accounted for 334 million individuals in China in 2006 [1], exceeding the total U.S. population.  Chinese names may also appear differently because of the simplified and traditional characters.  The English first name of Robert, the 61st most popular male name at birth in the U.S. in 2011 [2,3], can have at least 7 common variations for the same person, including Bert, Bo, Bob, Bobby, Rob, Robbie, and Robby.  Bert may also be short for Albert.  Individuals may apply to change their names or use more than one name; women may change their names after marriage.  Human errors can add errant names.  References to the same name across nations with different languages can be notoriously difficult.

The name of a company is usually checked and validated to avoid duplication during the registration process and protected by applicable local, national and international rules and laws including trademarks after registration.  The company may use multiple names including abbreviations and stock market symbols; it can also change its name due to merger with another company, acquisition agreement, reorganization, or a simple desire to change its brand.

A non-unique identification code poses the risk of linking and merging records incorrectly, leading to incorrect results or conclusions.  Supplementing a name with auxiliary information, such as age, gender, and an address, would reduce but not eliminate the chance of record mismatches, and at the cost of increasing processing time.

A series of numbers, letters and special characters (alphanumeric) or a series of numbers alone (numeric) is increasingly used as the identification code of choice with modern machine sorting, linking, and merging of electronic records.  Numeric codes tend to be less restrictive because they are independent of the writing system.  Alphanumeric codes using letters from the English alphabet may be suitable for systems using languages based on the Latin alphabet, but systems using non-Latin scripts may still find them unavailable or difficult to use, understand, or interpret.  It is also easier to understand how numeric codes are sorted compared to alphanumeric codes.

When the Social Security Act of 1935 was passed in the U.S., one of the first challenges in implementation was to create an identification code that would “permanently identify each individual to be covered” and “be sufficiently elastic to function indefinitely as additional workers became covered” [4].  An 8-field alphanumeric code was initially chosen, but it was soon rejected by the statistical agencies, as well as labor and justice departments.   This exchange was described [4,5] as the first sign of “the tremendous impact machines would have on the way [government] would do business.”  This was BEFORE computers were introduced for actual use.

Today, the impact of information technology is obvious and continues to increase in every aspect of government, business, and individual activities.  An identification code may be applied to a person, a company, a vehicle, a credit card, a cargo, en email account, a location, or just about any practical entity.

An electronic record that does not contain an identification code or cannot be correctly linked with other records may be described as lacking in “structure” or “unstructured” in the Big Data era.  Since the beginning of the 21stcentury, “unstructured” data are occurring in much higher frequency than “structured” data.  However, they contain relatively limited information content and utility compared to “structured” data, especially for continuing, consistent, and reliable information about a society or an economy over time.

Effective use of the identification code is a key to unlocking the enormous power inherent in Big Data.

Effective Use of Identification Codes

  1. Match and Merge Records.  Ideal identification codes are mutually exclusive and exhaustive, establishing an unambiguous one-to-one relationship between the code and the entity, including those yet to appear in the future.  The code facilitates direct and perfect machine sorting, matching, and merging of electronic records, potentially increasing the amount of information about the entity with no limit.
  2. Anonymize and Protect Identity.  A code offers the first-line protection of identity by anonymizing the entity.  Due to the increasing importance of the code and the relative ease of linking with other records, the risks and stakes of identity fraud or theft through the identification code have also risen, requiring responsible policy and management of the code as safeguards.
  3. Provide Basic Description and Classification.  An identification code can provide the most basic description of the content and context of the data records, from which simple observations or summaries can be quickly derived.  Over time, this concept also evolved into codes for classification and the separate development of “metadata” [6,7] for efficiently building structure into data systems and broadening their use across systems.
  4. Perform Initial Quality Check.  Unintentional human errors in typing or transcribing an identification code incorrectly may damage the quality of integrated data and the eventual analytical results.  Fraudulent or malicious altering of the identification codes may inflict even more severe damage to the integrity and reliability of the data.  Early detection with the deployment of “check digit” [8,9] in the identification code may eliminate more than 90 percent of these common errors.
  5. Facilitate Statistical Innovations.  By collecting and integrating data continuously for each entity such as a student, a dynamic frame with rich content can be built for all students and all schools.  New data elements may be defined for analysis; statistical summaries may be produced in real time or according to set schedules to describe the performance of a school or the state of education for a nation, while strictly protecting the confidentiality of individuals and security of their data.   Innovative efforts to construct these dynamic frames, or longitudinal data systems, have started in the U.S. and China [10].  The Data Quality Campaign [11] lists “a unique statewide student identifier that connects student data across key databases across years” to be the top essential element in building state longitudinal data systems for education in the U.S.

Personal Identification Codes of the U.S. and China

The U.S. does not have a national identification system. The Social Security Number (SSN) was created to track earnings of workers in the U.S. in 1936, before computers were introduced for commercial use.  Its transition into the computer age revealed some of the strengths and weaknesses of its evolving role as an identification code.

The 9-digit SSN is composed of 3 parts:

U.S. Social Security Card
U.S. Social Security Card
  • Area Number (3 digits) – initially geographical region where the SSN was issued and later the postal area code of the mailing address in the application
  • Group Number (2 digits) – representing each set of SSN being assigned as a group
  • Serial number (4 digits) – from 0001 to 9999

Demographic data are collected in the SSN application [12], including name, place of birth, date of birth, citizenship, race, ethnicity, gender, parents’ name and SSN, phone number and mailing address.  The U.S. Social Security Administration is responsible for issuing the SSN.  Some of the SSN are reserved and not used.  Once issued, a SSN is supposed to be unique because it would not be issued again.  However, some duplicate situations exist.

A wallet manufacturer decided to promote its product in 1938 by showing how a copy of a Social Security card from one of its employees would fit into its wallets, which were sold through department stores [13].  In all, over 40,000 people mistakenly reported this to be their own SSN, with some as late as in 1977.

Use of the SSN by the government and later the private sector has expanded substantially since its creation.  Beginning in 1943, federal agencies were required by executive order to use the SSN whenever the agency finds it advisable to establish a new system of permanent account numbers for individuals [5].  In the early 1960s, federal employees and individual tax filers were required to use SSN.  In the late 1960s, SSN began to serve as military identification numbers.  Throughout the 1970s when computers were increasingly used, SSN was required for federal benefits and financial transactions such as opening bank accounts and applying for credit cards and loans.  Beginning in 1986, parents must list the SSN for each dependent for whom the parents want to claim as a tax deduction.   The anti-fraud change resulted in 7 million fewer minor dependents being claimed in the first year of implementation [14].

As SSN became essentially an unofficial national identifier that can link and merge many electronic files for the same person together, it can also be the direct cause of misuse and abuse such as identity theft [15].  The SSN does not have a check digit; it cannot be used reliably for authentication of identity.  Academic researchers have also demonstrated ways to use publicly available information “to reconstruct SSN with a startling degree of accuracy” [16].  These identified vulnerabilities have resulted in more cautious, secured, and responsible use of the SSN in the U.S. in recent years.  The original 1943 executive order requiring the use of SSN was rescinded and replaced by another executive order in 2008 that makes the use of SSN optional.

China had a relatively late start in personal identification codes.  It revised the Resident Identification Number (RIN) from 15 digits to 18 digits on July 1, 1999, raising the embedded birth year from 2 to 4 digits and adding a check digit.  The 18-digit RIN is composed of 4 parts [17,18]:

Chinese Identification Card
Chinese Identification Card
  • Address Area Number (6 digits) – administrative code for the individual’s residence
  • Birthdate Number (8 digits) – in the form of YYYYMMDD where YYYY is year, MM is month and DD is day of the birthdate
  • Serial Number (3 digits) – with odd numbers reserved for males and even numbers reserved for females
  • Check Digit (1 digit) – computed digit based on 17 previous digits using the ISO 7064 standard algorithm [18,19]

Security offices at county-level local governments issue the resident identification cards to individuals upon application no later than age 16.  Data collected include name, gender, race, birthdate, and residential address.  The resident identification cards may be valid permanently or for a time period as short as 5 years, depending on the age of the applicant.  According to official announcements, the RIN is also used to track individual health records in the National Electronic Health Record System in China [20].

Business Identification and Industry Classification Codes of the U.S. and China

An Employer Identification Number (EIN) to a business is equivalent to the SSN to an individual in the U.S. [21].  However, a business in this case may also be a local, state, or federal government; it may also be a company without employees or an individual who has to pay withholding taxes on his/her employees.  The EIN is a unique 9-digit number assigned by the U.S. Internal Revenue Service (IRS) according to the GG-NNNNNNN format, where GG was a numerical geographical code to the location of the business prior to 2001 and the remaining 7 numeric digits have no special meanings.  Once issued, an EIN will not be reissued by IRS.  In addition, each state has its own, different Employer Identification Number for its tax collection and administrative purposes.

Information collected about the business during the EIN application process include legal name, trade name, executor name, responsible party name, mailing address, location of principal business, type of entity or company, reason for application, starting date of business, accounting year, highest number of employees expected in the next 12 months, first date of paid wages, and principal activity of business [22].

U.S. statistical agencies use the North American Industry Classification System (NAICS) to classify business establishments for the purpose of collecting, analyzing, and publishing statistical data related to the U.S. economy [23].  NAICS was adopted and replaced the Standard Industrial Classification (SIC) system in 1997.

NAICS is a hierarchical classification coding system consisting of 2, 3, 4, 5, or up to 6 numeric digits.  The top 2-digit codes represent the major economic sectors such as Construction and Manufacturing.  Each 2-digit sector contains a collection of 3-digit subsectors, each of which in turn contains a collection of 4-digit industry groups.  For example, 31-33 is the Manufacturing sector for which the following hierarchy exists for the Rice Milling industry:

311                  Food Manufacturing

3112                Grain and Oilseed Milling

31121              Flour Milling and Malt Manufacturing

311212            Rice Milling

One of the strengths of the hierarchical system is that aggregation can be performed easily up the chain.  For example, sum of all 311X companies should form the 311 Food Manufacturing industry in the U.S.

Consistent creation and assignment of NAICS codes is a challenge in a global, dynamic economy where obsolete industries may disappear and new industries may spawn and grow overnight.  Examples of challenging industries include “high technology” industries in the past and the recent “green” industries.  Application of the NAICS codes is subject to interpretation and consistency issues.  For example, the U.S. Census Bureau and the U.S. Bureau of Labor Statistics disagree in creating and maintaining their respective business frames due to differences in data sources and assignment of NAICS codes [10].  Inconsistent use of NAICS codes disrupts or even invalidates analysis and interpretation of time series or longitudinal data.

A new business in China must apply to the local Quality and Technical Supervision Office for a 9-digit National Organization Code, which contains 8 digits and 1 check digit [22].  The Chinese regulation, GB 11714-1997 on Rules of Coding for the Representation of Organizations, is patterned after international standards, ISO 6523 Information Technology – Structure for the Identification of Organizations and Organization Parts [25].  Online directories exist to look up information about the organization based on the National Organization Code [26].

The value of the Chinese Industrial Statistical Dataset is well recognized by economists and other analysts domestically and internationally.  Substantial resources were invested into the construction and maintenance of the comprehensive data system that describes almost all state-owned and large enterprises (annual sales of over RMB Ұ5 million until 2010 and over RMB Ұ20 million thereafter) in China longitudinally since 1998.  However, serious data quality problems have been reported, and the primary cause can be traced to the inconsistent and incorrect application of the identification codes [27].  This situation exists although China started its standardization of organization codes in 1989 and is currently in the third phase of implementation [28].

As recently as last month, Guangdong province has announced its commitment to use a shared platform on the National Organization Codes as part of its campaign to combat corruption [29].

China also has a standard industry classification system under GS-T4754-2002 [30].  The hierarchical system has 4 categories with the highest level indicated by a 1-digit letter, and the lower levels represented by 2, 3, and 4 digits respectively.  For the previous example of Rice Milling, the Chinese classification system provides the following hierarchy:

C                      Manufacturing

C13                  Food Manufacturing

C131                Grain Milling

C1312              Rice Milling

Summary

As technology continues to evolve and grow, larger amount of digitized data will be collected more rapidly at relatively low cost.  This has characterized the Big Data era.

These Big Data contain unprecedented amount of information.  If integrated and structured, their value and power will be increased exponentially beyond any existing statistical systems have been able to provide.  Identification codes that facilitate linking and merging of records hold the key to unlocking this enormous trove of opportunities.

As the gateway to the enormous power of Big Data, identification codes may also be the primary cause of system failures, misuses and abuses, and even fraudulent or criminal activities, if they are not properly applied and managed.

The practical challenges of applying an identification code are complex.  In addition to technology, statistical design and quality feedback loops, proper education and training, effective policies and regulations, and public awareness are all needed for the effective and responsible use of identification codes.  These topics will be discussed in future papers.

Co-authored by Jeremy S. Wu, Ph.D., Jeremy.s.wu@gmail.com and Hao Ding, edwarddh101@gmail.com

References

[1] 360doc.com.  Quantitative Ranking of Chinese Family Names (中國姓氏人口數),November 25, 2012.  Available at http://www.360doc.com/content/12/1125/17/6264479_250155720.shtml on April 29, 2013.

[2] Wikipedia.  Robert.  Available at http://en.wikipedia.org/wiki/Robert on April 29, 2013.

[3] U.S. Social Security Administration.  Change in Name Popularity.  Available at http://www.ssa.gov/OACT/babynames/rankchange.html on April 29, 2013.

[4] U.S. Social Security Administration.  Fifty Years of Operations in the Social Security Administration, by Michael A. Cronin, June 1985.  Social Security Bulletin, Volume 48, Number 6.  Available at http://www.ssa.gov/history///cronin.html on April 29, 2013.

[5] U.S. Social Security Administration.  The Story of the Social Security Number, by Carolyn Puckett, 2009.  Social Security Bulletin, Volume 69, Number 2.  Available at http://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html on April 29, 2013.

[6] Wikipedia.  Metadata. Available at http://en.wikipedia.org/wiki/Metadata on April 29, 2013.

[7] Wikipedia. 元数据. Available at http://zh.wikipedia.org/wiki/%E5%85%83%E6%95%B0%E6%8D%AE on April 29, 2013.

[8] Wikipedia.  Check Digit.  Available at http://en.wikipedia.org/wiki/Check_digit on April 29, 2013.

[9] Wikipedia. 效验码. Available at http://zh.wikipedia.org/wiki/%E6%A0%A1%E9%AA%8C%E7%A0%81 on April 29, 2013.

[10] Wu, Jeremy S. 21st Century Statistical Systems, August 1, 2012.  Available at https://jeremy-wu.info/21st-century-statistical-systems/ on April 29, 2013

[11] Data Quality Campaign.  10 Essential Elements of a State Longitudinal Data System.  Available athttp://www.dataqualitycampaign.org/build/elements/1 on April 29, 2013.

[12] U.S. Social Security Administration.  Application for a Social Security Card, Form SS-5.  Available at http://www.ssa.gov/online/ss-5.pdf on April 29, 2013.

[13] U.S. Social Security Administration.  Social Security Cards Issued by Woolworth.  Available at http://www.socialsecurity.gov/history/ssn/misused.html on April 29, 2013.

[14] Wikipedia.  Social Security Number.  Available at http://en.wikipedia.org/wiki/Social_Security_number, on April 29, 2013.

[15] President’s Identity Theft Task Force. 2007. Combating Identity Theft: A Strategic Plan.  Available at http://www.idtheft.gov/reports/StrategicPlan.pdf on April 29, 2013.

[16] Timmer, John.  New Algorithm Guesses SSNs Using Data and Place of Birth, July 6, 2009. Available at http://arstechnica.com/science/2009/07/social-insecurity-numbers-open-to-hacking/ on April 29, 2013.

[17] baidu.com.  GB11643-1999 Citizen Identity Number 公民身份号码.  Available at http://wenku.baidu.com/view/4f19376348d7c1c708a14587.html on April 29, 2013.

[18] Wikipedia.  Resident Identity Card.  Available at http://en.wikipedia.org/wiki/Resident_Identity_Card_%28PRC%29 on April 29, 2013.

[19] Wikipedia.  ISO 7064.  Available at http://en.wikipedia.org/wiki/ISO_7064:1983 on April 29, 2013.

[20] baidu.com.  Electronic Health Record 电子健康档案. Available at http://wenku.baidu.com/view/348d5a18a300a6c30c229fec.html on April 29, 2013.

[21] Wikipedia.  Employer Identification Number.  Available at http://en.wikipedia.org/wiki/Employer_identification_number on April 29, 2013.

[22] U.S. Internal Revenue Service.  Form SS-4: Application for Employer Identification Number.  Available at http://www.irs.gov/pub/irs-pdf/fss4.pdf on April 29, 2013.

[23] U.S. Census Bureau.  North American Industry Classification System.  Available at http://www.census.gov/eos/www/naics/index.html on April 29, 2013.

[24] National Administration for Code Allocation to Organizations.  Introduction to Organizational Codes, 组织机构代码简介.  Available at http://www.nacao.org.cn/publish/main/65/index.html on April 29, 2013.

[25] Wikipedia.  ISO/IEC 6523.  Available at http://en.wikipedia.org/wiki/ISO_6523 on April 29, 2013.

[26] National Administration for Code Allocation to Organizations.  National Organization Code Information Retrieval System, 全国组织机构信息核查系. Available at http://www.nacao.org.cn/ on April 29, 2013.

[27] Nie, Huihua; Jiang, Ting; and Yang, Rudai.  A Review and Reflection on the Use and Abuse of Chinese Industrial Enterprises Database.  World Economics, Volume 5, 2012.  Available at http://www.niehuihua.com/UploadFile/ea_201251019517.pdf on April 29, 2013.

[28] National Administration for Code Allocation to Organizations.  Historical Development of National Organization Codes, 全国组织机构代码犮展历. Available at http://www.nacao.org.cn/publish/main/236/index.html on April 29, 2013.

[29] National Administration for Code Allocation to Organizations.  Guangdong Aggressively Promotes the Use of identification Codes in its Campaign against Corruption, 广东积极发挥代码在反腐倡廉中的促进作用, March 7, 2013. Available at http://www.nacao.org.cn/publish/main/13/2013/20130307150216299954995/20130307150216299954995_.html on April 29, 2013.

[30] baidu.com.  National Economic Industry Classification, GB-t4754-2002, 国民经济行业分类(GB-T4754-2002)(总表).  Available at http://wenku.baidu.com/view/69f04af8c8d376eeaeaa31cf.html on April 29, 2013.

Categories
Big Data Statistics

Statistics 2.0: Dynamic Frames

Abstract

A frame identifies all the known units in a population from which a census can be conducted or a random sample can be drawn, providing the structural foundation for the extraction of maximum, reliable information from designed statistical studies with the support of established statistical theories.  The significance of the Big Data era is that most data are now digitized, easily stored, and processed in large quantity at relatively low cost.  Big Data offers unprecedented opportunities for statisticians to rethink and innovate.  Among the many possibilities offered by Big Data is the creation and maintenance of Dynamic Frames – frames that are rich in content, capture the most up-to-date data as soon as they become available, and produce results and reports in real time on demand.
Traditional Population and Frame
A population is an important concept in the study of statistics.  It is commonly understood to be an entire collection of items of interest, be it a nation’s people or businesses, a day’s production of light bulbs, or an ocean’s fish [1,2,3].
A less well-known term is a frame, or a list of the units that cover the entire population with its identification system.   A frame is the working definition of a population under study.  It identifies all the known units in a population from which a census can be conducted or a random sample can be drawn, providing the structure for statistical description and analysis about the population [2,4,5].
 

Figure 1

Figure 1 shows a flow chart of a conventional statistical study by census or random sample.  Quoting from [4], an ideal frame should have the following qualities:
  • All units have a logical, numerical identifier
  • All units can be found – their contact information, map location or other relevant information is present
  • The frame is organized in a logical, systematic fashion
  • The frame has additional information about the units that allow the use of more advanced sampling frames
  • Every element of the population of interest is present in the frame
  • Every element of the population is present only once in the frame
  • No elements from outside the population of interest are present in the frame
  • The data is “up-to-date”
Modeling may be considered part of a sampling process, sometimes bypassing the need for a frame by assuming that the model and data adequately represent the underlying population. 
Practicing statisticians understand the importance of frames – it is the structural foundation for the extraction of maximum, reliable information from designed statistical studies with the support of established statistical theories.  However, there are few statistical papers or forums that discuss the best practices for creating and maintaining a frame, primarily because it is viewed as an administrative or clerical task.
Many lament how difficult it is to obtain or maintain a good frame or their bitter experience of working with incomplete or error-prone frames.  Indeed, poor quality frames may prevent a well-planned statistical study from even taking place or create misleading or biased results. 
Inadequate attention to the creation and maintenance of a flexible, up-to-date, and dynamic population frame has been costly to the statistics profession and the U.S. in terms of efficiency and innovation.
For example, according to [6], although “an accurate and complete address list is a critical ingredient in all U.S. Census Bureau surveys and censuses,” each program prepared its own separate list until the concept of a national frame was advanced not even 20 years ago in the name of the Master Address File (MAF). 
The MAF is used primarily to support mail delivery of questionnaires [7], which is increasingly an outdated mode for information collection.  It is relied upon heavily for follow-up visits to non-respondents, when rising labor costs are now met with tight budget constraints.  Web-based questionnaire delivery or data submission was not allowed in the latest 2010 decennial census in the U.S.  The MAF is also not designed to promote or support web-based applications. 
The arrival of the Big Data era seems to have caught the statistics profession in a deer-in-the-headlight moment.  As statistician is hailed as “the sexiest job for the next 10 years” and beyond [8], the profession is still wondering why statistics is undervalued and left out, while in search of a role it should play in the Big Data era [9].
Only a few seem to recognize that statistics is “the science of learning from data” [10], regardless of how big or small the data are, and that the moment has arrived for the profession to join the revolution and remain relevant in the future.
Statistics 2.0: Dynamic Frames
Big Data is a relative concept.  Tomorrow’s Big Data will be bigger than today’s Big Data.  If it is only the size of data that statisticians would consider, the impact of Big Data would be limited to only scaling the existing software and methods. 
The significance of the Big Data era is that most data are now digitized, including sound, vision, and handwriting [e.g., 11], much of which have never been available before.  They can be easily stored and processed in large quantity at relatively low cost.  Today’s consumers of statistics are much higher in number and less interested in technical details, but they also want comprehensive, reliable, easy-to-use information rapidly and readily.
Big Data is as much a revolution in information technology as it is for advancement in statistics because it offers unprecedented opportunities for statisticians to rethink its systems and operations and innovate.
For example, mathematical statistics clearly demonstrates that a 5 percent random sample is superior to a 5 percent non-random sample.  However, how does it compare to a 50 percent or a 95 percent non-random sample?  We have continued to caution, warn, condemn, or dismiss large, non-random samples, but have done little to go beyond the existing framework of mathematical statistics.  Is there not a point, albeit that it may vary from case to case, where the inherent statistical bias can be reduced by the large size of a non-random sample so that they can become practically acceptable and meaningful?
As another example, as long as Figure 1 remains the typical process of conducting statistical studies in a sequential and cross-sectional manner, there is little room for innovative improvement to reduce turnaround time or introduce new metrics such as measuring longitudinal change at the unit level [12].  Is it absolutely impossible to produce accurate and reliable statistical results in real time?  Or is it because we have become so comfortable with the present software, approach, and convenience that there is no desire to consider other possibilities?
Random sampling has been the dominant mode of statistical operation for a century [13].  Because of Big Data, one may now study an entire population almost as easily as one can study a random sample today.  Should we ignore this opportunity? 
If statisticians do not recognize or embrace the challenges of theory and practice posted by Big Data as part of the core of studying and practicing statistics, the risk is high that others including the yet-undefined “data scientists” will fill the void [14].
Among the many possibilities offered by Big Data is the creation and maintenance of Dynamic Frames – population frames that are rich in content, capture the most up-to-date data as soon as they become available, and produce results and reports according to established schedules or even in real time. 
With some user base exceeding one billion people in membership, E-Commerce companies and the social media are well positioned to apply their data from online transactions, emails, and blog postings to conduct market research and perform predictive analyses.  A lay person may also capture these data in a less structured manner.
 

Figure 2

Figure 2 provides a simple schematic on how the Dynamic Frames may work, which are also described as longitudinal data systems in educational applications in the U.S. [15,16]
In essence, primary efforts are put into the creation and maintenance of the frame so that it is optimized by the previously identified properties.  It is constantly updated with new data for every sampling unit over time. 
Statisticians must be fully engaged in the design, implementation, and operation of Dynamic Frames, in addition to the production of descriptive and analytical results.  There are many new and traditional functions that statisticians can make major contributions. 
For example, the identification code is a key to unlocking the enormous power in Big Data.  It controls the extent additional records and data may be linked, determines firsthand the overall quality of data and study, and is the first safeguard to protect confidentiality.
As another example, the size and content for the units have no conceivable limit.  They depend only on availability of data, ability to link and match records, and design of system.  Effective operation minimizes mismatches of records and collection of duplicative data that do not change or change in predictable manner.   Appropriate replacement or imputation for missing values ensures quality and timely integration of data.
Other enhancement of traditional statistical functions [14] include, but are not limited to, establishing continuous quality loops back to the data sources; developing new definitions, metrics, and standards for the dynamic frames; applying new statistical modeling for imputation, profiling, risk assessment, and creating artificial intelligence; developing innovative visualizations; improving statistical training and education; and protecting confidentiality.
Summary
 
Dynamic frames will retain its original purpose as a list of known units for conducting censuses and drawing random samples as needed, but the potential use of structured Big Data is limited only by the imagination and innovative spirit of the statistics profession.  Statisticians need to embrace Big Data as its own revolution, which will lead to the next level of human knowledge and practice by study and use of data.

Co-authored by 
Jeremy S. Wu, Ph.D., Jeremy.s.wu@gmail.com

Junchi Guo, Ph. D. Candidate, junchi@email.gwu.edu

References
[1] Hansen, Morris H.; Hurwitz, William N.; and Madow, William G.  (1953).  Sample Survey Methods and Theory.  Wiley Classics Library Edition, John Wiley & Sons, Inc. 
[2] Kish, Leslie.  (1965).  Survey Sampling.  Wiley Classics Library Edition, John Wiley & Sons, Inc. 
[3] Cochran, William G.  (1977).  Sampling Techniques.  A Wiley Publication in Applied Statistics, Third Edition, John Wiley & Sons, Inc.
[4] Wikipedia.  Sampling Frame.  Available at http://en.wikipedia.org/wiki/Sampling_frame on April 8, 2013.
[5] Baidu.com.  Sampling Frame 抽样框.  Available at http://baike.baidu.com/view/1652958.htm on April 8, 2013.
[6] U.S. Census Bureau.  Master Address File: Update Methodology and Quality Improvement Program, by Philip M. Ghur,  Machell Kindred, and Michael L. Mersch, 1994.  Available at https://www.amstat.org/sections/srms/Proceedings/papers/1994_128.pdf on April 8, 2013.
[7] U.S. Census Bureau.  The Master Address File for the 2010 Census, by Joseph Salvo, April 7, 2006.  Brookings Breakfast Briefings on the Census.  Available at http://www.brookings.edu/~/media/events/2006/4/07community%20development/20060407_salvo.pdf on April 8, 2013.
[8] Varian, Hal.  Hal Varian explains why statisticians will be the sexy job in the next 10 years,  September 15, 2009.  YouTube.  Available at http://www.youtube.com/watch?v=pi472Mi3VLw on April 8, 2013.
[9] Pierson, Steve and Wasserstein, Ron.  Big Data and the Role of Statistics, March 28, 2012.  Available at http://community.amstat.org/amstat/blogs/blogviewer?BlogKey=737fd276-0225-4c87-b7cb-0cfc7cd9e124 on April 8, 2013.
[10] van der Lann, Mark; Hsu, Jiann-Ping; and Rose, Sherri.  Statistics Ready for a Revolution.  Amstat News, September 1, 2010.  Available at http://magazine.amstat.org/blog/2010/09/01/statrevolution/ on April 8, 2013.
[11] Washington Post.  From the President’s Hand to the Internet.  Available at http://www.washingtonpost.com/lifestyle/style/from-the-presidents-hand-to-the-internet/2013/03/21/0b609e66-9282-11e2-9cfd-36d6c9b5d7ad_graphic.html on April 8, 2013.
[12] Diggle, Peter J.; Heagerty, Patrick J.; Liang, Kung-Yee; and Zeger, Scott L. (2001).  Analysis of Longitudinal Data.  Second Edition, Oxford University Press.
[13] Wu, Jeremy S., Chinese translation by Zhang, Yaoting and Yu, Xiang.  One Hundred Years of Sampling, invited paper in Sampling Theory and Practice, ISBN7-5037-1670-3, 1995.  China Statistical Publishing Company.
[14] Wu, Jeremy S. 21st Century Statistical Systems, August 1, 2012.  Available at https://jeremy-wu.info/21st-century-statistical-systems/ on April 8, 2013. 
[15] Data Quality Campaign.  Using Data to Improve Student Achievement.  Available at http://www.dataqualitycampaign.org/ on April 8, 2013.
[16] U.S. Department of Education.  Statewide Longitudinal Data Systems Grant Program, National Center for Education Statistics.  Available at http://nces.ed.gov/programs/slds/ on April 8, 2013.
Categories
大数据 统计

统计学2.0:动态框架


摘要
框架涵盖并能识别总体中的每个个体,为普查以及随机抽样提供结构基础, 使有设计的统计学研究能引用成熟的统计理论,来提取最多和最可靠的信息。大数据时代的重要意义在于大部份的数据已数字化,易于批量的存储和处理,并且成本较低。大数据为统计学家提供了前所未有的反思和创新的机会,其中之一是建立动态框架——拥有大量的內容,及时吸收最新的数据,有能力提供实时的结果和分析报告。
传统总体和框架
总体是一个重要的统计学概念。通常被理解为被研究对象的全体,可以是一个国家的全部国民或全部企业,或一天所生产的所有灯泡,或一个大洋所有的鱼[1,2,3]
框架是一个不太普通的概念,它包括一个能涵盖整个总体的识别系统,对总体中的每个个体排序编号。在实践中,框架是研究总体的工作定义,它能识别总体中的每个个体,为普查和随机抽样对总体描述和分析提供结构和基础[2,4,5]
 

1

1是一个传统的普查或抽样调查的统计研究流程图。引用[4],一个理想的框架应具备如下性质:

  • 每个个体拥有一个逻辑的,数值的识别码
  • 每个个体都可以被找到——联系信息,地理位置或其他相关信息
  • 框架是一个逻辑性,系统性的组织
  • 框架还提供个体的其他信息,使研究可以在更复杂的抽样框架下进行
  • 框架涵盖了相关总体中的每个个体
  • 总体中每个个体只在框架中出现一次
  • 框架不包含相关总体以外的任何个体
  • 数据具有时效性

建模可以被认为是抽样过程的一部分,有时会跳过对框架的需求,直接假设所建立的模型和数据充分代表了研究的总体。
实用统计学家深知框架的重要性——它为统计学研究提供结构基础,使有设计的统计学研究能引用成熟的统计理论,来提取最多和最可靠的信息。 然而,由于被看作是行政及文书方面的工作,极少有统计学方面的论文或论坛探讨怎样去有效的创建和维护框架。
许多人抱怨获取或维护一个框架很困难,或者他们在应用不完整或不准确的框架时所受的痛苦经历。质量差的框架的确可以使一个有良好计划的统计研究搁浅,或导致令人误解或有偏差的结果。
对创建和维护一个有可塑性,时效性的动态框架的长期忽视,对统计学界以及整个美国在效率和创新上起着极其负面的影响。
例如,根据[6]虽然准确和完整的地址记录是美国普查局在抽样调查和普查中的一个关键因素,但局內各部门各有自已的记录,这种情况持续了很久,直到不足二十年前,全国性统一框架这个概念才被推进,称作主地址档案Master Address File (MAF)
主地址档案MAF主要被用来帮助问卷调查的传统邮递[7],这是一种日益趋于过时的信息收集方式。对问卷调查无应答的要靠它来上门访问, 然而紧缩的预算和不断上升的人工成本使这种信息收集方式越来越难进行。美国最近进行的十年一次的2010年人口普查中,基于网络的问卷调查未被允许使用。MAF的设计也未考虑到促进和支持基于网络的应用。
大数据时代的到来似乎令统计学界有点不知所措。虽然统计学家被认为是未来十年最热门的职业[8],这个专业郤感觉它在大数据时代的角色被低估和忽视了,但它仍在找寻它应有的角色[9]
似乎只有少数人认识到统计学是一门研究数据的科学[10],不论数据规模的大小。要想在今后的数据研究上依然起关键作用,现在是这个专业该变革的时候了。
统计学2.0动态框架
大数据是一个相对的概念。明天的大数据将比今天的大数据大。如果统计学家只考虑数据规模的变大,那么大数据的意义仅局限于现有软件和研究方法的相应提升。
大数据时代的重要意义在于大部份的数据已数字化,数据信息涵盖很广, 比如声音,图像以及写在纸上的内容[例如,11],其中许多是前所未有的。这些信息易于批量的存储和处理,并且成本较低。如今统计学的市场越来越大,需求者们对技术细节的要求越来越低,但他们还希望随时能得到全面的,可靠的和便于应用的信息。
大数据是信息技术行业的一次大变革,在同等程度上是统计学的一次革命性的跃进,因为它为统计学家提供了前所未有的反思和创新的机会。
例如,数理统计清晰明了的论证了一个5% 随机样本要优于一个5% 的非随机样本。但是,若与一个50% 或者95% 的非随机样本比较,结果会怎样?对于大规模的非随机样本,我们一直持警示,不赞成,或不予考虑的态度, 但同时对现有的数理统计框架外的探索又非常少。虽然因情况而异,但对于一个非随机样本,其内在的统计偏差是否能随样本的增大而降到一个可以接受的程度,使得基于这个样本的统计学研究有实用意义?
作为另外一个实例, 1是典型的统计学研究流程,每次在研究数据前都要经历从设计到最终获得横断面数据的时序。这样就很难在缩短周转时间和度量个体的纵向的发展变化上有所创新[12]。难道实时的提供可靠准确的统计分析是完全不可能的吗?或者是由于我们已经很习惯于现有的软件和工作方式,在这种惯性的舒适下失去了探索其他可能性的动力?
随机抽样作为主要统计应用模式已经存在了近一个世纪[13]。由于大数据的到来,今天研究一个总体就差不多同研究一个随机样本那样方便。我们应否放弃这个机会?
如果统计学家不及时认识到并面对大数据所带来的理论和实践层面的挑战,忽视其在统计学研究中的重要性,那么其他专业就可能会填补这个空缺,比如还未有明确界定的数据科学家[14]
大数据带来了众多机会和可能性,其中之一是建立和维护动态框架——总体框架拥有大量的內容,及时吸收最新的数据,有能力根据需求提供定时甚至实时的结果和分析报告。
数以仡计用户的电子商务公司和社会媒体具备很好的条件对市场进行调研和预期,他们拥有的大量的数据,比如网上交易,电子邮件和博客内容。一个外行人也可以获取比较欠缺结构的巨量数据。
2
2为动态框架提供了一个简洁概要的流程图,在美国教育界动态框架也被称为纵向数据系统[15,16]
本质上,动态框架的重点在于它的建立和维护,使前述的各个性质上都最优化。框架中的每个个体的信息都随时间不断更新。
统计学家必须充分的参与到动态框架的设计,应用和操作,以及对框架中数据的加工、描述、和分析。统计学家可以在很多崭新和传统工作做出贡献。
例如,识别码在发挥大数据的巨大能量上起着关键作用。它的设计和远用决定了其所能涵盖信息的多少,对数据的整体质量和研究起着决定性作用,它也是保护隐私的首要保证。
作为另一实例,动态框架对个体的数量和内容并无任何限制。这完全取决于可获得多少数据,如何整合记录,如何建立数据之间的联系,以及整个系统的设计。有效的操作可以减少记录的不匹配,提高数据的准确性和一致性,同时减低重复收集或对无用数据的收集。适当的填补遗缺数据保证组合数据的质量和及时性。
其它促进统计学的传统工作范围[14]包括,但不局限于,基于数据源建立连续的质量环;发展动态框架中新的定义、度量和标准;应用新的统计学模型来填补缺失数据、轮廓描述、风险评估、建立人工智能;发展新的可视化技术;加强统计学的训练和教育;保护隐私。
 
动态框架不但保留框架原本的目的,即为普查以及随机抽样提供结构基础,还有其他方面巨大的应用潜力,这完全取决于统计学界的想象力和创新精神。统计学家应该欣然的欢迎大数据的到来,并把它看作是统计学本身的一个大变革,带动统计学在运用数据研究社会和人类知识和实践上达到一个新的高度。

胡善庆博士, Jeremy.s.wu@gmail.com

郭俊池, 博士生, junchi@email.gwu.edu

参考文献
[1] Hansen, Morris H.; Hurwitz, William N.; and Madow, William G.  (1953).  Sample Survey Methods and Theory.  Wiley Classics Library Edition, John Wiley & Sons, Inc. 
[2] Kish, Leslie.  (1965).  Survey Sampling.  Wiley Classics Library Edition, John Wiley & Sons, Inc. 
[3] Cochran, William G.  (1977).  Sampling Techniques.  A Wiley Publication in Applied Statistics, Third Edition, John Wiley & Sons, Inc.
[4] Wikipedia.  Sampling Frame.  Available at http://en.wikipedia.org/wiki/Sampling_frame on April 8, 2013.
[5] Baidu.com.  Sampling Frame 抽样框.  Available at http://baike.baidu.com/view/1652958.htm on April 8, 2013.
[6] U.S. Census Bureau.  Master Address File: Update Methodology and Quality Improvement Program, by Philip M. Ghur,  Machell Kindred, and Michael L. Mersch, 1994.  Available at https://www.amstat.org/sections/srms/Proceedings/papers/1994_128.pdf on April 8, 2013.
[7] U.S. Census Bureau.  The Master Address File for the 2010 Census, by Joseph Salvo, April 7, 2006.  Brookings Breakfast Briefings on the Census.  Available at http://www.brookings.edu/~/media/events/2006/4/07community%20development/20060407_salvo.pdf on April 8, 2013.
[8] Varian, Hal.  Hal Varian explains why statisticians will be the sexy job in the next 10 years,  September 15, 2009.  YouTube.  Available at http://www.youtube.com/watch?v=pi472Mi3VLw on April 8, 2013.
[9] Pierson, Steve and Wasserstein, Ron.  Big Data and the Role of Statistics, March 28, 2012.  Available at http://community.amstat.org/amstat/blogs/blogviewer?BlogKey=737fd276-0225-4c87-b7cb-0cfc7cd9e124 on April 8, 2013.
[10] van der Lann, Mark; Hsu, Jiann-Ping; and Rose, Sherri.  Statistics Ready for a Revolution.  Amstat News, September 1, 2010.  Available at http://magazine.amstat.org/blog/2010/09/01/statrevolution/ on April 8, 2013.
[11] Washington Post.  From the President’s Hand to the Internet.  Available at http://www.washingtonpost.com/lifestyle/style/from-the-presidents-hand-to-the-internet/2013/03/21/0b609e66-9282-11e2-9cfd-36d6c9b5d7ad_graphic.html on April 8, 2013.
[12] Diggle, Peter J.; Heagerty, Patrick J.; Liang, Kung-Yee; and Zeger, Scott L. (2001).  Analysis of Longitudinal Data.  Second Edition, Oxford University Press.
[13] Wu, Jeremy S., Chinese translation by Zhang, Yaoting and Yu, Xiang.  One Hundred Years of Sampling, invited paper in Sampling Theory and Practice, ISBN7-5037-1670-3, 1995.  China Statistical Publishing Company.
[14] Wu, Jeremy S. 21st Century Statistical Systems, August 1, 2012.  Available at https://jeremy-wu.info/21st-century-statistical-systems/ on April 8, 2013. 
[15] Data Quality Campaign.  Using Data to Improve Student Achievement.  Available at http://www.dataqualitycampaign.org/ on April 8, 2013.
[16] U.S. Department of Education.  Statewide Longitudinal Data Systems Grant Program, National Center for Education Statistics.  Available at http://nces.ed.gov/programs/slds/ on April 8, 2013.