Abstract
A frame identifies all the known units in a population from which a census can be conducted or a random sample can be drawn, providing the structural foundation for the extraction of maximum, reliable information from designed statistical studies with the support of established statistical theories. The significance of the Big Data era is that most data are now digitized, easily stored, and processed in large quantity at relatively low cost. Big Data offers unprecedented opportunities for statisticians to rethink and innovate. Among the many possibilities offered by Big Data is the creation and maintenance of Dynamic Frames – frames that are rich in content, capture the most up-to-date data as soon as they become available, and produce results and reports in real time on demand.
Traditional Population and Frame
A population is an important concept in the study of statistics. It is commonly understood to be an entire collection of items of interest, be it a nation’s people or businesses, a day’s production of light bulbs, or an ocean’s fish [1,2,3].
A less well-known term is a frame, or a list of the units that cover the entire population with its identification system. A frame is the working definition of a population under study. It identifies all the known units in a population from which a census can be conducted or a random sample can be drawn, providing the structure for statistical description and analysis about the population [2,4,5].
|
Figure 1 |
Figure 1 shows a flow chart of a conventional statistical study by census or random sample. Quoting from [4], an ideal frame should have the following qualities:
- All units have a logical, numerical identifier
- All units can be found – their contact information, map location or other relevant information is present
- The frame is organized in a logical, systematic fashion
- The frame has additional information about the units that allow the use of more advanced sampling frames
- Every element of the population of interest is present in the frame
- Every element of the population is present only once in the frame
- No elements from outside the population of interest are present in the frame
- The data is “up-to-date”
Modeling may be considered part of a sampling process, sometimes bypassing the need for a frame by assuming that the model and data adequately represent the underlying population.
Practicing statisticians understand the importance of frames – it is the structural foundation for the extraction of maximum, reliable information from designed statistical studies with the support of established statistical theories. However, there are few statistical papers or forums that discuss the best practices for creating and maintaining a frame, primarily because it is viewed as an administrative or clerical task.
Many lament how difficult it is to obtain or maintain a good frame or their bitter experience of working with incomplete or error-prone frames. Indeed, poor quality frames may prevent a well-planned statistical study from even taking place or create misleading or biased results.
Inadequate attention to the creation and maintenance of a flexible, up-to-date, and dynamic population frame has been costly to the statistics profession and the U.S. in terms of efficiency and innovation.
For example, according to [6], although “an accurate and complete address list is a critical ingredient in all U.S. Census Bureau surveys and censuses,” each program prepared its own separate list until the concept of a national frame was advanced not even 20 years ago in the name of the Master Address File (MAF).
The MAF is used primarily to support mail delivery of questionnaires [7], which is increasingly an outdated mode for information collection. It is relied upon heavily for follow-up visits to non-respondents, when rising labor costs are now met with tight budget constraints. Web-based questionnaire delivery or data submission was not allowed in the latest 2010 decennial census in the U.S. The MAF is also not designed to promote or support web-based applications.
The arrival of the Big Data era seems to have caught the statistics profession in a deer-in-the-headlight moment. As statistician is hailed as “the sexiest job for the next 10 years” and beyond [8], the profession is still wondering why statistics is undervalued and left out, while in search of a role it should play in the Big Data era [9].
Only a few seem to recognize that statistics is “the science of learning from data” [10], regardless of how big or small the data are, and that the moment has arrived for the profession to join the revolution and remain relevant in the future.
Statistics 2.0: Dynamic Frames
Big Data is a relative concept. Tomorrow’s Big Data will be bigger than today’s Big Data. If it is only the size of data that statisticians would consider, the impact of Big Data would be limited to only scaling the existing software and methods.
The significance of the Big Data era is that most data are now digitized, including sound, vision, and handwriting [e.g., 11], much of which have never been available before. They can be easily stored and processed in large quantity at relatively low cost. Today’s consumers of statistics are much higher in number and less interested in technical details, but they also want comprehensive, reliable, easy-to-use information rapidly and readily.
Big Data is as much a revolution in information technology as it is for advancement in statistics because it offers unprecedented opportunities for statisticians to rethink its systems and operations and innovate.
For example, mathematical statistics clearly demonstrates that a 5 percent random sample is superior to a 5 percent non-random sample. However, how does it compare to a 50 percent or a 95 percent non-random sample? We have continued to caution, warn, condemn, or dismiss large, non-random samples, but have done little to go beyond the existing framework of mathematical statistics. Is there not a point, albeit that it may vary from case to case, where the inherent statistical bias can be reduced by the large size of a non-random sample so that they can become practically acceptable and meaningful?
As another example, as long as Figure 1 remains the typical process of conducting statistical studies in a sequential and cross-sectional manner, there is little room for innovative improvement to reduce turnaround time or introduce new metrics such as measuring longitudinal change at the unit level [12]. Is it absolutely impossible to produce accurate and reliable statistical results in real time? Or is it because we have become so comfortable with the present software, approach, and convenience that there is no desire to consider other possibilities?
Random sampling has been the dominant mode of statistical operation for a century [13]. Because of Big Data, one may now study an entire population almost as easily as one can study a random sample today. Should we ignore this opportunity?
If statisticians do not recognize or embrace the challenges of theory and practice posted by Big Data as part of the core of studying and practicing statistics, the risk is high that others including the yet-undefined “data scientists” will fill the void [14].
Among the many possibilities offered by Big Data is the creation and maintenance of Dynamic Frames – population frames that are rich in content, capture the most up-to-date data as soon as they become available, and produce results and reports according to established schedules or even in real time.
With some user base exceeding one billion people in membership, E-Commerce companies and the social media are well positioned to apply their data from online transactions, emails, and blog postings to conduct market research and perform predictive analyses. A lay person may also capture these data in a less structured manner.
|
Figure 2 |
Figure 2 provides a simple schematic on how the Dynamic Frames may work, which are also described as longitudinal data systems in educational applications in the U.S. [15,16]
In essence, primary efforts are put into the creation and maintenance of the frame so that it is optimized by the previously identified properties. It is constantly updated with new data for every sampling unit over time.
Statisticians must be fully engaged in the design, implementation, and operation of Dynamic Frames, in addition to the production of descriptive and analytical results. There are many new and traditional functions that statisticians can make major contributions.
For example, the identification code is a key to unlocking the enormous power in Big Data. It controls the extent additional records and data may be linked, determines firsthand the overall quality of data and study, and is the first safeguard to protect confidentiality.
As another example, the size and content for the units have no conceivable limit. They depend only on availability of data, ability to link and match records, and design of system. Effective operation minimizes mismatches of records and collection of duplicative data that do not change or change in predictable manner. Appropriate replacement or imputation for missing values ensures quality and timely integration of data.
Other enhancement of traditional statistical functions [14] include, but are not limited to, establishing continuous quality loops back to the data sources; developing new definitions, metrics, and standards for the dynamic frames; applying new statistical modeling for imputation, profiling, risk assessment, and creating artificial intelligence; developing innovative visualizations; improving statistical training and education; and protecting confidentiality.
Summary
Dynamic frames will retain its original purpose as a list of known units for conducting censuses and drawing random samples as needed, but the potential use of structured Big Data is limited only by the imagination and innovative spirit of the statistics profession. Statisticians need to embrace Big Data as its own revolution, which will lead to the next level of human knowledge and practice by study and use of data.
References
[1] Hansen, Morris H.; Hurwitz, William N.; and Madow, William G. (1953). Sample Survey Methods and Theory. Wiley Classics Library Edition, John Wiley & Sons, Inc.
[2] Kish, Leslie. (1965). Survey Sampling. Wiley Classics Library Edition, John Wiley & Sons, Inc.
[3] Cochran, William G. (1977). Sampling Techniques. A Wiley Publication in Applied Statistics, Third Edition, John Wiley & Sons, Inc.
[12] Diggle, Peter J.; Heagerty, Patrick J.; Liang, Kung-Yee; and Zeger, Scott L. (2001). Analysis of Longitudinal Data. Second Edition, Oxford University Press.
[13] Wu, Jeremy S., Chinese translation by Zhang, Yaoting and Yu, Xiang. One Hundred Years of Sampling, invited paper in Sampling Theory and Practice, ISBN7-5037-1670-3, 1995. China Statistical Publishing Company.
[16] U.S. Department of Education. Statewide Longitudinal Data Systems Grant Program, National Center for Education Statistics. Available at http://nces.ed.gov/programs/slds/ on April 8, 2013.