{"id":385,"date":"2014-05-30T09:32:00","date_gmt":"2014-05-30T13:32:00","guid":{"rendered":""},"modified":"2015-11-30T07:20:03","modified_gmt":"2015-11-30T12:20:03","slug":"not-all-data-are-created-equal","status":"publish","type":"post","link":"https:\/\/jeremy-wu.info\/?p=385","title":{"rendered":"Not All Data are Created Equal"},"content":{"rendered":"<div id=\"pl-385\"  class=\"panel-layout\" >\n<div id=\"pg-385-0\"  class=\"panel-grid panel-no-style\" >\n<div id=\"pgc-385-0-0\"  class=\"panel-grid-cell\" >\n<div id=\"panel-385-0-0-0\" class=\"so-panel widget widget_sow-editor panel-first-child panel-last-child\" data-index=\"0\" >\n<div\n\t\t\t\n\t\t\tclass=\"so-widget-sow-editor so-widget-sow-editor-base\"\n\t\t\t\n\t\t><\/p>\n<div class=\"siteorigin-widget-tinymce textwidget\">\n<div>Suppose we have data on 60,000 households. \u00a0Are they useful for analysis? If we add that the amount of data is very large, like 3 TB or even 30 TB, does it change your answer?<\/div>\n<div>\u00a0<\/div>\n<div>The U.S. government collects monthly data from 60,000 randomly selected households and reports on the national employment situation. \u00a0Based on these data, the U.S. unemployment rate is estimated to within a margin of sampling error of about 0.2%. \u00a0Important inferences are drawn and policies are made from these statistics about the U.S. economy comprised of 120 million households and 310 million individuals.<\/div>\n<div>\u00a0<\/div>\n<div>In this case, data for 60,000 households are very useful.<\/div>\n<div>\u00a0<\/div>\n<div>These 60,000 households represent only 0.05% of all the households in the U.S. \u00a0If they were not randomly selected, the statistics they generate will contain unknown and potentially large bias. \u00a0They are not reliable to describe the national employment situation.<\/div>\n<div>\u00a0<\/div>\n<div>In this case, data for 60,000 households are not useful at all, regardless of what the file size may be.<\/div>\n<div>\u00a0<\/div>\n<div>Suppose further that the 60,000 households are all located in a small city that has only 60,000 households. \u00a0In other words, they represent the entire universe of households in the city. \u00a0These data are potentially very useful. \u00a0Depending on its content and relevance to the question of interest, usefulness of the data may again range widely between two extremes. \u00a0If the content is relevant and the quality is good, file size may then become an indicator of the degree of usefulness for the data.<\/div>\n<div>\u00a0<\/div>\n<div>This simple line of reasoning shows that the original question is too incomplete for a direct, satisfactory answer. \u00a0We must also consider, for example, the sample selection method, representation of the sample in the population under study, and the relevance and quality of the data relative to a specified hypothesis that is being investigated.<\/div>\n<div>\u00a0<\/div>\n<div>The original question of data usefulness was seldom asked until the Big Data era began around 2000 when electronic data became widely available in massive amounts at relatively low cost. \u00a0Prior to this time, data were usually collected when they were driven and needed by a known specific purpose, such as an exploration to conduct, a hypothesis to test, or a problem to resolve. \u00a0It was costly to collect data. \u00a0When they were collected, they were already considered to be potentially useful for the intended analysis.<\/div>\n<div>\u00a0<\/div>\n<div>For example, when the nation was mired in the Great Depression, the U.S. government began to collect data from randomly selected households in the 1930s so that it could produce more reliable and timely statistics about unemployment. This practice has continued to this date.<\/div>\n<div>\u00a0<\/div>\n<div>Statisticians initially considered data mining to be a bad practice. \u00a0 It was argued that without a prior hypothesis, false or misleading identification of \u201csignificant\u201d relationships and patterns is inevitable by \u201cfishing,\u201d \u201cdredging,\u201d or \u201csnooping\u201d data aimlessly. \u00a0An analogy is the over interpretation or analysis of a person winning a lottery, not necessarily because the person possesses any special skill or knowledge about winning a lottery, but because random chance dictates that some person(s) must eventually win a lottery.<\/div>\n<div>\u00a0<\/div>\n<div>Although the argument of false identification remains valid today, it has also been overwhelmed by the abundance of available Big Data that are frequently collected without design or even structure. \u00a0Total dismissal of the data-driven approach bypasses the chance of uncovering hidden, meaningful relationships that have not been or cannot be established as a priori hypotheses. \u00a0An analogy is the prediction of hereditary disease and the study of potential treatment. \u00a0After data on the entire human genome are collected, they may be explored and compared for the systematic identification and treatment of specific hereditary diseases.<\/div>\n<div>\u00a0<\/div>\n<div>Not all data are created equal and have the same usefulness.<\/div>\n<div>\u00a0<\/div>\n<div>Complete and structured data can create dynamic frames that describe an entire population in detail over time, providing valuable information that has never been available in previous statistical systems. \u00a0On the other hand, fragmented and unstructured data may not yield any meaningful analysis no matter how large the file size may be.<\/div>\n<div>\u00a0<\/div>\n<div>As problem solving is rapidly expanding from a hypothesis-driven paradigm to include a data-driven approach, the fundamental questions about the usefulness and quality of these data have also increased in importance. \u00a0While the question of study interest may not be specified a priori, establishing it a posteriori to data collection is still necessary before conducting any analysis. \u00a0We cannot obtain a correct answer to non-existing questions.<\/div>\n<div>\u00a0<\/div>\n<div>How are the samples selected? \u00a0How much does the sample represent the universe of inference? \u00a0What is the relevance and quality of data relative to the posterior hypothesis of interest? \u00a0 File size has little to no meaning if the usefulness of data cannot even be established in the first place. \u00a0<\/div>\n<div>\u00a0<\/div>\n<div>Ignoring these considerations may lead to the need to update a well-known quote: \u201cLies, Damned Lies, and Big Data.\u201d<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Suppose we have data on 60,000 households. \u00a0Are they useful for analysis? If we add that the amount of data is very large, like 3 TB or even 30 TB, [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":410,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,1,6,18],"tags":[89,22,86,85],"class_list":["post-385","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","category-general","category-statistics","category-statistics-2-0","tag-data-structure","tag-lies","tag-random","tag-sampling"],"_links":{"self":[{"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=\/wp\/v2\/posts\/385","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=385"}],"version-history":[{"count":7,"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=\/wp\/v2\/posts\/385\/revisions"}],"predecessor-version":[{"id":513,"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=\/wp\/v2\/posts\/385\/revisions\/513"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=\/wp\/v2\/media\/410"}],"wp:attachment":[{"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=385"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=385"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jeremy-wu.info\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=385"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}