With all the hype around big data analytics, not enough attention is being given to data quality or the validation of models built on the data. Despite their deterministic nature, algorithms are only as good as the data their modelers work with.
Simply defined, algorithms follow a series of instructions to solve a problem based on the input variables in the underlying model. From high frequency trading, credit scores and insurance rates to web search, recruiting and online dating, flawed algorithms and models can cause major displacements in markets and lives. The excessive focus on volume, velocity and variety of data and the technologies emerging to store, process and analyze it are rendered ineffectual if the algorithms result in bad decision outcomes or abuses.
One example is the flash crash that occurred on May 6, 2010. Within a few minutes, The Dow Jones Industrial average plunged 1,000 points only to recover less than 20 minutes later. While the cause was never fully explained, many market participants agree that quantitative algorithms were to blame. With algorithms responsible for up to 75% of trading volume, the potential for future calamitous events is more than likely. Despite the efficiencies, the absence of human intervention resulted in a cascade of events that triggered more trades to tank the market further. Have we learned nothing from the portfolio insurance of the 1980s that ultimately caused the 1987 crash?
On a more individual level, algorithms based on personal data, such as zip codes, payment histories and health records have the potential to be discriminatory in determining insurance rates and credit scores. Include social data into the mix and the resulting assumptions in models can skew outcomes even further.
Another example is the revelations about the NSA’s collection and analysis of personal information. Governments have enacted legislation to allow data mining for indirect or non-obvious correlations in the name of national security. Similar algorithms are being used for profiling by municipal police departments. A modeling error may have devastating effects on every day citizens. And the potential breach of personal privacy leaves a gaping hole in governance.
Modeling in fields with controlled environments and reliable data inputs, such as drug discovery or predicting traffic patterns provide scientists the luxury of time to validate their models. However, in web search the time horizon may be two seconds and on a trading floor, milliseconds.
Focus on model validation
As big data becomes more pervasive, it becomes even more important to validate models and the integrity of data. A correlation between two variables does not necessarily mean that one causes the other. Coefficients of determination can easily be manipulated to fit the hypothesis behind the model. As such, this also distorts the analysis of the residuals. Models for spatial and temporal data would only appear to complicate validation even further.
Data management tools have improved to significantly increase the reliability of the data inputs. Until machines devise the models, focus on the veracity of the data would improve model validation and reduce, not eliminate, inherent bias. It would also yield more valuable data.
Ways to improve data quality
Bad data is not just an IT problem. Missing data, misfielded attributes and duplicate records are among the causes of flawed data models. These in turn, undermine the organization’s ability to execute on strategy, maximize revenue and cost opportunities and adhere to governance, regulatory and compliance (GRC) mandates. Organizations need to enact rules, policies and processes to identify root cause and assure better data integrity.
Below are some antidotes for common data quality problems:
- Create enterprise-wide metadata with clear definitions and rules. This reduces errors for what data users can enter into a particular field, such as customer name, address, SSN, vendor, serial number or part number. This metadata should be used for integration with all applications, including those behind the firewall and in the cloud.
- Use data quality tools for real-time validation of all relevant information. The data quality solution should flexibly deploy with application servers, cloud environments or in an enterprise service bus (ESB). Mechanisms should exist for internal and external users to double-check the accuracy of their data entries.
- Establish policies and standards for data handling. Departments must be prevented from using unsanctioned applications or data stores often create rogue data or versions that are incompatible or not properly backed up. These must be endorsed by senior management to assure adherence and facilitate enforcement by IT.
- Profile data from the outset. This is to make certain that data converts smoothly from source application to target. This includes custom code and special processes beneath the data to know the exact shape and syntax in the source.
- Deploy performance management tools. This includes schema checks in job streams to test that data is complete and correctly formatted, as well as real-time monitoring to assure end user data experience.
- Inventory the entire infrastructure and application environment, including external cloud/SaaS applications.
- Document all IT initiatives, including Data Quality for Azure Data Lake standards, responsibilities and time lines. This helps define what is happening in databases and how various processes are interrelated.
- Make data governance an ongoing effort. This is to ensure that as data usage and the data itself changes, the data handling rules and policies adjust accordingly.