Getting external data ready for analytics and data science is complicated. You can spend hours sifting through reference specs trying to find the...
Making data actionable with data validations
In the business of data distribution for financial institutions, it’s never fun to receive an alert in the middle of the night because a critical business component went down. You are suddenly woken up by a pager system (or worse an angry end-user) to find out that a major model which fed into an important dashboard is no longer working. After a little digging, you find that a primary input to the model is a file that showed up late and malformed. The model is broken and now your firm is losing time, resources, and opportunities playing catch up. As bad as this sounds, it’s all too common, processes break, systems fail, and models go sideways — commonly caused by bad data. But what is bad data exactly? How can we prevent these catastrophes? Moreover, why are all data users forced to set up the same tools to catch the same issues for the same data?
When we started Crux Informatics, we asked ourselves the same questions. With the world working to ingest and distribute data faster every minute, consumers, and suppliers of data products face common challenges caused by the perpetual motion of data. Crux Informatics sits in a unique position in the center of these interactions, connecting sources, end-users, and platforms to ensure that data arrives cleanly and in the simplest way possible. To do this, Crux started by building a comprehensive set of validations and monitoring tools to actively oversee tens of thousands (and growing) of running data pipelines – covering a full range of formats, delivery methods, and data types.
What are data validations?
A validation is a check to confirm a particular aspect of data quality which results in a pass or failure. Basic examples of these are “Price is greater than 0” or “Name is not (empty)”. Validations are created from a mix of technical, domain-specific, and experiential knowledge. Poor past experiences are often the most influential as users seek to prevent the recurrence of a prior catastrophe. Checks commonly do calculations, comparisons, and time series analyses to identify potential issues within and across datasets. Validations can become as complex as the analytics that are in place to protect.
Validations are frequently created by the team responsible for their failure. The team that initially creates the data pipeline often builds the validations. In some organizations an IT team is responsible while in others it’s a team of end-users, such as data scientists. Ownership of each pipeline runs the gamut from upstream to downstream users with each imparting views based on their use case. When an organisation looks to distribute data to multiple groups or users, the dependency structure created by multiple layers of validations and remediations becomes more significant and potentially unbearable.
Monitors, notifications, and responses are often overlooked requirements to make validation results actionable. Without these tools or a team to maintain your pipelines, there is no way to ensure data quality through review and response to issues that are raised. Human oversight is essential for interpreting and remediating issues. Automation and algorithms are providing an increasing role in providing scale, efficiency, accuracy to data operations.
Overall, validations are configured to address as many areas of data quality as possible without reducing processing efficiency – this is a very delicate balance. In addition, the increased operational burden of validations further complicates this balance. Ultimately, teams make concessions between operations, quality, or timeliness to find their optimal configuration. As teams look to increase their data usage and refine systems to manage it all, the multiple dimensions of data quality and validation become very apparent. The challenges of designing an effective validation system affect both data consumers and producers. Furthermore, these systems are created repetitiously with the same validations, supporting systems, and resulting processes between organizations, groups, and users.
What are the consequences of failing to use validations?
As the old adage goes: “Garbage in, garbage out” or in the case of the data world “Garbage in, garbage out – if at all”. It’s no mystery that bad inputs can cause systems to fail with untold business impact and cost to remediate, but this also impacts the businesses where data is the output. While data is used in many different ways, all data users have one commonality: the negative impact of bad data to their business. Ramifications of data failures are not always isolated, they can have a cascading effect throughout the data supply chain. As data takes an increasingly important role in decision making – the cost of a single failure becomes insurmountable.
There are a number of costs incurred as the result of data failure, both direct and indirect. Direct costs are predominantly for resources to help identify and resolve failures when they occur. These could further extend themselves into acquiring replacement systems or data if the issue is unable to be resolved quickly. Indirect costs are the most significant; firms can lose revenue, miss opportunities, and incur severe losses. Furthermore, the reputational risk of these failures is unquantifiable.
Looking across the financial data industry, the cost of validations are ubiquitous. Each participant, supplier and consumer alike, shoulders the cost of data operations, validation, and remediation. With data usage only set to increase, this inefficiency will continue to slow the market’s data consumption and increase the cost of validation. Crux is in a unique position to alleviate this burden equally across the market, benefitting all participants. Rather than loads of firms all watching for and reacting to the exact same errors in identical copies of the data, Crux does it once, at a world-class level, on behalf of all clients, freeing up time, money, and resources.
How does Crux help validate data?
I. Ensure delivery adheres to schedule
Tracking when data arrives is a simple, yet critical task. Data can appear late or not at all, have an inconsistent schedule, or come piecemeal – with some pieces coming in later than expected. While a discrete schedule that is consistently held is desirable, many update schedules are partially or entirely unknown.
Crux ensures the timeliness of all data in and out of its platform. We actively seek to consume data as quickly as possible from its source while identifying trends in availability schedules. Coupled with our round the clock operations, Crux monitors its data pipelines 24×7.
II. Maintain consistent access
Extracting data from a source is prone to a variety of errors or complications. Problems start with connection issues ranging from no access, access that is removed after being initially granted, or connections dropped randomly due to instability. Once a connection is made, limitations such as the allowed download speed, concurrency, and permission inhibit processing. The last major hurdle in extraction relates to how data is managed on the source. It is very common for vendors to maintain limited windows of history, remove or republish files, and even shift them around without warning.
Crux builds data pipelines with direct support from the data supplier (where available) to ensure data is retrieved efficiently. Extraction is configured to be both resilient and timely while notifying Crux Data Operations of any potential breaks in processing. Stakeholders are promptly notified of errors in access or connection while engineers work closely with data suppliers to restore service as quickly as possible.
III. Confirm consistent formats
Data formatting naturally has a high degree of variability. As such, errors are most commonly caused by unexpected differences. Breaks can occur when dates are misformatted, resulting in failed processing; or worse, successful processing of incorrect data such as a date with the month and day flipped. Occasionally file names change or the file comes with no data, setting off a series of processes that delays consumption. From letters in numeric fields to differences in the way NA values are represented – formatting errors are widespread and have the potential to cause tremendous damage, including some which is difficult to detect.
Crux is focused on identifying potential data quality failures at both large and small scales – down to each individual data point. Identifying, validating, and remediating data errors is a core part of Crux’s mission and value proposition. As such, each pipeline is configured to focus on core attributes of the dataset including specific value formats, naming conventions, and special characters. The overall validation process feeds a detailed report to the operations team to monitor and respond to. Historical validations are periodically reviewed and enhancements are proactively made to pipelines to avoid future errors.
IV. Verify schema and structure
Appending, removing, or updating column properties are known as schema changes; these are some of most common changes to data and causes of fatal errors. Depending on the nature of the data, a frame’s schema can change multiple times during its lifetime forcing teams to respond to frequent system failures. Well established vendors commonly issue advanced notice of change, however there exist many vendors that do not give notice or have a dataset with a dynamic schema. Structural errors are also possible – missing columns, rows, or syntax results in direct failures for consuming processes.
Crux’s structural validations scan through each piece of data record by record to compare against our preconfigured expectations. Errors are carefully raised with supporting metadata to the operations team to follow up with vendors where possible. As errors are observed through time, internal teams carefully adjust processes and configurations to handle the structural nuances of each dataset.
V. Manage source interaction and problem resolution
After a validation triggers an alert, a user begins the process to triage and resolve the data error. In many cases, the user needs to contact the source of the data to ask questions, escalate issues, or make changes. The size and type of data supplier has direct implications on the number of support resources available to downstream consumers. This becomes particularly more complex for the supplier as it has to field numerous, repetitious requests from consumers round the clock. When issues arise, the volume, velocity, and vulnerability of this dynamic multiplies quickly with both consumers and suppliers seeking the fastest and simplest solution.
Crux’s position in the industry presents a unique opportunity to help both sides of this relationship. For consumers, it presents a managed data service that catches, researches, and resolves errors – in addition to proactively enhancing its data pipeline monitoring to enhance future service. Suppliers benefit by having a single point of contact for a multitude of clients – freeing up internal resources to solve problems more efficiently and prevent them in the future.
Why does an intermediary benefit data suppliers and data consumers?
The data industry is full of repetitive, fragmented tasks that are undifferentiated between it’s participants. As a result, the ability to evaluate, integrate, and manage data faces an unnecessary headwind. Through a strategically positioned industry utility, these tasks can be addressed in a scalable manner on behalf of data suppliers and consumers.
Data made available through the Crux platform is readily accessible to consumers. Each product is already configured with Crux validations and actively monitored by operations teams further reducing any overhead or onboarding costs incurred by teams to get data into their production environment. With plug-and-play data, evaluating data also becomes extremely simple and more reliable. Consumers do not need to spend the resources to set up a dataset they will not ultimately use.
How does Crux fit in?
Crux is a utility for the data industry – catering to all participants and reducing barriers to widespread data quality and utilization. We accomplish this by:
- Maintaining industry neutrality. We remain objective and unbiased of the data we work with. Crux exists to benefit all consumers, suppliers, and partners equally.
- Employing a multi-phased validation framework. Validations have different uses and cater to different dimensions of quality. Our multi-stage approach combines automation with human insight to deliver a flexible yet robust system.
- Protecting data quality for everyone. Acting as the industry surge protector, Crux actively identifies errors and seeks to resolve them across all of its pipelines. Clients receive detailed notifications when events occur for transparency and to ensure downstream changes can be made swiftly.
- Monitoring pipelines 24/7 and proactively enhancing reliability. Data operations and reliability engineering engage with suppliers and consumers to address data quality and consistency.
- Empowering suppliers with enhanced capabilities. As a delivery service, we provide distribution channels to modern services without consuming internal resources.
- Streamlining problem management and reconciliation. Reconciling issues directly with suppliers to help them construct better products through consolidated and interpreted user feedback.
- Integrating with strategic partners to provide complete solutions. Tapping into industry leading analytic and cloud solutions allows clients to receive the data they want where they want it.
- Removing setup time and cost. For data consumers, the sunk cost of configuring a dataset for evaluation or use is removed. Data on Crux is readily available and persistently managed.
Why does a cloud solution make sense?
Crux was built as a cloud-based solution for two reasons: flexibility and scalability.
Flexibility is essential. While this is a bit cliche, Crux’s position as a utility requires us to address the varying needs of the industry. We work with leaders in computing, finance, and analytics to make data available in the latest platforms and tools. Because we are cloud based, our ability to quickly pivot both sides of the market to be future-proof.
With scalability, we are consistently working with robust systems that will allow us to ensure that what we construct are reliable and resilient. As data needs grow, a cloud-based platform rather than an internal physical system enables scaling as a data consumer’s big data appetite inevitably grows.
Need custom validations?
Learn more about transformations and custom validations by connecting with the Crux team at firstname.lastname@example.org or completing the form below:
Solving the complexities of external-data integration to bring more of Morningstar’s most critical datasets to the cloud