6 min read

Tackling External Data Integration For a Competitive Advantage

Having multiple external data sources that are linked to each other and to strategic internal data, allows a related set of questions to be asked with full context. The combined data sources allow true critical thinking strategies, 5 Whys, which quickly identifies the issue, to be applied to the whole problem including the environment and market, not just a technical or single organizational silo.

The competitive advantage for an innovator can be realized because the chain of critical thinking has access to more, better information something traditional organizations that limit themselves to mainly internal data can never see.

Companies solve large problems by quickly creating an Agile environment when integrating data captures into the decision-making process. For innovators, within a few weeks they can derive value and start to learn what is and is not working. In many cases the original assumptions and the list of needed data will change often as new insights are gained. At each stage scientific thinking is applied to understand if the data and applied process answers a piece of the question and what happens next.

During the initial data capture, 1–3-week sprints are constantly updated and run through, refining previous understanding, and branching out into related areas.

Here is an example. At first glance, weather and associated local events such as tornadoes, hurricanes, earthquakes, etc. have obvious correlations to longer delivery times. However, if products are being manufactured in different countries, specific news events, changing economics and those events that occur between manufacturer and consumer can all potentially infer delays. As such a widening of scope, in terms of datasets naturally happens.

At this point the potential value of external data should be clear. Yet to harness it you have to solve 2 challenges:

  • Data Supply Problem – which is the challenge of rapidly understanding and making innovation-ready across many disparate data-sources, with various potential value.
  • External Data in Production – which is ensuring the management of external data does not become too onerous.

Data Supply Challenge

The upstream pressure on data engineers supporting these global projects quickly shows that speed to data becomes the biggest problem. To explain the problem in simple terms, let's categorize any external dataset in terms of three levels of complexity, rated low to high as L1C, L2C and L3C. For each dataset we can assume there are two critical timelines, these being the time to generate a sample dataset and the time to create a production-ready dataset.

  • L1C - is the simplest type of data to ingest, maybe available publicly from a website, with a between 1-3 schemas and various levels of documentation. This may take between 5-15 days to generate a sample dataset and 3-4 weeks to create a production ready dataset.
  • L2C - maybe different interactions here, a combination of APIs that need to be explored to join and process multiple schemas. This may take about 10-25 days to generate a sample dataset and 4-10 weeks to create a production ready dataset.
  • L3C - this is usually an unknown data source, bigger than L2C, the work is custom for each source and can contain thousands of files/schemas meaning it is broken down into smaller sprints, based on schema subsets and relevance.

The sample dataset is where the data engineer reviews the source, understands the schema and dependencies, then looks to generate an initial dataset for review by the data scientist. On the first iteration the innovator can tell from the schema and the initial data whether it can potentially solve the current problem. If relevant, the dataset is iterated over to refine, in terms of validation and transformations that are performed, this will take a few small iterations over the next few days/weeks to finalize. With the data specification being complete data engineers now look to bring a historical dataset from the data supplier and make it available for innovation, depending on the problem being solved historical information could be from the last month, year, or decades. The more information collected and analyzed, the more accurate the derived model.

Adding to the complexity is ingesting historical information which adds new issues to data integration. The first being the time to compute it. If there are potentially decades of information, the process of downloading and applying validation and transformation can be huge, potentially more than the perceived value of the dataset. Historical data also means that data engineers must understand any schema changes that happen over the years, many data suppliers process a subset of data once and make it available forever. This process means that as the supplier starts to derive new value new attributes appear, others disappear, and some have completely different meaning or formatting. A data engineer tasked with “just downloading the historical data” must engineer completely new processes for each schema change, sometimes this means 2-5,100 versions of the same process to manage schema alone.

From a staffing perspective it means that there will need to be n-number of data engineers per data scientist, or data scientists will be back to spending 70-90% of their time processing data rather than deriving insights. Or alternatively the number of data sources that can be applied at any time will be limited due to the stages of data that must be onboarded and made ready, plan for 6 weeks for a new L1C dataset to be available for innovation at the earliest, or two sprint cycles. Assuming that between 2-8 datasets are identified as candidates in early sprints as the problem is understood, with the 70% rejection rate at first, this means that very quickly a pool of about 10 data engineers are required to start and maintain a steady flow to the innovators if the datasets are all L1C, but if just a couple are L2C or above these resources will quickly be consumed and more developers/contractors will be required.

The final two challenges on the supplier side are contracts and infosec. As mentioned above, the variety of data sources quickly consume engineering resources and can “simply” be addressed by reassigning other existing resources, adding temporary contractors to the team or outsourcing to a managed service provider to accelerate data integration. Supplier side contracts will become an ever-increasing issue for any organization, in that the data has to be procured before it can be used, even many public datasets will require an agreement before complete dataset can be accessed. A legal contract with a new supplier can take weeks to reach contract agreement leading to much time lost before the engineer can start.

In many cases the innovators are looking for data to be placed in their environments, leveraging the cloud investments many organizations have made. This often requires a unique ingest pattern from data provider to innovator. This creates a new potential threat and security audit for each provider/relationship.

The final hurdle is privacy, the act of uploading data from an unknown external source, especially if a foreign entity, can cause scrutiny from security and privacy analysts ensuring no undue risk is incurred by the organization. As a simple rule of thumb, a simple privacy violation can be calculated at $750 per record ingested, although it could be higher in larger breaches. This cost in this situation can be substantial if you consider an average dataset is made up for XX records which leads to a privacy violation of $$$.

External Data in Production

Some of the same issues of data supply carry into production, while a series of new ones surface. In production this case means the initial innovation has been completed and the derived data-product(s) have been deployed to a point where they directly impact the business, offering some combination of reducing risk/cost or increasing revenue.

Schema evolution, as discussed earlier, can quickly increase the amount of effort to ingest a single dataset as increase in history increases the number of changes. Similarly, as new data is added these same changes will still occur, unlike before these changes have a potential to break upstream processes. If these are not documented, then this can cause unexpected outages.

Model drift is a subtle production error that may cause deployed ML/AI to produce gradually fewer effective results. As these errors gradually increase over time, they are difficult to understand, and the downstream effect may be difficult to trace. The most common reason for model drift is that the new data received is behaving with new characteristics that were different from the initial inception of the idea. As such organizations need to put statistical guidelines in place that catch and alert an organization of the deviation.

Data Integration: The XDaaS Model by Crux

External (X) Data as a Service is truly the goal of Crux, delivering external data to innovators in production when it is required, regardless of the consuming organization’s resources. This unlocks an organization to truly blend internal and external datasets at the speed of thought, removing the friction and dependencies on internal data resources to address the business challenges in pre and postproduction.

Data integration can be easy when the right outside contractor or managed data services provider is involved. It is critical to bring in supplement help that has the breadth of knowledge and in-house expertise to get the job done, so that the existing company resources can focus on getting value from the data to make the right decisions.

We have pre-built pipelines to hundreds of data sources and over 24,000 datasets to accelerate access to the most used datasets and build specific data pipelines to deliver data in tailored frequencies to the location of their choice (often to a cloud, but we support a range of capabilities). For some of our larger enterprise customers we manage thousands of datasets across many suppliers to remove the complexity and burden of managing from their teams. Clients look to us to handle the details for them from onboarding, prep and enhancement, delivery, monitoring and maintenance of this data. If you are interested in learning more about how we might be able to help your organization, contact us

To learn more about Crux and its data engineering and operations managed services, contact us.  

This is part 2 of a 2 part series - read part 1



What Cloud Marketplaces Do and Don’t Do

What Cloud Marketplaces Do and Don’t Do

Not long ago, we observed here in our blog that the critical insights that drive business value come from data that is both (1) fast and (2) reliable.

Read More
The 3 Dimensions of AI Data Preparedness

The 3 Dimensions of AI Data Preparedness

This past year has been exciting, representing the dawning of a new age for artificial intelligence (AI) and machine learning (ML)—with large...

Read More
How Do Small Hedge Funds Solve the Big Problem of External-Data Integration?

How Do Small Hedge Funds Solve the Big Problem of External-Data Integration?

How do you get white-glove customer service from a major data supplier?

Read More