Robert Kugel's Analyst Perspectives

A Data Pantry Speeds Development of Machine Learning Models

Written by Robert Kugel | Jul 12, 2022 10:00:00 AM

A few years ago – somewhat tongue in cheek – I began using the term “data pantry” to describe a type of data store that’s part of a business application platform, created for a specific set of users and use cases. It’s a data pantry because, unlike a general-purpose data store such as a data warehouse, everything the user needs is readily available and easily accessible, with labels that are immediately recognized and understood.

A pantry is consistent with the data mesh architecture concept. It is especially suited to business software that describes itself as a platform because, typically, platforms are designed to work with any number of other applications or data sources, using application programming interfaces to automate the integration of processes and data. Eliminating the need for manual integration of data is important because our Analytics and Data Benchmark Research reveals that individuals spend a considerable portion of their time preparing data for analysis and reviewing it for quality and consistency issues, activities that are no longer necessary when a data pantry is available.

Increasingly, business applications – especially those involved in planning and analytics – describe themselves as platforms. Unlike the older connotation of a platform – upon which additional functionality is developed – platforms currently in use are in effect suspended across multiple computing systems to facilitate process or data interactions between systems. This could include any form of forecasting that uses historical data from multiple, disparate sources to better inform projections and present historical results, such as a business planning platform that uses data sourced from human capital management, disparate enterprise resource planning systems, supply chain or customer relationship management systems to support what Ventana Research calls integrated business planning.

A data pantry makes it possible for analysts and business users to immediately access all necessary data gathered from multiple sources for analysis and reporting, without the need to repeatedly perform data extraction, enrichment or transformation motions. This delivers all of the information needed in a consistent and useful form and format. Data pantries are becoming increasingly common in business applications as software vendors adopt a platform approach to their architecture, although so far no one else is using this term. For example, in 2018, Ventana Research gave Workiva an Innovation Award for its Wdata offering because this method of data aggregation is especially useful for reporting corporate data from the wide range of systems necessary for, say, statutory reporting to securities regulators or regulatory bodies. Having a broader set of accessible data is also useful for creating richer and more insightful analyses and for communicating information and insights across an organization.

More recently, it struck me that another compelling use case for a data pantry is to support machine learning necessary for training artificial intelligence capabilities that are part of a business application. This is likely to be the case where the authoritative source or sources of data necessary for training the system using machine learning exist both inside and outside of the application. The need to support machine learning will increase significantly over the next three years because I assert that by 2025, almost all vendors of software designed for finance organizations will have incorporated some AI capabilities to reduce workloads and improve performance. This will especially be the case for planning and predictive analytics purposes.

In this respect, a data pantry has properties similar to what data scientists call a “feature store.” ML uses “features” (statisticians prefer the term “variables”) to build, train and adjust models capable of making performant predictions based on past experiences. A feature is a distinct and measurable characteristic of a phenomenon – for example, the factors that drive demand for a specific product or that predict the price sensitivity of a buyer. Typically, a feature store ingests raw data taken from data warehouses, streaming data sources, applications and other data sources, and then transforms the data to make it usable for discovering and testing inferences as well as for training the system.

Feature stores and data pantries continually access primary sources to select, extract, clean and transform data related to the features or variables used in modeling. They are necessary where heterogenous data sets are used in machine learning because the data almost always must be transformed into a structure and format that facilitates creating and updating models. Staging the data in feature stores and data pantries minimizes time lags that can occur when data moves back and forth between systems, especially when those data movements require some form of transformation. And minimizing lags is essential to creating practical value: Machine learning computation windows are often measured in seconds or minutes, such as responding to queries, generating forecasts or performing full system learning cycles.

Data pantries are similar to feature stores, but they are a distinct construct, also designed to support a wide range of analytical, business intelligence and reporting tasks. And, because they are created for a specific domain, the challenges data scientists typically encounter in feature stores – including data source integration, modeling constraints or inflexible data structures – are far less of an issue.

I recommend that all vendors of business software – especially those with applications using data from multiple authoritative sources, and particularly those that incorporate AI capabilities – have a “data pantry” as part of the application architecture. I also recommend that organizations become familiar with the benefits of this type of integrated data structure and include it in an evaluation of software vendors’ offerings and roadmap, making it a part of the selection criteria.

Regards,

Robert Kugel