You are currently browsing the tag archive for the ‘Hadoop’ tag.

Using information technology to make data useful is as old as the Information Age. The difference today is that the volume and variety of  available data has grown enormously. Big data gets almost all of the attention, but there’s also cryptic data. Both are difficult to harness using basic tools and require new technology to help organizations glean actionable information from the large and chaotic mass of data. “Big data” refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends and associations, especially those related to human behavior and interaction. The challenges in dealing with big data include having the computational power that can scale to the processing requirements for the volumes involved; analytical tools to work with the large data sets; and governance necessary to manage the large data sets to ensure that the results of the analysis are accurate and meaningful. But that’s not all organizations have to deal with now. I’ve coined the term “cryptic data” to focus on a different, less well known sort of data challenge that many companies and individuals face.

Cryptic data sets aren’t easy to find or aren’t easily accessed by people who could make use of them. Why “cryptic?” As a scuba diver, I donate time to Reef Check by doing scientific species counts in and around Monterey Bay, Calif. Cryptic organisms are ones that hide out deep in the cracks and crevices of our rocky reefs. Finding and counting them accurately is time-consuming and requires skill. Similarly, it’s difficult to locate, access and collect cryptic data routinely. Because it’s difficult to locate or access routinely, those who have it can gain a competitive advantage over those who don’t. The main reason cryptic data is largely untapped is cost vs. benefits: The time, effort, money and other resources required to manually retrieve it and get it into usable form may be greater than the value of having that information.

By automating the process of routinely collecting information and transforming it into a usable form and format, technology can expand the range of data available by lowering the cost side of the equation. So far, most tools, such as Web crawlers, have been designed to be used by IT professionals. Data integration software, also mainly used by IT departments, helps transform the data collected into a form and format where it can be used by analysts to create mashups or build data tables for analysis to support operational processes. Data integration tools mainly work with internal, structured data and a majority have little or no capability to support data acquisition in the Web. Tools designed for IT professionals are a constraint in making better use of cryptic data because business users are subject matter experts. They have a better idea of the information they need and are in a better position to understand the subtleties and ambiguities in the information they collect. To address this constraint, Web scraping tools (what I call “data drones”) have appeared that are designed for business users. They use a more visual user interface design and hide some of the complexity inherent in the process. They can automate the process of collecting cryptic data and expand the scope and depth of data used for analysis, alerting and decision support.

Cryptic data can be valuable because when collected, aggregated and analyzed, it provides companies and individuals with information and insight that were unavailable. This is particularly true of data sets gathered over time from a source or combination of sources that can reveal trends and relationships that otherwise would be difficult to spot.

Cryptic data can exist within a company’s firewall (typically held in desktop spreadsheets or other files maintained by an individual as well as in “dark” operational data sets), but usually it is somewhere in the Internet cloud. For example, it may be

  • Industry data collected by some group that is only available to members
  • A composite list of products from gathered from competitors’ websites
  • Data contained in footnotes in financial filings that are not collected in tabular form by data aggregators
  • Tables of related data assembled through repetitive queries of a free or paid data source (such as patents, real estate ownership or uniform commercial code filings).

Along these lines, our next-generation finance analytics benchmark research shows that companies have limited access to information about markets, industries and- economies.vr_NG_Finance_Analytics_17_accessibility_of_external_dataOnly 14 percent of participants said they have access all the external data they need. Most (63%) said they can access only some of it, and another 14 percent said they can’t access any such data. In the past, this lack of access was even more common, but the Internet changed that. And this type of external data is worth going after, as it can help organizations build better models, perform deeper analysis or do better in assessing performance, forecasting or gauging threats and opportunities.

Cryptic data poses a different set of challenges than big data. Making big data usable requires the ability to manage large volumes of data. This includes processing large volumes, transforming data sets into usable forms, filtering extraneous data and code data for relevance or reliability, to name some of more common tasks. To be useful big data also requires powerful analytic tools that handle masses of structured and unstructured data and the talent to understand it. By contrast, the challenge of cryptic data lies in identifying and locating useful sources of information and having the ability to collect it efficiently. Both pose difficulties. Whereas making big data useful requires boiling the ocean of data, cryptic data involves collecting samples from widely distributed ponds of data. In the case of cryptic data, automating data collection makes it feasible to assemble a mosaic of data points that improves situational awareness.

Big data typically uses data scientists to tease out meaning from the masses of data (although analytics software vendors have been working on making this process simpler for business users). Cryptic data analysis is built on individual experience and insight. Often, the starting point is a straightforward hypothesis or a question in the mind of a business user. It can stem from the need to periodically access the same pools of data to better understand the current state of markets, competitors, suppliers or customers. Subject matter expertise, an analytical mind and a researcher’s experience are necessary starting capabilities for those analyzing cryptic data. These skills facilitate knowing what data to look for, how to look for it and where to look for it. Although these qualities are essential, they not sufficient. Automating the process of retrieving data from sources in a reliable fashion is a must because, as noted above, the time and expense required to acquire the data manually are greater than its value to the individual or organization.

Almost from the dawn of the Internet, Web robots (or crawlers) have been used to automate the collection of information from Web pages. Search engines, for example, use them to index Internet pages while spammers use them to collect email addresses. These robots are designed and managed by information technology professionals. Automating the process of collecting cryptic data requires software that business people can use. To make accessing cryptic data feasible, they need “data drones” that can be programmed by users with limited training to fetch information from specific Web pages. Tools available from Astera ReportMinerConnotate, Datawatch, import.io, Kofax Kapow and Mozenda are great examples on where you can get started for leveraging cryptic data. I recommend that everyone who has to routinely collect information from Internet sites or from internal data stores that are hard to access or who thinks that they could benefit from using cryptic data investigate tools available for collecting it.

Regards,

Robert Kugel – SVP Research

My colleague Mark Smith and I recently chatted with executives of Tidemark, a company in the early stages of providing business analytics for decision-makers. It has a roster of experienced executive talent and solid financial backing. There’s a strategic link with Workday that reflects a common background at the operational and investor levels. As it gets rolling, Tidemark is targeting large and very companies as customers for its cloud-based system for analyzing data. It can automate alerts and enhance operating visibility, collaboratively assess the potential impacts of decisions and support the process of implementing those decisions.

Tidemark’s product fits into the performance management/decision support category but has several points of differentiation. One is that it enables larger enterprises to work interactively with a broad set of operating and financial metrics rather than working with multiple legacy business intelligence and reporting systems. It can integrate enterprise data from ERP, CRM, supply chain, logistics, maintenance, real estate and other systems into a single cloud-based analytic and data store in a way that ensures there is master data control from source systems to the new warehouse, and that uses common semantic-layer definitions consistently across all processes and systems.

The business purpose of marshalling all the data in one place is to take advantage of recent advances in managing large-scale data using new database techniques that we have assessed. This can lower the cost of doing so along with utilizing in-memory data processing, which makes it feasible for large organizations to work with operational and financial information much more interactively. Because this can reduce the time required to assemble information (in reports and dashboards), organizations gain time to focus on making the right decision and implementing it. Rather than struggling to lash together sets of fragmented and siloed data every time a new set of analytics is needed, the right data in the right format and context is available at any level for various purposes, including integrated business planning, predictive analytics, and analysis of trends, cost-to-serve metrics, customer profitability and  product profitability. The same data set can be used for dashboards, performance scorecards and operational risk measurement and management. Executives, managers and analysts can go beyond drilling down into the details of what just happened. For example, they can do real-time collaborative contingency planning to explore in detail the impacts of several courses of action during a one-hour meeting.

Tidemark’s second point of differentiation is to make the user experience as close to that of a consumer application (and as far away from traditional business software) as possible. Design has become an increasingly important element in industrial products once sold mainly on functional considerations instead of usability. I expect the same is finally coming true in software as past limitations to design melt away with advances in the underlying technology. I also put tablet and mobility support in this category, because these have become table stakes for enterprise applications. An individual vendor’s advantage in mobility may be built initially on technical capabilities, but competitors can quickly replicate it. User interface design, however, can be a source of lasting advantage.

Tidemark has elected to use HTML5 Internet technology standard so it can support as many Web browsers, smart phones and tablets as possible. In this respect, I think executives decided that the underlying technology is a commodity and therefore it’s best to embrace a broad standard. This may prove to be a smart bet or it could be a sticking point since it will not use the native capabilities of mobile platforms. Notably, it will not be part of the Apple ecosystem, which I think may hurt Tidemark today but may be of no consequence in the future.

Another new, consumer-like feature is the visual metaphor of using sentences to describe what you are doing and seeing with business analytics, which adds to usability for business people. In this form,  analysis, planning and review activities fit the rhythms of how people naturally do and think about business. This approach makes it simple to understand what a business user is asking from the system and what it has presented.

A third point of differentiation from standard performance management offerings is having advanced analytics integrated into the application. Because comprehensive real-time and near-real-time finance and operations data are available for in-memory processing, it’s feasible to employ predictive analytics to improve forecast accuracy and to provide a comprehensive range of alerts when results diverge materially from what’s expected. Such a structure facilitates creating driver-based models for planning and budgeting. Moreover, Tidemark includes risks in its scorecard presentation. All business decisions involve risk, yet risks are rarely part of a balanced scorecard. Scorecards are balanced precisely because they cover trade-offs (for example, in a call center, balancing time-on-call with first-call resolution). Although some risks are implicit in every item on a scorecard (that is, not achieving a key objective) others are indirect or less than obvious, such as deferring scheduled maintenance or operating above capacity limits to accommodate unexpected orders. That noted, it’s impossible to say, based on this introduction, just how broad and deep Tidemark’s capabilities (especially industry-specific metrics, advanced analytics and risk measures) will be in the initial release. Operational risk analytics are likely to be elementary because, outside of financial services, that is the (sad) state of the art at this moment.

It’s still early days for Tidemark, so file the comments above under “excellent if true.” There are a number of big potential snags that the company and its users may confront when they put their system in place.

First on the list are the performance and scalability proof points that must be demonstrated in full-scale deployments. I’m not sure this will be a big issue since the basic technology foundations (cloud data and applications, in-memory processing and core analytics) that Tidemark rests on are solid. More likely, the challenge to users and Tidemark will be the snags that are inevitable when dealing with large organizations’ IT infrastructure (nonstandard legacy systems, for instance) as well as the maintenance issues that companies will have to address that might cause performance and other system metrics to fall short of what customers are expecting. Or performance may be adequate for most complex analytical tasks but not for, say, organizations that do complex sales and operations planning involving large numbers of rapidly changing stock-keeping units (SKUs). At this point, I doubt any of these issues will prove to be overall showstoppers, but they may limit the software’s appeal.

A more fundamental issue that I see is whether and to what degree companies will change how they operate to take advantage of the technology available. Just because something is technically feasible doesn’t mean that companies will change old ways of doing things. Our research shows few companies (just 6%) use driver-based plans, which, from a practical standpoint, is the only way to be able to do agile planning. Outside of financial services, few companies understand how to measure and manage operational risks beyond those major that are already incorporated in business reviews (such as failure to meet sales and profit targets). Attenuating the impact of unfavorable events collaboratively with established risk mitigation scripts (something that is feasible in Tidemark) is not well practiced outside of the military or the parts of companies facing heavy safety and health liabilities.

The next fundamental issue is data governance. As I look at it, the initial system setup can, with varying degrees of difficulty, overcome the accumulated impact of sloppy data management practices. Thereafter, there can be considerable cost involved in maintaining the integrity of the comprehensive data unless the company is rigorous about data governance.

The last fundamental issue for some companies may be the total cost of ownership. Tidemark’s initial pricing projections are competitive with on-premises systems. Companies with sprawling, disjointed enterprise system infrastructures and poor data stewardship probably will have to spend quite a bit more both up front and for ongoing maintenance for a system that includes data from a broad set of enterprise systems. Since few companies have mature data governance practices, dealing with data issues may prove to be a big cost factor in regards to time and agreement on the definitions and representation.

On balance, my initial impression of Tidemark is positive. There’s a great deal of promise and more than a few things that its established competition might want consider copying. Stay tuned.

Regards,

Robert Kugel CFA – SVP of Research

Twitter Updates

Stats

  • 123,080 hits
%d bloggers like this: