Advances in information technology over the past five decades have been nothing short of breathtaking. While this offers tremendous opportunities, it also creates some difficulties, for computing on a vast scale generates data at rates faster than can be managed, understood or analyzed. Which is why, though storage costs are going down every year, many large companies are experiencing increased total storage costs. One large financial-services company, in fact, saw its data stores grow from four to 40 petabytes in just the last two years.
Welcome to the “Big Data” era. In many ways, big data is a new frontier connecting consumers and companies, from which communications and activity can be mined to deliver personalized, relevant offers and messages, all executed with unprecedented speed, automation, and intelligence. The opportunities are vast.
Experienced CIOs see this opportunity in context. They know that leveraging big data to deliver real business results will require a focused strategy that leverages and protects their existing data assets, develops new capabilities that are production-ready and reusable, and is able to manage the deluge of new data that will be created in the process.
For many companies, the recent explosion in data is not a result of increased business transactions or better use of information and analytics. Rather, it’s the result of unmanaged replication. Large email attachments that are broadly distributed, hundreds of extracts from production systems sent nightly to departmental managers, and unclear archive-and-purge processes, all that drive data growth without necessarily creating any new information. The value of big data comes almost exclusively from new information and insights, not copies of existing data, and there are three main ways in which to get started down the right path.
The first task is to separate the signal from the noise. First, begin reducing the noise by locking down and simplifying the data environment with information lifecycle management, data governance, and master data management.
Second, it is critical to identify (even broadly) what new information and insights big data can provide and how that will impact the business, your business. Case studies illustrate this in action. Some of the actions you can take include:
Voice of the Customer: Summarize call-center and customer e-mail correspondence nightly with text mining tools to prioritize top product and service issues and desired features.
Accelerate Analytic Processes: Create a multi-terabyte "analysis-ready" database to support common analytic needs, such as customer marketing segmentation.
Business Event Detection: Design subsystems to identify important business events during interactions with customers and automate responses or alerts.
Third, define the smallest possible scope for success. Be rigorous in defining the new information that is needed, and then decide if big data is the only source. If it is, then assess the smallest set of data required to generate that information. Ask questions such as: How much history is needed for trend analysis? How granular is the data needed? For example, for discovery and analysis projects, statistical relevant data samples can produce the same insights as can full-volume historical data sets. Most large companies try to understand patterns in customer behavior and product performance so they can optimize their business processes and performance. An analysis of 500,000 random phone subscribers will yield just about the same insights as 50,000,000. Unless your business can take advantage of micro segmentation, a rigorous sampling and analysis process will yield sufficient actionable insights.
Networked, dynamic business processes built at a very granular level can produce billions and trillions of bytes of data each month. Given all this, it must be understood that the demands of big data have traditionally outstripped any improvements in technology cost/performance. Fortunately, new architectures and approaches have evolved over the last decade that can simplify managing these enormous data volumes, approaches that are finally being incorporated into the enterprise architectures of many large companies. These include:
Database Appliances and Accelerators: Relational database technology has evolved dramatically over the last decade, allowing terabytes, even petabytes, of data to be loaded and queried quickly and efficiently on a single platform. Database appliances bundle storage, processing, interconnects, and query processing onto a dedicated hardware and software platform optimized for database performance and management. Database accelerators use innovative storage and query optimizations to reduce database size and accelerate complex query performance. Where hardware upgrades on traditional relational databases might improve performance by a factor of two, appliances and accelerators can improve price-performance by a factor of 100.
NOSQL Data Stores: A technology literally born from the Internet, Not-Only-SQL technology was designed from the start to manage enormous, distributed data sets that can be queried in milliseconds. Instead of normalizing data into relational tables that are then joined for answers, very large data sets are distributed across hundreds or thousands of processors, organized so that related data is stored together. Queries run in parallel across all processors, each returning answers based on its local data. This incredibly simple and scalable approach is very efficient and flexible, allowing for a wide variety of data types to be stored together, as well as sophisticated queries to be run.
Automated Analytics: Harvesting insights from big data requires analytics, and, in most companies, this is the domain of a small number of highly trained specialists. Capturing, cleansing, and combing through terabytes of data is often more art than science, and most analysts will tell you that their manual processes cannot be automated. However, over the last decade, advances in self-learning algorithms, genetic algorithms, and automated testing have produced programs that discover patterns, generate insights, and improve over time—in other words, they learn.
Big data must be considered in the context of the enterprise data and analytics environment: capturing and creating data, cleansing and organizing it, mining business insights from it, and using those insights to drive intelligent alerts and actions in the business. By feeding data that measure the outcomes of these actions back into the system, a closed loop is created that allows companies to use their data to test, learn, and improve potential scenarios. The diagram below depicts three broad domains of the ecosystem: data, insight, and action.
Big data presents opportunities and challenges. Data management leverages big-data technology to eliminate redundancy and provide scalable infrastructure for managing big-data assets. Insight uses appliances and accelerators, while NOSQL technology, and automated analytics to expose new value hidden in big data.
Big data presents fascinating opportunities for insight and innovation—as well as the challenge of separating the signal from the noise. Increasingly, companies are overlaying their internal, proprietary data with insights from external structured and unstructured data to better understand their customers, performance, and marketplace. New technologies are making big data useful and manageable, but careful, business-driven planning and governance are essential to success. Starting from clear business objectives, enterprises are evolving to manage the dramatic growth in data, harvest new insights, and continuously optimize their actions.