We now understand that the world is drowning in data. It is estimated that over 15 petabytes of new information is created every day, which is eight times more than the information in all the libraries in the United States. This year, the amount of digital information generated is expected to reach 988 exabytes. This is equivalent to the amount of information if books were stacked from the Sun to Pluto and back.1
Gartner agrees data growth is now the leading data center infrastructure challenge: In a recent survey, 47 percent of Gartner survey respondents ranked data growth as their number one challenge.2
Data growth is capable of stripping entire data centers of cooling and power capacity. System availability is impacted as batch processes are no longer able to meet scheduled completion times. The “outage windows” necessary to convert data during ERP upgrade cycles may extend from hours to days, and other critical processes like replication and disaster recovery are impacted since more and more data is harder and harder to move.
Additionally, unchecked data growth creates governance, risk and compliance challenges. HIPAA, PCI DSS, FISMA and SAS 70 mandates all require that organizations establish compliance frameworks for data security and compliance. Information Lifecycle Management (ILM) programs are required to meet compliance objectives throughout the data lifecycle.
Advances in semiconductor technology have enabled impressive new solutions for data growth by using “commodity” hardware to process and store extraordinary amounts of data at lower unit costs. Through virtualization, this new low-cost infrastructure may now be utilized with extraordinary efficiency.
Apache Hadoop is designed to leverage this powerful, low-cost infrastructure to deliver massive scalability. Using the MapReduce programming model to process large data sets across distributed compute nodes in parallel, Hadoop provides the most efficient and cost-effective bulk data storage solution available. Such capabilities enable compelling new big data applications, such as Enterprise Archiving and Data Lake and establish a new enterprise blueprint for data management on a petabyte scale.
Experts agree that as much as 80 percent of production data in ERP, CRM, file servers and other mission-critical applications may not be in active use, and both structured and unstructured data becomes less active as they age. Large amounts of inactive data stored online for too long reduces the performance of production applications, increases costs and creates compliance challenges.
Enterprise archiving and data lake applications using big data offer low-cost bulk storage alternatives to storing inactive enterprise data online. By moving inactive data to nearline storage, application performance is improved and costs are reduced as data sets are smaller and workloads are more manageable. Data access is maintained through analytics applications, structured query and reporting, or just simple text search.
Big data is driving a new enterprise blueprint enable organizations to gain improved value from their data. Enterprise data warehouse (EDW) and analytics applications leverage big data for better described views of critical information. As a low-cost data repository to store copies of enterprise data, big data is an ideal platform to stage critical enterprise data for later use by EDW and analytics applications.
1. http://www.enterprisestorageforum.com/management/features/article.php/3911686/CIOs-Struggling-With-Data-Growth
2. http://www.gartner.com/it/page.jsp?id=146021