Big data whither the Enterprise Data Warehouse/Relational Data Management System?
Yes, another paper on big data. Even though so much has been written about this phenomenon, everyone seems still to be interested about the different approaches.
As ‘Business Intelligence’ is the field in which we work, my attention got drawn by two articles published by the Kimball Group. One on: Evolving Role of EDW (Enterprise Data Warehouse) in the Era of Big Data Analytics. And the other one on: Newly Emerging Best Practices for Big Data.
Thi paper will discuss the following:
Problems faced by RDMSs due to big data
Two architectures to address big data analytics: Extended RDBMS and MapReduce/Hadoop.
Why and how to use Hadoop as part of your EDW?
For the past decades we have stored numerous different kinds of data on our systems. Some structured other semi-structured, not structured or raw data in many different formats. Nowadays we call it ‘Big Data’. Big data is causing problems because an ‘easy’ solution to extract direct business value from it is not yet available.
The main problem is not its volume (this is a myth). But rather the width of the data that makes the query explode exponentially. The big challenge of big data is knowing the right query, a solution that is one button away from gaining business value out of data. Unfortunately, we are not there yet.
Problems faced by RDMSs due to big data
There are three main problems that occur to our traditional RDMSs:
- The standard relational database can simply not store or process big data (volume) and are reaching its capacity and scaling limits.
- To shift away from slice and dice reporting.
- The recognition that the value of big data increases sharply as latency decreases and the data is delivered faster.
We want to continue using SQL and SQL like languages but not the storage constraints of the RDMS. In order to do find a solution, we need to admit that big data is a paradigm shift in how we think about data assets. Questions arise like; How do we collect? How do we analyze and how to we monetize?
Yes, a different mindset is needed in order to develop ourselves to conquer the different data assets. Next to that, the determination of the value of the data asset has become even more important.
How much does it cost to produce data?
How much does it cost to replace data?
How high is the profit opportunity provided by data?
How much does the business loose if the data falls in the competitors hands?
Lawsuits if data is exposed to the wrong parties.
In the coming decade, the analysis of big data will require a technology or combination of technologies capable of:
- Scale easily petabytes of data.
- Distribute across thousands of processors.
- Response time for highly constrained standard SQL queries.
- Embedding arbitrarily complex user-defined functions (UDFs) within processing requests.
- Assembling extensive libraries of reusable UDFs crossing all use cases.
- Supporting a wide variety of data types growing.
- Loading data to be ready for analysis, at very high rates, at least gigabytes per second.
- Integrating data from multiple sources during the load process at very high rates.
- Loading data before declaring or discovering its structure
- Updating data in place at full load speeds.
- Joining a billion row dimension to a trillion row fact table without pre-clustering.
- Scheduling and execution of complex multi-hundred node workflows.
- Failover and process continuation when processing nodes fail.
Two architectures to address big data analytics: Extended RDBMS and MapReduce/Hadoop.
Extended RDMS
Major relational database management system vendors are adding features to cope with big data analytics from a solid relational perspective. The two most major architectural developments include massively parallel processing (MPP), and the growing adoption of columnar storage. When one combines MPP and columnar storage techniques, the following can be added to the list above:
- Scaling to support exabytes of data
- Distribute across thousands of geographically dispersed processors
- Subsecond response time for highly constrained standard SQL queries
- Updating data in place at full load speeds
- Being configured without failure
Example of the standard RDMS architecture:
Mapreduce/Hadoop
MapReduce is a processing framework. The best know implementation of MapReduce is Hadoop. Hadoop is known by everyone in the Business Intelligence field and various articles have been written about it. Also diverse Big Data summit are being organized all over the world and Hadoop is always the most spoken topic. Still, a short overview will be given about how Hadoop works.
Hadoop runs on its own distributed file system. The framework contains the following modules:
- Hadoop common – libraries
- Hadoop Distributed File System – distributed filesystem that stores data
- Hadoop Yarn – resource management platform
- Hadoop MapReduce – programming model for large scale data
Example of the Hadoop architecture:
How is Hadoop different from data warehousing? Let’s have a look at the following image created by the Kimball group:
One needs to keep in mind that an ETL-process needs to be executed to extract the source data. The following shows how Hadoop can be exploratory for data warehousing / Business Intelligence:
Query engines can access HDFS files before ETL. But how does this go in practise? To write and read more easily data on the grid, the HCatalog table has been developed. The HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools (Apache Pig, Apache MapReduce, and Apache Hive). “Query engines can access HDFS files via their metadata descriptions in HCatalog without requiring the data to be transferred to a relational table as in a conventional DBMS”- See more at: Blog Cloudera
As an addition to the image above, Kimball will be quoted: “BI tools are the ultimate glue integrating EDW recourses”. I agree on that, BI tools present the business value of its source data.
Hadoop clusters are large and need optimization
Overall clusters can contain dozens of TB and up to a PB of effective storage. Therefor an extra layer needs to be added between de HDFS files and the HCatalog. In order to get higher performance, Parquet has been introduced into the market last year. Parquet is not a database but a columnar storage file accessible to multiple query and analysis apps. The advantages of Parquet are:
* Limits I0 – only the needed data is processed
- Saves space – columnar process better
- Better scans
- Vectorized execution engines
- Data can be updated and schema modified
One can query Parquet data with Hive or Impala to gain at least 10x performance over simple raw file. It is also ideal for ETL and transfer to conventional EDW and for interactive query in Hadoop destination DW
Why and how to use big data and also Hadoop as part of your EDW?
Lets first state that one should not attempt to build a big data environment as the developments go too fast at this time. It will be difficult to build an environment with a long-lasting foundation. Rather integrate it into your current EDW.
So why should one use Hadoop as part of its EDW? There are many reasons, here is a shortlist:
- It will open doors to new kinds of data.
- New kinds of analysis will be possible in RDBMS
- Simultaneous incompatible analysis modes on same data files
- It will lower the operational costs
- Linear scaling across response time
What are the best practices to do so? This paper will not present you the solution but rather mention different options.
- Use Hadoop as a low cost and flexible backup and archiving technology.
- Make a separation between big data workloads and the EDW
- Build a dimension library and link it to sources
- Join dimensions at query time to fact tables (yes, performance might suffer due to huge volumes)
Summary
The enterprise data warehouse needs to integrate with big data analytics and therefor also with Hadoop. Hadoop will become an equal partner of the EDW. The scope of the EDW should grow to these new big data developments. Hadoop will be the environment for new data types and analysis. When integrating this to the traditional RDMS, new kinds of analysis will be possible and the operational costs will lower. Business Intelligence tools will eventually need to develop in order to deliver.