Monday, September 16, 2013

Big Data Profile

I am sure, most IT Managers have got some grounding on Big Data in terms of what it is, how to integrate with traditional enterprise data warehouses. Big IT vendors like IBM, Oracle, HP, Teradata, Microsoft to name a few have matured their offerings to integrate Big data with their traditional products, while others have perfected their niche products and services in their offerings for Big Data on Infrastructure, Integration Services and Analytical areas. Even there are specialized niche consulting companies that provide Bigdata solutions, while other major consulting companies are catching upon this exciting environment and certainly seems to have mastered the challenges and opportunities this new paradigm provides to the enterprise of all sizes.

·         Today, social media is an integral part to study Customer usage, experience on their products and service which can also be monitored on daily based on their sentiments.  A Bigdata platform facilitates understanding this sentiment by integrating to read the semi and unstructured data on their comments and tweets. Klout score is a perfect example that leverages Social media platforms to provides individuals social online influence using Bigdata.

·         A wind farm manufacture was able to identify patterns for a given terrain to install wind mills across 80 countries  by deploying Big data to churn petabytes of weather data in 15 minutes for analysis which otherwise would have taken 3 to 4 weeks and many lost opportunities.

·         A major patient caregiver was able to provide better patient care with 360 degree view of the patient and able to avoid readmitting losses and get timely analysis from patients past and current records for key insights by marrying both their RDBM data warehouse with Big data analytics platform.

·         A University built in-stream Bigdata computing platform to monitor in real time round the clock  signal streams for its many medical devices from patients to detect and respond before things become critical and avoid potential complications.

Bigdata is now considered as the next new phenomenon only to “Internet” and many companies are vying with each other to stay ahead of the curve as they sit on golden mound of common sources. These sources are best suited for Bigdata to extract content and the context to provide them with appropriate analytics for competitive edge and better decision making abilities in the current ecosystem.

Here are some examples which I believe we are all too familiar.

  • Market research reports
  • Consumer research reports
  • Survey data
  • Call center voice calls
  • Emails
  • Social media data
  • Excel spreadsheets from multiple business units
  • Data from interactive web channels
Even, GPS  data (time, location), Machine data (activity logs) consists of usage and behavior content and deciphering, studying and predicting for patterns from these sensors, assembly line robotic arms machines can be automated in Bigdata.

Note, these sources existed even before, but what is fueling today is the availability of commodity infrastructure with automated data processing capabilities that is extremely fast with scalability exceeding the limitations that existed in the mindset of business and IT managers. Innovation has transformed the way business is done today to meet ever increasing personalized services with speed, agility and flexibility.

Now lets look at the some obvious questions like how to build one and what are the key components of the Bigdata solution.

A Bigdata namely has  four dimensions namely Volume (terabytes running into petabytes), Velocity (Speed at which data can change dramatically) Variety( machine processed logs, posts, tweets, blogs coming in different formats and structures and data types including text, videos, audio, images etc ) and veracity (trust worthiness of the data).

For Bigdata  Infrastructure, the scalability needs to be linear to scale, with a high throughput to mandate the  velocity dimension requirements, a fault tolerant system (automatic recovery with no manual intervention), with a high degree of parallelism to load and processes across multiple machines with its own copy with distributed data.

To meet the above features, Bigdata relies on Hadoop framework to support data intensive process that can support thousands of nodes and petabytes of data and being fault tolerant. Hadoop is an Open source software that supports distributed processing on very large scale and was inspired by Google's MapReduce and Google File System (GFS) . Besides processing large of amounts of data, Hadoop can crank intensive computational problems that are deep like Portfolio evaluation analysis for Finance Investment Company.

Some of the common processing tools of Bigdata Infrastructure for Hadoop (An Open Source):
Hadoop HDFS
A DFS Storage system that distributes files across multiple machines for scalability, greater throughput and fault tolerance
MapReduce
Programming interface that consists of Map and Reduce functionality which follows divide and conquer strategy for distributing an extremely large problem on extremely large computing cluster.
Chukwa
Hadoop's large-scale log collection and analysis tool.
Hive
Data warehouse infrastructure built on top of Hadoop for providing data summarization and analysis with SQL like queries for structured data.
HBase
Distributed column-oriented NoSQL database built on top of HDFS. HBase supports billions of rows and millions of columns on thousands of nodes.(Some NoSQL DB’s like CouchDB, MongoDB, Riak have built in MapReduce functionality)
Pig
A high level data flow programming language for Hadoop
Mahout
Open source library which implements several scalable machine learning algorithms.
Zookeeper
Helps with coordination between Hadoop nodes efficiently
Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs

Four steps of data processing steps in Bigdata data
·         Gather data. A landing zone holds data received from different sources with file name changes accommodated in this stage.
·         Load data. Metadata is applied to map context to the raw source while creating small chunk of files which are partitioned either horizontally or vertically based requirements.
·         Transform data. Apply business rules and build intermediary data sets at each processing steps as key-value-pair and associate metadata and associated metrics..
·         Extract data. In this stage the result data set can be extracted for further processing including analytics, operational reporting, data warehouse integration, and visualization purposes.

In Bigdata, the data processing is essentially batch oriented but can also be real-time for steam processing like for trending analysis with some latency for computation. Data transformation is built as multistep derivations and complexity is kept to minimal within each step.

Bigdata works on un-modeled unstructured data on its commodity servers and can be hooked with Flume an interface for streaming data into Hadoop HDFS. Tools like Sqoop provides bidirectional API interface for data between Hadoop and RDBMS for data interchange. Bigdata can be associated with NoSQL (Not only SQL) namely for Graphs DBMS for social graph analytics (Social media networks). So you get the point. You can create a Hybrid infrastructure by mixing and matching various options to solve and integrate different types of data characteristics and analytics workloads.

 There are also several Commercial products with Storage, MapReduce and Query(SMAQ) based Bigdata solutions in the market to choose from and those can be integrated with RDBMS.
  • Greenplum utilizes open source PostgressSQL DBMS runs on clusters of distributed hardware or on EMC with MapReduce functions expressed in Perl or Python.
  • Teradata Aster nCluster with SQL MapReduce written in C++, C#, Java, R or Python can intermingle with Teradata.
  • Vertica of HP has built- in connectors with Hadoop on their column oriented MPP database
  • Neteeza utilizes Cloudera’s Sqoop connector to interface its appliance with Hadoop
  • Oracle Bigdata solution integrates Hadoop with Cloudera’s connector and its family of products
  • Microsft SQL Server provides Hadoop connector and family of products from HortonWorks  as part of Bigdata solutions
The capabilities of Bigdata Analytical solutions be extended to data at  “rest” or  “in- motion” with appropriate configuration to churn huge quantities of data. In day to day life, we leave data trail every where whether we are surfing the web or making purchases in a super market or data being picked up from sensors on vehicles that we drive or we fly could all be collected, and analyzed to provide timely decisions for quicker customer gratification and business products and better services.

In 1950’s, a 5 MB data storage from Hitachi was of a size of large refrigerator and weighed more than 500 pounds compared today’s SD Ram holding more than 32GB weighing just under 0.5 grams.  Based on Moore’s law trend, we can certainly see more and more foot print of data being captured, stored and processed in today’s world as we enter a new age of “data consumerism” with a deluge of “Data Products” available in the market.

Finally, what is the Bigdata definition. What I have read so far says, “Bigdata is when the size of the data itself becomes part of the problem”.

No comments: