I am sure, most IT Managers have got some grounding on Big
Data in terms of what it is, how to integrate with traditional enterprise data
warehouses. Big IT vendors like IBM, Oracle, HP, Teradata, Microsoft to name a
few have matured their offerings to integrate Big data with their traditional
products, while others have perfected their niche products and services in
their offerings for Big Data on Infrastructure, Integration Services and
Analytical areas. Even there are specialized niche consulting companies that
provide Bigdata solutions, while other major consulting companies are catching
upon this exciting environment and certainly seems to have mastered the
challenges and opportunities this new paradigm provides to the enterprise of
all sizes.
·
Today,
social media is an integral part to study Customer usage, experience on their
products and service which can also be monitored on daily based on their sentiments. A Bigdata platform facilitates understanding
this sentiment by integrating to read the semi and unstructured data on their
comments and tweets. Klout score is a perfect example that leverages Social
media platforms to provides individuals social online influence using Bigdata.
·
A wind
farm manufacture was able to identify patterns for a given terrain to install
wind mills across 80 countries by
deploying Big data to churn petabytes of weather data in 15 minutes for
analysis which otherwise would have taken 3 to 4 weeks and many lost
opportunities.
·
A major
patient caregiver was able to provide better patient care with 360 degree view
of the patient and able to avoid readmitting losses and get timely analysis
from patients past and current records for key insights by marrying both their
RDBM data warehouse with Big data analytics platform.
·
A
University built in-stream Bigdata computing platform to monitor in real time
round the clock signal streams for its
many medical devices from patients to detect and respond before things become
critical and avoid potential complications.
Bigdata is now considered as the next new phenomenon only to “Internet” and many companies are vying with each other to stay ahead
of the curve as they sit on golden mound of common sources. These sources are best
suited for Bigdata to extract content and the context to provide them with
appropriate analytics for competitive edge and better decision making abilities
in the current ecosystem.
Here are some examples which I believe we are all too familiar.
- Market research reports
- Consumer research reports
- Survey data
- Call center voice calls
- Emails
- Social media data
- Excel spreadsheets from multiple business units
- Data from interactive web channels
Even, GPS data (time,
location), Machine data (activity logs) consists of usage and behavior content
and deciphering, studying and predicting for patterns from these sensors, assembly
line robotic arms machines can be automated in Bigdata.
Note, these sources existed even before, but what is fueling
today is the availability of commodity infrastructure with automated data
processing capabilities that is extremely fast with scalability exceeding the
limitations that existed in the mindset of business and IT managers. Innovation
has transformed the way business is done today to meet ever increasing
personalized services with speed, agility and flexibility.
Now lets look at the
some obvious questions like how to build one and what are the key components of
the Bigdata solution.
A Bigdata namely has four dimensions namely Volume (terabytes
running into petabytes), Velocity (Speed at which data can change dramatically)
Variety( machine processed logs, posts, tweets, blogs coming in different
formats and structures and data types including text, videos, audio, images etc
) and veracity (trust worthiness of the data).
For Bigdata
Infrastructure, the scalability needs to be linear to scale, with a high
throughput to mandate the velocity
dimension requirements, a fault tolerant system (automatic recovery with no
manual intervention), with a high degree of parallelism to load and processes
across multiple machines with its own copy with distributed data.
To meet the above features, Bigdata relies on Hadoop framework
to support data intensive process that can support thousands of nodes and
petabytes of data and being fault tolerant. Hadoop is an Open source software
that supports distributed processing on very large scale and was inspired by Google's
MapReduce and Google File System (GFS) . Besides processing large of amounts of
data, Hadoop can crank intensive computational problems that are deep like
Portfolio evaluation analysis for Finance Investment Company.
Some of the common processing tools of Bigdata Infrastructure for
Hadoop (An Open Source):
Hadoop HDFS
|
A DFS Storage system that distributes files across multiple machines
for scalability, greater throughput and fault tolerance
|
MapReduce
|
Programming interface that consists of Map and Reduce functionality
which follows divide and conquer strategy for distributing an extremely large
problem on extremely large computing cluster.
|
Chukwa
|
Hadoop's large-scale log collection and analysis tool.
|
Hive
|
Data warehouse infrastructure built on top of Hadoop for providing
data summarization and analysis with SQL like queries for structured data.
|
HBase
|
Distributed column-oriented NoSQL database built on top of HDFS.
HBase supports billions of rows and millions of columns on thousands of
nodes.(Some NoSQL DB’s like CouchDB, MongoDB, Riak have built in MapReduce
functionality)
|
Pig
|
A high level data flow programming language for Hadoop
|
Mahout
|
Open source library which implements several scalable machine
learning algorithms.
|
Zookeeper
|
Helps with coordination between Hadoop nodes efficiently
|
Oozie
|
Oozie is a workflow scheduler system to manage Hadoop jobs
|
Four steps of data processing steps in Bigdata data
·
Gather
data. A landing zone holds data received from different sources with file name
changes accommodated in this stage.
·
Load data.
Metadata is applied to map context to the raw source while creating small chunk
of files which are partitioned either horizontally or vertically based
requirements.
·
Transform data.
Apply business rules and build intermediary data sets at each processing steps
as key-value-pair and associate metadata and associated metrics..
·
Extract
data. In this stage the result data set can be extracted for further
processing including analytics, operational reporting, data warehouse
integration, and visualization purposes.
In Bigdata, the data processing is essentially batch
oriented but can also be real-time for steam processing like for trending
analysis with some latency for computation. Data transformation is built as
multistep derivations and complexity is kept to minimal within each step.
Bigdata works on un-modeled unstructured data on its
commodity servers and can be hooked with Flume
an interface for streaming data into Hadoop HDFS. Tools like Sqoop provides bidirectional API
interface for data between Hadoop and RDBMS for data interchange. Bigdata can
be associated with NoSQL (Not only SQL) namely for Graphs DBMS for social graph
analytics (Social media networks). So you get the point. You can create a
Hybrid infrastructure by mixing and matching various options to solve and
integrate different types of data characteristics and analytics workloads.
There are also several
Commercial products with Storage, MapReduce and Query(SMAQ) based Bigdata solutions
in the market to choose from and those can be integrated with RDBMS.
- Greenplum utilizes open source PostgressSQL DBMS runs on clusters of distributed hardware or on EMC with MapReduce functions expressed in Perl or Python.
- Teradata Aster nCluster with SQL MapReduce written in C++, C#, Java, R or Python can intermingle with Teradata.
- Vertica of HP has built- in connectors with Hadoop on their column oriented MPP database
- Neteeza utilizes Cloudera’s Sqoop connector to interface its appliance with Hadoop
- Oracle Bigdata solution integrates Hadoop with Cloudera’s connector and its family of products
- Microsft SQL Server provides Hadoop connector and family of products from HortonWorks as part of Bigdata solutions
The capabilities of Bigdata Analytical solutions be extended
to data at “rest” or “in- motion” with appropriate configuration to
churn huge quantities of data. In day to day life, we leave data trail every
where whether we are surfing the web or making purchases in a super market or
data being picked up from sensors on vehicles that we drive or we fly could all
be collected, and analyzed to provide timely decisions for quicker customer
gratification and business products and better services.
In 1950’s, a 5 MB data storage from Hitachi was of a size of
large refrigerator and weighed more than 500 pounds compared today’s SD Ram
holding more than 32GB weighing just under 0.5 grams. Based on Moore’s law trend, we can certainly
see more and more foot print of data being captured, stored and processed in
today’s world as we enter a new age of “data consumerism” with a deluge of
“Data Products” available in the market.
Finally, what is the Bigdata definition. What I have read so
far says, “Bigdata is when the size of
the data itself becomes part of the problem”.
No comments:
Post a Comment