Thursday, September 19, 2013

Big Data Analytics


Use of Big Data is increasing by the hour and so is the investment in this area for Data Analytics. There are  slew of startups and existing Analytical players like SAS, SPSS, R, RStat, E-Views, TreeNet who are building deep algorithms, forecasting and mining tools that can provide business insights from Big Data.

With a deluge of data that flows from Web and other sources, we are now entangled with much wider and deeper silos of information at our disposal. Mining this information with analytics is the need of the hour for many companies to stay competitive and also to improve their processes and overcome their structural inefficiencies.

Today’s Big Data analytics provides a slew of statistical techniques and usage of algorithms to provides for  State of the current business and assessment and prediction on future growth models for their products and services either on Cloud or their SMAQ clustered of servers.

A host of Statistical techniques and algorithms are being customized and build with various Visualization techniques for Segmentation of profiles be it homogeneous or heterogeneous data sets (K-Means, Discriminant Analysis, Bayesian Belief Network etc) Forecasting future events which are both qualitative and quantitative (Monte Carlo simulation, Markov Chain), Predicting modeling with probabilities on outcomes (Linear Regression, Bayesian techniques),  Descriptive modeling on trends on population (Structured Equation Modeling)  are some of Statistical techniques that may sound Greek and Latin to most common readers who are not Data Scientists ( A new term getting in vogue for folks who have good grasp of understanding and usage of Statistical techniques and Econometric modeling which were earlier called as Operational Research folks).

Data driven decisions are not new in many enterprises be it in CPG (Consumer Packaging Group), Financial Services, Healthcare, Retail, Media and so forth,  but they had only scratched the surface in the past because these techniques required special silos for storage, computation and analysis.

Big Data Analytics are used to correlate and identify patterns to study Customer behaviors, Brand preferences, Loyalties and Reward program usages to strategize and position their products and services appropriately on a ongoing basis. Business mining, Data Analytics, Visualization will become a key arsenal and core requirements in their Managers job profiles in understanding and implementation them.

Business leaders of many US Companies are proactively taking this discerning note from McKinsey Consulting group that by 2018, there will be a shortage of 1.5 million managers who would not have the skills to utilize Analytics in their business repertoire in their job profile. Many companies are planning to reposition and retrain their employees to meet the new challenging job environment.

On the other hand, IT Consulting companies are already gearing up with resources and processes of this positive note that by 2018, BI Analytics space alone would generate an additional US$20. 3 billions. Major Educational institutions to keep up with growing demand  in the Global market have already started offering Business Analytical courses as a core requirement in both Graduate and Undergraduate degrees. Not to be outdone, many Institutes have created Certification Courses on Business Analytics.

I believe the mantra today on Big Data appears to be how business wants  model and build their data requirements, similar to looking at Glass that is either half full or half empty philosophy.

“We capture what we model or We model what we capture”

Conclusion: Imagine building a Eureqa like program, which keeps iterating with different data points until it finds an equation that matches its relationship. Human mind is not capable of modeling or figuring it out or just like Einstein's e= mc² what 'c' is. Often many complexities outrun minds capability to understand it. Big Data analytics comes handy to unravel and demystify it.

Monday, September 16, 2013

Big Data Profile

I am sure, most IT Managers have got some grounding on Big Data in terms of what it is, how to integrate with traditional enterprise data warehouses. Big IT vendors like IBM, Oracle, HP, Teradata, Microsoft to name a few have matured their offerings to integrate Big data with their traditional products, while others have perfected their niche products and services in their offerings for Big Data on Infrastructure, Integration Services and Analytical areas. Even there are specialized niche consulting companies that provide Bigdata solutions, while other major consulting companies are catching upon this exciting environment and certainly seems to have mastered the challenges and opportunities this new paradigm provides to the enterprise of all sizes.

·         Today, social media is an integral part to study Customer usage, experience on their products and service which can also be monitored on daily based on their sentiments.  A Bigdata platform facilitates understanding this sentiment by integrating to read the semi and unstructured data on their comments and tweets. Klout score is a perfect example that leverages Social media platforms to provides individuals social online influence using Bigdata.

·         A wind farm manufacture was able to identify patterns for a given terrain to install wind mills across 80 countries  by deploying Big data to churn petabytes of weather data in 15 minutes for analysis which otherwise would have taken 3 to 4 weeks and many lost opportunities.

·         A major patient caregiver was able to provide better patient care with 360 degree view of the patient and able to avoid readmitting losses and get timely analysis from patients past and current records for key insights by marrying both their RDBM data warehouse with Big data analytics platform.

·         A University built in-stream Bigdata computing platform to monitor in real time round the clock  signal streams for its many medical devices from patients to detect and respond before things become critical and avoid potential complications.

Bigdata is now considered as the next new phenomenon only to “Internet” and many companies are vying with each other to stay ahead of the curve as they sit on golden mound of common sources. These sources are best suited for Bigdata to extract content and the context to provide them with appropriate analytics for competitive edge and better decision making abilities in the current ecosystem.

Here are some examples which I believe we are all too familiar.

  • Market research reports
  • Consumer research reports
  • Survey data
  • Call center voice calls
  • Emails
  • Social media data
  • Excel spreadsheets from multiple business units
  • Data from interactive web channels
Even, GPS  data (time, location), Machine data (activity logs) consists of usage and behavior content and deciphering, studying and predicting for patterns from these sensors, assembly line robotic arms machines can be automated in Bigdata.

Note, these sources existed even before, but what is fueling today is the availability of commodity infrastructure with automated data processing capabilities that is extremely fast with scalability exceeding the limitations that existed in the mindset of business and IT managers. Innovation has transformed the way business is done today to meet ever increasing personalized services with speed, agility and flexibility.

Now lets look at the some obvious questions like how to build one and what are the key components of the Bigdata solution.

A Bigdata namely has  four dimensions namely Volume (terabytes running into petabytes), Velocity (Speed at which data can change dramatically) Variety( machine processed logs, posts, tweets, blogs coming in different formats and structures and data types including text, videos, audio, images etc ) and veracity (trust worthiness of the data).

For Bigdata  Infrastructure, the scalability needs to be linear to scale, with a high throughput to mandate the  velocity dimension requirements, a fault tolerant system (automatic recovery with no manual intervention), with a high degree of parallelism to load and processes across multiple machines with its own copy with distributed data.

To meet the above features, Bigdata relies on Hadoop framework to support data intensive process that can support thousands of nodes and petabytes of data and being fault tolerant. Hadoop is an Open source software that supports distributed processing on very large scale and was inspired by Google's MapReduce and Google File System (GFS) . Besides processing large of amounts of data, Hadoop can crank intensive computational problems that are deep like Portfolio evaluation analysis for Finance Investment Company.

Some of the common processing tools of Bigdata Infrastructure for Hadoop (An Open Source):
Hadoop HDFS
A DFS Storage system that distributes files across multiple machines for scalability, greater throughput and fault tolerance
MapReduce
Programming interface that consists of Map and Reduce functionality which follows divide and conquer strategy for distributing an extremely large problem on extremely large computing cluster.
Chukwa
Hadoop's large-scale log collection and analysis tool.
Hive
Data warehouse infrastructure built on top of Hadoop for providing data summarization and analysis with SQL like queries for structured data.
HBase
Distributed column-oriented NoSQL database built on top of HDFS. HBase supports billions of rows and millions of columns on thousands of nodes.(Some NoSQL DB’s like CouchDB, MongoDB, Riak have built in MapReduce functionality)
Pig
A high level data flow programming language for Hadoop
Mahout
Open source library which implements several scalable machine learning algorithms.
Zookeeper
Helps with coordination between Hadoop nodes efficiently
Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs

Four steps of data processing steps in Bigdata data
·         Gather data. A landing zone holds data received from different sources with file name changes accommodated in this stage.
·         Load data. Metadata is applied to map context to the raw source while creating small chunk of files which are partitioned either horizontally or vertically based requirements.
·         Transform data. Apply business rules and build intermediary data sets at each processing steps as key-value-pair and associate metadata and associated metrics..
·         Extract data. In this stage the result data set can be extracted for further processing including analytics, operational reporting, data warehouse integration, and visualization purposes.

In Bigdata, the data processing is essentially batch oriented but can also be real-time for steam processing like for trending analysis with some latency for computation. Data transformation is built as multistep derivations and complexity is kept to minimal within each step.

Bigdata works on un-modeled unstructured data on its commodity servers and can be hooked with Flume an interface for streaming data into Hadoop HDFS. Tools like Sqoop provides bidirectional API interface for data between Hadoop and RDBMS for data interchange. Bigdata can be associated with NoSQL (Not only SQL) namely for Graphs DBMS for social graph analytics (Social media networks). So you get the point. You can create a Hybrid infrastructure by mixing and matching various options to solve and integrate different types of data characteristics and analytics workloads.

 There are also several Commercial products with Storage, MapReduce and Query(SMAQ) based Bigdata solutions in the market to choose from and those can be integrated with RDBMS.
  • Greenplum utilizes open source PostgressSQL DBMS runs on clusters of distributed hardware or on EMC with MapReduce functions expressed in Perl or Python.
  • Teradata Aster nCluster with SQL MapReduce written in C++, C#, Java, R or Python can intermingle with Teradata.
  • Vertica of HP has built- in connectors with Hadoop on their column oriented MPP database
  • Neteeza utilizes Cloudera’s Sqoop connector to interface its appliance with Hadoop
  • Oracle Bigdata solution integrates Hadoop with Cloudera’s connector and its family of products
  • Microsft SQL Server provides Hadoop connector and family of products from HortonWorks  as part of Bigdata solutions
The capabilities of Bigdata Analytical solutions be extended to data at  “rest” or  “in- motion” with appropriate configuration to churn huge quantities of data. In day to day life, we leave data trail every where whether we are surfing the web or making purchases in a super market or data being picked up from sensors on vehicles that we drive or we fly could all be collected, and analyzed to provide timely decisions for quicker customer gratification and business products and better services.

In 1950’s, a 5 MB data storage from Hitachi was of a size of large refrigerator and weighed more than 500 pounds compared today’s SD Ram holding more than 32GB weighing just under 0.5 grams.  Based on Moore’s law trend, we can certainly see more and more foot print of data being captured, stored and processed in today’s world as we enter a new age of “data consumerism” with a deluge of “Data Products” available in the market.

Finally, what is the Bigdata definition. What I have read so far says, “Bigdata is when the size of the data itself becomes part of the problem”.

Thursday, September 12, 2013

Federated Data Mart & UDBA’s

Federated Data Mart & UDBA’s
In many businesses today, users do not find all the data that they need in one datamart or EDW, but distributed and spread across different silos and technology platforms. Some of the challenges faced by businesses are;
·         Required data sets are in different platforms and difficult to access and have to navigate on the thin line of permission issues and security concerns
·         Formats of these data sets makes it difficult be it structured, semi-structured or unstructured data
·         Grain of the data differs, often stored at higher level rather than at atomic level
·         Even partially satisfied data sets lack utility due to its relevance, accuracy and timely availability
·         Data sizes are too large to be accommodated in their own databases
Businesses have built UDBA’s (User defined database) stores just to source these datasets to circumvent the challenges even though they are difficult to maintain and manage. These data stores are not scalable besides their reusability is very negligible other than for whom these are build and as a result there are 100’s of different versions being created to satisfy specific business needs.
The issues and concerns are real, and IT managers are aware of it, and are trying to pitch in with solutions, but with varying success. There are companies which are rewriting their existing legacy EDW’s either with a top down or as bottom up approach so business can get to access the atomic grain data in normalized structures as well as providing data in dimensional model for reporting and analytical needs.
 One such solution is to build Federated data model architecture as a bottom up approach across multiple data mart systems. A successful federated model should share and build very high level of common data content (Core), metadata, and metrics in the Core to be accessible by various marts. For example, combining Digital Marketing information with Customer information can certainly provide proper customer, product/service segmentation information and even reliable demographics assessment. A Financial company for their Auto Finance LOB can, by combining Origination, Servicing, Collection and Recovery stack could potential be able build various risk tier propensity models.
Bringing Core data of Employees, Customers, Products, Services, Marketing data assets in a Federated model would be easily accessible by different data marts of LOB (Line of Businesses) with their specific line of data for their operational and analytical reporting needs. Besides the Core, common data sets among the LOB stacks can also be layered in the Core in different data grains with access privileges for building Mashup BI Analytics at enterprise levels. The key requirements of a Federated Data mart are Shared keys, Global Metadata and capable of running distributed queries
One of the true benefits of this design is to the UDBA’s (User Defined Databases). As indicated earlier UDBA’s are decentralized data stores, build to meet the customized business requirements of a LOB. The proliferation of these data stores in enterprises today is due to lack of access to Core data as a single source platform in atomic grain level and multitude of customized individual group business rules and user routines that needs to be applied to the data sets for their reporting requirements.
These UDBA’s come in different forms, shapes, sizes and complexity. There are ones that pull data via ODBC drivers built into applications like Excel and run complicated macros to produce reporting requirements, while some actually have small databases of their own that extracts data from different sources and apply business transformation logic with user routines, functions, stored procedures, triggers to provide business reporting requirements. To top it all, there are 100’s of versions of specific business requirements and these UDBA’s processes are duplicated, cloned to meet each individual group within the LOB’s.
In a typical Financial institution, an Investment Fund manager managing a portfolio (equities, bonds and other investments) has to comply daily, hundreds of rules and regulations as inscribed in the prospectus as its underlying stock/bonds changes its prices in the stock exchanges along with Corporate Actions (CA) that potentially alters underlying portfolio. The Rules, terms for the rules and metadata that defines the decision of the rules all differ from one portfolio to the other. Fund managers have different data source requirements, but most need standard core data sources like Security master data, Stock prices, CA, Fund Index to name a few.
In the absence of a Common Core data sources and Common denomination of core rules construction engine, each Investment Fund manger would have to build UDBA’s under their desk to manage and comply with SEC requirements.
Conclusion: First identify, define and profess the need of common data sources to IT folks to build them in Core of a Federated Data Mart. Second, commoditize and build a scalable user driven rules engine and routines independent of database platform technology as a persistent semantic layer with Global metadata.
In a nutshell, there are many benefits of building a Federated data mart with flexible Core and UDBA is one right candidate up the sleeve.