Big Data - An Overview, Big Data Reference Architecture, Real Time Big Data Analytics



Big Data is a huge area .For those of you, who have already ready many literatures and now implementing it in your projects or product, this article may be worthy of skimming only. For others who are to yet to read something on Big Data and are interested. I hope this article gives  basic information and context and prompts you to explore the fascinating area of Big Data Analytics.


What's Big Data?

Big Data can be thought of a solution platform and ecosystem that address the problem of unavailability of a fast, interactive, accurate, and cost effective analytics that can work on humongous amount of structured, unstructured varied data both offline or at real time.
Big in  Big Data essentially not only refers to the humongous volume of data set but refers to something which is difficult to manage and calls of out of the box solutions.
What are the characteristics of Big Data?
The characteristics of Big Data helps to better understand it. There are 5 characteristics also commonly known as the 5 +1 Vs associated with Big Data. In brief they are
a) Volume b) Variety c)Velocity d)Value e)Veracity and f)Variability.
Volume refers to humongous amount of data that's associated when we say Big Data. The size of analytical data sets can range up to Petabytes and Zeta Bytes per hour.
Variety refers to the data format that's associated. It can be structured, semi structured or unstructured.
Velocity refers the speed at which these data set are generated for analysis. There are astronomical telescopes that capture zeta bytes of data in minutes that needs analysis for storage and further processing.
Value refers to insights that are associated with these seemingly unrelated or uninteresting data sets.
Veracity refers to genuineness of data sources and a single source of truth.
Variability refers to the seasonal/time scale variation associate with data both in terms of volume and freshness.
Why Big Data?
Let's look at some of the use cases that Big Data address and why it has become so popular and of value now.
There are few broad categories of use cases that Big Data can address.
a) Retailing: Personalized Marketing/Campaigns and Offers: Using Big Data to analyze customer interests and translate to a sale, this is valid across channels like ecommerce site visit (click-stream analysis), a customer visiting a store, mobility space, kiosk/self-service etc.
b) Operational Purposes: Analyze information and use it to improve operations. Like Transport fleet to optimize route, scheduling repairs of Engines/critical parts. In Restaurants to improve menu. Capacity Planning. Improve catalog, look and feel of webpage, merchandizing etc.
c) Financial: Analyzing portfolios, validating models, detecting frauds at real time.
d) Predictive: Outbreak of Flu, Augment of quake meter, preventive steps in healthcare(premature babies/cancer detection etc), Election Results, Analyze and aid in security aspects.
e) As an end user, I can may be avail a service which tells me when one should buy a ticket at a site, or purchase a product, provide me security info, provides me on health and financial info/advice.
Why now?
We are in a digital age, we are producing data at a larger rate. Of the 7 billion world population, the internet penetration and its usage via smart mobile devices has grown to 3 billion and increasing at 8% avg per year. In last 2 years alone, we have generated 90% of world data. To handle these data and get insights, technology has also advanced. With the increase in computing power and decrease in cost of storage, distributed computing solutions are getting commoditized. Solutions like Hadoop, , make it easier to do distributed computing and enable us to implement Big Data Analytics in a faster and cost effective way now. To handle the huge amount of unstructured data generated now, file systems have evolved. With these enablers, we are able to get a platform in analytics which is not created based on predefined queries but based on data. We are now able to store such huge volumes of data, structured or unstructured and have tools to analyze them. And this is giving us an analytics where one thinks on what questions one want answers to. No need to restructure and build a cube again like warehouse. Just use the same dataset with different techniques.
Architecture aspects
Before we proceed to look at Big Data Reference Architectures, let’s briefly look at the fundamental technique that's used in these distributed computing platforms like Hadoop. The primary programming paradigm behind Hadoop is Map Reduce.
Instead of moving data to computing sources like our traditional computing and in warehouses, where the database I/O and disk I/O severely adds to latency of queries, the modern techniques are moving computing to the data. 
If you have a very large dataset for processing, then shard it (split it horizontally) and distribute it to different nodes of your cluster. Keep replica of the shards on other nodes of the cluster to enable fault tolerance. Each of these shards can be independently processed at the nodes and these intermediate results of each of the shards can be merged/reduced to get the final result of processing the very large data set.

There are quite a few literature on architecture on Big Data Analytics available, for sake of brievity and illustration we just refer to one by National Institute of Standards and Technology US (NIST).
NIST has intiated a working group with a mandate to come up with Big Data Reference Architecture.The draft architecture is shown in figure below.
The X axis presents the information value chain from Data Provider to the Data Consumer. It depicts the 5 phases of information flow and indicates components for Collection, Curation(manage,maintain,validate,preserve),Analytics, Visulation and Access. Visualization refers to presenting in graphs,images,tables etc to communicate the results effectively. Access emphsises on security and authorization.
The Y axis presents the IT value chain, at the bottom is the crucial infrastructure layer which includes clusters that are built upon physical and virtual networking, computing resources.Built on the infrastructure are platforms and Processing frameworks.
On top of the framework provider block, is the Application Provider which does the analytics. On the very top is the system orchestrator which can orchestrate one or more applications to meet the business needs.There are two blocks that indicate the cross cutting concerns namely Security & privacy and Management. Management here emphasizes the need to have an integrated information management system so that the data used is trusted and the analysis is of value.
NIST Big Data Working group also suggestes a mapping of architecture framework to the ecosystem in the figure below to emphasize that each component of the ecosystem has to be addressed to have a meaningful solution. NIST Big Data Working group also highlights the infrastructure aspect of the Big Data. The impact of inter cloud and hetreogeneous infrastructure, the network bandwidth, security, the data management and Analytic Tools is articulated.
Real Time Analytics
Another space in Big Data which is gaining momentum is Real Time Analytics. In financial sector, portfolio analysis, fraud detection call for real time analytics. Adapting online games & online content of ecommerce site based on customer preference, Analyzing engine performance of transportation vessels etc. are some other scenarios where real time analytics is of great value. As the requirements of batch and real time analytics are different, the solutions are also different. . Hadoop is suited for large scale storage and processing but for batch jobs. It’s not designed for real time analytics. In real time analytics space Apache Storm and Yahoo’s S4 are prominent ones. They use messaging infrastructure and their architecture is different.
In their book Big Data - Principles and best practices of scalable realtime data systems" Nathan Marz and James Warren have proposed an architecture called lambda architecture that has provisions for addressing both batch and real time. 

 In brief their architecture, for batch leverages Hadoop and for real time leverages Apache Storm. The Real Time analytics focusses on forming a real time data set for processing and then updating that set by discarding stale and adding new/relevant set for processing/analytics.

Conclusion
To run the business, you organize data to make it do something specific; to change the business, you take data as-is and determine what it can do for you. These two approaches are more powerful together than either alone. The challenge is to bring the two paradigms together.
Challenges in connecting existing structured datasets and unstructured data sets, Data Privacy, Data Security, Using analytics as sole basis for loans and insurance approval decisions, and Data Governance are some of the cross cutting concerns that needs to be effectively addressed to ensure proper ecosystem where Big Data can flourish.
Big Data is here to stay and it’s become imperative that we all understand well its aspects to better take decisions and help improve Business Strategy and Operations and succeed over our competitors. It’s a survival kit we need to be well versed with.








Comments

Popular posts from this blog

Is your Promotion Due?

Performance Appraisals