Big Data - An Overview, Big Data Reference Architecture, Real Time Big Data Analytics
Big Data is a huge area .For those of you, who have
already ready many literatures and now implementing it in your projects or
product, this article may be worthy of skimming only. For others who are to yet
to read something on Big Data and are interested. I hope this article gives
basic information and context and prompts you to explore the fascinating area
of Big Data Analytics.
What's Big Data?
Big Data can be thought of a solution platform and
ecosystem that address the problem of unavailability of a fast, interactive,
accurate, and cost effective analytics that can work on humongous amount of
structured, unstructured varied data both offline or at real time.
Big in Big Data
essentially not only refers to the humongous volume of data set but refers to
something which is difficult to manage and calls of out of the box solutions.
What are the characteristics of Big Data?
The characteristics of Big Data helps to better
understand it. There are 5 characteristics also commonly known as the 5 +1 Vs
associated with Big Data. In brief they are
a) Volume b) Variety c)Velocity d)Value e)Veracity and
f)Variability.
Volume refers to humongous amount of data that's
associated when we say Big Data. The size of analytical data sets can range up
to Petabytes and Zeta Bytes per hour.
Variety refers to the data format that's associated. It
can be structured, semi structured or unstructured.
Velocity refers the speed at which these data set are
generated for analysis. There are astronomical telescopes that capture zeta
bytes of data in minutes that needs analysis for storage and further
processing.
Value refers to insights that are associated with these
seemingly unrelated or uninteresting data sets.
Veracity refers to genuineness of data sources and a
single source of truth.
Variability refers to the seasonal/time scale variation
associate with data both in terms of volume and freshness.
Why Big Data?
Let's look at some of the use cases that Big Data address
and why it has become so popular and of value now.
There are few broad categories of use cases that Big Data
can address.
a) Retailing: Personalized Marketing/Campaigns and
Offers: Using Big Data to analyze customer interests and translate to a sale,
this is valid across channels like ecommerce site visit (click-stream
analysis), a customer visiting a store, mobility space, kiosk/self-service etc.
b) Operational Purposes: Analyze information and use it
to improve operations. Like Transport fleet to optimize route, scheduling
repairs of Engines/critical parts. In Restaurants to improve menu. Capacity
Planning. Improve catalog, look and feel of webpage, merchandizing etc.
c) Financial: Analyzing portfolios, validating models,
detecting frauds at real time.
d) Predictive: Outbreak of Flu, Augment of quake meter,
preventive steps in healthcare(premature babies/cancer detection etc), Election
Results, Analyze and aid in security aspects.
e) As an end user, I can may be avail a service which
tells me when one should buy a ticket at a site, or purchase a product, provide
me security info, provides me on health and financial info/advice.
Why now?
We are in a digital age, we are producing data at a
larger rate. Of the 7 billion world population, the internet penetration and its
usage via smart mobile devices has grown to 3 billion and increasing at 8% avg
per year. In last 2 years alone, we have generated 90% of world data. To handle
these data and get insights, technology has also advanced. With the increase in
computing power and decrease in cost of storage, distributed computing
solutions are getting commoditized. Solutions like Hadoop, , make it easier to
do distributed computing and enable us to implement Big Data Analytics in a
faster and cost effective way now. To handle the huge amount of unstructured
data generated now, file systems have evolved. With these enablers, we are able
to get a platform in analytics which is not created based on predefined queries
but based on data. We are now able to store such huge volumes of data,
structured or unstructured and have tools to analyze them. And this is giving
us an analytics where one thinks on what questions one want answers to. No need
to restructure and build a cube again like warehouse. Just use the same dataset
with different techniques.
Architecture aspects
Before we proceed to look at Big Data Reference
Architectures, let’s briefly look at the fundamental technique that's used in
these distributed computing platforms like Hadoop. The primary programming
paradigm behind Hadoop is Map Reduce.
Instead of moving data to computing sources like our
traditional computing and in warehouses, where the database I/O and disk I/O severely
adds to latency of queries, the modern techniques are moving computing to the
data.
If you have a very large dataset for processing, then
shard it (split it horizontally) and distribute it to different nodes of your cluster.
Keep replica of the shards on other nodes of the cluster to enable fault tolerance.
Each of these shards can be independently processed at the nodes and these
intermediate results of each of the shards can be merged/reduced to get the
final result of processing the very large data set.
There are quite a few literature on architecture on Big
Data Analytics available, for sake of brievity and illustration we just refer
to one by National Institute of Standards and
Technology US (NIST).
NIST has intiated a working group with a
mandate to come up with Big Data Reference Architecture.The draft architecture
is shown in figure below.
The X axis presents the information
value chain from Data Provider to the Data Consumer. It depicts the 5 phases of
information flow and indicates components for Collection,
Curation(manage,maintain,validate,preserve),Analytics, Visulation and Access.
Visualization refers to presenting in graphs,images,tables etc to communicate
the results effectively. Access emphsises on security and authorization.
The Y axis presents the IT value chain,
at the bottom is the crucial infrastructure layer which includes clusters that
are built upon physical and virtual networking, computing resources.Built on
the infrastructure are platforms and Processing frameworks.
On top of the framework provider block,
is the Application Provider which does the analytics. On the very top is the
system orchestrator which can orchestrate one or more applications to meet the
business needs.There are two blocks that indicate the cross cutting concerns
namely Security & privacy and Management. Management here emphasizes the
need to have an integrated information management system so that the data used
is trusted and the analysis is of value.
NIST Big Data Working group also
suggestes a mapping of architecture framework to the ecosystem in the figure
below to emphasize that each component of the ecosystem has to be addressed to
have a meaningful solution. NIST Big Data Working group also highlights the
infrastructure aspect of the Big Data. The impact of inter cloud and
hetreogeneous infrastructure, the network bandwidth, security, the data management
and Analytic Tools is articulated.
Real Time Analytics
Another space in Big Data which is gaining momentum is
Real Time Analytics. In financial sector, portfolio analysis, fraud detection call
for real time analytics. Adapting online games & online content of
ecommerce site based on customer preference, Analyzing engine performance of
transportation vessels etc. are some other scenarios where real time analytics
is of great value. As the requirements of batch and real time analytics are
different, the solutions are also different. . Hadoop is suited for large scale
storage and processing but for batch jobs. It’s not designed for real time
analytics. In real time analytics space Apache Storm and Yahoo’s S4 are
prominent ones. They use messaging infrastructure and their architecture is
different.
In their book “Big Data - Principles and
best practices of scalable realtime data systems" Nathan Marz
and James Warren have proposed an architecture called lambda architecture that
has provisions for addressing both batch and real time.
In brief their architecture,
for batch leverages Hadoop and for real time leverages Apache Storm. The Real
Time analytics focusses on forming a real time data set for processing and then
updating that set by discarding stale and adding new/relevant set for
processing/analytics.
Conclusion
To run the business, you organize data to make it do
something specific; to change the business, you take data as-is and determine
what it can do for you. These two approaches are more powerful together than
either alone. The challenge is to bring the two paradigms together.
Challenges in connecting existing structured datasets and
unstructured data sets, Data Privacy, Data Security, Using analytics as sole
basis for loans and insurance approval decisions, and Data Governance are some
of the cross cutting concerns that needs to be effectively addressed to ensure
proper ecosystem where Big Data can flourish.
Big Data is here to stay and it’s become imperative that
we all understand well its aspects to better take decisions and help improve
Business Strategy and Operations and succeed over our competitors. It’s a
survival kit we need to be well versed with.


Comments
Post a Comment