About Motifworks

At Motifworks, we are AZURESMART. We are one of the fastest-growing cloud solutions providers, specializing in Cloud Adoption, Application Innovation, and Effective Data Strategies. Our passion is to empower you to accelerate your digital transformation initiatives using the Microsoft Azure cloud. We’re here to simplify your path to explore what’s possible.

Corporate Office

200 W Townsontown Blvd, Suite 300D, Towson, MD, 21204, US

Regional Offices

Philadelphia ¦ Detroit ¦ Dallas ¦ Florida ¦ Cincinnati ¦ Ohio

Development Office

101 Specialty Business Center, Balewadi Road, Pune- 411045, Maharashtra, India

Connect with us
info@motifworks.com +1 443-424-2340
Don’t be a victim of cyber-attacks. Assess your Cyber security risks before it is too late! Know more ➜

 

Performing Hadoop Big Data Analytics using Azure Data Services

Performing Hadoop Big Data Analytics using Azure Data Services

Hadoop Big Data Analytics using Azure Data Services

More and more organizations are leveraging Big Data consulting to extract value and insights from the data. As the data grows exponentially, organizations find that the data warehouse and ETL systems cannot keep up. Distributed storage and compute solutions like Hadoop provides the framework for storing, processing, and performing Hadoop Big Data analytics. However, Hadoop systems deployed on-premise often run into multiple challenges. They require substantial upfront Capital investment since they are built for the peak capacity.

Hadoop Big Data Analytics on Microsoft Azure

On-premise Hadoop also provides limited elasticity and scale. Adding new nodes has a long turnaround cycle of procuring, installing, and configuring the hardware before being put to use. On-premise Hadoop also requires a large IT support team to maintain the environment.

With the rise of the cloud environment and the agility, flexibility, and scale it introduces, running on-premise Hadoop looks like play Atari video game in the age of XBOX and Playstations. Microsoft Azure provides several options to run Big Data workloads in the cloud. Big Data environment can leverage the cloud benefits using Azure. The benefits will vary depending on the option chosen.
The benefits depend on the big data workload, migration of existing Hadoop, and the Big Data team’s skill set.

Hadoop using IaaS

Cloudera offers Cloudera Enterprise Data Hub as an Azure marketplace solution to be installed on Azure virtual machines. Cloudera enterprise data hub builds a unified enterprise data platform using the Hadoop framework. Cloudera Azure option provides full compatibility with an on-premise deployment. Azure VM provides the leased infrastructure that can be scaled on-demand without upfront Capex investment.

Hadoop Big Data Analytics blog – Cloudera Platform

Pros:
A full version of Cloudera for running Hadoop big data analytics on cloud
Compatibility with applications and workloads running on-premise
Converting Capex into Opex
On-demand scaling to add additional nodes or workloads
Cons:
Limited ability to scale on demand. Azure VM can only have limited scaling up to a higher number of cores per machine. It depends on the type of machine used in VM.
IaaS services only offer infrastructure benefits. A dedicated Hadoop admin team (similar to on-premise) will be required to maintain the environment.

Hadoop using PaaS (HDInsight)

Microsoft Azure offers managed Hadoop service using the Azure cloud platform to perform Hadoop big data analytics. HDInsight option of Hadoop is leveraging Hortonworks distribution. Hortonworks Data Platform 3.0 is available through HDInsight version 4.0. HDInsight offers a first-party Hadoop service. Hortonworks integrated its distribution to run natively using Azure PaaS features. HDP services like HDFS, Tez, Hive, Hbase, Spark, Storm, Kafka, etc. are available in HDInsight

Hadoop Big Data Analytics blog – HDInsight

HDInsight offers different cluster types depending on the Hadoop services and components used for the workload

Apache Hadoop cluster — provides HDFS, YARN, and Map Reduce framework
Apache Spark — Spark cluster for in-memory parallel processing to boost big data analysis performance
Apache Hbase — NoSQL database based on Hadoop
ML Services — distributed R processes for data science workloads
Apache Storm — real-time processing of streaming data
Apache Interactive — in-memory caching for hive queries
Apache Kafka — streaming data pipelines

Pros:
Separation of storage from computing. HDFS is built using ADLS, which is separate from HDInsight cluster
Benefits of Platform as a service (pay as you go, on-demand scale, elasticity, flexibility)
The open-source codebase
Cons:
Only available via Hortonworks distribution (Cloudera version is not available)
Separate clusters for Spark and HDFS
HDFS running in ADLS run slower than local storage

Big Data using Azure cloud-native services

Hadoop big data analytics can be achieved by leveraging Azure native cloud services like Databricks, data lake storage, Synapse Analytics, etc. Databricks provides a unified environment for data engineers, data scientists, and data analysts. Data is stored in a data lake, and processed, analyzed, and reported using Databricks. Databricks provides managed Spark engine and tight integration with other Azure services.

Hadoop Big Data Analytics blog – Azure Databricks

Pros:
On-demand scale and option to go serverless
Native integration with Azure services like Synapse, ADLS, etc
Integrated security through Azure AD, key vaults
Auto-scaling and option to pause the cluster

Cons:
Multiple Azure services for running big data workloads

These are the typical scenarios for running big data workloads in Azure. Depending on the maturity of the existing Hadoop applications, organizations will decide one way or the other. IaaS option provides the most control on the platform. However, it also requires the most work to be managed by the application/admin team. Organizations looking to migrate existing Hadoop workloads to Azure should first perform feasibility, cost, and value analysis. Typically Azure cloud-native big data services like Azure Databricks provide the best performance and the most optimized cost. In our experience, by moving Hive and Spark jobs to Databricks, organizations gain more 20% performance gain and to carryout Hadoop big data analytics. Re-architecture of the jobs to leverage cloud-native benefits can further improve performance by multiple factors. More and more organizations find it appealing to modernize the monolithic Hadoop platform and high leverage ROI on the migration investments.

This blog is originally published on Medium.

 

Tarun Agarwal, Data and AI Practice Lead

Tarun Agarwal

Data and AI Practice Lead, Motifworks

Tarun has been focused on building digital and data platforms that provide the innovation needed to bridge today’s business realities to future opportunities.

 

Facing difficulties in scaling your big data workloads on cloud?

Run Apache Hadoop on Azure to seamlessly manage your big data workloads