What is Apache Hadoop and How it Works with Amazon EMR?

Hadoop is an open source distributed processing framework that manages data and storage for big data apps running in clustered environment. It lies in the center of  big data ecosystem which used to support advanced analytics such as predictive analytics, data mining and machine learning apps. It handles several types of structured and unstructured data, which give users more flexibility for processing and analyzing data rather than relational databases and data warehouses.

Hadoop was initially developed by Yahoo! engineers, Doug Cutting, Mike Cafarella and first came into the IT world in 2006. It was named after a toy elephant of the child of Doug Cutting. Apache Software Foundation chose to release it to the general domain and mainly accessible in 2011. It is currently an open source used under Apache License 2.0 and used to oversee a lot of data efficiently by several business units.

There are many execution engines and applications in Hadoop ecosystem which provide a range of tools to fulfill the requirements of your analytics workloads. Amazon EMR uses it to create and manage fully configured, elastic clusters of Amazon EC2 instances running Hadoop and other apps in Hadoop ecosystem.

Key features of Hadoop:

  • New Hadoop cluster can be initialized dynamically and quickly, or more servers can be added into your existing Amazon EMR cluster, which significantly reduce the time that make resources available to data scientists and users.
  • Hadoop configuration such as server installation, networking, security configuration can be a challenging and complicated task. As a managed service, Amazon EMR addresses your Hadoop infrastructure requirements in such a way that you can focus on your core business.
  • Hadoop can be easily integrated with other services such as Amazon S3, Amazon Redshift, Amazon Kinesis, and Amazon DynamoDB to enable data movement, workflows, and analytics on the AWS platform.
  • You can flexibly launch your clusters in any number of Availability Zones in any AWS region by using Hadoop on Amazon EMR. A possible issue in one region or zone can be easily avoided by launching a cluster in another zone in few minutes.
  • Capacity planning in a Hadoop environment can be expensive. With Amazon EMR, clusters can be created with the required capacity within few minutes with use of Auto Scaling to dynamically scale-in and scale-out.

Amazon EMR:

Amazon EMR delivers a managed Hadoop framework that offers easy, fast, and cost-effective data processing across dynamically scalable Amazon EC2 instances of AWS. Some other popular distributed frameworks such as Apache SparkHBasePresto, and Flink can also be run in Amazon EMR, and can interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon EMR can handles a broad set of big data use cases securely and reliably including log analysis, data transformations (ETL), web indexing, financial analysis, scientific simulation, machine learning, and bioinformatics.

Amazon EMR details can be found at

Components of Hadoop on Amazon EMR:

Hadoop contains three main components: a distributed file system, a parallel programming framework, and a resource/job management system. Linux and Windows are the supported OS for Hadoop, but BSD, Mac OS/X, and OpenSolaris are also known to be work with.

1.    Amazon S3 and EMRFS

Using EMR File System (EMRFS) on Amazon EMR cluster, Amazon S3 can be leveraged as data layer for Hadoop. Amazon S3 is highly scalable, low cost, and designed for durability, making it a great data store for big data processing. Storing data in Amazon S3, compute layer can be decoupled from storage layer which allows to size Amazon EMR cluster for memory and CPU requirements for the workloads instead of using additional nodes in cluster to maximize on-cluster storage. Moreover, Amazon EMR cluster can be terminated when it’s in idle state to save the cost while exists in Amazon S3.

EMRFS is optimized for Hadoop to directly read/write parallels to Amazon S3, and objects will be processed using encryption with Amazon S3 server-side and client-side encryption. EMRFS lets you use Amazon S3 as data lake and Hadoop in Amazon EMR is used as an elastic query layer.

2.    Hadoop Distributed File System (HDFS)

Hadoop is an open-source Java based employment of a clustered file system called HDFS, which lets you do cost effective, reliable, and scalable distributed computing. The architecture of HDFS is highly fault tolerant and designed to be installed and deployed on low cost hardware.


Hadoop, using a distributed storage system, the HDFS stores data in local disks in a cluster in large blocks.

HDFS is automatically installed with Hadoop on Amazon EMR cluster along with Amazon S3 to store your input/output of data. HDFS can be easily encrypted using an Amazon EMR security configurations.

3.    Hadoop YARN

The Hadoop YARN framework allows job scheduling and cluster resource management, and web user interface available for monitoring Hadoop cluster. In Hadoop Java JAR files and classes used to run a MapReduce program called a job. Jobs can be submitted to a JobTracker from CLI or by HTTP posting to REST API. These jobs comprise the “tasks” that execute and run the individual map and reduce the steps.


In Hadoop, resources are managed by Yet Another Resource Negotiator (YARN). YARN retains track of all resources of the cluster and also ensures that these resources are dynamically allocated to complete the job. YARN manages Hadoop MapReduce and Tez workloads along with other distributed frameworks i-e Apache Spark.

Other Big Data Tools Associated with Hadoop:

The ecosystem around Hadoop includes some other open source tools which are used to enhance basic capabilities. These tools include:

  • Apache Flume: is used to collect, aggregate, and move huge streaming data into HDFS
  • Apache HBase:is a distributed database that often paired with Hadoop
  • Apache Hive:is an SQL-on-Hadoop tool that delivers data summarization, query and, analysis
  • Apache Oozie:is a server-based workflow scheduling system which is used to manage Hadoop jobs
  • Apache Phoenix:is an SQL-based Massively Parallel Processing (MPP) database engine which uses HBase as its data store
  • Apache Sqoop:is a tool which helps in transfer bulk data between Hadoop and structured data stores, and
  • Apache ZooKeeper:is a configuration, synchronization, and naming registry service for large distributed systems.

Use-cases of Hadoop:

Following are some use cases of Hadoop


Razorfish uses Clickstream analysis data to segment users and understand user preferences and favorites. Advertisers can also analyze clickstreams and advertising impression logs to provide more operative ads.


In Yelp, Hadoop is used to process logs which are generated by web and mobile apps. Hadoop supports the petabytes of un-structured data into useful insights about their applications or users.

What do you think?

Leave a Reply




Ten ways cloud computing can benefit your business