Top Hadoop Interview questions

Last updated on May 3rd, 2018 at 04:56 am

Here are top Hadoop Interview questions

Hadoop Interview questions

HDFS,Hadoop and Big Data questions

Question 1: What is called Big Data?

Answer
Data which are very large in size is called Big Data. Data which is in Peta bytes  size is called Big Data. It is stated that almost 90% of today’s data has been generated in the past 3 years.

Bigdata, Nosql, Hadoop terms explained

Question 2: What is Hadoop?
Answer
Hadoop is a distributed computing platform. It is written in Java. It consist of the many features like HDFS , MapReduce and YARN. It is used to process bigdata

Bigdata, Nosql, Hadoop terms explained

Question 3: What is HDFS?

Answer

HDFS is the Acronym of Hadoop Distributed File System .Apache HDFS is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines.  HDFS Architecture  consists of Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java

Question 4 What is data Blocks in HDFS ?

Answer

A Block is the minimum amount of data that it can read or write. HDFS blocks are 128 MB by default and this is configurable through HDFS parameter files.Each File in HDFS are broken into block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full block size, i.e. 12 MB of file stored in HDFS of block size 128 MB takes 12 MB of space only.The HDFS block size is large just to minimize the cost of seek.

Question 5 What is Name Node in HDFS and how its works ?

Answer

HDFS works in master-slave pattern where the name node acts as master.Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS. The metadata information consists of  file permission, names and location of each block.The metadata  is stored in the memory of name node,allowing faster access to data.  The file system operations like opening, closing, renaming etc. are executed by it.

Question 4 What is Data Node  in HDFS ?

Answer
Data node are the slaves in the Hadoop HDFS architecture,they  store and retrieve blocks when they are told to  by client or name node. Data nodes  periodically report back to name node  with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node.

Question 5 What is meant by ‘commodity hardware’? Can Hadoop work on them?

Answer

Systems which is having Average RAM,processor ,non-expensive  ,affordable and easy to obtain are known as commodity hardware .

Hadoop can be installed on any of them. Hadoop does not require high end hardware to function.

Question 6 Which one is the master node in HDFS? Can it be commodity?

Answer
Name node is the master node in HDFS and job tracker runs on it. The node contains metadata and works as high availability machine and single pint of failure in HDFS. It cannot be commodity as the entire HDFS works on it.

Question 7 What is daemon?

Answer
Daemon is the process that runs in background in the UNIX environment. In Windows it is ‘services’

Question 8 What is a rack in HDFS?

Answer

Rack is the storage location where all the data nodes are put together. Thus it is a physical collection of data nodes stored in a single location.

Question 9: How is HDFS fault tolerant?

Answer
When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.

Question 10 How does NameNode tackle DataNode failures?

Answer
NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.

A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.

The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.

Question 11 What is the difference between HDFS and NAS(Network Attached Storage)

Answer
HDFS data blocks are distributed across local drives of all machines in a cluster whereas, NAS data is stored on dedicated hardware.

Question 12 What is the difference between RDBMS and Hadoop?

Answer

a) RDBMS is relational database management system. Hadoop is node based flat structure.
b) RDBMS is used for OLTP processing. Hadoop is used for analytical and for big data processing.
c) In RDBMS, the database cluster uses the same data files stored in shared storage. In Hadoop, the storage data can be stored independently in each processing node.
d) In RDBMS, preprocessing of data is required before storing it. In Hadoop, you don’t need to preprocess data before storing it.

Question 13) What are the main Hadoop configuration files?

Answer

Following are the main configuration files in Hadoop:

core-site.xml : This one is required for runtime environment settings of a Hadoop cluster.

mapred-site.xml: This contains the configuration settings for Mapreduce
hdfs-site.xml: This one has configuration settings for namenode,datanode ,secondary node and default block replication
hadoop-env.sh: This one contains the environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
yarn-site.xml:This one contains the configuration settings related to YARN . i.e  settings for Node Manager, Resource Manager, Containers, and Application Master.

 

Question 14 What are the main network requirements for using Hadoop?

Answer

Password-less SSH connection  as hadoop uses ssh to launch processes on distributed node. Password less ssh connection will help in establishing that fast
Secure Shell (SSH) for launching soderver processes.

Question 15  What platform and Java version is required to run Hadoop?

Answer
Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.

Question 16  What kind of Hardware is best for Hadoop?

Answer
Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.

Question 17  Give few Difference between Hadoop 1.x vs. Hadoop 2.x ?

Answer

The major difference are given below

In Hadoop 1.x version  , NameNode is a Single Point of Failure We have Active & Passive NameNode in Hadoop 2.x. So Hadoop 2.x provides better availability
In Hadoop 1.x version  Processing is done through  MRV1 (Job Tracker & Task Tracker) while we have MRV2/YARN (ResourceManager & NodeManager) in Hadoop 2.x

Question 18 How to debug Hadoop code?

Answer
There are many ways to debug Hadoop codes but the most popular methods are:

By using Counters.
By web interface provided by Hadoop framework.

Question 19 What is Yarn ?

Answer
YARN stands for Yet Another Resource Negotiator. YARN  is used for resource allocation.YARN is the processing framework in Hadoop, which provides Resource management, and it allows multiple data processing engines such as real-time streaming, data science and batch processing to handle data stored on a single platform.

Map-reduce

Question 1 What is Map/Reduce in Hadoop?

Answer
Map/Reduce is programming method which is used to allow massive scalability across the thousands of server.

It is the data processing layer of Hadoop Ecosystem. It is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed Filesystem (HDFS). Map Reduce  processes huge amount of data in parallel by dividing the job (submitted job) into a set of independent tasks (sub-job).

MapReduce works by breaking the processing into phases: Map and Reduce. The Map is the first phase of processing.In the first step maps jobs which takes the set of data and converts it into another set of data

Reduce is the second phase of processing. In the second step Reduce job takes the output from the map as input and compress those data tuples into smaller set of tuples.Basically light-weight processing like aggregation/summation happens in this phase

Question 2 What is  term “map” and what is  term “reducer” in Hadoop?

Answer
Map: In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

Reducer: In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

Question 3 What do you mean by  shuffling in MapReduce?

Answer
Shuffling is a process which is used to perform the sorting and transfer the map outputs to the reducer as input.

Question 4 What are the most common input formats defined in Hadoop?

Answer
These are the most common input formats defined in Hadoop:

TextInputFormat
KeyValueInputFormat
SequenceFileInputFormat
TextInputFormat is a by default input format.

Question 5 What is InputSplit in Hadoop? Explain. What is the difference between Input Split and HDFS Block?

Answer
When a hadoop job runs, it splits input files into chunks and assign each split to a mapper for processing. It is called InputSplit.

Logical division of data is called Input Split and physical division of data is called HDFS Block.

Question 6 What is testinformat?

Answer
In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

Question 7 What is the sequencefileinputformat in Hadoop?
Answer

In Hadoop, Sequencefileinputformat is used to read files in sequence. It is a specific compressed binary file format which passes data between the output of one MapReduce job to the input of some other MapReduce job.

Question 8 What is the use of RecordReader in Hadoop?
Answer

InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

Question 9 What is Job-Tracker in Hadoop 1.x?

Answer
JobTracker is a service within Hadoop which runs MapReduce jobs on the cluster.

Question 10 Is it possible to provide multiple inputs to Hadoop? If yes, explain.

Answer
Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.

Question 11 What is the relation between job and task in Hadoop?

Answer
In Hadoop, A job is divided into multiple small parts known as task.

Question 12 What is Hadoop Streaming?

Answer
Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.

Question 13 What is a combiner in Hadoop?

Answer
A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.

Question 14 What is the functionality of JobTracker in Hadoop 1.x?

Answer
JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.

How   JobTracker works  in Hadoop:

a) When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
b)It locates TaskTracker nodes with available slots for data.
c) It assigns the work to the chosen TaskTracker nodes.
d) The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.

Question 15 What is the difference between Hadoop and other data processing tools?

Answer
Hadoop facilitates you to increase or decrease the number of mappers without worrying about the volume of data to be processed.

Question 16 what is distributed cache in Hadoop?

Answer
Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.

Question 17  What are the main components of Job flow in YARN(Yet ANother Resource negotiater) architecture ?

Answer
Mapreduce job flow on YARN involves below components.

A Client node, which submits the Mapreduce job.
The YARN Resource Manager, which allocates the cluster resources to jobs.
The YARN Node Managers, which launch and monitor the tasks of jobs.
The MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
The HDFS file system is used for sharing job files between the above entities.

Question 18  What is Speculative Execution in Hadoop and how it  helps the system?

Answer

Hadoop is very good at distributed processing but it has few  limitation .In Hadoop, processing is distributed over several nodes so there are chances that few slow nodes limit the rest of the program. There could be  various reasons for the tasks to be slow, which are sometimes not easy to detect.

So to remove this limitation , Instead of identifying and fixing the slow-running tasks. Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup on the different node  and this  backup mechanism in Hadoop is Speculative Execution.

It creates a duplicate task on another disk.

The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.

Setting for Speculative execution

It is by default true in Hadoop. To disable, set below options to false.

mapred.map.tasks.speculative.execution

mapred.reduce.tasks.speculative.execution JobConf

 

Leave a Reply