Here are top Hadoop Interview questions
HDFS,Hadoop and Big Data questions
Question 1: What is called Big Data?
Data which are very large in size is called Big Data. Data which is in Peta bytes size is called Big Data. It is stated that almost 90% of today’s data has been generated in the past 3 years.
Question 2: What is Hadoop?
Hadoop is a distributed computing platform. It is written in Java. It consist of the many features like HDFS , MapReduce and YARN. It is used to process bigdata
Question 3: What is HDFS?
HDFS is the Acronym of Hadoop Distributed File System .Apache HDFS is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. HDFS Architecture consists of Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java
Question 4 What is data Blocks in HDFS ?
A Block is the minimum amount of data that it can read or write. HDFS blocks are 128 MB by default and this is configurable through HDFS parameter files.Each File in HDFS are broken into block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full block size, i.e. 12 MB of file stored in HDFS of block size 128 MB takes 12 MB of space only.The HDFS block size is large just to minimize the cost of seek.
Question 5 What is Name Node in HDFS and how its works ?
HDFS works in master-slave pattern where the name node acts as master.Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS. The metadata information consists of file permission, names and location of each block.The metadata is stored in the memory of name node,allowing faster access to data. The file system operations like opening, closing, renaming etc. are executed by it.
Question 4 What is Data Node in HDFS ?
Data node are the slaves in the Hadoop HDFS architecture,they store and retrieve blocks when they are told to by client or name node. Data nodes periodically report back to name node with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node.
Question 5 What is meant by ‘commodity hardware’? Can Hadoop work on them?
Systems which is having Average RAM,processor ,non-expensive ,affordable and easy to obtain are known as commodity hardware .
Hadoop can be installed on any of them. Hadoop does not require high end hardware to function.
Question 6 Which one is the master node in HDFS? Can it be commodity?
Name node is the master node in HDFS and job tracker runs on it. The node contains metadata and works as high availability machine and single pint of failure in HDFS. It cannot be commodity as the entire HDFS works on it.
Question 7 What is daemon?
Daemon is the process that runs in background in the UNIX environment. In Windows it is ‘services’
Question 8 What is a rack in HDFS?
Rack is the storage location where all the data nodes are put together. Thus it is a physical collection of data nodes stored in a single location.
Question 9: How is HDFS fault tolerant?
When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.
Question 10 How does NameNode tackle DataNode failures?
NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.
Question 11 What is the difference between HDFS and NAS(Network Attached Storage)
HDFS data blocks are distributed across local drives of all machines in a cluster whereas, NAS data is stored on dedicated hardware.
Question 12 What is the difference between RDBMS and Hadoop?
a) RDBMS is relational database management system. Hadoop is node based flat structure.
b) RDBMS is used for OLTP processing. Hadoop is used for analytical and for big data processing.
c) In RDBMS, the database cluster uses the same data files stored in shared storage. In Hadoop, the storage data can be stored independently in each processing node.
d) In RDBMS, preprocessing of data is required before storing it. In Hadoop, you don’t need to preprocess data before storing it.
Question 13) What are the main Hadoop configuration files?
Following are the main configuration files in Hadoop:
core-site.xml : This one is required for runtime environment settings of a Hadoop cluster.
mapred-site.xml: This contains the configuration settings for Mapreduce
hdfs-site.xml: This one has configuration settings for namenode,datanode ,secondary node and default block replication
hadoop-env.sh: This one contains the environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
yarn-site.xml:This one contains the configuration settings related to YARN . i.e settings for Node Manager, Resource Manager, Containers, and Application Master.
Question 14 What are the main network requirements for using Hadoop?
Password-less SSH connection as hadoop uses ssh to launch processes on distributed node. Password less ssh connection will help in establishing that fast
Secure Shell (SSH) for launching soderver processes.
Question 15 What platform and Java version is required to run Hadoop?
Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.
Question 16 What kind of Hardware is best for Hadoop?
Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.
Question 17 Give few Difference between Hadoop 1.x vs. Hadoop 2.x ?
The major difference are given below
In Hadoop 1.x version , NameNode is a Single Point of Failure We have Active & Passive NameNode in Hadoop 2.x. So Hadoop 2.x provides better availability
In Hadoop 1.x version Processing is done through MRV1 (Job Tracker & Task Tracker) while we have MRV2/YARN (ResourceManager & NodeManager) in Hadoop 2.x
Question 18 How to debug Hadoop code?
There are many ways to debug Hadoop codes but the most popular methods are:
By using Counters.
By web interface provided by Hadoop framework.
Question 19 What is Yarn ?
YARN stands for Yet Another Resource Negotiator. YARN is used for resource allocation.YARN is the processing framework in Hadoop, which provides Resource management, and it allows multiple data processing engines such as real-time streaming, data science and batch processing to handle data stored on a single platform.
Question 1 What is Map/Reduce in Hadoop?
Map/Reduce is programming method which is used to allow massive scalability across the thousands of server.
It is the data processing layer of Hadoop Ecosystem. It is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed Filesystem (HDFS). Map Reduce processes huge amount of data in parallel by dividing the job (submitted job) into a set of independent tasks (sub-job).
MapReduce works by breaking the processing into phases: Map and Reduce. The Map is the first phase of processing.In the first step maps jobs which takes the set of data and converts it into another set of data
Reduce is the second phase of processing. In the second step Reduce job takes the output from the map as input and compress those data tuples into smaller set of tuples.Basically light-weight processing like aggregation/summation happens in this phase
Question 2 What is term “map” and what is term “reducer” in Hadoop?
Map: In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.
Reducer: In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
Question 3 What do you mean by shuffling in MapReduce?
Shuffling is a process which is used to perform the sorting and transfer the map outputs to the reducer as input.
Question 4 What are the most common input formats defined in Hadoop?
These are the most common input formats defined in Hadoop:
TextInputFormat is a by default input format.
Question 5 What is InputSplit in Hadoop? Explain. What is the difference between Input Split and HDFS Block?
When a hadoop job runs, it splits input files into chunks and assign each split to a mapper for processing. It is called InputSplit.
Logical division of data is called Input Split and physical division of data is called HDFS Block.
Question 6 What is testinformat?
In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
Question 7 What is the sequencefileinputformat in Hadoop?
In Hadoop, Sequencefileinputformat is used to read files in sequence. It is a specific compressed binary file format which passes data between the output of one MapReduce job to the input of some other MapReduce job.
Question 8 What is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.
Question 9 What is Job-Tracker in Hadoop 1.x?
JobTracker is a service within Hadoop which runs MapReduce jobs on the cluster.
Question 10 Is it possible to provide multiple inputs to Hadoop? If yes, explain.
Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.
Question 11 What is the relation between job and task in Hadoop?
In Hadoop, A job is divided into multiple small parts known as task.
Question 12 What is Hadoop Streaming?
Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.
Question 13 What is a combiner in Hadoop?
A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.
Question 14 What is the functionality of JobTracker in Hadoop 1.x?
JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.
How JobTracker works in Hadoop:
a) When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
b)It locates TaskTracker nodes with available slots for data.
c) It assigns the work to the chosen TaskTracker nodes.
d) The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.
Question 15 What is the difference between Hadoop and other data processing tools?
Hadoop facilitates you to increase or decrease the number of mappers without worrying about the volume of data to be processed.
Question 16 what is distributed cache in Hadoop?
Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.
Question 17 What are the main components of Job flow in YARN(Yet ANother Resource negotiater) architecture ?
Mapreduce job flow on YARN involves below components.
A Client node, which submits the Mapreduce job.
The YARN Resource Manager, which allocates the cluster resources to jobs.
The YARN Node Managers, which launch and monitor the tasks of jobs.
The MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
The HDFS file system is used for sharing job files between the above entities.
Question 18 What is Speculative Execution in Hadoop and how it helps the system?
Hadoop is very good at distributed processing but it has few limitation .In Hadoop, processing is distributed over several nodes so there are chances that few slow nodes limit the rest of the program. There could be various reasons for the tasks to be slow, which are sometimes not easy to detect.
So to remove this limitation , Instead of identifying and fixing the slow-running tasks. Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup on the different node and this backup mechanism in Hadoop is Speculative Execution.
It creates a duplicate task on another disk.
The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.
Setting for Speculative execution
It is by default true in Hadoop. To disable, set below options to false.