8/20/2019 Hadoop Questions
1/41
BIG DATA HADOOP BANK
1 | P a g e https://www.facebook.com/chatchindia
FAQ’s For Data Science 1. What is the biggest data set that you have processed and how did you process it? What was the result?
2. Tell me two success stories about your analytic or computer science projects? How was the lift (or success
measured?
3. How do you optimize a web crawler to run much faster, extract better information and summarize data toproduce cleaner databases?
4. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? And
which languages would you choose for semi-structured text data reconciliation?
5. State any 3 positive and negative aspects about your favorite statistical software.
6. You are about to send one million email (marketing campaign). How do you optimize delivery and its
response? Can both of these be done separately?
7. How would you turn unstructured data into structured data? Is it really necessary? Is it okay to store data as
flat text files rather than in an SQL-powered RDBMS?
8. In terms of access speed (assuming both fit within RAM) is it better to have 100 small hash tables or one big
hash table in memory? What do you think about in-database analytics?
9. Can you perform logistic regression with Excel? If yes, how can it be done? Would the result be good?
10. Give examples of data that does not have a Gaussian distribution, or log-normal. Also give examples of data
that has a very chaotic distribution?
11. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not
doing anything? How familiar are you with A/B testing?
12. What is sensitivity analysis? Is it better to have low sensitivity and low predictive power? How do you perform
good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity
of your models?
13. Compare logistic regression with decision trees and neural networks. How have these technologies improvedover the last 15 years?
14. What is root cause analysis? How to identify a cause Vs a correlation? Give examples.
15. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule
redundancy, rule discovery and the combinatorial nature of the problem? Can an approximate solution to the
rule set problem be okay? How would you find an okay approximate solution? What factors will help you decide
that it is good enough and stop looking for a better one?
16. Which tools do you use for visualization? What do you think of Tableau, R and SAS? (for graphs). How to
efficiently represent 5 dimension in a chart or in a video?
17. Which is better: Too many false positives or too many false negatives?
18. Have you used any of the following: Time series models, Cross-correlations with time lags, Correlograms
Spectral analysis, Signal processing and filtering techniques? If yes, in which context?
19. What is the computational complexity of a good and fast clustering algorithm? What is a good clustering
algorithm? How do you determine the number of clusters? How would you perform clustering in one million
unique keywords, assuming you have 10 million data points and each one consists of two keywords and a
metric measuring how similar these two keywords are? How would you create this 10 million data points table
in the first place?
20. How can you fit Non-Linear relations between X (say, Age) and Y (say, Income) into a Linear Model?
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
2/41
8/20/2019 Hadoop Questions
3/41
BIG DATA HADOOP BANK
3 | P a g e https://www.facebook.com/chatchindia
Yes, but only for the members of an object. A null cannot be added to the database collection as it isn’t an object. But
{}can be added.
Does an update fsync to disk immediately? No. Writes to disk are lazy by default. A write may only hit the disk a couple of seconds later. For example, if the
database receives thousand increments to an object within one second, it will only be flushed to disk once. (Note: fsyncoptions are available both at the command line and via getLastError_old.)
How do I do transactions/locking? MongoDB does not use traditional locking or complex transactions with rollback, as it is designed to be light weight
fast and predictable in its performance. It can be thought of how analogous is to the MySQL’s MyISAM autocommi
model. By keeping transaction support extremely simple, performance is enhanced, especially in a system that may
run across many servers.
Why are data files so large? MongoDB does aggressive preallocation of reserved space to avoid file system fragmentation.
How long does replica set failover take? It may take 10-30 seconds for the primary to be declared down by the other members and a new primary to be elected
During this window of time, the cluster is down for primary operations i.e writes and strong consistent reads. However,
eventually consistent queries may be executed to secondaries at any time (in slaveOk mode), including during this
window.
What’s a Master or Primary? This is a node/member which is currently the primary and processes all writes for the replica set. During a failover
event in a replica set, a different member can become primary.
What’s a Secondary or Slave? A secondary is a node/member which applies operations from the current primary. This is done by tailing the replication
oplog (local.oplog.rs). Replication from primary to secondary is asynchronous, however, the secondary will try to stay
as close to current as possible (often this is just a few milliseconds on a LAN).
Is it required to call ‘getLastError’ to make a write durable? No. If ‘getLastError’ (aka ‘Safe Mode’) is not called, the server does exactly behave the way as if it has been called.
The ‘getLastError’ call simply allows one to get a confirmation that the write operation was successfully committed. O
course, often you will want that confirmation, but the safety of the write and its durability is independent.
Should you start out with Sharded or with a Non-Sharded MongoDB environment? We suggest starting with Non-Sharded for simplicity and quick startup, unless your initial data set will not fit on single
servers. Upgrading to Sharded from Non-sharded is easy and seamless, so there is not a lot of advantage in setting
up Sharding before your data set is large.
How does Sharding work with replication?
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
4/41
BIG DATA HADOOP BANK
4 | P a g e https://www.facebook.com/chatchindia
Each Shard is a logical collection of partitioned data. The shard could consist of a single server or a cluster of replicas.
Using a replica set for each Shard is highly recommended.
When will data be on more than one Shard? MongoDB Sharding is range-based. So all the objects in a collection lie into a chunk. Only when there is more than 1
chunk there is an option for multiple Shards to get data. Right now, the default chunk size is 64mb, so you need atleast 64mb for migration.
What happens when a document is updated on a chunk that is being migrated? The update will go through immediately on the old Shard and then the change will be replicated to the new Shard
before ownership transfers.
What happens when a Shard is down or slow when querying? If a Shard is down, the query will return an error unless the ‘Partial’ query options is set. If a shard is responding slowly
Mongos will wait for it.
Can the old files in the ‘moveChunk’ directory be removed? Yes, these files are made as backups during normal Shard balancing operations. Once the operations are done then
they can be deleted. The clean-up process is currently manual so this needs to be taken care of to free up space.
How do you see the connections used by Mongos? The following command needs to be used: db._adminCommand(“connPoolStats”);
If a ‘moveChunk’ fails, is it necessary to cleanup the partially moved docs?
No, chunk moves are consistent and deterministic. The move will retry and when completed, the data will be only onthe new Shard.
What are the disadvantages of MongoDB? 1. A 32-bit edition has 2GB data limit. After that it will corrupt the entire DB, including the existing data. A 64-bi
edition won’t suffer from this bug/feature.
2. Default installation of MongoDB has asynchronous and batch commits turned on. Meaning, it lies when asked
to store something in DB and commits all changes in a batch at a later time in future. If there is a server crash
or power failure, all those commits buffered in memory will be lost. This functionality can be disabled, but then
it will perform as good as or worse than MySQL.
3. MongoDB is only ideal for implementing things like analytics/caching where impact of small data loss is
negligible.
4. In MongoDB, it’s difficult to represent relationships between data so you end up doing that manually by creating
another table to represent the relationship between rows in two or more tables.
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
5/41
BIG DATA HADOOP BANK
5 | P a g e https://www.facebook.com/chatchindia
FAQ’s For Hadoop Administration
Explain check pointing in Hadoop and why is it important? Check pointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient
Namenode recovery and restart and is an important indicator of overall cluster health.
Namenode persists filesystem metadata. At a high level, namenode’s primary responsibility is to store the HDFS
namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essentia
that this metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that
represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very
efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than
writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation
in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage
then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state
of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the
namesystem modifications made since the creation of the fsimage.
What is default block size in HDFS and what are the benefits of
having smaller block sizes? Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in
HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file.
Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on thedisk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as
NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of
megabytes, or gigabytes each.
What are two main modules which help you interact with HDFS and
what are they used for? user@machine:hadoop$ bin/hadoop moduleName-cmdargs…
The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific
command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within
the file system. The dfsadmin module manipulates or queries the file system as a whole.
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
6/41
BIG DATA HADOOP BANK
6 | P a g e https://www.facebook.com/chatchindia
How can I setup Hadoop nodes (data nodes/namenodes) to use
multiple volumes/disks? Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup
multiple directories one needs to specify a comma separated list of pathnames as values under config paramters
dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.
Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup
multiple directories one needs to specify a comma separated list of pathnames as values under config paramters
dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that
image and log could be restored from the remaining disks/volumes if one of the disks fails.
How do you read a file from HDFS? The following are the steps for doing this:
Step 1: The client uses a Hadoop client program to make the request.
Step 2: Client program reads the cluster config file on the local machine which tells it where the namemode is located
This has to be configured ahead of time.
Step 3: The client contacts the NameNode and requests the file it would like to read.
Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.
Step 5: The client’s validated request is checked against the owner and permissions of the file.
Step 6: If the file exists and the user has access to it then the NameNode responds with the first block id and provides
a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
Step 7: The client now contacts the most appropriate datanode directly and reads the block data. This process repeats
until all blocks in the file have been read or the client closes the file stream.
If while reading the file the datanode dies, library will automatically attempt to read another replica of the data from
another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. In case
the information returned by the NameNode about block locations are outdated by the time the client attempts to
contact a datanode, a retry will occur if there are other replicas or the read will fail.
What are schedulers and what are the three types of schedulers that
can be used in Hadoop cluster? Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the
jobtracker. The three types of schedulers are:
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
7/41
BIG DATA HADOOP BANK
7 | P a g e https://www.facebook.com/chatchindia
FIFO (First in First Out) Scheduler
Fair Scheduler
Capacity Scheduler
How do you decide which scheduler to use? The CS scheduler can be used under the following situations:
1. When you know a lot about your cluster workloads and utilization and simply want to enforce resource
allocation.
2. When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes
sense when all queues are at capacity almost all the time.
3. When you have high variance in the memory requirements of jobs and you need the CS’s memory-based
scheduling support.
4. When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
1. When you have a slow network and data locality makes a significant difference to a job runtime, features like
delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
2. When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre -emption model
affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re no
used. 3. When you require jobs within a pool to make equal progress rather than running in FIFO order.
Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where
are they specified and what happens if you don’t specify these
parameters? DFS.NAME.DIR specifies the path of directory in Namenode’s local file system to store HDFS’s metadata and
DFS.DATA.DIR specifies the path of directory in Datanode’s local file system to store HDFS’s file blocks. These
paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.
If these paramters are not specified, namenode’s metadata and Datanode’s file blocks related information gets stored
in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be los
and is critical if Namenode is restarted, as formatting information will be lost.
What is file system checking utility FSCK used for? What kind of
information does it show? Can FSCK show information about files
which are open for writing by a client? FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When
used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files unde
the path. And when used with ‘/’ , it checks the entire file system. By Default FSCK ignores files still open for writing
by a client. To list such files, run FSCK with -openforwrite option.
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
8/41
BIG DATA HADOOP BANK
8 | P a g e https://www.facebook.com/chatchindia
FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less
than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks
corrupt blocks and missing replicas.
What are the important configuration files that need to be
updated/edited to setup a fully distributed mode of Hadoop cluster 1.x
( Apache distribution)? The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:
Hadoop-env.sh
Core-site.xml
Hdfs-site.xml
Mapred-site.xml
Masters
Slaves
These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using
‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of daemon, then masters and slaves file need not be
updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to
start appropriate daemons. If Hadoop daemons are started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then
masters and slaves configurations files on namenode machine need to be updated.
Masters – Ip address/hostname of node where secondarynamenode will run.
Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.
FAQ’s For Hadoop HDFS What is Big Data?
Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store
process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing
techniques.
Can you give some examples of Big Data?
There are many real life examples of Big Data Facebook is generating 500+ terabytes of data per day, NYSE (New York
Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor
data for every 30 minutes of flying time. All these are day to day examples of Big Data! Can you give a detailed
overview about the Big Data being generated by Facebook?
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
9/41
BIG DATA HADOOP BANK
9 | P a g e https://www.facebook.com/chatchindia
As of December 31, 2012, there are 1.06 billion monthly active users on facebook and 680 million mobile users. On
an average, 3.2 billion likes and comments are posted every day on Facebook. 72% of web audience is on Facebook
And why not! There are so many activities going on facebook from wall posts, sharing images, videos, writing
comments and liking posts, etc. In fact, Facebook started using Hadoop in mid-2009 and was one of the initial users
of Hadoop.
What are the four characteristics of Big Data?
According to IBM, the three characteristics of Big Data are: Volume: Facebook generating 500+ terabytes of data pe
day. Velocity: Analyzing 2 million records each day to identify the reason for losses. Variety: images, audio, video
sensor data, log files, etc. Veracity: biases, noise and abnormality in data
How Big is ‘Big Data’?
With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes. But time has
arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes! Global data volume was
around 1.8ZB in 2011 and is expected to be 7.9ZB in 2015. It is also known that the global information doubles in
every two years!
How is analysis of Big Data useful for organizations?
Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus
on and which areas are less important. Big data analysis provides some early key indicators that can prevent the
company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data
helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any
product or service. All thanks to the Big Data explosion.
Who are ‘Data Scientists’?
Data scientists are soon replacing business analysts or data analysts. Data scientists are experts who find solutions
to analyze data. Just as web analysis, we have data scientists who have good business insight as to how to handle a
business challenge. Sharp data scientists are not only involved in dealing business problems, but also choosing therelevant issues that can bring value-addition to the organization.
What is Hadoop?
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity
computers using a simple programming model.
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
10/41
BIG DATA HADOOP BANK
10 | P a g e https://www.facebook.com/chatchindia
Why do we need Hadoop?
https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia
8/20/2019 Hadoop Questions
11/41
BIG DATA HADOOP BANK
11 | P a g e A N T R I X S H G U P T A
Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to
store large data sets in our systems but to retrieve and analyze the big data in the organizations that too data presen
in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to
analyze the data present in different machines at different locations very quickly and in a very cost effective way. It
uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel
This is also known as parallel computing. The following link Why Hadoop gives a detailed explanation about whyHadoop is gaining so much popularity!
What are some of the characteristics of Hadoop framework?
Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes)
The programming model is based on Google’s MapReduce. The infrastructure is based on Google’s Big Data and
Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed
applications. Hadoop is scalable as more nodes can be easily added to it.
Give a brief overview of Hadoop history.
In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS
papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000
node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for
Hadoop.
Give examples of some companies that are using Hadoop structure?
A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook
eBay, Twitter, Google and so on.
What is the basic difference between traditional RDBMS and Hadoop?
Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach
to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to
seek one record from Big data, whereas, Hadoop will be useful when you want Big data in one shot and perform
analysis on that later.
What is structured and unstructured data?
Structured data is the data that is easily identifiable as it is organized in a structure. The most common form ofstructured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured
data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email,
logs and random text. It is not in the form of rows and columns. What are the core components of Hadoop?
Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and
MapReduce is used to process such large data sets.
8/20/2019 Hadoop Questions
12/41
BIG DATA HADOOP BANK
12 | P a g e A N T R I X S H G U P T A
Now, let’s get cracking with the hard Stuff: What is HDFS?
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on
commodity hardware.
What are the key features of HDFS?
HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to
file system data and can be built out of commodity hardware.
What is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there
is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature
of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also
So even if one or two of the systems collapse, the file is still available on the third system.
Replication causes data redundancy, then why is it pursued in HDFS?
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed
any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places
Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is
unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance
of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.
Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be
replicated on the other two?
Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data
The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding
it is assumed to be failed. Only then, the required calculation will be done on the second replica.
What is throughput? How does HDFS get a good throughput?
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from thesystem and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an
action, then the work is divided and shared among different systems. So all the systems will be executing the tasks
assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this
way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data
tremendously.
What is streaming access?
8/20/2019 Hadoop Questions
13/41
BIG DATA HADOOP BANK
13 | P a g e A N T R I X S H G U P T A
As HDFS works on the principle of ‘Write Once, Read many‘, the feature of streaming access is extremely important
in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially
while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single
record from the data.
What is a commodity hardware? Does commodity hardware include RAM?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be
installed in any average commodity hardware. We don’t need super computers or high -end hardware to work on
Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on
RAM.
What is a Namenode?
Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the
blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.
Is Namenode also a commodity?
No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure
in HDFS. Namenode has to be a high-availability machine.
What is a metadata?
Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on
What is a Datanode?
Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible
for serving read and write requests for the clients.
Why do we use HDFS for applications having large data sets and not when there are lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across
multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy
the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when
there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized
performance, HDFS supports large data sets instead of multiple small files.
What is a daemon?
Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The
equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.
8/20/2019 Hadoop Questions
14/41
BIG DATA HADOOP BANK
14 | P a g e A N T R I X S H G U P T A
What is a job tracker?
Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns
the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. I
is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are
halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task iscompleted or not.
What is a task tracker?
Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on
slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to
different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be
simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbea
from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to anothe
task tracker in the cluster.
Is Namenode machine same as datanode machine as in terms of hardware?
It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on anothe
machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing
environment, Namenode and datanodes are on different machines.
What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send
its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there
is some problem in datanode or task tracker is unable to perform the assigned task.
Are Namenode and job tracker on the same host?
No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.
What is a ‘block’ in HDFS?
A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as
contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which
are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost
of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64
8/20/2019 Hadoop Questions
15/41
BIG DATA HADOOP BANK
15 | P a g e A N T R I X S H G U P T A
mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS
block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient
manner.
What are the benefits of block transfer?
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored
on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block
rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against
corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate
machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is
transparent to the client.
If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks,
can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node
will figure out what is the actual amount of space required, how many block are being used, how much space is
available, and it will allocate the blocks accordingly.
How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on
storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS
If a data Node is full how it’s identified?
When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode wil
identify if the data node is full.
If datanodes increase, then do we need to upgrade Namenode?
While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we
do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a
requirement rarely arise.
Are job tracker and task trackers present in separate machines?
8/20/2019 Hadoop Questions
16/41
BIG DATA HADOOP BANK
16 | P a g e A N T R I X S H G U P T A
Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure
for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
When we send a data to a node, do we allow settling in time, before sending another data to that node?
Yes, we do.
Does hadoop always require digital data to process?
Yes. Hadoop always require digital data to be processed.
On what basis Namenode will decide which datanode to write on?
As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.
Doesn’t Google have its very own version of DFS?
Yes, Google owns a DFS known as “Google File System (GFS)” developed by Google Inc. for its own use. Who
is a ‘user’ in HDFS?
A user is like you or me, who has some query or who needs some kind of data.
Is client the end user in HDFS?
No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or
datanode (task tracker).
What is the communication channel between client and namenode/datanode?
The mode of communication is SSH.
What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different
places. Rack is a physical collection of datanodes which are stored at a single location. There can be mult iple racks
in a single location.
On what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the clien
consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be
stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack
third copy in a different rack“. This rule is known as “Replica Placement Policy“.
8/20/2019 Hadoop Questions
17/41
BIG DATA HADOOP BANK
17 | P a g e A N T R I X S H G U P T A
Do we need to place 2nd and 3rd data in rack 2 only?
Yes, this is to avoid datanode failure.
What if rack 2 and datanode fails?
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such
situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by
changing the value in replication factor which is set to 3 by default.
What is a Secondary Namenode? Is it a substitute to the Namenode?
The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk
or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes
down.
What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and
Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.
What is MapReduce?
Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs
for processing data. ‘Map’ processes the data first to give some intermediate ou tput which is further processed by
‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction
operations.
Can you explain how do ‘map’ and ‘reduce’ work?
Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks
assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects
this key value pairs of all the datanodes and combines them and generates the final output.
What is ‘Key value pair’ in HDFS?
Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output. Wha
is the difference between MapReduce engine and HDFS cluster?
HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce
Engine is the programming module which is used to retrieve and analyze data.
Is map like a pointer?
8/20/2019 Hadoop Questions
18/41
BIG DATA HADOOP BANK
18 | P a g e A N T R I X S H G U P T A
No, Map is not like a pointer.
Do we require two servers for the Namenode and the datanodes?
Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly
configurable system as it stores information about the location details of all the files stored in different datanodes andon the other hand, datanodes require low configuration system.
Why are the number of splits equal to the number of maps?
The number of maps is equal to the number of input splits because we want the key and value pairs of all the input
splits.
Is a job split into maps?
No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split, amap is needed.
Which are the two types of ‘writes’ in HDFS?
There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget abou
it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted
Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non -posted write is
more expensive than the posted write. It is much more expensive, though both writes are asynchronous.
Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation
in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency
For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not
know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and
accessed.
Can Hadoop be compared to NOSQL database like Cassandra?
Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no
DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework
(MapReduce).
FAQ’s For Hadoop Cluster Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
8/20/2019 Hadoop Questions
19/41
BIG DATA HADOOP BANK
19 | P a g e A N T R I X S H G U P T A
1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode
What are the features of Stand alone (local) mode?
In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local filesystem. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the
most least used environments.
What are the features of Pseudo mode?
Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on
the same machine.
Can we call VMs as pseudos?
No, VMs are not pseudos because VM is something different and pesudo is very specific to Hadoop.
What are the features of Fully Distributed mode?
Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a
Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running
and another host on which datanode is running and then there are machines on which task tracker is running. We
have separate masters and separate slaves in this distribution.
Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf ‘ directory as in the case of UNIX. Inwhich directory Hadoop is installed?
Cloudera and Apache has the same directory structure. Hadoop is installed in cd /usr/lib/hadoop-0.20/.
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.
What is the Hadoop-core configuration?
Hadoop core is configured by two xml files:
1. hadoop-default.xml which was renamed to 2. hadoop-site.xml.
These files are written in xml format. We have certain properties in these xml files, which consist of name and value
But these files do not exist now.
What are the Hadoop configuration files at present?
There are 3 configuration files in Hadoop:
1. core-site.xml
8/20/2019 Hadoop Questions
20/41
BIG DATA HADOOP BANK
20 | P a g e A N T R I X S H G U P T A
2. hdfs-site.xml
3. mapred-site.xml These files are located in the conf/ subdirectory. How to exit the Vi editor?
To exit the Vi Editor, press ESC and type :q and then press enter.
What is a spill factor with respect to the RAM?
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.
Is fs.mapr.working.dir a single directory?
Yes, fs.mapr.working.dir it is just one directory.
Which are the three main hdfs-site.xml properties?
The three main hdfs-site.xml properties are:
1. dfs.name.dir which gives you the location on which metadata will be stored and where DFS is located – on disk o
onto the remote.
2. dfs.data.dir which gives you the location where the data is going to be stored.
3. fs.checkpoint.dir which is for secondary Namenode.
How to come out of the insert mode?
To come out of the insert mode, press ESC, type :q (if you have not written anything) OR type :wq (if you have written
anything in the file) and then press ENTER.
What is Cloudera and why it is used?
Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera belongs to Apache and is used
for data processing.
What happens if you get a ‘connection refused java exception’ when you type hadoop fsck /?
It could mean that the Namenode is not working on your VM.
We are using Ubuntu operating system with Cloudera, but from where we can downloadHadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from Edureka’s dropbox and
the run it on your systems. You can also proceed with your own configuration but you need a Linux box, be it Ubuntu
or Red hat. There are installation steps present at the Cloudera location or in Edureka’s Drop box. You can go either
ways.
What does ‘jps’ command do?
This command checks whether your Namenode, datanode, task tracker, job tracker, etc are working or not.
8/20/2019 Hadoop Questions
21/41
BIG DATA HADOOP BANK
21 | P a g e A N T R I X S H G U P T A
How can I restart Namenode?
1. Click on stop-all.sh and then click on start-all.sh OR
2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then
/etc/init.d/hadoop0.20-namenode start (press enter). What is the full form of fsck?
Full form of fsck is File System Check.
How can we check whether Namenode is working or not?
To check whether Namenode is working or not, use the command /etc/init.d/hadoop-0.20-namenode status or as
simple as jps.
What does the command mapred.job.tracker do?
The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.
What does /etc /init.d do?
/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX
specific, and nothing to do with Hadoop.
How can we look for the Namenode in the browser?
If you have to look for Namenode in the browser, you don’t have to give localhost:8021, the port number to look for
Namenode in the brower is 50070.
How to change from SU to Cloudera?
To change from SU to Cloudera just type exit.
Which files are used by the startup and shutdown commands?
Slaves and Masters are used by the startup and the shutdown commands. Whatdo slaves consist of?
Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers.
What do masters consist of?
Masters contain a list of hosts, one per line, that are to host secondary namenode servers.
What does hadoop-env.sh do?
hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.
Can we have multiple entries in the master files?
Yes, we can have multiple entries in the Master files.
Where is hadoop-env.sh file present?
hadoop-env.sh file is present in the conf location.
In Hadoop_PID_DIR, what does PID stands for?
PID stands for ‘Process ID’.
8/20/2019 Hadoop Questions
22/41
BIG DATA HADOOP BANK
22 | P a g e A N T R I X S H G U P T A
What does /var/hadoop/pids do?
It stores the PID.
What does hadoop-metrics.properties file do?
hadoop-metrics.properties is used for ‘ Reporting ‘ purposes. It controls the reporting for Hadoop. The default status
is ‘not to report ‘.
What are the network requirements for Hadoop?
The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires passwordless
SSH connection between the master and all the slaves and the secondary machines.
Why do we need a password-less SSH in Fully Distributed environment?
We need a password-less SSH in a Fully-Distributed environment because when the cluster is LIVE and running in
Fully
Distributed environment, the communication is too frequent. The job tracker should be able to send a task to task
tracker quickly.
Does this lead to security issues?
No, not at all. Hadoop cluster is an isolated cluster. And generally it has nothing to do with an internet. It has a
different kind of a configuration. We needn’t worry about that kind of a security breach, for instance, someone
hacking through the internet, and so on. Hadoop has a very secured way to connect to other machines to fetch and
to process data.
On which port does SSH work?
SSH works on Port No. 22, though it can be configured. 22 is the default Port number.
Can you tell us more about SSH?SSH is nothing but a secure shell communication, it is a kind of a protocol that works on a Port No. 22, and when you
do an SSH, what you really require is a password.
Why password is needed in SSH localhost?
Password is required in SSH for security and in a situation where password-less communication is not set. Dowe need to give a password, even if the key is added in SSH?
Yes, password is still required even if the key is added in SSH.
What if a Namenode has no data?If a Namenode has no data it is not a Namenode. Practically, Namenode will have some data.
What happens to job tracker when Namenode is down?
When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.
What happens to a Namenode, when job tracker is down?
8/20/2019 Hadoop Questions
23/41
BIG DATA HADOOP BANK
23 | P a g e A N T R I X S H G U P T A
When a job tracker is down, it will not be functional but Namenode will be present. So, cluster is accessible if
Namenode is working, even if the job tracker is not working.
Can you give us some more details about SSH communication between Masters and theSlaves?
SSH is a password-less secure communication where data packets are sent across the slave. It has some format intowhich data is sent across. SSH is not only between masters and slaves but also between two hosts.
What is formatting of the DFS?
Just like we do for Windows, DFS is formatted for proper structuring. It is not usually done as it formats the Namenode
too.
Does the HDFS client decide the input split or Namenode?
No, the Client does not decide. It is already specified in one of the configurations through which input split is already
configured.
In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can we do it?
Yes, you can go ahead with this! There are installation steps for creating a new cluster. You can uninstall your presen
cluster and install the new cluster.
Can we create a Hadoop cluster from scratch?
Yes we can do that also once we are familiar with the Hadoop environment.
Can we use Windows for Hadoop?
Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for
installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred
environment for Hadoop.
FAQ’s For Hadoop MapReduce What is MapReduce?
It is a framework or a programming model that is used for processing large data sets over clusters of computers using
distributed programming.
What are ‘maps’ and ‘reduces’?
‘ Maps‘ and ‘ Reduces‘ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input
location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local machine
’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.
What are the four basic parameters of a mapper?
8/20/2019 Hadoop Questions
24/41
BIG DATA HADOOP BANK
24 | P a g e A N T R I X S H G U P T A
The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input
parameters and the second two represent intermediate output parameters. What are the four basic
parameters of a reducer?
The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate
output parameters and the second two represent final output parameters.What do the master class and the
output class do?
Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output
location.
What is the input type/format in MapReduce by default? By default the type input type in MapReduce is ‘text’.
Is it mandatory to set input and output type/format in MapReduce?
No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input
and the output type as ‘text’.
What does the text input format do?
In text input format, each line will create a line object, that is an hexa-decimal number. Key is considered as a line
object and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper
will receive the ‘key’ as a ‘ LongWritable‘ parameter and value as a ‘text ‘ parameter.
What does job conf class do?
MapReduce needs to logically separate different jobs running on the same cluster. ‘ Job conf class‘ helps to do job
level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive
and represent the type of job that is being executed. What does conf.setMapper Class do?
Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and
generating a key-value pair out of the mapper. What do sorting and shuffling do?
Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one
location is known as Sorting . And the process by which the intermediate output of the mapper is sorted and sent
across to the reducers is known as Shuffling . What does a split do?
Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split
Method ‘. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but
reads data from the block and pass it to the mapper. Be default, Split is taken care by the framework. Split method isequal to the block size and is used to divide block into bunch of splits.
How can we change the split size if our commodity hardware has less storage space? If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter ‘. There
is a feature of customization in Hadoop which can be called from the main method.
What does a MapReduce partitioner do?
8/20/2019 Hadoop Questions
25/41
BIG DATA HADOOP BANK
25 | P a g e A N T R I X S H G U P T A
A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly
distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which
reducer is responsible for a particular key.
How is Hadoop different from other data processing tools?
In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering
about the volume of data to be processed. this is the beauty of parallel processing in contrast to the other data
processing tools available.
Can we rename the output file?
Yes we can rename the output file by implementing multiple format output class.
Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that? We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on
the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose
the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again
gets divided into mapper, thus we do not have a track of the previous row value.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any
programming language which can accept standard input and can produce standard output. It could be Perl, Python,
Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any
other programming language.
What is a Combiner? A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a
particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by
reducing the quantum of data that is required to be sent to the reducers.
What is the difference between an HDFS Block and Input Split?
HDFS Block is the physical division of the data and Input Split is the logical division of the data.
What happens in a textinputformat?
In textinputformat , each line in the text file is a record. Key is the byte offset of the line andvalue is the content of the
line. For instance, Key: longWritable, value: text.
What do you know about keyvaluetextinputformat?
In keyvaluetextinputformat , each line in the text file is a ‘record ‘. The first separator character divides each line.
Everything before the separator is the key and everything after the separator is the value. For instance, Key: text,
value: text.
What do you know about Sequencefileinputformat?
Sequencefileinputformat is an input format for reading in sequence files. Key and value are user defined. It is a specific
compressed binary file format which is optimized for passing the data between the output of one MapReduce job to
the input of some other MapReduce job.
8/20/2019 Hadoop Questions
26/41
BIG DATA HADOOP BANK
26 | P a g e A N T R I X S H G U P T A
What do you know about Nlineoutputformat?
Nlineoutputformat splits ‘n’ lines of input as one split.
FAQ’s For Hadoop PIG Can you give us some examples how Hadoop is used in real time environment?
Let us assume that the we have an exam consisting of 10 Multiple-choice questions and 20 students appear for that
exam. Every student will attempt each question. For each question and each answer option, a key will be generated
So we have a set of key-value pairs for all the questions and all the answer options for every student. Based on the
options that the students have selected, you have to analyze and find out how many students have answered correctly
This isn’t an easy task. Here Hadoop comes into picture! Hadoop helps you in solving these problems quickly and
without much effort. You may also take the case of how many students have wrongly attempted a particular questionWhat is BloomMapFile used for?
The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses
dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.
What is PIG?
PIG is a platform for analyzing large data sets that consist of high level language for expressing data analysis
programs, coupled with infrastructure for evaluating these programs. PIG’s infrastructure layer consists of a compiler
that produces sequence of MapReduce Programs.
What is the difference between logical and physical plans?Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic
parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have
to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the
physical operators that are needed to execute the script.
Does ‘ILLUSTRATE’ run MR job?
No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows
output of each stage and not the final output.
Is the keyword ‘DEFINE’ like a function name?Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic
you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler wil
check the function in exported jar. When the function is not present in the library, it looks into your jar.
Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?
8/20/2019 Hadoop Questions
27/41
BIG DATA HADOOP BANK
27 | P a g e A N T R I X S H G U P T A
No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions
Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built -in function i.e a
pre-defined function, therefore it does not work as a UDF.
Why do we need MapReduce during Pig programming?
Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use
for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution
engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into
MapReduce jobs. Here, MapReduce acts as the execution engine.
Are there any problems which can only be solved by MapReduce and cannot be solved byPIG? In which kind of scenarios MR jobs will be more useful than PIG?
Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different
cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and
the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population
data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program –
whenever you find city with the name ‘ Bangalore‘ and city with the name ‘ Noida’ , you create the alias name which wilbe the common name for these two cities so that you create a common key for both the cities and it get passed to
the same reducer. For this, we have to write custom partitioner .
In mapreduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework
comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There
is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida
then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we
cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.
Does Pig give any warning when there is a type mismatch or missing field?
No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such awarning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.
What co-group does in Pig?
Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and
then returns a set of records containing two separate bags. The first bag consists of the record of the first data set
with the common data set and the second bag consists of the records of the second data set with the common data
set.
Can we say cogroup is a group of more than 1 data set?
Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets
and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and
join of that data set as well.
What does FOREACH do?
FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating
that for each element of a data bag, the respective action will be performed.
8/20/2019 Hadoop Questions
28/41
BIG DATA HADOOP BANK
28 | P a g e A N T R I X S H G U P T A
Syntax : FOREACH bagname GENERATE expression1, expression2, ….. The meaning of this
statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.
What is bag?
A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags
are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size o
the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag
in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.
Real BIG DATA Use CasesBig Data Exploration
Big Data exploration deals with the challenges like information stored in different systems and access to this data to
complete day-to-day tasks, faced by large organization. Big Data exploration allows you to analyse data and gainvaluable insights from them.
Enhanced 360º Customer Views
Enhancing existing customer views helps to gain complete understanding of customers, addressing questions like
why they buy, how they prefer to shop, why they change, what they’ll buy next, and what features make them to
recommend a company to others.
Security/Intelligence Extension
Enhancing cyber security and intelligence analysis platforms with Big Data technologies to process and analyze new
types from social media, emails, sensors and Telco, reduce risks, detect fraud and monitor cyber security in realtime
to significantly improve intelligence, security and law enforcement insights.
Operations Analysis
Operations analysis is about using Big Data technologies to enable a new generation of applications that analyze
large volumes of multi-structured, like machine and operational data to improve business. These data can include
anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across
different types of data sets.
Data Warehouse Modernization
Big Data needs to be integrated with data warehouse capabilities to increase operational efficiency. Getting rid of
rarely accessed or old data from warehouse and application databases can be done using information integration
software and tools.
8/20/2019 Hadoop Questions
29/41
BIG DATA HADOOP BANK
29 | P a g e A N T R I X S H G U P T A
Companies and their Big Data Applications: Guangdong Mobiles:
A popular mobile group in China, Guangdong uses Hadoop to remove data access bottlenecks and uncover
customer usage pattern for precise and targeted market promotions and Hadoop HBase for automatically splitting
data tables across nodes to expand data storage.
Red Sox:
The World Series champs come across huge volumes of structured and unstructured data related to the game like
on the weather, opponent team and pre-game promotions. Big Data allows them to provide forecasts about the
game and how to allocate resources based on expected variations in the oncoming game.
Nokia:
Big Data has helped Nokia make effective use of their data to understand and improve users’ experience with their
products. The company leverages data processing and complex analyses to build maps with predictive traffic and
layered elevation models. Nokia uses Cloudera’s Hadoop platform and Hadoop components like HBase, HDFS,
Sqoop and Scribe for the above application.
Huawei:
Huawei OceanStor N8000-Hadoop Big Data solution is developed based on advanced clustered architecture and
enterprise-level storage capability and integrating it with Hadoop computing framework. This innovative combination
helps enterprises get real-time analysis and processing results from exhaustive data computation and analysis,improves decision-making and efficiency, make management easier and reduce the cost of networking.
SAS:
SAS has combined with Hadoop to help data scientists transform Big Data in to bigger insights. As a result, SAS has
come up with an environment that provides visual and interactive experience, making it easier to gain insights and
explore new trends. The potent analytical algorithms extract valuable insights from the data while the in-memory
technology allows faster access to data.
CERN:
Big Data plays a vital part in CERN, home of the large Hadron Supercollider, as it collects unbelievable amount of
data from its 40 million pictures per second from its 100 megapixel cameras, which gives out 1 petabyte of data per
second. The data from these cameras needs to be analysed. The lab is experimenting with ways to place more data
from its experiments in both relational databases and data stores based on NoSQL technologies, such as Hadoop
and Dynamo in Amazon’s S3′s cloud storage service Buzzdata:
8/20/2019 Hadoop Questions
30/41
BIG DATA HADOOP BANK
30 | P a g e A N T R I X S H G U P T A
Buzzdata is working on a Big Data project where it needs to combine all the sources and integrate them in a safe
location. This creates a great place for journalists to connect and normalize public data.
Department of Defence:
The Department of Defense (DoD) has invested approximately $250 million for harnessing and utilizing colossalamount of data to come up with a system that can make control and make autonomous decisions and assist
analysts to provide support to operations. The department has plans to increase their analytical abilities by 100
folds, to extract information from texts in any language and an equivalent increase in the number of objects,
activities, and events that analysts can analyze.
Defence Advanced Research Projects Agency (DARPA):
DARPA intends to invest approximately $25 million to improve computational techniques and software tools fo
analyzing large amounts of semi-structured and unstructured data.
National Institutes of Health:
At 200 terabytes of data contained in the 1000 Genomes Project, it is all set to be a prime example of Big Data. The
datasets are so massive that very few researchers have the computational power to analyses the data.
Big Data Application Examples in different Industries: Retail/Consumer:
1. Market Basket Analysis and Pricing Optimization2. Merchandizing and market analysis
3. Supply-chain management and analytics
4. Behavior-based targeting
5. Market and consumer segmentations Finances & Frauds Services:
1. Customer Segmentation 2.
Compliance and regulatory reporting
3. Risk analysis and management.
4. Fraud detection and security analytics
5. Medical insurance fraud
6. CRM
7. Credit risk, scoring and analysis
8. Trade surveillance and abnormal trading pattern analysis Health & Life Sciences:
1. Clinical trials data analysis
2. Disease pattern analysis
3. Patient care quality analysis
8/20/2019 Hadoop Questions
31/41
BIG DATA HADOOP BANK
31 | P a g e A N T R I X S H G U P T A
4. Drug development analysis Telecommunications:
1. Price optimization
2. Customer churn prevention
3. Call detail record (CDR) analysis
4. Network performance and optimization5. Mobile user location analysis Enterprise Data Warehouse:
1. Enhance EDW by offloading processing and storage
2. Pre-processing hub before getting to EDW
Gaming:
1. Behavioral Analytics High
Tech:
1. Optimize Funnel Conversion
2. Predictive Support
3. Predict Security Threats
4. Device Analytics
Facebook today is a world-wide phenomenon that has caught up with young and old alike. Launched in 2004 by abunch of Harvard University students, it was least expected to be such a rage. In a span of just a decade, how did it
manage this giant leap?
With around 1.23 billion users and counting, Facebook definitely has an upper hand over other social media websites
What is the reason behind this success? This blog is an attempt to answer some of these queries.
It is quite evident that the existence of a durable storage system and high technological expertise has contributed to
the support of various user data like managing messages, applications, personal information etc, without which all of
it would have come to a staggering halt.So what does a website do when its user count exceeds the number of cars
in the world? How does it manage such a massive data?
Data Centre: The Crux of Facebook Facebook’s data center is spread across an area of 300,000 sq ft in cutting edge servers and huge memory banks; i
has data spread over 23 million ft of fiber optic cables. Their systems are designed to run data at the speed of ligh
making sure that once a user logs into his profile, everything works faster. With 30 MW of electricity, they have to
make sure that they’re never out of power. The warehouse stores up to 300 PB of Hive data with an incoming daily rate of
600 TB.
8/20/2019 Hadoop Questions
32/41
BIG DATA HADOOP BANK
32 | P a g e A N T R I X S H G U P T A
Every computer is cooled by heat sync not bigger than a match box, but for Facebook computers, the picture is
evidently bigger. Spread over a huge field, there are cooling systems and fans that help balance the temperature of
these systems. As the count increases, trucks of storage systems keep pouring in on a daily basis and employees are
now losing a count of it.
Hadoop & Cassandra: The Technology Wizards The use of big data has evolved and for Facebook’s existence Big Data is crucial. A platform as big as this, requires
a number of technologies that will enable them to solve problems and store massive data. Hadoop is one of the many
Big Data technologies employed at Facebook, which is insufficient for a company that is growing every minute of the
day. Hadoop is a highly scalable open-source framework that uses clusters of low-cost servers to solve problems
One of the other technologies used and preferred is Cassandra.
Apache Cassandra was initially developed at Facebook to power their Inbox Search feature by two proficient Indians
Avinash Lakshman and Prashant Malik, the former being an author( Amazon Dynamo) and the latter a techie. It is an
open-source distributed database management system designed to handle large amounts of data across manycommodity servers, providing high availability with no single point of failure.
Cassandra offers robust support for clusters spanning multiple data centers. Hence, Cassandra aims to run on top of
an infrastructure of hundreds of nodes. There are failures at some point of time, but the manner in which Cassandra
manages it, makes it possible for anyone to rely on this service.
Facebook, along with the other social media websites, avoids using MySQL due to the complexity in getting good
results. Cassandra has overpowered the rest and has proved its capability in terms of getting quick results. Facebook
had originally developed Cassandra to solve the problem of engine search and to be fast and reliable in terms of
handling the ability to read and write requests at the same time. Facebook is a platform that instantly helps you connec
to people far and near and for this, it requires a system that performs and matches the brand.
WHAT IS HADOOP
So, What exactly is Hadoop? It is truly said that ‘Necessity is the mother of all inventions’ and ‘Hadoop’ is amongst the finest inventions in the world
of Big Data! Hadoop had to be developed sooner or later as there was an acute need of a framework that can handle
and process Big Data efficiently.
Technically speaking, Hadoop is an open source software framework that supports data-intensive distributed
applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop
Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies
concepts of functional programming. Hadoop is written in the Java programming language and is the highest-leve
Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug
Cutting and Michael J. Cafarella. And the charming yellow elephant you see is basically named after Doug’s son’s toy
elephant!
8/20/2019 Hadoop Questions
33/41
BIG DATA HADOOP BANK
33 | P a g e A N T R I X S H G U P T A
Hadoop Ecosystem: Once you are familiar with ‘What is Hadoop’, let’s probe into its ecosystem. Hadoop Ecosystem is nothing but various
components that make up Hadoop so powerful, among which HDFS and MapReduce are the core components!
1. HDFS: The Hadoop Distributed File System (HDFS) is a very robust feature of Apache Hadoop. HDFS is designed to amass
gigantic amount of data unfailingly, and to transfer the data at an amazing speed among nodes and facilitates the
system to continue working smoothly even if any of the nodes fail to function. HDFS is very competent in writing
programs, handling their allocation, processing the data and generating the final outcomes. In fact, HDFS manages
around 40 petabytes of data at Yahoo! The key components of HDFS are NameNode, DataNodes and Secondary
NameNode.
2. MapReduce: It all started with Google applying the concept of functional programming to solve the problem of how to manage large
amounts of data on the internet. Google named it as the ‘MapReduce’ system and was penned down in a pape
published by Google. With the ever increasing amount of data generated on the web, MapReduce was created in
2004 and Yahoo stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The
function of MapReduce is to help Google in searching and indexing the large quantity of web pages in matter of a few
seconds or even in a fraction of a second. The key components of MapReduce are JobTracker, TaskTrackers and
JobHistoryServer.
3. Apache Pig: Apache Pig is another component of Hadoop, which is used to evaluate huge data sets made up of high-leve
language. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic
attribute of Pig programs is ‘parallelization’ which helps them to manage large data sets. Apache Pig consists of a
compiler that generates a series of MapReduce program and a ‘Pig Latin’ language layer that facilitates SQL-like
queries to be run on distributed databases in Hadoop.
http://cdn.edureka.co/blog/wp-content/uploads/2013/03/Hadoop-ecosystem-1.png
8/20/2019 Hadoop Questions
34/41
BIG DATA HADOOP BANK
34 | P a g e A N T R I X S H G U P T A
4. Apache Hive: As the name suggests, Hive is Hadoop’s data warehouse system that enables quick data summarization for Hadoop
handle queries and evaluate huge data sets which are located in Hadoop’s file systems and also maintains full suppor
for map/reduce. Another striking feature of Apache Hive is to provide indexes such as bitmap indexes in order to
speed up queries. Apache Hive was originally developed by Facebook, but now it is developed and used by othe
companies too, including Netflix.
5. Apache HCatalog Apache HCatalog is another important component of Apache Hadoop which provides a table and storage
management service for data created with the help of Apache Hadoop. HCatalog offers features like a shared schema
and data type mechanism, a table abstraction for users and smooth functioning across other components of Hadoop
such as such as Pig, Map Reduce, Streaming, and Hive.
6. Apache HBase HBase is acronym for Hadoop DataBase. HBase is a distributed, column oriented database that uses HDFS forstorage purposes. On one hand it manages batch style computations using MapReduce and on the other hand it
handles point queries (random reads). The key components of Apache HBase are HBase Master and the
RegionServer.
7. Apache Zookeeper Apache ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton is to keep a record o
configuration information, naming, providing distributed synchronization, and providing group services which are
immensely crucial for various distributed systems. Infact, HBase is dependent upon ZooKeeper for its functioning.
WHY HADOOPHadoop can be contagious. It’s implementation in one organization can lead to another one elsewhere. Thanks to
Hadoop being robust and cost-effective, handling humongous data seems much easier now. The ability to include
HIVE in an EMR workflow is yet another awesome point. It’s incredibly easy to boot up a cluster, install HIVE, and be
doing simple SQL analytics in no time. Let’s take a look at why Hadoop can be so incredible.
Key features that answer – Why Hadoop? 1. Flexible:
As it is a known fact that only 20% of data in organizations is structured, and the rest is all unstructured, it is very
crucial to manage unstructured data which goes unattended.
Top Related