Apache Hadoop is an open-source software framework for distributed storage and processing of very large data sets distributed in large clusters of commodity hardware. Currently, any Big Data solution, open-source or commercial (from IBM Insights, Cloudera or Oracle) is based on this framework.

Hadoop has been designed with one fundamental assumption, that hardware failures are common, they will happen, and when they do, the framework should handle it automatically, without the developer or the user be aware or had to take actions upon it.

The framework is also considering the fact that data are stored in many places, and it makes more sense to get the processing where the data is instead of centralizing the data (which might not be even possible if the size of the data is huge) and work on the whole set . In this way, data is processed faster and more efficiently because:

  • • on each node, there is a small chunk of data that is processed not the whole set;
  • • each node has fast access to data, data is local to the processing unit;
  • • no need to transfer data over the network to centralize it, only transfer the results of the processing.

The base Apache Hadoop framework is composed of the following four modules:

  • Hadoop Common - contains libraries and utilities needed by other Hadoop modules;
  • Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
  • Hadoop YARN - a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications;
  • Hadoop MapReduce - an implementation of MapReduce programming model for large scale data processing.

On top of these modules, a lot of other projects have been built, to make the life of developers and data scientists much easier: Apache Pig, Apache Hive, Apache HBase, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm. All of these composed what is called Hadoop platform.

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single NameNode plus a cluster of DataNodes,. Each DataNode serves up blocks of data over the network using a block protocol specific to HDFS. The basic HDFS architecture is represented in the following image.




The NameNode keeps all metadata information about where the data is stored, the location of the data files, how the files are splitted across DataNodes, etc.. HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines, called DataNodes. The files are splitted in blocks (usually 64 MB or 128 MB) and stored on a serie of DataNodes (what DataNodes are used for each file, is managed by the NameNode). The blocks of data are also replicated (usually three times) on other DataNodes so that in case hardware failures, clients can find the data on another server. The information about the location of the data blocks and the replicas is also managed by the NameNode. Data Nodes can talk to each other to re balance data, to move copies around, and to keep the replication of data high.

The HDFS file system also includes a secondary NameNode, a name which is misleading, as it might be intepret as a back-up for the NameNode. In fact, this secondary NameNode regularly connects on the regular basis to the primary NameNode and builds snapshots of all directory information managed by the latter, which the system then saves to local or remote directories. These snapshots can later be used to restart a failed primary NameNode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure.

What of the issues related to HDFS architecture is the fact that the NameNode is the single point for storage and management of metadata, and so it can become a bottleneck when dealing with a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes. Moreover, there are some issues in HDFS, namely, small file issue, scalability problem, Single Point of Failure (SPoF), and bottleneck in huge metadata request.