Breaking Down Apache’s Hadoop Distributed File System
Apache Hadoop is a framework for big data. It has two main components: MapReduce, for processing large quantities of data in parallel, and HDFS, Hadoop Distributed File System, which stores big data.
Let’s examine HDFS. You might expect that a storage framework that holds large quantities of data requires special, state-of-the-art infrastructure for a file system that does not fail, but quite the contrary is true.
Distributed file system
As the name says, HDFS is a distributed file system that spans across a multi-node cluster. Data is replicated for durability and high availability.
Durability implies that data is not lost; even if one of the replicas were to be lost due to machine failure, other replicas would still be available. High availability implies that with multiple users accessing the same data, perhaps at different nodes or machines over different regions, the data is available even if some of the copies of the replicated data were to become unavailable.
Commodity hardware
HDFS uses off-the-shelf commodity hardware—machines that could be used for any other purpose—so it does not require any special infrastructure. HDFS is portable across heterogeneous hardware and software platforms.
Fault tolerance
HDFS is designed to be fault-tolerant, expecting that some of the machines used to store data will fail routinely. With thousands of machines in a cluster, HDFS is designed to detect failure of a machine and recover.
Because data is replicated three times by default, the data on the machines that have failed does not become unavailable, but instead is served from another machine. Failover is used to direct a user request to another machine if a node were to fail. In the meantime, data is replicated to bring it back to the configured replication level.
Abstract file system
HDFS is not a real file system on the hardware of a machine, but rather an abstract file system that is developed over the underlying file system of a machine. Data stored in HDFS has a smallest unit of storage called a block. An HDFS block’s size is 128 MB by default.
Locality of data
Locality of data is another benefit of replicating data in a distributed cluster. If data replicas are spread across multiple regions, a user has a greater likelihood of being served data from a local machine or a machine in relatively close proximity. If data does not have to be fetched from a distant machine, it reduces the network load.
Data and compute code
Because of the large quantities of data involved with HFDS, data is not moved to the compute code; instead, compute code is moved to the location of the data, reducing network overhead in terms of load and latency.
Read latency
HDFS is not designed to be your regular, interactive file system for general-purpose applications. It’s meant for batch processing, high throughput, and streaming access of large data sets. HDFS is a write-once-read-many file system. File systems that serve general-purpose, interactive applications generally have a low read latency, but with HDFS, some latency in accessing data is to be expected.