HDFS' Location Awareness
According to several documentation 1, 2, 3 HDFS' Location Awareness is about knowing the physical location of nodes and replicating data on different racks to reduce the impact of rack issues due to, e.g. power supply and/or switch issues.
How does HDFS know the physical location of nodes and racks and subsequently decide to replicate data to nodes located on other racks?
Rack-awareness is configured when the cluster is set up. This can be done either manually for each node or through a script.
Each DataNode is given a network location which is simple a string, much like a file system path.
datacenter-1/rack-1/node1 datacenter-1/rack-1/node2 datacenter-1/rack-2/node3
The NameNode then builds a network topology (basically a tree structure) using the network locations of each DataNode. This topology is then used to determine block replica placement.