Friday, May 18, 2012

Software Configuration for Hadoop HDFS Failure Recoverability

We have been working on Hadoop HDFS 0.20.x and we encountered 2 times namenode failure for the past 1 year. Needless to say, we should buy better hardware since it is holding important data but this is out of my control.

What I could do is to make the recovery possible after the namenode failure occurs. Below is the procedure that I'm using. The goal of the procedure is to be able to recover the namenode as quickly as possible while minimizing data loss. Hope this is useful for other as well.

Procedures:

  1. make sure hadoop dfs is off
  2. backup the directories listed in hdfs-site.xml under the property name 'dfs.name.dir' (from namenode server) and 'fs.checkpoint.dir' (from secondary namenode server) in a safe location. Optionally, diff all the fsimage and edits across different copies to get a sense how many unique copies of the metadata you can recover from and possibly start with the one with a timstamp closing to the current time.
  3. remove all remote locations listed in 'dfs.name.dir' and leave the local location (if there are more than one local locations, leave 1 from the list) in a machine that takes up the responsibility of a new namenode.
  4. restart hadoop dfs
  5. check the status page located in the namenode:50070: if the reported block ratio cannot reach 100%, stop hdfs and replace the meta data from another location (possibly a remote location/secondary namenode location if there are no other local locations). Note that always leave only 1 location in the 'dfs.name.dir'. If there is no more location to try, go to step 7. Otherwise, it is done.
  6. Start hdfs until step 5 is true.
  7. Select one meta data from all the locations which has the highest reported block ratio, copy this meta data into a local location (if it is not the local one) and empty the content inside the remote locations
  8. List the location which has the highest reported block ratio first in the 'dfs.name.dir' property and append it with the remote empty locations.
  9. start hdfs
  10. use 'bin/hadoop fsck /' to check the health of the cluster.
  11. Pay attention to the corrupted/missing blocks
  12. leave safemode by executing 'bin/hadoop dfsadmin -safemode leave' and then move all the corrupted and missing blocks to lost+found folder by executing 'bin/hadoop fsck -move'
  13. check the health of the cluster using 'bin/hadoop fsck /', the cluster is now healthy.

Configurations:

Also, we also notice that some configurations are very important to hadoop and hbase (0.90.4) reliability and they are:

Name: dfs.http.address
Value: namenode.hadoop-host.com:50070
Description: put this property in hdfs-site.xml where the secondary namenode is hosted. This is the address and the base port where the dfs namenode web ui will listen on. It is used for shipping fsimage and edits to the secondary namenode server

Name: dfs.secondary.http.address
Value: secondary-namenode.hadoop-host.com:50090
Description: put this property in hdfs-site.xml where the secondary namenode is hosted. This is the secondary namenode http server address and port. It is used for the secondary namenode server to notify the namenode where the merge fsimage locates.

Name: fs.checkpoint.period
Value: 600
Description: put this property in hdfs-site.xml where the secondary namenode is hosted. The value is the number of seconds between two periodic checkpoints.

Name: dfs.support.append
Value: true
Description: put this property in hdfs-site.xml and hbase-site.xml across the entire cluster. This property enables durable sync to minimize data loss when namenode fails.


No comments:

Post a Comment