Friday, May 18, 2012

Software Configuration for Hadoop HDFS Failure Recoverability

We have been working on Hadoop HDFS 0.20.x and we encountered 2 times namenode failure for the past 1 year. Needless to say, we should buy better hardware since it is holding important data but this is out of my control.

What I could do is to make the recovery possible after the namenode failure occurs. Below is the procedure that I'm using. The goal of the procedure is to be able to recover the namenode as quickly as possible while minimizing data loss. Hope this is useful for other as well.

Procedures:

  1. make sure hadoop dfs is off
  2. backup the directories listed in hdfs-site.xml under the property name 'dfs.name.dir' (from namenode server) and 'fs.checkpoint.dir' (from secondary namenode server) in a safe location. Optionally, diff all the fsimage and edits across different copies to get a sense how many unique copies of the metadata you can recover from and possibly start with the one with a timstamp closing to the current time.
  3. remove all remote locations listed in 'dfs.name.dir' and leave the local location (if there are more than one local locations, leave 1 from the list) in a machine that takes up the responsibility of a new namenode.
  4. restart hadoop dfs
  5. check the status page located in the namenode:50070: if the reported block ratio cannot reach 100%, stop hdfs and replace the meta data from another location (possibly a remote location/secondary namenode location if there are no other local locations). Note that always leave only 1 location in the 'dfs.name.dir'. If there is no more location to try, go to step 7. Otherwise, it is done.
  6. Start hdfs until step 5 is true.
  7. Select one meta data from all the locations which has the highest reported block ratio, copy this meta data into a local location (if it is not the local one) and empty the content inside the remote locations
  8. List the location which has the highest reported block ratio first in the 'dfs.name.dir' property and append it with the remote empty locations.
  9. start hdfs
  10. use 'bin/hadoop fsck /' to check the health of the cluster.
  11. Pay attention to the corrupted/missing blocks
  12. leave safemode by executing 'bin/hadoop dfsadmin -safemode leave' and then move all the corrupted and missing blocks to lost+found folder by executing 'bin/hadoop fsck -move'
  13. check the health of the cluster using 'bin/hadoop fsck /', the cluster is now healthy.

Configurations:

Also, we also notice that some configurations are very important to hadoop and hbase (0.90.4) reliability and they are:

Name: dfs.http.address
Value: namenode.hadoop-host.com:50070
Description: put this property in hdfs-site.xml where the secondary namenode is hosted. This is the address and the base port where the dfs namenode web ui will listen on. It is used for shipping fsimage and edits to the secondary namenode server

Name: dfs.secondary.http.address
Value: secondary-namenode.hadoop-host.com:50090
Description: put this property in hdfs-site.xml where the secondary namenode is hosted. This is the secondary namenode http server address and port. It is used for the secondary namenode server to notify the namenode where the merge fsimage locates.

Name: fs.checkpoint.period
Value: 600
Description: put this property in hdfs-site.xml where the secondary namenode is hosted. The value is the number of seconds between two periodic checkpoints.

Name: dfs.support.append
Value: true
Description: put this property in hdfs-site.xml and hbase-site.xml across the entire cluster. This property enables durable sync to minimize data loss when namenode fails.


Wednesday, October 14, 2009

Linkage Error: Loader Constraint Violation when using CXF with Jetty 6 Plus

To keep this post as short as possible, a detail description of the problem can be found in here and here. What actually happens in my case is that when I reload the web service application in a container, this linkage error occurs. The loader constraint is violated because it happens that two classloaders (the current class loader and the superclass loader) have different class objects for the type javax.activation.DataHandler used in the signature.

There are different ways to fix this problem. For example, unjar geronimo-activation_1.1_spec-1.0.2.jar and remove the associated classes from it. However, this is very intrusive. Another approach is to instruct Jetty to allow the normal java behavior to be used and all classes will be loaded from the system classpath if possible. This can be done by setting the webAppContext's parentLoaderPriority to true in jetty-web.xml. This is fine however it affects all classes behaviors since every class will be loaded from the system classpath if found. A better approach (I think) is to control the class loading behavior in a per class basis. This can be done by setting the webAppContext's systemClasses property to include javax.activation. in the list of classes that cannot be overridden by webapp context classloaders.

A sample of the jetty-web.xml is as follows:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Configure PUBLIC "-//Mort Bay Consulting//DTD Configure//EN" "http://www.eclipse.org/jetty/configure.dtd">
<Configure class="org.mortbay.jetty.webapp.WebAppContext">
<!-- this is needed to avoid LinkageError: loader constraint violation -->
<Set name="systemClasses">
<Array type="String">
<Item>java.</Item>
<Item>javax.servlet.</Item>
<Item>javax.xml.</Item>
<Item>org.mortbay.</Item>
<Item>org.xml.</Item>
<Item>org.w3c.</Item>
<Item>org.apache.commons.logging.</Item>
<Item>org.apache.log4j.</Item>
<Item>javax.activation.</Item>
</Array>
</Set>
</Configure>
Hope this is helpful to someone else as well. :)

Tuesday, September 29, 2009

CXF Development Notes

These days, I'm prototyping SOAP web services using CXF. It is a nice framework with spring built-in support. Basically, I can now design, develop and test the web service within eclipse with the help of the WTP plugin (beta).

Here is what I did:
  1. Start with a pojo; define the methods that will be exposed to the web
  2. Generate getters and setters for the pojo
  3. Use the WTP plugin to develop the service following the tutorial described here.
  4. Write unit test for the web service
Sound simple, isn't it? However, you will encounter at least 2 problems:

Problem 1:
The pojo has annotations that are not part of the CXF annotations. When I use WTP plugin to generate the web service, the generated web service is usable. The exact cause of the problem is not known yet (not that important for now as I'm prototyping it). The workaround is to remove those unrecognized annotations before going to Step 3.

Problem 2:
Since one of my private members in the pojo represents an interface of a class, the CXF complaints:
javax.xml.ws.WebServiceException: org.apache.cxf.service.factory.ServiceConstructionException
at org.apache.cxf.jaxws.EndpointImpl.doPublish(EndpointImpl.java:275)
at org.apache.cxf.jaxws.EndpointImpl.publish(EndpointImpl.java:209)
at org.apache.cxf.jaxws.spi.ProviderImpl.createAndPublishEndpoint(ProviderImpl.java:84)
at javax.xml.ws.Endpoint.publish(Unknown Source)
at com.boci.sample.test.MarketData_PortTypeTest.setupServer(MarketData_PortTypeTest.java:58)
at com.boci.sample.test.MarketData_PortTypeTest.startServices(MarketData_PortTypeTest.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:46)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: org.apache.cxf.service.factory.ServiceConstructionException
at org.apache.cxf.jaxb.JAXBDataBinding.initialize(JAXBDataBinding.java:354)
at org.apache.cxf.service.factory.ReflectionServiceFactoryBean.buildServiceFromClass(ReflectionServiceFactoryBean.java:381)
at org.apache.cxf.jaxws.support.JaxWsServiceFactoryBean.buildServiceFromClass(JaxWsServiceFactoryBean.java:523)
at org.apache.cxf.service.factory.ReflectionServiceFactoryBean.initializeServiceModel(ReflectionServiceFactoryBean.java:444)
at org.apache.cxf.service.factory.ReflectionServiceFactoryBean.create(ReflectionServiceFactoryBean.java:195)
at org.apache.cxf.jaxws.support.JaxWsServiceFactoryBean.create(JaxWsServiceFactoryBean.java:163)
at org.apache.cxf.frontend.AbstractWSDLBasedEndpointFactory.createEndpoint(AbstractWSDLBasedEndpointFactory.java:100)
at org.apache.cxf.frontend.ServerFactoryBean.create(ServerFactoryBean.java:117)
at org.apache.cxf.jaxws.JaxWsServerFactoryBean.create(JaxWsServerFactoryBean.java:167)
at org.apache.cxf.jaxws.EndpointImpl.getServer(EndpointImpl.java:346)
at org.apache.cxf.jaxws.EndpointImpl.doPublish(EndpointImpl.java:259)
... 21 more
Caused by: com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions
org.openspaces.core.GigaSpace is an interface, and JAXB can't handle interfaces.
this problem is related to the following location:
at org.openspaces.core.GigaSpace
at private org.openspaces.core.GigaSpace com.boci.sample.jaxws_asm.SetGigaSpace.arg0
at com.boci.sample.jaxws_asm.SetGigaSpace
org.openspaces.core.GigaSpace does not have a no-arg default constructor.
this problem is related to the following location:
at org.openspaces.core.GigaSpace
at private org.openspaces.core.GigaSpace com.boci.sample.jaxws_asm.SetGigaSpace.arg0
at com.boci.sample.jaxws_asm.SetGigaSpace

at com.sun.xml.bind.v2.runtime.IllegalAnnotationsException$Builder.check(IllegalAnnotationsException.java:102)
at com.sun.xml.bind.v2.runtime.JAXBContextImpl.getTypeInfoSet(JAXBContextImpl.java:472)
at com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:302)
at com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1136)
at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:154)
at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:121)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
at javax.xml.bind.ContextFinder.find(Unknown Source)
at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
at org.apache.cxf.jaxb.JAXBDataBinding.createJAXBContextAndSchemas(JAXBDataBinding.java:500)
at org.apache.cxf.jaxb.JAXBDataBinding.initialize(JAXBDataBinding.java:337)
... 31 more
This sounds horrible but it can be fixed by annotating the getters and the setters with the @webmethod(exclude=true). and voila, everything should work :)

Hopefully, this trick will be helpful for someone else as well.