Archive for the ‘Tips’ Category

This is my log on several mistakes (some pretty dumb on the hindsight :)) that I did while getting started with Hadoop and Hive some time back, along with some tricks on debugging Hadoop and Hive. I am using Hadoop 0.20.203 and Hive 0.8.1.

localhost: Error: JAVA_HOME is not set

This almost undecipherable and cryptic error message 🙂 during Hadoop startup (namenode/jobtracker etc.) says Hadoop cannot find the Java installation. Wait!! I have already set JAVA_HOME enviornment variable?? Seems it’s not enough. So where else to set it? Turns out that you have to set JAVA_HOME in hadoop-env.sh present in conf folder to get the elephant moving.

Name node mysteriously fails to start

When you start the namenode things seems fine except for the fact that the server is not up and running. And of course I hadn’t formatted the HDFS on the namenode. So why should it work right? 🙂 So there goes. Format the darn namenode before doing anything distributed with Hadoop.

bin/hadoop namenode -format

java.io.IOException Call to localhost/ failed on local exception: java.io.EOFException

This one was bit tricky. After fiddling and struggling for some time found out that Hadoop dependency version used in the JobClient in order to communicate with JobTracker is different from the version that’s present inside the running Hadoop instance. Hadoop uses a homegrown RPC mechanism to communicate with job tracker and name nodes. And it seems certain different Hadoop versions have incompatibilities in this interface.

Now it’s time for some debugging tips.

Debugging Hadoop Local (Standalone) mode

Add debugging options for JVM as follows in conf/hadoop-env.sh.

export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=[DEBUG_PORT]"

Debugging Hive Server

Start Hive with following command line to remote debug Hive.

./hive --service hiveserver --debug[port=[DEBUG_PORT],mainSuspend=y,childSuspend=y]

Read Full Post »

Learn by Errors : Java + OSGi

Recently I worked on getting Apache Hive work inside an OSGi environment. While
not proving to be a proverbial piece of cake (software right?.. Why am I not
surprised? :)), it led me through an assortment of Java and OSGi errors. Here I
am listing some of them that bit me bit hard (no pun intended) so that I
thought of making a blog out them just for my own satisfaction.


I got this nastiness during initialization of one of OSGi service components.
The culprit was not immediately identifiable since the offending bundle was in
ACTIVE state. On the surface everything looked fine except for the fact the
Hive server which was supposed to start during the initialization of the
service component present in the bundle was not up and running. A quick ‘ls’ in
the OSGi console revealed the service component is in ‘unsatisfied’ state.
Finally a ‘comp’ revealed the root cause, the VerifyError.

The VerifyError can occur if the runtime dependency of a class is different to that
of the dependency that was used at compilation time. For example if the method
signatures have changed between the dependencies then this error would result.
This is nicely explained at [1] in the accepted answer. As it turned out
slightly different versions of a package had been exposed in two bundles causing
the Hive bundle to pick up a different version over the version that was in the
compilation environment. Proper OSGi versioning turned out to be the solution.


This error also cropped up under a similar circumstance where two packages were
present in the system. As [2] clearly explains, the reason for this in my case
was an interface being changed to an abstract class between the conflicting
package versions. Again the versioning helped to save the day.

java.lang.LinkageError : loader constraint violation in xxxx – blah …

Now this seems to be a famous error specially in OSGi enviornments. Main root
cause seems to be two classes loaded by different ClassLoaders coming in to
contact in a certain way. For example say Class A object accept a Class B object
as a method parameter. Class B is loaded by ClassLoader-A which also loads Class
A. But at the method invocation time how ever an object of Class B which has
been loaded by ClassLoader-B is passed as an argument to an object of Class A
which has been loaded by ClassLoader-A. Now the result would be a big fat
LinkageError with a very verbose error message.

The graph based class loadingstructure in OSGi makes it specially conducive to these kind of errors. In my case the culprit was a package which had been duplicated in two different
bundles and a particular class in that package loaded by the separate
ClassLoaders of each of the bundles coming in to contact via a third bundle
present in the system during a method call. So this was a case of not following
“import what you export” best practice [3] in OSGi. Doing so would help to
reduce the exposure of duplicated packages across bundles and help to maintain a
consistent class space for a given package. And so this turned out to be the
resolution for that in this case.

Package uses conflict: Import-Package: yyy; version=”x.x.x”

I had my fair share of this inconvenience thrown at my face every so often
during the exercise. There are two excellent posts [4],[5] exactly on this issue
at SpringSource which helped a lot. However let me summarize my learning on this
issue. Simply if a bundle is being exposed to two versions of the same package
through a direct import and via a uses constraint this error would come up. The
diagram best illustrates this situation.

The bundle A imports org.foo version 1.0.0 directly. However it also imports
bundle org.bar from bundle B. However as it turns out package org.bar also uses
org.foo package albeit it’s a different version (2.0.0) than that of the version
imported by bundle A. Now bundle A is directly wired to version 1.0.0 of org.foo
and also being exposed to the version 2.0.0 of org.foo due to the
import of org.bar which is using version 2.0.0 of org.foo. Now since a bundle
cannot be wired to different versions of the same package, a uses conflict would
come up with offending import org.bar as the root cause. (e.g: Package uses
conflict: Import-Package: org.bar; version=”0.0.0″). The solution would be to
change package import versions of org.bar in either bundle A or bundle B so that
both would be pointing to the same package version. Another excellent blog by
Neil Bartlett on this can be found at [6].


One of my friends at work came across this while trying to incorporate another
third party library in to our OSGi enviornment. JavaDocs goes on to say that
this gets “Thrown if the Java Virtual Machine cannot find an appropriate
native-language definition of a method declared native”. The offending library
was a linux .so (dynamically linked library) file which was not visible to
bundle ClassLoader at runtime. We were able to get it working by directly
including the library resource to the bundle ClassLoader. An earlier attempt on
setting this resource on TCCL (Thread Context ClassLoader) failed and this let
us to the realization that the TCCL is typically not the bundle class loader. A
good reading on TCCL under Equinox OSGi enviornment can be found at [7].


[1] http://stackoverflow.com/questions/100107/reasons-of-getting-a-java-lang-verifyerror
[2] http://stackoverflow.com/questions/1980452/what-causes-java-lang-incompatibleclasschangeerror
[3] http://blog.osgi.org/2007/04/importance-of-exporting-nd-importing.html
[4] http://blog.springsource.org/2008/10/20/understanding-the-osgi-uses-directive/
[5] http://blog.springsource.org/2008/11/22/diagnosing-osgi-uses-conflicts/

[6] http://njbartlett.name/2011/02/09/uses-constraints.html
[7] http://wiki.eclipse.org/ContextClassLoader_Enhancements

Read Full Post »

Been there. Done that. And suffered for that…

Programming is fun. But there are some other associated stuff we programmers blissfully skip or procrastinate because they are not so cool.

End result?…

Somebody is going to get hurt at the end of the day and that somebody may very well be a ourselves. So here are some stuff I have experienced and some of the stuff I my self have been guilty of doing and insights I gotten from them.

Good ol’ docs

It’s a well documented fact that documentation is.. hmm well.. let me think.. Good to have. Or is it important? Yep I know the feeling :). But it’s as things turn out, is some thing that needs to be done at the end of day. Who said we programmers do not have to toil for our food 🙂 right?. From a user’s perspective a feature without proper documentation is a near close to a feature which is not there at all. Say you developed a really neat feature and obviously you want people to try it out right? But what if they are not able to wrap their head around how to use it or they have to play a guessing game to get for trying to get it to work and in the process failing miserably? Now not only you have wasted their time but also have earned some bad karma. And yes, an intuitive user interface can go a long way to ease user’s pain but a good, to the point documentation sprinkled on top makes up a recipe that users can’t get enough of.

The Extra Mile

Say you developed this new cool feature. But in the hurry of pushing it off you cut some corners and left some manual step in the usage flow which better would have been done behind the curtains unbeknownst to the user. Now the user has to do this manual step every time he uses your feature which quickly becomes a pain specially if it turns out to be a heavily used feature. Optimize the UX. Cut unnecessary stuff from the user flow. Go that extra mile and users will thank you for that.

Mind your cycle

Go easy on your self. Make your development cycle quicker. Say you have some repetitive process to do in order to make the code you wrote to run in the test environment in order to check whether your feature/ fix is working correctly. Invest some time on automating this process, may be writing a handy script and it will help you to finish your work early and then go play :).

Let’s configure it

What if user want to fine tune the size of foo queue holding tasks for the bar thread pool of your program? Oh ok let’s make it configurable via UI then right? Or should we?? Too much configurability thrown at user’s face kills user experience. Do not force your users to fill in stuff which are better left with some sensible defaults every time they use your stuff. It may be that there is no need to configure every nook and corner of your program to make it work the way you want. Decide what should be in and what should be out. Better yet the middle ground to come would be to provide those configurations in an optional advanced configuration section with some sensible defaults which if user sees fit will go and change. And also remember to document them clearly as well so that user knows better when configuring those.

Nasty API docs

Wrong API docs are worse than having no API docs at all. It really happened to me once with a JMS API not working as published in its API docs. And I thought my thread programming was causing it. Took some considerable amount of hairs pulled to figure out the fault is with the API. Since my assumptions of the API derived from the docs were wrong, so was my program. Specially be mindful when you are changing an existing API implementation whether the assumptions and results returned in certain conditions specified in API docs still holds. If not change the docs accordingly.

Carpenters wanted..

Manage your broken windows. You may have to cut some corners and pull out some hacks due to time or release pressures. It’s OK as long as you know what your broken windows are and you intend to repair them the first chance you get. Leave some reminders and attend to them when you get the chance.

Love thy code.

Show that you care so that others will care. If you maintain your code in a good condition the other people taking over or contributing to your code will tend to care about maintaining it the same way. This is specially important in open source settings where you will not be the only one mending a piece of code written by you at the end of the day.

So there goes my list of tidbits on programming for the better. Pretty much regulation and common sense stuff which does not warrant a mention you might say. But as mentioned in the beginning I have been there. Done that. And have paid for that :).  And we keep doing that as well. So I hope this will post serve as a reminder for me at least, when I am on verge of doing some nasty thing next time around :). Anyway this is just my 2 cents. Please holler if you beg to differ.

Read Full Post »

One of the past several days I decided to have a look how Hector, the Cassandra access library works under the hood. This was partially motivated by of the confusing the notions I had about some of API classes, specially Cluster. My initial impression was Cluster API class represents a Cassandra cluster but looking at some hector documentation I found out it has nothing to do with actual Cassandra cluster and it’s just a connection to Cassandra and the passed cluster name parameter in the API does not have anything to do with actual cluster name. The name is only for Hector to identify the connection.

I took few notes while reading through the Hector sources. So here goes..

  • Internally HConnectionManager is used to hold a connection pool in the form of a map.
    1. Connection Pool = Map<CassandraHost, ConcurrentHClientPool>)
    2. ConcurrentHClientPool contains Queue of HThriftClient which is a wrapper for Cassandra.Client which is the Cassandra Thrift client.
  • CassandraHost is created for each node in the cluster and host configuration is populated using passed CassandraHostConfigurator which contains comma separated list of hosts in the Cassandra cluster.  (CassandraHostConfigurator needs to be passed at Cluster creation time)
  • API operations are distributed among Cluster, Keyspace and set of query related classes (MultigetSliceQuery, RangeSliceQuery etc..) each of which use HConnectionManager for the actual execution of the operations.
  • All API operations are executed using HConnectionManager with suitable load balancing policy to which decides distribution of requests among the hosts.
  • Cluster contains operations related to creating updating keyspaces and column families.
  • Keyspace has operations related to adding and deleting data to column families. Mutator batches these requests and present them to keyspace as a batch for execution.
  • Query classes have operations related to fetching data from Cassandra keyspaces.

Read Full Post »

Some Cassandra gotchas

Here are some points to keep in mind when working with Cassandra.

  • Have to use ordered partitioner if you want row keys in sorted order. Be aware that if the row keys are not distributed properly it would create hot spots since most of rows will be concentrated to several nodes.
  • Columns are sorted using definitions given when creating the column family but row keys are not sorted according to them.
  • If you see only hex values in Cassandra-cli for column keys and values in the shown results use ‘as’ and ‘assume’ to get the human readable values of the column keys and values. See this thread.

Read Full Post »

Recently I ran in to an interesting little problem with “resolution:=optional” OSGi directive. Basically what “resolution:=optional” says is that the even though the package imported with this directive is not present in the system  bundle would be resolved at run time. This is logical in case of compile time dependencies of the bundle not being needed at runtime. Or is it?..

I had to ponder a little while before putting down that last statement until I got how it can be so. Usually bundle needs compile time dependencies at run time as well to function correctly. But what if we have an execution path which never gets called at runtime inside our bundle. If that execution path contains imports then these import dependencies will not be needed at run time.

Typically Maven Bundle Plugin is used to incorporate bundle creation at build time. So in this case the general practice I used to follow was to explicitly import obvious dependencies and use a “resolution:=optional” directive with “*” wild card as a catch all mechanism for all other imports. Even if we include some imports which are not needed at runtime we can get away with optional directive since bundle will resolve without them, right?

This is all good and rosy until p2 come in to the picture. Apparently p2 installs bundles exporting optional imports too if it can find them during provisioning. So you end up installing bundles that may  not required at runtime along side the feature. Though generally not disastrous this is a concern if you want to keep your feature installation lean.

In my case the issue was some unwanted bundle getting mysteriously installed with my feature even though it’s packages are not explicitly imported in any of bundle’s sources. As it turned out I had included a dependency required for tests without specifying it should be in the test scope. So naturally with “*;resolution:=optional” in place the Bundle Plugin dutifully did it job by placing an import in the manifest which led to this particular bundle installation at provisioning time.

Solution was trivial. Place the dependency in test scope and Bundle Plugin did not include an import since test scoped dependencies will not be imported in the manifest at all. Alternatively I could have explicitly unimported the offending packages or make requires.greedy instruction false at feature p2.inf to stop P2  greedily installing optional dependencies [1] though these solutions are hacky which do not deal with the root cause directly in my case. But it’s always to good to have alternatives isn’t it?. :).

[1] http://ekkescorner.wordpress.com/2010/04/26/p2-is-my-friend-again/

Read Full Post »

I was reading up on JMS of late. The more I read the more I got confused about the usage of terms persistent and durable in the context of JMS. Then I found this blog which cleared my issues. Thanks Nigel for the great post. :). Anyway here is the gist of that blog I committed to my memory.

  • Persistent apply to messages while durable apply to subscriptions.
  • Queues don’t have durable/ non-durable distinction because consumer always gets a durable subscription.
  • For queues:
  1. Persistent : Message will be saved on disk and sent later if consumer is inactive.
  2. Non-persistent : Messages will be saved in-memory and sent later if consumer is inactive. But they will not survive a broker re-start.
  • For topics:
  1. Persistent & Durable : Message will be saved both on-disk and in-memory and sent later if subscriber is inactive.
  2. Persistent  & Non-Durable : Message will not be saved either in-memory or on disk and any inactive subscriber at the moment of message receipt at broker will not receive the message.
  3. Non-persistent & Durable : Message will be saved in-memory and sent later if subscriber is inactive. So it will not survive a broker re-start.
  4. Non-persistent & Non-Durable : Message will not be saved either in-memory or on disk and any inactive subscriber at the moment of message receipt at broker will not receive the message. Similar to Persistent & Non-Durable case.
  • In all these cases message is sent immediately if subscriber is active.

Read Full Post »

Older Posts »