Feeds:
Posts
Comments

Archive for February, 2012

I have been reading on Join implementations available for Hadoop for past few days. In this post I recap some techniques I learnt during the process. The joins can be done at both Map side and Join side according to the nature of data sets of to be joined.

Reduce Side Join

Let’s take the following tables containing employee and department data.

Let’s see how join query below can be achieved using reduce side join.


SELECT Employees.Name, Employees.Age, Department.Name  FROM Employees INNER JOIN Department ON Employees.Dept_Id=Department.Dept_Id

Map side is responsible for emitting the join predicate values along with the corresponding record from each table so that records having same department id in both tables will end up at on same reducer which would then do the joining of records having same department id. However it is also required to tag the each record to indicate from which table the record originated so that joining happens between records of two tables. Following diagram illustrates the reduce side join process.

Here is the pseudo code for map function for this scenario.


map (K table, V rec) {

   dept_id = rec.Dept_Id

   tagged_rec.tag = table

   tagged_rec.rec = rec

   emit(dept_id, tagged_rec)

}

At reduce side join happens within records having different tags.


reduce (K dept_id, list<tagged_rec> tagged_recs)  {

   for (tagged_rec : tagged_recs) {

      for (tagged_rec1 : taagged_recs) {

          if (tagged_rec.tag != tagged_rec1.tag) {

              joined_rec = join(tagged_rec, tagged_rec1)

          }
       emit (tagged_rec.rec.Dept_Id, joined_rec)

    }

}

Map Side Join (Replicated Join)

Using Distributed Cache on Smaller Table

For this implementation to work one relation has to fit in to memory. The smaller table is replicated to each node and loaded to the memory. The join happens at map side without reducer involvement which significantly speeds up the process since this avoids shuffling all data across the network even-though most of the records not matching are later dropped. Smaller table can be populated to a hash-table so look-up by Dept_Id can be done. The pseudo code is outlined below.


map (K table, V rec) {

list recs = lookup(rec.Dept_Id) // Get smaller table records having this Dept_Id

for (small_table_rec : recs) {

joined_rec = join (small_table_rec, rec)

}

emit (rec.Dept_id, joined_rec)

}

Using Distributed Cache on Filtered Table

If the smaller table doesn’t fit the memory it may be possible to prune the contents of it if  filtering expression has been specified in the query. Consider following query.


SELECT Employees.Name, Employees.Age, Department.Name  FROM Employees INNER JOIN Department ON Employees.Dept_Id=Department.Dept_Id WHERE Department.Name="Eng"

Here a smaller data set can be derived from Department table by filtering out records having department names other than “Eng”. Now it may be possible to do replicated map side join with this smaller data set.

Replicated Semi-Join

Reduce Side Join with Map Side Filtering

Even of the filtered data of small table doesn’t fit in to the memory it may be possible to include just the Dept_Id s of filtered records in the replicated data set. Then at map side this cache can be used to filter out records which would be sent over to reduce side thus reducing the amount of data moved between the mappers and reducers.

The map side logic would look as follows.


map (K table, V rec) {

   // Check if this record needs to be sent to reducer
   boolean sendToReducer = check_cache(rec.Dept_Id)
   if (sendToReducer) {
      dept_id = rec.Dept_Id

      tagged_rec.tag = table

      tagged_rec.rec = rec

      emit(dept_id, tagged_rec)
   }
}

Reducer side logic would be same as the Reduce Side Join case.

Using a Bloom Filter

A bloom filter is a construct which can be used to test the containment of a given element in a set. A smaller representation of filtered Dept_ids can be derived if Dept_Id values can be augmented in to a bloom filter. Then this bloom filter can be replicated to each node. At the map side for each record fetched from the smaller table the bloom filter can be used to check whether the Dept_Id in the record is present in the bloom filter and only if so to emit that particular record to reduce side. Since a bloom filter is guaranteed not to provide false negatives the result would be accurate.

References

[1] Hadoop In Action

[2] Hadoop : The Definitive Guide

Advertisements

Read Full Post »

Been there. Done that. And suffered for that…

Programming is fun. But there are some other associated stuff we programmers blissfully skip or procrastinate because they are not so cool.

End result?…

Somebody is going to get hurt at the end of the day and that somebody may very well be a ourselves. So here are some stuff I have experienced and some of the stuff I my self have been guilty of doing and insights I gotten from them.

Good ol’ docs

It’s a well documented fact that documentation is.. hmm well.. let me think.. Good to have. Or is it important? Yep I know the feeling :). But it’s as things turn out, is some thing that needs to be done at the end of day. Who said we programmers do not have to toil for our food 🙂 right?. From a user’s perspective a feature without proper documentation is a near close to a feature which is not there at all. Say you developed a really neat feature and obviously you want people to try it out right? But what if they are not able to wrap their head around how to use it or they have to play a guessing game to get for trying to get it to work and in the process failing miserably? Now not only you have wasted their time but also have earned some bad karma. And yes, an intuitive user interface can go a long way to ease user’s pain but a good, to the point documentation sprinkled on top makes up a recipe that users can’t get enough of.

The Extra Mile

Say you developed this new cool feature. But in the hurry of pushing it off you cut some corners and left some manual step in the usage flow which better would have been done behind the curtains unbeknownst to the user. Now the user has to do this manual step every time he uses your feature which quickly becomes a pain specially if it turns out to be a heavily used feature. Optimize the UX. Cut unnecessary stuff from the user flow. Go that extra mile and users will thank you for that.

Mind your cycle

Go easy on your self. Make your development cycle quicker. Say you have some repetitive process to do in order to make the code you wrote to run in the test environment in order to check whether your feature/ fix is working correctly. Invest some time on automating this process, may be writing a handy script and it will help you to finish your work early and then go play :).

Let’s configure it

What if user want to fine tune the size of foo queue holding tasks for the bar thread pool of your program? Oh ok let’s make it configurable via UI then right? Or should we?? Too much configurability thrown at user’s face kills user experience. Do not force your users to fill in stuff which are better left with some sensible defaults every time they use your stuff. It may be that there is no need to configure every nook and corner of your program to make it work the way you want. Decide what should be in and what should be out. Better yet the middle ground to come would be to provide those configurations in an optional advanced configuration section with some sensible defaults which if user sees fit will go and change. And also remember to document them clearly as well so that user knows better when configuring those.

Nasty API docs

Wrong API docs are worse than having no API docs at all. It really happened to me once with a JMS API not working as published in its API docs. And I thought my thread programming was causing it. Took some considerable amount of hairs pulled to figure out the fault is with the API. Since my assumptions of the API derived from the docs were wrong, so was my program. Specially be mindful when you are changing an existing API implementation whether the assumptions and results returned in certain conditions specified in API docs still holds. If not change the docs accordingly.

Carpenters wanted..

Manage your broken windows. You may have to cut some corners and pull out some hacks due to time or release pressures. It’s OK as long as you know what your broken windows are and you intend to repair them the first chance you get. Leave some reminders and attend to them when you get the chance.

Love thy code.

Show that you care so that others will care. If you maintain your code in a good condition the other people taking over or contributing to your code will tend to care about maintaining it the same way. This is specially important in open source settings where you will not be the only one mending a piece of code written by you at the end of the day.

So there goes my list of tidbits on programming for the better. Pretty much regulation and common sense stuff which does not warrant a mention you might say. But as mentioned in the beginning I have been there. Done that. And have paid for that :).  And we keep doing that as well. So I hope this will post serve as a reminder for me at least, when I am on verge of doing some nasty thing next time around :). Anyway this is just my 2 cents. Please holler if you beg to differ.

Read Full Post »