Feeds:
Posts
Comments

Archive for the ‘miscellaneous’ Category

Let’s see how we can model some regularly used SQL queries using map reduce.

  • select … from … where …
Take following example
select f1, f2 from relation where f1 > 500

For this example let’s assume a suitable InputFormat (in case of Hadoop) does the reading from the database and emit key value pairs in the form (k, rec) where k is primary key and rec is the entire record. Pseudo code using map reduce is given below.

map (k, rec)  {
   if (rec.f1 > 500) {
      rec1 = <rec.f1, rec.f2>
      collect (k , rec1)
   }
}

As can be seen this is implemented using a map function. Output will be emitted only if predicate is satisfied.

  • select aggregate_func() from … where … groupby …
Let’s take the following example.
select f3, sum(f1), avg(f2) from relation where f1 > 500 groupby f3
The pseudo-code below describes how this is achieved using map reduce.
map (k, rec)  {
   if (rec.f1 > 500) {
      rec1 = <rec.f1, rec.f2, rec.f3>
      collect (rec.f3 , rec1)
   }
}
reduce(v, list<rec1>) {
   sum := 0
   avg := 0
   for each rec1 in list {
      sum += rec1.f1
      avg += rec1.f2
   }
   avg := avg / size(list)
   rec2 = <rec1.f3, sum, avg>
   store (v, rec2)
}

Here each v that reduce gets corresponds to a unique value in rec1.f3 field. Group by is implicitly done using the shuffling phase between map and reduce functions.

  • select aggregate_func() from … where … groupby … having …

Here additional having clause is used to filter out grouped results. Let’s take an extended version of earlier example.

select f3, sum(f1), avg(f2) from relation where f1 > 500 groupby f3 having avg(f2) > 50

No change is required in the map function. The modified reduce function is as below.

reduce(v, list<rec1>) {
   sum := 0
   avg := 0
   for each rec1 in list {
      sum += rec1.f1
      avg += rec1.f2
   }
   avg := avg / size(list)
   rec2 = <rec1.f3, sum, avg>
   if (avg > 50) {
      store (v, rec2)
   }
}

Read Full Post »

One of the past several days I decided to have a look how Hector, the Cassandra access library works under the hood. This was partially motivated by of the confusing the notions I had about some of API classes, specially Cluster. My initial impression was Cluster API class represents a Cassandra cluster but looking at some hector documentation I found out it has nothing to do with actual Cassandra cluster and it’s just a connection to Cassandra and the passed cluster name parameter in the API does not have anything to do with actual cluster name. The name is only for Hector to identify the connection.

I took few notes while reading through the Hector sources. So here goes..

  • Internally HConnectionManager is used to hold a connection pool in the form of a map.
    1. Connection Pool = Map<CassandraHost, ConcurrentHClientPool>)
    2. ConcurrentHClientPool contains Queue of HThriftClient which is a wrapper for Cassandra.Client which is the Cassandra Thrift client.
  • CassandraHost is created for each node in the cluster and host configuration is populated using passed CassandraHostConfigurator which contains comma separated list of hosts in the Cassandra cluster.  (CassandraHostConfigurator needs to be passed at Cluster creation time)
  • API operations are distributed among Cluster, Keyspace and set of query related classes (MultigetSliceQuery, RangeSliceQuery etc..) each of which use HConnectionManager for the actual execution of the operations.
  • All API operations are executed using HConnectionManager with suitable load balancing policy to which decides distribution of requests among the hosts.
  • Cluster contains operations related to creating updating keyspaces and column families.
  • Keyspace has operations related to adding and deleting data to column families. Mutator batches these requests and present them to keyspace as a batch for execution.
  • Query classes have operations related to fetching data from Cassandra keyspaces.

Read Full Post »

At WSO2Con 2011..

Today is the first day at WSO2Con 2011. Themed “Explore. Connect. Learn”, the conference aims to deliver deep dive knowledge and experiences of SOA middleware and cloud computing technologies. The day consisted of interesting talks ranging from IBM global technology perspective for 2011 from Dr. C. Mohan (the keynote) to technical talks focused at cloud computing and real world SOA success stories.

So here is  some little insider information of what IBM vision for near future summed up to one word.

“Analytics”.

Yes the world is in dire need for better analytics for smarter proactive decision making from ever growing big data. It’s a matter of getting out actionable data out of continuous information streams at reasonable processing times. Big data and scalable analytics systems are becoming important in this context.

A related talk by Sumedha focused on why data is important and how data base systems should be chosen with the recent advent of myriad data base systems in NoSQL front. The gist of the talk was that, be it relational or NoSQL it’s according to the actual requirements of the application which should determine the type of data management system. Several aspects to consider are the expected volume growth, scalability, transactional requirements, data access patterns, required management functions etc.

The day concluded with a panel discussion with a panel of industry experts on cloud computing paradigm. All in all it was an interesting day filled with lots of valuable technical insights…

Read Full Post »

Do You Workrave???

Are you computoholic? Yep that’s a word I just made up. Not present in any dictionary. Do you tend to get up from the seat in front of your computer at the evening with the sudden revelation that you didn’t make it out of the room even two times within the day. Or are you suffering from eye problems due to excessive usage of computers. Then here is some thing that may worth checking out. Workrave is a small software which sits behind your back and says to you when to stop!. Yes you heard me. After certain interval it rings to you by giving visual indications that you should probably take some tea break or if not some exercises for your eye or your back. It plays you animations as well as how to do them. Pretty cool huh? The time intervals are configurable. So it is advisable to the set the break intervals that suites your work style. But don’t cheat and set the interval to 4 hrs. You know what I mean. In Ubuntu you can install it by

$sudo apt-get install workrave

It is cross platform. Google workrave you will get many references.

Read Full Post »

First Post

Here in this blog I will be thinking aloud about my open source software experience. Feel free to check out and comment of what you think of my thinkings. Have a nice time.

Read Full Post »