One of the past several days I decided to have a look how Hector, the Cassandra access library works under the hood. This was partially motivated by of the confusing the notions I had about some of API classes, specially Cluster. My initial impression was Cluster API class represents a Cassandra cluster but looking at some hector documentation I found out it has nothing to do with actual Cassandra cluster and it’s just a connection to Cassandra and the passed cluster name parameter in the API does not have anything to do with actual cluster name. The name is only for Hector to identify the connection.
I took few notes while reading through the Hector sources. So here goes..
- Internally HConnectionManager is used to hold a connection pool in the form of a map.
- Connection Pool = Map<CassandraHost, ConcurrentHClientPool>)
- ConcurrentHClientPool contains Queue of HThriftClient which is a wrapper for Cassandra.Client which is the Cassandra Thrift client.
- CassandraHost is created for each node in the cluster and host configuration is populated using passed CassandraHostConfigurator which contains comma separated list of hosts in the Cassandra cluster. (CassandraHostConfigurator needs to be passed at Cluster creation time)
- API operations are distributed among Cluster, Keyspace and set of query related classes (MultigetSliceQuery, RangeSliceQuery etc..) each of which use HConnectionManager for the actual execution of the operations.
- All API operations are executed using HConnectionManager with suitable load balancing policy to which decides distribution of requests among the hosts.
- Cluster contains operations related to creating updating keyspaces and column families.
- Keyspace has operations related to adding and deleting data to column families. Mutator batches these requests and present them to keyspace as a batch for execution.
- Query classes have operations related to fetching data from Cassandra keyspaces.