Introduction to Cassandra

Short notes and books for beginners

Nil Seri
5 min readJan 22, 2021
Photo by Leo Roomets on Unsplash

Book Recommendations:

Some of the books I can recommend:

  • Cassandra: The Definitive Guide: Distributed Data at Web Scale (Eben Hewitt and Jeff Carpenter) 2nd Ed.
  • Practical Cassandra: A Developer’s Approach (Eric Lubow and Russell Bradberry)
  • Learning Apache Cassandra (Sandeep Arabalar)

I have read “A Developer’s Approach”, Chapter 3. Data Modeling was good; had pretty examples for quick understanding. To design tables for logging purpose, you can see “Model Queries — Not Data” title.

There are short notes like a summary here -> https://github.com/vrachieru/cheatsheet/blob/master/database/cassandra.cheatsheet Again, #-# Tables” is the place where we will focus on for design.

There are also good and up to date examples for development: Cassandra: The Definitive Guide: Distributed Data at Web Scale Third Edition -> https://www.datastax.com/sites/default/files/content/ebook/2020-04/9781492079514%20%282%29.pdf

Connection Pooling

https://docs.datastax.com/en/developer/java-driver/4.7/manual/core/pooling/ , Here are some outstanding notes from the link:

“The driver communicates with Cassandra over TCP, using the Cassandra binary protocol. This protocol is asynchronous, which allows each TCP connection to handle multiple simultaneous requests:

  • when a query gets executed, a stream id gets assigned to it. It is a unique identifier on the current connection;
  • the driver writes a request containing the stream id and the query on the connection, and then proceeds without waiting for the response (if you’re using the asynchronous API, this is when the driver will send you back a java.util.concurrent.CompletionStage). Once the request has been written to the connection, we say that it is in flight;
  • at some point, Cassandra will send back a response on the connection. This response also contains the stream id, which allows the driver to trigger a callback that will complete the corresponding query (this is the point where your CompletionStage will get completed).

You don’t need to manage connections yourself. You simply interact with a CqlSession object, which takes care of it.

For a given session, there is one connection pool per connected node (a node is connected when it is up and not ignored by the load balancing policy).

The number of connections per pool is configurable (this will be described in the next section). There are up to 32768 stream ids per connection.”

https://stackoverflow.com/questions/49862049/how-many-pools-per-session-for-cassandra-java-driver : Some citations below

“For each session driver have one control & N data connections (configurable) per connected host. You can configure number of connections as described in documentation, potentially setting different number of connections for local & remote connections (if you’re using DC-aware load balancing policy).

By default the number of data connections for V3 protocol is one — that’s enough, especially if you increase the number of “in-flight” requests to high number (V3 allows to have up to 32k “in-flight” requests). If you combine high number of “in-flight” requests with async operations, you can reach quite high throughput, but you may need to take care that you won’t issue too many requests. (I use following class to control it).”

https://stackoverflow.com/questions/57141458/how-to-create-a-cassandra-connection-pool-using-datastax-driver : Again some citations below

“You just need to create one Session object per your application, and then driver will do all necessary pooling for you. Cassandra protocol allows to execute multiple queries over the one connection, and everything just work out of box - you can execute queries from multiple threads using the same Session object. If necessary (but you need to have a good reason for it), you can increase number of connections from driver to every host in cluster, but in the big cluster this may lead to increased resource consumption.

The complete description on how the pooling is implemented is in the driver’s documentation.”

Important Notes for Implementation:

An example for development: https://www.baeldung.com/cassandra-datastax-java-driver The dependencies here are up to date (cassandra-driver-core is old now, java-driver-core is being used recently).

From the book “Cassandra: The Definitive Guide: Distributed Data at Web Scale” :

“Sessions are expensive”:

https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/ResultSet.html :

“The retrieval of the rows of a ResultSet is generally paged (a first page of result is fetched and the next one is only fetched once all the results of the first one has been consumed). The size of the pages can be configured either globally through QueryOptions.setFetchSize(int) or per-statement with Statement.setFetchSize(int). Though new pages are automatically (and transparently) fetched when needed, it is possible to force the retrieval of the next page early through fetchMoreResults()."

https://stackoverflow.com/questions/35225207/using-a-datastax-cassandra-resultset-with-java-8-parallel-streams-quickly

https://stackoverflow.com/questions/46582211/key-value-store-in-cassandra : A good example to keep mapping values.

https://stackoverflow.com/questions/34570346/cassandra-table-with-all-fields-as-pk

“Cassandra does not let you change the value of a primary key, because you would in effect be referring to a different row completely as the primary key is how a row is uniquely identified.
If you want all columns to be part of the primary key and you want to change one of them, what you could do instead is delete the existing row first, and then insert the new data.”

Delete Operations

Some important parts about delete operations from the book “Cassandra: The Definitive Guide: Distributed Data at Web Scale”:

--

--

Nil Seri

I would love to change the world, but they won’t give me the source code | coding 👩🏻‍💻 | coffee ☕️ | jazz 🎷 | anime 🐲 | books 📚 | drawing 🎨