Cassandra datastax driver ResultSet sharing in multiple threads for fast reading

I've huge tables in cassandra, more than 2 billions rows and increasing. The rows have a date field and it is following date bucket pattern so as to limit each row.

Even then, I've more than a million entries for a particular date.

I want to read and process rows for each day as fast as possible. What I am doing is that getting instance of com.datastax.driver.core.ResultSet and obtain iterator from it and share that iterator across multiple threads.

So, essentially I want to increase the read throughput. Is this the correct way? If not, please suggest a better way.

Answers


Unfortunately you cannot do this as is. The reason why is that a ResultSet provides an internal paging state that is used to retrieve rows 1 page at a time.

You do have options however. Since I imagine you are doing range queries (queries across multiple partitions), you can use a strategy where you submit multiple queries across token ranges at a time using the token directive. A good example of this is documented in Paging through unordered partitioner results.

java-driver 2.0.10 and 2.1.5 each provide a mechanism for retrieving token ranges from Hosts and splitting them. There is an example of how to do this in the java-driver's integration tests in TokenRangeIntegrationTest.java#should_expose_token_ranges():

    PreparedStatement rangeStmt = session.prepare("SELECT i FROM foo WHERE token(i) > ? and token(i) <= ?");

    TokenRange foundRange = null;
    for (TokenRange range : metadata.getTokenRanges()) {
        List<Row> rows = rangeQuery(rangeStmt, range);
        for (Row row : rows) {
            if (row.getInt("i") == testKey) {
                // We should find our test key exactly once
                assertThat(foundRange)
                    .describedAs("found the same key in two ranges: " + foundRange + " and " + range)
                    .isNull();
                foundRange = range;
                // That range should be managed by the replica
                assertThat(metadata.getReplicas("test", range)).contains(replica);
            }
        }
    }
    assertThat(foundRange).isNotNull();
}
...
private List<Row> rangeQuery(PreparedStatement rangeStmt, TokenRange range) {
    List<Row> rows = Lists.newArrayList();
    for (TokenRange subRange : range.unwrap()) {
        Statement statement = rangeStmt.bind(subRange.getStart(), subRange.getEnd());
        rows.addAll(session.execute(statement).all());
    }
    return rows;
}

You could basically generate your statements and submit them in async fashion, the example above just iterates through the statements one at a time.

Another option is to use the spark-cassandra-connector, which essentially does this under the covers and in a very efficient way. I find it very easy to use and you don't even need to set up a spark cluster to use it. See this document for how to use the Java API.


Need Your Help

Image as hyperlink in a GridView

asp.net gridview

The ASP.NET GridView control's default column types just don't seem to be up to using an image as a hyperlink. The HyperLinkField has no image attributes, and the ImageField has no navigation

html - How to prevent the browser from opening the link specified in href?

javascript html href

I currently making a filebrowser. If the user clicks on a link to a file a little window open and asks for options (like download and view). I've done this using the onclick attribute. If I click o...