Anurag Kapur - Datastax Cassandra Summit 2016

Making Connections With Graphs

Relations DBs, joins, filtering, normal forms
Relational db - schema changes as your use cases evolve become painful
- Every entity gets a table
- Lots of many to many tables
- Rigid structure
- Going from one to many requires a migration and new data model
Solving problems with graph
- Fundamentals
  - Vertex - a thing, example: Movie, Person
  - Edges - labeled, directional relationships
  JCVD – acted in –> Time cop – acted in –> Blood sport – directed –> Blood sport
- Properties - similar to fields in a table
- Power of graphs are relationships
- Summary
- Tinkerpop 3 & Gremlin - API for graph query
  
  g.V().has(“person”, “name”, “JCVD”)
RDF stores vs Graph DBs
- RDF - great for inferencing capabilities, but tend to not scale very well
- RDF stores can be considered to be specialist graph DBs
- Good ref: https://www.quora.com/What-are-the-differences-between-a-Graph-database-and-a-Triple-store
Ref: datastax-enterprise-graph

Write path

  Data -> Memtable -> Commit log -> SSTable

SSTable compaction
- Compaction strategies
  - Size tiered
  - Leveled
  - Date tiered (sort of not recommended - use with care)
  - Time tiered (not available yet, may come)
Data organisation
- Partition key
- Clustering key
- Columns
- Primary key = partition + clustering key
Data modelling
- Always include partition key in where clause of query
  - User login scenario - customer login by email
- User defined types and collections
- Avoid client side joins from 2 or more tables
- Customer registration problem
  - If an insert is done by 2 different clients at around the same time, the last write wins. This may be a problem in certain use-cases. Example: user registration by email
  - Solution
  - IF NOT EXISTS - expensive, use with caution, when really required
- Customer login problems
  - Customers, Customers_by_email
  - Materialized view w/ DSE 5.0

## Scaling DataStax in Docker

Key concepts
- Images
- Registries (example: Docker hub)
- Containers - running instance of image
DSE processes
- Core DSE JVM
- Opscentre agent
- Spark executor processes
- Single spark workder process
- etc
Things to consider
- Host and DSE config
- Cassandra data - where will you mount volumes etc
- JVM heap size
- Garbage colletor
Default networking not recommended in prod, instead use host networking
```
  docker run -net=host
```
Storage
- commit log or anything else in /var/lib/data
Ref: https://github.com/joeljacobson/dse-docker