Datastax Cassandra Summit 2016
Making Connections With Graphs
- Relations DBs, joins, filtering, normal forms
- Relational db - schema changes as your use cases evolve become painful
- Every entity gets a table
- Lots of many to many tables
- Rigid structure
- Going from one to many requires a migration and new data model
- Solving problems with graph
- Fundamentals
- Vertex - a thing, example: Movie, Person
- Edges - labeled, directional relationships
JCVD – acted in –> Time cop – acted in –> Blood sport – directed –> Blood sport
- Properties - similar to fields in a table
- Power of graphs are relationships
-
Summary
-
Tinkerpop 3 & Gremlin - API for graph query
g.V().has(“person”, “name”, “JCVD”)
- Fundamentals
- RDF stores vs Graph DBs
- RDF - great for inferencing capabilities, but tend to not scale very well
- RDF stores can be considered to be specialist graph DBs
- Good ref: https://www.quora.com/What-are-the-differences-between-a-Graph-database-and-a-Triple-store
- Ref: datastax-enterprise-graph
Data modelling
-
Write path
Data -> Memtable -> Commit log -> SSTable
- SSTable compaction
- Compaction strategies
- Size tiered
- Leveled
- Date tiered (sort of not recommended - use with care)
- Time tiered (not available yet, may come)
- Compaction strategies
- Data organisation
- Partition key
- Clustering key
- Columns
- Primary key = partition + clustering key
- Data modelling
- Always include partition key in where clause of query
- User login scenario - customer login by email
- User defined types and collections
- Avoid client side joins from 2 or more tables
- Customer registration problem
- If an insert is done by 2 different clients at around the same time, the last write wins. This may be a problem in certain use-cases. Example: user registration by email
- Solution
- IF NOT EXISTS - expensive, use with caution, when really required
- Customer login problems
- Customers, Customers_by_email
- Materialized view w/ DSE 5.0
- Always include partition key in where clause of query
## Scaling DataStax in Docker
- Key concepts
- Images
- Registries (example: Docker hub)
- Containers - running instance of image
- DSE processes
- Core DSE JVM
- Opscentre agent
- Spark executor processes
- Single spark workder process
- etc
- Things to consider
- Host and DSE config
- Cassandra data - where will you mount volumes etc
- JVM heap size
- Garbage colletor
-
Default networking not recommended in prod, instead use host networking
docker run -net=host
- Storage
- commit log or anything else in /var/lib/data
- Ref: https://github.com/joeljacobson/dse-docker