#73: Neo4j: all your data as a graph?
Neo4j is a NoSQL database engine. What makes it different is the unusual data model. In Neo4j everything is modelled as a graph. A graph is a collection of nodes connected with edges. A typical example is a graph of friends on a social media website. Or a network of movies and actors. But it turns out many problems can be efficiently modelled as graphs. Like a customer having orders, each order has items. Or insurance, connected to a certain car and an accident. So what makes Neo4j special?
First of all, how do you model your data in Neo4j? There are two entities: nodes and edges. Nodes represent things, like people or places. Edges connect nodes with some relationships. For example, a person likes another person. Or a geographical location is connected to another location via highway. Both edges and nodes may have multiple key-value attributes. So in some sense, they are dynamically-typed documents.
Additionally, nodes can be tagged with labels. Labels allow grouping nodes of the same type. For example, you can have people and places in one graph.
Last but not least, edges are directed and have a relationship type. This means that for example, Bob may have a like connection to Alice. Alice, on the other hand, may have a dislike connection to Bob. Or no connection whatsoever. Attributes on edges are especially important. For example, they may represent the distance between two places.
All right, so our data is represented as a graph. How do we query it? This is where graph databases like Neo4j shine. The query language is called Cypher. A few years ago it became a de-facto standard for querying graph databases. What does Cypher look like?
It’s a high-level, declarative language. In principle, similar to SQL. With Cypher, you declare what kind of data you are looking for. Let’s consider a few examples from simplest to most complex. For example:
- find all bike lanes in a city
- find all places that are reachable from a certain place via bike lanes
- find all one-way bike lanes going parallel to roads
- find the shortest path via bike to the nearest cafe
All of the above can be achieved with a simple Cypher query. Cypher takes advantage of pattern matching. You declare what kind of patterns in a graph you are looking for. And the engine finds all instances of that pattern.
Let’s get into a concrete example. Remember the Panama Papers scandal from 2016? Gigabytes of leaked documents proving how billionaires and politicians are evading taxes and international sanctions? And then absolutely nothing happened to anyone accused? Well, politics aside, Neo4j helped tremendously in analyzing this leak. Initially, the leak contained more than 11 million unstructured, disconnected documents. The documents were then parsed, OCRed and imported to Neo4j. At this point, it was fairly easy to find non-obvious, relationships. Like family members transferring wealth, offshore tax structures, etc.
Neo4j is dual-licensed. The free and open-source license has no hot backup functionality and can run on just a single node. If you need clustering, it’s commercial. Also, if you are into the cloud, check out AuraDB. It’s a managed Neo4j on AWS or Google Cloud.
That’s it, thanks for listening, bye!
More materials
- Official website
- Neo4j on Wikipedia
- AuraDB
- Path finding
- Cypher on Wikipedia
- Panama Papers