In relational databases, data is kept in relations, commonly known as tables. Simplifying, when data is normalized, it’s not duplicated. For example, when storing books and authors, you don’t keep an author’s name next to a book record. Instead, you use a so-called foreign key that references the author in another table. Thanks to this level of indirection, books by the same author do not store repeated information. This has many benefits and one, huge drawback. In order to look up a book together with a corresponding author you must somehow correlate these two tables. This is called joining.
You can think of joining as taking all records from one table and finding a corresponding record in another table.
Typically, one column in one table must be equal to another column in the other table.
In our example
books table has an
That column must reference an existing record in the other table.
Typically, it’s a primary key, like
Thinking imperatively, for each book the SQL engine must find an author for a given
In SQL terms it’s called inner join.
It’s written as
SELECT * FROM books JOIN authors ON books.author_id = authors.id
or with implicit syntax, simply:
SELECT * FROM books, authors WHERE books.author_id = authors.id
Now, there are some edge cases.
What if there’s a book with no author at all?
For example, the excellent “Frankenstein” novel was first published anonymously.
Mary Shelley revealed herself in the second edition.
In our database, the
author_id is NULL.
Inner join can’t find a corresponding author for “Frankenstein” so this book is entirely skipped.
This is where outer joins start to appear.
LEFT JOIN and you’ll see all books in the result.
Some of them, like “Frankenstein”, will have
NULL where author’s ID and name were expected.
In the same manner, you can try
In that case you’ll end up with authors that never published any book.
Put differently, records in the
authors table, that are not referenced by any record in the
Another interesting side-effect. If an author never published any novel, he or she will appear once. The same applies for authors that published just one novel. Like Margaret Mitchell, who wrote one masterpiece: “Gone with the Wind”. But what about authors who published more than one book? Well, they will appear as many times as many books they written.
There’s one more type of
JOIN worth mentioning.
If you have some math background, this is the Cartesian product of two sets.
Without math background: each and every row from one table is mapped to each and every row of the second table.
This is especially useful combined with another technique: self joins.
Believe it or not, you can join a table with itself.
For example, imagine you are looking for pairs of books published the same year.
book table has a
But how do you find two books with the same
Join every book with every other book first.
Just be careful, for a tiny library of one hundred books, this produces ten thousand pairs.
So an extra filtering condition is needed to only choose pairs with the same year.
select * from books b1 join books b2 on b1.published_year = b2.published_year and b1.id > b2.id
b2 are needed to distinguish between
Moreover, a condition on ID excludes duplicates and self-joins.
Implementing fast joins in relational databases requires quite a lot of engineering and optimizations. But that’s a different story.
Thanks for listening, bye!