Around IT in 256 seconds

#34: SQL joins

February 23, 2021 | 4 Minute Read

In relational databases, data is kept in relations, commonly known as tables. Simplifying, when data is normalized, it’s not duplicated. For example, when storing books and authors, you don’t keep an author’s name next to a book record. Instead, you use a so-called foreign key that references the author in another table. Thanks to this level of indirection, books by the same author do not store repeated information. This has many benefits and one, huge drawback. In order to look up a book together with a corresponding author you must somehow correlate these two tables. This is called joining.

You can think of joining as taking all records from one table and finding a corresponding record in another table. Typically, one column in one table must be equal to another column in the other table. In our example books table has an author_id column. That column must reference an existing record in the other table. Typically, it’s a primary key, like author.id. Thinking imperatively, for each book the SQL engine must find an author for a given author_id. In SQL terms it’s called inner join. It’s written as

SELECT *
FROM books JOIN authors ON books.author_id = authors.id

or with implicit syntax, simply:

SELECT *
FROM books, authors
WHERE books.author_id = authors.id

Now, there are some edge cases. What if there’s a book with no author at all? For example, the excellent “Frankenstein” novel was first published anonymously. Mary Shelley revealed herself in the second edition. In our database, the author_id is NULL. Inner join can’t find a corresponding author for “Frankenstein” so this book is entirely skipped. This is where outer joins start to appear. Replace JOIN with LEFT JOIN and you’ll see all books in the result. Some of them, like “Frankenstein”, will have NULL where author’s ID and name were expected. In the same manner, you can try RIGHT JOIN. In that case you’ll end up with authors that never published any book. Put differently, records in the authors table, that are not referenced by any record in the books table.

Another interesting side-effect. If an author never published any novel, he or she will appear once. The same applies for authors that published just one novel. Like Margaret Mitchell, who wrote one masterpiece: “Gone with the Wind”. But what about authors who published more than one book? Well, they will appear as many times as many books they written.

There’s one more type of JOIN worth mentioning. CROSS JOIN. If you have some math background, this is the Cartesian product of two sets. Without math background: each and every row from one table is mapped to each and every row of the second table. This is especially useful combined with another technique: self joins. Believe it or not, you can join a table with itself. For example, imagine you are looking for pairs of books published the same year. A book table has a published_year column. But how do you find two books with the same published_year value? Simple! Join every book with every other book first. Just be careful, for a tiny library of one hundred books, this produces ten thousand pairs. So an extra filtering condition is needed to only choose pairs with the same year. For example:

select *
from books b1 join books b2
on b1.published_year = b2.published_year
and b1.id > b2.id

Aliases b1 and b2 are needed to distinguish between books tables. Moreover, a condition on ID excludes duplicates and self-joins.

Implementing fast joins in relational databases requires quite a lot of engineering and optimizations. But that’s a different story.

Thanks for listening, bye!

More materials

Tags: cartesian, cross, inner, join, outer, sql

Be the first to listen to new episodes!

To get exclusive content: