# #32: Cryptographic hash function

February 09, 2021 | 3 Minute Read

Sometimes you need to split arbitrary objects into a fixed number of groups. For example, storing a record into one out of many database nodes. Or saving a cookie in a hash table. Or distributing jobs among multiple workers. In all of these cases you later want to know, bucket or worker was chosen. Also, data should be split evenly. You don’t want one node or worker to be overloaded. The above properties are implemented by a so-called hash function. It’s an algorithm that takes arbitrary input and produces fixed-length output. A number. For the same input, often called a message, it always produces the same output, known as a hash. Ideally, different messages should produce a different hash. Even better, two slightly different messages should produce wildly different hash. In practice, hash collisions must happen. After all, we are mapping arbitrarily large messages into a fixed-length hash. Often 32- or 64-bit.

Good hash functions are quintessential to implement efficient hash tables. It’s a data structure which allows searching by key in constant time. You simply calculate the hash of the key and then go to the bucket for that key. In a well-designed hash function and hash table, collisions are infrequent. So, two different keys rarely land in the same bucket. But can we do better? Can we have a hash function without collisions at all? Theoretically, it’s impossible. If the input space is larger than the output space, by definition two inputs must produce the same output. Sooner, or later. In practice, we have several such functions!

Meet cryptographic hash functions. This special class of functions are designed in such a way that finding a collision is almost impossible. Mainly because the output space is quite large. A hash code in Java and C# is just 4 bytes. Cryptographic hashes are typically 16, 32 or even 64 bytes long. A 20-digit number! Finding two messages with the same hash is possible. But it requires brute-forcing billions upon billions of messages. Impractical. Moreover, any reasonable cryptographic hash function is irreversible. Some legacy hash functions, like MD5, are reversible, thus broken. Fairly fast algorithms to find any message for a given hash exist. Why does it mean that MD5 is broken? Let’s think about practical usages.

Another use case is verifying the integrity of a large file. Suppose someone sends you one gigabyte of data. How do you make sure the file was not corrupted, or modified by a malicious man-in-the-middle? Well, if the sender of the file also provides a cryptographic hash of that file, you can verify that file against the hash. Hash is short enough to be provided through a different, more secure channel. Over the phone or in a tweet. We can even go further and hash a message together with our public key. Without going into details, this not only proves that message wasn’t tampered with. The digital signature also proves we are the authors of that message.

You’ll find more materials in the show notes. Especially about birthday attack and rainbow tables. Thanks for listening, bye!

# More materials

Tags: birthday-attack, c#, cryptographic-hash-function, hash-function, java, md5, rainbow-table, sha256

### Be the first to listen to new episodes!

#### To get exclusive content:

• Transcripts
• Unedited, longer content
• More extra materials to learn
• Your user voice ideas are prioritized