# Hashing Functions – Finding Collisions

The simple definition of a hash function is the following:

A one-way hash function, H(M) transforms a message M of any length into a fixed-length hash value, h.

h = H(M), where h is of length m, which is a characteristic of the hash function used.
Given h, it is hard to compute M such that H(M)= h.
Given M, it is easy to compute h.

An additional requirement is also needed when evaluating hash functions, this requirement is called collision-resistance.

It is hard to find two random messages, M and M’, such that H(M) = H(M’).

Two famous hashing functions used in our daily live are MD5 and SHA-1. Surprisingly, Microsoft has banned such hash functions 5 years ago to be used by its developer. In this article, I will investigate a set of issues about hash functions that will help us find collisions.

A birthday attack is simply a brute-force attacks. It gets its name from the surprising result that the probability that two or more people in a group of 23 share the same birthday is greater than 1/2; such a result is called a birthday paradox.

If some function, when supplied with a random input, returns one of k equally-likely values, then by repeatedly evaluating the function for different inputs, we expect to obtain the same output after about 1.2k1/2. For the above birthday paradox, replace k with 365.

You can check how to calculate such a probability here.

Another point about hash functions is that

If MD5(x) == MD5(y) then MD5(x+q) == MD5(y+q)

Having a pair messages x and y with the same MD5 value you can append any payload q and the MD5 values remain equal.

Being able to find collisions in MD5 hashing had the following consequences:

1- Vulnerabilities exposed in the Public Key Infrastructure (PKI) – read article MD5 considered harmful for more details.

2- All digital signatures like RSA, DSA/ElGamel, and Elliptical Curve never hash the data directly, but rather a hash of the data, often the choice is MD5. Also consider DRM (Digital Right Management) implementations using MD5. All these protection signatures and checksums are at risk because of these findings.

3- SSL uses concatenated MD5 and SHA-1sums; that ensures that a method to find collisions in in one of the functions doesn’t allow forging traffic protected with both functions.

However, the concatenated function is only as strong as the best component, not stronger. It was noted that 2-collisions lead to n-collisions: if it is feasible to find two messages with the same MD5 hash, it is effectively no more difficult to find as many messages as the attacker desires with identical MD5 hashes. Among the n messages with the same MD5 hash, there is likely to be a collision in SHA-1. The additional work needed to find the SHA-1 collision (beyond the exponential birthday search) is polynomial. This argument is summarized here.

I will be now investigating the algorithms to find collisions based on the ideas in this article.