defmy_birthday(n,d):return1-pow((d-1)/d , n)defsame_birthday(n,d): p =1for i inrange(1, n):#1 -> n-1 p*=(1-i/d)return1- pprint(same_birthday(23, 365), same_birthday(32, 365), same_birthday(100, 365))# (0.5072972343239854, 0.7533475278503207, 0.9999996927510721)
An useful approximation
From the Taylor approximation we know ex=1+x+2!x2+⋯=>ex≈1+x for x≪1
Apply for each event: ⇒x=−da⇒e−a/d≈1−da⇒Pr(A)=1−∏i=1n−1e−i/d=1−e−n(n−1)/2d≈1−e−n2/2d
If we want to solve for n knowing Pr(A) we take the ln => n≈2dln(1−Pr(A)1)
import hashlibimport randomfrom Crypto.Util.number import long_to_bytes, bytes_to_longdefsmall_hash(m,hash_bits):''' Arguments m: bytes -- input hash_bits: int -- number of bits of the hash Returns: {bytes} - hash of the message of dimension `hash_bits` '''assert hash_bits >0,"no negative number of bits" mask = (1<< hash_bits) -1# Mask of bits t = hashlib.sha256(m).digest()# the hash in bytes t =bytes_to_long(t) t = t & mask # get the last 12 bitsreturnlong_to_bytes(t)
We make the following function to find the hashes:
defsmall_hash_colision(M_bits,T_bits):''' Arguments M_bits: int -- dimension of the message space T_bits: int -- dimension of the hash space Returns: {(bytes, bytes, bytes)} -- message1, message2, hash or {(b'', b'', b'')} -- if no collision found ''' N =1<<T_bits print('Hash size: ', N)# This is `s` num_samples =1*isqrt(N) num_samples += num_samples//5+1# num_samples = 1.2 * isqrt(N) + 1print(f'Making a list of {num_samples} hashes')print(f'Probability of finding a collision is {same_birthday(num_samples, N)}') m_list = [] t_list = []# Make a list of hashesfor i inrange(num_samples): m = random.getrandbits(M_bits)# Get random message#m = random.randint(0, M_bits-1) m =long_to_bytes(m) t =small_hash(m, T_bits)# Hash itif m notin m_list: m_list.append(m) t_list.append(t)# Check for collisionnfor i inrange(len(t_list)):for j inrange(i+1, len(t_list)):if t_list[i]== t_list[j]:print('Collision found!')return m_list[i], m_list[j], t_list[i]else:print('Collision not found :(')returnb"",b"",b""
We use it as follows:
M_bits =30T_bits =20m1, m2, t =small_hash_colision(M_bits, T_bits)print(m1, m2, t)print(small_hash(m1, T_bits) ==small_hash(m2, T_bits))# Hash size: 1048576# Making a list of 1229 hashes# Probability of finding a collision is 0.513213460854798# Collision found!# b'!\xafB\xc5' b'4F\xb6w' b'\x01y\xf7'# True
Eliminating Storage Requirements with Pollard's Rho
While the above algorithm works to find a hash collision in time O(N), it also requires storing O(N)hash values. As such, it represents a classic time-space tradeoff over the naive approach, which involves randomly selecting pairs of inputs until they hash to the same value. While the naive approach does not require any additional storage, it does have runtime O(N).
However, there is a better approach combining the best of both worlds: constant storage requirements and O(N)runtime. This approach is based on Pollard's Rho algorithm, which is better-known for its application to solving discrete logarithms. The core insight behind the algorithm is that by the Birthday Paradox, we expect to encounter a hash collision after trying O(N)random inputs. However, it is possible to detect whether a collision has occurred without needing to store all of the inputs and hashes if the inputs are chosen in a clever way.
This is done by choosing the next input based on the hash of the previous input, according to the following sequence:
Where H:M⟶T is our hash function and g:T⟶M is a "sufficiently random" function which takes a hash value and produces a new. We define the composition of the functions to bef=H∘g:T⟶T.
# TODO:# have the two hash functions used in this chapter be the samefrom Crypto.Hash import SHA256from Crypto.Util.number import bytes_to_long, long_to_bytesfrom tqdm.autonotebook import tqdm# bits in hash outputn =30# Hdefmy_hash(x): x =bytes(x, 'ascii') h = SHA256.new() h.update(x) y = h.digest() y =bytes_to_long(y) y = y % (1<<n) y =int(y)return y# gdefget_message(r): x ="Crypto{"+str(r)+"}"return x# fdeff(r):returnmy_hash(get_message(r))
By the Birthday Paradox, we expect that the sequence will have a collision (where xi=xj for two distinct values i,j) after O(N)values. But once this occurs, then the sequence will begin to cycle, because xi+1=g(H(xi))=g(H(xj))=xj+1.
Therefore, we can detect that a collision has occurred by using standard cycle-detection algorithms, such as Floyd's tortoise and hare!
And finally, we can locate the first place in the sequence where the collision occurred, which will let us determine what the colliding inputs to the hash function are. This is done by determining how many iterations apart the colliding inputs are, and then stepping together one iteration at a time until the collision occurs.
For Floyd's tortoise and hare, this is done by noting that when we found our collision after n iterations, we were comparing H(xn)=H(x2n). And because the sequence is cyclic, if the first colliding input is xi, then it collides withH(xi)=H(xi+n). So we define the new sequence yi=xn+i, and step through the xi and yi sequences together until we find our collision!
This is implemented in the following code snippet.
"""initializationThis routine will find a hash collision in sqrt time with constant space."""seed =0x0 =get_message(seed)x = x0y =f(x)"""detect collisionusing Floyd / tortoise and hare cycle finding algorithmexpected number of iterations is ~ sqrt(pi/2) * 2^(n/2),we run for up to 4 * 2^(n/2) iterations"""for i intqdm(range(4<< (n//2))):iff(x)==f(y):break x =f(x) y =f(y) y =f(y)"""locate collision"""x = x0y =f(y)for j intqdm(range(i)):iff(x)==f(y):break x =f(x) y =f(y)m1 =get_message(x)m2 =get_message(y)assertmy_hash(m1)==f(x)assertmy_hash(m2)==f(y)print("[+] seeds for messages: {}, {}".format(x, y))print("[+] messages: {}, {}".format(m1, m2))print("[+] collided hashes of messages: {}, {}".format(my_hash(m1), my_hash(m2)))# 31% 40391/131072 [00:03<00:08, 10666.20it/s]# 30% 12032/40391 [00:00<00:02, 13112.15it/s]# # [+] seeds for messages: 404842900, 254017312# [+] messages: Crypto{404842900}, Crypto{254017312}# [+] collided hashes of messages: 1022927209, 1022927209
Finally, there is a third algorithm for finding hash collisions in O(N)time, namely the van Oorschot-Wiener algorithm based on Distinguished Points. While it does have additional storage requirements over Pollard's Rho, the main advantage of this algorithm is that it parallelizes extremely well, achieving O(mN)runtime when run on m processors.