CryptoBook
Search…
The Birthday paradox / attack
Authors: Zademn, ireland Reviewed by:
Prerequisites
    Probability theory (for the main idea)
    Hashes (an application)
Motivation
    Breaking a hash function (insert story)

The birthday paradox

*insert story / introduction about why it's called a paradox + use*

Birthday version

Question 1
What is the probability that 1 person has the same birthday as you?
Solution
    Let
    AA
    be the event that someone has the same birthday as you and
    Aˉ\bar{A}
    be the complementary event
      The events are mutually exclusive =>
      Pr(A)=1Pr(Aˉ)Pr(A) = 1 - Pr(\bar{A})
    Let
    EiE_i
    be the events that person
    ii
    does not have your birthday
Then
    Pr(A)=1Pr(Aˉ)=1i=1nPr(Ei)=1(364365)nPr(A) = 1 - Pr(\bar{A}) = 1 - \prod_{i=1}^n Pr(E_i) = 1 - \left( \dfrac {364} {365}\right)^n
Reminder:
Pr(A,B)=Pr(A)Pr(B)Pr(A, B) = Pr(A) \cdot Pr(B)
if
A,BA, B
are independent events
Question 2
What is the probability that 2 out of
nn
people in a room share the same birthday?
    Suppose the birthdays are distributed independently and uniformly
Solution
    Let
    AA
    be the event that 2 people have the same birthday, let
    Aˉ\bar{A}
    be the complementary event (no 2 people have the same birthday)
    Event 1 = Person 1 is born =>
    Pr(E1)=365365Pr(E_1) = \dfrac {365} {365}
    Event 2 = Person 2 is born on a different day than Person 1 =>
    Pr(E2)=364365Pr(E_2) = \dfrac {364} {365}
    \vdots
    Event n = Person n is born on a different day than Person
    1,...,n11,...,n-1 \Rightarrow
    Pr(En)=365(n1)365\Rightarrow Pr(E_n) = \dfrac {365-(n-1)} {365}
Pr(Aˉ)=Pr(E1)Pr(E2)Pr(En)=365365364365365(n1)365=(1365)n365!(365n)!=i=1n1(1i365)Pr(\bar{A}) = Pr(E1) \cdot Pr(E_2) \cdot \dots \cdot Pr(E_n) = \dfrac {365} {365} \cdot \dfrac {364} {365} \cdot \dots \cdot \dfrac {365-(n-1)} {365} = \left( \dfrac {1} {365} \right) ^{n} \cdot \dfrac {365!} {(365-n)!} = \prod_{i=1}^{n-1} \left(1 - \dfrac i {365}\right)

General case

Question 1
    Instead of
    365365
    days we have
    d1(d1d)nd \Rightarrow \boxed{1 - \left( \dfrac {d-1} {d}\right)^n}
Question 2
    Instead of
    365365
    days we have
    d1i=1n1(1id)d \Rightarrow \boxed{1 - \prod_{i=1}^{n-1} \left(1 - \dfrac i {d}\right)}
Code examples
1
def my_birthday(n, d):
2
return 1 - pow((d-1)/d , n)
3
4
def same_birthday(n, d):
5
p = 1
6
for i in range(1, n): #1 -> n-1
7
p*=(1-i/d)
8
return 1 - p
9
10
print(same_birthday(23, 365), same_birthday(32, 365), same_birthday(100, 365))
11
# (0.5072972343239854, 0.7533475278503207, 0.9999996927510721)
Copied!

An useful approximation

From the Taylor approximation we know
ex=1+x+x22!+=>ex1+xe^x = 1 + x + \dfrac {x^2} {2!} + \dots => e_x\approx 1 + x
for
x1x \ll 1
Apply for each event:
x=adea/d1adPr(A)=1i=1n1ei/d=1en(n1)/2d1en2/2d\Rightarrow x = -\dfrac a d \Rightarrow e^{ -a /d} \approx 1- \dfrac a d \Rightarrow Pr(A) = 1 - \prod_{i=1}^{n-1}e^{-i/d} = 1-e^{-n(n-1) /{2d}} \approx 1-\boxed{e^{-{n^2} / {2d}}}
If we want to solve for
nn
knowing
Pr(A)Pr(A)
we take the
ln\ln
=>
n2dln(11Pr(A))\boxed{n \approx \sqrt{2d \ln \left(\dfrac 1 {1-Pr(A)}\right)}}
1
def approx_same_birthday(n, d):
2
return 1 - pow(e, -pow(n, 2) / (2*d)).n()
3
4
print(approx_same_birthday(23, 365))
5
print(approx_same_birthday(32, 365))
6
print(approx_same_birthday(100, 365))
7
8
# 0.515509538061517
9
# 0.754077719532824
10
# 0.999998876014983
Copied!

Finding a collision

    Let
    H:MTH:\mathcal{M} \longrightarrow \mathcal{T}
    be a hash function with
    M>>T|\mathcal{M}| >> |T|
    Let's denote
    N=TN = |\mathcal{T}|
Algorithm
1. Choose
sNs \approx \sqrt{N}
random distinct messages in
M\mathcal{M}
2. Compute
ti=H(mi)t_i = H(m_i)
for
1iN1\leq i \leq \sqrt{N}
3. Look for
(ti=tj)(t_i = t_j) \to
If not found go to step 1
Example:
Consider the following hash function:
1
import hashlib
2
import random
3
from Crypto.Util.number import long_to_bytes, bytes_to_long
4
5
def small_hash(m, hash_bits):
6
'''
7
Arguments
8
m: bytes -- input
9
hash_bits: int -- number of bits of the hash
10
Returns:
11
{bytes} - hash of the message of dimension `hash_bits`
12
'''
13
assert hash_bits > 0, "no negative number of bits"
14
15
mask = (1 << hash_bits) - 1 # Mask of bits
16
t = hashlib.sha256(m).digest() # the hash in bytes
17
t = bytes_to_long(t)
18
t = t & mask # get the last 12 bits
19
return long_to_bytes(t)
Copied!
We make the following function to find the hashes:
1
2
def small_hash_colision(M_bits, T_bits):
3
'''
4
Arguments
5
M_bits: int -- dimension of the message space
6
T_bits: int -- dimension of the hash space
7
Returns:
8
{(bytes, bytes, bytes)} -- message1, message2, hash
9
or
10
{(b'', b'', b'')} -- if no collision found
11
'''
12
N = 1<<T_bits
13
print('Hash size: ', N)
14
# This is `s`
15
num_samples = 1 * isqrt(N)
16
num_samples += num_samples//5 + 1 # num_samples = 1.2 * isqrt(N) + 1
17
print(f'Making a list of {num_samples} hashes')
18
print(f'Probability of finding a collision is {same_birthday(num_samples, N)}')
19
20
m_list = []
21
t_list = []
22
23
# Make a list of hashes
24
for i in range(num_samples):
25
m = random.getrandbits(M_bits) # Get random message
26
#m = random.randint(0, M_bits-1)
27
m = long_to_bytes(m)
28
29
t = small_hash(m, T_bits) # Hash it
30
if m not in m_list:
31
m_list.append(m)
32
t_list.append(t)
33
34
# Check for collisionn
35
for i in range(len(t_list)):
36
for j in range(i+1, len(t_list)):
37
if t_list[i] == t_list[j]:
38
print('Collision found!')
39
return m_list[i], m_list[j], t_list[i]
40
else:
41
print('Collision not found :(')
42
return b"", b"", b""
Copied!
We use it as follows:
1
M_bits = 30
2
T_bits = 20
3
4
m1, m2, t = small_hash_colision(M_bits, T_bits)
5
print(m1, m2, t)
6
print(small_hash(m1, T_bits) == small_hash(m2, T_bits))
7
8
# Hash size: 1048576
9
# Making a list of 1229 hashes
10
# Probability of finding a collision is 0.513213460854798
11
# Collision found!
12
# b'!\xafB\xc5' b'4F\xb6w' b'\x01y\xf7'
13
# True
14
Copied!

Eliminating Storage Requirements with Pollard's Rho

While the above algorithm works to find a hash collision in time
O(N)O(\sqrt N )
, it also requires storing
O(N)O(\sqrt N )
hash values. As such, it represents a classic time-space tradeoff over the naive approach, which involves randomly selecting pairs of inputs until they hash to the same value. While the naive approach does not require any additional storage, it does have runtime
O(N)O(N )
.
However, there is a better approach combining the best of both worlds: constant storage requirements and
O(N)O(\sqrt N)
runtime. This approach is based on Pollard's Rho algorithm, which is better-known for its application to solving discrete logarithms. The core insight behind the algorithm is that by the Birthday Paradox, we expect to encounter a hash collision after trying
O(N)O(\sqrt N)
random inputs. However, it is possible to detect whether a collision has occurred without needing to store all of the inputs and hashes if the inputs are chosen in a clever way.
This is done by choosing the next input based on the hash of the previous input, according to the following sequence:
x0=g(seed)x1=g(H(x0))x2=g(H(x1))xn+1=g(H(xn))x_0 = g(seed) \\ x_1 = g(H(x_0)) \\ x_2 = g(H(x_1)) \\ \dots \\ x_{n+1} = g(H(x_n)) \\
Where
H:MTH:\mathcal{M} \longrightarrow \mathcal{T}
is our hash function and
g:TMg: \mathcal{T} \longrightarrow \mathcal{M}
is a "sufficiently random" function which takes a hash value and produces a new. We define the composition of the functions to be
f=Hg:TTf = H \circ g : \mathcal{T} \longrightarrow \mathcal{T}
.
1
# TODO:
2
# have the two hash functions used in this chapter be the same
3
4
from Crypto.Hash import SHA256
5
from Crypto.Util.number import bytes_to_long, long_to_bytes
6
7
from tqdm.autonotebook import tqdm
8
9
# bits in hash output
10
n = 30
11
12
# H
13
def my_hash(x):
14
x = bytes(x, 'ascii')
15
h = SHA256.new()
16
h.update(x)
17
y = h.digest()
18
y = bytes_to_long(y)
19
y = y % (1<<n)
20
y = int(y)
21
return y
22
23
# g
24
def get_message(r):
25
x = "Crypto{" + str(r) + "}"
26
return x
27
28
# f
29
def f(r):
30
return my_hash(get_message(r))
31
Copied!
By the Birthday Paradox, we expect that the sequence will have a collision (where
xi=xjx_i = x_j
for two distinct values
i,ji,j
) after
O(N)O(\sqrt N)
values. But once this occurs, then the sequence will begin to cycle, because
xi+1=g(H(xi))=g(H(xj))=xj+1x_{i+1} = g(H(x_i)) = g(H(x_j)) = x_{j+1}
.
Therefore, we can detect that a collision has occurred by using standard cycle-detection algorithms, such as Floyd's tortoise and hare!
And finally, we can locate the first place in the sequence where the collision occurred, which will let us determine what the colliding inputs to the hash function are. This is done by determining how many iterations apart the colliding inputs are, and then stepping together one iteration at a time until the collision occurs.
For Floyd's tortoise and hare, this is done by noting that when we found our collision after
nn
iterations, we were comparing
H(xn)=H(x2n)H(x_{n}) = H(x_{2 n})
. And because the sequence is cyclic, if the first colliding input is
xix_i
, then it collides with
H(xi)=H(xi+n)H(x_{i}) = H(x_{i+n})
. So we define the new sequence
yi=xn+iy_i = x_{n+i}
, and step through the
xix_i
and
yiy_i
sequences together until we find our collision!
This is implemented in the following code snippet.
1
"""
2
initialization
3
4
This routine will find a hash collision in sqrt time with constant space.
5
"""
6
7
seed = 0
8
9
x0 = get_message(seed)
10
x = x0
11
y = f(x)
12
13
"""
14
detect collision
15
using Floyd / tortoise and hare cycle finding algorithm
16
17
expected number of iterations is ~ sqrt(pi/2) * 2^(n/2),
18
we run for up to 4 * 2^(n/2) iterations
19
"""
20
for i in tqdm(range(4 << (n//2))):
21
22
if f(x) == f(y):
23
break
24
25
x = f(x)
26
27
y = f(y)
28
y = f(y)
29
30
31
"""
32
locate collision
33
"""
34
x = x0
35
y = f(y)
36
37
38
for j in tqdm(range(i)):
39
if f(x) == f(y):
40
break
41
42
x = f(x)
43
44
y = f(y)
45
46
47
m1 = get_message(x)
48
m2 = get_message(y)
49
50
assert my_hash(m1) == f(x)
51
assert my_hash(m2) == f(y)
52
53
print("[+] seeds for messages: {}, {}".format(x, y))
54
print("[+] messages: {}, {}".format(m1, m2))
55
print("[+] collided hashes of messages: {}, {}".format(my_hash(m1), my_hash(m2)))
56
57
58
# 31% 40391/131072 [00:03<00:08, 10666.20it/s]
59
# 30% 12032/40391 [00:00<00:02, 13112.15it/s]
60
#
61
# [+] seeds for messages: 404842900, 254017312
62
# [+] messages: Crypto{404842900}, Crypto{254017312}
63
# [+] collided hashes of messages: 1022927209, 1022927209
Copied!
Finally, there is a third algorithm for finding hash collisions in
O(N)O(\sqrt N)
time, namely the van Oorschot-Wiener algorithm based on Distinguished Points. While it does have additional storage requirements over Pollard's Rho, the main advantage of this algorithm is that it parallelizes extremely well, achieving
O(Nm)O(\frac{\sqrt N}{m})
runtime when run on
mm
processors.

Resources

Last modified 5mo ago