CryptoBook
  • CryptoBook
  • Book Plan
  • Style Guide
    • Sample Page
  • Contributors
  • Fundamentals
    • Mathematical Notation
    • Division and Greatest common divisor
      • Euclidean Algorithm
    • Modular Arithmetic
      • Theorems of Wilson, Euler, and Fermat
        • Fermat's Little Theorem in Detail
        • Euler's Theorem in Detail
      • Quadratic Residues
    • Continued Fractions
  • Number Theory
  • Ideals
  • Polynomials With Shared Roots
  • Integer Factorization
    • Pollard rho
    • Sieves
  • Abstract algebra
    • Groups
      • Another take on groups
      • Discrete Log Problem
    • Rings
    • Fields
    • Polynomials
  • Elliptic Curves
    • Untitled
  • Lattices
    • Introduction
    • LLL reduction
      • Gram-Schmidt Orthogonalization
      • Lagrange's algorithm
      • LLL reduction
    • Lattice reduction
      • Minkowski reduced
      • HKZ reduced
      • LLL reduced
    • Applications
      • Coppersmith algorithm
      • Extensions of Coppersmith algorithm
    • Hard lattice problems
    • Lattices of interest
    • Cryptographic lattice problems
      • Short integer solutions (SIS)
      • Learning with errors (LWE)
      • Ring-LWE
      • NTRU
    • Interactive fun
    • Resources and notations
  • Asymmetric Cryptography
  • RSA
    • Proof of correctness
    • RSA application
    • Low Private Component Attacks
      • Wiener's Attack
      • Boneh-Durfee Attack
    • Common Modulus Attack
    • Recovering the Modulus
  • Diffie-Hellman
    • MITM
  • Elliptic Curve Cryptography
  • Symmetric Cryptography
    • Encryption
    • The One Time Pad
    • AES
      • Rijndael Finite Field
      • Round Transformations
  • Hashes
    • Introduction / overview
    • The Birthday paradox / attack
  • Isogeny Based Cryptography
    • Introduction to Isogeny Cryptography
    • Isogenies
    • Isogeny and Ramanujan Graphs
  • Appendices
    • Sets and Functions
    • Probability Theory
Powered by GitBook
On this page
  • The birthday paradox
  • Birthday version
  • General case
  • Finding a collision
  • Eliminating Storage Requirements with Pollard's Rho
  • Resources

Was this helpful?

Export as PDF
  1. Hashes

The Birthday paradox / attack

PreviousIntroduction / overviewNextIntroduction to Isogeny Cryptography

Last updated 4 years ago

Was this helpful?

Authors: Zademn, ireland Reviewed by:

Prerequisites

  • Probability theory (for the main idea)

  • Hashes (an application)

Motivation

  • Breaking a hash function (insert story)

The birthday paradox

*insert story / introduction about why it's called a paradox + use*

Birthday version

Question 1

What is the probability that 1 person has the same birthday as you?

Solution

  • Let AAA be the event that someone has the same birthday as you and Aˉ\bar{A}Aˉ be the complementary event

    • The events are mutually exclusive => Pr(A)=1−Pr(Aˉ)Pr(A) = 1 - Pr(\bar{A})Pr(A)=1−Pr(Aˉ)

  • Let EiE_iEi​ be the events that person iii does not have your birthday

Then

Question 2

  • Suppose the birthdays are distributed independently and uniformly

Solution

General case

Question 1

Question 2

Code examples

def my_birthday(n, d):
    return 1 - pow((d-1)/d , n)

def same_birthday(n, d):
    p = 1
    for i in range(1, n): #1 -> n-1
        p*=(1-i/d)
    return 1 - p

print(same_birthday(23, 365), same_birthday(32, 365), same_birthday(100, 365))
# (0.5072972343239854, 0.7533475278503207, 0.9999996927510721)

An useful approximation

def approx_same_birthday(n, d):
    return 1 - pow(e, -pow(n, 2) / (2*d)).n()
    
print(approx_same_birthday(23, 365)) 
print(approx_same_birthday(32, 365))
print(approx_same_birthday(100, 365))

# 0.515509538061517
# 0.754077719532824
# 0.999998876014983

Finding a collision

Algorithm

Example:

Consider the following hash function:

import hashlib
import random
from Crypto.Util.number import long_to_bytes, bytes_to_long

def small_hash(m, hash_bits):
    '''
    Arguments
        m: bytes -- input
        hash_bits: int -- number of bits of the hash
    Returns:
        {bytes} - hash of the message of dimension `hash_bits`
        '''
    assert hash_bits > 0, "no negative number of bits"
    
    mask = (1 << hash_bits) - 1 # Mask of bits
    t = hashlib.sha256(m).digest() # the hash in bytes
    t = bytes_to_long(t)
    t = t & mask # get the last 12 bits
    return long_to_bytes(t)

We make the following function to find the hashes:


def small_hash_colision(M_bits, T_bits):
    '''
    Arguments
        M_bits: int -- dimension of the message space
        T_bits: int -- dimension of the hash space
    Returns:
        {(bytes, bytes, bytes)} -- message1, message2, hash
        or
        {(b'', b'', b'')} -- if no collision found
    '''
    N = 1<<T_bits 
    print('Hash size: ', N)
    # This is `s`
    num_samples = 1 * isqrt(N)
    num_samples += num_samples//5 + 1 # num_samples = 1.2 * isqrt(N) + 1
    print(f'Making a list of {num_samples} hashes')
    print(f'Probability of finding a collision is {same_birthday(num_samples, N)}')
    
    m_list = []
    t_list = []
    
    # Make a list of hashes
    for i in range(num_samples):
        m = random.getrandbits(M_bits) # Get random message
        #m = random.randint(0, M_bits-1) 
        m = long_to_bytes(m)
        
        t = small_hash(m, T_bits) # Hash it
        if m not in m_list:
            m_list.append(m)
            t_list.append(t)
            
    # Check for collisionn
    for i in range(len(t_list)):
        for j in range(i+1, len(t_list)):
            if t_list[i] == t_list[j]:
                print('Collision found!')
                return m_list[i], m_list[j], t_list[i]
    else:
        print('Collision not found :(')
        return b"", b"", b""

We use it as follows:

M_bits = 30
T_bits = 20

m1, m2, t = small_hash_colision(M_bits, T_bits)
print(m1, m2, t)
print(small_hash(m1, T_bits) == small_hash(m2, T_bits))

# Hash size:  1048576
# Making a list of 1229 hashes
# Probability of finding a collision is 0.513213460854798
# Collision found!
# b'!\xafB\xc5' b'4F\xb6w' b'\x01y\xf7'
# True

Eliminating Storage Requirements with Pollard's Rho

This is done by choosing the next input based on the hash of the previous input, according to the following sequence:

# TODO:
# have the two hash functions used in this chapter be the same

from Crypto.Hash import SHA256
from Crypto.Util.number import bytes_to_long, long_to_bytes

from tqdm.autonotebook import tqdm

# bits in hash output
n = 30

# H
def my_hash(x):
    x = bytes(x, 'ascii')
    h = SHA256.new()
    h.update(x)
    y = h.digest()
    y = bytes_to_long(y)
    y = y % (1<<n)
    y = int(y)
    return y

# g
def get_message(r):
    x = "Crypto{" + str(r) + "}"
    return x

# f
def f(r):
    return my_hash(get_message(r))

And finally, we can locate the first place in the sequence where the collision occurred, which will let us determine what the colliding inputs to the hash function are. This is done by determining how many iterations apart the colliding inputs are, and then stepping together one iteration at a time until the collision occurs.

This is implemented in the following code snippet.

"""
initialization

This routine will find a hash collision in sqrt time with constant space.
"""

seed = 0

x0 = get_message(seed)
x = x0
y = f(x)

"""
detect collision
using Floyd / tortoise and hare cycle finding algorithm

expected number of iterations is ~ sqrt(pi/2) * 2^(n/2),
we run for up to 4 * 2^(n/2) iterations
"""
for i in tqdm(range(4 << (n//2))):
        
    if f(x) == f(y):
        break
        
    x = f(x)
    
    y = f(y)
    y = f(y)
    

"""
locate collision
"""
x = x0
y = f(y)


for j in tqdm(range(i)):
    if f(x) == f(y):
        break
    
    x = f(x)
    
    y = f(y)
    

m1 = get_message(x)
m2 = get_message(y)

assert my_hash(m1) == f(x)
assert my_hash(m2) == f(y)

print("[+] seeds for messages: {}, {}".format(x, y))
print("[+] messages: {}, {}".format(m1, m2))
print("[+] collided hashes of messages: {}, {}".format(my_hash(m1), my_hash(m2)))


# 31%    40391/131072 [00:03<00:08, 10666.20it/s]
# 30%    12032/40391 [00:00<00:02, 13112.15it/s]
# 
# [+] seeds for messages: 404842900, 254017312
# [+] messages: Crypto{404842900}, Crypto{254017312}
# [+] collided hashes of messages: 1022927209, 1022927209

Resources

Pr(A)=1−Pr(Aˉ)=1−∏i=1nPr(Ei)=1−(364365)nPr(A) = 1 - Pr(\bar{A}) = 1 - \prod_{i=1}^n Pr(E_i) = 1 - \left( \dfrac {364} {365}\right)^nPr(A)=1−Pr(Aˉ)=1−∏i=1n​Pr(Ei​)=1−(365364​)n

Reminder: Pr(A,B)=Pr(A)⋅Pr(B)Pr(A, B) = Pr(A) \cdot Pr(B)Pr(A,B)=Pr(A)⋅Pr(B)if A,BA, BA,Bare independent events

What is the probability that 2 out of nnn people in a room share the same birthday?

Let AAA be the event that 2 people have the same birthday, let Aˉ\bar{A}Aˉ be the complementary event (no 2 people have the same birthday)

Event 1 = Person 1 is born => Pr(E1)=365365Pr(E_1) = \dfrac {365} {365}Pr(E1​)=365365​

Event 2 = Person 2 is born on a different day than Person 1 => Pr(E2)=364365Pr(E_2) = \dfrac {364} {365}Pr(E2​)=365364​

⋮\vdots⋮

Event n = Person n is born on a different day than Person 1,...,n−1⇒1,...,n-1 \Rightarrow1,...,n−1⇒ ⇒Pr(En)=365−(n−1)365\Rightarrow Pr(E_n) = \dfrac {365-(n-1)} {365}⇒Pr(En​)=365365−(n−1)​

Pr(Aˉ)=Pr(E1)⋅Pr(E2)⋅⋯⋅Pr(En)=365365⋅364365⋅⋯⋅365−(n−1)365=(1365)n⋅365!(365−n)!=∏i=1n−1(1−i365)Pr(\bar{A}) = Pr(E1) \cdot Pr(E_2) \cdot \dots \cdot Pr(E_n) = \dfrac {365} {365} \cdot \dfrac {364} {365} \cdot \dots \cdot \dfrac {365-(n-1)} {365} = \left( \dfrac {1} {365} \right) ^{n} \cdot \dfrac {365!} {(365-n)!} = \prod_{i=1}^{n-1} \left(1 - \dfrac i {365}\right)Pr(Aˉ)=Pr(E1)⋅Pr(E2​)⋅⋯⋅Pr(En​)=365365​⋅365364​⋅⋯⋅365365−(n−1)​=(3651​)n⋅(365−n)!365!​=∏i=1n−1​(1−365i​)

Instead of 365365365 days we have d⇒1−(d−1d)nd \Rightarrow \boxed{1 - \left( \dfrac {d-1} {d}\right)^n}d⇒1−(dd−1​)n​

Instead of 365365365 days we have d⇒1−∏i=1n−1(1−id)d \Rightarrow \boxed{1 - \prod_{i=1}^{n-1} \left(1 - \dfrac i {d}\right)}d⇒1−i=1∏n−1​(1−di​)​

From the Taylor approximation we know ex=1+x+x22!+⋯=>ex≈1+xe^x = 1 + x + \dfrac {x^2} {2!} + \dots => e_x\approx 1 + xex=1+x+2!x2​+⋯=>ex​≈1+x for x≪1x \ll 1x≪1 Apply for each event: ⇒x=−ad⇒e−a/d≈1−ad⇒Pr(A)=1−∏i=1n−1e−i/d=1−e−n(n−1)/2d≈1−e−n2/2d\Rightarrow x = -\dfrac a d \Rightarrow e^{ -a /d} \approx 1- \dfrac a d \Rightarrow Pr(A) = 1 - \prod_{i=1}^{n-1}e^{-i/d} = 1-e^{-n(n-1) /{2d}} \approx 1-\boxed{e^{-{n^2} / {2d}}}⇒x=−da​⇒e−a/d≈1−da​⇒Pr(A)=1−∏i=1n−1​e−i/d=1−e−n(n−1)/2d≈1−e−n2/2d​

If we want to solve for nnn knowing Pr(A)Pr(A)Pr(A) we take the ln⁡\lnln => n≈2dln⁡(11−Pr(A))\boxed{n \approx \sqrt{2d \ln \left(\dfrac 1 {1-Pr(A)}\right)}}n≈2dln(1−Pr(A)1​)​​

Let H:M⟶TH:\mathcal{M} \longrightarrow \mathcal{T}H:M⟶T be a hash function with ∣M∣>>∣T∣|\mathcal{M}| >> |T|∣M∣>>∣T∣

Let's denote N=∣T∣N = |\mathcal{T}|N=∣T∣

1. Choose s≈Ns \approx \sqrt{N}s≈N​ random distinct messages in M\mathcal{M}M

2. Compute ti=H(mi)t_i = H(m_i)ti​=H(mi​) for 1≤i≤N1\leq i \leq \sqrt{N}1≤i≤N​

3. Look for (ti=tj)→(t_i = t_j) \to(ti​=tj​)→ If not found go to step 1

While the above algorithm works to find a hash collision in time O(N)O(\sqrt N )O(N​), it also requires storing O(N)O(\sqrt N )O(N​)hash values. As such, it represents a classic time-space tradeoff over the naive approach, which involves randomly selecting pairs of inputs until they hash to the same value. While the naive approach does not require any additional storage, it does have runtime O(N)O(N )O(N).

However, there is a better approach combining the best of both worlds: constant storage requirements and O(N)O(\sqrt N)O(N​)runtime. This approach is based on Pollard's Rho algorithm, which is better-known for its application to solving discrete logarithms. The core insight behind the algorithm is that by the Birthday Paradox, we expect to encounter a hash collision after trying O(N)O(\sqrt N)O(N​)random inputs. However, it is possible to detect whether a collision has occurred without needing to store all of the inputs and hashes if the inputs are chosen in a clever way.

x0=g(seed)x1=g(H(x0))x2=g(H(x1))…xn+1=g(H(xn))x_0 = g(seed) \\ x_1 = g(H(x_0)) \\ x_2 = g(H(x_1)) \\ \dots \\ x_{n+1} = g(H(x_n)) \\x0​=g(seed)x1​=g(H(x0​))x2​=g(H(x1​))…xn+1​=g(H(xn​))

Where H:M⟶TH:\mathcal{M} \longrightarrow \mathcal{T}H:M⟶T is our hash function and g:T⟶Mg: \mathcal{T} \longrightarrow \mathcal{M} g:T⟶M is a "sufficiently random" function which takes a hash value and produces a new. We define the composition of the functions to bef=H∘g:T⟶Tf = H \circ g : \mathcal{T} \longrightarrow \mathcal{T} f=H∘g:T⟶T.

By the Birthday Paradox, we expect that the sequence will have a collision (where xi=xjx_i = x_jxi​=xj​ for two distinct values i,ji,ji,j) after O(N)O(\sqrt N)O(N​)values. But once this occurs, then the sequence will begin to cycle, because xi+1=g(H(xi))=g(H(xj))=xj+1x_{i+1} = g(H(x_i)) = g(H(x_j)) = x_{j+1}xi+1​=g(H(xi​))=g(H(xj​))=xj+1​.

Therefore, we can detect that a collision has occurred by using standard cycle-detection algorithms, such as !

For Floyd's tortoise and hare, this is done by noting that when we found our collision after nnn iterations, we were comparing H(xn)=H(x2n)H(x_{n}) = H(x_{2 n})H(xn​)=H(x2n​). And because the sequence is cyclic, if the first colliding input is xix_ixi​, then it collides withH(xi)=H(xi+n)H(x_{i}) = H(x_{i+n})H(xi​)=H(xi+n​). So we define the new sequence yi=xn+iy_i = x_{n+i}yi​=xn+i​, and step through the xix_ixi​ and yiy_iyi​ sequences together until we find our collision!

Finally, there is a third algorithm for finding hash collisions in O(N)O(\sqrt N)O(N​)time, namely the based on Distinguished Points. While it does have additional storage requirements over Pollard's Rho, the main advantage of this algorithm is that it parallelizes extremely well, achieving O(Nm)O(\frac{\sqrt N}{m})O(mN​​)runtime when run on mmm processors.

- wiki entry

- wiki entry

- vsauce2 video

Floyd's tortoise and hare
https://en.wikipedia.org/wiki/Birthday_problem
https://en.wikipedia.org/wiki/Birthday_attack
https://www.youtube.com/watch?v=ofTb57aZHZs
van Oorschot-Wiener Parallel Collision Search with Cryptanalytic Applications
http://www.cs.umd.edu/~jkatz/imc/hash-erratum.pdf
https://crypto.stackexchange.com/questions/3295/how-does-a-birthday-attack-on-a-hashing-algorithm-work?rq=1
van Oorschot-Wiener algorithm