arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

The Birthday paradox / attack

Authors: Zademn, ireland Reviewed by:

Prerequisites

  • Probability theory (for the main idea)

  • Hashes (an application)

Motivation

  • Breaking a hash function (insert story)

hashtag
The birthday paradox

circle-exclamation

*insert story / introduction about why it's called a paradox + use*

hashtag
Birthday version

Question 1

What is the probability that 1 person has the same birthday as you?

Solution

  • Let be the event that someone has the same birthday as you and be the complementary event

    • The events are mutually exclusive =>

  • Let be the events that person does not have your birthday

Then

circle-check

Reminder: if are independent events

Question 2

What is the probability that 2 out of people in a room share the same birthday?

  • Suppose the birthdays are distributed independently and uniformly

Solution

  • Let be the event that 2 people have the same birthday, let be the complementary event (no 2 people have the same birthday)

  • Event 1 = Person 1 is born =>

  • Event 2 = Person 2 is born on a different day than Person 1 =>

hashtag
General case

Question 1

  • Instead of days we have

Question 2

  • Instead of days we have

Code examples

hashtag
An useful approximation

From the Taylor approximation we know for Apply for each event:

If we want to solve for knowing we take the =>

hashtag
Finding a collision

  • Let be a hash function with

  • Let's denote

Algorithm

1. Choose random distinct messages in

2. Compute for

3. Look for If not found go to step 1

Example:

Consider the following hash function:

We make the following function to find the hashes:

We use it as follows:

hashtag
Eliminating Storage Requirements with Pollard's Rho

While the above algorithm works to find a hash collision in time , it also requires storing hash values. As such, it represents a classic time-space tradeoff over the naive approach, which involves randomly selecting pairs of inputs until they hash to the same value. While the naive approach does not require any additional storage, it does have runtime .

However, there is a better approach combining the best of both worlds: constant storage requirements and runtime. This approach is based on Pollard's Rho algorithm, which is better-known for its application to solving discrete logarithms. The core insight behind the algorithm is that by the Birthday Paradox, we expect to encounter a hash collision after trying random inputs. However, it is possible to detect whether a collision has occurred without needing to store all of the inputs and hashes if the inputs are chosen in a clever way.

This is done by choosing the next input based on the hash of the previous input, according to the following sequence:

Where is our hash function and is a "sufficiently random" function which takes a hash value and produces a new. We define the composition of the functions to be.

By the Birthday Paradox, we expect that the sequence will have a collision (where for two distinct values ) after values. But once this occurs, then the sequence will begin to cycle, because .

Therefore, we can detect that a collision has occurred by using standard cycle-detection algorithms, such as !

And finally, we can locate the first place in the sequence where the collision occurred, which will let us determine what the colliding inputs to the hash function are. This is done by determining how many iterations apart the colliding inputs are, and then stepping together one iteration at a time until the collision occurs.

For Floyd's tortoise and hare, this is done by noting that when we found our collision after iterations, we were comparing . And because the sequence is cyclic, if the first colliding input is , then it collides with. So we define the new sequence , and step through the and sequences together until we find our collision!

This is implemented in the following code snippet.

Finally, there is a third algorithm for finding hash collisions in time, namely the based on Distinguished Points. While it does have additional storage requirements over Pollard's Rho, the main advantage of this algorithm is that it parallelizes extremely well, achieving runtime when run on processors.

hashtag
Resources

  • - wiki entry

  • - wiki entry

  • - vsauce2 video

⋮\vdots⋮
  • Event n = Person n is born on a different day than Person 1,...,n−1⇒1,...,n-1 \Rightarrow1,...,n−1⇒ ⇒Pr(En)=365−(n−1)365\Rightarrow Pr(E_n) = \dfrac {365-(n-1)} {365}⇒Pr(En​)=365365−(n−1)​

  • van Oorschot-Wiener Parallel Collision Search with Cryptanalytic Applicationsarrow-up-right

  • http://www.cs.umd.edu/~jkatz/imc/hash-erratum.pdfarrow-up-right

  • https://crypto.stackexchange.com/questions/3295/how-does-a-birthday-attack-on-a-hashing-algorithm-work?rq=1arrow-up-right

  • AAA
    Aˉ\bar{A}Aˉ
    Pr(A)=1−Pr(Aˉ)Pr(A) = 1 - Pr(\bar{A})Pr(A)=1−Pr(Aˉ)
    EiE_iEi​
    iii
    Pr(A)=1−Pr(Aˉ)=1−∏i=1nPr(Ei)=1−(364365)nPr(A) = 1 - Pr(\bar{A}) = 1 - \prod_{i=1}^n Pr(E_i) = 1 - \left( \dfrac {364} {365}\right)^nPr(A)=1−Pr(Aˉ)=1−∏i=1n​Pr(Ei​)=1−(365364​)n
    Pr(A,B)=Pr(A)⋅Pr(B)Pr(A, B) = Pr(A) \cdot Pr(B)Pr(A,B)=Pr(A)⋅Pr(B)
    A,BA, BA,B
    nnn
    AAA
    Aˉ\bar{A}Aˉ
    Pr(E1)=365365Pr(E_1) = \dfrac {365} {365}Pr(E1​)=365365​
    Pr(E2)=364365Pr(E_2) = \dfrac {364} {365}Pr(E2​)=365364​
    Pr(Aˉ)=Pr(E1)⋅Pr(E2)⋅⋯⋅Pr(En)=365365⋅364365⋅⋯⋅365−(n−1)365=(1365)n⋅365!(365−n)!=∏i=1n−1(1−i365)Pr(\bar{A}) = Pr(E1) \cdot Pr(E_2) \cdot \dots \cdot Pr(E_n) = \dfrac {365} {365} \cdot \dfrac {364} {365} \cdot \dots \cdot \dfrac {365-(n-1)} {365} = \left( \dfrac {1} {365} \right) ^{n} \cdot \dfrac {365!} {(365-n)!} = \prod_{i=1}^{n-1} \left(1 - \dfrac i {365}\right)Pr(Aˉ)=Pr(E1)⋅Pr(E2​)⋅⋯⋅Pr(En​)=365365​⋅365364​⋅⋯⋅365365−(n−1)​=(3651​)n⋅(365−n)!365!​=∏i=1n−1​(1−365i​)
    365365365
    d⇒1−(d−1d)nd \Rightarrow \boxed{1 - \left( \dfrac {d-1} {d}\right)^n}d⇒1−(dd−1​)n​
    365365365
    d⇒1−∏i=1n−1(1−id)d \Rightarrow \boxed{1 - \prod_{i=1}^{n-1} \left(1 - \dfrac i {d}\right)}d⇒1−i=1∏n−1​(1−di​)​
    ex=1+x+x22!+⋯=>ex≈1+xe^x = 1 + x + \dfrac {x^2} {2!} + \dots => e_x\approx 1 + xex=1+x+2!x2​+⋯=>ex​≈1+x
    x≪1x \ll 1x≪1
    ⇒x=−ad⇒e−a/d≈1−ad⇒Pr(A)=1−∏i=1n−1e−i/d=1−e−n(n−1)/2d≈1−e−n2/2d\Rightarrow x = -\dfrac a d \Rightarrow e^{ -a /d} \approx 1- \dfrac a d \Rightarrow Pr(A) = 1 - \prod_{i=1}^{n-1}e^{-i/d} = 1-e^{-n(n-1) /{2d}} \approx 1-\boxed{e^{-{n^2} / {2d}}}⇒x=−da​⇒e−a/d≈1−da​⇒Pr(A)=1−∏i=1n−1​e−i/d=1−e−n(n−1)/2d≈1−e−n2/2d​
    nnn
    Pr(A)Pr(A)Pr(A)
    ln⁡\lnln
    n≈2dln⁡(11−Pr(A))\boxed{n \approx \sqrt{2d \ln \left(\dfrac 1 {1-Pr(A)}\right)}}n≈2dln(1−Pr(A)1​)​​
    H:M⟶TH:\mathcal{M} \longrightarrow \mathcal{T}H:M⟶T
    ∣M∣>>∣T∣|\mathcal{M}| >> |T|∣M∣>>∣T∣
    N=∣T∣N = |\mathcal{T}|N=∣T∣
    s≈Ns \approx \sqrt{N}s≈N​
    M\mathcal{M}M
    ti=H(mi)t_i = H(m_i)ti​=H(mi​)
    1≤i≤N1\leq i \leq \sqrt{N}1≤i≤N​
    (ti=tj)→(t_i = t_j) \to(ti​=tj​)→
    O(N)O(\sqrt N )O(N​)
    O(N)O(\sqrt N )O(N​)
    O(N)O(N )O(N)
    O(N)O(\sqrt N)O(N​)
    O(N)O(\sqrt N)O(N​)
    x0=g(seed)x1=g(H(x0))x2=g(H(x1))…xn+1=g(H(xn))x_0 = g(seed) \\ x_1 = g(H(x_0)) \\ x_2 = g(H(x_1)) \\ \dots \\ x_{n+1} = g(H(x_n)) \\x0​=g(seed)x1​=g(H(x0​))x2​=g(H(x1​))…xn+1​=g(H(xn​))
    H:M⟶TH:\mathcal{M} \longrightarrow \mathcal{T}H:M⟶T
    g:T⟶Mg: \mathcal{T} \longrightarrow \mathcal{M} g:T⟶M
    f=H∘g:T⟶Tf = H \circ g : \mathcal{T} \longrightarrow \mathcal{T} f=H∘g:T⟶T
    xi=xjx_i = x_jxi​=xj​
    i,ji,ji,j
    O(N)O(\sqrt N)O(N​)
    xi+1=g(H(xi))=g(H(xj))=xj+1x_{i+1} = g(H(x_i)) = g(H(x_j)) = x_{j+1}xi+1​=g(H(xi​))=g(H(xj​))=xj+1​
    nnn
    H(xn)=H(x2n)H(x_{n}) = H(x_{2 n})H(xn​)=H(x2n​)
    xix_ixi​
    H(xi)=H(xi+n)H(x_{i}) = H(x_{i+n})H(xi​)=H(xi+n​)
    yi=xn+iy_i = x_{n+i}yi​=xn+i​
    xix_ixi​
    yiy_iyi​
    O(N)O(\sqrt N)O(N​)
    O(Nm)O(\frac{\sqrt N}{m})O(mN​​)
    mmm
    Floyd's tortoise and harearrow-up-right
    van Oorschot-Wiener algorithmarrow-up-right
    https://en.wikipedia.org/wiki/Birthday_problemarrow-up-right
    https://en.wikipedia.org/wiki/Birthday_attackarrow-up-right
    https://www.youtube.com/watch?v=ofTb57aZHZsarrow-up-right
    def my_birthday(n, d):
        return 1 - pow((d-1)/d , n)
    
    def same_birthday(n, d):
        p = 1
        for i in range(1, n): #1 -> n-1
            p*=(1-i/d)
        return 1 - p
    
    print(same_birthday(23, 365), same_birthday(32, 365), same_birthday(100, 365))
    # (0.5072972343239854, 0.7533475278503207, 0.9999996927510721)
    def approx_same_birthday(n, d):
        return 1 - pow(e, -pow(n, 2) / (2*d)).n()
        
    print(approx_same_birthday(23, 365)) 
    print(approx_same_birthday(32, 365))
    print(approx_same_birthday(100, 365))
    
    # 0.515509538061517
    # 0.754077719532824
    # 0.999998876014983
    import hashlib
    import random
    from Crypto.Util.number import long_to_bytes, bytes_to_long
    
    def small_hash(m, hash_bits):
        '''
        Arguments
            m: bytes -- input
            hash_bits: int -- number of bits of the hash
        Returns:
            {bytes} - hash of the message of dimension `hash_bits`
            '''
        assert hash_bits > 0, "no negative number of bits"
        
        mask = (1 << hash_bits) - 1 # Mask of bits
        t = hashlib.sha256(m).digest() # the hash in bytes
        t = bytes_to_long(t)
        t = t & mask # get the last 12 bits
        return long_to_bytes(t)
    
    def small_hash_colision(M_bits, T_bits):
        '''
        Arguments
            M_bits: int -- dimension of the message space
            T_bits: int -- dimension of the hash space
        Returns:
            {(bytes, bytes, bytes)} -- message1, message2, hash
            or
            {(b'', b'', b'')} -- if no collision found
        '''
        N = 1<<T_bits 
        print('Hash size: ', N)
        # This is `s`
        num_samples = 1 * isqrt(N)
        num_samples += num_samples//5 + 1 # num_samples = 1.2 * isqrt(N) + 1
        print(f'Making a list of {num_samples} hashes')
        print(f'Probability of finding a collision is {same_birthday(num_samples, N)}')
        
        m_list = []
        t_list = []
        
        # Make a list of hashes
        for i in range(num_samples):
            m = random.getrandbits(M_bits) # Get random message
            #m = random.randint(0, M_bits-1) 
            m = long_to_bytes(m)
            
            t = small_hash(m, T_bits) # Hash it
            if m not in m_list:
                m_list.append(m)
                t_list.append(t)
                
        # Check for collisionn
        for i in range(len(t_list)):
            for j in range(i+1, len(t_list)):
                if t_list[i] == t_list[j]:
                    print('Collision found!')
                    return m_list[i], m_list[j], t_list[i]
        else:
            print('Collision not found :(')
            return b"", b"", b""
    M_bits = 30
    T_bits = 20
    
    m1, m2, t = small_hash_colision(M_bits, T_bits)
    print(m1, m2, t)
    print(small_hash(m1, T_bits) == small_hash(m2, T_bits))
    
    # Hash size:  1048576
    # Making a list of 1229 hashes
    # Probability of finding a collision is 0.513213460854798
    # Collision found!
    # b'!\xafB\xc5' b'4F\xb6w' b'\x01y\xf7'
    # True
    
    # TODO:
    # have the two hash functions used in this chapter be the same
    
    from Crypto.Hash import SHA256
    from Crypto.Util.number import bytes_to_long, long_to_bytes
    
    from tqdm.autonotebook import tqdm
    
    # bits in hash output
    n = 30
    
    # H
    def my_hash(x):
        x = bytes(x, 'ascii')
        h = SHA256.new()
        h.update(x)
        y = h.digest()
        y = bytes_to_long(y)
        y = y % (1<<n)
        y = int(y)
        return y
    
    # g
    def get_message(r):
        x = "Crypto{" + str(r) + "}"
        return x
    
    # f
    def f(r):
        return my_hash(get_message(r))
    
    """
    initialization
    
    This routine will find a hash collision in sqrt time with constant space.
    """
    
    seed = 0
    
    x0 = get_message(seed)
    x = x0
    y = f(x)
    
    """
    detect collision
    using Floyd / tortoise and hare cycle finding algorithm
    
    expected number of iterations is ~ sqrt(pi/2) * 2^(n/2),
    we run for up to 4 * 2^(n/2) iterations
    """
    for i in tqdm(range(4 << (n//2))):
            
        if f(x) == f(y):
            break
            
        x = f(x)
        
        y = f(y)
        y = f(y)
        
    
    """
    locate collision
    """
    x = x0
    y = f(y)
    
    
    for j in tqdm(range(i)):
        if f(x) == f(y):
            break
        
        x = f(x)
        
        y = f(y)
        
    
    m1 = get_message(x)
    m2 = get_message(y)
    
    assert my_hash(m1) == f(x)
    assert my_hash(m2) == f(y)
    
    print("[+] seeds for messages: {}, {}".format(x, y))
    print("[+] messages: {}, {}".format(m1, m2))
    print("[+] collided hashes of messages: {}, {}".format(my_hash(m1), my_hash(m2)))
    
    
    # 31%    40391/131072 [00:03<00:08, 10666.20it/s]
    # 30%    12032/40391 [00:00<00:02, 13112.15it/s]
    # 
    # [+] seeds for messages: 404842900, 254017312
    # [+] messages: Crypto{404842900}, Crypto{254017312}
    # [+] collided hashes of messages: 1022927209, 1022927209