The Birthday paradox / attack

Authors: Zademn, ireland Reviewed by:

Prerequisites

  • Probability theory (for the main idea)

  • Hashes (an application)

Motivation

  • Breaking a hash function (insert story)

The birthday paradox

Birthday version

Question 1

What is the probability that 1 person has the same birthday as you?

Solution

  • Let AA be the event that someone has the same birthday as you and Aˉ\bar{A} be the complementary event

    • The events are mutually exclusive => Pr(A)=1Pr(Aˉ)Pr(A) = 1 - Pr(\bar{A})

  • Let EiE_i be the events that person ii does not have your birthday

Then

  • Pr(A)=1Pr(Aˉ)=1i=1nPr(Ei)=1(364365)nPr(A) = 1 - Pr(\bar{A}) = 1 - \prod_{i=1}^n Pr(E_i) = 1 - \left( \dfrac {364} {365}\right)^n

Question 2

What is the probability that 2 out of nn people in a room share the same birthday?

  • Suppose the birthdays are distributed independently and uniformly

Solution

  • Let AA be the event that 2 people have the same birthday, let Aˉ\bar{A} be the complementary event (no 2 people have the same birthday)

  • Event 1 = Person 1 is born => Pr(E1)=365365Pr(E_1) = \dfrac {365} {365}

  • Event 2 = Person 2 is born on a different day than Person 1 => Pr(E2)=364365Pr(E_2) = \dfrac {364} {365}

    \vdots

  • Event n = Person n is born on a different day than Person 1,...,n11,...,n-1 \Rightarrow Pr(En)=365(n1)365\Rightarrow Pr(E_n) = \dfrac {365-(n-1)} {365}

Pr(Aˉ)=Pr(E1)Pr(E2)Pr(En)=365365364365365(n1)365=(1365)n365!(365n)!=i=1n1(1i365)Pr(\bar{A}) = Pr(E1) \cdot Pr(E_2) \cdot \dots \cdot Pr(E_n) = \dfrac {365} {365} \cdot \dfrac {364} {365} \cdot \dots \cdot \dfrac {365-(n-1)} {365} = \left( \dfrac {1} {365} \right) ^{n} \cdot \dfrac {365!} {(365-n)!} = \prod_{i=1}^{n-1} \left(1 - \dfrac i {365}\right)

General case

Question 1

  • Instead of 365365 days we have d1(d1d)nd \Rightarrow \boxed{1 - \left( \dfrac {d-1} {d}\right)^n}

Question 2

  • Instead of 365365 days we have d1i=1n1(1id)d \Rightarrow \boxed{1 - \prod_{i=1}^{n-1} \left(1 - \dfrac i {d}\right)}

Code examples

An useful approximation

From the Taylor approximation we know ex=1+x+x22!+=>ex1+xe^x = 1 + x + \dfrac {x^2} {2!} + \dots => e_x\approx 1 + x for x1x \ll 1 Apply for each event: x=adea/d1adPr(A)=1i=1n1ei/d=1en(n1)/2d1en2/2d\Rightarrow x = -\dfrac a d \Rightarrow e^{ -a /d} \approx 1- \dfrac a d \Rightarrow Pr(A) = 1 - \prod_{i=1}^{n-1}e^{-i/d} = 1-e^{-n(n-1) /{2d}} \approx 1-\boxed{e^{-{n^2} / {2d}}}

If we want to solve for nn knowing Pr(A)Pr(A) we take the ln\ln => n2dln(11Pr(A))\boxed{n \approx \sqrt{2d \ln \left(\dfrac 1 {1-Pr(A)}\right)}}

Finding a collision

  • Let H:MTH:\mathcal{M} \longrightarrow \mathcal{T} be a hash function with M>>T|\mathcal{M}| >> |T|

  • Let's denote N=TN = |\mathcal{T}|

Algorithm

1. Choose sNs \approx \sqrt{N} random distinct messages in M\mathcal{M}

2. Compute ti=H(mi)t_i = H(m_i) for 1iN1\leq i \leq \sqrt{N}

3. Look for (ti=tj)(t_i = t_j) \to If not found go to step 1

Example:

Consider the following hash function:

We make the following function to find the hashes:

We use it as follows:

Eliminating Storage Requirements with Pollard's Rho

While the above algorithm works to find a hash collision in time O(N)O(\sqrt N ), it also requires storing O(N)O(\sqrt N )hash values. As such, it represents a classic time-space tradeoff over the naive approach, which involves randomly selecting pairs of inputs until they hash to the same value. While the naive approach does not require any additional storage, it does have runtime O(N)O(N ).

However, there is a better approach combining the best of both worlds: constant storage requirements and O(N)O(\sqrt N)runtime. This approach is based on Pollard's Rho algorithm, which is better-known for its application to solving discrete logarithms. The core insight behind the algorithm is that by the Birthday Paradox, we expect to encounter a hash collision after trying O(N)O(\sqrt N)random inputs. However, it is possible to detect whether a collision has occurred without needing to store all of the inputs and hashes if the inputs are chosen in a clever way.

This is done by choosing the next input based on the hash of the previous input, according to the following sequence:

x0=g(seed)x1=g(H(x0))x2=g(H(x1))xn+1=g(H(xn))x_0 = g(seed) \\ x_1 = g(H(x_0)) \\ x_2 = g(H(x_1)) \\ \dots \\ x_{n+1} = g(H(x_n)) \\

Where H:MTH:\mathcal{M} \longrightarrow \mathcal{T} is our hash function and g:TMg: \mathcal{T} \longrightarrow \mathcal{M} is a "sufficiently random" function which takes a hash value and produces a new. We define the composition of the functions to bef=Hg:TTf = H \circ g : \mathcal{T} \longrightarrow \mathcal{T} .

By the Birthday Paradox, we expect that the sequence will have a collision (where xi=xjx_i = x_j for two distinct values i,ji,j) after O(N)O(\sqrt N)values. But once this occurs, then the sequence will begin to cycle, because xi+1=g(H(xi))=g(H(xj))=xj+1x_{i+1} = g(H(x_i)) = g(H(x_j)) = x_{j+1}.

Therefore, we can detect that a collision has occurred by using standard cycle-detection algorithms, such as Floyd's tortoise and hare!

And finally, we can locate the first place in the sequence where the collision occurred, which will let us determine what the colliding inputs to the hash function are. This is done by determining how many iterations apart the colliding inputs are, and then stepping together one iteration at a time until the collision occurs.

For Floyd's tortoise and hare, this is done by noting that when we found our collision after nn iterations, we were comparing H(xn)=H(x2n)H(x_{n}) = H(x_{2 n}). And because the sequence is cyclic, if the first colliding input is xix_i, then it collides withH(xi)=H(xi+n)H(x_{i}) = H(x_{i+n}). So we define the new sequence yi=xn+iy_i = x_{n+i}, and step through the xix_i and yiy_i sequences together until we find our collision!

This is implemented in the following code snippet.

Finally, there is a third algorithm for finding hash collisions in O(N)O(\sqrt N)time, namely the van Oorschot-Wiener algorithm based on Distinguished Points. While it does have additional storage requirements over Pollard's Rho, the main advantage of this algorithm is that it parallelizes extremely well, achieving O(Nm)O(\frac{\sqrt N}{m})runtime when run on mm processors.

Resources

Last updated

Was this helpful?