Yep.
It’s widely believed that the digits of pi are “random”. I personally think this seems likely to be true. I suspect if you go far enough into the digits, you’ll find your social security number, encodings of entire works of Shakespeare, etc., by pure chance.
However, it hasn’t actually been proven that pi is “normal”—having an
even distribution of all its digits. It might be true that after a
quadrillion digits, pi stops having any 0 digits for some
reason.
It’d be infeasible to find an encoding of Hamlet since it’s so long (and therefore so unlikely to show up by random chance), but we can verify that every SSN exists in the digits of pi by actually locating them.
For security, never post your SSN into an unknown website. Even for purposes as cool as locating it in the digits of pi.
Reminder that using someone else’s SSN can constitute aggravated identity theft, a federal crime with a minimum prison sentence of two years. Just because you’ve found a number doesn’t mean you can use it.
Before embarking on my quest, I decided to have a chat with an LLM for some background information. I was especially interested to find whether anyone had done this before (it didn’t know).
If we actually locate every SSN in the digits of pi, we’ll have shown that all SSNs are in the digits of pi. In math, this is called a “constructive proof”. If you can produce something, then it exists. Sadly, my stochastic friend didn’t agree:

This is one of the worst arguments I’ve seen in a long time, from any source.

I suppose this was foreshadowing my upcoming search for meaning in the randomness.
This is the kind of problem where dealing with a billion items is table stakes.
My initial idea was to put all the SSNs in a set, search through the digits of pi in 9-digit windows, and remove found SSNs from the set. Once the set is empty, the search is over.
We could search pi for every 9-digit number and call it a day. However, SSNs have a little bit of structure beyond “9 digit number” that we can take advantage of.1
An SSN is broken into three parts: a 3-digit area, 2-digit group, and 4-digit serial number.
123-45-6789
─┬─ ┬─ ─┬──
│ │ └─ serial
│ └───── group
└──────── areaNone of the area, group, or serial numbers is allowed to be all
0s. However, they can have leading 0s.
001 is okay, but 000 is not.
Because Americans fear Satan, the area cannot be
666.
Finally, no SSN can be in the range 900-01-0001 to
999-99-9999. That is, no SSN can start with a
9. This by itself eliminates about 10% of the work we have
to do.
HashSet<String> is an almost ridiculously
straightforward representation for the set of all SSNs.
let mut ssns: HashSet<String> = HashSet::new();Following the SSN structure rules above, we can insert every possible SSN into this set.
for area in 1..=899 {
if area == 666 {
continue
}
for group in 1..=99 {
for serial in 1..=9999 {
ssns.insert(
format!("{:03}{:02}{:04}", area, group, serial)
);
}
}
}This generates 888_931_098 SSNs, so we’ve saved about
11% of the work compared to checking every 9-digit string.
On my machine, this giant HashSet took 64 seconds to
construct. Dismal!
A String uses 24 bytes of overhead—8 bytes for each of a
pointer to the character data, a length, and a capacity.
size_of::<String>() // 24 bytesBy default, a String that can store 9 digits allocates
16 bytes of capacity (even though 9 bytes would be enough).
let ssn: String = format!("{:03}{:02}{:04}", 123, 45, 6789);
ssn.capacity() // 16 bytesRepresenting each SSN as a String uses 40 total bytes
and a pointer indirection to the heap. We’ll massively improve upon this
representation below.
Rather than calculate them myself, I used the website calculat.io
as a handy way to download the first billion digits of pi. The unzipped
.txt file is a manageable 1GB—a billion digits at one byte
per digit.
Again, we use the most straightforward method possible, reading the
entire file into a big String in RAM. These days, 1GB
easily fits in most machines’ RAM. However, this still needs to do
allocation and copying, which is slow.
fs::read_to_string(&file_path)As a last step before running the code, let’s estimate the result we expect to see.
Let’s assume the digits of pi are uniformly randomly distributed.
Let \(F\) be the event “found a specific SSN in the digits we searched”. Because finding a specific SSN is unlikely and independent2 from finding any others, we’d expect \(F\) to follow a Poisson distribution. This would mean the probability that \(F\) happens \(k\) times is given by:
\[P(F = k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
The probability of finding a specific SSN at least once is the complement of the probability of never finding it at all:
\[P(F \geq 1) = 1 - P(F = 0)\]
Plugging in \(k=0\) to the Poisson distribution formula, we get this reduction, where many terms cancel out:
\[P(F = 0) = \frac{\lambda^0 e^{-\lambda}}{0!} = e^{-\lambda}\]
Take the complement:
\[P(F \geq 1) = 1 - e^{-\lambda}\]
\(\lambda\) is the number of events we expect in a given interval. There is a \(\frac{1}{10^9}\) chance that a random 9-digit string matches a specific SSN. If we search through \(N\) digits of pi, there are \(N - 8\) windows of 9 digits available. This is because we can start a window at each digit except the last 8, where there isn’t enough room for a full window.
\[\lambda = \frac{N - 8}{10^9}\]
Substituting in \(\lambda\), we get this reusable formula.
\[\boxed{P(F \geq 1)_N = 1 - e^{- \frac{N - 8}{10^9}}}\]
Plugging in any number of digits \(N\) will give us the probability of finding a given SSN in the first \(N\) digits of pi.
\[P(F \geq 1) _{N=10^9} = 1 - e^{- \frac{10^9 - 8}{10^9}} \approx 63.2\%\]
The HashSet<String> version of the program takes
several minutes to run. It’s single-threaded and uses an inefficient
data structure. We’ll keep building our way to a much faster solution,
but for now, here’s the output of the final version of the
SSN-finding program, on a billion digits of pi.
[ 0.00s] Generating the set of all valid SSNs
[ 0.13s] Done generating the set of all 888_931_098 valid SSNs
[ 0.13s] Memory-mapping pi digits file
[ 0.13s] Finding SSN matches in file with 1_000_000_000 digits
[ 0.87s] Done searching for matches
[ 0.87s] Total SSNs = 888_931_098, found = 561_910_510, missing = 327_020_588
[ 1.15s] More than 1_000_000 SSNs missing, skipping file write
[ 1.15s] Done
String is not an efficient datatype for representing
SSNs. A simple numeric type would be much more performant.
Because \(2^{16} < 10^{9} <
2^{32}\), we should use a u32 to represent each SSN.
Each u32 is 4 bytes, a 10x improvement from the 40-byte
Strings we had earlier. We can convert from a slice of
bytes to a u32 with this function:
fn digits_to_u32(s: &[u8]) -> u32 {
let mut n = 0u32;
for &b in s {
n = n * 10 + (b - b'0') as u32;
}
n
}Of course, we can always convert back to String if we
need to.
fn u32_to_digits(n: u32) -> String {
let area = n / 1_000_000;
let group = (n / 10_000) % 100;
let serial = n % 10_000;
format!("{:03}{:02}{:04}", area, group, serial)
}For efficient loading of a 10GB digits file, we can memory map it.
This takes barely any extra effort compared to loading a file into a
String.
let mmap = unsafe {
Mmap::map(&file).expect("Failed to mmap file")
};For 10 billion digits, we should move off a single thread.
par_windows lets us do the exact thing we want—examine each
9-byte slice of a file as parallel overlapping windows. Again,
surprisingly small code changes can have a big impact on the low-hanging
fruit of performance.
The single-byte check window[0] != b'9' eliminates about
10% of the work, at the cost of affecting branch prediction. I found
this to be worthwhile, but your profiler may vary.
mmap.par_windows(9).for_each(|window| {
if window[0] != b'9' {
let potential_ssn = digits_to_u32(window);
ssns.remove(&potential_ssn);
}
});Initially, there wasn’t much speedup from making this program
multithreaded. This was because the HashSet<u32>
required a lock on every change, meaning all threads were contending for
one lock. DashSet is a sharded HashSet, which
only requires locking a portion of the overall set when making changes
to it. This easy data structure change unlocked a lot of parallel
performance.
The program was still fairly slow, with a 3.5 minute runtime. Often when things are slow, it means we should think about switching to a new data structure.
How much information do we really need to store about each SSN? Just whether it’s been found or not. That’s a single bit.
We can represent a BitSet by using a
Vec<AtomicU64> with 900 million total bits (SSNs
upwards of 899-99-9999 are invalid). Each bit represents
whether the SSN at that index has been found in the digits of pi
(1) or not (0). Setting a bit means locating
its AtomicU64 container, then updating a single bit in that
u64 using bit arithmetic.
900 million bits is about 113 MB—extremely reasonable for a modern
computer to handle. Building the set of valid SSNs was much faster now,
taking under 2 seconds instead of over 60. The main loop of the program
remained largely unchanged, except that instead of removing SSNs from
the set as we went along, we would now collect all the 0
bits at the end of processing, to determine which SSNs were still
missing.
A BitSet also buys us much finer granularity for
parallelism than the DashMap did. There are about 14
million separate AtomicU64s, meaning contention for each
one from a pool of 12 threads is pretty unlikely.
Using BitSet brought the runtime for a 10 billion digit
search down to 12 seconds.
Using our formula from above…
\[\boxed{P(F \geq 1)_N = 1 - e^{- \frac{N - 8}{10^9}}}\]
…we can estimate the probability of finding a given SSN in the first 10 billion digits of pi to be:
\[P(F \geq 1)_{N=10^{10}} = 1 - e^{- \frac{10^{10} - 8}{10^9}} \approx99.995\%\]
Searching 10 billion digits leads to only 40_258 missing
SSNs. Out of 889 million, that’s a tiny fraction. However, it does mean
we’re not done yet.
[ 0.00s] Generating the set of all valid SSNs
[ 0.13s] Done generating the set of all 888_931_098 valid SSNs
[ 0.13s] Memory-mapping pi digits file
[ 0.13s] Finding SSN matches in file with 10_000_000_000 digits
[ 4.28s] Done searching for matches
[ 4.28s] Total SSNs = 888_931_098, found = 888_890_840, missing = 40_258
[ 4.28s] Writing missing SSNs to missing.txt
[ 4.29s] Wrote missing SSNs to missing.txt
[ 4.29s] Done
For 1 billion and even 10 billion digits, I was able to find and
directly download .txt files full of decimal digits from
the internet. However, in the 100 billion digit range this got
tricky.
There is a
Google website with 100 trillion digits of pi (in 100 billion digit
chunks), in both decimal and hex formats. However, it’s using the
.ycd file format, which I wasn’t familiar with. We’d expect
a 100GB file, but the .ycd was 39GB, indicating some form
of compression.
The .ycd format comes from y-cruncher, a program
I’d never heard of before. It only runs on Windows and Linux, and I had
been writing my code on a Mac. Luckily, I have a Windows PC, so I was
able to download the file there and convert a 39GB .ycd to
a 100GB .txt file using y-cruncher.
I then transferred the 100GB file to my laptop using LocalSend, which took over an hour. I used the time to watch a movie.
Parallel windows are nice, but they’re not taking advantage of useful
information: most of each window overlaps with the next window. Each
9-digit window has 8 digits of reused information (marked
R):
789012345678901234
----123456789-----
-----234567890----
RRRRRRRRWith a little bit of math, we can peel off the old leftmost digit by subtracting away that digit times \(10^8\), multiply the whole number by 10, and add the new rightmost digit to get a brand new number. There’s no need to re-parse or recalculate any of the 8 digits in the middle. We fully parse the first window of every chunk of digits, then only do partial parsing until reaching the end of that chunk.
Another place ripe for some optimization at this point was the
initial BitSet population. It was taking around 1.6
seconds, and an extremely straightforward parallelization over each
area got it down to 0.13s.
(1..=899u32)
.into_par_iter()
.map(|area| {
// set every valid SSN in this `area` as missing
});Some of my favorite optimizations were the last ones I did, going deep into the inner loop of the main part of the program and trying to make it as fast as I could. I’m not an optimization expert, so surely there are speedups I missed, but this was still incredibly fun.
As an example optimization, ssn / 64 can be rewritten as
ssn >> 6 and bit shifting is much faster than
division. The Rust compiler will often spot these things for you, so it
pays to always profile with and without a change in place, or to read
the generated assembly.
The innermost part of the main loop was a call to the function below.
Marking it #[inline(always)] wasn’t necessary—the compiler
did it anyways—but I think it’s still a nice indication for humans
reading the code. Instead of bounds-checked slice accesses, it uses
pointer math. Finally, after profiling both the branching (current)
version and a branchless version, the speedup from a load
and checking (current & bit) == 0 outweighed the branch
penalty and increase in fetch_or calls.
// this should be heavily optimized, since it's in the tightest loop
#[inline(always)]
fn mark_ssn_as_found(bitset_ptr: *const AtomicU64, ssn: usize) {
let word = ssn >> 6;
let bit = 1u64 << (ssn & 63);
// SAFETY: using pointers for speedy unchecked access,
// but still loading from the same range as before
unsafe {
let current = (*bitset_ptr.add(word)).load(Ordering::Relaxed);
if (current & bit) == 0 {
(*bitset_ptr.add(word)).fetch_or(bit, Ordering::Relaxed);
}
}
}I found it interesting to poke around the assembly as a way to try to
spot missing optimizations. cargo asm makes this really
easy. The final output of
cargo asm ssns_in_pi::mark_ssn_as_found was:
ssns_in_pi::mark_ssn_as_found:
lsr x8, x1, #6
mov w9, #1
lsl x9, x9, x1
ldr x10, [x0, x8, lsl, #3]
tst x10, x9
b.eq LBB9_2
ret
LBB9_2:
add x8, x0, x8, lsl, #3
ldset x9, x8, [x8]
ret
I’m nowhere near an expert on optimization (or even reading assembly), so if you spot something obvious that would speed this program up, I’d welcome feedback!
The probability of finding a given SSN in the first 100 billion digits of pi is:
\[P(F \geq 1)_{N=10^{11}} = 1 - e^{- \frac{10^{11} - 8}{10^9}} \approx 100\%\]
The decimal expansion isn’t quite 1, but it’s close.
0.9999999999999999999999999999999999999999999627992...
Finally, the quest to ensure every SSN exists in the digits of pi was complete.
[ 0.00s] Generating the set of all valid SSNs
[ 0.13s] Done generating the set of all 888_931_098 valid SSNs
[ 0.13s] Memory-mapping pi digits file
[ 0.13s] Finding SSN matches in file with 100_000_000_000 digits
[ 36.89s] Done searching for matches
[ 36.89s] Total SSNs = 888_931_098, found = 888_931_098, missing = 0
[ 36.90s] Done
Of course, searching 100 billion digits could be overkill. Our program doesn’t stop its search just because it’s found all the possible SSNs. If we search the first 20 billion digits, there are only three missing SSNs.
[ 0.00s] Generating the set of all valid SSNs
[ 0.13s] Done generating the set of all 888_931_098 valid SSNs
[ 0.13s] Memory-mapping pi digits file
[ 0.13s] Finding SSN matches in file with 20_000_000_000 digits
[ 7.86s] Done searching for matches
[ 7.86s] Total SSNs = 888_931_098, found = 888_931_095, missing = 3
[ 7.86s] Writing missing SSNs to missing.txt
[ 7.86s] Wrote missing SSNs to missing.txt
[ 7.86s] Done
I’m not going to post specific SSNs, but we can use a verification site to check whether each is real. According to that check, the last three SSNs in pi were:
US life expectancy is 79 years,3 and 1971 was only 55 years ago. There’s probably someone still alive, perhaps somewhere in Pennsylvania, with a wildly remote SSN, not located until searching 22.4 billion digits into pi.
I wonder if they know.
The Social Security Administration actually changed the rules in 2011 to be much more “random”, but since my SSN is from before then, I’m sticking with the structure I know and love.↩︎
These aren’t truly independent events, since each window of 9 digits partially overlaps 8 other windows. We’ll ignore this for simplicity. We actually expect the Poisson distribution’s prediction to be an overestimate.↩︎
The National Center for Health Statistics puts it at 79.0 years.↩︎