Debugging Multithreaded Rust Scaling Problems
I was recently listening to one of my go to podcasts, A Problem Squared, where the hosts tackled the problem of figuring out the probabilities of outcomes for a particular card game. They approached it by simulating the game 10 millions times and collecting statistics on the results. I've been wanting a small Rust project to work on, so I thought it would be fun to do this in Rust and run 1B+ simulations to see how the results compared. I learned far more than I expected.
Background: Writing the Simulation
The card game is fairly simple - and the specifics aren't really important to the post. But in short, its a one-player game that is entirely deterministic once the deck is shuffled. There are no choices for the player to make.
My code looks roughly like this:
use game_stats::deck::Deck;
use rayon::prelude::*;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread::available_parallelism;
fn main() {
let n_games: usize = std::env::args()
.nth(1)
.map(|s| {
s.parse::<usize>().unwrap_or_else(|_| {
eprintln!("Invalid number of games '{}', using default 100_000_000", s);
100_000_000usize
})
})
.unwrap_or(100_000_000usize);
// Print diagnostic information about Rayon configuration
println!("Available parallelism: {:?}", available_parallelism());
println!("Rayon thread pool size: {}", rayon::current_num_threads());
println!("Running {} games...", n_games);
let counts: Arc<Vec<AtomicU64>> = Arc::new((0..=52)
.map(|_| AtomicU64::new(0))
.collect());
let start = std::time::Instant::now();
(0..n_games)
.into_par_iter()
.for_each_with(
counts.clone(),
|local_counts, _i| {
let outcome = run_game();
local_counts[outcome as usize].fetch_add(1, Ordering::Relaxed);
}
);
let elapsed = start.elapsed();
println!("Elapsed time: {:.2?} for {} games", elapsed, n_games);
println!("\nResults:");
for i in 0_u8..=52_u8 {
let count = counts[i as usize].load(Ordering::Relaxed);
if count > 0 {
let percentage = (count as f64 / n_games as f64) * 100.0;
println!(" {:2} cards remaining: {:12} games ({:.6}%)", i, count, percentage);
}
}
}
fn run_game() -> u8 {
let mut deck = Deck::new();
deck.shuffle();
/* Simulate a game */
}
I also wrote a Deck struct to represent and manipulate the deck of cards,
and added some debugging output. The full code is available on GitHub.
Running on Apple M2 Max
I first ran the code on my Apple M2 Max laptop to validate correctness and get a baseline for performance. The code ran correctly and I got results that made sense.
Running with cargo run --release --bin the-52-game -- 1000000 ran 1 million
simulations in about 70ms seconds using all 12 available cores.
Cross-Compiling for Linux
To build for Linux on a Mac, I used cargo-zigbuild. This first requires
installing Zig and cargo-zigbuild:
brew install zig
cargo install cargo-zigbuild
Then to build for aarch64 Linux musl:
rustup target add aarch64-unknown-linux-musl
I chose musl for simplicity to avoid glibc version issues on AWS Graviton instances (you'll see later how this is the root cause of my problems).
Then build with:
cargo zigbuild --release --target aarch64-unknown-linux-musl
This produces a binary in target/aarch64-unknown-linux-musl/release/ that
will run on my EC2 Graviton instances.
Orchestrating Runs on AWS EC2
An important thing here was to fully automate the runs on EC2 to control my hardware costs. I was worried I'd forget to teardown an EC2 instance that's costing me a non-trivial amount of money. If I left that running overnight I'd have a rather unpleasant AWS bill to deal with.
So the process looks something like this (all orchestrated via a Bash script):
- Use Terraform to spin up the EC2 instance
- In Terraform code, use
provisioner "file"to copy the binary to the instance andprovisioner "remote-exec"to run it. - Poll the server for job output, then SCP it back to my local machine.
- Use terraform to destroy the EC2 instance
I ran this first on t4g.nano instances to validate the process and keep costs
low before slowly scaling it up. I also sanity checked the tear downs in the AWS console when I was done with each run.
Surprising Scaling Problems
Once I validated my workflow in t4g.nano instances, I moved to c8g instances. These are what I was going to run on at scale.
Again, I started with c8g.medium (1 vCPU) to validate, then c8g.large (2 vCPUs), then c8g.4xlarge (16 vCPUs), to validate scaling behavior and actual
performance measures.
For each of these, I ran 10 million simulations to keep run times small and got a surprising result. Beyond 2 vCPUs, the performance completely broke down.
It scales perfectly on Apple Silicon (M-series Mac) but completely failed to scale beyond 2 cores on AWS Graviton instances, despite the program reporting that it's using all available cores.
| Machine | vCPUs | Time (10M games) | Speedup vs 1 vCPU | Expected Time |
|---|---|---|---|---|
| c8g.medium | 1 | 11.0s | 1.0x (baseline) | 11.0s ✓ |
| c8g.large | 2 | 5.7s | 1.9x ✓ | 5.5s ✓ |
| c8g.4xlarge | 16 | 6.3-7.3s | 1.5-1.7x ❌ | ~0.7s |
| c8g.48xlarge | 192 | 10-12s | ~1.0x ❌ | ~0.06s |
| Apple M2 Max | 1 | 4.9s | 1.0x (M2 baseline) | - |
| Apple M2 Max | 12 | 0.55s | 8.9x ✓ | - |
This was very surprising to me. I double and triple checked that the program was actually using all the cores (it was), and the the EC2 instance types were correct (they were), and that there wasn't some issue with the one atomic vector being shared across threads (there wasn't).
This was when I learned (with the help of GPT-5) a bit about memory allocators.
Root Cause and Fix
After some research, I learned more about what musl is doing differently from glibc. I chose musl for simplicity and portability, but I failed to fully understand at the time I chose that, what tradeoffs musl makes to achieve that.
It turns out musl's memory allocator is optimized for small portable binaries, not high performance parallel workloads. This means for each of my very fast simulations, every time I allocated a Deck struct or removed
items from it, the memory allocator became a bottleneck. And this happens A LOT.
I also learned how easy it is in Rust to swap out the memory allocator. GPT-5 helped me with this as well, recommending mimalloc as a high performance drop-in replacement for musl's default allocator.
Adding this to the top of my code allowed me to swap out the allocator:
#[cfg(all(target_os = "linux"))]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
Sure enough, after rebuilding and re-running on the EC2 instances, much more reasonable scaling behavior emerged.
Final Results
It's still not perfect scaling, but it's much more reasonable now. The 192 vCPU instance is about 2.4x faster than the 64 vCPU instance. Given the overhead of thread management and result aggregation, this makes sense - it must be less than 4x.
| Machine | vCPUs | # Sims | Runtime |
|---|---|---|---|
| c8g.4xlarge | 16 | 10M | 428.92ms |
| c8g.16xlarge | 64 | 10M | 144.49ms |
| c8g.48xlarge | 192 | 10M | 250.00ms |
| c8g.16xlarge | 64 | 1B | 9.74s |
| c8g.48xlarge | 192 | 1B | 4.09s |
| c8g.48xlarge | 192 | 10B | 35.44s |
| c8g.48xlarge | 192 | 25B | 86.30s |
What's particularly useful here is the ability to run 1B+ simulations and automate the runs. So I end up paying for < 2 minutes of EC2 time for 192 vCPUs to get results that would take hours on my laptop.
Since EC2 bills per-second for partial hours, this means my 25B simulation run
cost about $0.20 at the $7.63008 / hour price for a c8g.48xlarge instance.
Closing Thoughts
I can definitely take this further - profiling the code, exploring compiler optimizations, and more. But this felt like a good enough point to stop and write up everything so far.
I learned far more than I expected when I set up to simulate a simple card game. It was a great deep dive into a very specific problem space. I think a key takeaway is that this experience is a powerful reminder that a deep understanding of a system is critical to making informed trade-offs.