- Tiny Big Spark
- Posts
- Walmart’s Million-Core Cloud Revolution: How OpenStack Defies Myths and Redefines Hybrid Scaling
Walmart’s Million-Core Cloud Revolution: How OpenStack Defies Myths and Redefines Hybrid Scaling
Discover the Untold Story of Walmart’s Decade-Long Cloud Journey, Shattering OpenStack Stereotypes with Hyperscale Innovation
What We Discovered: How Walmart’s Cloud Journey Redefines OpenStack and Hybrid Scaling
Dear Readers,
We recently dove deep into Walmart’s cloud journey—and what we found completely reshaped our view of OpenStack and private cloud scaling.
At the OpenInfra Days North America, we heard firsthand from Walmart’s Senior Director of Cloud Engineering, Gerald Bothello, about their incredible decade-long experience with OpenStack.
What really struck us was the stark contrast between common tech chatter and real-world facts. You’ve probably heard it too—“OpenStack is dead,” they say. We admit, we've heard it countless times ourselves. But Walmart’s story flips that idea on its head. Instead of fading away, OpenStack has become the backbone of one of the most massive, dynamic private clouds in the world.

Humble Beginnings: How Success Starts Small
Their journey began in 2014, starting humbly with a few thousand cores during a crucial holiday season. It worked—and that early success catapulted OpenStack into the heart of Walmart’s operations.
When we reflected on this, it made us question: How many other technologies get dismissed before people even look under the hood at what's actually happening?
Scaling Beyond Limits: Yearly Doubling and Million-Core Milestones
What we discovered next honestly blew our minds.
Between 2014 and 2017, Walmart didn’t just scale—they doubled their cloud capacity every year. By 2017, they crossed 100,000 cores (and even made it a milestone with “100,000-core club” laptop stickers, which we thought was pretty cool).
Today, they’ve passed over one million cores—and they’re still climbing.
Hearing Gerald talk about it so casually—like managing a million cores was just another Tuesday—made us realize how often true tech milestones get overshadowed by marketing hype elsewhere.
My team currently supporting around 40,000 cores, we couldn’t help but feel the weight of that comparison. At our scale, we're already facing challenges: resource contention, upgrade coordination, hardware lifecycle management, and team bandwidth are all non-trivial problems. The idea of scaling that up 25x almost seems unfathomable—until you hear how Walmart’s been doing it.
Their success isn’t magic. It’s the result of intentional design, deep automation, and relentless iteration. They’ve managed to tame complexity at hyperscale by treating scale as a constraint, not a goal—and engineering around that constraint with systems that are both flexible and resilient.
Hybrid by Necessity: Adapting to New Demands
But it’s not just about big numbers. What really inspired us was how Walmart handled scale. Rather than sticking to old models, they evolved.
When they realized they were outgrowing their data centers, they didn’t cling to the past—they built a hybrid cloud, blending OpenStack with public cloud services.
To us, that adaptability is the real story here. It’s not about being loyal to a tool. It’s about using what works, continuously improving, and never standing still.
Upgrading at Scale: From Painful Migrations to Seamless Speed
Upgrades are another part of Walmart’s story that really hit home for us.
In the early days, upgrading OpenStack sounded like a nightmare—painful migrations, endless hand-holding, and a full month of stress. Honestly, we could feel Gerald’s frustration even years later as he described it. And we couldn’t help but think: haven’t we all been there with complex systems?
But here’s where it gets impressive: Walmart refused to accept "that's just the way it is."
Starting in 2018, they engineered a stateless, in-place upgrade system. No migrating workloads, no downtime, just smooth upgrades—even at hyperscale. Now they upgrade their entire million-core fleet every single year in just three months, skipping versions to move faster.
For us, with 40,000 cores, even planning an upgrade can be a multi-month affair. Watching Walmart do it at a million-core scale—annually and seamlessly—really puts things into perspective. It's a masterclass in removing friction through systemic innovation.
Automation as Oxygen: Scaling with Small Teams
Another thing that stuck with us was how Walmart thinks about automation.
Gerald made it clear: running over a million cores with a small team is only possible with deep automation. Manual processes are simply not an option.
They’ve built layers of self-healing systems: automation on the control plane to fix OpenStack component issues automatically, and automation at the hardware layer to detect and pull out failing machines before they cause problems.
They also developed internal tools like Galaxy, which gives them a real-time, visual health check of their cloud. We were particularly impressed by how fast they can pinpoint and fix issues now—within minutes.
And that’s the lesson we’re trying to bring home. At 40,000 cores, we already feel the pressure to automate everything we can—so the idea that Walmart’s team operates at 25x that scale with relatively lean staffing shows just how far thoughtful automation can take you. It’s not about making humans obsolete—it’s about making them superpowered.
Refind - Brain food is delivered daily. Every day we analyze thousands of articles and send you only the best, tailored to your interests. Loved by 510,562 curious minds. Subscribe. |
Our Big Takeaways: Pragmatism Over Hype
Looking back at everything we learned, our biggest takeaway is this:
Walmart’s cloud success story isn’t about OpenStack versus public cloud, or private versus hybrid. It’s about relentless pragmatism, scalability, and a refusal to accept outdated limits.
OpenStack is still Walmart’s cloud of choice—not because they’re stuck with it, but because it works at scale, and it continues to evolve with them.
Hearing Gerald’s story made us rethink a lot about how we approach infrastructure choices, upgrades, and even team structures. It challenged us to be bolder in how we design and maintain systems: to think longer-term, to innovate under pressure, and to automate wisely.
Closing Thoughts: Learning from Walmart’s Mindset
We hope this discovery gives you as much food for thought as it gave us.
There’s a quiet kind of brilliance in what Walmart has done: they’ve built a system so resilient and so efficient that it can keep growing without adding massive headcount or accepting tech debt as inevitable.
At the end of the day, it’s a reminder that the best engineering doesn’t always make the loudest noise—but it’s the kind that endures.
That’s it! Keep innovating and stay inspired! If you think your colleagues and friends would find this content valuable, we’d love it if you shared our newsletter with them!
PROMO CONTENT
Can email newsletters make money?
With the world becoming increasingly digital, this question will be on the minds of millions of people looking for new income streams in 2025.
The answer is—Absolutely!
That’s it for this episode!
Thank you for taking the time to read today’s email! Your support allows me to send out this newsletter for free every day.
What do you think for today’s episode? Please provide your feedback in the poll below.
How would you rate today's newsletter? |
Share the newsletter with your friends and colleagues if you find it valuable.
Disclaimer: The "Tiny Big Spark" newsletter is for informational and educational purposes only, not a substitute for professional advice, including financial, legal, medical, or technical. We strive for accuracy but make no guarantees about the completeness or reliability of the information provided. Any reliance on this information is at your own risk. The views expressed are those of the authors and do not reflect any organization's official position. This newsletter may link to external sites we don't control; we do not endorse their content. We are not liable for any losses or damages from using this information.
Reply