🍔🧠 Zero Downtime: How Yelp Upgraded 1000+ Cassandra Nodes (No Rollback)

Happy Monday! ☀️

Welcome to the 530 new hungry minds who have joined us since last Monday!

If you aren't subscribed yet, join smart, curious, and hungry folks by subscribing here.

📚 Software Engineering Articles

Data mesh at Grab: foundational tools behind certification
Adaptive parallel reasoning: the next paradigm in inference
The bottleneck was never the code; rethink your agents
Stop failures from spreading between your microservices
Bad retries can break good systems; here's why

🗞️ Tech and AI Trends

AI load breaks GitHub – why other vendors survived
Google Chrome silently installs a 4GB AI model on your device
OpenAI fast-tracking AI phone for 2027 launch

👨🏻‍💻 Coding Tip

Use OpenTelemetry exemplars to link metrics directly to traces, cutting debugging time dramatically

Time-to-digest: 5 minutes

Zero downtime upgrade: Yelp's Cassandra 4.x upgrade story 🐿

Yelp's Database Reliability Engineering team upgraded over 1,000 Cassandra nodes from 3.11 to 4.1 with zero downtime. This wasn't just a version bump, it required orchestrating complex distributed systems, fixing compatibility issues, and keeping production traffic flowing the entire time.

The challenge: Upgrade a thousand production nodes without breaking client code, requiring schema migrations, or accepting any downtime while managing gossip communication failures, proxy incompatibilities, and CDC architectural changes.

Implementation highlights:

Init container gossip trick: Use Kubernetes init containers to let nodes gossip with their new IP on 3.11 before upgrading to 4.1, avoiding gossip failures during simultaneous IP and version changes
Parallel Stargate versions: Run version-specific Stargate proxy instances side-by-side in the service mesh, seamlessly routing to matching Cassandra versions during the transition
Separate Git branches: Publish version-specific Cassandra images from dedicated branches, enabling independent deployments to 3.11 and 4.1 without hard blocking either version
CDC Publisher backwards compatibility: Refactor the Cassandra Source Connector to handle architectural changes in 4.1 (immediate CDC log creation, schema change detection) while maintaining 3.11 support
Orchestrated three-stage automation: Implement checkpoint-enabled scripts for pre-flight (schema agreement, backup verification), flight (rolling node upgrades with monitoring), and post-flight (repair re-enablement) stages

Results and learnings:

Massive latency wins: Achieved up to 58% reduction in p99 latencies on key clusters, plus 11% throughput improvement
Zero production impact: No client code changes, no downtime, no incidents—the upgrade was invisible to users
Faster restarts: Leveraged non-disruptive seed list reload for quicker gossip convergence and significantly improved node restart times

Yelp proved that big infrastructure upgrades don't require chaos. Their playbook: automation with checkpoints, version-agnostic design, and obsessive monitoring, shows you can move mountains without breaking a sweat.

The bottleneck was never the code

The other month I finally ran an experiment we had been postponing for over a year at .txt.

Create Your First AI Agent In 5 Minutes

A practical guide to building real Claude Code subagents with tools, context, and and well-defined outputs

How to Stop Failures from Spreading Between Services

Practical patterns to protect your services from failing dependencies and excessive load.

How to correctly use MCP servers with your AI Agents

MCP servers are not dead. Blindly enabling them bloats your context, which leads to higher cost and worse performance. Here are two proven patterns on how to correctly use MCP servers and avoid the bloat.

Bad Retries Can Break Good Systems

A practical guide to using backoff, jitter, retry budgets, idempotency, and circuit breakers in backend systems.

Data Mesh at Grab (Part II): The foundational tools behind certification

How does Grab manage quality across hundreds of thousands of data assets? Discover the foundational tools powering our Signals Marketplace. We dive into Hubble for discovery, Genchi for observability, and our Data Contract Registry to see how event-driven certification turns 'data as a product' into a reliable, AI-ready reality. Stop guessing and start trusting your data.

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

The BAIR Blog

ARTICLE (shadow clone jutsu)
Enhancing Flink deployment with shadow testing

ARTICLE (robot boss energy)
How to Thrive as an EM in the AI Era

ESSENTIAL (a11y zen mode)
Three stoic principles for better web accessibility

ARTICLE (auth roulette)
From Supabase to Clerk to Better Auth

ARTICLE (ai editor go brr)
Editing my LLM assisted Articles

ESSENTIAL (redis patience speedrun)
Redis array type: short story of a long development

ARTICLE (whisper sweet prompts)
Realtime prompting guide

ARTICLE (form go spinny)
Introducing TanStack Form

ARTICLE (ai interview hot take)
Removing AI in Tech Interviews is Wrong

ARTICLE (api wallet cries)
Computer use is 45x More Expensive Than Structured APIs

ARTICLE (postgres oops moment)
How Linux 7.0 Broke PostgreSQL

Want to reach 200,000+ engineers?

Let’s work together! Whether it’s your product, service, or event, we’d love to help you connect with this awesome community.

WORK WITH US

🔥 AI Load Breaks GitHub – Why Other Vendors Stay Strong (8 min)

Brief: GitHub suffers critical reliability meltdown with data integrity failures, multiple outages, and 85% uptime as AI agent load spikes 3.5x, prompting famous developers like HashiCorp's founder to abandon the platform, while competitors like GitLab and Vercel handle similar growth without major issues—exposing GitHub's infrastructure missteps and organizational bloat.

🚨 Google Chrome Silently Installs 4GB AI Model Without Consent—Here's the Climate Cost (27 min)

Brief: Google Chrome automatically downloads a 4GB Gemini Nano AI model to user devices without permission, persists even when deleted, and carries a massive carbon footprint estimated between 6,000–60,000 tonnes of CO2 equivalent—a practice that violates GDPR, ePrivacy law, and deceptive design standards while giving users zero transparency about the cloud-backed AI Mode that doesn't even use the local model.

🤖 Designing the AI-Native Engineering Organization (5 min)

Brief: Engineering leaders from Microsoft, 1Password, and Atlassian reveal how AI is reshaping dev workflows—compressing coding and operations while expanding planning and validation, shifting teams to smaller mission-driven squads, collapsing planning horizons to 90-day cycles, and demanding cost management rigor on token spend that rivals cloud infrastructure negotiations.

📱 OpenAI Fast-Tracking AI Phone for 2027 Launch, Says Kuo (3 min)

Brief: OpenAI is accelerating its AI agent phone to hit mass production in early 2027 instead of 2028, driven by its planned IPO and competition, with MediaTek as the lead processor supplier, a custom Dimensity 9600 chip, and an advanced image signal processor for enhanced camera perception—potentially reaching 30 million units by 2028.

🔒 Mozilla Hardens Firefox with AI-Powered Security Audits, Fixes 271 Bugs (5 min)

Brief: Mozilla used Claude Mythos Preview and other AI models to identify and fix an unprecedented 271 latent security bugs in Firefox, including complex sandbox escapes and use-after-free vulnerabilities that fuzzing alone couldn't detect, while building a scalable agentic harness pipeline that dynamically tests AI-generated hypotheses to eliminate false positives.

This week’s tip:

Instrument RED metrics (Rate, Errors, Duration) with OpenTelemetry Exemplars to link slow traces to metrics dashboards. Exemplars embed trace IDs in histogram buckets, enabling one-click drill-down from 'p95 latency spike' to the actual trace without log searching.

Wen?

SLO burn debugging: Exemplars auto-link p99 latency alerts to slow traces, cutting MTTR from "search logs for timestamp" to "click exemplar, see trace graph".
Capacity planning: Correlating error rate spikes with trace samples shows if errors are transient network blips or systematic (e.g., OOM), guiding scaling decisions.
Incident postmortems: Collecting exemplars during the incident window creates a trail of representative traces at each percentile, preserving evidence without storing every trace.

The more people you help become successful the more successful you become.
Steve Harvey

That’s it for today! ☀️

Enjoyed this issue? Send it to your friends here to sign up, or share it on Twitter!

If you want to submit a section to the newsletter or tell us what you think about today’s issue, reply to this email or DM me on Twitter! 🐦

Thanks for spending part of your Monday morning with Hungry Minds.
See you in a week — Alex.

Icons by Icons8.

*I may earn a commission if you get a subscription through the links marked with “aff.” (at no extra cost to you).

🍔🧠 Zero Downtime: How Yelp Upgraded 1000+ Cassandra Nodes (No Rollback)

Keep Reading

Hungry Minds 🍔🧠