Happy Monday! ☀️

Welcome to the 530 new hungry minds who have joined us since last Monday!

If you aren't subscribed yet, join smart, curious, and hungry folks by subscribing here.

📚 Software Engineering Articles

🗞️ Tech and AI Trends

👨🏻‍💻 Coding Tip

  • Use OpenTelemetry exemplars to link metrics directly to traces, cutting debugging time dramatically

Time-to-digest: 5 minutes

Yelp's Database Reliability Engineering team upgraded over 1,000 Cassandra nodes from 3.11 to 4.1 with zero downtime. This wasn't just a version bump, it required orchestrating complex distributed systems, fixing compatibility issues, and keeping production traffic flowing the entire time.

The challenge: Upgrade a thousand production nodes without breaking client code, requiring schema migrations, or accepting any downtime while managing gossip communication failures, proxy incompatibilities, and CDC architectural changes.

Implementation highlights:

  1. Init container gossip trick: Use Kubernetes init containers to let nodes gossip with their new IP on 3.11 before upgrading to 4.1, avoiding gossip failures during simultaneous IP and version changes

  2. Parallel Stargate versions: Run version-specific Stargate proxy instances side-by-side in the service mesh, seamlessly routing to matching Cassandra versions during the transition

  3. Separate Git branches: Publish version-specific Cassandra images from dedicated branches, enabling independent deployments to 3.11 and 4.1 without hard blocking either version

  4. CDC Publisher backwards compatibility: Refactor the Cassandra Source Connector to handle architectural changes in 4.1 (immediate CDC log creation, schema change detection) while maintaining 3.11 support

  5. Orchestrated three-stage automation: Implement checkpoint-enabled scripts for pre-flight (schema agreement, backup verification), flight (rolling node upgrades with monitoring), and post-flight (repair re-enablement) stages

Results and learnings:

  • Massive latency wins: Achieved up to 58% reduction in p99 latencies on key clusters, plus 11% throughput improvement

  • Zero production impact: No client code changes, no downtime, no incidents—the upgrade was invisible to users

  • Faster restarts: Leveraged non-disruptive seed list reload for quicker gossip convergence and significantly improved node restart times

Yelp proved that big infrastructure upgrades don't require chaos. Their playbook: automation with checkpoints, version-agnostic design, and obsessive monitoring, shows you can move mountains without breaking a sweat.

ARTICLE (robot boss energy)
How to Thrive as an EM in the AI Era

ARTICLE (ai editor go brr)
Editing my LLM assisted Articles

ESSENTIAL (redis patience speedrun)
Redis array type: short story of a long development

ARTICLE (whisper sweet prompts)
Realtime prompting guide

ARTICLE (form go spinny)
Introducing TanStack Form

ARTICLE (ai interview hot take)
Removing AI in Tech Interviews is Wrong

ARTICLE (postgres oops moment)
How Linux 7.0 Broke PostgreSQL

Want to reach 200,000+ engineers?

Let’s work together! Whether it’s your product, service, or event, we’d love to help you connect with this awesome community.

Brief: GitHub suffers critical reliability meltdown with data integrity failures, multiple outages, and 85% uptime as AI agent load spikes 3.5x, prompting famous developers like HashiCorp's founder to abandon the platform, while competitors like GitLab and Vercel handle similar growth without major issues—exposing GitHub's infrastructure missteps and organizational bloat.

Brief: Google Chrome automatically downloads a 4GB Gemini Nano AI model to user devices without permission, persists even when deleted, and carries a massive carbon footprint estimated between 6,000–60,000 tonnes of CO2 equivalent—a practice that violates GDPR, ePrivacy law, and deceptive design standards while giving users zero transparency about the cloud-backed AI Mode that doesn't even use the local model.

Brief: Engineering leaders from Microsoft, 1Password, and Atlassian reveal how AI is reshaping dev workflows—compressing coding and operations while expanding planning and validation, shifting teams to smaller mission-driven squads, collapsing planning horizons to 90-day cycles, and demanding cost management rigor on token spend that rivals cloud infrastructure negotiations.

Brief: OpenAI is accelerating its AI agent phone to hit mass production in early 2027 instead of 2028, driven by its planned IPO and competition, with MediaTek as the lead processor supplier, a custom Dimensity 9600 chip, and an advanced image signal processor for enhanced camera perception—potentially reaching 30 million units by 2028.

Brief: Mozilla used Claude Mythos Preview and other AI models to identify and fix an unprecedented 271 latent security bugs in Firefox, including complex sandbox escapes and use-after-free vulnerabilities that fuzzing alone couldn't detect, while building a scalable agentic harness pipeline that dynamically tests AI-generated hypotheses to eliminate false positives.

This week’s tip:

Instrument RED metrics (Rate, Errors, Duration) with OpenTelemetry Exemplars to link slow traces to metrics dashboards. Exemplars embed trace IDs in histogram buckets, enabling one-click drill-down from 'p95 latency spike' to the actual trace without log searching.

Wen?

  • SLO burn debugging: Exemplars auto-link p99 latency alerts to slow traces, cutting MTTR from "search logs for timestamp" to "click exemplar, see trace graph".

  • Capacity planning: Correlating error rate spikes with trace samples shows if errors are transient network blips or systematic (e.g., OOM), guiding scaling decisions.

  • Incident postmortems: Collecting exemplars during the incident window creates a trail of representative traces at each percentile, preserving evidence without storing every trace.

The more people you help become successful the more successful you become.
Steve Harvey

That’s it for today! ☀️

Enjoyed this issue? Send it to your friends here to sign up, or share it on Twitter!

If you want to submit a section to the newsletter or tell us what you think about today’s issue, reply to this email or DM me on Twitter! 🐦

Thanks for spending part of your Monday morning with Hungry Minds.
See you in a week — Alex.

Icons by Icons8.

*I may earn a commission if you get a subscription through the links marked with “aff.” (at no extra cost to you).

Keep Reading