Diluting complex research, spotting silent data leaks, and why the best way to learn is often backwards.
The post Bridging the Gap Between Research and Readability with Marco Hening Tallarico appeared first on Towards Data Science.
How I used open-source models to explore new frontiers in efficient code generation, using my MacBook and local LLMs.
The post Using Local LLMs to Discover High-Performance Algorithms appeared first on Towards Data Science.
Why modeling SKUs as a network reveals what traditional forecasts miss
The post Time Series Isn’t Enough: How Graph Neural Networks Change Demand Forecasting appeared first on Towards Data Science.
How to use n8n with multimodal AI and optimisation tools to help companies with low data maturity accelerate their digital transformation.
The post The Hidden Opportunity in AI Workflow Automation with n8n for Low-Tech Companies appeared first on Towards Data Science.
How science, regulation, collaboration, and public funding shaped the world’s most mature semantic infrastructure
The post Why Healthcare Leads in Knowledge Graphs appeared first on Towards Data Science.
Do you know where your data has been?
The post Data Poisoning in Machine Learning: Why and How People Manipulate Training Data appeared first on Towards Data Science.
Imagine a flock of birds in flight. There’s no leader. No central command. Each bird aligns with its neighbors—matching direction, adjusting speed, maintaining coherence through purely local coordination. The result is global order emerging from local consistency. Now imagine one bird flying with the same conviction as the others. Its wingbeats are confident. Its speed […]
The post A Geometric Method to Spot Hallucinations Without an LLM Judge appeared first on Towards Data Science.
Learn how to be a more efficient programmer
The post Maximum-Effiency Coding Setup appeared first on Towards Data Science.
Why your final LLM layer is OOMing and how to fix it with a custom Triton kernel.
The post Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels appeared first on Towards Data Science.
A multi-tier approach to segmentation, color correction, and domain-specific enhancement
The post From RGB to Lab: Addressing Color Artifacts in AI Image Compositing appeared first on Towards Data Science.
Acquisitions, venture, and an increasingly competitive landscape all point to a market ceiling
The post The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling appeared first on Towards Data Science.
Let's make sense of the current state of retrieval-augmented generation
The post TDS Newsletter: Is It Time to Revisit RAG? appeared first on Towards Data Science.
Shapley Values are one of the most common methods for explainability, yet they can be misleading. Discover how to overcome these limitations to achieve better insights.
The post When Shapley Values Break: A Guide to Robust Model Explainability appeared first on Towards Data Science.
Get the most out of Claude Code
The post How to Run Coding Agents in Parallel appeared first on Towards Data Science.
Designing a centralized system to track daily habits and long-term goals
The post The 2026 Goal Tracker: How I Built a Data-Driven Vision Board Using Python, Streamlit, and Neon appeared first on Towards Data Science.
Why speed without standards creates fragile AI products
The post Do You Smell That? Hidden Technical Debt in AI Development appeared first on Towards Data Science.
From optimizing metrics to designing meaning: putting people back into data-driven decisions
The post Why Human-Centered Data Analytics Matters More Than Ever appeared first on Towards Data Science.
How structured knowledge became healthcare’s quiet advantage
The post What Is a Knowledge Graph — and Why It Matters appeared first on Towards Data Science.
A history of Transformer artifacts and the latest research on how to fix them
The post Glitches in the Attention Matrix appeared first on Towards Data Science.
Seeded topic modeling, integration with LLMs, and training on summarized data are the fresh parts of the NLP toolkit.
The post Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries appeared first on Towards Data Science.

I’ve been using Linux since 1997. As the ’90s came to a close, there were so many new distributions coming
The post Mageia Harkens Back to the Glory Days of Mandrake Linux appeared first on The New Stack.

The Repair Association is a U.S. advocacy group fighting for customers’ “Right to Repair.” If you buy something, you should
The post Repair Advocates Name CES 2026’s Most Anticonsumer Tech appeared first on The New Stack.

Astro — whose acquisition by Cloudflare was announced on Friday — on Wednesday released its first beta of Astro 6,
The post Astro Redesigns Its Development Server appeared first on The New Stack.

For years, site reliability engineering (SRE) has centered on one mission: keeping systems healthy while everything else — code, configurations
The post The Future of AI in SRE: Preventing Failures, Not Fixing Them appeared first on The New Stack.

While code and data interplay with each other to form running programs, we still tend to concern ourselves more with
The post A Developer’s Guide to Marshaling Data With JSON appeared first on The New Stack.

Security platform provider Arcjet has launched a Python SDK to bring application-layer security directly into code. The SDK, now in
The post Arcjet’s Python SDK Embeds Security in Code appeared first on The New Stack.

Cloudflare announced Friday it will acquire the Astro Technology Company team, which is responsible for the open source JavaScript web
The post Cloudflare Acquires Team Behind Open Source Framework Astro appeared first on The New Stack.

Recently, I procured a Zettlab AI NAS. This was mostly to be used to house a metric ton of video
The post You Might Not Know This, but Your NAS Might Be a Good Docker Server appeared first on The New Stack.

Enterprise AI inherited the consumer model of AI, but it’s the wrong one for most business-to-business (B2B) problems. In the
The post SLMs vs. LLMs: Why Smaller AI Models Win in Business appeared first on The New Stack.

In today’s enterprise landscape, AI may dominate the conversation, but legacy systems still underpin mission-critical operations for many — if
The post Orchestration: The Key to Integrating AI with Legacy Systems appeared first on The New Stack.

For today’s AI agents, memory is a moat. Every conversation counts, but traditional large language models (LLMs) are stateless —
The post Memory for AI Agents: A New Paradigm of Context Engineering appeared first on The New Stack.

The Dragonfly project, an open source peer-to-peer image and file distribution system, has graduated from the ‘s program for incubating
The post CNCF Dragonfly Speeds Container, Model Sharing with P2P appeared first on The New Stack.

The ability to run large AI models locally defines the next frontier of developer productivity. Without desktop-class AI compute, data
The post Nvidia DGX Spark: The New Stack Developer’s Guide appeared first on The New Stack.

These days, malicious actors succeed not by breaking systems, but by blending into them. Increasingly, the intruder looks like a
The post The New Threats: Attackers Don’t Just Break In, They Blend In appeared first on The New Stack.

Logs are essentially a language, and if you throw the right information into a large language model (LLM), it can
The post Start Small and Go Big With Open Source Gonzo for Observability appeared first on The New Stack.
API sprawl hides both security dangers and missed opportunities. If an organization has more APIs than it can easily keep
The post Solving the Problems That Accompany API Sprawl With AI appeared first on The New Stack.

High-performance database system provider has launched a new database cloud service that promises to rival the performance of ‘ DynamoDB
The post ScyllaDB’s New Cloud Challenges DynamoDB Cost, Performance appeared first on The New Stack.

This is an excerpt from Chapter 5 of “AI for the Enterprise: The Playbook for Developing and Scaling Your AI
The post Level 3 AI Collaboration: The Sweet Spot for Developers appeared first on The New Stack.

Tailwind has been in the news lately, as it struggles to keep its doors open in the AI era. But
The post StyleX vs. Tailwind: Meta’s Take on CSS-in-JS Maintainability appeared first on The New Stack.

Two years ago, I started experimenting with AI-assisted development tools. Today, they’re embedded into daily workflows across our engineering organization.
The post Lessons from 2 Years of Integrating AI into Development Workflows appeared first on The New Stack.

Change approval has been a constraining force on software delivery for decades. With teams adopting AI coding assistants and agents,
The post Stop Wasting AI Investment on a Broken Change Approval Process appeared first on The New Stack.

In 2025, the Ada programming language made what might be considered a comeback. (But don’t call it one! Yet.) Last
The post 2025: The Year of the Return of the Ada Programming Language? appeared first on The New Stack.

Anthropic‘s $1.5 million investment in Python security is both self-interested and smart, analysts say, addressing a critical vulnerability in the
The post Experts Hail Anthropic’s $1.5M Python Security Commitment appeared first on The New Stack.

Whamm is designed to allow users to instrument their WebAssembly (or Wasm) applications with a programming language or code, or
The post Open Source Whamm: Use WebAssembly To Monitor and Fix Running Apps appeared first on The New Stack.

The ability to connect AI agents to external systems defines their practical utility. Without tools, an agent is limited to
The post How To Choose the Right Tool for Your Google ADK Agent appeared first on The New Stack.

Frontend developer and educator Kent Dodds has some good news for frontend developers confused about which React-based framework to choose.
The post The React Framework Face-Off: Which One Owns the Future? appeared first on The New Stack.
FINRA, the Financial Industry Regulatory Authority, consistently seeks to achieve the highest standards in its technology practices. To elevate its software development lifecycle, FINRA — which oversees member broker-dealers — engaged Google consultants to help apply a metrics-driven methodology to its engineering practices.
DORA is a popular framework for helping organization improve software delivery performance through capabilities that can be measured by key metrics. These include deployment frequency, change lead time, change failure rate, failed deployment recovery time, and rework.
While FINRA had begun laying the groundwork to adopt DORA internally, the organization recognized an opportunity to accelerate implementation by tapping Google's firsthand experience.
Google conducted a discovery effort alongside technology leaders to identify opportunities for improvement. The recommendation that followed included increasing the existing focus on continuous improvement, adopting a user-centric approach to developing software and further enabling a generative culture within the department.
The implementation itself was deliberately flexible. Rather than recommending a one-size-fits-all approach, Google helped FINRA tailor its actions to individual team objectives. Teams prioritizing product value concentrated on lead time and deployment frequency metrics, while teams focused on stability concentrated on change failure rates and failed deployment recovery time.
Over the first year of implementation, engineering teams demonstrated continuous improvement across DORA capabilities, achieving a 9% per-developer productivity gain and reporting directionally positive developer experience feedback.
Sprint velocities also improved by 5%, enabling smaller engineering teams to deliver greater incremental product value to the business. Beyond raw metrics, teams also reported heightened transparency around delivery performance and appreciation for a standardized methodology.
Looking ahead, FINRA is maturing its DORA practice by providing more granular metrics tied to high-level DORA measurements, increasing emphasis on developer experience and correlating product metrics with software delivery performance indicators.
Want to discover what AI can do for governments, nonprofits, and other public sector organizations? Register to attend our upcoming Gemini for Government webinar on February 5, where we will dive deeper into the transformative technology powering the next wave of innovation across the public sector.
The 2025 State of AI-assisted Software Development report revealed a critical truth: AI is an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.
While AI adoption is now near-universal, with 90% of developers using it in their daily workflows, success is not guaranteed. Our cluster analysis of nearly 5,000 technology professionals reveals significant variation in team performance: Not everyone experiences the same outcomes from adopting AI.
From this disparity, we can conclude that how they are using AI is a critical factor. We wanted to understand the particular capabilities and conditions that enable teams to achieve positive outcomes, leading us to develop the DORA AI Capabilities Model report.
This companion guide to the 2025 DORA Report is designed to help you navigate our new reality. It provides actionable strategies, implementation tactics, and measurement frameworks to help technology leaders build an environment where AI thrives.
Successfully using AI requires cultivating your technical and cultural environment. From the same set of respondents who participated in the 2025 DORA survey, we identified seven foundational capabilities that are proven to amplify the positive impact of AI on organizational performance:
The DORA AI Capabilities Model shows which capabilities amplify the effect of AI adoption on
specific outcomes
Every organization starts their AI journey differently. To help you prioritize, this report introduces seven distinct team archetypes derived from our cluster analysis. These profiles range from "harmonious high-achievers," who excel in both performance and well-being, to teams facing "foundational challenges" or those stuck in a "legacy bottleneck," where unstable systems undermine morale.
Identifying the profile that best matches your team can help pinpoint the most impactful interventions. For example, a "high impact, low cadence" team might prioritize automation to improve stability, while a team "constrained by process" might focus on reducing friction through a better AI stance.
Once you understand your team's profile, how do you direct your efforts? The report includes a step-by-step facilitation guide for running a Value Stream Mapping (VSM) exercise.
VSM acts as an AI force multiplier. By visualizing your flow from idea to customer, you can identify where work waits and where friction exists. This ensures that the efficiency gains from AI aren't just creating local optimizations that pile up work downstream, but are instead channeled into solving system-level constraints.
AI adoption is an organizational transformation. The greatest returns come not from the tools themselves, but from investing in the foundational systems that enable them.
When was the last time you knew — not just hoped — that your disaster recovery plan would work perfectly?
For most of us, the answer is unclear. Sure, you may have a DR plan, a meticulously crafted document stored in a wiki or a shared drive, that gets dusted off for compliance audits or the occasional tabletop drill. You assume its procedures are correct, its contact lists are current, and its dependencies are fully mapped, and you certainly hope it works.
Why wouldn’t it work? One problem is that systems are rarely static anymore. In a world where you deploy new microservices dozens of times per day, make constant configuration changes, and maintain an ever-growing web of third-party API dependencies, the DR plan you wrote last quarter is probably just as useful as one from 10 years ago.
And if the failover does work, will it work well enough to meet the promises you've made to your customers (or board of directors or regulators)? When a key component fails, could you still even meet your target availability and latency targets, a.k.a., your Service Level Objectives (SLOs)?
So, how do you close this gap between your current aspirational DR plan and a DR plan that you actually have confidence in? The answer isn't to write more documents or run more theatrical drills. The answer is to stop assuming and start proving.
This is where chaos engineering comes in. Unlike what the name might imply, chaos engineering isn’t a tool for recklessly breaking things. Instead, it’s a framework that provides data-driven confidence in your SLOs under stress. By running controlled experiments that simulate real-world disasters like a database failover or a regional outage, you can quantitatively measure the impact of those failures on your systems’ performance. Chaos engineering is how you transform your DR hypotheses into a proven method to ensure resilience. By validating your plan through experimentation, you create tangible evidence, verifying that your plan will safeguard your infrastructure and keep your promises to customers.
In a nutshell, chaos engineering is the practice of running controlled, scientific experiments to find weaknesses in your system before they cause a real outage.
At its core, it’s about building confidence in your system’s resilience. The process starts with understanding your system's steady state, which is its normal, measurable, and healthy output. You can't know the true impact of a failure without first defining what "good" looks like. This understanding allows you to form a clear, testable hypothesis: a statement of belief that your system's steady state will persist even when a specific, turbulent condition is introduced.
To test this hypothesis, you then execute a controlled action, which is a precise and targeted failure injected into the system. This isn't random mischief; it's a specific simulation of real-world failures, such as consuming all CPU on a host (resource exhaustion), adding network latency (network failure), or terminating a virtual machine (state failure). While this action is running, automated probes act as your scientific instruments, continuously monitoring the system's state to measure the effect.
Together, these components form a complete scientific loop: you use a hypothesis to predict resilience, run an experiment by applying an action to simulate adversity, and use probes to measure the impact, turning uncertainty into hard data.
Now that you understand the building blocks of a chaos experiment, you can build the bridge to your ultimate goal: transforming your DR plan from a document of hope into an evidence-based procedure. The key is to stop seeing your DR plan as a set of instructions and start seeing it for what it truly is: a collection of unproven hypotheses.
When you think about it, every significant statement in your DR document is a claim waiting to be tested. When your plan states, "The database will failover to the replica in under 5 minutes," that isn't a fact, it's a hypothesis. When it says, "In the event of a regional outage, traffic will be successfully rerouted to the secondary region," that's another hypothesis. Your DR plan is filled with these critical assumptions about how your system should behave under duress. Until you test them, they remain nothing more than educated guesses.
Chaos experiments are the ultimate validation tools, live-fire drills that put your DR hypotheses to a real, empirical test. Instead of just talking through a scenario, you use controlled actions to safely and precisely simulate the disaster. You're no longer asking "what if?"; you're actively measuring "what happens when."
For example, imagine you have a DR plan for a regional outage. When you adopt chaos engineering, you break down that plan into a hypothesis and an experiment. For example:
The hypothesis: "In case our primary region us-central1 becomes unreachable, the load balancers will failover all traffic to us-east1 within 3 minutes, with an error rate below 1%."
The chaos experiment: Run an action that simulates a regional outage by injecting a "blackhole" that drops all network traffic to and from us-central1 for a limited time. Your probes then measure the actual failover time and error rates to validate the hypothesis.
In other words, by applying the chaos engineering methodology, you systematically move through your DR plan, turning each assumption into a proven fact. You're not just testing your plan; you're forging it in a controlled fire.
Beyond simply proving system availability, chaos engineering builds trust in your reliability metrics, ensuring that you meet your SLOs even when services become unavailable. An SLO is a specific, acceptable target level of your service's performance measured over a specified period that reflects the user's experience. SLOs aren't just internal goals; they are the bedrock of customer trust and the foundation of your contractual service level agreements (SLAs).
A traditional DR drill might get a "pass" because the backup system came online. But what if it took 20 minutes to fail over, during which every user saw errors? What if the backup region was under-provisioned, and performance became so slow that the service was unusable? From a technical perspective, you "recovered." But from a customer's perspective, you were down.
A chaos experiment, however, can help you answer a critical question: "During a failover, did we still meet our SLOs?” Because your probes are constantly measuring performance against your SLOs, you get the full picture. You don't just see that the database failed over; you see that it took 7 minutes, during which your latency SLO was breached and your error budget was completely burned. This is the crucial, game-changing insight. It shifts the entire goal from simple disaster recovery to SLO preservation, which is what actually determines if a failure was a minor hiccup or a major business-impacting incident. It also provides the data necessary to set goals for system improvement. So the next time you run this experiment, you can measure if and how much your system resilience has improved, and ultimately if you can maintain your SLO during the disaster event.
The journey to resilience doesn't start by simulating a full regional failover. It starts with a single, small experiment. The goal is not to boil the ocean; it's to build momentum. Test one timeout, one retry mechanism, or one graceful error message.
The biggest win from your first successful experiment won't be the technical data you gather. It will be the confidence you build. When your team sees that they can safely inject failure, learn from it, and improve the system, their entire relationship with failure changes. Fear is replaced by curiosity. That confidence is the catalyst for building a true, enduring culture of resilience. To learn more and get started with chaos engineering, check out this blog and this podcast. And if you’re ready to get started, but unsure how, reach out to Google Cloud professional services to discuss how we can help.
Earlier this year, we unveiled a big investment in platform and developer team productivity, with the launch of Application Design Center, helping them streamline the design and deployment of cloud application infrastructure, while ensuring applications are secure, reliable, and aligned with best practices. And today, Application Design Center is generally available.
We built Application Design Center to put applications at the center of your cloud experience, with a visual, canvas-style and AI-powered approach to design and modify Terraform-backed application templates. It also offers full lifecycle management that’s aligned with DevOps best practices across application design and deployment.
Application Design Center is a core component of our application-centric cloud experience. When you use Application Design Center to design and deploy your application infrastructure, your applications are easily discoverable, observable, and manageable. Application Design Center works in concert with App Hub to automatically register application deployments, enabling a unified view and control plane for your application portfolio, and Cloud Hub, to provide operational insights for your applications.
“Google Application Design Center is a valuable enabler for Platform Engineering, providing a structured approach to harmonizing resource creation in Google Cloud Platform. By aligning tools, processes, and technologies, it streamlines workflows, reducing friction between development, operations, and other teams. This harmonization enhances collaboration, accelerates delivery, and ensures consistency across Google Cloud environments.” - Ervis Duraj, Principal Engineer, MediaMarktSaturn Technology
Our goal with Application Design Center is for you to innovate more, and administer less. It consists of four key elements to help you minimize administrative overhead and maximize efficiency, so you can design and deploy applications with integrated best practices and essential guardrails. Let’s take a closer look.
1. Terraform components and application templates
Develop applications faster with our growing library of opinionated application templates. These provide well-architected patterns and pre-built components, including innovative "AI inference templates" to help you leverage AI to create dynamic and intelligent application foundations. As an example, at launch, Application Design Center provides opinionated templates for Google Kubernetes Engine (GKE) clusters (Standard, Autopilot and NodePool) to run AI inference workloads using a variety of LLM models, as well as for enterprise-grade production clusters or single-region web app clusters.
You can also ingest and manage your existing Terraform configurations (“Bring your own Terraform”) directly from Git repositories. Once imported, you can use Application Design Center to design with your own Terraform, or in combination with Google-provided Terraform, to create standardized, opinionated infrastructure patterns for sharing and reuse across your application teams.
2. AI-powered design for rapid application designing and prototyping
Application Design Center integrates with Google's Gemini Cloud Assist Design Agent, empowering you to design actual, deployable application infrastructure application templates on Google Cloud that you can export as Terraform infrastructure-as-code.
With Gemini Cloud Assist, you can describe your application design intents using natural language. In return, Gemini interactively generates multi-product application template suggestions, complete with visual architecture diagrams and summarized benefits. You can then refine these proposals through multi-turn reasoning or by directly manipulating the architecture within the Application Design Center canvas.
Additionally, all designs that you create with Gemini are automatically observable, optimizable, and enabled for troubleshooting assistance during runtime, thanks to their tight integration with Gemini Cloud Assist.
3. A secure, sharable catalog of application templates with full lifecycle management
Platform admins can curate a collection of application templates built from Google's best-practice components. This provides developers a trusted, self-service experience from which they can quickly discover and deploy compliant applications. Tight integration with Cloud Hub transforms these governed templates into a live operational command center, complete with unified visibility into the health and deployment status of the resulting applications. This closes the critical loop between design and runtime, so that your production environments reflect your organization’s approved architectural standards.
Also, Application Design Center’s robust application template revisions serve as an immutable audit trail. It automatically detects and flags configuration drift between your intended designs and deployed applications, so that developers can remediate unauthorized changes or safely push approved configuration updates. This helps ensure continuous state consistency and compliance from Day 1 and through the subsequent evolution of your application.
4. GitOps integration automating developers’ day-to-day software design lifecycle tasks
By integrating Application Design Center into existing CI/CD workflows, platform teams empower developers to own the complete software delivery lifecycle right from their IDE. Developers can leverage compliant application and infrastructure (IaC) code using Application Design Center application templates.
Further, every infrastructure decision made through Application Design Center is committed to code, versioned, and auditable. Specifically, developers can download the application IaC template from Application Design Center and import it into their app repos (the single source of truth), clone their repo, and edit the Terraform directly in their local IDEs. Any modifications go through a Git pull request for review. Once approved, this automatically triggers the existing CI/CD setup to build, test, and deploy both app and infra changes in lockstep. This unified approach minimizes friction, enforcing "golden paths" and providing an end-to-end automated pathway from a line of code in the IDE to a fully deployed change in production.
This GA launch is packed with features that users have been asking for. We’re excited to share powerful new capabilities: enterprise-grade governance and security with public APIs and gcloud CLI support; full compatibility with VPC service controls; bring your own Terraform and GitOps support for integration with your existing application patterns and automation pipelines; agentic application patterns using GKE templates (Standard, Autopilot and NodePool); and finally, a simplified onboarding experience with app-managed project support, making Application Design Center an AI-powered engine for your applications on Google Cloud.
To help you get started, Google provides a growing library of curated Google application templates built by experts. These templates combine multiple Google Cloud products and best practices to serve common use cases, which you can configure for deployment, and view as infrastructure as code in-line. Platform teams can then create and securely share the catalogs and collaborate with teammates on designs and self-service deployment for developers. For enterprises with existing Terraform patterns and assets, Application Design Center interoperates by enabling their import and reuse within its native design and configuration experience.
Ready to experience the power of Application Design Center? You can learn more about ADC and get started building in minutes using the quickstart. You can start building your first AI-powered application template in minutes, free of cost, and quickly deploy applications with working code. For deeper insights, explore the comprehensive public documentation here. We can't wait to see how you innovate with the Application Design Center!
Editor's note: This blog was updated on Dec. 4, 5, 7, and 12, 2025, with additional guidance on Cloud Armor WAF rule syntax, and WAF enforcement across App Engine Standard, Cloud Functions, and Cloud Run.
Earlier today, Meta and Vercel publicly disclosed two vulnerabilities that expose services built using the popular open-source frameworks React Server Components (CVE-2025-55182) and Next.js to remote code execution risks when used for some server-side use cases. At Google Cloud, we understand the severity of these vulnerabilities, also known as React2Shell, and our security teams have shared their recommendations to help our customers take immediate, decisive action to secure their applications.
The React Server Components framework is commonly used for building user interfaces. On Dec. 3, 2025, CVE.org assigned this vulnerability as CVE-2025-55182. The official Common Vulnerability Scoring System (CVSS) base severity score has been determined as Critical, a severity of 10.0.
Vulnerable versions: React 19.0, 19.1.0, 19.1.1, and 19.2.0
Patched in React 19.2.1
Fix: https://github.com/facebook/react/commit/7dc903cd29dac55efb4424853fd0442fef3a8700
Announcement: https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components
Next.js is a web development framework that depends on React, and is also commonly used for building user interfaces. (The Next.js vulnerability was referenced as CVE-2025-66478 before being marked as a duplicate.)
Vulnerable versions: Next.js 15.x, Next.js 16.x, Next.js 14.3.0-canary.77 and later canary releases
Patched versions are listed here.
Fix: https://github.com/vercel/next.js/commit/6ef90ef49fd32171150b6f81d14708aa54cd07b2
Announcement: https://nextjs.org/blog/CVE-2025-66478
Google Threat Intelligence Group (GTIG) has also published a new report to help understand the specific threats exploiting React2Shell.
We strongly encourage organizations who manage environments relying on the React and Next.js frameworks to update to the latest version, and take the mitigation actions outlined below.
We have created and rolled out a new Cloud Armor web application firewall (WAF) rule designed to detect and block exploitation attempts related to CVE-2025-55182. This new rule is available now and is intended to help protect your internet-facing applications and services that use global or regional Application Load Balancers. We recommend deploying this rule as a temporary mitigation while your vulnerability management program patches and verifies all vulnerable instances in your environment.
For customers using App Engine Standard, Cloud Functions, Cloud Run, Firebase Hosting or Firebase App Hosting, we provide an additional layer of defense for serverless workloads by automatically enforcing platform-level WAF rules that can detect and block the most common exploitation attempts related to CVE-2025-55182.
For Project Shield users, we have deployed WAF protections for all sites and no action is necessary to enable these WAF rules. For long-term mitigation, you will need to patch your origin servers as an essential step to eliminate the vulnerability (see additional guidance below).
Cloud Armor and the Application Load Balancer can be used to deliver and protect your applications and services regardless of whether they are deployed on Google Cloud, on-premises, or on another infrastructure provider. If you are not yet using Cloud Armor and the Application Load Balancer, please follow the guidance further down to get started.
While these platform-level rules and the optional Cloud Armor WAF rules (for services behind an Application Load Balancer) help mitigate the risk from exploits of the CVE, we continue to strongly recommend updating your application dependencies as the primary long-term mitigation.
To configure Cloud Armor to detect and protect from CVE-2025-55182, you can use the cve-canary preconfigured WAF rule leveraging the new ruleID that we have added for this vulnerability. This rule is opt-in only, and must be added to your policy even if you are already using the cve-canary rules.
In your Cloud Armor backend security policy, create a new rule and configure the following match condition:
This can be accomplished from the Google Cloud console by navigating to Cloud Armor and modifying an existing or creating a new policy.
Cloud Armor rule creation in the Google Cloud console.
Alternatively, the gcloud CLI can be used to create or modify a policy with the requisite rule:
Additionally, if you are managing your rules with Terraform, you may implement the rule via the following syntax:
Cloud Armor rules can be configured in preview mode, a logging-only mode to test or monitor the expected impact of the rule without Cloud Armor enforcing the configured action. We recommend that the new rule described above first be deployed in preview mode in your production environments so that you can see what traffic it would block.
Once you verify that the new rule is behaving as desired in your environment, then you can disable preview mode to allow Cloud Armor to actively enforce it.
Cloud Armor per-request WAF logs are emitted as part of the Application Load Balancer logs to Cloud Logging. To see what Cloud Armor’s decision was on every request, load balancer logging first needs to be enabled on a per backend service basis. Once it is enabled, all subsequent Cloud Armor decisions will be logged and can be found in Cloud Logging by following these instructions.
There has been a proliferation of scanning tools designed to help identify vulnerable instances of React and Next.js in your environments. Many of those scanners are designed to identify the version number of relevant frameworks in your servers and do so by crafting a legitimate query and inspecting the response from the server to detect the version of React and Next.js that is running.
Our WAF rule is designed to detect and prevent exploit attempts of CVE-2025-55182. As the scanners discussed above are not attempting an exploit, but sending a safe query to elicit a response revealing indications of the version of the software, the above Cloud Armor rule will not detect or block such scanners.
If the findings of these scanners indicate a vulnerable instance of software protected by Cloud Armor, that does not mean that an actual exploit attempt of the vulnerability will successfully get through your Cloud Armor security policy. Instead, such findings mean that the version React or Next.js detected is known to be vulnerable and should be patched.
If your workload is already using an Application Load Balancer to receive traffic from the internet, you can configure Cloud Armor to protect your workload from this and other application-level vulnerabilities (as well as DDoS attacks) by following these instructions.
If you are not yet using an Application Load Balancer and Cloud Armor, you can get started with the external Application Load Balancer overview, the Cloud Armor overview, and the Cloud Armor best practices.
If your workload is using Cloud Run, Cloud Run functions, or App Engine and receives traffic from the internet, you must first set up an Application Load Balancer in front of your endpoint to leverage Cloud Armor security policies to protect your workload. You will then need to configure the appropriate controls to ensure that Cloud Armor and the Application Load Balancer can’t be bypassed.
Once you configure Cloud Armor, we recommend consulting our best practices guide. Be sure to account for limitations discussed in the documentation to minimize risk and optimize performance while ensuring the safety and availability of your workloads.
Google Cloud is enforcing platform-level protections across App Engine Standard, Cloud Functions, and Cloud Run to automatically help protect against common exploit attempts of CVE-2025-55182. This protection supplements the protections already in place for Firebase Hosting and Firebase App Hosting.
What this means for you:
Applications deployed to those serverless services benefit from these WAF rules that are enabled by default to help provide a base level of protection without requiring manual configuration.
These rules are designed to block known malicious payloads targeting this vulnerability.
Important considerations:
Patching is still critical: These platform-level defenses are intended to be a temporary mitigation. The most effective long-term solution is to update your application's dependencies to non-vulnerable versions of React and Next.js, and redeploy them.
Potential impacts: While unlikely, if you believe this platform-level filtering is incorrectly impacting your application's traffic, please contact Google Cloud Support and reference issue number 465748820.
While WAF rules provide critical frontline defense, the most comprehensive long-term solution is to patch the underlying frameworks.
While Google Cloud is providing platform-level protections and Cloud Armor options, we urge all customers running React and Next.js applications on Google Cloud to immediately update their dependencies to the latest stable versions (React 19.2.1 or the relevant version of Next.js listed here), and redeploy their services.
This applies specifically to applications deployed on:
Patching your applications is an essential step to eliminate the vulnerability at its source and ensure the continued integrity and security of your services.
We will continue to monitor the situation closely and provide further updates and guidance as necessary. Please refer to our official Google Cloud Security advisories for the most current information and detailed steps.
If you have any questions or require assistance, please contact Google Cloud Support and reference issue number 465748820.
As engineers, we all dream of perfectly resilient systems — ones that scale perfectly, provide a great user experience, and never ever go down. What if we told you the key to building these kinds of resilient systems isn't avoiding failures, but deliberately causing them? Welcome to the world of chaos engineering, where you stress test your systems by introducing chaos, i.e., failures, into a system under a controlled environment. In an era where downtime can cost millions and destroy reputations in minutes, the most innovative companies aren't just waiting for disasters to happen — they're causing them and learning from the resulting failures, so they can build immunity to chaos before it strikes in production.
Chaos engineering is useful for all kinds of systems, but particularly for cloud-based distributed ones. Modern architectures have evolved from monolithic to microservices-based systems, often comprising hundreds or thousands of services. These complex service dependencies introduce multiple points of failure, and it’s difficult if not impossible to predict all the possible failure modes through traditional testing methods. When these applications are deployed on the cloud, they are deployed across multiple availability zones and regions. This increases the likelihood of failure due to the highly distributed nature of cloud environments and the large number of services that coexist within them.
A common misconception is that cloud environments automatically provide application resiliency, eliminating the need for testing. Although cloud providers do offer various levels of resiliency and SLAs for their cloud products, these alone do not guarantee that your business applications are protected. If applications are not designed to be fault-tolerant or if they assume constant availability of cloud services, they will fail when a particular cloud service they depend on is not available.
In short, chaos engineering can take a team's worst "what if?" scenarios and transform them into well-rehearsed responses. Chaos engineering isn’t about breaking systems — engineering chaotically, as it were — it's about building teams that face production incidents with the calm confidence that only comes from having weathered that chaos before, albeit in controlled conditions.
Google Cloud’s Professional Service Organization (PSO) Enterprise Architecture team consults on and provides hands-on expertise on customers’ cloud transformation journeys, including application development, cloud migrations, and enterprise architecture. And when advising on designing resilient architecture for cloud environments, we routinely introduce the principles and practices of chaos engineering and Site Reliability Engineering (SRE) practices.
In this first blog post in a series, we explain the basics of chaos engineering — what it is and its core principles and elements. We then explore how chaos engineering is particularly helpful and important for teams running distributed applications in the cloud. Finally, we’ll talk about how to get started, and point you to further resources.
Chaos engineering is a methodology invented by Netflix in 2010 when it created and popularized ‘Chaos Monkey’ to address the need to build more resilient and reliable systems in the face of increasing complexity in their AWS environment. Around the same time, Google introduced Disaster Resilience Testing, or DiRT, which enabled continuous and automated disaster readiness, response, and recovery of Google’s business, systems, and data. Here on Google Cloud’s PSO team, we offer various services to help customers implement DiRT as part of SRE practices. These offerings also include training on how to perform DiRT on applications and systems operating on Google Cloud. The central concept is straightforward: deliberately introduce controlled disruptions into a system to identify vulnerabilities, evaluate its resilience, and enhance its overall reliability.
As a proactive discipline, chaos engineering enables organizations to identify weaknesses in their systems before they lead to significant outages or failures, where a system includes not only the technology components but also the people and processes of an organization. By introducing controlled, real-world disruptions, chaos engineering helps test a system's robustness, recoverability, and fault tolerance. This approach allows teams to uncover potential vulnerabilities, so that systems are better equipped to handle unexpected events and continue functioning smoothly under stress.
Chaos engineering is guided by a set of core principles about why it should be done, while practices define what needs to be done.
Below are the principles of chaos engineering:
With these principles established, follow these practices when conducting a chaos engineering experiment:
In other words, chaos engineering isn't about breaking things for the sake of it, but about building more resilient systems by understanding their limitations and addressing them proactively.
Here are the core elements you'll use in a chaos engineering experiment, derived from these five principles:
Now that you have a good understanding of chaos engineering and why to use it in your cloud environment, the next step is to try it out for yourself in your own development environment.
There are multiple chaos engineering solutions in the market; some are paid products and some are open-source frameworks. To get started quickly, we recommend that you use Chaos Toolkit as your chaos engineering framework.
Chaos Toolkit is an open-source framework written in Python that provides a modular architecture where you can plug in other libraries (also known as ‘drivers’) to extend your chaos engineering experiments. For example, there are extension libraries for Google Cloud, Kubernetes, and many other technologies. Since Chaos Toolkit is a Python-based developer tool, you can begin by configuring your Python environment. You can find a good example of a Chaos Toolkit experiment and step-by-step explanation here.
Finally, to enable Google Cloud customers and engineers to introduce chaos testing in their applications, we’ve created a series of Google Cloud-specific chaos engineering recipes. Each recipe covers a specific scenario to introduce chaos in a particular Google Cloud service. For example, one recipe covers introducing chaos in an application/service running behind a Google Cloud internal or external application load balancer; another recipe covers simulating a network outage between an application running on Cloud Run and connecting to a Cloud SQL database by leveraging another Chaos Toolkit extension named ToxiProxy.
You can find a complete collection of recipes, including step-by-step instructions, scripts, and sample code, to learn how to introduce chaos engineering in your Google Cloud environment on GitHub. Then, stay tuned for subsequent posts, where we’ll talk about chaos engineering techniques, such as how to introduce faults into your Google Cloud environment.
Today, we are excited to announce the 2025 DORA Report: State of AI-assisted Software Development. Drawing on insights from over 100 hours of qualitative data and survey responses from nearly 5,000 technology professionals from around the world.
The report reveals a key insight: AI doesn't fix a team; it amplifies what's already there. Strong teams use AI to become even better and more efficient. Struggling teams will find that AI only highlights and intensifies their existing problems. The greatest return comes not from the AI tools themselves, but from a strategic focus on the quality of internal platforms, the clarity of workflows, and the alignment of teams.
As we established from the 2024 report as well as the special report published this year called “Impact of Generative AI in Software Development”, organizations are continuing to heavily adopt AI and receive substantial benefits across important outcomes. And there is evidence of learning to better integrate these tools into our workflow. Unlike last year, we observe a positive relationship between AI adoption on both software delivery throughput and product performance. It appears that people, teams, and tools are learning where, when, and how AI is most useful. However, AI adoption does continue to have a negative relationship with software delivery stability.
This confirms our central theory - AI accelerates software development, but that acceleration can expose weaknesses downstream. Without robust control systems, like strong automated testing, mature version control practices, and fast feedback loops, an increase in change volume leads to instability. Teams working in loosely coupled architectures with fast feedback loops see gains, while those constrained by tightly coupled systems and slow processes see little or no benefit.
Key findings from the 2025 report
Beyond this central theme, this year’s research highlighted the following about modern software development:
AI adoption is near-universal: 90% of survey respondents report using AI at work. More than 80% believe it has increased their productivity. However, skepticism remains as 30% report little or no trust in the code generated by AI, a slightly lower percentage than last year but a key trend to note.
User-centricity is a prerequisite for AI success: AI becomes most useful when it's pointed at a clear problem, and a user-centric focus provides that essential direction. Our data shows this focus amplifies AI’s positive influence on team performance.
Platform engineering is the foundation: Our data shows that 90% of organizations have adopted at least one platform and there is a direct correlation between a high quality internal platform and an organization’s ability to unlock the value of AI, making it an essential foundation for success.
Simple software delivery metrics alone aren’t sufficient. They tell you what is happening but not why it’s happening. To connect performance data to experience, we conducted a cluster analysis that reveals seven common team profiles or archetypes, each with a unique interplay of performance, stability, and well-being. This model provides leaders with a way to diagnose team health and apply the right interventions.
The ‘Foundational challenges’ group are trapped in survival mode and face significant gaps in their processes and environment, leading to low performance, high system stability, and high levels of burnout and friction. While the ‘Harmonious high achievers’ excel across multiple areas, showing positive metrics for team well-being, product outcomes, and software delivery.
Read more details of each archetype in the "Understanding your software delivery performance: A look at seven team profiles" chapter of the report.
This year, we went beyond identifying AI’s impact to investigating the conditions in which AI-assisted technology-professionals realize the best outcomes. The value of AI is unlocked not by the tools themselves, but by the surrounding technical practices and cultural environment.
Our research identified seven capabilities that are shown to magnify the positive impact of AI in organizations.
One of the key insights derived from the research this year is that the value of AI will be unlocked by reimagining the system of work it inhabits. Technology leaders should treat AI adoption as an organizational transformation.
Here’s where we suggest you begin:
Clarify and socialize your AI policies
Connect AI to your internal context
Prioritize foundational practices
Fortify your safety nets
Invest in your internal platform
Focus on your end-users
The DORA research program is committed to serving as a compass to teams and organizations as we navigate the important and transformative period with AI. We hope the new team profiles and the DORA AI capabilities model provide a clear roadmap for you to move beyond simply adopting AI to unlocking its value by investing in teams and people. We look forward to learning how you put these insights into practice. To learn more:
What guides your approach to software development? In our roles at Google, we’re constantly working to build better software, faster. Within Google, our Developer Platform team and Google Cloud have a strategic partnership and a shared strategy: together, we take our internal capabilities and engineering tools and package them up for Google Cloud customers.
At the heart of this is understanding the many ways that software teams, big and small, need to balance efficiency, quality, and cost, all while delivering value. In our recent talk at PlatformCon 2025, we shared key parts of our platform strategy, which we call “shift down.”
Shift down is an approach that advocates for embedding decisions and responsibilities into underlying internal developer platforms (IDPs), thereby reducing the operational burden on developers. This contrasts with the DevOps trend of "shift left," which pushes more effort earlier into the development cycle, a method that is proving difficult at scale due to the sheer volume and rate of change in requirements. Our shift down strategy helps us maximize value with existing resources so businesses can achieve high innovation velocity with acceptable quality, acceptable risk, and sustainable costs across a diverse range of business models. In the talk, we share learnings that have been really helpful to us in our software and platform engineering journey:
6. Divide up the problem space by identifying different platform and ecosystem types.
Because the developer experience and platform infrastructure change with scale and degree of shifting down, it’s not enough to just know where the ecosystem effectiveness zone is — you have to identify the ecosystem by type. We differentiate ecosystem types by the degree of oversight and assurance for quality attributes. As an ecosystem becomes more vertically integrated, such as Google's highly optimized "Assured" (Type 4) ecosystem, the platform itself assumes increasing responsibility for vital quality attributes, allowing specialists like site reliability engineers (SRE) and security teams to have full ownership in taking action through large-scale observability and embedded capabilities. Conversely, in less uniform "YOLO," "AdHoc," or "Guided" (Type 0-2) ecosystems, developers have more responsibility for assuring these attributes, while central specialist teams have less direct control and enforcement mechanisms are less pervasive. It’s really important to note here that this is not a maturity model — the best ecosystem and platform type is the one that best fits your business need (see point #1 above!).
The most important takeaway is to make active choices. Tailor platform engineering for each business unit and application to achieve the best outcomes. Place critical emphasis on identifying and solving stable sub-problems in reliable, reusable ways across various business problems. This approach directly underpins our "shift down" strategy, moving toward composable platforms that embed decisions and responsibilities for software quality directly into the underlying platform infrastructure, thereby improving our ability to maximize business value with the right resources, at the right quality level, and with sustainable costs.
Watch our full discussion for more insights on effective platform engineering.
Application owners are looking for three things when they think about optimizing cloud costs:
What are the most expensive resources?
Which resources are costing me more this week or month?
Which resources are poorly utilized?
To help you answer these questions quickly and easily, we announced Cloud Hub Optimization and Cost Explorer, in private preview, at Google Cloud Next 2025. And today, we are excited to announce that both Cloud Hub Optimization and Cost Explorer are now in public preview.
As an app owner, your primary objective is keeping your application healthy at all times. Yet, monitoring all the individual components of your application, which may straddle dozens of Projects, can be quite overwhelming. AppHub Applications allow you to reorganize cloud around your application, giving you the information and controls you need at your fingertips.
In addition to supporting Google Cloud Projects, Cloud Hub Optimization and Cost Explorer leverage App Hub applications to show you the cost-efficiency of your application’s workloads and services instantly. This is great for instance when you are trying to pinpoint deployments running on GKE clusters that might be wasting valuable resources, such as GPUs.
When you bring up Cloud Hub Optimization, you can immediately see the resources that are costing you the most, along with the percentage change in their cost. With this highly granular cost information, you can now attribute your costs to specific resources and resource owners to reason about any changes in costs.
We have additionally integrated granular cost data from Cloud Billing and resource utilization data from Cloud Monitoring to give you a comprehensive picture of your cost efficiency. This includes average vCPU utilization for your Project, which helps you find the most promising optimization candidates across hundreds of Google Cloud Projects.
The Cost Explorer dashboard also shows you your costs logically organized at the product level, for even more cost explainability. Instead of seeing a lump sum cost for Compute Engine, you can now see your exact spend on individual products including Google Kubernetes Engine (GKE) clusters, Persistent Disks, Cloud Load Balancing, and more.
Customers who have tried these new tools love the information that is surfaced as well as the simplicity of the interfaces.
“My team has to keep an eye on cloud costs across tens of business units and hundreds of developers. The Cloud Hub Optimization and Cost Explorer dashboards are a force multiplier for my team as they tell us where to look for cost savings and potential optimization opportunities.” - Frank Dice, Principal Cloud Architect, Major League Baseball
Customers especially appreciate the breadth of product coverage available out of the box without any additional setup, and the fact that there is no additional charge to using these features.
As your organization “shifts left” on cloud cost management, we are working to help application owners and developers understand and optimize their cloud costs. You can try Cloud Hub Optimize and Cost Explorer here.
You can also see a live demo of how Cloud Hub Optimization and Cost Explorer can be used to identify underutilized GKE clusters within seconds in the Google Cloud Next 2025 talk Maximize Your Cloud ROI.
Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit MLB.com.
Are you ready to unlock the power of Google Cloud and want guidance on how to set up your environment effectively? Whether you're a cloud novice or part of an experienced team looking to migrate critical workloads, getting your foundational infrastructure right is the key to success. That's where Google Cloud Setup comes in — your guided pathway to a secure cloud foundation and quick start on Google Cloud.
Google Cloud Setup helps you quickly implement Google Cloud's recommended best practices. Our goal is to provide a fast and easy path to deploying your workloads without unnecessary configuration effort. Think of it as your expert guide, walking you through the essential first steps so you can focus on what truly matters: rapidly deploying your innovative applications and services. To help you get started without financial barriers, all components and service integrations enabled during the setup process are free or include some level of no-cost access.
We understand that every organization and project has unique requirements. That's why Cloud Setup offers three distinct guided flows to choose from:
Proof-of-concept: Designed for users who want to set up a lightweight environment to explore Google Cloud and run initial tests or sandbox workloads. This flow focuses on the minimum configuration to get you started quickly.
Production: This flow is recommended for supporting production-ready workloads with security and scalability in mind. It aligns with Google Cloud’s best practices and is tailored for administrators setting up basic foundational infrastructure for production workloads.
Enhanced security: Designed for organizations, regions or workloads with advanced security and compliance requirements, this flow defaults to more advanced security controls and is designed to help you meet rigorous requirements. Even this advanced foundation sets you up with a perpetual free tier up to certain usage limits.
Cloud Setup guides you through a series of onboarding steps, presenting defaults backed by Google Cloud best practices. Throughout the process, you'll also encounter key features designed to help protect your organization and prepare it for growth, including:
Cloud KMS AutoKey: Automates the provisioning and assignment of customer-managed encryption keys (CMEK).
Security Command Center: Provides security posture management for Google Cloud deployments including automatic project scanning for security issues such as open ports and misconfigured access controls.
Centralized Logging and Monitoring: Enables you to easily set up infrastructure to monitor your system's health and performance from a central location — critical for audit logging compliance and visualizing metrics across projects.
Shared VPC Networks: Allows you to establish a centralized network across multiple projects, enabling secure and efficient communication between your Google Cloud resources.
Hybrid Connectivity: Facilitates connecting your Google Cloud environment to your on-premises infrastructure or other cloud providers. This is often a critical step for workload migrations.
Support plan: Enables you to quickly resolve any issues with help from experts at Google Cloud.
At the end of the guided flow, you can deploy your configuration directly via the Google Cloud console or download a Terraform configuration file for later deployment using other Infrastructure as Code (IaC) methods.
Organizations using Cloud Setup experience enjoy:
Faster application deployment: By simplifying the initial setup, you can get your applications up and running more quickly, accelerating your cloud journey.
Reduced setup effort: Our streamlined flow significantly reduces the number of manual steps, allowing you to establish a basic foundation with less effort.
Greater access to Google Cloud's full potential: By establishing a solid foundation quickly, you can more easily explore and leverage a wider range of Google Cloud services to meet your evolving needs and unlock greater value.
Ready to start your Google Cloud journey? Visit Google Cloud Setup today for a streamlined path to a secure cloud foundation. Let us guide you through the initial steps so you can focus on innovation and growth.
To learn more, visit:
Cloud Setup overview (requires login)
As developers and operators, you know that having access to the right information in the proper context is crucial for effective troubleshooting. This is why organizations invest a lot upfront curating monitoring resources across different business units: so information is easy to find and contextualize when needed.
Today we are reducing the need for this upfront investment with an out-of-the-box Application Monitoring experience for your organization on Google Cloud within Cloud Observability.
Application Monitoring consists of a set of pre-curated dashboards with relevant metrics and logs mapped to a user-defined application in App Hub. It incorporates best practices pioneered by Google Site Reliability Engineers (SRE) to optimize manual troubleshooting and unlock AI-assisted troubleshooting.
Application Monitoring automatically labels and brings together key telemetry for your application into a centralized experience, making it easy to discover, filter and correlate trends. It also feeds application context into Gemini Cloud Assist Investigations, for AI-assisted troubleshooting.
No more spending hours configuring application dashboards.
From the moment you describe your application in App Hub, Application Monitoring starts to automatically build dashboards tailored to your environment. Each dashboard comprises relevant telemetry for your application and is searchable, filterable and ready for deep dives — no configuration required.
The dashboards offer an overview of charts detailing the SRE Four Golden Signals: traffic, latency, error rate, and saturation. This provides a high-level view of application performance, integrating automatically collected system metrics across various services and workloads such as load balancers, Cloud Run, GKE workloads, MIGs, and databases. From this overview, you can then drill down into services or workloads with performance issues or active alerts to access detailed metrics and logs.
For example in the image below, a user defined an App Hub application called Cymbal BnB app, with multiple services and workloads. The flow below shows the automatically generated experience with golden signals, alerts and relevant logs.
Figure 1 - A user’s flow from an App Hub defined application (i.e. Cymbal BnB) to the automatic prebuilt Application Monitoring experience in Cloud Observability
See application labels propagated seamlessly across Google Cloud
Once Application Monitoring is enabled, your application labels are propagated across Google Cloud, so you can see and use them to filter and focus on the most essential signals across the logs, metrics and trace explorers.
Figure 2 - Logs Explorer showing application automatically tagged with application labels
Figure 3 - Metrics Explorer showing application labels automatically associated with metrics
Figure 4 - Trace Explorer showing AppHub label Integration
Troubleshoot issues faster with AI powered Investigations.
Gemini Cloud Assist’s investigation feature makes it easier to troubleshoot issues because application boundaries and relationships have been propagated into the AI model, grounding it in context about your environment.
Figure 5 - Seamless entry point into Gemini Cloud Assist powered Investigations from application logs
Note - Gemini Cloud Assist Investigations is currently in private preview
The new Application Monitoring experience provides a low-effort unified view of application and infrastructure performance for your troubleshooting needs.
Take advantage of the new Google Cloud Application Monitoring experience by:
Visiting your Cloud console
Adding Services and Workloads to your Application
Navigating to Application Monitoring in Cloud Observability to see your automatically built experience
Enable your Gemini Cloud Assist SKU and sign up for the trusted tester program to get access to the Investigations experience
Application Monitoring docs
AppHub docs
At Google Cloud, we are committed to making it as seamless as possible for you to build and deploy the next generation of AI and agentic applications. Today, we’re thrilled to announce that we are collaborating with Docker to drastically simplify your deployment workflows, enabling you to bring your sophisticated AI applications from local development to Cloud Run with ease.
Previously, bridging the gap between your development environment and managed platforms like Cloud Run required you to manually translate and configure your infrastructure. Agentic applications that use MCP servers and self-hosted models added additional complexity.
The open-source Compose Specification is one of the most popular ways for developers to iterate on complex applications in their local environment, and is the basis of Docker Compose. And now, gcloud run compose up brings the simplicity of Docker Compose to Cloud Run, automating this entire process. Now in private preview, you can deploy your existing compose.yaml file to Cloud Run with a single command, including building containers from source and leveraging Cloud Run’s volume mounts for data persistence.
Supporting the Compose Specification with Cloud Run makes for easy transitions across your local and cloud deployments, where you can keep the same configuration format, ensuring consistency and accelerating your dev cycle.
“We’ve recently evolved Docker Compose to support agentic applications, and we’re excited to see that innovation extend to Google Cloud Run with support for GPU-backed execution. Using Docker and Cloud Run, developers can now iterate locally and deploy intelligent agents to production at scale with a single command. It’s a major step forward in making AI-native development accessible and composable. We’re looking forward to continuing our close collaboration with Google Cloud to simplify how developers build and run the next generation of intelligent applications.” - Tushar Jain, EVP Engineering and Product, Docker
Support for the compose spec isn’t the only AI-friendly innovation you’ll find in Cloud Run. We recently announced general availability of Cloud Run GPUs, removing a significant barrier to entry for developers who want access to GPUs for AI workloads. With its pay-per-second billing, scale to zero, and rapid scaling (which takes approximately 19 seconds for a gemma3:4b model for time-to-first-token), Cloud Run is a great hosting solution for deploying and serving LLMs.
This also makes Cloud Run a strong solution for Docker’s recently announced OSS MCP Gateway and Model Runner, making it easy for developers to take the AI applications locally to production in the cloud seamlessly. By supporting Docker’s recent addition of ‘models’ to the open Compose Spec, you can deploy these complex solutions to the cloud with a single command.
Let's review the compose file for the above demo. It consists of a multi-container application (defined in services) built from sources and leveraging a storage volume (defined in volumes). It also uses the new models attribute to define AI models and a Cloud Run-extension defining the runtime image to use:
We’re committed to offering developers maximum flexibility and choice by adopting open standards and supporting various agent frameworks. This collaboration on Cloud Run and Docker is another example of how we aim to simplify the process for developers to build and deploy intelligent applications.
Compose Specification support is available for our trusted users — sign up here for the private preview.
Editor's note: This is part one of the story. After you’re finished reading, head over to part two.
In 2017, John Lewis, a major UK retailer with a £2.5bn annual online turnover, was hampered by its monolithic e-commerce platform. This outdated approach led to significant cross-team dependencies, cumbersome and infrequent releases (monthly at best), and excessive manual testing, all further hindered by complex on-premises infrastructure. What was needed were some bold decisions to drive a quick and significant transformation.
The John Lewis engineers knew there was a better way. Working with Google Cloud, they modernized their e-commerce operations with Google Kubernetes Engine. They started with the frontend, and started to see results fast: the frontend was moved onto Google Cloud in mere months, releases to the frontend browser journey started to happen weekly, and the business gladly backed expansion into other areas.
At the same time, the team had a broader strategy in mind: to take a platform engineering approach, creating many product teams who built their own microservices to replace the functionality of the legacy commerce engine, as well as creating brand new experiences for customers.
And so The John Lewis Digital Platform was born. The vision was to empower development teams and arm them with the tools and processes they needed to go to market fast, with full ownership of their own business services. The team’s motto? "You Build It. You Run It. You Own It." This decentralization of development and operational responsibilities would also enable the team to scale.
This article features insights from Principal Platform Engineer Alex Moss, who delves into their strategy, platform build, and key learnings of John Lewis’ journey to modernize and streamline its operations with platform engineering — so you can begin to think about how you might apply platform engineering to your own organization.
In order to make this happen, John Lewis needed to adopt a multi-tenant architecture — one tenant for each business service, allowing each owning team to work independently without risk to others -- and thereby permitting the Platform team to give the team a greater degree of freedom.
Knowing that the business' primary objective was to greatly increase the number of product teams helped inform our initial design thinking, positioning ourselves to enable many independent teams even though we only had a handful of tenants.
This foundational design has served us very well and is largely unchanged now, seven years later. Central to the multi-tenant concept is what we chose to term a "Service" — a logical business application, usually composed of several microservices plus components for storing data.
We largely position our platform as a “bring your own container” experience, but encourage teams to make use of other Google Cloud services — particularly for handling state. Adopting services like Firestore and Pub/Sub reduces the complexity that our platform team has to work with, particularly for areas like resilience and disaster recovery. We also favor Kubernetes over compute products like Cloud Run because it strikes the right balance for us between enabling development teams to have freedom whilst allowing our platform to drive certain certain behaviours, e.g., the right level of guardrails, without introducing too much friction.
On our platform, Product Teams (i.e., tenants) have a large amount of control over their own Namespaces and Projects. This allows them to prototype, build, and ultimately operate, their workloads without dependency on others — a crucial element of enabling scale.
Our early-adopter teams were extremely helpful in helping evolve the platform; they were accepting of the lack of features and willing to develop their own solutions, and provided very rich feedback on whether we were building something that met their needs.
The first tenant to adopt the platform was rebuilding the johnlewis.com, search capability, replacing a commercial-off-the-shelf solution. This team was staffed with experienced engineers familiar with modern software development and the advantages of a microservice-based architecture. They quickly identified the need for supporting services for their application to store data and asynchronously communicate between their components. They worked with the Platform Team to identify options, and were onboard with our desire to lean into Google Cloud native services to avoid running our own databases or messaging. This led to us adopting Cloud Datastore and Pub/Sub for our first features that extended beyond Google Kubernetes Engine.
A risk with a platform that allows very high team autonomy is that it can turn into a bit of a wild-west of technology choices and implementation patterns. To handle this, but to do so in a way that remained developer-centric, we adopted the concept of a paved road, analogous to a “golden path.”
We found that the paved road approach made it easier to:
build useful platform features to help developers do things rapidly and safely
share approaches and techniques, and engineers to move between teams
demonstrate to the wider organisation that teams are following required practices (which we do by building assurance capabilities, not by gating release)
The concept of the paved road permeates most of what the platform builds, and has inspired other areas of the John Lewis Partnership beyond the John Lewis Digital space.
Our paved road is powered by two key features to enable simplification for teams:
The Paved Road Pipeline. This operates on the whole Service and drives capabilities such as Google Cloud resource provisioning and observability tools.
The Microservice CRD. As the name implies, this is an abstraction at the microservice level. The majority of the benefit here is in making it easier for teams to work with Kubernetes.
Whilst both features were created with the developer experience in mind, we discovered that they also hold a number of benefits for the platform team too.
The Paved Road Pipeline is driven by a configuration file — in yaml (of course!) — which we call the Service Definition. This allows the team that owns the tenancy to describe, through easy-to-reason-about configuration, what they would like the platform to provide for them. Supporting documentation and examples help them understand what can be achieved. Pushes to this file then drive a CI/CD pipeline for a number of platform-owned jobs, which we refer to as provisioners. These provisioners are microservices-like themselves in that they are independently releasable and generally focus on performing one task well. Here are some examples of our provisioners and what they can do:
Our product teams are therefore freed from the need to familiarize themselves deeply with how Google Cloud resource provisioning works, or Infrastructure-as-Code (IaC) tooling for that matter. Our preferred technologies and good practices can be curated by our experts, and developers can focus on building differentiating software for the business, while remaining fully in control of what is provisioned and when.
Earlier, we mentioned that this approach has the added benefit of being something that the platform team can rely upon to build their own features. The configuration updated by teams for their Service can be combined with metadata about their team and surfaced via an API and events published to Pub/Sub. This can then drive updates to other features like incident response and security tooling, pre-provision documentation repositories, and more. This is an example of how something that was originally intended as a means to help teams avoid writing their own IaC can also be used to make it easier for us to build platform features, further improving the value-add — without the developer even needing to be aware of it!
We think this approach is also more scalable than providing pre-built Terraform modules for teams to use. That approach still burdens teams with being familiar with Terraform, and versioning and dependency complexities can create maintenance headaches for platform engineers. Instead, we provide an easy-to-reason-about API and deliberately burden the platform team, ensuring that the Service provides all the functionality our tenants require. This abstraction also means we can make significant refactoring choices if we need to.
Adopting this approach also results in a broad consistency in technologies across our platform. For example, why would a team implement Kafka when the platform makes creating resources in Pub/Sub so easy? When you consider that this spans not just the runtime components that assemble into a working business service, but also all the ancillary needs for operating that software — resilience engineering, monitoring & alerting, incident response, security tooling, service management, and so on— this has a massive amplifying effect on our engineers’ productivity. All of these areas have full paved road capabilities on the John Lewis Digital Platform, reducing the cognitive load for teams in recognizing the need for, identifying appropriate options, and then implementing technology or processes to use them.
That being said, one of the reasons we particularly like the paved road concept is because it doesn't preclude teams choosing to "go off-road." A paved road shouldn’t be mandatory, but it should be compelling to use, so that engineers aren’t tempted to do something else. Preventing use of other approaches risks stifling innovation and the temptation to think the features you've built are "good enough." The paved road challenges our Platform Engineers to keep improving their product so that it continues to meet our Developers' changing needs. Likewise, development teams tempted to go off-road are put off by the increasing burden of replicating powerful platform features.
The needs of our Engineers don’t remain fixed, and Google Cloud are of course releasing new capabilities all the time, so we have extended the analogy to include a “dusty path” representing brand new platform features that aren’t as feature-rich as we’d like (perhaps they lack self-service provisioning or out-the-box observability). Teams are trusted to try different options and make use of Google Cloud products that we haven't yet paved. The Paved Road Pipeline allows for this experimentation - what we term "snowflaking". We then have an unofficial "rule of three", whereby if we notice at least 3 teams requesting the same feature, we move to make the use of it self-service.
At the other end of the scale, teams can go completely solo — which we refer to as “crazy paving” — and might be needed to support wild experimentation or to accommodate a workload which cannot comply with the platform’s expectations for safe operation. Solutions in this space are generally not long-lived.
In this article, we've covered how John Lewis revolutionized its e-commerce operations by adopting a multi-tenant, "paved road" approach to platform engineering. We explored how this strategy empowered development teams and streamlined their ability to provision Google Cloud resources and deploy operational and security features.
In part 2 of this series, we'll dive deeper into how John Lewis further simplified the developer experience by introducing the Microservice CRD. You'll discover how this custom Kubernetes abstraction significantly reduced the complexity of working with Kubernetes at the component level, leading to faster development cycles and enhanced operational efficiency.
To learn more about shifting down with platform engineering on Google Cloud, you can find more information available here. To learn more about how Google Kubernetes Engine (GKE) empowers developers to effortlessly deploy, scale, and manage containerized applications with its fully managed, robust, and intelligent Kubernetes service, you can find more information here.
In our previous article we introduced the John Lewis Digital Platform and its approach to simplifying the developer experience through platform engineering and so-called paved road features. We focused on the ways that platform engineering enables teams to create resources in Google Cloud and deploy the platform's operational and security features within dedicated tenant environments. In this article, we will build upon that concept for the next level of detail — how the platform simplifies build and run at a component (typically for us, a microservice) level too.
Within just over a year, the John Lewis Digital Platform had fully evolved into a product. We had approximately 25 teams using our platform, with several key parts of the johnlewis.com retail website running in production. We had built a self-service capability to help teams provision resources in Google Cloud, and firmly established that the foundation of our platform was on Google Kubernetes Engine (GKE). But we were hearing signals from some of the recent teams that there was a learning curve to Kubernetes. This was expected — we were driving a cultural change for teams to build and run their own services, and so we anticipated that our application developers would need some Kubernetes skills to support their own software. But our vision was that we wanted to make developers' lives easier — and their feedback was clear. In some cases, we observed that teams weren't following "good practice" (despite the existence of good documentation!) such as not using anti-affinity rules or PodDisruptionBudgets to help their workloads tolerate failure.
All the way back in 2017, Kelsey Hightower wrote: “Kubernetes is a platform for building platforms. It's a better place to start, not the endgame.”
Kelsey's quote inspired us to act. We had the idea to write our own custom controller to simplify the point of interaction for a developer with Kubernetes — a John Lewis-specific abstraction that aligned to our preferred approaches. And thus the JL Microservice was born.
To do this, we declared a Kubernetes CustomResourceDefinition with a simplified specification containing just the fields we felt our developers needed to set. For example, as we expect our tenants to build and operate their applications themselves, attributes such as the number of replicas and the amount of resources needed are best left up to the developers themselves. But do they really need to be able to customize the rules defining how to distribute pods across nodes? How often do they need to change the Service pointing towards their Deployment? When we looked closer, we realized just how much duplication there was — our analysis at the time suggested that only around 33% of the lines in the yaml files developers were producing were relevant to their application. This was a target-rich scenario for simplification.
To help us build this feature, we selected Kubebuilder, using it to declare our CustomResourceDefinition and then build the Controller (what we call MicroserviceManager). This turned out to be a beneficial decision — initial prototyping was quick, and the feature was launched a few months later, and very well-received. Our team had to skill up in the Go programming language, but this trade-off felt worthwhile due to the advantages Kubebuilder was bringing to the table, and it has continued to be helpful for other software engineering since.
The initial implementation replaced an engineer's need to understand and fully configure a Deployment and Service, instead applying a much briefer yaml file containing only the fields they need to change. As well as direct translation of identical fields (image and replicas are equivalent to what you would see in a Deployment, for example), it also allowed us to simplify the choices made by the Kubernetes APIs, because in John Lewis we didn't need some of that functionality. For example, writablePaths: [] is an easy concept for our engineers to understand, and behind the scenes, our controller is converting those into the more complex combination of Volumes and VolumeMounts. Likewise, visibleToOtherServices: true is an example of us simplifying the interaction with Kubernetes NetworkPolicy — rather than requiring teams to read our documentation to understand the necessary incantations to label their resources correctly, the controller understands those conventions and handles it for them.
With the core concept of the Microservice resource established, we were able to improve the value-add by augmenting it with further features. We rapidly extended it out to define our Prometheus scrape configuration, then more complex features such as allowing teams to declare that they use Google Cloud Endpoints, and have the controller inject the necessary sidecar container into their Deployment and wiring it up to the Service. As we added more features, existing tenants converted to use this specification, and it now makes up the majority of workloads declared on the platform.
Our motivation to build MicroserviceManager was focused on making developers' lives easier. But we discovered an additional benefit that we had not initially expected - it was something we could greatly benefit from within the platform as well. It enabled us to make changes behind the scenes without needing to involve our tenants — reducing toil for them and making it easier for us to improve our product. This was a slightly unexpected but an exceptionally powerful benefit. It is generally difficult to change the agreement that you’ve established between your tenants and the platform, and creating an abstraction like this has allowed us to bring more under our control, for everyone’s benefit.
An example of this was something we observed through our live load testing of johnlewis.com when certain workloads burst up to several hundred Pods — numbers that exceeded the typical number of Nodes we had running in the cluster. This led to new Node creation — therefore slower Pod autoscaling and poor bin-packing. Experienced Kubernetes operators can probably guess what was happening here: our default antiAffinity rules were set to optimize for resilience such that no more than one replica was allowed on any given Node. The good news though was that because the workloads were under the control of our Microservice Manager, rather than us having to instruct our tenants to copy the relevant yaml into their Deployments, it was a straightforward change for us to replace the antiAffinity rules with the more modern podTopologyConstraints, allowing us to customize the number of replicas that could be stacked on a Node for workloads exceeding a certain replica count. And this happened with no intervention from our tenants.
A more complex example of this was when we rolled out our service mesh. In keeping with our general desire to let Google Cloud handle the complexity of running control planes components, we opted to use Google's Cloud Service Mesh product. But even then, rolling out a mesh to a business-critical platform in constant use is not without its risks. Microservice Manager allowed us to control the rate at which we enrolled workloads into the mesh through the use of a feature flag on the Microservice resource. We could start rollout with platform-owned workloads first to test our approach, then make tenants aware of the flag for early adopters to validate and take advantage of some of Cloud Service Mesh’s features. To scale the rollout, we could then manipulate the flag to release in waves based on business importance, providing an opt-out mechanism if needed to. This again greatly simplified the implementation — product teams had very little to do, and we avoided having to chase approximately 40 teams running hundreds of Microservices to make the appropriate changes in their configuration. This feature flagging technique is something we make extensive use of to support our own experimentation.
Building the Microservice Manager has led to further thinking in Kubernetes-native ways: the Custom Resource + Controller concept is a powerful technique, and we have built other features since using it. One example is a controller that converts the need for external connectivity into Istio resources to route via our egress gateway. Istio in particular is an example of a very powerful platform capability that comes with a high cognitive load for its users, and so is a perfect example of where platform engineering can help manage that for teams whilst still allowing them to take advantage of it. We have a number of ideas in this area now that our confidence in the technology has grown.
In summary, the John Lewis Partnership leveraged Google Cloud and platform engineering to modernize their e-commerce operations and developer experience. By implementing a "paved road" approach with a multi-tenant architecture, they empowered development teams, accelerated deployment cycles, and simplified Kubernetes interactions using a custom Microservice CRD. This strategy allowed them to scale effectively and enhance the developer experience by reducing complexity while maintaining operational efficiency and scaling engineering teams effectively.
To learn more about platform engineering on Google Cloud, check out some of our other articles: 5 myths about platform engineering: what it is and what it isn’t, Another five myths about platform engineering, and Light the way ahead: Platform Engineering, Golden Paths, and the power of self-service.
In the event of a cloud incident, everyone wants swift and clear communication from the cloud provider, and to be able to leverage that information effectively. Personalized Service Health in the Google Cloud console addresses this need with fast, transparent, relevant, and actionable communications about Google Cloud service disruptions, customized to your specific footprint. This helps you to quickly identify the source of the problem, helping you answer the question, “Is it Google or is it me?” You can then integrate this information into your incident response workflows to resolve the incident more efficiently.
We're excited to announce that you can prompt Gemini Cloud Assist to pull real-time information about active incidents, powered by Personalized Service Health, providing you with streamlined incident management, including discovery, impact assessment, and recovery. By combining Gemini's guidance with Personalized Service Health insights and up-to-the-minute information, you can assess the scope of impact and begin troubleshooting – all within a single, AI-driven Gemini Cloud Assist chat. Further, you can initiate this sort of incident discovery from anywhere within the console, offering immediate access to relevant incidents without interrupting your workflow. You can also check for active incidents impacting your projects, gathering details on their scope and the latest updates directly sourced from Personalized Service Health.
We designed Gemini Cloud Assist with a user-friendly layout and a well-organized information structure. Crucial details, including dynamic timelines, latest updates, symptoms, and workarounds sourced directly from Personalized Service Health, are now presented in the console, enabling conversational follow-ups. Gemini Cloud Assist highlights critical insights from Personalized Service Health, helping you refine your investigations and understand the impact of incidents.
To illustrate the power of this integration, the following demo showcases a typical incident response workflow leveraging the combined capabilities of Gemini and Personalized Service Health.
Incident discovery and triage
In the crucial first moments of an incident, Gemini Cloud Assist helps you answer "Is it Google or is it me?" Gemini Cloud Assist accesses data directly from Personalized Service Health, and provides feedback on which projects and at what locations are affected by a Google Cloud incident, speeding up the triage process.
To illustrate how you can start this process, try asking Gemini Cloud Assist questions like:
Is my project impacted by a Google Cloud incident?
Are there any incidents impacting Google Cloud at the moment?
Investigating and evaluating impact
Once you’ve identified a relevant Google Cloud incident, you can use Gemini Cloud Assist to delve deeper into the specifics and evaluate its impact on your environment. Furthermore, by asking follow-up questions, Gemini Cloud Assist can retrieve updates from Personalized Service Health about the incident as it evolves. You can then further investigate by asking Gemini to pinpoint exactly which of your apps or projects, and at what locations, might be affected by the reported incident.
Here are examples of prompts you might pose to Gemini Cloud Assist:
Tell me more about the ongoing Incident ID [X] (Replace [X] with the Incident ID)
Is [X] impacted? (Replace [X] with your specific location or Google Cloud product)
What is the latest update on Incident ID [X]?
Show me the details of Incident ID [X].
Can you guide me through some troubleshooting steps for [impacted Google Cloud product]?
Mitigation and recovery
Finally, Gemini Cloud Assist can also act as an intelligent assistant during the recovery phase, providing you with actionable guidance. You can gain access to relevant logs and monitoring data for more efficient resolution. Additionally, Gemini Cloud Assist can help surface potential workarounds from Personalized Service Health and direct you to the tools and information you need to restore your projects or applications. Here are some sample prompts:
What are the workarounds for the incident ID [X]? (Replace [X] with the Incident ID)
Can you suggest a temporary solution to keep my application running?
How can I find logs for this impacted project?
From these prompts, Gemini retrieves relevant information from Personalized Service Health to provide you with personalized insights into your Google Cloud environment's health — both for ongoing events and incidents from up to one year in the past. This helps when investigating an incident to narrow down its impact, as well as assisting in recovery.
Looking ahead, we are excited to provide even deeper insights and more comprehensive incident management with Gemini Cloud Assist and Personalized Service Health, extending these AI-driven capabilities beyond a single project view. Ready to get started?
Learn more about Personalized Service Health, or reach out to your account team to enable it.
Get started with Gemini Cloud Assist. Refine your prompts to ask about your specific regions or Google Cloud products, and experiment to discover how it can help you proactively manage incidents.
In 2023, the Waze platform engineering team transitioned to Infrastructure as Code (IaC) using Google Cloud's Config Connector (KCC) — and we haven’t looked back since. We embraced Config Connector, an open-source Kubernetes add-on, to manage Google Cloud resources through Kubernetes. To streamline management, we also leverage Config Controller, a hosted version of Config Connector on Google Kubernetes Engine (GKE), incorporating Policy Controller and Config Sync. This shift has significantly improved our infrastructure management and is shaping our future infrastructure.
Previously, Waze relied on Terraform to manage resources, particularly during our dual-cloud, VM-based phase. However, maintaining state and ensuring reconciliation proved challenging, leading to inconsistent configurations and increased management overhead.
In 2023, we adopted Config Connector, transforming our Google Cloud infrastructure into Kubernetes Resource Modules (KRMs) within a GKE cluster. This approach addresses the reconciliation issues encountered with Terraform. Config Sync, paired with Config Connector, automates KRM synchronization from source repositories to our live GKE cluster. This managed solution eliminates the need for us to build and maintain custom reconciliation systems.
The shift helped us meet the needs of three key roles within Waze’s infrastructure team:
Infrastructure consumers: Application developers who want to easily deploy infrastructure without worrying about the maintenance and complexity of underlying resources.
Infrastructure owners: Experts in specific resource types (e.g., Spanner, Google Cloud Storage, Load Balancers, etc.), who want to define and standardize best practices in how resources are created across Waze on Google Cloud.
Platform engineers: Engineers who build the system that enables infrastructure owners to codify and define best practices, while also providing a seamless API for infrastructure consumers.
It may seem circular to define all of our Google Cloud infrastructure as KRMs within a Google Cloud service, however, KRM is actually a great representation for our infrastructure as opposed to existing IaC tooling.
Terraform's reconciliation issues – state drift, version management, out of band changes – are a significant pain. Config Connector, through Config Sync, offers out-of-the-box reconciliation, a managed solution we prefer. Both KRM and Terraform offer templating, but KCC's managed nature aligns with our shift to Google Cloud-native solutions and reduces our maintenance burden.
Infrastructure complexity requires generalization regardless of the tool. We can see this when we look at the Spanner requirements at Waze:
Consistent backups for all Spanner databases
Each Spanner database utilizes a dedicated Cloud Storage bucket and Service Account to automate the execution of DDL jobs.
All IAM policies for Spanner instances, databases, and Cloud Storage buckets are defined in code to ensure consistent and auditable access control.
To define these resources, we evaluated various templating and rendering tools and selected Helm, a robust CNCF package manager for Kubernetes. Its strong open-source community, rich templating capabilities, and native rendering features made it a natural fit. We can now refer to our bundled infrastructure configurations as 'Charts.' While KRO has since emerged that achieves a similar purpose, our selection process predated its availability.
Let's open the hood and dive into how the system works and is driving value for Waze.
Waze infrastructure owners generically define Waze-flavored infrastructure in Helm Charts.
Infrastructure consumers use these Charts with simplified inputs to generate infrastructure (demo).
Infrastructure code is stored in repositories, enabling validation and presubmit checks.
Code is uploaded to a Artifact Registry where Config Sync and Config Connector align Google Cloud infrastructure with the code definitions.
This diagram represents a single "data domain," a collection of bounded services, databases, networks, and data. Many tech orgs today consist of Prod, QA, Staging, Development, etc.
So why does all of this matter? Adopting this approach allowed us to move from Infrastructure as Code to Infrastructure as Software. By treating each Chart as a software component, our infrastructure management goes beyond simple code declaration. Now, versioned Charts and configurations enable us to leverage a rich ecosystem of software practices, including sophisticated release management, automated rollbacks, and granular change tracking.
Here's where we apply this in practice: our configuration inheritance model minimizes redundancy. Resource Charts inherit settings from Projects, which inherit from Bootstraps. All three are defined as Charts. Consequently, Bootstrap configurations apply to all Projects, and Project configurations apply to all Resources.
Every change to our infrastructure – from changes on existing infrastructure to rolling out new resource types – can be treated like a software rollout.
Now that all of our infrastructure is treated like software, we can see what this does for us system-wide:
In summary, Config Connector and Config Controller have enabled Waze to achieve true Infrastructure as Software, providing a robust and scalable platform for our infrastructure needs, along with many other benefits including:
Infrastructure consumers receive the latest best practices through versioned updates.
Infrastructure owners can iterate and improve infrastructure safely.
Platform Engineers and Security teams are confident our resources are auditable and compliant
Config Connector leverages Google's managed services, reducing operational overhead.
Distributed tracing is a critical part of an observability stack, letting you troubleshoot latency and errors in your applications. Cloud Trace, part of Google Cloud Observability, is Google Cloud’s native tracing product, and we’ve made numerous improvements to the Trace explorer UI on top of a new analytics backend.
The new Trace explorer page contains:
A filter bar with options for users to choose a Google Cloud project-based trace scope, all/root spans and a custom attribute filter.
A faceted span filter pane that displays commonly used filters based on OpenTelemetry conventions.
A visualization of matching spans including an interactive span duration heatmap (default), a span rate line chart, and a span duration percentile chart.
A table of matching spans that can be narrowed down further by selecting a cell of interest on the heatmap.
Let’s take a closer look at these new features and how you can use them to troubleshoot your applications. Imagine you’re a developer working on the checkoutservice of a retail webstore application and you’ve been paged because there’s an ongoing incident.
This application is instrumented using OpenTelemetry and sends trace data to Google Cloud Trace, so you navigate to the Trace explorer page on the Google Cloud console with the context set to the Google Cloud project that hosts the checkoutservice.
Before starting your investigation, you remember that your admin recommended using the webstore-prod trace scope when investigating webstore app-wide prod issues. By using this Trace scope, you'll be able to see spans stored in other Google Cloud projects that are relevant to your investigation.
You set the trace scope to webstore-prod and your queries will now include spans from all the projects included in this trace scope.
You select checkoutservice in Span filters (1) and the following updates load on the page:
Other sections such as Span name in the span filter pane (2) are updated with counts and percentages that take into account the selection made under service name. This can help you narrow down your search criteria to be more specific.
The span Filter bar (3) is updated to display the active filter.
The heatmap visualization (4) is updated to only display spans from the checkoutservice in the last 1 hour (default). You can change the time-range using the time-picker (5). The heatmap’s x-axis is time and the y-axis is span duration. It uses color shades to denote the number of spans in each cell with a legend that indicates the corresponding range.
The Spans table (6) is updated with matching spans sorted by duration (default).
Other Chart views (7) that you can switch to are also updated with the applied filter.
From looking at the heatmap, you can see that there are some spans in the >100s range which is abnormal and concerning. But first, you’re curious about the traffic and corresponding latency of calls handled by the checkoutservice.
Switching to the Span rate line chart gives you an idea of the traffic handled by your service. The x-axis is time and the y-axis is spans/second. The traffic handled by your service looks normal as you know from past experience that 1.5-2 spans/second is quite typical.
Switching to the Span duration percentile chart gives you p50/p90/p95/p99 span duration trends. While p50 looks fine, the p9x durations are greater than you expect for your service.
You switch back to the heatmap chart and select one of the outlier cells to investigate further. This particular cell has two matching spans with a duration of over 2 minutes, which is concerning.
You investigate one of those spans by viewing the full trace and notice that the orders publish span is the one taking up the majority of the time when servicing this request. Given this, you form a hypothesis that the checkoutservice is having issues handling these types of calls. To validate your hypothesis, you note the rpc.method attribute being PlaceOrder and exit this trace using the X button.
You add an attribute filter for key: rpc.method value:PlaceOrder using the Filter bar, which shows you that there is a clear latency issue with PlaceOrder calls handled by your service. You’ve seen this issue before and know that there is a runbook that addresses it, so you alert the SRE team with the appropriate action that needs to be taken to mitigate the incident.
Share your feedback with us via the Send feedback button.
This new experience is powered by BigQuery, using the same platform that backs Log Analytics. We plan to launch new features that take full advantage of this platform: SQL queries, flexible sampling, export, and regional storage.
In summary, you can use the new Cloud Trace explorer to perform service-oriented investigations with advanced querying and visualization of trace data. This allows developers and SREs to effectively troubleshoot production incidents and identify mitigating measures to restore normal operations.
The new Cloud Trace explorer is generally available to all users — try it out and share your feedback with us via the Send feedback button.
Picture this: you’re an Site Reliability Engineer (SRE) responsible for the systems that power your company’s machine learning (ML) services. What do you do to ensure you have a reliable ML service, how do you know you’re doing it well, and how can you build strong systems to support these services?
As artificial intelligence (AI) becomes more widely available, its features — including ML — will matter more to SREs. That’s because ML becomes both a part of the infrastructure used in production software systems, as well as an important feature of the software itself.
Abstractly, machine learning relies on its pipelines … and you know how to manage those! So you can begin with pipeline management, then look to other factors that will strengthen your ML services: training, model freshness, and efficiency. In the resources below, we'll look at some of the ML-specific characteristics of these pipelines that you’ll want to consider in your operations. Then, we draw on the experience of Google SREs to show you how to apply your core SRE skills to operating and managing your organization’s machine-learning pipelines.
Training ML models applies the notion of pipelines to specific types of data, often running on specialized hardware. Critical aspects to consider about the pipeline:
how much data you’re ingesting
how fresh this data needs to be
how the system trains and deploys the models
how efficiently the system handles these first three things
This keynote presents an SRE perspective on the value of applying reliability principles to the components of machine learning systems. It provides insight into why ML systems matter for products, and how SREs should think about them. The challenges that ML systems present include capacity planning, resource management, and monitoring; other challenges include understanding the cost of ML systems as part of your overall operations environment.
As with any pipeline-based system, a big part of understanding the system is describing how much data it typically ingests and processes. The Data Processing Pipelines chapter in the SRE Workbook lays out the fundamentals: automate the pipeline’s operation so that it is resilient, and can operate unattended.
You’ll want to develop Service Level Objectives (SLOs) in order to measure the pipeline’s health, especially for data freshness, i.e., how recently the model got the data it’s using to produce an inference for a customer. Understanding freshness provides an important measure of an ML system’s health, as data that becomes stale may lead to lower-quality inferences and sub-optimal outcomes for the user. For some systems, such as weather forecasting, data may need to be very fresh (just minutes or seconds old); for other systems, such as spell-checkers, data freshness can lag on the order of days — or longer! Freshness requirements will vary by product, so it’s important that you know what you’re building and how the audience expects to use it.
In this way, freshness is a part of the critical user journey described in the SRE Workbook, describing one aspect of the customer experience. You can read more about data freshness as a component of pipeline systems in the Google SRE article Reliable Data Processing with Minimal Toil.
There’s more than freshness to ensuring high-quality data — there’s also how you define the model-training pipeline. A Brief Guide To Running ML Systems in Production gives you the nuts and bolts of this discipline, from using contextual metrics to understand freshness and throughput, to methods for understanding the quality of your input data.
The 2021 SRE blog post Efficient Machine Learning Inference provides a valuable resource to learn about improving your model’s performance in a production environment. (And remember, training is never the same as production for ML services!)
Optimizing machine learning inference serving is crucial for real-world deployment. In this article, the authors explore multi-model serving off of a shared VM. They cover realistic use cases and how to manage trade-offs between cost, utilization, and latency of model responses. By changing the allocation of models to VMs, and varying the size and shape of those VMs in terms of processing, GPU, and RAM attached, you can improve the cost effectiveness of model serving.
We mentioned that these AI pipelines often rely on specialized hardware. How do you know you’re using this hardware efficiently? Todd Underwood’s talk from SREcon EMEA 2023 on Artificial Intelligence: What Will It Cost You? gives you a sense of how much this specialized hardware costs to run, and how you can provide incentives for using it efficiently.
This article from Google's SRE team outlines strategies for ensuring reliable data processing while minimizing manual effort, or toil. One of the key takeaways: use an existing, standard platform for as much of the pipeline as possible. After all, your business goals should focus on innovations in presenting the data and the ML model, not in the pipeline itself. The article covers automation, monitoring, and incident response, with a focus on using these concepts to build resilient data pipelines. You’ll read best practices for designing data systems that can handle failures gracefully and reduce a team’s operational burden. This article is essential reading for anyone involved in data engineering or operations. Read more about toil in the SRE Workbook: https://sre.google/workbook/eliminating-toil/.
Successful ML deployments require careful management and monitoring for systems to be reliable and sustainable. That means taking a holistic approach, including implementing data pipelines, training pathways, model management, and validation, alongside monitoring and accuracy metrics. To go deeper, check out this guide on how to use GKE for your AI orchestration.
In today's dynamic digital landscape, building and operating secure, reliable, cost-efficient and high-performing cloud solutions is no easy feat. Enterprises grapple with the complexities of cloud adoption, and often struggle to bridge the gap between business needs, technical implementation, and operational readiness. This is where the Google Cloud Well-Architected Framework comes in. The framework provides comprehensive guidance to help you design, develop, deploy, and operate efficient, secure, resilient, high-performing, and cost-effective Google Cloud topologies that support your security and compliance requirements.
The Well-Architected Framework caters to a broad spectrum of cloud professionals. Cloud architects, developers, IT administrators, decision makers and other practitioners can benefit from years of subject-matter expertise and knowledge both from within Google and from the industry. The framework distills this vast expertise and presents it as an easy-to-consume set of recommendations.
The recommendations in the Well-Architected Framework are organized under five, business-focused pillars.
We recently completed a revamp of the guidance in all the pillars and perspectives of the Well-Architected Framework to center the recommendations around a core set of design principles.
In addition to the above pillars, the Well-Architected Framework provides cross-pillar perspectives that present recommendations for selected domains, industries, and technologies like AI and machine learning (ML).
The Well-Architected Framework is much more than a collection of design and operational recommendations. The framework empowers you with a structured principles-oriented design methodology that unlocks many advantages:
Enhanced security, privacy, and compliance: Security is paramount in the cloud. The Well-Architected Framework incorporates industry-leading security practices, helping ensure that your cloud architecture meets your security, privacy, and compliance requirements.
Optimized cost: The Well-Architected Framework lets you build and operate cost-efficient cloud solutions by promoting a cost-aware culture, focusing on resource optimization, and leveraging built-in cost-saving features in Google Cloud.
Resilience, scalability, and flexibility: As your business needs evolve, the Well-Architected Framework helps you design cloud deployments that can scale to accommodate changing demands, remain highly available, and be resilient to disasters and failures.
Operational excellence: The Well-Architected Framework promotes operationally sound architectures that are easy to operate, monitor, and maintain.
Predictable and workload-specific performance: The Well-Architected Framework offers guidance to help you build, deploy, and operate workloads that provide predictable performance based on your workloads’ needs.
The Well-Architected Framework also includes cross-pillar perspectives for selected domains, industries, and technologies like AI and machine learning (ML).
The principles and recommendations in the Google Cloud Well-Architected Framework are aligned with Google and industry best practices like Google’s Site Reliability Engineering (SRE) practices, DORA capabilities, the Google HEART framework for user-centered metrics, the FinOps framework, Supply-chain Levels for Software Artifacts (SLSA), and Google's Secure AI Framework (SAIF).
Embrace the Well-Architected Framework to transform your Google Cloud journey, and get comprehensive guidance on security, reliability, cost, performance, and operations — as well as targeted recommendations for specific industries and domains like AI and ML. To learn more, visit Google Cloud Well-Architected Framework.
We are thrilled to announce the collaboration between Google Cloud, AWS, and Azure on Kube Resource Orchestrator, or kro (pronounced “crow”). kro introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.
Platform and DevOps teams want to define standards for how application teams deploy their workloads, and they want to use Kubernetes as the platform for creating and enforcing these standards. Each service needs to handle everything from resource creation to security configurations, monitoring setup, defining the end-user interface, and more. There are client-side templating tools that can help with this (e.g., Helm, Kustomize), but Kubernetes lacks a native way for platform teams to create custom groupings of resources for consumption by end users.
Before kro, platform teams needed to invest in custom solutions such as building custom Kubernetes controllers, or using packaging tools like Helm, which can’t leverage the benefits of Kubernetes CRDs. These approaches are costly to build, maintain, and troubleshoot, and complex for non-Kubernetes experts to consume. This is a problem many Kubernetes users face. Rather than developing vendor-specific solutions, we’ve partnered with Amazon and Microsoft on making K8s APIs simpler for all Kubernetes users.
kro is a Kubernetes-native framework that lets you create reusable APIs to deploy multiple resources as a single unit. You can use it to encapsulate a Kubernetes deployment and its dependencies into a single API that your application teams can use, even if they aren’t familiar with Kubernetes. You can use kro to create custom end-user interfaces that expose only the parameters an end user should see, hiding the complexity of Kubernetes and cloud-provider APIs.
kro does this by introducing the concept of a ResourceGraphDefinition, which specifies how a standard Kubernetes Custom Resource Definition (CRD) should be expanded into a set of Kubernetes resources. End users define a single resource, which kro then expands into the custom resources defined in the CRD.
kro can be used to group and manage any Kubernetes resources. Tools like ACK, KCC, or ASO define CRDs to manage cloud provider resources from Kubernetes (these tools enable cloud provider resources, like storage buckets, to be created and managed as Kubernetes resources). kro can also be used to group resources from these tools, along with any other Kubernetes resources, to define an entire application deployment and the cloud provider resources it depends on.
Below, you’ll find some examples of kro being used with Google Cloud. You can find additional examples on the kro website.
Example 1: GKE cluster definition
Imagine that a platform administrator wants to give end users in their organization self-service access to create GKE clusters. The platform administrator creates a kro ResourceGraphDefinition called GKEclusterRGD that defines the required Kubernetes resources and a CRD called GKEcluster that exposes only the options they want to be configurable by end users. In addition to creating a cluster, the platform team also wants clusters to deploy administrative workloads such as policies, agents, etc. The ResourceGraphDefinition defines the following resources, using KCC to provide the mappings from K8s CRDs to Google Cloud APIs:
GKE cluster, Container Node Pools, IAM ServiceAccount, IAM PolicyMember, Services, Policies
The platform administrator would then define the end-user interface so that they can create a new cluster by creating an instance of the CRD that defines:
Cluster name, Nodepool name, Max nodes, Location (e.g. us-east1), Networks (optional)
Everything related to policy, service accounts, and service activation (and how these resources relate to each other) is hidden from the end user, simplifying their experience.
Example 2: Web application definition
In this example, a DevOps Engineer wants to create a reusable definition of a web application and its dependencies. They create a ResourceGraphDefinition called WebAppRGD, which defines a new Kubernetes CRD called WebApp. This new resource encapsulates all the necessary resources for a web application environment, including:
Deployments, service, service accounts, monitoring agents, and cloud resources like object storage buckets.
The WebAppRGD ResourceGraphDefinition can set a default configuration, and also define which parameters can be set by the end user at deployment time (kro gives you the flexibility to decide what is immutable, and what an end user is able to configure). A developer then creates an instance of the WebApp CRD, inputting any user-facing parameters. kro then deploys the desired Kubernetes resource.
We believe kro is a big step forward for platform engineering teams, delivering a number of advantages:
Kubernetes-native: kro leverages Kubernetes Custom Resource Definitions (CRDs) to extend Kubernetes, so it works with any Kubernetes resource and integrates with existing Kubernetes tools and workflows.
Lets you create a simplified end user experience: kro makes it easy to define end-user interfaces for complex groups of Kubernetes resources, making it easy for people who are not Kubernetes experts to consume services built on Kubernetes.
Enables standardized services for application teams: kro templates can be reused across different projects and environments, promoting consistency and reducing duplication of effort.
kro is available as an open-source project on GitHub. The GitHub organization is currently jointly owned by teams from Google, AWS, and Microsoft, and we welcome contributions from the community. We also have a website with documentation on installing and using kro, including example use cases. As an early-stage project, kro is not yet ready for production use, but we still encourage you to test it out in your own Kubernetes development environments!

When we first launched DevOps.com, the goal was never just to report on tools or trends. It was to elevate the people, ideas, and communities shaping how software is built and delivered. The DevOps Dozen Awards exist for that exact reason. Each year, these awards recognize individuals, teams, technologies, and community efforts that move DevOps […]

Mainframes don’t have to block DevOps. Learn how extending CI/CD, Git, and security scanning to mainframe applications modernizes delivery without replacing mission-critical systems.

In 2026, cloud cost overruns stop being finance’s problem and become an engineering responsibility. Here’s how treating cost as code finally makes FinOps work.

GitLab today made generally available an agentic artificial intelligence (AI) platform that automates software engineering tasks ranging from planning to application security. Coinciding with the release of version 18.8 of the core GitLab platform, the GitLab Duo Agent Platform initially provides access to seven AI agents that DevOps teams can assign a range of tasks […]

Anaconda CEO David DeSanto explores what it takes to make AI-native development practical, secure and scalable for modern engineering teams. DeSanto reflects on his journey from leading product at GitLab to stepping into the CEO role at Anaconda, and why the next wave of software delivery will be shaped by the intersection of open source, […]

Jonathan Rende, chief product officer at Checkmarx, tackle’s one of the most urgent questions in AppSec right now: what happens when AI starts writing the majority of your software? With estimates that as much as 60% of code is being generated by AI in some environments—and that AI-authored code is already finding its way into […]

McLean, Virginia, United States, 15th January 2026, CyberNewsWire

Learn how AWS Bedrock, retrieval-augmented generation (RAG), and ChatOps in Microsoft Teams transform incident response by turning fragmented runbooks into a GenAI-powered assistant that reduces MTTR and improves SRE efficiency.

Let’s get the obvious out of the way right up front. AI isn’t a person. Thank you, Captain Obvious. We’re all on the same page. And yet, when we announced that AI is Techstrong’s “Person of the Year” for our Predict 2026 virtual event on January 15, a few folks felt compelled to remind us […]

Austin, TX / USA, 14th January 2026, CyberNewsWire