This one little trick can bring about enhanced training stability, the use of larger learning rates and improved scaling properties
The post NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating appeared first on Towards Data Science.
In this article, we rebuild Logistic Regression step by step directly in Excel.
Starting from a binary dataset, we explore why linear regression struggles as a classifier, how the logistic function fixes these issues, and how log-loss naturally appears from the likelihood.
With a transparent gradient-descent table, you can watch the model learn at each iteration—making the whole process intuitive, visual, and surprisingly satisfying.
The post The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel appeared first on Towards Data Science.
Most breakthroughs in deep learning — from simple neural networks to large language models — are built upon a principle that is much older than AI itself: decentralization. Instead of relying on a powerful “central planner” coordinating and commanding the behaviors of other components, modern deep-learning-based AI models succeed because many simple units interact locally […]
The post Decentralized Computation: The Hidden Principle Behind Deep Learning appeared first on Towards Data Science.
Hey everyone! Welcome to the start of a major data journey that I’m calling “EDA in Public.” For those who know me, I believe the best way to learn anything is to tackle a real-world problem and share the entire messy process — including mistakes, victories, and everything in between. If you’ve been looking to level up […]
The post EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas appeared first on Towards Data Science.
Introduction How do we identify latent groups of patients in a large cohort? How can we find similarities among patients that go beyond the well-known comorbidity clusters associated with specific diseases? And more importantly, how can we extract quantitative signals that can be analyzed, compared, and reused across different clinical scenarios? The information associated to […]
The post Spectral Community Detection in Clinical Knowledge Graphs appeared first on Towards Data Science.
Linear Regression looks simple, but it introduces the core ideas of modern machine learning: loss functions, optimization, gradients, scaling, and interpretation.
In this article, we rebuild Linear Regression in Excel, compare the closed-form solution with Gradient Descent, and see how the coefficients evolve step by step.
This foundation naturally leads to regularization, kernels, classification, and the dual view.
Linear Regression is not just a straight line, but the starting point for many models we will explore next in the Advent Calendar.
The post The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel appeared first on Towards Data Science.
A step-by-step tutorial that explores the Python Turtle Module
The post Drawing Shapes with the Python Turtle Module appeared first on Towards Data Science.
What I've learned about making Pandas faster after too many slow notebooks and frozen sessions
The post 7 Pandas Performance Tricks Every Data Scientist Should Know appeared first on Towards Data Science.
Understanding how LLM agents transfer control to each other in a multi-agent system with LangGraph
The post How Agent Handoffs Work in Multi-Agent Systems appeared first on Towards Data Science.
DBSCAN shows how far we can go with a very simple idea: count how many neighbors live close to each point.
It finds clusters and marks anomalies without any probabilistic model, and it works beautifully in Excel.
But because it relies on one fixed radius, HDBSCAN is needed to make the method robust on real data.
The post The Machine Learning “Advent Calendar” Day 10: DBSCAN in Excel appeared first on Towards Data Science.
Learn how to become an effective engineer with continual learning LLMs
The post How to Maximize Agentic Memory for Continual Learning appeared first on Towards Data Science.
What recruiters are looking for in machine learning portfolios
The post Don’t Build an ML Portfolio Without These Projects appeared first on Towards Data Science.
Tips for accelerating AI/ML on CPU — Part 2
The post Optimizing PyTorch Model Inference on AWS Graviton appeared first on Towards Data Science.
In this article, we explore LOF through three simple steps: distances and neighbors, reachability distances, and the final LOF score. Using tiny datasets, we see how two anomalies can look obvious to us but completely different to different algorithms. This reveals the key idea of unsupervised learning: there is no single “true” outlier, only definitions. Understanding these definitions is the real skill.
The post The Machine Learning “Advent Calendar” Day 9: LOF in Excel appeared first on Towards Data Science.
Build a self-hosted, end-to-end platform that gives each user a personal, agentic chatbot that can autonomously vector-search through files that the user explicitly allows it to access.
The post Personal, Agentic Assistants: A Practical Blueprint for a Secure, Multi-User, Self-Hosted Chatbot appeared first on Towards Data Science.
From idea to impact : using AI as your accelerating copilot
The post How to Develop AI-Powered Solutions, Accelerated by AI appeared first on Towards Data Science.
Smarter retrieval strategies that outperform dense graphs — with hybrid pipelines and lower cost
The post GraphRAG in Practice: How to Build Cost-Efficient, High-Recall Retrieval Systems appeared first on Towards Data Science.
How to learn AI in 2026 through real, usable projects
The post A Realistic Roadmap to Start an AI Career in 2026 appeared first on Towards Data Science.
Why on-device intelligence and low-orbit constellations are the only viable path to universal accessibility
The post Bridging the Silence: How LEO Satellites and Edge AI Will Democratize Connectivity appeared first on Towards Data Science.
Isolation Forest may look technical, but its idea is simple: isolate points using random splits. If a point is isolated quickly, it is an anomaly; if it takes many splits, it is normal.
Using the tiny dataset 1, 2, 3, 9, we can see the logic clearly. We build several random trees, measure how many splits each point needs, average the depths, and convert them into anomaly scores. Short depths become scores close to 1, long depths close to 0.
The Excel implementation is painful, but the algorithm itself is elegant. It scales to many features, makes no assumptions about distributions, and even works with categorical data. Above all, Isolation Forest asks a different question: not “What is normal?”, but “How fast can I isolate this point?”
The post The Machine Learning “Advent Calendar” Day 8: Isolation Forest in Excel appeared first on Towards Data Science.

Going forward, when you run ‘s Terraform Infrastructure as Code (IaC) software, you will have one language to write your
The post IBM HashiCorp ‘Sunsets’ Terraform’s External Language Support appeared first on The New Stack.

TOKYO — When I worked for NASA in the 1980s, every satellite and spaceship that went into orbit ran one-off,
The post Papermoon: A Space-Grade Linux for the NewSpace Era appeared first on The New Stack.

Too often, developers are unfairly accused of being careless about data integrity. The logic goes: Without the rigid structure of
The post Rethinking Data Integrity: Why Domain-Driven Design Is Crucial appeared first on The New Stack.

TOKYO — Forking an open source project is never a first choice. It is divisive, dangerous, and politically risky. But
The post How the Team Behind Valkey Knew It Was Time to Fork appeared first on The New Stack.

We have all seen the headlines that enterprise AI is failing at a high rate. MIT has reported that 95%
The post Your AI Is Working With Half a Brain. You Need the Other Half appeared first on The New Stack.

You may not have heard of open source developer Tony Kovanen, but the 34-year-old has collaborated on projects you know
The post Why Next.js Co-Creator Tony Kovanen Prefers the Sidelines appeared first on The New Stack.

Most data processing frameworks are built on a single execution engine. Not Apache Wayang. Last week, The Apache Software Foundation
The post Apache Wayang Makes Data Processing a Cross-Platform Job appeared first on The New Stack.

Some time ago, released the Gemini CLI (Command Line Interface) tool that is pretty impressive. Unlike many AI tools, the
The post Coding With the Gemini CLI Tool appeared first on The New Stack.
“As a low-level systems engineer, if you do your job right, no one knows you exist — but the minute
The post Kubernetes GPU Management Just Got a Major Upgrade appeared first on The New Stack.

AI leader Andrej Karpathy said Inception Labs’ approach to diffusion has the potential to differ in comparison to all the
The post Inception Labs: Making LLMs Faster and More Cost-Efficient appeared first on The New Stack.

Kubernetes is one of the fastest growing open source projects in history. In 2024, it generated $1.71 billion in revenue,
The post Why Open Platforms Are the Future of Kubernetes Deployments appeared first on The New Stack.

AI agents transform enterprise operations by autonomously interpreting context, making decisions and executing tasks with minimal human input. But the
The post A 5-Step Checklist for Building Collaborative AI Agent Systems appeared first on The New Stack.

Everyone wants their database to scale. In fact, if you are old enough, you might even remember when saying your
The post AI Agents Create a New Dimension for Database Scalability appeared first on The New Stack.

IBM is no stranger to AI, given its long history with its AI Watson project and countless other efforts. But
The post IBM’s Confluent Acquisition Is About Event-Driven AI appeared first on The New Stack.

Linus Torvalds wasn’t thrilled with how the final days leading up to the 6.18 release of the Linux kernel went.
The post Linux 6.18: All About the New Long-Term Support Linux Kernel appeared first on The New Stack.

OpenAI’s Apps SDK launched in early October, enabling developers to build mini web apps for ChatGPT. While we haven’t yet
The post Why Capability-Driven Protocols Are Key for ChatGPT Apps appeared first on The New Stack.
The role of the developer is changing fast. At KubeCon North America 2025 in Atlanta, we sat down with Emilio
The post The Rise of the Cognitive Architect appeared first on The New Stack.

Model Context Protocol (MCP) has become the de facto standard for large language models (LLMs) to interact with third-party services,
The post Google Launches Managed Remote MCP Servers for Its Cloud Services appeared first on The New Stack.

TOKYO — At the invitation-only Linux Kernel Maintainers Summit here, the top Linux maintainers decided, as Jonathan Corbet, Linux kernel
The post Rust Goes Mainstream in the Linux Kernel appeared first on The New Stack.

Ripple, a new TypeScript-based UI framework, might be shrugged off as just another framework if it weren’t created by Dominic
The post Inferno Vet Creates Frontend Framework Built With AI in Mind appeared first on The New Stack.

Kubernetes applications in production traditionally require manual work and time. GitOps with Argo CD solves this. It keeps your cluster
The post Make Your Kubernetes Apps Self-Heal With Argo CD and GitOps appeared first on The New Stack.

In my previous article on preparing CI/CD pipelines to ship production-ready agents, I argued that we cannot ship agents to
The post Why the MCP Server Is Now a Critical Microservice appeared first on The New Stack.

Unlike cloud native, the term “AI native” is ill-defined. One might even say undefined. Many have glommed onto the term
The post What Is ‘AI Native’ and Why Is MCP Key? appeared first on The New Stack.

Demand for bandwidth continues to surge with increasing use of AI, high-definition video streaming, cloud and 5G mobile applications. In
The post Why AI Traffic Growth Demands Optical Network Automation Now appeared first on The New Stack.

Somewhere between disciplined software engineering and late-night code jazz lives a strange new philosophy: Vibe coding. It’s the act of
The post Why There Might Be Something to Vibe Coding After All appeared first on The New Stack.
“I’m obsessed with inference,” Jonathan Bryce, who took over as the executive director of the this summer, said during a
The post Why the CNCF’s New Executive Director Is Obsessed With Inference appeared first on The New Stack.
The 2025 State of AI-assisted Software Development report revealed a critical truth: AI is an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.
While AI adoption is now near-universal, with 90% of developers using it in their daily workflows, success is not guaranteed. Our cluster analysis of nearly 5,000 technology professionals reveals significant variation in team performance: Not everyone experiences the same outcomes from adopting AI.
From this disparity, we can conclude that how they are using AI is a critical factor. We wanted to understand the particular capabilities and conditions that enable teams to achieve positive outcomes, leading us to develop the DORA AI Capabilities Model report.
This companion guide to the 2025 DORA Report is designed to help you navigate our new reality. It provides actionable strategies, implementation tactics, and measurement frameworks to help technology leaders build an environment where AI thrives.
Successfully using AI requires cultivating your technical and cultural environment. From the same set of respondents who participated in the 2025 DORA survey, we identified seven foundational capabilities that are proven to amplify the positive impact of AI on organizational performance:
The DORA AI Capabilities Model shows which capabilities amplify the effect of AI adoption on
specific outcomes
Every organization starts their AI journey differently. To help you prioritize, this report introduces seven distinct team archetypes derived from our cluster analysis. These profiles range from "harmonious high-achievers," who excel in both performance and well-being, to teams facing "foundational challenges" or those stuck in a "legacy bottleneck," where unstable systems undermine morale.
Identifying the profile that best matches your team can help pinpoint the most impactful interventions. For example, a "high impact, low cadence" team might prioritize automation to improve stability, while a team "constrained by process" might focus on reducing friction through a better AI stance.
Once you understand your team's profile, how do you direct your efforts? The report includes a step-by-step facilitation guide for running a Value Stream Mapping (VSM) exercise.
VSM acts as an AI force multiplier. By visualizing your flow from idea to customer, you can identify where work waits and where friction exists. This ensures that the efficiency gains from AI aren't just creating local optimizations that pile up work downstream, but are instead channeled into solving system-level constraints.
AI adoption is an organizational transformation. The greatest returns come not from the tools themselves, but from investing in the foundational systems that enable them.
When was the last time you knew — not just hoped — that your disaster recovery plan would work perfectly?
For most of us, the answer is unclear. Sure, you may have a DR plan, a meticulously crafted document stored in a wiki or a shared drive, that gets dusted off for compliance audits or the occasional tabletop drill. You assume its procedures are correct, its contact lists are current, and its dependencies are fully mapped, and you certainly hope it works.
Why wouldn’t it work? One problem is that systems are rarely static anymore. In a world where you deploy new microservices dozens of times per day, make constant configuration changes, and maintain an ever-growing web of third-party API dependencies, the DR plan you wrote last quarter is probably just as useful as one from 10 years ago.
And if the failover does work, will it work well enough to meet the promises you've made to your customers (or board of directors or regulators)? When a key component fails, could you still even meet your target availability and latency targets, a.k.a., your Service Level Objectives (SLOs)?
So, how do you close this gap between your current aspirational DR plan and a DR plan that you actually have confidence in? The answer isn't to write more documents or run more theatrical drills. The answer is to stop assuming and start proving.
This is where chaos engineering comes in. Unlike what the name might imply, chaos engineering isn’t a tool for recklessly breaking things. Instead, it’s a framework that provides data-driven confidence in your SLOs under stress. By running controlled experiments that simulate real-world disasters like a database failover or a regional outage, you can quantitatively measure the impact of those failures on your systems’ performance. Chaos engineering is how you transform your DR hypotheses into a proven method to ensure resilience. By validating your plan through experimentation, you create tangible evidence, verifying that your plan will safeguard your infrastructure and keep your promises to customers.
In a nutshell, chaos engineering is the practice of running controlled, scientific experiments to find weaknesses in your system before they cause a real outage.
At its core, it’s about building confidence in your system’s resilience. The process starts with understanding your system's steady state, which is its normal, measurable, and healthy output. You can't know the true impact of a failure without first defining what "good" looks like. This understanding allows you to form a clear, testable hypothesis: a statement of belief that your system's steady state will persist even when a specific, turbulent condition is introduced.
To test this hypothesis, you then execute a controlled action, which is a precise and targeted failure injected into the system. This isn't random mischief; it's a specific simulation of real-world failures, such as consuming all CPU on a host (resource exhaustion), adding network latency (network failure), or terminating a virtual machine (state failure). While this action is running, automated probes act as your scientific instruments, continuously monitoring the system's state to measure the effect.
Together, these components form a complete scientific loop: you use a hypothesis to predict resilience, run an experiment by applying an action to simulate adversity, and use probes to measure the impact, turning uncertainty into hard data.
Now that you understand the building blocks of a chaos experiment, you can build the bridge to your ultimate goal: transforming your DR plan from a document of hope into an evidence-based procedure. The key is to stop seeing your DR plan as a set of instructions and start seeing it for what it truly is: a collection of unproven hypotheses.
When you think about it, every significant statement in your DR document is a claim waiting to be tested. When your plan states, "The database will failover to the replica in under 5 minutes," that isn't a fact, it's a hypothesis. When it says, "In the event of a regional outage, traffic will be successfully rerouted to the secondary region," that's another hypothesis. Your DR plan is filled with these critical assumptions about how your system should behave under duress. Until you test them, they remain nothing more than educated guesses.
Chaos experiments are the ultimate validation tools, live-fire drills that put your DR hypotheses to a real, empirical test. Instead of just talking through a scenario, you use controlled actions to safely and precisely simulate the disaster. You're no longer asking "what if?"; you're actively measuring "what happens when."
For example, imagine you have a DR plan for a regional outage. When you adopt chaos engineering, you break down that plan into a hypothesis and an experiment. For example:
The hypothesis: "In case our primary region us-central1 becomes unreachable, the load balancers will failover all traffic to us-east1 within 3 minutes, with an error rate below 1%."
The chaos experiment: Run an action that simulates a regional outage by injecting a "blackhole" that drops all network traffic to and from us-central1 for a limited time. Your probes then measure the actual failover time and error rates to validate the hypothesis.
In other words, by applying the chaos engineering methodology, you systematically move through your DR plan, turning each assumption into a proven fact. You're not just testing your plan; you're forging it in a controlled fire.
Beyond simply proving system availability, chaos engineering builds trust in your reliability metrics, ensuring that you meet your SLOs even when services become unavailable. An SLO is a specific, acceptable target level of your service's performance measured over a specified period that reflects the user's experience. SLOs aren't just internal goals; they are the bedrock of customer trust and the foundation of your contractual service level agreements (SLAs).
A traditional DR drill might get a "pass" because the backup system came online. But what if it took 20 minutes to fail over, during which every user saw errors? What if the backup region was under-provisioned, and performance became so slow that the service was unusable? From a technical perspective, you "recovered." But from a customer's perspective, you were down.
A chaos experiment, however, can help you answer a critical question: "During a failover, did we still meet our SLOs?” Because your probes are constantly measuring performance against your SLOs, you get the full picture. You don't just see that the database failed over; you see that it took 7 minutes, during which your latency SLO was breached and your error budget was completely burned. This is the crucial, game-changing insight. It shifts the entire goal from simple disaster recovery to SLO preservation, which is what actually determines if a failure was a minor hiccup or a major business-impacting incident. It also provides the data necessary to set goals for system improvement. So the next time you run this experiment, you can measure if and how much your system resilience has improved, and ultimately if you can maintain your SLO during the disaster event.
The journey to resilience doesn't start by simulating a full regional failover. It starts with a single, small experiment. The goal is not to boil the ocean; it's to build momentum. Test one timeout, one retry mechanism, or one graceful error message.
The biggest win from your first successful experiment won't be the technical data you gather. It will be the confidence you build. When your team sees that they can safely inject failure, learn from it, and improve the system, their entire relationship with failure changes. Fear is replaced by curiosity. That confidence is the catalyst for building a true, enduring culture of resilience. To learn more and get started with chaos engineering, check out this blog and this podcast. And if you’re ready to get started, but unsure how, reach out to Google Cloud professional services to discuss how we can help.
Earlier this year, we unveiled a big investment in platform and developer team productivity, with the launch of Application Design Center, helping them streamline the design and deployment of cloud application infrastructure, while ensuring applications are secure, reliable, and aligned with best practices. And today, Application Design Center is generally available.
We built Application Design Center to put applications at the center of your cloud experience, with a visual, canvas-style and AI-powered approach to design and modify Terraform-backed application templates. It also offers full lifecycle management that’s aligned with DevOps best practices across application design and deployment.
Application Design Center is a core component of our application-centric cloud experience. When you use Application Design Center to design and deploy your application infrastructure, your applications are easily discoverable, observable, and manageable. Application Design Center works in concert with App Hub to automatically register application deployments, enabling a unified view and control plane for your application portfolio, and Cloud Hub, to provide operational insights for your applications.
“Google Application Design Center is a valuable enabler for Platform Engineering, providing a structured approach to harmonizing resource creation in Google Cloud Platform. By aligning tools, processes, and technologies, it streamlines workflows, reducing friction between development, operations, and other teams. This harmonization enhances collaboration, accelerates delivery, and ensures consistency across Google Cloud environments.” - Ervis Duraj, Principal Engineer, MediaMarktSaturn Technology
Our goal with Application Design Center is for you to innovate more, and administer less. It consists of four key elements to help you minimize administrative overhead and maximize efficiency, so you can design and deploy applications with integrated best practices and essential guardrails. Let’s take a closer look.
1. Terraform components and application templates
Develop applications faster with our growing library of opinionated application templates. These provide well-architected patterns and pre-built components, including innovative "AI inference templates" to help you leverage AI to create dynamic and intelligent application foundations. As an example, at launch, Application Design Center provides opinionated templates for Google Kubernetes Engine (GKE) clusters (Standard, Autopilot and NodePool) to run AI inference workloads using a variety of LLM models, as well as for enterprise-grade production clusters or single-region web app clusters.
You can also ingest and manage your existing Terraform configurations (“Bring your own Terraform”) directly from Git repositories. Once imported, you can use Application Design Center to design with your own Terraform, or in combination with Google-provided Terraform, to create standardized, opinionated infrastructure patterns for sharing and reuse across your application teams.
2. AI-powered design for rapid application designing and prototyping
Application Design Center integrates with Google's Gemini Cloud Assist Design Agent, empowering you to design actual, deployable application infrastructure application templates on Google Cloud that you can export as Terraform infrastructure-as-code.
With Gemini Cloud Assist, you can describe your application design intents using natural language. In return, Gemini interactively generates multi-product application template suggestions, complete with visual architecture diagrams and summarized benefits. You can then refine these proposals through multi-turn reasoning or by directly manipulating the architecture within the Application Design Center canvas.
Additionally, all designs that you create with Gemini are automatically observable, optimizable, and enabled for troubleshooting assistance during runtime, thanks to their tight integration with Gemini Cloud Assist.
3. A secure, sharable catalog of application templates with full lifecycle management
Platform admins can curate a collection of application templates built from Google's best-practice components. This provides developers a trusted, self-service experience from which they can quickly discover and deploy compliant applications. Tight integration with Cloud Hub transforms these governed templates into a live operational command center, complete with unified visibility into the health and deployment status of the resulting applications. This closes the critical loop between design and runtime, so that your production environments reflect your organization’s approved architectural standards.
Also, Application Design Center’s robust application template revisions serve as an immutable audit trail. It automatically detects and flags configuration drift between your intended designs and deployed applications, so that developers can remediate unauthorized changes or safely push approved configuration updates. This helps ensure continuous state consistency and compliance from Day 1 and through the subsequent evolution of your application.
4. GitOps integration automating developers’ day-to-day software design lifecycle tasks
By integrating Application Design Center into existing CI/CD workflows, platform teams empower developers to own the complete software delivery lifecycle right from their IDE. Developers can leverage compliant application and infrastructure (IaC) code using Application Design Center application templates.
Further, every infrastructure decision made through Application Design Center is committed to code, versioned, and auditable. Specifically, developers can download the application IaC template from Application Design Center and import it into their app repos (the single source of truth), clone their repo, and edit the Terraform directly in their local IDEs. Any modifications go through a Git pull request for review. Once approved, this automatically triggers the existing CI/CD setup to build, test, and deploy both app and infra changes in lockstep. This unified approach minimizes friction, enforcing "golden paths" and providing an end-to-end automated pathway from a line of code in the IDE to a fully deployed change in production.
This GA launch is packed with features that users have been asking for. We’re excited to share powerful new capabilities: enterprise-grade governance and security with public APIs and gcloud CLI support; full compatibility with VPC service controls; bring your own Terraform and GitOps support for integration with your existing application patterns and automation pipelines; agentic application patterns using GKE templates (Standard, Autopilot and NodePool); and finally, a simplified onboarding experience with app-managed project support, making Application Design Center an AI-powered engine for your applications on Google Cloud.
To help you get started, Google provides a growing library of curated Google application templates built by experts. These templates combine multiple Google Cloud products and best practices to serve common use cases, which you can configure for deployment, and view as infrastructure as code in-line. Platform teams can then create and securely share the catalogs and collaborate with teammates on designs and self-service deployment for developers. For enterprises with existing Terraform patterns and assets, Application Design Center interoperates by enabling their import and reuse within its native design and configuration experience.
Ready to experience the power of Application Design Center? You can learn more about ADC and get started building in minutes using the quickstart. You can start building your first AI-powered application template in minutes, free of cost, and quickly deploy applications with working code. For deeper insights, explore the comprehensive public documentation here. We can't wait to see how you innovate with the Application Design Center!
Editor's note: This blog was updated on Dec. 4, 5, 7, and 12, 2025, with additional guidance on Cloud Armor WAF rule syntax, and WAF enforcement across App Engine Standard, Cloud Functions, and Cloud Run.
Earlier today, Meta and Vercel publicly disclosed two vulnerabilities that expose services built using the popular open-source frameworks React Server Components (CVE-2025-55182) and Next.js to remote code execution risks when used for some server-side use cases. At Google Cloud, we understand the severity of these vulnerabilities, also known as React2Shell, and our security teams have shared their recommendations to help our customers take immediate, decisive action to secure their applications.
The React Server Components framework is commonly used for building user interfaces. On Dec. 3, 2025, CVE.org assigned this vulnerability as CVE-2025-55182. The official Common Vulnerability Scoring System (CVSS) base severity score has been determined as Critical, a severity of 10.0.
Vulnerable versions: React 19.0, 19.1.0, 19.1.1, and 19.2.0
Patched in React 19.2.1
Fix: https://github.com/facebook/react/commit/7dc903cd29dac55efb4424853fd0442fef3a8700
Announcement: https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components
Next.js is a web development framework that depends on React, and is also commonly used for building user interfaces. (The Next.js vulnerability was referenced as CVE-2025-66478 before being marked as a duplicate.)
Vulnerable versions: Next.js 15.x, Next.js 16.x, Next.js 14.3.0-canary.77 and later canary releases
Patched versions are listed here.
Fix: https://github.com/vercel/next.js/commit/6ef90ef49fd32171150b6f81d14708aa54cd07b2
Announcement: https://nextjs.org/blog/CVE-2025-66478
Google Threat Intelligence Group (GTIG) has also published a new report to help understand the specific threats exploiting React2Shell.
We strongly encourage organizations who manage environments relying on the React and Next.js frameworks to update to the latest version, and take the mitigation actions outlined below.
We have created and rolled out a new Cloud Armor web application firewall (WAF) rule designed to detect and block exploitation attempts related to CVE-2025-55182. This new rule is available now and is intended to help protect your internet-facing applications and services that use global or regional Application Load Balancers. We recommend deploying this rule as a temporary mitigation while your vulnerability management program patches and verifies all vulnerable instances in your environment.
For customers using App Engine Standard, Cloud Functions, Cloud Run, Firebase Hosting or Firebase App Hosting, we provide an additional layer of defense for serverless workloads by automatically enforcing platform-level WAF rules that can detect and block the most common exploitation attempts related to CVE-2025-55182.
For Project Shield users, we have deployed WAF protections for all sites and no action is necessary to enable these WAF rules. For long-term mitigation, you will need to patch your origin servers as an essential step to eliminate the vulnerability (see additional guidance below).
Cloud Armor and the Application Load Balancer can be used to deliver and protect your applications and services regardless of whether they are deployed on Google Cloud, on-premises, or on another infrastructure provider. If you are not yet using Cloud Armor and the Application Load Balancer, please follow the guidance further down to get started.
While these platform-level rules and the optional Cloud Armor WAF rules (for services behind an Application Load Balancer) help mitigate the risk from exploits of the CVE, we continue to strongly recommend updating your application dependencies as the primary long-term mitigation.
To configure Cloud Armor to detect and protect from CVE-2025-55182, you can use the cve-canary preconfigured WAF rule leveraging the new ruleID that we have added for this vulnerability. This rule is opt-in only, and must be added to your policy even if you are already using the cve-canary rules.
In your Cloud Armor backend security policy, create a new rule and configure the following match condition:
This can be accomplished from the Google Cloud console by navigating to Cloud Armor and modifying an existing or creating a new policy.
Cloud Armor rule creation in the Google Cloud console.
Alternatively, the gcloud CLI can be used to create or modify a policy with the requisite rule:
Additionally, if you are managing your rules with Terraform, you may implement the rule via the following syntax:
Cloud Armor rules can be configured in preview mode, a logging-only mode to test or monitor the expected impact of the rule without Cloud Armor enforcing the configured action. We recommend that the new rule described above first be deployed in preview mode in your production environments so that you can see what traffic it would block.
Once you verify that the new rule is behaving as desired in your environment, then you can disable preview mode to allow Cloud Armor to actively enforce it.
Cloud Armor per-request WAF logs are emitted as part of the Application Load Balancer logs to Cloud Logging. To see what Cloud Armor’s decision was on every request, load balancer logging first needs to be enabled on a per backend service basis. Once it is enabled, all subsequent Cloud Armor decisions will be logged and can be found in Cloud Logging by following these instructions.
There has been a proliferation of scanning tools designed to help identify vulnerable instances of React and Next.js in your environments. Many of those scanners are designed to identify the version number of relevant frameworks in your servers and do so by crafting a legitimate query and inspecting the response from the server to detect the version of React and Next.js that is running.
Our WAF rule is designed to detect and prevent exploit attempts of CVE-2025-55182. As the scanners discussed above are not attempting an exploit, but sending a safe query to elicit a response revealing indications of the version of the software, the above Cloud Armor rule will not detect or block such scanners.
If the findings of these scanners indicate a vulnerable instance of software protected by Cloud Armor, that does not mean that an actual exploit attempt of the vulnerability will successfully get through your Cloud Armor security policy. Instead, such findings mean that the version React or Next.js detected is known to be vulnerable and should be patched.
If your workload is already using an Application Load Balancer to receive traffic from the internet, you can configure Cloud Armor to protect your workload from this and other application-level vulnerabilities (as well as DDoS attacks) by following these instructions.
If you are not yet using an Application Load Balancer and Cloud Armor, you can get started with the external Application Load Balancer overview, the Cloud Armor overview, and the Cloud Armor best practices.
If your workload is using Cloud Run, Cloud Run functions, or App Engine and receives traffic from the internet, you must first set up an Application Load Balancer in front of your endpoint to leverage Cloud Armor security policies to protect your workload. You will then need to configure the appropriate controls to ensure that Cloud Armor and the Application Load Balancer can’t be bypassed.
Once you configure Cloud Armor, we recommend consulting our best practices guide. Be sure to account for limitations discussed in the documentation to minimize risk and optimize performance while ensuring the safety and availability of your workloads.
Google Cloud is enforcing platform-level protections across App Engine Standard, Cloud Functions, and Cloud Run to automatically help protect against common exploit attempts of CVE-2025-55182. This protection supplements the protections already in place for Firebase Hosting and Firebase App Hosting.
What this means for you:
Applications deployed to those serverless services benefit from these WAF rules that are enabled by default to help provide a base level of protection without requiring manual configuration.
These rules are designed to block known malicious payloads targeting this vulnerability.
Important considerations:
Patching is still critical: These platform-level defenses are intended to be a temporary mitigation. The most effective long-term solution is to update your application's dependencies to non-vulnerable versions of React and Next.js, and redeploy them.
Potential impacts: While unlikely, if you believe this platform-level filtering is incorrectly impacting your application's traffic, please contact Google Cloud Support and reference issue number 465748820.
While WAF rules provide critical frontline defense, the most comprehensive long-term solution is to patch the underlying frameworks.
While Google Cloud is providing platform-level protections and Cloud Armor options, we urge all customers running React and Next.js applications on Google Cloud to immediately update their dependencies to the latest stable versions (React 19.2.1 or the relevant version of Next.js listed here), and redeploy their services.
This applies specifically to applications deployed on:
Patching your applications is an essential step to eliminate the vulnerability at its source and ensure the continued integrity and security of your services.
We will continue to monitor the situation closely and provide further updates and guidance as necessary. Please refer to our official Google Cloud Security advisories for the most current information and detailed steps.
If you have any questions or require assistance, please contact Google Cloud Support and reference issue number 465748820.
As engineers, we all dream of perfectly resilient systems — ones that scale perfectly, provide a great user experience, and never ever go down. What if we told you the key to building these kinds of resilient systems isn't avoiding failures, but deliberately causing them? Welcome to the world of chaos engineering, where you stress test your systems by introducing chaos, i.e., failures, into a system under a controlled environment. In an era where downtime can cost millions and destroy reputations in minutes, the most innovative companies aren't just waiting for disasters to happen — they're causing them and learning from the resulting failures, so they can build immunity to chaos before it strikes in production.
Chaos engineering is useful for all kinds of systems, but particularly for cloud-based distributed ones. Modern architectures have evolved from monolithic to microservices-based systems, often comprising hundreds or thousands of services. These complex service dependencies introduce multiple points of failure, and it’s difficult if not impossible to predict all the possible failure modes through traditional testing methods. When these applications are deployed on the cloud, they are deployed across multiple availability zones and regions. This increases the likelihood of failure due to the highly distributed nature of cloud environments and the large number of services that coexist within them.
A common misconception is that cloud environments automatically provide application resiliency, eliminating the need for testing. Although cloud providers do offer various levels of resiliency and SLAs for their cloud products, these alone do not guarantee that your business applications are protected. If applications are not designed to be fault-tolerant or if they assume constant availability of cloud services, they will fail when a particular cloud service they depend on is not available.
In short, chaos engineering can take a team's worst "what if?" scenarios and transform them into well-rehearsed responses. Chaos engineering isn’t about breaking systems — engineering chaotically, as it were — it's about building teams that face production incidents with the calm confidence that only comes from having weathered that chaos before, albeit in controlled conditions.
Google Cloud’s Professional Service Organization (PSO) Enterprise Architecture team consults on and provides hands-on expertise on customers’ cloud transformation journeys, including application development, cloud migrations, and enterprise architecture. And when advising on designing resilient architecture for cloud environments, we routinely introduce the principles and practices of chaos engineering and Site Reliability Engineering (SRE) practices.
In this first blog post in a series, we explain the basics of chaos engineering — what it is and its core principles and elements. We then explore how chaos engineering is particularly helpful and important for teams running distributed applications in the cloud. Finally, we’ll talk about how to get started, and point you to further resources.
Chaos engineering is a methodology invented by Netflix in 2010 when it created and popularized ‘Chaos Monkey’ to address the need to build more resilient and reliable systems in the face of increasing complexity in their AWS environment. Around the same time, Google introduced Disaster Resilience Testing, or DiRT, which enabled continuous and automated disaster readiness, response, and recovery of Google’s business, systems, and data. Here on Google Cloud’s PSO team, we offer various services to help customers implement DiRT as part of SRE practices. These offerings also include training on how to perform DiRT on applications and systems operating on Google Cloud. The central concept is straightforward: deliberately introduce controlled disruptions into a system to identify vulnerabilities, evaluate its resilience, and enhance its overall reliability.
As a proactive discipline, chaos engineering enables organizations to identify weaknesses in their systems before they lead to significant outages or failures, where a system includes not only the technology components but also the people and processes of an organization. By introducing controlled, real-world disruptions, chaos engineering helps test a system's robustness, recoverability, and fault tolerance. This approach allows teams to uncover potential vulnerabilities, so that systems are better equipped to handle unexpected events and continue functioning smoothly under stress.
Chaos engineering is guided by a set of core principles about why it should be done, while practices define what needs to be done.
Below are the principles of chaos engineering:
With these principles established, follow these practices when conducting a chaos engineering experiment:
In other words, chaos engineering isn't about breaking things for the sake of it, but about building more resilient systems by understanding their limitations and addressing them proactively.
Here are the core elements you'll use in a chaos engineering experiment, derived from these five principles:
Now that you have a good understanding of chaos engineering and why to use it in your cloud environment, the next step is to try it out for yourself in your own development environment.
There are multiple chaos engineering solutions in the market; some are paid products and some are open-source frameworks. To get started quickly, we recommend that you use Chaos Toolkit as your chaos engineering framework.
Chaos Toolkit is an open-source framework written in Python that provides a modular architecture where you can plug in other libraries (also known as ‘drivers’) to extend your chaos engineering experiments. For example, there are extension libraries for Google Cloud, Kubernetes, and many other technologies. Since Chaos Toolkit is a Python-based developer tool, you can begin by configuring your Python environment. You can find a good example of a Chaos Toolkit experiment and step-by-step explanation here.
Finally, to enable Google Cloud customers and engineers to introduce chaos testing in their applications, we’ve created a series of Google Cloud-specific chaos engineering recipes. Each recipe covers a specific scenario to introduce chaos in a particular Google Cloud service. For example, one recipe covers introducing chaos in an application/service running behind a Google Cloud internal or external application load balancer; another recipe covers simulating a network outage between an application running on Cloud Run and connecting to a Cloud SQL database by leveraging another Chaos Toolkit extension named ToxiProxy.
You can find a complete collection of recipes, including step-by-step instructions, scripts, and sample code, to learn how to introduce chaos engineering in your Google Cloud environment on GitHub. Then, stay tuned for subsequent posts, where we’ll talk about chaos engineering techniques, such as how to introduce faults into your Google Cloud environment.
Today, we are excited to announce the 2025 DORA Report: State of AI-assisted Software Development. Drawing on insights from over 100 hours of qualitative data and survey responses from nearly 5,000 technology professionals from around the world.
The report reveals a key insight: AI doesn't fix a team; it amplifies what's already there. Strong teams use AI to become even better and more efficient. Struggling teams will find that AI only highlights and intensifies their existing problems. The greatest return comes not from the AI tools themselves, but from a strategic focus on the quality of internal platforms, the clarity of workflows, and the alignment of teams.
As we established from the 2024 report as well as the special report published this year called “Impact of Generative AI in Software Development”, organizations are continuing to heavily adopt AI and receive substantial benefits across important outcomes. And there is evidence of learning to better integrate these tools into our workflow. Unlike last year, we observe a positive relationship between AI adoption on both software delivery throughput and product performance. It appears that people, teams, and tools are learning where, when, and how AI is most useful. However, AI adoption does continue to have a negative relationship with software delivery stability.
This confirms our central theory - AI accelerates software development, but that acceleration can expose weaknesses downstream. Without robust control systems, like strong automated testing, mature version control practices, and fast feedback loops, an increase in change volume leads to instability. Teams working in loosely coupled architectures with fast feedback loops see gains, while those constrained by tightly coupled systems and slow processes see little or no benefit.
Key findings from the 2025 report
Beyond this central theme, this year’s research highlighted the following about modern software development:
AI adoption is near-universal: 90% of survey respondents report using AI at work. More than 80% believe it has increased their productivity. However, skepticism remains as 30% report little or no trust in the code generated by AI, a slightly lower percentage than last year but a key trend to note.
User-centricity is a prerequisite for AI success: AI becomes most useful when it's pointed at a clear problem, and a user-centric focus provides that essential direction. Our data shows this focus amplifies AI’s positive influence on team performance.
Platform engineering is the foundation: Our data shows that 90% of organizations have adopted at least one platform and there is a direct correlation between a high quality internal platform and an organization’s ability to unlock the value of AI, making it an essential foundation for success.
Simple software delivery metrics alone aren’t sufficient. They tell you what is happening but not why it’s happening. To connect performance data to experience, we conducted a cluster analysis that reveals seven common team profiles or archetypes, each with a unique interplay of performance, stability, and well-being. This model provides leaders with a way to diagnose team health and apply the right interventions.
The ‘Foundational challenges’ group are trapped in survival mode and face significant gaps in their processes and environment, leading to low performance, high system stability, and high levels of burnout and friction. While the ‘Harmonious high achievers’ excel across multiple areas, showing positive metrics for team well-being, product outcomes, and software delivery.
Read more details of each archetype in the "Understanding your software delivery performance: A look at seven team profiles" chapter of the report.
This year, we went beyond identifying AI’s impact to investigating the conditions in which AI-assisted technology-professionals realize the best outcomes. The value of AI is unlocked not by the tools themselves, but by the surrounding technical practices and cultural environment.
Our research identified seven capabilities that are shown to magnify the positive impact of AI in organizations.
One of the key insights derived from the research this year is that the value of AI will be unlocked by reimagining the system of work it inhabits. Technology leaders should treat AI adoption as an organizational transformation.
Here’s where we suggest you begin:
Clarify and socialize your AI policies
Connect AI to your internal context
Prioritize foundational practices
Fortify your safety nets
Invest in your internal platform
Focus on your end-users
The DORA research program is committed to serving as a compass to teams and organizations as we navigate the important and transformative period with AI. We hope the new team profiles and the DORA AI capabilities model provide a clear roadmap for you to move beyond simply adopting AI to unlocking its value by investing in teams and people. We look forward to learning how you put these insights into practice. To learn more:
What guides your approach to software development? In our roles at Google, we’re constantly working to build better software, faster. Within Google, our Developer Platform team and Google Cloud have a strategic partnership and a shared strategy: together, we take our internal capabilities and engineering tools and package them up for Google Cloud customers.
At the heart of this is understanding the many ways that software teams, big and small, need to balance efficiency, quality, and cost, all while delivering value. In our recent talk at PlatformCon 2025, we shared key parts of our platform strategy, which we call “shift down.”
Shift down is an approach that advocates for embedding decisions and responsibilities into underlying internal developer platforms (IDPs), thereby reducing the operational burden on developers. This contrasts with the DevOps trend of "shift left," which pushes more effort earlier into the development cycle, a method that is proving difficult at scale due to the sheer volume and rate of change in requirements. Our shift down strategy helps us maximize value with existing resources so businesses can achieve high innovation velocity with acceptable quality, acceptable risk, and sustainable costs across a diverse range of business models. In the talk, we share learnings that have been really helpful to us in our software and platform engineering journey:
6. Divide up the problem space by identifying different platform and ecosystem types.
Because the developer experience and platform infrastructure change with scale and degree of shifting down, it’s not enough to just know where the ecosystem effectiveness zone is — you have to identify the ecosystem by type. We differentiate ecosystem types by the degree of oversight and assurance for quality attributes. As an ecosystem becomes more vertically integrated, such as Google's highly optimized "Assured" (Type 4) ecosystem, the platform itself assumes increasing responsibility for vital quality attributes, allowing specialists like site reliability engineers (SRE) and security teams to have full ownership in taking action through large-scale observability and embedded capabilities. Conversely, in less uniform "YOLO," "AdHoc," or "Guided" (Type 0-2) ecosystems, developers have more responsibility for assuring these attributes, while central specialist teams have less direct control and enforcement mechanisms are less pervasive. It’s really important to note here that this is not a maturity model — the best ecosystem and platform type is the one that best fits your business need (see point #1 above!).
The most important takeaway is to make active choices. Tailor platform engineering for each business unit and application to achieve the best outcomes. Place critical emphasis on identifying and solving stable sub-problems in reliable, reusable ways across various business problems. This approach directly underpins our "shift down" strategy, moving toward composable platforms that embed decisions and responsibilities for software quality directly into the underlying platform infrastructure, thereby improving our ability to maximize business value with the right resources, at the right quality level, and with sustainable costs.
Watch our full discussion for more insights on effective platform engineering.
Application owners are looking for three things when they think about optimizing cloud costs:
What are the most expensive resources?
Which resources are costing me more this week or month?
Which resources are poorly utilized?
To help you answer these questions quickly and easily, we announced Cloud Hub Optimization and Cost Explorer, in private preview, at Google Cloud Next 2025. And today, we are excited to announce that both Cloud Hub Optimization and Cost Explorer are now in public preview.
As an app owner, your primary objective is keeping your application healthy at all times. Yet, monitoring all the individual components of your application, which may straddle dozens of Projects, can be quite overwhelming. AppHub Applications allow you to reorganize cloud around your application, giving you the information and controls you need at your fingertips.
In addition to supporting Google Cloud Projects, Cloud Hub Optimization and Cost Explorer leverage App Hub applications to show you the cost-efficiency of your application’s workloads and services instantly. This is great for instance when you are trying to pinpoint deployments running on GKE clusters that might be wasting valuable resources, such as GPUs.
When you bring up Cloud Hub Optimization, you can immediately see the resources that are costing you the most, along with the percentage change in their cost. With this highly granular cost information, you can now attribute your costs to specific resources and resource owners to reason about any changes in costs.
We have additionally integrated granular cost data from Cloud Billing and resource utilization data from Cloud Monitoring to give you a comprehensive picture of your cost efficiency. This includes average vCPU utilization for your Project, which helps you find the most promising optimization candidates across hundreds of Google Cloud Projects.
The Cost Explorer dashboard also shows you your costs logically organized at the product level, for even more cost explainability. Instead of seeing a lump sum cost for Compute Engine, you can now see your exact spend on individual products including Google Kubernetes Engine (GKE) clusters, Persistent Disks, Cloud Load Balancing, and more.
Customers who have tried these new tools love the information that is surfaced as well as the simplicity of the interfaces.
“My team has to keep an eye on cloud costs across tens of business units and hundreds of developers. The Cloud Hub Optimization and Cost Explorer dashboards are a force multiplier for my team as they tell us where to look for cost savings and potential optimization opportunities.” - Frank Dice, Principal Cloud Architect, Major League Baseball
Customers especially appreciate the breadth of product coverage available out of the box without any additional setup, and the fact that there is no additional charge to using these features.
As your organization “shifts left” on cloud cost management, we are working to help application owners and developers understand and optimize their cloud costs. You can try Cloud Hub Optimize and Cost Explorer here.
You can also see a live demo of how Cloud Hub Optimization and Cost Explorer can be used to identify underutilized GKE clusters within seconds in the Google Cloud Next 2025 talk Maximize Your Cloud ROI.
Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit MLB.com.
Are you ready to unlock the power of Google Cloud and want guidance on how to set up your environment effectively? Whether you're a cloud novice or part of an experienced team looking to migrate critical workloads, getting your foundational infrastructure right is the key to success. That's where Google Cloud Setup comes in — your guided pathway to a secure cloud foundation and quick start on Google Cloud.
Google Cloud Setup helps you quickly implement Google Cloud's recommended best practices. Our goal is to provide a fast and easy path to deploying your workloads without unnecessary configuration effort. Think of it as your expert guide, walking you through the essential first steps so you can focus on what truly matters: rapidly deploying your innovative applications and services. To help you get started without financial barriers, all components and service integrations enabled during the setup process are free or include some level of no-cost access.
We understand that every organization and project has unique requirements. That's why Cloud Setup offers three distinct guided flows to choose from:
Proof-of-concept: Designed for users who want to set up a lightweight environment to explore Google Cloud and run initial tests or sandbox workloads. This flow focuses on the minimum configuration to get you started quickly.
Production: This flow is recommended for supporting production-ready workloads with security and scalability in mind. It aligns with Google Cloud’s best practices and is tailored for administrators setting up basic foundational infrastructure for production workloads.
Enhanced security: Designed for organizations, regions or workloads with advanced security and compliance requirements, this flow defaults to more advanced security controls and is designed to help you meet rigorous requirements. Even this advanced foundation sets you up with a perpetual free tier up to certain usage limits.
Cloud Setup guides you through a series of onboarding steps, presenting defaults backed by Google Cloud best practices. Throughout the process, you'll also encounter key features designed to help protect your organization and prepare it for growth, including:
Cloud KMS AutoKey: Automates the provisioning and assignment of customer-managed encryption keys (CMEK).
Security Command Center: Provides security posture management for Google Cloud deployments including automatic project scanning for security issues such as open ports and misconfigured access controls.
Centralized Logging and Monitoring: Enables you to easily set up infrastructure to monitor your system's health and performance from a central location — critical for audit logging compliance and visualizing metrics across projects.
Shared VPC Networks: Allows you to establish a centralized network across multiple projects, enabling secure and efficient communication between your Google Cloud resources.
Hybrid Connectivity: Facilitates connecting your Google Cloud environment to your on-premises infrastructure or other cloud providers. This is often a critical step for workload migrations.
Support plan: Enables you to quickly resolve any issues with help from experts at Google Cloud.
At the end of the guided flow, you can deploy your configuration directly via the Google Cloud console or download a Terraform configuration file for later deployment using other Infrastructure as Code (IaC) methods.
Organizations using Cloud Setup experience enjoy:
Faster application deployment: By simplifying the initial setup, you can get your applications up and running more quickly, accelerating your cloud journey.
Reduced setup effort: Our streamlined flow significantly reduces the number of manual steps, allowing you to establish a basic foundation with less effort.
Greater access to Google Cloud's full potential: By establishing a solid foundation quickly, you can more easily explore and leverage a wider range of Google Cloud services to meet your evolving needs and unlock greater value.
Ready to start your Google Cloud journey? Visit Google Cloud Setup today for a streamlined path to a secure cloud foundation. Let us guide you through the initial steps so you can focus on innovation and growth.
To learn more, visit:
Cloud Setup overview (requires login)
As developers and operators, you know that having access to the right information in the proper context is crucial for effective troubleshooting. This is why organizations invest a lot upfront curating monitoring resources across different business units: so information is easy to find and contextualize when needed.
Today we are reducing the need for this upfront investment with an out-of-the-box Application Monitoring experience for your organization on Google Cloud within Cloud Observability.
Application Monitoring consists of a set of pre-curated dashboards with relevant metrics and logs mapped to a user-defined application in App Hub. It incorporates best practices pioneered by Google Site Reliability Engineers (SRE) to optimize manual troubleshooting and unlock AI-assisted troubleshooting.
Application Monitoring automatically labels and brings together key telemetry for your application into a centralized experience, making it easy to discover, filter and correlate trends. It also feeds application context into Gemini Cloud Assist Investigations, for AI-assisted troubleshooting.
No more spending hours configuring application dashboards.
From the moment you describe your application in App Hub, Application Monitoring starts to automatically build dashboards tailored to your environment. Each dashboard comprises relevant telemetry for your application and is searchable, filterable and ready for deep dives — no configuration required.
The dashboards offer an overview of charts detailing the SRE Four Golden Signals: traffic, latency, error rate, and saturation. This provides a high-level view of application performance, integrating automatically collected system metrics across various services and workloads such as load balancers, Cloud Run, GKE workloads, MIGs, and databases. From this overview, you can then drill down into services or workloads with performance issues or active alerts to access detailed metrics and logs.
For example in the image below, a user defined an App Hub application called Cymbal BnB app, with multiple services and workloads. The flow below shows the automatically generated experience with golden signals, alerts and relevant logs.
Figure 1 - A user’s flow from an App Hub defined application (i.e. Cymbal BnB) to the automatic prebuilt Application Monitoring experience in Cloud Observability
See application labels propagated seamlessly across Google Cloud
Once Application Monitoring is enabled, your application labels are propagated across Google Cloud, so you can see and use them to filter and focus on the most essential signals across the logs, metrics and trace explorers.
Figure 2 - Logs Explorer showing application automatically tagged with application labels
Figure 3 - Metrics Explorer showing application labels automatically associated with metrics
Figure 4 - Trace Explorer showing AppHub label Integration
Troubleshoot issues faster with AI powered Investigations.
Gemini Cloud Assist’s investigation feature makes it easier to troubleshoot issues because application boundaries and relationships have been propagated into the AI model, grounding it in context about your environment.
Figure 5 - Seamless entry point into Gemini Cloud Assist powered Investigations from application logs
Note - Gemini Cloud Assist Investigations is currently in private preview
The new Application Monitoring experience provides a low-effort unified view of application and infrastructure performance for your troubleshooting needs.
Take advantage of the new Google Cloud Application Monitoring experience by:
Visiting your Cloud console
Adding Services and Workloads to your Application
Navigating to Application Monitoring in Cloud Observability to see your automatically built experience
Enable your Gemini Cloud Assist SKU and sign up for the trusted tester program to get access to the Investigations experience
Application Monitoring docs
AppHub docs
At Google Cloud, we are committed to making it as seamless as possible for you to build and deploy the next generation of AI and agentic applications. Today, we’re thrilled to announce that we are collaborating with Docker to drastically simplify your deployment workflows, enabling you to bring your sophisticated AI applications from local development to Cloud Run with ease.
Previously, bridging the gap between your development environment and managed platforms like Cloud Run required you to manually translate and configure your infrastructure. Agentic applications that use MCP servers and self-hosted models added additional complexity.
The open-source Compose Specification is one of the most popular ways for developers to iterate on complex applications in their local environment, and is the basis of Docker Compose. And now, gcloud run compose up brings the simplicity of Docker Compose to Cloud Run, automating this entire process. Now in private preview, you can deploy your existing compose.yaml file to Cloud Run with a single command, including building containers from source and leveraging Cloud Run’s volume mounts for data persistence.
Supporting the Compose Specification with Cloud Run makes for easy transitions across your local and cloud deployments, where you can keep the same configuration format, ensuring consistency and accelerating your dev cycle.
“We’ve recently evolved Docker Compose to support agentic applications, and we’re excited to see that innovation extend to Google Cloud Run with support for GPU-backed execution. Using Docker and Cloud Run, developers can now iterate locally and deploy intelligent agents to production at scale with a single command. It’s a major step forward in making AI-native development accessible and composable. We’re looking forward to continuing our close collaboration with Google Cloud to simplify how developers build and run the next generation of intelligent applications.” - Tushar Jain, EVP Engineering and Product, Docker
Support for the compose spec isn’t the only AI-friendly innovation you’ll find in Cloud Run. We recently announced general availability of Cloud Run GPUs, removing a significant barrier to entry for developers who want access to GPUs for AI workloads. With its pay-per-second billing, scale to zero, and rapid scaling (which takes approximately 19 seconds for a gemma3:4b model for time-to-first-token), Cloud Run is a great hosting solution for deploying and serving LLMs.
This also makes Cloud Run a strong solution for Docker’s recently announced OSS MCP Gateway and Model Runner, making it easy for developers to take the AI applications locally to production in the cloud seamlessly. By supporting Docker’s recent addition of ‘models’ to the open Compose Spec, you can deploy these complex solutions to the cloud with a single command.
Let's review the compose file for the above demo. It consists of a multi-container application (defined in services) built from sources and leveraging a storage volume (defined in volumes). It also uses the new models attribute to define AI models and a Cloud Run-extension defining the runtime image to use:
We’re committed to offering developers maximum flexibility and choice by adopting open standards and supporting various agent frameworks. This collaboration on Cloud Run and Docker is another example of how we aim to simplify the process for developers to build and deploy intelligent applications.
Compose Specification support is available for our trusted users — sign up here for the private preview.
Editor's note: This is part one of the story. After you’re finished reading, head over to part two.
In 2017, John Lewis, a major UK retailer with a £2.5bn annual online turnover, was hampered by its monolithic e-commerce platform. This outdated approach led to significant cross-team dependencies, cumbersome and infrequent releases (monthly at best), and excessive manual testing, all further hindered by complex on-premises infrastructure. What was needed were some bold decisions to drive a quick and significant transformation.
The John Lewis engineers knew there was a better way. Working with Google Cloud, they modernized their e-commerce operations with Google Kubernetes Engine. They started with the frontend, and started to see results fast: the frontend was moved onto Google Cloud in mere months, releases to the frontend browser journey started to happen weekly, and the business gladly backed expansion into other areas.
At the same time, the team had a broader strategy in mind: to take a platform engineering approach, creating many product teams who built their own microservices to replace the functionality of the legacy commerce engine, as well as creating brand new experiences for customers.
And so The John Lewis Digital Platform was born. The vision was to empower development teams and arm them with the tools and processes they needed to go to market fast, with full ownership of their own business services. The team’s motto? "You Build It. You Run It. You Own It." This decentralization of development and operational responsibilities would also enable the team to scale.
This article features insights from Principal Platform Engineer Alex Moss, who delves into their strategy, platform build, and key learnings of John Lewis’ journey to modernize and streamline its operations with platform engineering — so you can begin to think about how you might apply platform engineering to your own organization.
In order to make this happen, John Lewis needed to adopt a multi-tenant architecture — one tenant for each business service, allowing each owning team to work independently without risk to others -- and thereby permitting the Platform team to give the team a greater degree of freedom.
Knowing that the business' primary objective was to greatly increase the number of product teams helped inform our initial design thinking, positioning ourselves to enable many independent teams even though we only had a handful of tenants.
This foundational design has served us very well and is largely unchanged now, seven years later. Central to the multi-tenant concept is what we chose to term a "Service" — a logical business application, usually composed of several microservices plus components for storing data.
We largely position our platform as a “bring your own container” experience, but encourage teams to make use of other Google Cloud services — particularly for handling state. Adopting services like Firestore and Pub/Sub reduces the complexity that our platform team has to work with, particularly for areas like resilience and disaster recovery. We also favor Kubernetes over compute products like Cloud Run because it strikes the right balance for us between enabling development teams to have freedom whilst allowing our platform to drive certain certain behaviours, e.g., the right level of guardrails, without introducing too much friction.
On our platform, Product Teams (i.e., tenants) have a large amount of control over their own Namespaces and Projects. This allows them to prototype, build, and ultimately operate, their workloads without dependency on others — a crucial element of enabling scale.
Our early-adopter teams were extremely helpful in helping evolve the platform; they were accepting of the lack of features and willing to develop their own solutions, and provided very rich feedback on whether we were building something that met their needs.
The first tenant to adopt the platform was rebuilding the johnlewis.com, search capability, replacing a commercial-off-the-shelf solution. This team was staffed with experienced engineers familiar with modern software development and the advantages of a microservice-based architecture. They quickly identified the need for supporting services for their application to store data and asynchronously communicate between their components. They worked with the Platform Team to identify options, and were onboard with our desire to lean into Google Cloud native services to avoid running our own databases or messaging. This led to us adopting Cloud Datastore and Pub/Sub for our first features that extended beyond Google Kubernetes Engine.
A risk with a platform that allows very high team autonomy is that it can turn into a bit of a wild-west of technology choices and implementation patterns. To handle this, but to do so in a way that remained developer-centric, we adopted the concept of a paved road, analogous to a “golden path.”
We found that the paved road approach made it easier to:
build useful platform features to help developers do things rapidly and safely
share approaches and techniques, and engineers to move between teams
demonstrate to the wider organisation that teams are following required practices (which we do by building assurance capabilities, not by gating release)
The concept of the paved road permeates most of what the platform builds, and has inspired other areas of the John Lewis Partnership beyond the John Lewis Digital space.
Our paved road is powered by two key features to enable simplification for teams:
The Paved Road Pipeline. This operates on the whole Service and drives capabilities such as Google Cloud resource provisioning and observability tools.
The Microservice CRD. As the name implies, this is an abstraction at the microservice level. The majority of the benefit here is in making it easier for teams to work with Kubernetes.
Whilst both features were created with the developer experience in mind, we discovered that they also hold a number of benefits for the platform team too.
The Paved Road Pipeline is driven by a configuration file — in yaml (of course!) — which we call the Service Definition. This allows the team that owns the tenancy to describe, through easy-to-reason-about configuration, what they would like the platform to provide for them. Supporting documentation and examples help them understand what can be achieved. Pushes to this file then drive a CI/CD pipeline for a number of platform-owned jobs, which we refer to as provisioners. These provisioners are microservices-like themselves in that they are independently releasable and generally focus on performing one task well. Here are some examples of our provisioners and what they can do:
Our product teams are therefore freed from the need to familiarize themselves deeply with how Google Cloud resource provisioning works, or Infrastructure-as-Code (IaC) tooling for that matter. Our preferred technologies and good practices can be curated by our experts, and developers can focus on building differentiating software for the business, while remaining fully in control of what is provisioned and when.
Earlier, we mentioned that this approach has the added benefit of being something that the platform team can rely upon to build their own features. The configuration updated by teams for their Service can be combined with metadata about their team and surfaced via an API and events published to Pub/Sub. This can then drive updates to other features like incident response and security tooling, pre-provision documentation repositories, and more. This is an example of how something that was originally intended as a means to help teams avoid writing their own IaC can also be used to make it easier for us to build platform features, further improving the value-add — without the developer even needing to be aware of it!
We think this approach is also more scalable than providing pre-built Terraform modules for teams to use. That approach still burdens teams with being familiar with Terraform, and versioning and dependency complexities can create maintenance headaches for platform engineers. Instead, we provide an easy-to-reason-about API and deliberately burden the platform team, ensuring that the Service provides all the functionality our tenants require. This abstraction also means we can make significant refactoring choices if we need to.
Adopting this approach also results in a broad consistency in technologies across our platform. For example, why would a team implement Kafka when the platform makes creating resources in Pub/Sub so easy? When you consider that this spans not just the runtime components that assemble into a working business service, but also all the ancillary needs for operating that software — resilience engineering, monitoring & alerting, incident response, security tooling, service management, and so on— this has a massive amplifying effect on our engineers’ productivity. All of these areas have full paved road capabilities on the John Lewis Digital Platform, reducing the cognitive load for teams in recognizing the need for, identifying appropriate options, and then implementing technology or processes to use them.
That being said, one of the reasons we particularly like the paved road concept is because it doesn't preclude teams choosing to "go off-road." A paved road shouldn’t be mandatory, but it should be compelling to use, so that engineers aren’t tempted to do something else. Preventing use of other approaches risks stifling innovation and the temptation to think the features you've built are "good enough." The paved road challenges our Platform Engineers to keep improving their product so that it continues to meet our Developers' changing needs. Likewise, development teams tempted to go off-road are put off by the increasing burden of replicating powerful platform features.
The needs of our Engineers don’t remain fixed, and Google Cloud are of course releasing new capabilities all the time, so we have extended the analogy to include a “dusty path” representing brand new platform features that aren’t as feature-rich as we’d like (perhaps they lack self-service provisioning or out-the-box observability). Teams are trusted to try different options and make use of Google Cloud products that we haven't yet paved. The Paved Road Pipeline allows for this experimentation - what we term "snowflaking". We then have an unofficial "rule of three", whereby if we notice at least 3 teams requesting the same feature, we move to make the use of it self-service.
At the other end of the scale, teams can go completely solo — which we refer to as “crazy paving” — and might be needed to support wild experimentation or to accommodate a workload which cannot comply with the platform’s expectations for safe operation. Solutions in this space are generally not long-lived.
In this article, we've covered how John Lewis revolutionized its e-commerce operations by adopting a multi-tenant, "paved road" approach to platform engineering. We explored how this strategy empowered development teams and streamlined their ability to provision Google Cloud resources and deploy operational and security features.
In part 2 of this series, we'll dive deeper into how John Lewis further simplified the developer experience by introducing the Microservice CRD. You'll discover how this custom Kubernetes abstraction significantly reduced the complexity of working with Kubernetes at the component level, leading to faster development cycles and enhanced operational efficiency.
To learn more about shifting down with platform engineering on Google Cloud, you can find more information available here. To learn more about how Google Kubernetes Engine (GKE) empowers developers to effortlessly deploy, scale, and manage containerized applications with its fully managed, robust, and intelligent Kubernetes service, you can find more information here.
In our previous article we introduced the John Lewis Digital Platform and its approach to simplifying the developer experience through platform engineering and so-called paved road features. We focused on the ways that platform engineering enables teams to create resources in Google Cloud and deploy the platform's operational and security features within dedicated tenant environments. In this article, we will build upon that concept for the next level of detail — how the platform simplifies build and run at a component (typically for us, a microservice) level too.
Within just over a year, the John Lewis Digital Platform had fully evolved into a product. We had approximately 25 teams using our platform, with several key parts of the johnlewis.com retail website running in production. We had built a self-service capability to help teams provision resources in Google Cloud, and firmly established that the foundation of our platform was on Google Kubernetes Engine (GKE). But we were hearing signals from some of the recent teams that there was a learning curve to Kubernetes. This was expected — we were driving a cultural change for teams to build and run their own services, and so we anticipated that our application developers would need some Kubernetes skills to support their own software. But our vision was that we wanted to make developers' lives easier — and their feedback was clear. In some cases, we observed that teams weren't following "good practice" (despite the existence of good documentation!) such as not using anti-affinity rules or PodDisruptionBudgets to help their workloads tolerate failure.
All the way back in 2017, Kelsey Hightower wrote: “Kubernetes is a platform for building platforms. It's a better place to start, not the endgame.”
Kelsey's quote inspired us to act. We had the idea to write our own custom controller to simplify the point of interaction for a developer with Kubernetes — a John Lewis-specific abstraction that aligned to our preferred approaches. And thus the JL Microservice was born.
To do this, we declared a Kubernetes CustomResourceDefinition with a simplified specification containing just the fields we felt our developers needed to set. For example, as we expect our tenants to build and operate their applications themselves, attributes such as the number of replicas and the amount of resources needed are best left up to the developers themselves. But do they really need to be able to customize the rules defining how to distribute pods across nodes? How often do they need to change the Service pointing towards their Deployment? When we looked closer, we realized just how much duplication there was — our analysis at the time suggested that only around 33% of the lines in the yaml files developers were producing were relevant to their application. This was a target-rich scenario for simplification.
To help us build this feature, we selected Kubebuilder, using it to declare our CustomResourceDefinition and then build the Controller (what we call MicroserviceManager). This turned out to be a beneficial decision — initial prototyping was quick, and the feature was launched a few months later, and very well-received. Our team had to skill up in the Go programming language, but this trade-off felt worthwhile due to the advantages Kubebuilder was bringing to the table, and it has continued to be helpful for other software engineering since.
The initial implementation replaced an engineer's need to understand and fully configure a Deployment and Service, instead applying a much briefer yaml file containing only the fields they need to change. As well as direct translation of identical fields (image and replicas are equivalent to what you would see in a Deployment, for example), it also allowed us to simplify the choices made by the Kubernetes APIs, because in John Lewis we didn't need some of that functionality. For example, writablePaths: [] is an easy concept for our engineers to understand, and behind the scenes, our controller is converting those into the more complex combination of Volumes and VolumeMounts. Likewise, visibleToOtherServices: true is an example of us simplifying the interaction with Kubernetes NetworkPolicy — rather than requiring teams to read our documentation to understand the necessary incantations to label their resources correctly, the controller understands those conventions and handles it for them.
With the core concept of the Microservice resource established, we were able to improve the value-add by augmenting it with further features. We rapidly extended it out to define our Prometheus scrape configuration, then more complex features such as allowing teams to declare that they use Google Cloud Endpoints, and have the controller inject the necessary sidecar container into their Deployment and wiring it up to the Service. As we added more features, existing tenants converted to use this specification, and it now makes up the majority of workloads declared on the platform.
Our motivation to build MicroserviceManager was focused on making developers' lives easier. But we discovered an additional benefit that we had not initially expected - it was something we could greatly benefit from within the platform as well. It enabled us to make changes behind the scenes without needing to involve our tenants — reducing toil for them and making it easier for us to improve our product. This was a slightly unexpected but an exceptionally powerful benefit. It is generally difficult to change the agreement that you’ve established between your tenants and the platform, and creating an abstraction like this has allowed us to bring more under our control, for everyone’s benefit.
An example of this was something we observed through our live load testing of johnlewis.com when certain workloads burst up to several hundred Pods — numbers that exceeded the typical number of Nodes we had running in the cluster. This led to new Node creation — therefore slower Pod autoscaling and poor bin-packing. Experienced Kubernetes operators can probably guess what was happening here: our default antiAffinity rules were set to optimize for resilience such that no more than one replica was allowed on any given Node. The good news though was that because the workloads were under the control of our Microservice Manager, rather than us having to instruct our tenants to copy the relevant yaml into their Deployments, it was a straightforward change for us to replace the antiAffinity rules with the more modern podTopologyConstraints, allowing us to customize the number of replicas that could be stacked on a Node for workloads exceeding a certain replica count. And this happened with no intervention from our tenants.
A more complex example of this was when we rolled out our service mesh. In keeping with our general desire to let Google Cloud handle the complexity of running control planes components, we opted to use Google's Cloud Service Mesh product. But even then, rolling out a mesh to a business-critical platform in constant use is not without its risks. Microservice Manager allowed us to control the rate at which we enrolled workloads into the mesh through the use of a feature flag on the Microservice resource. We could start rollout with platform-owned workloads first to test our approach, then make tenants aware of the flag for early adopters to validate and take advantage of some of Cloud Service Mesh’s features. To scale the rollout, we could then manipulate the flag to release in waves based on business importance, providing an opt-out mechanism if needed to. This again greatly simplified the implementation — product teams had very little to do, and we avoided having to chase approximately 40 teams running hundreds of Microservices to make the appropriate changes in their configuration. This feature flagging technique is something we make extensive use of to support our own experimentation.
Building the Microservice Manager has led to further thinking in Kubernetes-native ways: the Custom Resource + Controller concept is a powerful technique, and we have built other features since using it. One example is a controller that converts the need for external connectivity into Istio resources to route via our egress gateway. Istio in particular is an example of a very powerful platform capability that comes with a high cognitive load for its users, and so is a perfect example of where platform engineering can help manage that for teams whilst still allowing them to take advantage of it. We have a number of ideas in this area now that our confidence in the technology has grown.
In summary, the John Lewis Partnership leveraged Google Cloud and platform engineering to modernize their e-commerce operations and developer experience. By implementing a "paved road" approach with a multi-tenant architecture, they empowered development teams, accelerated deployment cycles, and simplified Kubernetes interactions using a custom Microservice CRD. This strategy allowed them to scale effectively and enhance the developer experience by reducing complexity while maintaining operational efficiency and scaling engineering teams effectively.
To learn more about platform engineering on Google Cloud, check out some of our other articles: 5 myths about platform engineering: what it is and what it isn’t, Another five myths about platform engineering, and Light the way ahead: Platform Engineering, Golden Paths, and the power of self-service.
In the event of a cloud incident, everyone wants swift and clear communication from the cloud provider, and to be able to leverage that information effectively. Personalized Service Health in the Google Cloud console addresses this need with fast, transparent, relevant, and actionable communications about Google Cloud service disruptions, customized to your specific footprint. This helps you to quickly identify the source of the problem, helping you answer the question, “Is it Google or is it me?” You can then integrate this information into your incident response workflows to resolve the incident more efficiently.
We're excited to announce that you can prompt Gemini Cloud Assist to pull real-time information about active incidents, powered by Personalized Service Health, providing you with streamlined incident management, including discovery, impact assessment, and recovery. By combining Gemini's guidance with Personalized Service Health insights and up-to-the-minute information, you can assess the scope of impact and begin troubleshooting – all within a single, AI-driven Gemini Cloud Assist chat. Further, you can initiate this sort of incident discovery from anywhere within the console, offering immediate access to relevant incidents without interrupting your workflow. You can also check for active incidents impacting your projects, gathering details on their scope and the latest updates directly sourced from Personalized Service Health.
We designed Gemini Cloud Assist with a user-friendly layout and a well-organized information structure. Crucial details, including dynamic timelines, latest updates, symptoms, and workarounds sourced directly from Personalized Service Health, are now presented in the console, enabling conversational follow-ups. Gemini Cloud Assist highlights critical insights from Personalized Service Health, helping you refine your investigations and understand the impact of incidents.
To illustrate the power of this integration, the following demo showcases a typical incident response workflow leveraging the combined capabilities of Gemini and Personalized Service Health.
Incident discovery and triage
In the crucial first moments of an incident, Gemini Cloud Assist helps you answer "Is it Google or is it me?" Gemini Cloud Assist accesses data directly from Personalized Service Health, and provides feedback on which projects and at what locations are affected by a Google Cloud incident, speeding up the triage process.
To illustrate how you can start this process, try asking Gemini Cloud Assist questions like:
Is my project impacted by a Google Cloud incident?
Are there any incidents impacting Google Cloud at the moment?
Investigating and evaluating impact
Once you’ve identified a relevant Google Cloud incident, you can use Gemini Cloud Assist to delve deeper into the specifics and evaluate its impact on your environment. Furthermore, by asking follow-up questions, Gemini Cloud Assist can retrieve updates from Personalized Service Health about the incident as it evolves. You can then further investigate by asking Gemini to pinpoint exactly which of your apps or projects, and at what locations, might be affected by the reported incident.
Here are examples of prompts you might pose to Gemini Cloud Assist:
Tell me more about the ongoing Incident ID [X] (Replace [X] with the Incident ID)
Is [X] impacted? (Replace [X] with your specific location or Google Cloud product)
What is the latest update on Incident ID [X]?
Show me the details of Incident ID [X].
Can you guide me through some troubleshooting steps for [impacted Google Cloud product]?
Mitigation and recovery
Finally, Gemini Cloud Assist can also act as an intelligent assistant during the recovery phase, providing you with actionable guidance. You can gain access to relevant logs and monitoring data for more efficient resolution. Additionally, Gemini Cloud Assist can help surface potential workarounds from Personalized Service Health and direct you to the tools and information you need to restore your projects or applications. Here are some sample prompts:
What are the workarounds for the incident ID [X]? (Replace [X] with the Incident ID)
Can you suggest a temporary solution to keep my application running?
How can I find logs for this impacted project?
From these prompts, Gemini retrieves relevant information from Personalized Service Health to provide you with personalized insights into your Google Cloud environment's health — both for ongoing events and incidents from up to one year in the past. This helps when investigating an incident to narrow down its impact, as well as assisting in recovery.
Looking ahead, we are excited to provide even deeper insights and more comprehensive incident management with Gemini Cloud Assist and Personalized Service Health, extending these AI-driven capabilities beyond a single project view. Ready to get started?
Learn more about Personalized Service Health, or reach out to your account team to enable it.
Get started with Gemini Cloud Assist. Refine your prompts to ask about your specific regions or Google Cloud products, and experiment to discover how it can help you proactively manage incidents.
In 2023, the Waze platform engineering team transitioned to Infrastructure as Code (IaC) using Google Cloud's Config Connector (KCC) — and we haven’t looked back since. We embraced Config Connector, an open-source Kubernetes add-on, to manage Google Cloud resources through Kubernetes. To streamline management, we also leverage Config Controller, a hosted version of Config Connector on Google Kubernetes Engine (GKE), incorporating Policy Controller and Config Sync. This shift has significantly improved our infrastructure management and is shaping our future infrastructure.
Previously, Waze relied on Terraform to manage resources, particularly during our dual-cloud, VM-based phase. However, maintaining state and ensuring reconciliation proved challenging, leading to inconsistent configurations and increased management overhead.
In 2023, we adopted Config Connector, transforming our Google Cloud infrastructure into Kubernetes Resource Modules (KRMs) within a GKE cluster. This approach addresses the reconciliation issues encountered with Terraform. Config Sync, paired with Config Connector, automates KRM synchronization from source repositories to our live GKE cluster. This managed solution eliminates the need for us to build and maintain custom reconciliation systems.
The shift helped us meet the needs of three key roles within Waze’s infrastructure team:
Infrastructure consumers: Application developers who want to easily deploy infrastructure without worrying about the maintenance and complexity of underlying resources.
Infrastructure owners: Experts in specific resource types (e.g., Spanner, Google Cloud Storage, Load Balancers, etc.), who want to define and standardize best practices in how resources are created across Waze on Google Cloud.
Platform engineers: Engineers who build the system that enables infrastructure owners to codify and define best practices, while also providing a seamless API for infrastructure consumers.
It may seem circular to define all of our Google Cloud infrastructure as KRMs within a Google Cloud service, however, KRM is actually a great representation for our infrastructure as opposed to existing IaC tooling.
Terraform's reconciliation issues – state drift, version management, out of band changes – are a significant pain. Config Connector, through Config Sync, offers out-of-the-box reconciliation, a managed solution we prefer. Both KRM and Terraform offer templating, but KCC's managed nature aligns with our shift to Google Cloud-native solutions and reduces our maintenance burden.
Infrastructure complexity requires generalization regardless of the tool. We can see this when we look at the Spanner requirements at Waze:
Consistent backups for all Spanner databases
Each Spanner database utilizes a dedicated Cloud Storage bucket and Service Account to automate the execution of DDL jobs.
All IAM policies for Spanner instances, databases, and Cloud Storage buckets are defined in code to ensure consistent and auditable access control.
To define these resources, we evaluated various templating and rendering tools and selected Helm, a robust CNCF package manager for Kubernetes. Its strong open-source community, rich templating capabilities, and native rendering features made it a natural fit. We can now refer to our bundled infrastructure configurations as 'Charts.' While KRO has since emerged that achieves a similar purpose, our selection process predated its availability.
Let's open the hood and dive into how the system works and is driving value for Waze.
Waze infrastructure owners generically define Waze-flavored infrastructure in Helm Charts.
Infrastructure consumers use these Charts with simplified inputs to generate infrastructure (demo).
Infrastructure code is stored in repositories, enabling validation and presubmit checks.
Code is uploaded to a Artifact Registry where Config Sync and Config Connector align Google Cloud infrastructure with the code definitions.
This diagram represents a single "data domain," a collection of bounded services, databases, networks, and data. Many tech orgs today consist of Prod, QA, Staging, Development, etc.
So why does all of this matter? Adopting this approach allowed us to move from Infrastructure as Code to Infrastructure as Software. By treating each Chart as a software component, our infrastructure management goes beyond simple code declaration. Now, versioned Charts and configurations enable us to leverage a rich ecosystem of software practices, including sophisticated release management, automated rollbacks, and granular change tracking.
Here's where we apply this in practice: our configuration inheritance model minimizes redundancy. Resource Charts inherit settings from Projects, which inherit from Bootstraps. All three are defined as Charts. Consequently, Bootstrap configurations apply to all Projects, and Project configurations apply to all Resources.
Every change to our infrastructure – from changes on existing infrastructure to rolling out new resource types – can be treated like a software rollout.
Now that all of our infrastructure is treated like software, we can see what this does for us system-wide:
In summary, Config Connector and Config Controller have enabled Waze to achieve true Infrastructure as Software, providing a robust and scalable platform for our infrastructure needs, along with many other benefits including:
Infrastructure consumers receive the latest best practices through versioned updates.
Infrastructure owners can iterate and improve infrastructure safely.
Platform Engineers and Security teams are confident our resources are auditable and compliant
Config Connector leverages Google's managed services, reducing operational overhead.
Distributed tracing is a critical part of an observability stack, letting you troubleshoot latency and errors in your applications. Cloud Trace, part of Google Cloud Observability, is Google Cloud’s native tracing product, and we’ve made numerous improvements to the Trace explorer UI on top of a new analytics backend.
The new Trace explorer page contains:
A filter bar with options for users to choose a Google Cloud project-based trace scope, all/root spans and a custom attribute filter.
A faceted span filter pane that displays commonly used filters based on OpenTelemetry conventions.
A visualization of matching spans including an interactive span duration heatmap (default), a span rate line chart, and a span duration percentile chart.
A table of matching spans that can be narrowed down further by selecting a cell of interest on the heatmap.
Let’s take a closer look at these new features and how you can use them to troubleshoot your applications. Imagine you’re a developer working on the checkoutservice of a retail webstore application and you’ve been paged because there’s an ongoing incident.
This application is instrumented using OpenTelemetry and sends trace data to Google Cloud Trace, so you navigate to the Trace explorer page on the Google Cloud console with the context set to the Google Cloud project that hosts the checkoutservice.
Before starting your investigation, you remember that your admin recommended using the webstore-prod trace scope when investigating webstore app-wide prod issues. By using this Trace scope, you'll be able to see spans stored in other Google Cloud projects that are relevant to your investigation.
You set the trace scope to webstore-prod and your queries will now include spans from all the projects included in this trace scope.
You select checkoutservice in Span filters (1) and the following updates load on the page:
Other sections such as Span name in the span filter pane (2) are updated with counts and percentages that take into account the selection made under service name. This can help you narrow down your search criteria to be more specific.
The span Filter bar (3) is updated to display the active filter.
The heatmap visualization (4) is updated to only display spans from the checkoutservice in the last 1 hour (default). You can change the time-range using the time-picker (5). The heatmap’s x-axis is time and the y-axis is span duration. It uses color shades to denote the number of spans in each cell with a legend that indicates the corresponding range.
The Spans table (6) is updated with matching spans sorted by duration (default).
Other Chart views (7) that you can switch to are also updated with the applied filter.
From looking at the heatmap, you can see that there are some spans in the >100s range which is abnormal and concerning. But first, you’re curious about the traffic and corresponding latency of calls handled by the checkoutservice.
Switching to the Span rate line chart gives you an idea of the traffic handled by your service. The x-axis is time and the y-axis is spans/second. The traffic handled by your service looks normal as you know from past experience that 1.5-2 spans/second is quite typical.
Switching to the Span duration percentile chart gives you p50/p90/p95/p99 span duration trends. While p50 looks fine, the p9x durations are greater than you expect for your service.
You switch back to the heatmap chart and select one of the outlier cells to investigate further. This particular cell has two matching spans with a duration of over 2 minutes, which is concerning.
You investigate one of those spans by viewing the full trace and notice that the orders publish span is the one taking up the majority of the time when servicing this request. Given this, you form a hypothesis that the checkoutservice is having issues handling these types of calls. To validate your hypothesis, you note the rpc.method attribute being PlaceOrder and exit this trace using the X button.
You add an attribute filter for key: rpc.method value:PlaceOrder using the Filter bar, which shows you that there is a clear latency issue with PlaceOrder calls handled by your service. You’ve seen this issue before and know that there is a runbook that addresses it, so you alert the SRE team with the appropriate action that needs to be taken to mitigate the incident.
Share your feedback with us via the Send feedback button.
This new experience is powered by BigQuery, using the same platform that backs Log Analytics. We plan to launch new features that take full advantage of this platform: SQL queries, flexible sampling, export, and regional storage.
In summary, you can use the new Cloud Trace explorer to perform service-oriented investigations with advanced querying and visualization of trace data. This allows developers and SREs to effectively troubleshoot production incidents and identify mitigating measures to restore normal operations.
The new Cloud Trace explorer is generally available to all users — try it out and share your feedback with us via the Send feedback button.
Picture this: you’re an Site Reliability Engineer (SRE) responsible for the systems that power your company’s machine learning (ML) services. What do you do to ensure you have a reliable ML service, how do you know you’re doing it well, and how can you build strong systems to support these services?
As artificial intelligence (AI) becomes more widely available, its features — including ML — will matter more to SREs. That’s because ML becomes both a part of the infrastructure used in production software systems, as well as an important feature of the software itself.
Abstractly, machine learning relies on its pipelines … and you know how to manage those! So you can begin with pipeline management, then look to other factors that will strengthen your ML services: training, model freshness, and efficiency. In the resources below, we'll look at some of the ML-specific characteristics of these pipelines that you’ll want to consider in your operations. Then, we draw on the experience of Google SREs to show you how to apply your core SRE skills to operating and managing your organization’s machine-learning pipelines.
Training ML models applies the notion of pipelines to specific types of data, often running on specialized hardware. Critical aspects to consider about the pipeline:
how much data you’re ingesting
how fresh this data needs to be
how the system trains and deploys the models
how efficiently the system handles these first three things
This keynote presents an SRE perspective on the value of applying reliability principles to the components of machine learning systems. It provides insight into why ML systems matter for products, and how SREs should think about them. The challenges that ML systems present include capacity planning, resource management, and monitoring; other challenges include understanding the cost of ML systems as part of your overall operations environment.
As with any pipeline-based system, a big part of understanding the system is describing how much data it typically ingests and processes. The Data Processing Pipelines chapter in the SRE Workbook lays out the fundamentals: automate the pipeline’s operation so that it is resilient, and can operate unattended.
You’ll want to develop Service Level Objectives (SLOs) in order to measure the pipeline’s health, especially for data freshness, i.e., how recently the model got the data it’s using to produce an inference for a customer. Understanding freshness provides an important measure of an ML system’s health, as data that becomes stale may lead to lower-quality inferences and sub-optimal outcomes for the user. For some systems, such as weather forecasting, data may need to be very fresh (just minutes or seconds old); for other systems, such as spell-checkers, data freshness can lag on the order of days — or longer! Freshness requirements will vary by product, so it’s important that you know what you’re building and how the audience expects to use it.
In this way, freshness is a part of the critical user journey described in the SRE Workbook, describing one aspect of the customer experience. You can read more about data freshness as a component of pipeline systems in the Google SRE article Reliable Data Processing with Minimal Toil.
There’s more than freshness to ensuring high-quality data — there’s also how you define the model-training pipeline. A Brief Guide To Running ML Systems in Production gives you the nuts and bolts of this discipline, from using contextual metrics to understand freshness and throughput, to methods for understanding the quality of your input data.
The 2021 SRE blog post Efficient Machine Learning Inference provides a valuable resource to learn about improving your model’s performance in a production environment. (And remember, training is never the same as production for ML services!)
Optimizing machine learning inference serving is crucial for real-world deployment. In this article, the authors explore multi-model serving off of a shared VM. They cover realistic use cases and how to manage trade-offs between cost, utilization, and latency of model responses. By changing the allocation of models to VMs, and varying the size and shape of those VMs in terms of processing, GPU, and RAM attached, you can improve the cost effectiveness of model serving.
We mentioned that these AI pipelines often rely on specialized hardware. How do you know you’re using this hardware efficiently? Todd Underwood’s talk from SREcon EMEA 2023 on Artificial Intelligence: What Will It Cost You? gives you a sense of how much this specialized hardware costs to run, and how you can provide incentives for using it efficiently.
This article from Google's SRE team outlines strategies for ensuring reliable data processing while minimizing manual effort, or toil. One of the key takeaways: use an existing, standard platform for as much of the pipeline as possible. After all, your business goals should focus on innovations in presenting the data and the ML model, not in the pipeline itself. The article covers automation, monitoring, and incident response, with a focus on using these concepts to build resilient data pipelines. You’ll read best practices for designing data systems that can handle failures gracefully and reduce a team’s operational burden. This article is essential reading for anyone involved in data engineering or operations. Read more about toil in the SRE Workbook: https://sre.google/workbook/eliminating-toil/.
Successful ML deployments require careful management and monitoring for systems to be reliable and sustainable. That means taking a holistic approach, including implementing data pipelines, training pathways, model management, and validation, alongside monitoring and accuracy metrics. To go deeper, check out this guide on how to use GKE for your AI orchestration.
In today's dynamic digital landscape, building and operating secure, reliable, cost-efficient and high-performing cloud solutions is no easy feat. Enterprises grapple with the complexities of cloud adoption, and often struggle to bridge the gap between business needs, technical implementation, and operational readiness. This is where the Google Cloud Well-Architected Framework comes in. The framework provides comprehensive guidance to help you design, develop, deploy, and operate efficient, secure, resilient, high-performing, and cost-effective Google Cloud topologies that support your security and compliance requirements.
The Well-Architected Framework caters to a broad spectrum of cloud professionals. Cloud architects, developers, IT administrators, decision makers and other practitioners can benefit from years of subject-matter expertise and knowledge both from within Google and from the industry. The framework distills this vast expertise and presents it as an easy-to-consume set of recommendations.
The recommendations in the Well-Architected Framework are organized under five, business-focused pillars.
We recently completed a revamp of the guidance in all the pillars and perspectives of the Well-Architected Framework to center the recommendations around a core set of design principles.
In addition to the above pillars, the Well-Architected Framework provides cross-pillar perspectives that present recommendations for selected domains, industries, and technologies like AI and machine learning (ML).
The Well-Architected Framework is much more than a collection of design and operational recommendations. The framework empowers you with a structured principles-oriented design methodology that unlocks many advantages:
Enhanced security, privacy, and compliance: Security is paramount in the cloud. The Well-Architected Framework incorporates industry-leading security practices, helping ensure that your cloud architecture meets your security, privacy, and compliance requirements.
Optimized cost: The Well-Architected Framework lets you build and operate cost-efficient cloud solutions by promoting a cost-aware culture, focusing on resource optimization, and leveraging built-in cost-saving features in Google Cloud.
Resilience, scalability, and flexibility: As your business needs evolve, the Well-Architected Framework helps you design cloud deployments that can scale to accommodate changing demands, remain highly available, and be resilient to disasters and failures.
Operational excellence: The Well-Architected Framework promotes operationally sound architectures that are easy to operate, monitor, and maintain.
Predictable and workload-specific performance: The Well-Architected Framework offers guidance to help you build, deploy, and operate workloads that provide predictable performance based on your workloads’ needs.
The Well-Architected Framework also includes cross-pillar perspectives for selected domains, industries, and technologies like AI and machine learning (ML).
The principles and recommendations in the Google Cloud Well-Architected Framework are aligned with Google and industry best practices like Google’s Site Reliability Engineering (SRE) practices, DORA capabilities, the Google HEART framework for user-centered metrics, the FinOps framework, Supply-chain Levels for Software Artifacts (SLSA), and Google's Secure AI Framework (SAIF).
Embrace the Well-Architected Framework to transform your Google Cloud journey, and get comprehensive guidance on security, reliability, cost, performance, and operations — as well as targeted recommendations for specific industries and domains like AI and ML. To learn more, visit Google Cloud Well-Architected Framework.
We are thrilled to announce the collaboration between Google Cloud, AWS, and Azure on Kube Resource Orchestrator, or kro (pronounced “crow”). kro introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.
Platform and DevOps teams want to define standards for how application teams deploy their workloads, and they want to use Kubernetes as the platform for creating and enforcing these standards. Each service needs to handle everything from resource creation to security configurations, monitoring setup, defining the end-user interface, and more. There are client-side templating tools that can help with this (e.g., Helm, Kustomize), but Kubernetes lacks a native way for platform teams to create custom groupings of resources for consumption by end users.
Before kro, platform teams needed to invest in custom solutions such as building custom Kubernetes controllers, or using packaging tools like Helm, which can’t leverage the benefits of Kubernetes CRDs. These approaches are costly to build, maintain, and troubleshoot, and complex for non-Kubernetes experts to consume. This is a problem many Kubernetes users face. Rather than developing vendor-specific solutions, we’ve partnered with Amazon and Microsoft on making K8s APIs simpler for all Kubernetes users.
kro is a Kubernetes-native framework that lets you create reusable APIs to deploy multiple resources as a single unit. You can use it to encapsulate a Kubernetes deployment and its dependencies into a single API that your application teams can use, even if they aren’t familiar with Kubernetes. You can use kro to create custom end-user interfaces that expose only the parameters an end user should see, hiding the complexity of Kubernetes and cloud-provider APIs.
kro does this by introducing the concept of a ResourceGraphDefinition, which specifies how a standard Kubernetes Custom Resource Definition (CRD) should be expanded into a set of Kubernetes resources. End users define a single resource, which kro then expands into the custom resources defined in the CRD.
kro can be used to group and manage any Kubernetes resources. Tools like ACK, KCC, or ASO define CRDs to manage cloud provider resources from Kubernetes (these tools enable cloud provider resources, like storage buckets, to be created and managed as Kubernetes resources). kro can also be used to group resources from these tools, along with any other Kubernetes resources, to define an entire application deployment and the cloud provider resources it depends on.
Below, you’ll find some examples of kro being used with Google Cloud. You can find additional examples on the kro website.
Example 1: GKE cluster definition
Imagine that a platform administrator wants to give end users in their organization self-service access to create GKE clusters. The platform administrator creates a kro ResourceGraphDefinition called GKEclusterRGD that defines the required Kubernetes resources and a CRD called GKEcluster that exposes only the options they want to be configurable by end users. In addition to creating a cluster, the platform team also wants clusters to deploy administrative workloads such as policies, agents, etc. The ResourceGraphDefinition defines the following resources, using KCC to provide the mappings from K8s CRDs to Google Cloud APIs:
GKE cluster, Container Node Pools, IAM ServiceAccount, IAM PolicyMember, Services, Policies
The platform administrator would then define the end-user interface so that they can create a new cluster by creating an instance of the CRD that defines:
Cluster name, Nodepool name, Max nodes, Location (e.g. us-east1), Networks (optional)
Everything related to policy, service accounts, and service activation (and how these resources relate to each other) is hidden from the end user, simplifying their experience.
Example 2: Web application definition
In this example, a DevOps Engineer wants to create a reusable definition of a web application and its dependencies. They create a ResourceGraphDefinition called WebAppRGD, which defines a new Kubernetes CRD called WebApp. This new resource encapsulates all the necessary resources for a web application environment, including:
Deployments, service, service accounts, monitoring agents, and cloud resources like object storage buckets.
The WebAppRGD ResourceGraphDefinition can set a default configuration, and also define which parameters can be set by the end user at deployment time (kro gives you the flexibility to decide what is immutable, and what an end user is able to configure). A developer then creates an instance of the WebApp CRD, inputting any user-facing parameters. kro then deploys the desired Kubernetes resource.
We believe kro is a big step forward for platform engineering teams, delivering a number of advantages:
Kubernetes-native: kro leverages Kubernetes Custom Resource Definitions (CRDs) to extend Kubernetes, so it works with any Kubernetes resource and integrates with existing Kubernetes tools and workflows.
Lets you create a simplified end user experience: kro makes it easy to define end-user interfaces for complex groups of Kubernetes resources, making it easy for people who are not Kubernetes experts to consume services built on Kubernetes.
Enables standardized services for application teams: kro templates can be reused across different projects and environments, promoting consistency and reducing duplication of effort.
kro is available as an open-source project on GitHub. The GitHub organization is currently jointly owned by teams from Google, AWS, and Microsoft, and we welcome contributions from the community. We also have a website with documentation on installing and using kro, including example use cases. As an early-stage project, kro is not yet ready for production use, but we still encourage you to test it out in your own Kubernetes development environments!
Platform engineering, one of Gartner’s top 10 strategic technology trends for 2024, is rapidly becoming indispensable for enterprises seeking to accelerate software delivery and improve developer productivity. How does it do that? Platform engineering is about providing the right infrastructure, tools, and processes that enable efficient, scalable software development, deployment, and management, all while minimizing the cognitive burden on developers.
To uncover the secrets to platform engineering success, Google Cloud partnered with Enterprise Strategy Group (ESG) on a comprehensive research study of 500 global IT professionals and application developers working at organizations with at least 500 employees, all with formal platform engineering teams. Our goal was to understand whether they had adopted platform engineering, and if so, the impact that has had on their company’s software delivery capabilities.
The resulting report, “Building Competitive Edge With Platform Engineering: A Strategic Guide,” reveals common patterns, expectations, and actionable best practices for overcoming challenges and fully leveraging platform engineering. This blog post highlights some of the most powerful insights from this study.
The research confirms that platform engineering is no longer a nascent concept. 55% of the global organizations we invited to participate have already adopted platform engineering. Of those, 90% plan to expand its reach to more developers. Furthermore, 85% of companies using platform engineering report that their developers rely on the platform to succeed. These figures highlight that platform engineering is no longer just a trend; it's becoming a vital strategy for organizations seeking to unlock the full potential of their cloud and IT investments and gain a competitive edge.
Figure 1: 55% of 900+ global organizations surveyed have adopted platform engineering
The report identifies three critical components that are central to the success of mature platform engineering leaders.
Fostering close collaboration between platform engineers and other teams to ensure alignment
Adopting a “platform as a product” approach, which involves treating the developer platform with a clear roadmap, communicated value, and tight feedback loops
Defining success by measuring performance through clear metrics such as deployment frequency, failure recovery time, and lead time for changes
It's noteworthy that while many organizations have begun their platform engineering journey, only 27% of adopters have fully integrated these three key components in their practices, signaling a significant opportunity for further improvements.
One of the most compelling insights of this report is the synergistic relationship between platform engineering and AI. A remarkable 86% of respondents believe that platform engineering is essential to realizing the full business value of AI. At the same time, a vast majority of companies view AI as a catalyst for advancing platform engineering, with 94% of organizations identifying AI to be ‘Critical’ or ‘Important’ to the future of platform engineering.
The study also identified three cohorts of platform engineering adopters — nascent, established, and leading — based on whether and how much adopters had embraced the above-mentioned three key components of platform engineering success. The study shows that leading adopters gain more in terms of speed, efficiency, and productivity, and offers guidance for nascent and established adopters to improve their overall platform engineering maturity to gain more benefits.
The report also identified some additional benefits of platform engineering, including:
Improved employee satisfaction, talent acquisition & retention: mature platforms foster a positive developer experience that directly impacts company culture. Developers and IT pros working for organizations with mature developer platforms are much more likely to recommend their workplace to their peers.
Accelerated time to market: mature platform engineering adopters have significantly shortened time to market. 71% of leading adopters of platform engineering indicated they have significantly accelerated their time to market, compared with 28% of less mature adopters.
A vast majority (96%) of surveyed organizations are leveraging open-source tools to build their developer platforms. Moreover, most (84%) are partnering with external vendors to manage and support their open-source environments. Co-managed platforms with a third party or a cloud partner benefit from a higher degree of innovation. Organizations with co-managed platforms allocate an average of 47% of their developers’ productive time to innovation and experimentation, compared to just 38% for those that prefer to manage their platforms with internal staff.
While this blog provides a glimpse into the key findings from this study, the full report goes much further, revealing key platform engineering strategies and practices that will help you stay ahead of the curve. Download the report to explore additional topics, including:
The strategic considerations of centralized and distributed platform engineering teams
The key drivers behind platform engineering investments
Top priorities driving platform adoption for developers, ensuring alignment with their needs
Key pain points to anticipate and navigate on the road to platform engineering success
How platform engineering boosts productivity, performance, and innovation across the entire organization
The strategic importance of open source in platform engineering for competitive advantage
The transformative role of platform engineering for AI/ML workloads as adoption of AI increases
How to develop the right platform engineering strategy to drive scalability and innovation

Azul this week acquired Payara, a provider of a Java-based application server and microservices framework that extends the scope of the company’s portfolio beyond Java runtimes. The two companies were previously allied in 2018 when Payara embedded the Azul Platform Core into Payara Server Enterprise. In addition, both companies have a long history of contributing […]

Explore the key software development trends for 2026, including AI-enabled development, low-code platforms, and talent density maximization shaping modern SDLCs.

Cary, North Carolina, USA, 11th December 2025, CyberNewsWire

Alan breaks down Harness’s $240M raise, $5.5B valuation, and Jyoti Bansal’s AI-native platform reshaping the software delivery pipeline.

Developers were the targets of two new malicious Microsoft Visual Studio Code (VS Code) extensions created by a threat actor that security researchers believe is experimenting with methods for delivering information-stealing malware to the victims’ systems. The malicious extensions come posing as a harmless “premium dark theme” and an AI-powered coding assistant, but both – […]

The next wave of user interfaces demands a complete rethinking of software architecture. For decades, digital products have relied on menus, checkboxes, search bars, and forms. These elements worked, but they were never designed around how people naturally think, speak, or interact. In 2025, clinging to them feels like printing paper maps for directions or […]

Last winter, my city Richmond VA suffered water distribution outages for days after a blizzard. Not because of one big failure, but because backup pumps failed, sensors misread, alerts got buried, and then another pump died during recovery. The whole city ended up under a boil‑water advisory. Sound familiar? Replace “water pumps” with “microservices” and […]

As cloud-native architectures scale and regulatory pressure intensifies, organizations are finally recognizing that their logging pipelines contain sensitive. Logs fuel observability, debugging, compliance investigations, and incident response, yet they also remain one of the least governed data streams in the enterprise. Despite years of progress in DevSecOps, true privacy-safe logging, logs that remain operationally useful […]

Black Duck today added a tool for analyzing and remediating code that is directly integrated into artificial intelligence (AI) coding tools. Company CEO Jason Schmitt said Black Duck Signal makes it possible to discover issues as application developers increasingly rely on AI coding tools to generate more code faster, which paradoxically also typically contains more […]

IT outages cost companies over $14,000 per minute. IBM Research’s Project ALICE uses multiple AI agents to help engineers find bugs faster and restore systems. Software bugs are expensive. When a critical system goes down, every minute of downtime costs revenue, frustrates customers, and strains engineering teams scrambling to find the problem. The challenge isn’t […]