Pipes Feed Preview: Towards Data Science & The New Stack & DevOps & SRE & DevOps.com & Feed not found

Deploy Your AI Assistant to Monitor and Debug n8n Workflows Using Claude and MCP

Wed, 12 Nov 2025 20:40:54 -0000

Use Claude AI to monitor, analyse, and troubleshoot your n8n automation workflows through natural conversation.

The post Deploy Your AI Assistant to Monitor and Debug n8n Workflows Using Claude and MCP appeared first on Towards Data Science.

The Ultimate Guide to Power BI Aggregations

Wed, 12 Nov 2025 20:28:02 -0000

Aggregations are one of the most powerful features in Power BI — learn how to leverage this feature to improve the performance of your Power BI solution

The post The Ultimate Guide to Power BI Aggregations appeared first on Towards Data Science.

How to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

Wed, 12 Nov 2025 14:00:00 -0000

The third and final part for evaluating the retrieval quality of your RAG pipeline with graded measures

The post How to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k appeared first on Towards Data Science.

Feature Detection, Part 2: Laplace & Gaussian Operators

Wed, 12 Nov 2025 12:30:00 -0000

Laplace meets Gaussian — the story of two operators in edge detection

The post Feature Detection, Part 2: Laplace & Gaussian Operators appeared first on Towards Data Science.

Do You Really Need GraphRAG? A Practitioner’s Guide Beyond the Hype

Tue, 11 Nov 2025 17:00:00 -0000

A perspective on GraphRAG design best practices, challenges and learnings

The post Do You Really Need GraphRAG? A Practitioner’s Guide Beyond the Hype appeared first on Towards Data Science.

The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or a LLM (Explained with One Example)

Tue, 11 Nov 2025 15:30:00 -0000

A practical use case to describe how the data scientist job changed across three generations of machine learning

The post The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or a LLM (Explained with One Example) appeared first on Towards Data Science.

How to Build Agents with GPT-5

Tue, 11 Nov 2025 14:00:00 -0000

Learn how to use GPT-5 as a powerful AI Agent on your data.

The post How to Build Agents with GPT-5 appeared first on Towards Data Science.

AI Hype: Don’t Overestimate the Impact of AI

Tue, 11 Nov 2025 12:30:00 -0000

Targeting moonshots instead of trolleys

The post AI Hype: Don’t Overestimate the Impact of AI appeared first on Towards Data Science.

Run Python Up to 150× Faster with C

Mon, 10 Nov 2025 19:33:20 -0000

A practical guide to offloading performance-critical code to C without abandoning Python.

The post Run Python Up to 150× Faster with C appeared first on Towards Data Science.

Why Storytelling With Data Matters for Business and Data Analysts

Mon, 10 Nov 2025 19:16:43 -0000

Data is driving the future of business and here’s how you can be prepared for that future

The post Why Storytelling With Data Matters for Business and Data Analysts appeared first on Towards Data Science.

Does More Data Always Yield Better Performance?

Mon, 10 Nov 2025 18:47:18 -0000

Exploring and challenging the conventional wisdom of “more data → better performance” by experimenting with the interactions between sample size, attribute set, and model complexity.

The post Does More Data Always Yield Better Performance? appeared first on Towards Data Science.

Data Culture Is the Symptom, Not the Solution

Mon, 10 Nov 2025 14:00:00 -0000

The hidden reason your data investments fail

The post Data Culture Is the Symptom, Not the Solution appeared first on Towards Data Science.

LLM-Powered Time-Series Analysis

Sun, 09 Nov 2025 16:00:00 -0000

Part 2: Prompts for Advanced Model Development

The post LLM-Powered Time-Series Analysis appeared first on Towards Data Science.

How to Build Your Own Agentic AI System Using CrewAI

Sun, 09 Nov 2025 14:00:00 -0000

This article demonstrates how to develop your own Agentic AI system using CrewAI framework. By orchestrating specialized agents with distinct roles and tools, we implement a multi-agent team that is capable of generating optimized content for different social media platforms.

The post How to Build Your Own Agentic AI System Using CrewAI appeared first on Towards Data Science.

Power Analysis in Marketing: A Hands-On Introduction

Sat, 08 Nov 2025 14:00:00 -0000

Part 1: What is statistical power and how do we compute it?

The post Power Analysis in Marketing: A Hands-On Introduction appeared first on Towards Data Science.

Evaluating Synthetic Data — The Million Dollar Question

Fri, 07 Nov 2025 20:23:50 -0000

Learn how to evaluate synthetic data quality using the Maximum Similarity Test — a simple, quantitative approach for assessing fidelity, utility, and privacy in synthetic datasets.

The post Evaluating Synthetic Data — The Million Dollar Question appeared first on Towards Data Science.

Beyond Numbers: How to Humanize Your Data & Analysis

Fri, 07 Nov 2025 14:00:00 -0000

The scintillating grid optical illusion is a perfect metaphor for how raw data can mislead us, causing us to see false trends. To escape the "data-rich, action-poor" paradox, organizations should need data humanization.

This approach focuses on turning abstract metrics (the what) into clear, actionable stories (the why). It requires new roles like "Data Artisans," a core competency in "Data Storytelling," and a focus on proving the financial Impact (ROI) of these clearer insights.

The post Beyond Numbers: How to Humanize Your Data & Analysis appeared first on Towards Data Science.

How to Use GPT-5 Effectively

Fri, 07 Nov 2025 12:30:00 -0000

Learn about GPT-5's features and settings, and how to optimally apply them to your use case

The post How to Use GPT-5 Effectively appeared first on Towards Data Science.

TDS Newsletter: The Theory and Practice of Using AI Effectively

Thu, 06 Nov 2025 21:39:02 -0000

When we encounter a new technology — say, LLM applications — some of us tend to jump right in, sleeves rolled up, impatient to start tinkering. Others prefer a more cautious approach: reading a few relevant research papers, or browsing through a bunch of blog posts, with the goal of understanding the context in which these tools […]

The post TDS Newsletter: The Theory and Practice of Using AI Effectively appeared first on Towards Data Science.

Expected Value Analysis in AI Product Management

Thu, 06 Nov 2025 16:00:00 -0000

An introduction to key concepts and practical applications

The post Expected Value Analysis in AI Product Management appeared first on Towards Data Science.

Knative Has Finally Graduated From the CNCF

Wed, 12 Nov 2025 23:00:48 -0000

ATLANTA — Knative, the open source Kubernetes-native platform for building, deploying and running serverless and event-driven applications, has officially graduated

The post Knative Has Finally Graduated From the CNCF appeared first on The New Stack.

Paired with the release of Knative 1.20, the Kubernetes-native platform is now a Cloud Native Computing Foundation graduate project.

How To Deploy an Open Source Version of NotebookLM

Wed, 12 Nov 2025 22:00:54 -0000

NotebookLM is an AI research and note-taking tool created by that uses large language models (LLMs) that make it possible

The post How To Deploy an Open Source Version of NotebookLM appeared first on The New Stack.

NotebookLM is proprietary and is enjoying incredible popularity at the moment, but did you know that there's an open source take on this technology?

Jupyter AI v3: Could It Generate an ‘Ecosystem of AI Personas?’

Wed, 12 Nov 2025 21:00:22 -0000

AWS engineers David Qiu and Piyush Jain at JupyterCon, the new release introduces AI personas — customizable, specialized assistants that users can configure to perform tasks such as coding help, debugging, or analysis.

What if your AI assistant could watch you code, spot bugs as they happen and fix them before you even

The post Jupyter AI v3: Could It Generate an ‘Ecosystem of AI Personas?’ appeared first on The New Stack.

The newest innovation in Jupyter AI — AI personas, or customizable AI assistant configurations — has opened up exciting possibilities for the project.

Choosing Your AI Orchestration Stack for 2026

Wed, 12 Nov 2025 20:00:39 -0000

AI has advanced at an incredible pace. Just a few months ago, we were still talking about agentic AI’s budding

The post Choosing Your AI Orchestration Stack for 2026 appeared first on The New Stack.

Agentic AI depends on a stable orchestration layer. A look at the standards debate, risks of lock-in and why a hybrid stack is crucial for enterprise AI.

Why You Should Break Your ML Pipelines on Purpose

Wed, 12 Nov 2025 19:00:52 -0000

This is the first of two articles. What happens when your machine learning system quietly breaks, and no one notices?

The post Why You Should Break Your ML Pipelines on Purpose appeared first on The New Stack.

Traditional monitoring won't catch feature drift or data quality issues. Chaos engineering helps you find hidden issues before they cause damage.

Why the Frontend Should Run AI Models Locally With ONNX

Wed, 12 Nov 2025 18:00:27 -0000

Frontend developers need to make a paradigm shift about how they build applications using AI models, according to Angular consultant

The post Why the Frontend Should Run AI Models Locally With ONNX appeared first on The New Stack.

A full-stack developer and Angular consultant explains how AI models can be treated as local assets, enabling better performance and privacy.

5 Ways To Transform ITOps With a Human and AI Agent Model

Wed, 12 Nov 2025 17:00:21 -0000

Person working through a process with AI agents.

A new era of AI operations is here. Research reveals that over half (51%) of global companies have already deployed

The post 5 Ways To Transform ITOps With a Human and AI Agent Model appeared first on The New Stack.

This model allows you to harness the speed, scalability and processing power of AI alongside human creativity, context, collaboration and judgment.

How Google Is Shifting AI From the Cloud to Your Browser

Wed, 12 Nov 2025 15:00:58 -0000

The Google Web AI summit was held earlier this month as an invite-only event in Sunnyvale, Calif. After the event,

The post How Google Is Shifting AI From the Cloud to Your Browser appeared first on The New Stack.

Google's Jason Mayes argues that the web is the future of AI, not the cloud. He cites in-browser inference and LiteRT.js as key developments.

KubeCon: A Terraform Killer Built on Apple’s Pkl

Wed, 12 Nov 2025 13:15:02 -0000

ATLANTA — Terraform has you tripping? YAML got you cross-eyed? A startup, Platform Engineering Labs, has tackled the ongoing headache that

The post KubeCon: A Terraform Killer Built on Apple’s Pkl appeared first on The New Stack.

Formae, from Platform Engineering Labs, addresses the brittleness of Day 2 operations by using Apple's robust Pkl configuration language.

Debian Mandates Rust for APT, Reshaping Ubuntu and Other Linux Distros

Tue, 11 Nov 2025 20:00:29 -0000

One of the world’s oldest and most influential Linux distributions, Debian, has officially announced plans to restructure its development strategy

The post Debian Mandates Rust for APT, Reshaping Ubuntu and Other Linux Distros appeared first on The New Stack.

Like it or not, if you're developing for Debian, you're going to need to start using Rust.

How AI Is Pushing Kubernetes Storage Beyond Its Limits

Tue, 11 Nov 2025 19:00:11 -0000

As enterprises rush to deploy AI and data-intensive applications in Kubernetes environments, standard Container Storage Interfaces (CSIs) aren’t enough to

The post How AI Is Pushing Kubernetes Storage Beyond Its Limits appeared first on The New Stack.

AI is accelerating the need for stateful applications, and standard-issue Kubernetes Container Storage Interface (CSI) is not sufficient.

The 4 Ways AI Code Is Breaking Your Repo (And How To Fix It)

Tue, 11 Nov 2025 18:00:13 -0000

“Don’t ask the model to build your whole app. Break your request into smaller parts and generate one function, hook

The post The 4 Ways AI Code Is Breaking Your Repo (And How To Fix It) appeared first on The New Stack.

By treating every AI-generated piece as a component with a lifecycle, you move from isolated snippets toward a system that grows in value.

FFmpeg to Google: Fund Us or Stop Sending Bugs

Tue, 11 Nov 2025 16:30:08 -0000

You may never have heard of FFmpeg, but you’ve used it. This open source program’s robust multimedia framework is used

The post FFmpeg to Google: Fund Us or Stop Sending Bugs appeared first on The New Stack.

A lively discussion about open source, security, and who pays the bills has erupted on Twitter.

React’s UseEffect Is a Crime Scene Covered in Fingerprints

Tue, 11 Nov 2025 15:16:30 -0000

Let’s be honest: Half the React codebases out there are held together by duct tape and useEffect spaghetti. Every “quick

The post React’s UseEffect Is a Crime Scene Covered in Fingerprints appeared first on The New Stack.

Next time you reach for useEffect to fix a React data race or sync bug, ask yourself: Are you solving the problem or incriminating yourself?

KubeCon: Akuity Reduces Kubernetes Guesswork With AI

Tue, 11 Nov 2025 13:15:39 -0000

ATLANTA — Cloud Native GitOps service provider Akuity has added generative AI aids into its Kubernetes deployment platform, in order

The post KubeCon: Akuity Reduces Kubernetes Guesswork With AI appeared first on The New Stack.

Built on Argo CD, Akuity streamlines and automates the deployment of Kubernetes clusters by enforcing GitOps principles.

Google Debuts GKE Agent Sandbox, Inference Gateway at KubeCon

Tue, 11 Nov 2025 12:00:48 -0000

ATLANTA — Those pesky AI agents. You’ll never know what trouble they’ll cause. Sneaky and malicious ones will elevate their

The post Google Debuts GKE Agent Sandbox, Inference Gateway at KubeCon appeared first on The New Stack.

Google has updated Google Kubernetes Engine to better support large-scale AI workloads, introducing the GKE Agent Sandbox for securely LLM-generated code.

Process Theater vs. Technical Excellence: A Recurring Software Crisis

Mon, 10 Nov 2025 21:00:32 -0000

The term “snake oil salesman” is often used to describe individuals who engage in deceptive marketing practices. Wild west characters

The post Process Theater vs. Technical Excellence: A Recurring Software Crisis appeared first on The New Stack.

How to benefit from old knowledge without making old mistakes.

How Distributed Databases Power Developer Platforms at Scale

Mon, 10 Nov 2025 20:00:28 -0000

Automotive engineer working on computer system.

The path from proof of concept to production-grade system exposes a familiar pattern in enterprise software. Teams sprint toward product-market

The post How Distributed Databases Power Developer Platforms at Scale appeared first on The New Stack.

Learn how a major automotive company built an IDP with a declarative database to abstract complexity, improve resilience and accelerate development.

The LLM Flywheel Effect: AI That Writes and Tests Documentation

Mon, 10 Nov 2025 19:00:57 -0000

To help a team member get up to speed on a project, I had to learn and then document how

The post The LLM Flywheel Effect: AI That Writes and Tests Documentation appeared first on The New Stack.

How to manage a team of AI assistants in a virtuous cycle of improvement. The LLM flywheel effect is a new workflow for developers to adopt.

Exploring RTEB, a New Benchmark To Evaluate Embedding Models

Mon, 10 Nov 2025 18:00:52 -0000

Hand picking up model star on wooden table,

With the rise of large language models (LLMs), our exposure to benchmarks — not to mention the sheer number and

The post Exploring RTEB, a New Benchmark To Evaluate Embedding Models appeared first on The New Stack.

Retrieval Embedding Benchmark focuses on real retrieval tasks with both open and private data sets, making it accurate and practical for evaluating models.

Why SAP Is Opening Up Its Data and Developer Ecosystems

Mon, 10 Nov 2025 17:00:07 -0000

BERLIN — One of SAP’s key themes during its TechEd 2025 event in Berlin this year was “openness.” While it’s

The post Why SAP Is Opening Up Its Data and Developer Ecosystems appeared first on The New Stack.

SAP is making a strategic shift to an open platform, embracing open data, new developer tools like VS Code and key partnerships to foster innovation.

Kubecon: VCluster’s K8s Platform to Manage GPUs as a Service

Mon, 10 Nov 2025 16:00:00 -0000

ATLANTA — VCluster Labs (formerly Loft Labs) has released an augmented version of its namesake Kubernetes distribution, one customized for

The post Kubecon: VCluster’s K8s Platform to Manage GPUs as a Service appeared first on The New Stack.

This new Kubernetes distribution provides flexible, multitenant GPU orchestration with dynamic scaling and isolation.

Go Beyond DevOps With Autonomous Full-Stack Optimization

Mon, 10 Nov 2025 15:00:02 -0000

In the days of on-premises data centers, when procurement cycles were measured in months, overprovisioning was a logical risk management

The post Go Beyond DevOps With Autonomous Full-Stack Optimization appeared first on The New Stack.

Kubernetes optimization can’t be an afterthought or rely on “superhero configurations.” It must be automated, continuous, effortless and always safe.

Practitioners’ Guide to Chiseled Containers: Smaller, Faster, Safer

Mon, 10 Nov 2025 14:00:55 -0000

Containerization has transformed how teams build and deploy applications, but it’s also introduced new operational challenges. Traditional container images often

The post Practitioners’ Guide to Chiseled Containers: Smaller, Faster, Safer appeared first on The New Stack.

By including only essential application components, chiseled containers produce smaller images, enhance security and improve performance.

Tailscale Welcomes Kubernetes Co-Founder Joe Beda as Advisor

Mon, 10 Nov 2025 13:00:58 -0000

Virtual Private Network (VPN) software provider Tailscale has brought on Kubernetes pioneer Joe Beda as an advisor, the latest move

The post Tailscale Welcomes Kubernetes Co-Founder Joe Beda as Advisor appeared first on The New Stack.

The Kubernetes co-founder will help bring Wireguard VPN's ease of use to solving complicated K8s networking patterns.

Unix: OpenSolaris Lives on in This OpenIndiana Fork

Sun, 09 Nov 2025 15:00:25 -0000

As someone who grew up as a Hoosier, seeing a Linux distribution named OpenIndiana warms my heart. Of course, is

The post Unix: OpenSolaris Lives on in This OpenIndiana Fork appeared first on The New Stack.

Sun Microsystems' open source Unix distribution, called OpenSolaris, lives on in this variant. Linux users should give it a try.

Key Enterprise Architect

Mon, 13 Oct 2025 16:00:00 -0000

As engineers, we all dream of perfectly resilient systems — ones that scale perfectly, provide a great user experience, and never ever go down. What if we told you the key to building these kinds of resilient systems isn't avoiding failures, but deliberately causing them? Welcome to the world of chaos engineering, where you stress test your systems by introducing chaos, i.e., failures, into a system under a controlled environment. In an era where downtime can cost millions and destroy reputations in minutes, the most innovative companies aren't just waiting for disasters to happen — they're causing them and learning from the resulting failures, so they can build immunity to chaos before it strikes in production.

Chaos engineering is useful for all kinds of systems, but particularly for cloud-based distributed ones. Modern architectures have evolved from monolithic to microservices-based systems, often comprising hundreds or thousands of services. These complex service dependencies introduce multiple points of failure, and it’s difficult if not impossible to predict all the possible failure modes through traditional testing methods. When these applications are deployed on the cloud, they are deployed across multiple availability zones and regions. This increases the likelihood of failure due to the highly distributed nature of cloud environments and the large number of services that coexist within them.

A common misconception is that cloud environments automatically provide application resiliency, eliminating the need for testing. Although cloud providers do offer various levels of resiliency and SLAs for their cloud products, these alone do not guarantee that your business applications are protected. If applications are not designed to be fault-tolerant or if they assume constant availability of cloud services, they will fail when a particular cloud service they depend on is not available.

In short, chaos engineering can take a team's worst "what if?" scenarios and transform them into well-rehearsed responses. Chaos engineering isn’t about breaking systems — engineering chaotically, as it were — it's about building teams that face production incidents with the calm confidence that only comes from having weathered that chaos before, albeit in controlled conditions.

Google Cloud’s Professional Service Organization (PSO) Enterprise Architecture team consults on and provides hands-on expertise on customers’ cloud transformation journeys, including application development, cloud migrations, and enterprise architecture. And when advising on designing resilient architecture for cloud environments, we routinely introduce the principles and practices of chaos engineering and Site Reliability Engineering (SRE) practices.

In this first blog post in a series, we explain the basics of chaos engineering — what it is and its core principles and elements. We then explore how chaos engineering is particularly helpful and important for teams running distributed applications in the cloud. Finally, we’ll talk about how to get started, and point you to further resources.

Understanding chaos engineering

Chaos engineering is a methodology invented by Netflix in 2010 when it created and popularized ‘Chaos Monkey’ to address the need to build more resilient and reliable systems in the face of increasing complexity in their AWS environment. Around the same time, Google introduced Disaster Resilience Testing, or DiRT, which enabled continuous and automated disaster readiness, response, and recovery of Google’s business, systems, and data. Here on Google Cloud’s PSO team, we offer various services to help customers implement DiRT as part of SRE practices. These offerings also include training on how to perform DiRT on applications and systems operating on Google Cloud. The central concept is straightforward: deliberately introduce controlled disruptions into a system to identify vulnerabilities, evaluate its resilience, and enhance its overall reliability.

As a proactive discipline, chaos engineering enables organizations to identify weaknesses in their systems before they lead to significant outages or failures, where a system includes not only the technology components but also the people and processes of an organization. By introducing controlled, real-world disruptions, chaos engineering helps test a system's robustness, recoverability, and fault tolerance. This approach allows teams to uncover potential vulnerabilities, so that systems are better equipped to handle unexpected events and continue functioning smoothly under stress.

Principles and practices of chaos engineering

Chaos engineering is guided by a set of core principles about why it should be done, while practices define what needs to be done.

Below are the principles of chaos engineering:

Build a hypothesis around steady state: Prior to initiating any disruptive actions, you need to define what "normal" looks like for your system, commonly referred to as the "steady state hypothesis."
Replicate real-world conditions: Chaos experiments should emulate realistic failure scenarios that the system might encounter in a production environment.
Run experiments in production: Chaos engineering is firmly rooted in the belief that only a production environment with real traffic and dependencies can provide an accurate picture of resiliency. This is what separates chaos engineering from traditional testing.
Automate experiments: Make resiliency testing part of a continuous ongoing process rather than a one-off test.
Determine the blast radius: Experiments should be meticulously designed to minimize adverse impacts on production systems. This requires categorizing applications and services in different tiers based on the impact the experiments can have on customers and other applications and services.

With these principles established, follow these practices when conducting a chaos engineering experiment:

Define steady state: Identifies the specific metrics (e.g., latency, throughput) that you will look at and establish a baseline for them.
Formulate a hypothesis: This is the practice of creating a single testable statement, for example, ‘By deleting this container pod, user login will not be affected’. Hypotheses are generally created by identifying customer user journeys and deriving test scenarios from them.
Use a controlled environment: While one chaos engineering principle states that experiments need to run in production, you should still start small and run your experiment in a non-production environment first, learn and adjust, and then gradually expand the scope to production environment.
Inject failures: This is the practice of causing disruption by injecting failures either directly into the system (e.g., deleting a VM, stopping a database instance) or indirectly by injecting failures in the environment (e.g. deleting a network route, adding a firewall rule).
Automate experimental execution: Automation is crucial for establishing chaos engineering as a repeatable and scalable practice. This includes using automated tools for fault injection (e.g., making it part of a CI/CD pipeline) and automated rollback mechanisms.
Derive actionable insights: The primary objective of using chaos engineering is to gain insights into system vulnerabilities, thereby enhancing resilience. This involves rigorous analysis of experimental results; identifying weaknesses and areas for improvement; and disseminating findings to relevant teams to inform subsequent experimental design and system enhancements.

In other words, chaos engineering isn't about breaking things for the sake of it, but about building more resilient systems by understanding their limitations and addressing them proactively.

Elements of chaos engineering

Here are the core elements you'll use in a chaos engineering experiment, derived from these five principles:

Experiments: A chaos experiment constitutes a deliberate, pre-planned procedure wherein faults are introduced into a system to ascertain its response.
Steady-state hypotheses: A steady-state hypothesis defines the baseline operational state, or "normal" behavior, of the system under evaluation.
Actions: An action represents a specific operation executed upon the system being experimented on.
Probes: A probe provides a mechanism for observing defined conditions within the system during experimentation.
Rollbacks: An experiment may incorporate a sequence of actions designed to reverse any modifications implemented during the experiment.

Getting started with chaos engineering

Now that you have a good understanding of chaos engineering and why to use it in your cloud environment, the next step is to try it out for yourself in your own development environment.

There are multiple chaos engineering solutions in the market; some are paid products and some are open-source frameworks. To get started quickly, we recommend that you use Chaos Toolkit as your chaos engineering framework.

Chaos Toolkit is an open-source framework written in Python that provides a modular architecture where you can plug in other libraries (also known as ‘drivers’) to extend your chaos engineering experiments. For example, there are extension libraries for Google Cloud, Kubernetes, and many other technologies. Since Chaos Toolkit is a Python-based developer tool, you can begin by configuring your Python environment. You can find a good example of a Chaos Toolkit experiment and step-by-step explanation here.

Finally, to enable Google Cloud customers and engineers to introduce chaos testing in their applications, we’ve created a series of Google Cloud-specific chaos engineering recipes. Each recipe covers a specific scenario to introduce chaos in a particular Google Cloud service. For example, one recipe covers introducing chaos in an application/service running behind a Google Cloud internal or external application load balancer; another recipe covers simulating a network outage between an application running on Cloud Run and connecting to a Cloud SQL database by leveraging another Chaos Toolkit extension named ToxiProxy.

You can find a complete collection of recipes, including step-by-step instructions, scripts, and sample code, to learn how to introduce chaos engineering in your Google Cloud environment on GitHub. Then, stay tuned for subsequent posts, where we’ll talk about chaos engineering techniques, such as how to introduce faults into your Google Cloud environment.

Researcher

Tue, 23 Sep 2025 14:00:00 -0000

Today, we are excited to announce the 2025 DORA Report: State of AI-assisted Software Development. Drawing on insights from over 100 hours of qualitative data and survey responses from nearly 5,000 technology professionals from around the world.

The report reveals a key insight: AI doesn't fix a team; it amplifies what's already there. Strong teams use AI to become even better and more efficient. Struggling teams will find that AI only highlights and intensifies their existing problems. The greatest return comes not from the AI tools themselves, but from a strategic focus on the quality of internal platforms, the clarity of workflows, and the alignment of teams.

AI, the great amplifier

As we established from the 2024 report as well as the special report published this year called “Impact of Generative AI in Software Development”, organizations are continuing to heavily adopt AI and receive substantial benefits across important outcomes. And there is evidence of learning to better integrate these tools into our workflow. Unlike last year, we observe a positive relationship between AI adoption on both software delivery throughput and product performance. It appears that people, teams, and tools are learning where, when, and how AI is most useful. However, AI adoption does continue to have a negative relationship with software delivery stability.

This confirms our central theory - AI accelerates software development, but that acceleration can expose weaknesses downstream. Without robust control systems, like strong automated testing, mature version control practices, and fast feedback loops, an increase in change volume leads to instability. Teams working in loosely coupled architectures with fast feedback loops see gains, while those constrained by tightly coupled systems and slow processes see little or no benefit.

Key findings from the 2025 report

Beyond this central theme, this year’s research highlighted the following about modern software development:

AI adoption is near-universal: 90% of survey respondents report using AI at work. More than 80% believe it has increased their productivity. However, skepticism remains as 30% report little or no trust in the code generated by AI, a slightly lower percentage than last year but a key trend to note.
User-centricity is a prerequisite for AI success: AI becomes most useful when it's pointed at a clear problem, and a user-centric focus provides that essential direction. Our data shows this focus amplifies AI’s positive influence on team performance.
Platform engineering is the foundation: Our data shows that 90% of organizations have adopted at least one platform and there is a direct correlation between a high quality internal platform and an organization’s ability to unlock the value of AI, making it an essential foundation for success.

The seven team archetypes

Simple software delivery metrics alone aren’t sufficient. They tell you what is happening but not why it’s happening. To connect performance data to experience, we conducted a cluster analysis that reveals seven common team profiles or archetypes, each with a unique interplay of performance, stability, and well-being. This model provides leaders with a way to diagnose team health and apply the right interventions.

The ‘Foundational challenges’ group are trapped in survival mode and face significant gaps in their processes and environment, leading to low performance, high system stability, and high levels of burnout and friction. While the ‘Harmonious high achievers’ excel across multiple areas, showing positive metrics for team well-being, product outcomes, and software delivery.

Read more details of each archetype in the "Understanding your software delivery performance: A look at seven team profiles" chapter of the report.

Unlocking the value of AI with the ‘DORA AI Capabilities Model’

This year, we went beyond identifying AI’s impact to investigating the conditions in which AI-assisted technology-professionals realize the best outcomes. The value of AI is unlocked not by the tools themselves, but by the surrounding technical practices and cultural environment.

Our research identified seven capabilities that are shown to magnify the positive impact of AI in organizations.

Where leaders should get started

One of the key insights derived from the research this year is that the value of AI will be unlocked by reimagining the system of work it inhabits. Technology leaders should treat AI adoption as an organizational transformation.

Here’s where we suggest you begin:

Clarify and socialize your AI policies
Connect AI to your internal context
Prioritize foundational practices
Fortify your safety nets
Invest in your internal platform
Focus on your end-users

The DORA research program is committed to serving as a compass to teams and organizations as we navigate the important and transformative period with AI. We hope the new team profiles and the DORA AI capabilities model provide a clear roadmap for you to move beyond simply adopting AI to unlocking its value by investing in teams and people. We look forward to learning how you put these insights into practice. To learn more:

Download the full report
Join the DORA community
Share this overview with your colleagues

Cloud Solutions Architect Manager, Google Cloud

Wed, 13 Aug 2025 16:00:00 -0000

What guides your approach to software development? In our roles at Google, we’re constantly working to build better software, faster. Within Google, our Developer Platform team and Google Cloud have a strategic partnership and a shared strategy: together, we take our internal capabilities and engineering tools and package them up for Google Cloud customers.

At the heart of this is understanding the many ways that software teams, big and small, need to balance efficiency, quality, and cost, all while delivering value. In our recent talk at PlatformCon 2025, we shared key parts of our platform strategy, which we call “shift down.”

Shift down is an approach that advocates for embedding decisions and responsibilities into underlying internal developer platforms (IDPs), thereby reducing the operational burden on developers. This contrasts with the DevOps trend of "shift left," which pushes more effort earlier into the development cycle, a method that is proving difficult at scale due to the sheer volume and rate of change in requirements. Our shift down strategy helps us maximize value with existing resources so businesses can achieve high innovation velocity with acceptable quality, acceptable risk, and sustainable costs across a diverse range of business models. In the talk, we share learnings that have been really helpful to us in our software and platform engineering journey:

aside_block: <ListValue: []>

Work backwards from the business model: By starting with the business model, organizations can intentionally guide platform evolution and investment to align with desired margins, risk tolerance, and quality requirements. At Google, our central platform must support diverse business models, necessitating continuous strategic refinement and adaptation.
Focus on quality attributes for central software control: Quality attributes, such as reliability, security, efficiency, and performance, are emergent properties of software systems and are important for creating business value and managing risk. These are often referred to as “non-functional requirements” because they define how our software behaves, not what it functionally does. With a shift down strategy, we can embed the responsibility for assuring quality attributes directly into the underlying platform systems and infrastructure, thereby significantly reducing the operational burden on individual developers.
Abstractions and coupling are key technical tools to gain control of quality attributes: We define two key technical components in the way we build platforms: abstractions and coupling. In a shift down strategy, abstractions provide understandability, risk management levers, accountability, and cost control by encapsulating complexity. Coupling refers to the interconnectedness and interdependence of components within a system or development ecosystem. For a successful shift down strategy, the right degree of coupling is crucial because it allows the development platform and ecosystem design to directly implement and influence quality attributes. In fact, coupling is how we offer entire infrastructure and platform solutions as coherent services like Google Kubernetes Engine (GKE).
Shared responsibility, education, and policy are equally important social tools: Shared responsibility is a crucial social tool within software at scale. This is actively cultivated through education, such as training engineers on platform and AI usage, and fostering a "one team" culture that encourages a shift from artifact-bound identities to overarching mission goals and client-focused engagement. Furthermore, explicit policies like centrally enforced style guides and secure-by-design APIs are fundamental for embedding quality attribute assurance directly into the platform and infrastructure, significantly reducing the operational burden on individual developers by ensuring consistency and automated controls at scale.
Use a map. Supporting many business units with one platform is a vast and complex problem; we need a map. The ecosystem model is a framework that categorizes different types of software development environments, ranging from highly flexible, developer-controlled systems to highly opinionated, vertically integrated ones where the ecosystem itself assures quality attributes. Its critical purpose is to provide a visual and conceptual tool for evaluating how well our ecosystem controls match our business risk. This helps us ensure that the level of oversight and assurance of quality attributes aligns with the potential cost of mistakes. The goal is to be in the "ecosystem effectiveness zone," where controls are balanced to mitigate significant risks from human error without imposing overly restrictive systems that negatively impact velocity and developer satisfaction.

6. Divide up the problem space by identifying different platform and ecosystem types.

Because the developer experience and platform infrastructure change with scale and degree of shifting down, it’s not enough to just know where the ecosystem effectiveness zone is — you have to identify the ecosystem by type. We differentiate ecosystem types by the degree of oversight and assurance for quality attributes. As an ecosystem becomes more vertically integrated, such as Google's highly optimized "Assured" (Type 4) ecosystem, the platform itself assumes increasing responsibility for vital quality attributes, allowing specialists like site reliability engineers (SRE) and security teams to have full ownership in taking action through large-scale observability and embedded capabilities. Conversely, in less uniform "YOLO," "AdHoc," or "Guided" (Type 0-2) ecosystems, developers have more responsibility for assuring these attributes, while central specialist teams have less direct control and enforcement mechanisms are less pervasive. It’s really important to note here that this is not a maturity model — the best ecosystem and platform type is the one that best fits your business need (see point #1 above!).

Intentional choices in platform engineering

The most important takeaway is to make active choices. Tailor platform engineering for each business unit and application to achieve the best outcomes. Place critical emphasis on identifying and solving stable sub-problems in reliable, reusable ways across various business problems. This approach directly underpins our "shift down" strategy, moving toward composable platforms that embed decisions and responsibilities for software quality directly into the underlying platform infrastructure, thereby improving our ability to maximize business value with the right resources, at the right quality level, and with sustainable costs.

Watch our full discussion for more insights on effective platform engineering.

Product Manager

Mon, 04 Aug 2025 16:00:00 -0000

Application owners are looking for three things when they think about optimizing cloud costs:

What are the most expensive resources?
Which resources are costing me more this week or month?
Which resources are poorly utilized?

To help you answer these questions quickly and easily, we announced Cloud Hub Optimization and Cost Explorer, in private preview, at Google Cloud Next 2025. And today, we are excited to announce that both Cloud Hub Optimization and Cost Explorer are now in public preview.

Application cost and utilization

As an app owner, your primary objective is keeping your application healthy at all times. Yet, monitoring all the individual components of your application, which may straddle dozens of Projects, can be quite overwhelming. AppHub Applications allow you to reorganize cloud around your application, giving you the information and controls you need at your fingertips.

In addition to supporting Google Cloud Projects, Cloud Hub Optimization and Cost Explorer leverage App Hub applications to show you the cost-efficiency of your application’s workloads and services instantly. This is great for instance when you are trying to pinpoint deployments running on GKE clusters that might be wasting valuable resources, such as GPUs.

Not just another cost dashboard

When you bring up Cloud Hub Optimization, you can immediately see the resources that are costing you the most, along with the percentage change in their cost. With this highly granular cost information, you can now attribute your costs to specific resources and resource owners to reason about any changes in costs.

We have additionally integrated granular cost data from Cloud Billing and resource utilization data from Cloud Monitoring to give you a comprehensive picture of your cost efficiency. This includes average vCPU utilization for your Project, which helps you find the most promising optimization candidates across hundreds of Google Cloud Projects.

The Cost Explorer dashboard also shows you your costs logically organized at the product level, for even more cost explainability. Instead of seeing a lump sum cost for Compute Engine, you can now see your exact spend on individual products including Google Kubernetes Engine (GKE) clusters, Persistent Disks, Cloud Load Balancing, and more.

Simple is powerful

Customers who have tried these new tools love the information that is surfaced as well as the simplicity of the interfaces.

“My team has to keep an eye on cloud costs across tens of business units and hundreds of developers. The Cloud Hub Optimization and Cost Explorer dashboards are a force multiplier for my team as they tell us where to look for cost savings and potential optimization opportunities.” - Frank Dice, Principal Cloud Architect, Major League Baseball

Customers especially appreciate the breadth of product coverage available out of the box without any additional setup, and the fact that there is no additional charge to using these features.

What’s next

As your organization “shifts left” on cloud cost management, we are working to help application owners and developers understand and optimize their cloud costs. You can try Cloud Hub Optimize and Cost Explorer here.

You can also see a live demo of how Cloud Hub Optimization and Cost Explorer can be used to identify underutilized GKE clusters within seconds in the Google Cloud Next 2025 talk Maximize Your Cloud ROI.

^{Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit MLB.com.}

Senior Product Manager

Fri, 01 Aug 2025 16:00:00 -0000

Are you ready to unlock the power of Google Cloud and want guidance on how to set up your environment effectively? Whether you're a cloud novice or part of an experienced team looking to migrate critical workloads, getting your foundational infrastructure right is the key to success. That's where Google Cloud Setup comes in — your guided pathway to a secure cloud foundation and quick start on Google Cloud.

Google Cloud Setup helps you quickly implement Google Cloud's recommended best practices. Our goal is to provide a fast and easy path to deploying your workloads without unnecessary configuration effort. Think of it as your expert guide, walking you through the essential first steps so you can focus on what truly matters: rapidly deploying your innovative applications and services. To help you get started without financial barriers, all components and service integrations enabled during the setup process are free or include some level of no-cost access.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1dfc7a6220>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Choose the foundation that fits your needs

We understand that every organization and project has unique requirements. That's why Cloud Setup offers three distinct guided flows to choose from:

Proof-of-concept: Designed for users who want to set up a lightweight environment to explore Google Cloud and run initial tests or sandbox workloads. This flow focuses on the minimum configuration to get you started quickly.
Production: This flow is recommended for supporting production-ready workloads with security and scalability in mind. It aligns with Google Cloud’s best practices and is tailored for administrators setting up basic foundational infrastructure for production workloads.
Enhanced security: Designed for organizations, regions or workloads with advanced security and compliance requirements, this flow defaults to more advanced security controls and is designed to help you meet rigorous requirements. Even this advanced foundation sets you up with a perpetual free tier up to certain usage limits.

Building blocks for a solid foundation

Cloud Setup guides you through a series of onboarding steps, presenting defaults backed by Google Cloud best practices. Throughout the process, you'll also encounter key features designed to help protect your organization and prepare it for growth, including:

Cloud KMS AutoKey: Automates the provisioning and assignment of customer-managed encryption keys (CMEK).
Security Command Center: Provides security posture management for Google Cloud deployments including automatic project scanning for security issues such as open ports and misconfigured access controls.
Centralized Logging and Monitoring: Enables you to easily set up infrastructure to monitor your system's health and performance from a central location — critical for audit logging compliance and visualizing metrics across projects.
Shared VPC Networks: Allows you to establish a centralized network across multiple projects, enabling secure and efficient communication between your Google Cloud resources.
Hybrid Connectivity: Facilitates connecting your Google Cloud environment to your on-premises infrastructure or other cloud providers. This is often a critical step for workload migrations.
Support plan: Enables you to quickly resolve any issues with help from experts at Google Cloud.

At the end of the guided flow, you can deploy your configuration directly via the Google Cloud console or download a Terraform configuration file for later deployment using other Infrastructure as Code (IaC) methods.

Experience the cloud faster and smarter

Organizations using Cloud Setup experience enjoy:

Faster application deployment: By simplifying the initial setup, you can get your applications up and running more quickly, accelerating your cloud journey.
Reduced setup effort: Our streamlined flow significantly reduces the number of manual steps, allowing you to establish a basic foundation with less effort.
Greater access to Google Cloud's full potential: By establishing a solid foundation quickly, you can more easily explore and leverage a wider range of Google Cloud services to meet your evolving needs and unlock greater value.

Ready to start your Google Cloud journey? Visit Google Cloud Setup today for a streamlined path to a secure cloud foundation. Let us guide you through the initial steps so you can focus on innovation and growth.

To learn more, visit:

Cloud Setup documentation
Cloud Setup overview (requires login)

Product Manager

Fri, 18 Jul 2025 16:00:00 -0000

As developers and operators, you know that having access to the right information in the proper context is crucial for effective troubleshooting. This is why organizations invest a lot upfront curating monitoring resources across different business units: so information is easy to find and contextualize when needed.

Today we are reducing the need for this upfront investment with an out-of-the-box Application Monitoring experience for your organization on Google Cloud within Cloud Observability.

Application Monitoring consists of a set of pre-curated dashboards with relevant metrics and logs mapped to a user-defined application in App Hub. It incorporates best practices pioneered by Google Site Reliability Engineers (SRE) to optimize manual troubleshooting and unlock AI-assisted troubleshooting.

Application Monitoring automatically labels and brings together key telemetry for your application into a centralized experience, making it easy to discover, filter and correlate trends. It also feeds application context into Gemini Cloud Assist Investigations, for AI-assisted troubleshooting.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1e0c089e50>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

1. Application, service and workload dashboards

No more spending hours configuring application dashboards.

From the moment you describe your application in App Hub, Application Monitoring starts to automatically build dashboards tailored to your environment. Each dashboard comprises relevant telemetry for your application and is searchable, filterable and ready for deep dives — no configuration required.

The dashboards offer an overview of charts detailing the SRE Four Golden Signals: traffic, latency, error rate, and saturation. This provides a high-level view of application performance, integrating automatically collected system metrics across various services and workloads such as load balancers, Cloud Run, GKE workloads, MIGs, and databases. From this overview, you can then drill down into services or workloads with performance issues or active alerts to access detailed metrics and logs.

For example in the image below, a user defined an App Hub application called Cymbal BnB app, with multiple services and workloads. The flow below shows the automatically generated experience with golden signals, alerts and relevant logs.

Figure 1 - A user’s flow from an App Hub defined application (i.e. Cymbal BnB) to the automatic prebuilt Application Monitoring experience in Cloud Observability

2. Labels and context propagation

See application labels propagated seamlessly across Google Cloud

Once Application Monitoring is enabled, your application labels are propagated across Google Cloud, so you can see and use them to filter and focus on the most essential signals across the logs, metrics and trace explorers.

Figure 2 - Logs Explorer showing application automatically tagged with application labels

Figure 3 - Metrics Explorer showing application labels automatically associated with metrics

Figure 4 - Trace Explorer showing AppHub label Integration

3. Gemini Cloud Assist Investigations

Troubleshoot issues faster with AI powered Investigations.

Gemini Cloud Assist’s investigation feature makes it easier to troubleshoot issues because application boundaries and relationships have been propagated into the AI model, grounding it in context about your environment.

Figure 5 - Seamless entry point into Gemini Cloud Assist powered Investigations from application logs

Note - Gemini Cloud Assist Investigations is currently in private preview

Try Application Monitoring today

The new Application Monitoring experience provides a low-effort unified view of application and infrastructure performance for your troubleshooting needs.

Take advantage of the new Google Cloud Application Monitoring experience by:

Visiting your Cloud console
Setting up Applications in AppHub

Adding Services and Workloads to your Application

Navigating to Application Monitoring in Cloud Observability to see your automatically built experience
Enable your Gemini Cloud Assist SKU and sign up for the trusted tester program to get access to the Investigations experience

Related docs

Application Monitoring docs
AppHub docs
1. Apphub coverage docs

Director of Engineering, Google Cloud

Thu, 10 Jul 2025 09:30:00 -0000

At Google Cloud, we are committed to making it as seamless as possible for you to build and deploy the next generation of AI and agentic applications. Today, we’re thrilled to announce that we are collaborating with Docker to drastically simplify your deployment workflows, enabling you to bring your sophisticated AI applications from local development to Cloud Run with ease.

Deploy your compose.yaml directly to Cloud Run

Previously, bridging the gap between your development environment and managed platforms like Cloud Run required you to manually translate and configure your infrastructure. Agentic applications that use MCP servers and self-hosted models added additional complexity.

The open-source Compose Specification is one of the most popular ways for developers to iterate on complex applications in their local environment, and is the basis of Docker Compose. And now, gcloud run compose up brings the simplicity of Docker Compose to Cloud Run, automating this entire process. Now in private preview, you can deploy your existing compose.yaml file to Cloud Run with a single command, including building containers from source and leveraging Cloud Run’s volume mounts for data persistence.

Supporting the Compose Specification with Cloud Run makes for easy transitions across your local and cloud deployments, where you can keep the same configuration format, ensuring consistency and accelerating your dev cycle.

“We’ve recently evolved Docker Compose to support agentic applications, and we’re excited to see that innovation extend to Google Cloud Run with support for GPU-backed execution. Using Docker and Cloud Run, developers can now iterate locally and deploy intelligent agents to production at scale with a single command. It’s a major step forward in making AI-native development accessible and composable. We’re looking forward to continuing our close collaboration with Google Cloud to simplify how developers build and run the next generation of intelligent applications.” - Tushar Jain, EVP Engineering and Product, Docker

Cloud Run, your home for AI applications

Support for the compose spec isn’t the only AI-friendly innovation you’ll find in Cloud Run. We recently announced general availability of Cloud Run GPUs, removing a significant barrier to entry for developers who want access to GPUs for AI workloads. With its pay-per-second billing, scale to zero, and rapid scaling (which takes approximately 19 seconds for a gemma3:4b model for time-to-first-token), Cloud Run is a great hosting solution for deploying and serving LLMs.

This also makes Cloud Run a strong solution for Docker’s recently announced OSS MCP Gateway and Model Runner, making it easy for developers to take the AI applications locally to production in the cloud seamlessly. By supporting Docker’s recent addition of ‘models’ to the open Compose Spec, you can deploy these complex solutions to the cloud with a single command.

Bringing it all together

Let's review the compose file for the above demo. It consists of a multi-container application (defined in services) built from sources and leveraging a storage volume (defined in volumes). It also uses the new models attribute to define AI models and a Cloud Run-extension defining the runtime image to use:

code_block: <ListValue: [StructValue([('code', 'name: agent\r\nservices:\r\n webapp:\r\n build: .\r\n ports:\r\n - "8080:8080"\r\n volumes:\r\n - web_images:/assets/images\r\n depends_on:\r\n - adk\r\n\r\n adk:\r\n image: us-central1-docker.pkg.dev/jmahood-demo/adk:latest\r\n ports:\r\n - "3000:3000"\r\n models:\r\n - ai-model\r\n\r\nmodels:\r\n ai-model:\r\n model: ai/gemma3-qat:4B-Q4_K_M\r\n x-google-cloudrun:\r\n inference-endpoint: docker/model-runner:latest-cuda12.2.2\r\n\r\nvolumes:\r\n web_images:'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f1dfc7aad90>)])]>

Building the future of AI

We’re committed to offering developers maximum flexibility and choice by adopting open standards and supporting various agent frameworks. This collaboration on Cloud Run and Docker is another example of how we aim to simplify the process for developers to build and deploy intelligent applications.

Compose Specification support is available for our trusted users — sign up here for the private preview.

Principal Platform Engineer, John Lewis Partnership

Thu, 26 Jun 2025 16:00:00 -0000

Editor's note: This is part one of the story. After you’re finished reading, head over to part two.

In 2017, John Lewis, a major UK retailer with a £2.5bn annual online turnover, was hampered by its monolithic e-commerce platform. This outdated approach led to significant cross-team dependencies, cumbersome and infrequent releases (monthly at best), and excessive manual testing, all further hindered by complex on-premises infrastructure. What was needed were some bold decisions to drive a quick and significant transformation.

The John Lewis engineers knew there was a better way. Working with Google Cloud, they modernized their e-commerce operations with Google Kubernetes Engine. They started with the frontend, and started to see results fast: the frontend was moved onto Google Cloud in mere months, releases to the frontend browser journey started to happen weekly, and the business gladly backed expansion into other areas.

At the same time, the team had a broader strategy in mind: to take a platform engineering approach, creating many product teams who built their own microservices to replace the functionality of the legacy commerce engine, as well as creating brand new experiences for customers.

And so The John Lewis Digital Platform was born. The vision was to empower development teams and arm them with the tools and processes they needed to go to market fast, with full ownership of their own business services. The team’s motto? "You Build It. You Run It. You Own It." This decentralization of development and operational responsibilities would also enable the team to scale.

This article features insights from Principal Platform Engineer Alex Moss, who delves into their strategy, platform build, and key learnings of John Lewis’ journey to modernize and streamline its operations with platform engineering — so you can begin to think about how you might apply platform engineering to your own organization.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1df731de80>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Step 1: From monolithic to multi-tenant

In order to make this happen, John Lewis needed to adopt a multi-tenant architecture — one tenant for each business service, allowing each owning team to work independently without risk to others -- and thereby permitting the Platform team to give the team a greater degree of freedom.

Knowing that the business' primary objective was to greatly increase the number of product teams helped inform our initial design thinking, positioning ourselves to enable many independent teams even though we only had a handful of tenants.

This foundational design has served us very well and is largely unchanged now, seven years later. Central to the multi-tenant concept is what we chose to term a "Service" — a logical business application, usually composed of several microservices plus components for storing data.

We largely position our platform as a “bring your own container” experience, but encourage teams to make use of other Google Cloud services — particularly for handling state. Adopting services like Firestore and Pub/Sub reduces the complexity that our platform team has to work with, particularly for areas like resilience and disaster recovery. We also favor Kubernetes over compute products like Cloud Run because it strikes the right balance for us between enabling development teams to have freedom whilst allowing our platform to drive certain certain behaviours, e.g., the right level of guardrails, without introducing too much friction.

On our platform, Product Teams (i.e., tenants) have a large amount of control over their own Namespaces and Projects. This allows them to prototype, build, and ultimately operate, their workloads without dependency on others — a crucial element of enabling scale.

Our early-adopter teams were extremely helpful in helping evolve the platform; they were accepting of the lack of features and willing to develop their own solutions, and provided very rich feedback on whether we were building something that met their needs.

The first tenant to adopt the platform was rebuilding the johnlewis.com, search capability, replacing a commercial-off-the-shelf solution. This team was staffed with experienced engineers familiar with modern software development and the advantages of a microservice-based architecture. They quickly identified the need for supporting services for their application to store data and asynchronously communicate between their components. They worked with the Platform Team to identify options, and were onboard with our desire to lean into Google Cloud native services to avoid running our own databases or messaging. This led to us adopting Cloud Datastore and Pub/Sub for our first features that extended beyond Google Kubernetes Engine.

All roads lead to success

A risk with a platform that allows very high team autonomy is that it can turn into a bit of a wild-west of technology choices and implementation patterns. To handle this, but to do so in a way that remained developer-centric, we adopted the concept of a paved road, analogous to a “golden path.”

We found that the paved road approach made it easier to:

build useful platform features to help developers do things rapidly and safely
share approaches and techniques, and engineers to move between teams
demonstrate to the wider organisation that teams are following required practices (which we do by building assurance capabilities, not by gating release)

The concept of the paved road permeates most of what the platform builds, and has inspired other areas of the John Lewis Partnership beyond the John Lewis Digital space.

Our paved road is powered by two key features to enable simplification for teams:

The Paved Road Pipeline. This operates on the whole Service and drives capabilities such as Google Cloud resource provisioning and observability tools.
The Microservice CRD. As the name implies, this is an abstraction at the microservice level. The majority of the benefit here is in making it easier for teams to work with Kubernetes.

Whilst both features were created with the developer experience in mind, we discovered that they also hold a number of benefits for the platform team too.

The Paved Road Pipeline is driven by a configuration file — in yaml (of course!) — which we call the Service Definition. This allows the team that owns the tenancy to describe, through easy-to-reason-about configuration, what they would like the platform to provide for them. Supporting documentation and examples help them understand what can be achieved. Pushes to this file then drive a CI/CD pipeline for a number of platform-owned jobs, which we refer to as provisioners. These provisioners are microservices-like themselves in that they are independently releasable and generally focus on performing one task well. Here are some examples of our provisioners and what they can do:

Create Google Cloud resources in a tenant’s Project. For example, Buckets, PubSub, and Firestore — amongst many others
Configure platform-provided dashboards and custom dashboards based on golden-signal and self-instrumented metrics
Tune alert configurations for a given microservice’s SLOs, and the incident response behaviour for those alerts

Our product teams are therefore freed from the need to familiarize themselves deeply with how Google Cloud resource provisioning works, or Infrastructure-as-Code (IaC) tooling for that matter. Our preferred technologies and good practices can be curated by our experts, and developers can focus on building differentiating software for the business, while remaining fully in control of what is provisioned and when.

Earlier, we mentioned that this approach has the added benefit of being something that the platform team can rely upon to build their own features. The configuration updated by teams for their Service can be combined with metadata about their team and surfaced via an API and events published to Pub/Sub. This can then drive updates to other features like incident response and security tooling, pre-provision documentation repositories, and more. This is an example of how something that was originally intended as a means to help teams avoid writing their own IaC can also be used to make it easier for us to build platform features, further improving the value-add — without the developer even needing to be aware of it!

We think this approach is also more scalable than providing pre-built Terraform modules for teams to use. That approach still burdens teams with being familiar with Terraform, and versioning and dependency complexities can create maintenance headaches for platform engineers. Instead, we provide an easy-to-reason-about API and deliberately burden the platform team, ensuring that the Service provides all the functionality our tenants require. This abstraction also means we can make significant refactoring choices if we need to.

Adopting this approach also results in a broad consistency in technologies across our platform. For example, why would a team implement Kafka when the platform makes creating resources in Pub/Sub so easy? When you consider that this spans not just the runtime components that assemble into a working business service, but also all the ancillary needs for operating that software — resilience engineering, monitoring & alerting, incident response, security tooling, service management, and so on— this has a massive amplifying effect on our engineers’ productivity. All of these areas have full paved road capabilities on the John Lewis Digital Platform, reducing the cognitive load for teams in recognizing the need for, identifying appropriate options, and then implementing technology or processes to use them.

That being said, one of the reasons we particularly like the paved road concept is because it doesn't preclude teams choosing to "go off-road." A paved road shouldn’t be mandatory, but it should be compelling to use, so that engineers aren’t tempted to do something else. Preventing use of other approaches risks stifling innovation and the temptation to think the features you've built are "good enough." The paved road challenges our Platform Engineers to keep improving their product so that it continues to meet our Developers' changing needs. Likewise, development teams tempted to go off-road are put off by the increasing burden of replicating powerful platform features.

The needs of our Engineers don’t remain fixed, and Google Cloud are of course releasing new capabilities all the time, so we have extended the analogy to include a “dusty path” representing brand new platform features that aren’t as feature-rich as we’d like (perhaps they lack self-service provisioning or out-the-box observability). Teams are trusted to try different options and make use of Google Cloud products that we haven't yet paved. The Paved Road Pipeline allows for this experimentation - what we term "snowflaking". We then have an unofficial "rule of three", whereby if we notice at least 3 teams requesting the same feature, we move to make the use of it self-service.

At the other end of the scale, teams can go completely solo — which we refer to as “crazy paving” — and might be needed to support wild experimentation or to accommodate a workload which cannot comply with the platform’s expectations for safe operation. Solutions in this space are generally not long-lived.

In this article, we've covered how John Lewis revolutionized its e-commerce operations by adopting a multi-tenant, "paved road" approach to platform engineering. We explored how this strategy empowered development teams and streamlined their ability to provision Google Cloud resources and deploy operational and security features.

In part 2 of this series, we'll dive deeper into how John Lewis further simplified the developer experience by introducing the Microservice CRD. You'll discover how this custom Kubernetes abstraction significantly reduced the complexity of working with Kubernetes at the component level, leading to faster development cycles and enhanced operational efficiency.

To learn more about shifting down with platform engineering on Google Cloud, you can find more information available here. To learn more about how Google Kubernetes Engine (GKE) empowers developers to effortlessly deploy, scale, and manage containerized applications with its fully managed, robust, and intelligent Kubernetes service, you can find more information here.

Principal Platform Engineer, John Lewis Partnership

Thu, 26 Jun 2025 16:00:00 -0000

In our previous article we introduced the John Lewis Digital Platform and its approach to simplifying the developer experience through platform engineering and so-called paved road features. We focused on the ways that platform engineering enables teams to create resources in Google Cloud and deploy the platform's operational and security features within dedicated tenant environments. In this article, we will build upon that concept for the next level of detail — how the platform simplifies build and run at a component (typically for us, a microservice) level too.

Within just over a year, the John Lewis Digital Platform had fully evolved into a product. We had approximately 25 teams using our platform, with several key parts of the johnlewis.com retail website running in production. We had built a self-service capability to help teams provision resources in Google Cloud, and firmly established that the foundation of our platform was on Google Kubernetes Engine (GKE). But we were hearing signals from some of the recent teams that there was a learning curve to Kubernetes. This was expected — we were driving a cultural change for teams to build and run their own services, and so we anticipated that our application developers would need some Kubernetes skills to support their own software. But our vision was that we wanted to make developers' lives easier — and their feedback was clear. In some cases, we observed that teams weren't following "good practice" (despite the existence of good documentation!) such as not using anti-affinity rules or PodDisruptionBudgets to help their workloads tolerate failure.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1df7318bb0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

All the way back in 2017, Kelsey Hightower wrote: “Kubernetes is a platform for building platforms. It's a better place to start, not the endgame.”

Kelsey's quote inspired us to act. We had the idea to write our own custom controller to simplify the point of interaction for a developer with Kubernetes — a John Lewis-specific abstraction that aligned to our preferred approaches. And thus the JL Microservice was born.

To do this, we declared a Kubernetes CustomResourceDefinition with a simplified specification containing just the fields we felt our developers needed to set. For example, as we expect our tenants to build and operate their applications themselves, attributes such as the number of replicas and the amount of resources needed are best left up to the developers themselves. But do they really need to be able to customize the rules defining how to distribute pods across nodes? How often do they need to change the Service pointing towards their Deployment? When we looked closer, we realized just how much duplication there was — our analysis at the time suggested that only around 33% of the lines in the yaml files developers were producing were relevant to their application. This was a target-rich scenario for simplification.

To help us build this feature, we selected Kubebuilder, using it to declare our CustomResourceDefinition and then build the Controller (what we call MicroserviceManager). This turned out to be a beneficial decision — initial prototyping was quick, and the feature was launched a few months later, and very well-received. Our team had to skill up in the Go programming language, but this trade-off felt worthwhile due to the advantages Kubebuilder was bringing to the table, and it has continued to be helpful for other software engineering since.

The initial implementation replaced an engineer's need to understand and fully configure a Deployment and Service, instead applying a much briefer yaml file containing only the fields they need to change. As well as direct translation of identical fields (image and replicas are equivalent to what you would see in a Deployment, for example), it also allowed us to simplify the choices made by the Kubernetes APIs, because in John Lewis we didn't need some of that functionality. For example, writablePaths: [] is an easy concept for our engineers to understand, and behind the scenes, our controller is converting those into the more complex combination of Volumes and VolumeMounts. Likewise, visibleToOtherServices: true is an example of us simplifying the interaction with Kubernetes NetworkPolicy — rather than requiring teams to read our documentation to understand the necessary incantations to label their resources correctly, the controller understands those conventions and handles it for them.

With the core concept of the Microservice resource established, we were able to improve the value-add by augmenting it with further features. We rapidly extended it out to define our Prometheus scrape configuration, then more complex features such as allowing teams to declare that they use Google Cloud Endpoints, and have the controller inject the necessary sidecar container into their Deployment and wiring it up to the Service. As we added more features, existing tenants converted to use this specification, and it now makes up the majority of workloads declared on the platform.

Moving the platform boundary

Our motivation to build MicroserviceManager was focused on making developers' lives easier. But we discovered an additional benefit that we had not initially expected - it was something we could greatly benefit from within the platform as well. It enabled us to make changes behind the scenes without needing to involve our tenants — reducing toil for them and making it easier for us to improve our product. This was a slightly unexpected but an exceptionally powerful benefit. It is generally difficult to change the agreement that you’ve established between your tenants and the platform, and creating an abstraction like this has allowed us to bring more under our control, for everyone’s benefit.

An example of this was something we observed through our live load testing of johnlewis.com when certain workloads burst up to several hundred Pods — numbers that exceeded the typical number of Nodes we had running in the cluster. This led to new Node creation — therefore slower Pod autoscaling and poor bin-packing. Experienced Kubernetes operators can probably guess what was happening here: our default antiAffinity rules were set to optimize for resilience such that no more than one replica was allowed on any given Node. The good news though was that because the workloads were under the control of our Microservice Manager, rather than us having to instruct our tenants to copy the relevant yaml into their Deployments, it was a straightforward change for us to replace the antiAffinity rules with the more modern podTopologyConstraints, allowing us to customize the number of replicas that could be stacked on a Node for workloads exceeding a certain replica count. And this happened with no intervention from our tenants.

A more complex example of this was when we rolled out our service mesh. In keeping with our general desire to let Google Cloud handle the complexity of running control planes components, we opted to use Google's Cloud Service Mesh product. But even then, rolling out a mesh to a business-critical platform in constant use is not without its risks. Microservice Manager allowed us to control the rate at which we enrolled workloads into the mesh through the use of a feature flag on the Microservice resource. We could start rollout with platform-owned workloads first to test our approach, then make tenants aware of the flag for early adopters to validate and take advantage of some of Cloud Service Mesh’s features. To scale the rollout, we could then manipulate the flag to release in waves based on business importance, providing an opt-out mechanism if needed to. This again greatly simplified the implementation — product teams had very little to do, and we avoided having to chase approximately 40 teams running hundreds of Microservices to make the appropriate changes in their configuration. This feature flagging technique is something we make extensive use of to support our own experimentation.

Beyond the microservice

Building the Microservice Manager has led to further thinking in Kubernetes-native ways: the Custom Resource + Controller concept is a powerful technique, and we have built other features since using it. One example is a controller that converts the need for external connectivity into Istio resources to route via our egress gateway. Istio in particular is an example of a very powerful platform capability that comes with a high cognitive load for its users, and so is a perfect example of where platform engineering can help manage that for teams whilst still allowing them to take advantage of it. We have a number of ideas in this area now that our confidence in the technology has grown.

In summary, the John Lewis Partnership leveraged Google Cloud and platform engineering to modernize their e-commerce operations and developer experience. By implementing a "paved road" approach with a multi-tenant architecture, they empowered development teams, accelerated deployment cycles, and simplified Kubernetes interactions using a custom Microservice CRD. This strategy allowed them to scale effectively and enhance the developer experience by reducing complexity while maintaining operational efficiency and scaling engineering teams effectively.

To learn more about platform engineering on Google Cloud, check out some of our other articles: 5 myths about platform engineering: what it is and what it isn’t, Another five myths about platform engineering, and Light the way ahead: Platform Engineering, Golden Paths, and the power of self-service.

Sr. Staff UX Designer

Wed, 28 May 2025 16:00:00 -0000

In the event of a cloud incident, everyone wants swift and clear communication from the cloud provider, and to be able to leverage that information effectively. Personalized Service Health in the Google Cloud console addresses this need with fast, transparent, relevant, and actionable communications about Google Cloud service disruptions, customized to your specific footprint. This helps you to quickly identify the source of the problem, helping you answer the question, “Is it Google or is it me?” You can then integrate this information into your incident response workflows to resolve the incident more efficiently.

We're excited to announce that you can prompt Gemini Cloud Assist to pull real-time information about active incidents, powered by Personalized Service Health, providing you with streamlined incident management, including discovery, impact assessment, and recovery. By combining Gemini's guidance with Personalized Service Health insights and up-to-the-minute information, you can assess the scope of impact and begin troubleshooting – all within a single, AI-driven Gemini Cloud Assist chat. Further, you can initiate this sort of incident discovery from anywhere within the console, offering immediate access to relevant incidents without interrupting your workflow. You can also check for active incidents impacting your projects, gathering details on their scope and the latest updates directly sourced from Personalized Service Health.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1df73b5a90>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Using Gemini Cloud Assist with Personalized Service Health

We designed Gemini Cloud Assist with a user-friendly layout and a well-organized information structure. Crucial details, including dynamic timelines, latest updates, symptoms, and workarounds sourced directly from Personalized Service Health, are now presented in the console, enabling conversational follow-ups. Gemini Cloud Assist highlights critical insights from Personalized Service Health, helping you refine your investigations and understand the impact of incidents.

To illustrate the power of this integration, the following demo showcases a typical incident response workflow leveraging the combined capabilities of Gemini and Personalized Service Health.

Incident discovery and triage
In the crucial first moments of an incident, Gemini Cloud Assist helps you answer "Is it Google or is it me?" Gemini Cloud Assist accesses data directly from Personalized Service Health, and provides feedback on which projects and at what locations are affected by a Google Cloud incident, speeding up the triage process.

To illustrate how you can start this process, try asking Gemini Cloud Assist questions like:

Is my project impacted by a Google Cloud incident?
Are there any incidents impacting Google Cloud at the moment?

Investigating and evaluating impact
Once you’ve identified a relevant Google Cloud incident, you can use Gemini Cloud Assist to delve deeper into the specifics and evaluate its impact on your environment. Furthermore, by asking follow-up questions, Gemini Cloud Assist can retrieve updates from Personalized Service Health about the incident as it evolves. You can then further investigate by asking Gemini to pinpoint exactly which of your apps or projects, and at what locations, might be affected by the reported incident.

Here are examples of prompts you might pose to Gemini Cloud Assist:

Tell me more about the ongoing Incident ID [X] (Replace [X] with the Incident ID)
Is [X] impacted? (Replace [X] with your specific location or Google Cloud product)
What is the latest update on Incident ID [X]?
Show me the details of Incident ID [X].
Can you guide me through some troubleshooting steps for [impacted Google Cloud product]?

Mitigation and recovery
Finally, Gemini Cloud Assist can also act as an intelligent assistant during the recovery phase, providing you with actionable guidance. You can gain access to relevant logs and monitoring data for more efficient resolution. Additionally, Gemini Cloud Assist can help surface potential workarounds from Personalized Service Health and direct you to the tools and information you need to restore your projects or applications. Here are some sample prompts:

What are the workarounds for the incident ID [X]? (Replace [X] with the Incident ID)
Can you suggest a temporary solution to keep my application running?
How can I find logs for this impacted project?

From these prompts, Gemini retrieves relevant information from Personalized Service Health to provide you with personalized insights into your Google Cloud environment's health — both for ongoing events and incidents from up to one year in the past. This helps when investigating an incident to narrow down its impact, as well as assisting in recovery.

Next steps

Looking ahead, we are excited to provide even deeper insights and more comprehensive incident management with Gemini Cloud Assist and Personalized Service Health, extending these AI-driven capabilities beyond a single project view. Ready to get started?

Learn more about Personalized Service Health, or reach out to your account team to enable it.
Get started with Gemini Cloud Assist. Refine your prompts to ask about your specific regions or Google Cloud products, and experiment to discover how it can help you proactively manage incidents.

Staff Site Reliability Engineer, Waze

Mon, 28 Apr 2025 16:00:00 -0000

In 2023, the Waze platform engineering team transitioned to Infrastructure as Code (IaC) using Google Cloud's Config Connector (KCC) — and we haven’t looked back since. We embraced Config Connector, an open-source Kubernetes add-on, to manage Google Cloud resources through Kubernetes. To streamline management, we also leverage Config Controller, a hosted version of Config Connector on Google Kubernetes Engine (GKE), incorporating Policy Controller and Config Sync. This shift has significantly improved our infrastructure management and is shaping our future infrastructure.

The shift to Config Connector

Previously, Waze relied on Terraform to manage resources, particularly during our dual-cloud, VM-based phase. However, maintaining state and ensuring reconciliation proved challenging, leading to inconsistent configurations and increased management overhead.

In 2023, we adopted Config Connector, transforming our Google Cloud infrastructure into Kubernetes Resource Modules (KRMs) within a GKE cluster. This approach addresses the reconciliation issues encountered with Terraform. Config Sync, paired with Config Connector, automates KRM synchronization from source repositories to our live GKE cluster. This managed solution eliminates the need for us to build and maintain custom reconciliation systems.

The shift helped us meet the needs of three key roles within Waze’s infrastructure team:

Infrastructure consumers: Application developers who want to easily deploy infrastructure without worrying about the maintenance and complexity of underlying resources.
Infrastructure owners: Experts in specific resource types (e.g., Spanner, Google Cloud Storage, Load Balancers, etc.), who want to define and standardize best practices in how resources are created across Waze on Google Cloud.
Platform engineers: Engineers who build the system that enables infrastructure owners to codify and define best practices, while also providing a seamless API for infrastructure consumers.

aside_block: <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x7f1e10f6d100>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

First stop: Config Connector

It may seem circular to define all of our Google Cloud infrastructure as KRMs within a Google Cloud service, however, KRM is actually a great representation for our infrastructure as opposed to existing IaC tooling.

Terraform's reconciliation issues – state drift, version management, out of band changes – are a significant pain. Config Connector, through Config Sync, offers out-of-the-box reconciliation, a managed solution we prefer. Both KRM and Terraform offer templating, but KCC's managed nature aligns with our shift to Google Cloud-native solutions and reduces our maintenance burden.

Infrastructure complexity requires generalization regardless of the tool. We can see this when we look at the Spanner requirements at Waze:

Consistent backups for all Spanner databases
Each Spanner database utilizes a dedicated Cloud Storage bucket and Service Account to automate the execution of DDL jobs.
All IAM policies for Spanner instances, databases, and Cloud Storage buckets are defined in code to ensure consistent and auditable access control.

To define these resources, we evaluated various templating and rendering tools and selected Helm, a robust CNCF package manager for Kubernetes. Its strong open-source community, rich templating capabilities, and native rendering features made it a natural fit. We can now refer to our bundled infrastructure configurations as 'Charts.' While KRO has since emerged that achieves a similar purpose, our selection process predated its availability.

Under the hood

Let's open the hood and dive into how the system works and is driving value for Waze.

Waze infrastructure owners generically define Waze-flavored infrastructure in Helm Charts.
Infrastructure consumers use these Charts with simplified inputs to generate infrastructure (demo).
Infrastructure code is stored in repositories, enabling validation and presubmit checks.

Code is uploaded to a Artifact Registry where Config Sync and Config Connector align Google Cloud infrastructure with the code definitions.

2 - Provisioning Cloud Resources at Waze — This diagram represents a single "data domain," a collection of bounded services, databases, networks, and data. Many tech orgs today consist of Prod, QA, Staging, Development, etc.

Approaching our destination

So why does all of this matter? Adopting this approach allowed us to move from Infrastructure as Code to Infrastructure as Software. By treating each Chart as a software component, our infrastructure management goes beyond simple code declaration. Now, versioned Charts and configurations enable us to leverage a rich ecosystem of software practices, including sophisticated release management, automated rollbacks, and granular change tracking.

Here's where we apply this in practice: our configuration inheritance model minimizes redundancy. Resource Charts inherit settings from Projects, which inherit from Bootstraps. All three are defined as Charts. Consequently, Bootstrap configurations apply to all Projects, and Project configurations apply to all Resources.

Every change to our infrastructure – from changes on existing infrastructure to rolling out new resource types – can be treated like a software rollout.

Now that all of our infrastructure is treated like software, we can see what this does for us system-wide:

Reaching our destination

In summary, Config Connector and Config Controller have enabled Waze to achieve true Infrastructure as Software, providing a robust and scalable platform for our infrastructure needs, along with many other benefits including:

Infrastructure consumers receive the latest best practices through versioned updates.
Infrastructure owners can iterate and improve infrastructure safely.
Platform Engineers and Security teams are confident our resources are auditable and compliant
Config Connector leverages Google's managed services, reducing operational overhead.

Engineering Manager

Mon, 24 Feb 2025 17:00:00 -0000

Distributed tracing is a critical part of an observability stack, letting you troubleshoot latency and errors in your applications. Cloud Trace, part of Google Cloud Observability, is Google Cloud’s native tracing product, and we’ve made numerous improvements to the Trace explorer UI on top of a new analytics backend.

The new Trace explorer page contains:

A filter bar with options for users to choose a Google Cloud project-based trace scope, all/root spans and a custom attribute filter.
A faceted span filter pane that displays commonly used filters based on OpenTelemetry conventions.
A visualization of matching spans including an interactive span duration heatmap (default), a span rate line chart, and a span duration percentile chart.
A table of matching spans that can be narrowed down further by selecting a cell of interest on the heatmap.

A tour of the new Trace explorer

Let’s take a closer look at these new features and how you can use them to troubleshoot your applications. Imagine you’re a developer working on the checkoutservice of a retail webstore application and you’ve been paged because there’s an ongoing incident.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1df6eb1fa0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

This application is instrumented using OpenTelemetry and sends trace data to Google Cloud Trace, so you navigate to the Trace explorer page on the Google Cloud console with the context set to the Google Cloud project that hosts the checkoutservice.

Before starting your investigation, you remember that your admin recommended using the webstore-prod trace scope when investigating webstore app-wide prod issues. By using this Trace scope, you'll be able to see spans stored in other Google Cloud projects that are relevant to your investigation.

You set the trace scope to webstore-prod and your queries will now include spans from all the projects included in this trace scope.

You select checkoutservice in Span filters (1) and the following updates load on the page:

Other sections such as Span name in the span filter pane (2) are updated with counts and percentages that take into account the selection made under service name. This can help you narrow down your search criteria to be more specific.
The span Filter bar (3) is updated to display the active filter.
The heatmap visualization (4) is updated to only display spans from the checkoutservice in the last 1 hour (default). You can change the time-range using the time-picker (5). The heatmap’s x-axis is time and the y-axis is span duration. It uses color shades to denote the number of spans in each cell with a legend that indicates the corresponding range.
The Spans table (6) is updated with matching spans sorted by duration (default).
Other Chart views (7) that you can switch to are also updated with the applied filter.

From looking at the heatmap, you can see that there are some spans in the >100s range which is abnormal and concerning. But first, you’re curious about the traffic and corresponding latency of calls handled by the checkoutservice.

Switching to the Span rate line chart gives you an idea of the traffic handled by your service. The x-axis is time and the y-axis is spans/second. The traffic handled by your service looks normal as you know from past experience that 1.5-2 spans/second is quite typical.

Switching to the Span duration percentile chart gives you p50/p90/p95/p99 span duration trends. While p50 looks fine, the p9x durations are greater than you expect for your service.

You switch back to the heatmap chart and select one of the outlier cells to investigate further. This particular cell has two matching spans with a duration of over 2 minutes, which is concerning.

You investigate one of those spans by viewing the full trace and notice that the orders publish span is the one taking up the majority of the time when servicing this request. Given this, you form a hypothesis that the checkoutservice is having issues handling these types of calls. To validate your hypothesis, you note the rpc.method attribute being PlaceOrder and exit this trace using the X button.

You add an attribute filter for key: rpc.method value:PlaceOrder using the Filter bar, which shows you that there is a clear latency issue with PlaceOrder calls handled by your service. You’ve seen this issue before and know that there is a runbook that addresses it, so you alert the SRE team with the appropriate action that needs to be taken to mitigate the incident.

Share your feedback with us via the Send feedback button.

Behind the scenes

This new experience is powered by BigQuery, using the same platform that backs Log Analytics. We plan to launch new features that take full advantage of this platform: SQL queries, flexible sampling, export, and regional storage.

In summary, you can use the new Cloud Trace explorer to perform service-oriented investigations with advanced querying and visualization of trace data. This allows developers and SREs to effectively troubleshoot production incidents and identify mitigating measures to restore normal operations.

The new Cloud Trace explorer is generally available to all users — try it out and share your feedback with us via the Send feedback button.

Technical Program Manager, Google

Thu, 20 Feb 2025 17:00:00 -0000

Picture this: you’re an Site Reliability Engineer (SRE) responsible for the systems that power your company’s machine learning (ML) services. What do you do to ensure you have a reliable ML service, how do you know you’re doing it well, and how can you build strong systems to support these services?

As artificial intelligence (AI) becomes more widely available, its features — including ML — will matter more to SREs. That’s because ML becomes both a part of the infrastructure used in production software systems, as well as an important feature of the software itself.

Abstractly, machine learning relies on its pipelines … and you know how to manage those! So you can begin with pipeline management, then look to other factors that will strengthen your ML services: training, model freshness, and efficiency. In the resources below, we'll look at some of the ML-specific characteristics of these pipelines that you’ll want to consider in your operations. Then, we draw on the experience of Google SREs to show you how to apply your core SRE skills to operating and managing your organization’s machine-learning pipelines.

Training ML models

Training ML models applies the notion of pipelines to specific types of data, often running on specialized hardware. Critical aspects to consider about the pipeline:

how much data you’re ingesting
how fresh this data needs to be
how the system trains and deploys the models
how efficiently the system handles these first three things

This keynote presents an SRE perspective on the value of applying reliability principles to the components of machine learning systems. It provides insight into why ML systems matter for products, and how SREs should think about them. The challenges that ML systems present include capacity planning, resource management, and monitoring; other challenges include understanding the cost of ML systems as part of your overall operations environment.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1df7307100>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

ML freshness and data volume

As with any pipeline-based system, a big part of understanding the system is describing how much data it typically ingests and processes. The Data Processing Pipelines chapter in the SRE Workbook lays out the fundamentals: automate the pipeline’s operation so that it is resilient, and can operate unattended.

You’ll want to develop Service Level Objectives (SLOs) in order to measure the pipeline’s health, especially for data freshness, i.e., how recently the model got the data it’s using to produce an inference for a customer. Understanding freshness provides an important measure of an ML system’s health, as data that becomes stale may lead to lower-quality inferences and sub-optimal outcomes for the user. For some systems, such as weather forecasting, data may need to be very fresh (just minutes or seconds old); for other systems, such as spell-checkers, data freshness can lag on the order of days — or longer! Freshness requirements will vary by product, so it’s important that you know what you’re building and how the audience expects to use it.

In this way, freshness is a part of the critical user journey described in the SRE Workbook, describing one aspect of the customer experience. You can read more about data freshness as a component of pipeline systems in the Google SRE article Reliable Data Processing with Minimal Toil.

There’s more than freshness to ensuring high-quality data — there’s also how you define the model-training pipeline. A Brief Guide To Running ML Systems in Production gives you the nuts and bolts of this discipline, from using contextual metrics to understand freshness and throughput, to methods for understanding the quality of your input data.

Serving efficiency

The 2021 SRE blog post Efficient Machine Learning Inference provides a valuable resource to learn about improving your model’s performance in a production environment. (And remember, training is never the same as production for ML services!)

Optimizing machine learning inference serving is crucial for real-world deployment. In this article, the authors explore multi-model serving off of a shared VM. They cover realistic use cases and how to manage trade-offs between cost, utilization, and latency of model responses. By changing the allocation of models to VMs, and varying the size and shape of those VMs in terms of processing, GPU, and RAM attached, you can improve the cost effectiveness of model serving.

Cost efficiency

We mentioned that these AI pipelines often rely on specialized hardware. How do you know you’re using this hardware efficiently? Todd Underwood’s talk from SREcon EMEA 2023 on Artificial Intelligence: What Will It Cost You? gives you a sense of how much this specialized hardware costs to run, and how you can provide incentives for using it efficiently.

Automation for scale

This article from Google's SRE team outlines strategies for ensuring reliable data processing while minimizing manual effort, or toil. One of the key takeaways: use an existing, standard platform for as much of the pipeline as possible. After all, your business goals should focus on innovations in presenting the data and the ML model, not in the pipeline itself. The article covers automation, monitoring, and incident response, with a focus on using these concepts to build resilient data pipelines. You’ll read best practices for designing data systems that can handle failures gracefully and reduce a team’s operational burden. This article is essential reading for anyone involved in data engineering or operations. Read more about toil in the SRE Workbook: https://sre.google/workbook/eliminating-toil/.

Next steps

Successful ML deployments require careful management and monitoring for systems to be reliable and sustainable. That means taking a holistic approach, including implementing data pipelines, training pathways, model management, and validation, alongside monitoring and accuracy metrics. To go deeper, check out this guide on how to use GKE for your AI orchestration.

Cross-Product Solution Developer

Fri, 14 Feb 2025 17:00:00 -0000

In today's dynamic digital landscape, building and operating secure, reliable, cost-efficient and high-performing cloud solutions is no easy feat. Enterprises grapple with the complexities of cloud adoption, and often struggle to bridge the gap between business needs, technical implementation, and operational readiness. This is where the Google Cloud Well-Architected Framework comes in. The framework provides comprehensive guidance to help you design, develop, deploy, and operate efficient, secure, resilient, high-performing, and cost-effective Google Cloud topologies that support your security and compliance requirements.

Who should use the Well-Architected Framework?

The Well-Architected Framework caters to a broad spectrum of cloud professionals. Cloud architects, developers, IT administrators, decision makers and other practitioners can benefit from years of subject-matter expertise and knowledge both from within Google and from the industry. The framework distills this vast expertise and presents it as an easy-to-consume set of recommendations.

The recommendations in the Well-Architected Framework are organized under five, business-focused pillars.

We recently completed a revamp of the guidance in all the pillars and perspectives of the Well-Architected Framework to center the recommendations around a core set of design principles.

Operational excellence	Security, privacy, and compliance	Reliability	Cost optimization	Performance optimization
Operational readiness Incident management Resource optimization Change management Continuous improvement	Security by design Zero trust Shift-left security Preemptive cyber-defense Secure and responsible AI AI for security Regulatory, privacy, and compliance needs	User-focused goals Realistic targets HA through redundancy Horizontal scaling Observability Graceful degradation Recovery testing Thorough postmortems	Spending aligned with business value Culture of cost awareness Resource optimization Continuous optimization	Resource allocation planning Elasticity Modular design Continuous improvement

In addition to the above pillars, the Well-Architected Framework provides cross-pillar perspectives that present recommendations for selected domains, industries, and technologies like AI and machine learning (ML).

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1df7743fd0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Benefits of adopting the Well-Architected Framework

The Well-Architected Framework is much more than a collection of design and operational recommendations. The framework empowers you with a structured principles-oriented design methodology that unlocks many advantages:

Enhanced security, privacy, and compliance: Security is paramount in the cloud. The Well-Architected Framework incorporates industry-leading security practices, helping ensure that your cloud architecture meets your security, privacy, and compliance requirements.
Optimized cost: The Well-Architected Framework lets you build and operate cost-efficient cloud solutions by promoting a cost-aware culture, focusing on resource optimization, and leveraging built-in cost-saving features in Google Cloud.
Resilience, scalability, and flexibility: As your business needs evolve, the Well-Architected Framework helps you design cloud deployments that can scale to accommodate changing demands, remain highly available, and be resilient to disasters and failures.
Operational excellence: The Well-Architected Framework promotes operationally sound architectures that are easy to operate, monitor, and maintain.
Predictable and workload-specific performance: The Well-Architected Framework offers guidance to help you build, deploy, and operate workloads that provide predictable performance based on your workloads’ needs.
The Well-Architected Framework also includes cross-pillar perspectives for selected domains, industries, and technologies like AI and machine learning (ML).

The principles and recommendations in the Google Cloud Well-Architected Framework are aligned with Google and industry best practices like Google’s Site Reliability Engineering (SRE) practices, DORA capabilities, the Google HEART framework for user-centered metrics, the FinOps framework, Supply-chain Levels for Software Artifacts (SLSA), and Google's Secure AI Framework (SAIF).

Embrace the Well-Architected Framework to transform your Google Cloud journey, and get comprehensive guidance on security, reliability, cost, performance, and operations — as well as targeted recommendations for specific industries and domains like AI and ML. To learn more, visit Google Cloud Well-Architected Framework.

Product Manager

Thu, 30 Jan 2025 20:00:00 -0000

We are thrilled to announce the collaboration between Google Cloud, AWS, and Azure on Kube Resource Orchestrator, or kro (pronounced “crow”). kro introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.

Challenges of Kubernetes resource orchestration

Platform and DevOps teams want to define standards for how application teams deploy their workloads, and they want to use Kubernetes as the platform for creating and enforcing these standards. Each service needs to handle everything from resource creation to security configurations, monitoring setup, defining the end-user interface, and more. There are client-side templating tools that can help with this (e.g., Helm, Kustomize), but Kubernetes lacks a native way for platform teams to create custom groupings of resources for consumption by end users.

Before kro, platform teams needed to invest in custom solutions such as building custom Kubernetes controllers, or using packaging tools like Helm, which can’t leverage the benefits of Kubernetes CRDs. These approaches are costly to build, maintain, and troubleshoot, and complex for non-Kubernetes experts to consume. This is a problem many Kubernetes users face. Rather than developing vendor-specific solutions, we’ve partnered with Amazon and Microsoft on making K8s APIs simpler for all Kubernetes users.

aside_block: <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x7f1df775a0a0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

How kro simplifies the developer experience

kro is a Kubernetes-native framework that lets you create reusable APIs to deploy multiple resources as a single unit. You can use it to encapsulate a Kubernetes deployment and its dependencies into a single API that your application teams can use, even if they aren’t familiar with Kubernetes. You can use kro to create custom end-user interfaces that expose only the parameters an end user should see, hiding the complexity of Kubernetes and cloud-provider APIs.

kro does this by introducing the concept of a ResourceGraphDefinition, which specifies how a standard Kubernetes Custom Resource Definition (CRD) should be expanded into a set of Kubernetes resources. End users define a single resource, which kro then expands into the custom resources defined in the CRD.

kro can be used to group and manage any Kubernetes resources. Tools like ACK, KCC, or ASO define CRDs to manage cloud provider resources from Kubernetes (these tools enable cloud provider resources, like storage buckets, to be created and managed as Kubernetes resources). kro can also be used to group resources from these tools, along with any other Kubernetes resources, to define an entire application deployment and the cloud provider resources it depends on.

Example use cases

Below, you’ll find some examples of kro being used with Google Cloud. You can find additional examples on the kro website.

Example 1: GKE cluster definition

Imagine that a platform administrator wants to give end users in their organization self-service access to create GKE clusters. The platform administrator creates a kro ResourceGraphDefinition called GKEclusterRGD that defines the required Kubernetes resources and a CRD called GKEcluster that exposes only the options they want to be configurable by end users. In addition to creating a cluster, the platform team also wants clusters to deploy administrative workloads such as policies, agents, etc. The ResourceGraphDefinition defines the following resources, using KCC to provide the mappings from K8s CRDs to Google Cloud APIs:

GKE cluster, Container Node Pools, IAM ServiceAccount, IAM PolicyMember, Services, Policies

The platform administrator would then define the end-user interface so that they can create a new cluster by creating an instance of the CRD that defines:

Cluster name, Nodepool name, Max nodes, Location (e.g. us-east1), Networks (optional)

Everything related to policy, service accounts, and service activation (and how these resources relate to each other) is hidden from the end user, simplifying their experience.

Example 2: Web application definition

In this example, a DevOps Engineer wants to create a reusable definition of a web application and its dependencies. They create a ResourceGraphDefinition called WebAppRGD, which defines a new Kubernetes CRD called WebApp. This new resource encapsulates all the necessary resources for a web application environment, including:

Deployments, service, service accounts, monitoring agents, and cloud resources like object storage buckets.

The WebAppRGD ResourceGraphDefinition can set a default configuration, and also define which parameters can be set by the end user at deployment time (kro gives you the flexibility to decide what is immutable, and what an end user is able to configure). A developer then creates an instance of the WebApp CRD, inputting any user-facing parameters. kro then deploys the desired Kubernetes resource.

Key benefits of kro

We believe kro is a big step forward for platform engineering teams, delivering a number of advantages:

Kubernetes-native: kro leverages Kubernetes Custom Resource Definitions (CRDs) to extend Kubernetes, so it works with any Kubernetes resource and integrates with existing Kubernetes tools and workflows.
Lets you create a simplified end user experience: kro makes it easy to define end-user interfaces for complex groups of Kubernetes resources, making it easy for people who are not Kubernetes experts to consume services built on Kubernetes.
Enables standardized services for application teams: kro templates can be reused across different projects and environments, promoting consistency and reducing duplication of effort.

Get started with kro

kro is available as an open-source project on GitHub. The GitHub organization is currently jointly owned by teams from Google, AWS, and Microsoft, and we welcome contributions from the community. We also have a website with documentation on installing and using kro, including example use cases. As an early-stage project, kro is not yet ready for production use, but we still encourage you to test it out in your own Kubernetes development environments!

Senior Product Manager, Cloud Runtimes

Thu, 23 Jan 2025 17:00:00 -0000

Platform engineering, one of Gartner’s top 10 strategic technology trends for 2024, is rapidly becoming indispensable for enterprises seeking to accelerate software delivery and improve developer productivity. How does it do that? Platform engineering is about providing the right infrastructure, tools, and processes that enable efficient, scalable software development, deployment, and management, all while minimizing the cognitive burden on developers.

To uncover the secrets to platform engineering success, Google Cloud partnered with Enterprise Strategy Group (ESG) on a comprehensive research study of 500 global IT professionals and application developers working at organizations with at least 500 employees, all with formal platform engineering teams. Our goal was to understand whether they had adopted platform engineering, and if so, the impact that has had on their company’s software delivery capabilities.

The resulting report, “Building Competitive Edge With Platform Engineering: A Strategic Guide,” reveals common patterns, expectations, and actionable best practices for overcoming challenges and fully leveraging platform engineering. This blog post highlights some of the most powerful insights from this study.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1e11263940>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Platform engineering is no longer optional

The research confirms that platform engineering is no longer a nascent concept. 55% of the global organizations we invited to participate have already adopted platform engineering. Of those, 90% plan to expand its reach to more developers. Furthermore, 85% of companies using platform engineering report that their developers rely on the platform to succeed. These figures highlight that platform engineering is no longer just a trend; it's becoming a vital strategy for organizations seeking to unlock the full potential of their cloud and IT investments and gain a competitive edge.

Figure 1: 55% of 900+ global organizations surveyed have adopted platform engineering

Three keys to platform engineering success

The report identifies three critical components that are central to the success of mature platform engineering leaders.

Fostering close collaboration between platform engineers and other teams to ensure alignment
Adopting a “platform as a product” approach, which involves treating the developer platform with a clear roadmap, communicated value, and tight feedback loops
Defining success by measuring performance through clear metrics such as deployment frequency, failure recovery time, and lead time for changes

It's noteworthy that while many organizations have begun their platform engineering journey, only 27% of adopters have fully integrated these three key components in their practices, signaling a significant opportunity for further improvements.

AI: platform engineering's new partner

One of the most compelling insights of this report is the synergistic relationship between platform engineering and AI. A remarkable 86% of respondents believe that platform engineering is essential to realizing the full business value of AI. At the same time, a vast majority of companies view AI as a catalyst for advancing platform engineering, with 94% of organizations identifying AI to be ‘Critical’ or ‘Important’ to the future of platform engineering.

Beyond speed: key benefits of platform engineering

The study also identified three cohorts of platform engineering adopters — nascent, established, and leading — based on whether and how much adopters had embraced the above-mentioned three key components of platform engineering success. The study shows that leading adopters gain more in terms of speed, efficiency, and productivity, and offers guidance for nascent and established adopters to improve their overall platform engineering maturity to gain more benefits.

The report also identified some additional benefits of platform engineering, including:

Improved employee satisfaction, talent acquisition & retention: mature platforms foster a positive developer experience that directly impacts company culture. Developers and IT pros working for organizations with mature developer platforms are much more likely to recommend their workplace to their peers.
Accelerated time to market: mature platform engineering adopters have significantly shortened time to market. 71% of leading adopters of platform engineering indicated they have significantly accelerated their time to market, compared with 28% of less mature adopters.

Don't go it alone

A vast majority (96%) of surveyed organizations are leveraging open-source tools to build their developer platforms. Moreover, most (84%) are partnering with external vendors to manage and support their open-source environments. Co-managed platforms with a third party or a cloud partner benefit from a higher degree of innovation. Organizations with co-managed platforms allocate an average of 47% of their developers’ productive time to innovation and experimentation, compared to just 38% for those that prefer to manage their platforms with internal staff.

Ready to succeed? Explore the full report

While this blog provides a glimpse into the key findings from this study, the full report goes much further, revealing key platform engineering strategies and practices that will help you stay ahead of the curve. Download the report to explore additional topics, including:

The strategic considerations of centralized and distributed platform engineering teams
The key drivers behind platform engineering investments
Top priorities driving platform adoption for developers, ensuring alignment with their needs
Key pain points to anticipate and navigate on the road to platform engineering success
How platform engineering boosts productivity, performance, and innovation across the entire organization
The strategic importance of open source in platform engineering for competitive advantage
The transformative role of platform engineering for AI/ML workloads as adoption of AI increases
How to develop the right platform engineering strategy to drive scalability and innovation

Download the full report now.

Software Engineer

Thu, 23 Jan 2025 17:00:00 -0000

Editor’s note: This blog post was updated to reflect the general availability status of these features as of March 31, 2025.

Cloud Deploy is a fully managed continuous delivery platform that automates the delivery of your application. On top of existing automation features, customers tell us they want other ways to automate their deployments to keep their production environments reliable and up to date.

We're happy to announce three new features to help with that, all in GA.

1. Repair rollouts

The new repair rollout automation rule lets you retry failed deployments or automatically roll back to a previously successful release when an error occurs. These errors could come in any phase of a deployment: a pre-deployment SQL migration, a misconfiguration detected when talking to a GKE cluster, or as part of a deployment verification step. In any of these cases, the repair rollout automation lets you retry the failed step a configurable number of times, perfect for those occasionally flaky end-to-end tests. If the retry succeeds, the rollout continues. If the retries fail (or none are configured) the repair rollout automation can also roll back to the previously successful release.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1dfc33e760>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

2. Deploy policies

Automating deployments is powerful, but it can also be important to put some constraints on the automation. The new deploy policies feature is intended to limit what these automations (or users) can do. Initially, we're launching a time-windows policy which can, for example, inhibit deployments during evenings, weekends, or during important events. While an on-caller with the Policy Overrider role could "break glass" to get around these policies, automated deployments won't be able to trigger a rollout in the middle of your big demo.

3. Timed promotions

After a release is successfully rolled out, you may want to automatically deploy it to the next environment. Our previous auto-promote feature let you promote a release after a specified duration, for example moving it into prod 12 hours after it went to staging. But often you want promotions to happen on a schedule, not based on a delay. Within Google, for example, we typically recommend that teams promote from a dev environment into staging every Thursday, and then start a promotion into prod on Monday mornings. With the new timed promotion automation, Cloud Deploy can handle these scheduled promotions for you.

The future

Comprehensive, easy-to-use, and cost-effective DevOps tools are key to efficient software delivery, and it’s our hope that Cloud Deploy will help you implement complete CI/CD pipelines. Stay tuned as we introduce exciting new capabilities and features to Cloud Deploy in the months to come.

Update your current pipelines with these new features today. Check out the product page, documentation, quickstarts, and tutorials. Finally, if you have feedback on Cloud Deploy, you can join the conversation. We look forward to hearing from you!

Senior Staff Reliability Engineer

Thu, 09 Jan 2025 17:00:00 -0000

Cloud applications like Google Workspace provide benefits such as collaboration, availability, security, and cost-efficiency. However, for cloud application developers, there’s a fundamental conflict between achieving high availability and the constant evolution of cloud applications. Changes to the application, such as new code, configuration updates, or infrastructure rearrangements, can introduce bugs and lead to outages. These risks pose a challenge for developers, who must balance stability and innovation while minimizing disruption to users.

Here on the Google Workspace Site Reliability Engineering team, we once moved a replica of Google Docs to a new data center because we needed extra capacity. But moving the associated data, which was vast, overloaded a key index in our database, restricting user ability to create new docs. Thankfully, we were able to identify the root cause and mitigate the problem quickly. Still, this experience convinced us of the need to reduce the risk of a global outage from a simple application change.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1e0dd80970>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Limit the blast radius

Our approach to reducing the risk of global outages is to limit the “blast radius,” or extent, of an outage by vertically partitioning the serving stack. The basic idea is to run isolated instances (“partitions”) of application servers and storage (Figure 1). Each partition contains all the various servers necessary to service a user request from end to end. Each production partition also has a pseudo-random mix of users and workloads, so all the partitions have similar resource needs. When it comes time to make changes to the application code, we deploy new changes to one partition at a time. Bad changes may cause a partition-wide outage, but we are protected from a global application outage.

Compare this approach to using canarying alone, in which new features or code changes are released to a small group of users before rolling them out to the rest. While canarying deploys changes first to just a few servers, it doesn’t prevent problems from spreading. For example, we’ve had incidents where canaried changes corrupted data used by all the servers in the deployment. With partitioning, the effects of bad changes are isolated to a single partition, preventing such contagion. Of course, in practice, we combine both techniques: canarying new changes to a few servers within a single partition.

Benefits of partitioning

Broadly speaking, partitioning brings a lot of advantages:

Availability: Initially, the primary motivation for partitioning was to improve the availability of our services and avoid global outages. In a global outage, an entire service may be down (e.g., users cannot log into Gmail), or a critical user journey (e.g., users cannot create Calendar events) — obviously things to be avoided.

Still, the reliability benefits of partitioning can be hard to quantify; global outages are relatively infrequent, so if you don’t have one for a while, it may be due to partitioning, or may be due to luck. That said, we’ve had several outages that were confined to a single partition, and believe they would have expanded into global outages without it.
Flexibility: We evaluate many changes to our systems by experimenting with data. Many user-facing experiments, such as a change to a UI element, use discrete groups of users. For example, in Gmail we can choose an on-disk layout that stores the message bodies of emails inline with the message metadata, or a layout that separates them into different disk files. The right decision depends on subtle aspects of the workload. For example, separating message metadata and bodies may reduce latency for some user interactions, but requires more compute resources in our backend servers to perform joins between the body and metadata columns. With partitioning, we can easily evaluate the impact of these choices in contained, isolated environments.
Data location: Google Workspace lets enterprise customers specify that their data be stored in a specific jurisdiction. In our previous, non-partitioned architecture, such guarantees were difficult to provide, especially since services were designed to be globally replicated to reduce latency and take advantage of available capacity.

Challenges

Despite the benefits, there are some challenges to adopt partitioning. In some cases, these challenges make it hard or risky to move from a non-partitioned to a partitioned setup. In other cases, challenges persist even after partitioning. Here are the issues as we see them:

Not all data models are easy to partition: For example, Google Chat needs to assign both users and chat rooms to partitions. Ideally, a chat and its members would be in a single partition to avoid cross-partition traffic. However, in practice, this is difficult to accomplish. Chat rooms and users form a graph, with users in many chat rooms and chat rooms containing many users. In the worst case, this graph may have only a single connected component — the user. If we were to slice the graph into partitions, we could not guarantee that all users would be in the same partition as their chat rooms.
Partitioning a live service requires care: Most of our services pre-date partitioning. As a result, adopting partitioning means taking a live service and changing its routing and storage setup. Even if the end goal is higher reliability, making these kinds of changes in a live system is often the source of outages, and can be risky.
Partition misalignment between services: Our services often communicate with each other. For example, if a new person is added to a Calendar event, Calendar servers make an Remote Procedure Call (RPC) to Gmail delivery servers to send the new invitee an email notification. Similarly, Calendar events with video call links require Calendar to talk to Meet servers for a meeting id. Ideally, we would get the benefits of partitioning even across services. However, aligning partitions between services is difficult. The main reason is that different services tend to use different entity types when determining which partition to use. For example, Calendar partitions on the owner of the calendar while Meet partitions on meeting id. The result is that there is no clear mapping from partitions in one service to another.
Partitions are smaller than the service: A modern cloud application is served by hundreds or thousands of servers. We run servers at less than full utilization so that we can tolerate spikes in traffic, and because servers that are saturated with traffic generally perform poorly. If we have 500 servers, and target each at 60% CPU utilization, we effectively have 200 spare servers to absorb load spikes. Because we do not fail over between partitions, each partition has access to a much smaller amount of spare capacity. In a non-partitioned setup, a few server crashes may likely go unnoticed, since there is enough headroom to absorb the lost capacity. But in a smaller partition, these crashes may account for a non-trivial portion of the available server capacity, and the remaining servers may become overloaded.

Key takeaways

We can improve the availability of web applications by partitioning their serving stacks. These partitions are isolated, because we do not fail over between them. Users and entities are assigned to partitions in a sticky manner to allow us to roll out changes in order of risk tolerance. This approach allows us to roll out changes one partition at a time with confidence that bad changes will only affect a single partition, and ideally that partition contains only users from your organization.

In short, partitioning supports our efforts to provide stronger and more reliable services to our users, and it might apply to your service as well. For example, you can improve the availability of your application by using Spanner, which provides geo-partitioning out of the box. Read more about geo-partitioning best practices here.

References

Product Leader for Customer Telemetry, Google Cloud

Mon, 06 Jan 2025 17:00:00 -0000

Cloud incidents happen. And when they do, it’s incumbent on the cloud service provider to communicate about the incident to impacted customers quickly and effectively — and for the cloud service consumer to use that information effectively, as part of a larger incident management response.

Google Cloud Personalized Service Health provides businesses with fast, transparent, relevant, and actionable communication about Google Cloud service disruptions, tailored to a specific business at its desired level of granularity. Cybersecurity company Palo Alto Networks is one Google Cloud customer and partner that recently integrated Personalized Service Health signals into the incident workflow for its Google Cloud-based PRISMA Access offering, saving its customers critical minutes during active incidents.

By programmatically ingesting Personalized Service Health signals into advanced workflow components, Palo Alto can quickly make decisions such as triggering contingency actions to protect business continuity.

Let’s take a closer look at how Palo Alto integrated Personalized Service Health into its operations.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1df6505280>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

The Personalized Service Health integration

Palo Alto ingests Personalized Service Health logs into its internal AIOps system, which centralizes incident communications for PRISMA Access and applies advanced techniques to classify and distribute signals to the people responsible for responding to a given incident.

Personalized Service Health UI Incident list view

Users of Personalized Service Health can filter what relevance levels they want to see. Here, “Partially related” reflects an issue anywhere in the world with the products that are used. “Related” reflects that the problem is detected within the data center regions, while “Impacted” means that Google has verified the impact to the customer for specific services.

While Google is still confirming an incident, Personalized Service Health communicates some of these incidents as 'PSH Emerging Incident' to provide customers with early notification. Once Google confirms the incident, these incidents are merged with 'PSH Confirmed Incidents'. This helps customers respond faster to a specific incident that’s impacting their environment or escalate back to Google, if needed.

Personalized Service Health distributes updates throughout an active incident, typically every 30 minutes, or sooner if there’s progress to share. These updates are also written to logs, which Palo Alto ingests into AIOps.

Responding to disruptive, unplanned cloud service provider incidents can be accelerated by programmatically ingesting and distributing incident communications. This is especially true in large-scale organizations such as Palo Alto, which has multiple teams involved in incident response for different applications, workloads and customers.

Fueling the incident lifecycle

Palo Alto further leverages the ingested Personalized Service Health signals in its AIOps platform, which uses machine learning (ML) and analytics to automate IT operations. AIOps harnesses big data from operational appliances to detect and respond to issues instantaneously. AIOps correlates these signals with internally generated alerts to declare an incident that is affecting multiple customers. These AIOps alerts are tied to other incident management tools that assist with managing the incident lifecycle, including communication, regular updates and incident resolution.

In addition, a data enrichment pipeline takes Personalized Service Health incidents, adds Palo Alto’s related information, and publishes the events to Pub/Sub. AIOps then consumes the incident data from Pub/Sub, processes it, correlates it to related events signals, and notifies subscribed channels.

Palo Alto organizes Google Cloud assets into folders within the Google Cloud console. Each project represents a Palo Alto PRISMA Access customer. To receive incident signals that are likewise specific to end customers, Palo Alto creates a log sink that’s specific to each folder, aggregating service health logs at the folder level. Palo Alto then receives incident signals specific to each customer so it can take further action.

Palo Alto drives the following actions based on incident communications flowing from Google Cloud:

Proactive detection of zonal, inter-regional, external en-masse failures
Accurately identifying workloads affected by cloud provider incidents
Correlation of product issue caused by cloud service degradation in Google Cloud Platform itself

Seeing Personalized Service Health’s value

Incidents caused by cloud providers often go unnoticed or are difficult to isolate without involving multiple of the cloud provider’s teams (support, engineering, SRE, account management). The Personalized Service Health alerting framework plus AIOps correlation engine allows Palo Alto’s SRE teams to isolate issues caused by a cloud provider near-instantaneously.

Palo Alto’s incident management workflow is designed to address mass failures versus individual customer outages, ensuring the right teams are engaged until the incidents are resolved. This includes notifying relevant parties, such as the on-call engineer and the Google Cloud support team. With Personalized Service Health, Palo Alto can capture both event types i.e., mass failures as well as individual customer outages.

Palo Alto gets value from Personalized Service Health in multiple ways, beginning with faster incident response and contingency actions with which to optimize business continuity, especially for impacted customers of PRISMA Access. In the event of an incident impacting them, Prisma Access customers naturally seek and expect information from Palo Alto. By ensuring this information flows rapidly from Google Cloud to Palo Alto’s incident response systems, Palo Alto is able to provide more insightful answers to these end customers, and plans to serve additional Palo Alto use cases based on both existing and future Personalized Service Health capabilities.

Take your incident management to the next level

Google Cloud is continually evolving Personalized Service Health to provide deeper value for all Google Cloud customers — from startups, to ISVs and SaaS providers, to the largest enterprises. Ready to get started? Learn more about Personalized Service Health, or reach out to your account team.

^{We'd like to thank Jose Andrade, Pankhuri Kumar and Sudhanshu Jain of Google for their contributions to this collaboration between PANW and Google Cloud.}

Staff Software Engineer

Mon, 09 Dec 2024 17:00:00 -0000

From helping your developers write better code faster with Code Assist, to helping cloud operators more efficiently manage usage with Cloud Assist, Gemini for Google Cloud is your personal AI-powered assistant.

However, understanding exactly how your internal users are using Gemini has been a challenge — until today.

Today we are announcing that Cloud Logging and Cloud Monitoring support for Gemini for Google Cloud. Currently in public preview, Cloud Logging records requests and responses between Gemini for Google Cloud and individual users, while Cloud Monitoring reports 1-day, 7-day, and 28-day Gemini for Google Cloud active users and response counts in aggregate.

aside_block: <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x7f1e10f49700>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

Cloud Logging

In addition to offering customers general visibility into the impact of Gemini, there are a few scenarios where logs are useful:

to track the provenance of your AI-generated content
to record and review user usage of Gemini for Google Cloud

This feature is available as opt-in and when enabled, logs your users’ Gemini for Google Cloud activity to Cloud Logging (Cloud Logging charges apply).

Once enabled, log entries are made for each request to and response from Gemini for Google Cloud. In a typical request entry, Logs Explorer would provide an entry similar to the following example:

There are several things to note about this entry:

The content inside jsonPayload contains information about the request. In this case, it was a request to complete Python code with def fibonacci as the input.
The labels tell you the method (CompleteCode), the product (code_assist), and the user who initiated the request (cal@google.com).
The resource labels tell you the instance, location, and resource container (typically project) where the request occurred.

In a typical response entry, you’ll see the following:

Note that the request_id inside the label are identical for this pair of requests and responses, enabling identification of request and response pairs.

In addition to the Log Explorer, Log Analytics supports queries to analyze your log data, and help you answer questions like "How many requests did User XYZ make to Code Assist?"

For more details, please see the Gemini for Google Cloud logging documentation.

Cloud Monitoring

Gemini for Google Cloud monitoring metrics help you answer questions like:

How many unique active users used Gemini for Google Cloud services over the past day or seven days?
How many total responses did my users receive from Gemini for Google Cloud services over the past six hours?

Cloud Monitoring support for Gemini for Google Cloud is available to anybody who uses a Gemini for Google Cloud product and records responses and active users as Cloud Monitoring metrics, with which dashboards and alerts can be configured.

Because these metrics are available with Cloud Monitoring, you can also use them as part of Cloud Monitoring dashboards. A “Gemini for Google Cloud” dashboard is automatically installed under “GCP Dashboards” when Gemini for Google Cloud usage is detected:

Metrics Explorer offers another avenue where metrics can be examined and filters applied to gain a more detailed view of your usage. This is done by selecting the “Cloud AI Companion Instance” active resource in the Metrics Explorer:

In the example above, response_count is the number of responses sent by Gemini for Google Cloud, and can be filtered for Gemini Code Assist or the Gemini for Google Cloud method (code completion/generation).

For more details, please see the Gemini for Google Cloud monitoring documentation.

What’s next

We’re continually working on additions to these new capabilities, and in particular are focused on Code Assist logging and metrics enhancements that will bring even further insight and observability into your use of Gemini Code Assist and its impact. To get started with Gemini Code Assist and learn more about Gemini Cloud Assist — as well as observability data about it from Cloud Logging and Monitoring — check out the following links:

Survey Surfaces Multiple DevOps Platform Migration Challenges

Thu, 13 Nov 2025 02:44:52 -0000

A survey of 300 enterprise IT and technology leaders published this week finds that while 85% work for organizations that have completed DevOps platform migrations in the last two years, only a quarter (25%) said consolidation delivered expected value within a year. Conducted by the market research firm TrendCandy on behalf of CloudBees, the survey […]

The DevOps Impact of API-First Development

Wed, 12 Nov 2025 20:06:30 -0000

api explosion, ai, security, APIs, AI, api sprawl, Postman, APIs, engineering, API-first, strategy. API, Sideko, APIs, API, security, Sonar, vulnerabilities, API, APIs, developers, development, management, tools, API monetization, stack, platform, APIs API Security Summit -- API security -- cybersecurity - Application Programming Interfaces

Balaji Raghavan, chief technology officer at Postman, breaks down key findings from the company’s 7th Annual State of the API Report — offering an inside look at how APIs are reshaping software development and DevOps practices worldwide. Raghavan notes that APIs have evolved from tactical integration tools into strategic enablers of digital transformation. According to […]

Vibe Coding vs. Spec-Driven Development: Finding Balance in the AI Era

Wed, 12 Nov 2025 19:24:36 -0000

David Yanacek, Sr. Principal Engineer, AWS Agentic AI, dives into the rise of vibe coding versus traditional spec-driven development. As generative AI continues to transform how code is created, developers are moving faster than ever, but not always with the same level of structure, review, or intent that defined earlier software engineering practices. Yanacek explains […]

Bindplane Adds Ability to Automate Deployment of OpenTelemetry Collectors

Wed, 12 Nov 2025 13:21:21 -0000

PagerDuty, Harness, Qwiet, HashiCorp, Harness, Kong, API, sentry, Wiz, Veracode, ASPM,

Bindplane introduces an ability to streamline large-scale OpenTelemetry collector deployments, enabling teams to reuse telemetry pipelines, and improve observability.

Why Your SLO Dashboard is Lying: Moving Beyond Vanity Metrics in Production

Tue, 11 Nov 2025 12:41:30 -0000

Nobl9, SLOs, devops, SLOS Nobl9 Flutter Pulumi Bitbucket Atlassian composable enterprise low-code SlackOps

Discover how redefining service level objectives (SLOs) around business impact — not vanity uptime metrics — reduced incidents by 75% and saved $2.3M in lost revenue.

Vibe Coding Can Create Unseen Vulnerabilities

Tue, 11 Nov 2025 09:30:29 -0000

AI coding, teams, vibecoding, shadow, vibecoding vibe, coding, GitHub, agents, Gemini, Canvas, Gemini, code, Augment Code, code, kernel compliance-as-code software secure software Terraform infrastructure

Vibe coding uses AI to write software fast — but without developer oversight, it can introduce security flaws, technical debt and compliance risks.

Learning From the Past: What Automation Mistakes Can Teach Us About AI

Mon, 10 Nov 2025 23:59:47 -0000

Across industries, organizations are rushing to embed AI into their operations. In fact, 84% of organizations are looking to add more AI capabilities within the next three years. From customer service bots to AI copilots, the future is arriving faster than many expected. But in the rush for quick wins, many executives are overlooking a […]

Five Great DevOps Job Opportunities

Mon, 10 Nov 2025 09:04:09 -0000

DevOps brings you a weekly jobs report, including, this week, opportunities at Amentum, Oliver James and Morgan Stanley Services Group.

MIT Researchers Propose a New Way to Build Software That Actually Makes Sense

Fri, 07 Nov 2025 14:19:41 -0000

MIT researchers propose a new framework to make software clearer and safer by organizing code into “concepts” and “synchronizations” for better visibility.

AIOps for SRE — Using AI to Reduce On-Call Fatigue and Improve Reliability

Fri, 07 Nov 2025 11:48:23 -0000

zero, trust, SRE, SRE DevOps jobs Log4Shell patching security DevSecOps

Site reliability engineering (SRE) has become an emergent niche practice invented at Google to become a foundation of contemporary enterprise performance worldwide. With the continued growth of microservices, a multi-cloud infrastructure and continuous deployment pipelines adopted by organizations, the operational surface area has increased to the extent that human personnel cannot monitor and manage it in real time. The effectiveness […]

Pipes Feed Preview: Towards Data Science & The New Stack & DevOps & SRE & DevOps.com & Feed not found

Understanding chaos engineering

Principles and practices of chaos engineering

Elements of chaos engineering

Getting started with chaos engineering

AI, the great amplifier

The seven team archetypes

Unlocking the value of AI with the ‘DORA AI Capabilities Model’

Where leaders should get started

Intentional choices in platform engineering

Application cost and utilization

Not just another cost dashboard

Simple is powerful

What’s next

Choose the foundation that fits your needs

Building blocks for a solid foundation

Experience the cloud faster and smarter

1. Application, service and workload dashboards

2. Labels and context propagation

3. Gemini Cloud Assist Investigations

Try Application Monitoring today

Related docs

Deploy your compose.yaml directly to Cloud Run

Cloud Run, your home for AI applications

Bringing it all together

Building the future of AI

Step 1: From monolithic to multi-tenant

All roads lead to success

Moving the platform boundary

Beyond the microservice

Using Gemini Cloud Assist with Personalized Service Health

Next steps

Personalized Service Health is now generally available: Get started today

The shift to Config Connector

First stop: Config Connector

Under the hood