When small and medium companies achieve success in building Data and ML platforms, building AI platforms is now profoundly challenging
The post Stop Building AI Platforms appeared first on Towards Data Science.
An LLM in 2018 would not have trivialized a complex project, although it could have enhanced the final solution
The post What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization appeared first on Towards Data Science.
Compared to the opacity around human intelligence, AI is more transparent in some very tangible ways.
The post AI Is Not a Black Box (Relatively Speaking) appeared first on Towards Data Science.
Minimize chaos and maintain inter-agent harmony in your projects
The post How AI Agents “Talk” to Each Other appeared first on Towards Data Science.
Connecting the Dots for Better Movie Recommendations: Lightweight graph RAG on Rotten Tomatoes movie reviews
The post Connecting the Dots for Better Movie Recommendations appeared first on Towards Data Science.
Build multi-agent teams that can automate tasks and enhance productivity.
The post Agentic AI 103: Building Multi-Agent Teams appeared first on Towards Data Science.
Not just what you ask, but how you ask it. Practical techniques for prompt engineering that deliver
The post Design Smarter Prompts and Boost Your LLM Output: Real Tricks from an AI Engineer’s Toolbox appeared first on Towards Data Science.
Log in to a Streamlit app with a Google email account
The post User Authorisation in Streamlit With OIDC and Google appeared first on Towards Data Science.
Understanding and Implementing Brant’s Tests in Ordinal Logistic Regression with Python
The post Exploring the Proportional Odds Model for Ordinal Logistic Regression appeared first on Towards Data Science.
Exploring Titans: A new architecture equipping LLMs with human-inspired memory that learns and updates itself during test-time.
The post Can AI Truly Develop a Memory That Adapts Like Ours? appeared first on Towards Data Science.
A beginner-friendly tutorial of MCP architecture, with the focus on MCP server components and applications, guiding through the process of building a custom MCP server that enables code-to-diagram.
The post Model Context Protocol (MCP) Tutorial: Build Your First MCP Server in 6 Steps appeared first on Towards Data Science.
Build iOS & Android Apps with Kivy
The post Mobile App Development with Python appeared first on Towards Data Science.
A recipe for building a portable soundscape monitoring app with AudioMoth, Raspberry Pi, and a decent dose of deep learning.
The post Audio Spectrogram Transformers Beyond the Lab appeared first on Towards Data Science.
A step-by-step guide to containerizing and orchestrating an ML training workflow without the Dockerfile headache, using a lightweight GPT-2 example.
The post Automate Models Training: An MLOps Pipeline with Tekton and Buildpacks appeared first on Towards Data Science.
Using GPU acceleration to speed up Bayesian Inference from months to minutes...
The post 10,000x Faster Bayesian Inference: Multi-GPU SVI vs. Traditional MCMC appeared first on Towards Data Science.
A brief analysis using density estimation to compare the two-verdict and three-verdict systems.
The post Applications of Density Estimation to Legal Theory appeared first on Towards Data Science.
Understand how to use Window Functions to perform calculations without losing details
The post Mastering SQL Window Functions appeared first on Towards Data Science.
Let’s observe the matter on the atomic level
The post Exploratory Data Analysis: Gamma Spectroscopy in Python appeared first on Towards Data Science.
We roll up our sleeves and start to deal with matrices
The post A Bird’s-Eye View of Linear Algebra: Measure of a Map — Determinants appeared first on Towards Data Science.
Playbook on how data analysts can become data scientists
The post How to Transition From Data Analyst to Data Scientist appeared first on Towards Data Science.
While the advancements in Generative AI continue to introduce astounding possibilities, Agentic AI has emerged as a solution to complex
The post Ready or Not, Agentic AI Is Disrupting Corporate Landscapes appeared first on The New Stack.
Historically, Mary Meeker‘s “Internet Trends” reports have been legendary. After a five-year hiatus, the VC/analyst has co-authored a remarkable new
The post Mary Meeker’s New AI Trends Report Makes the Case for Optimism appeared first on The New Stack.
AI agents are becoming true collaborators in the workplace, with more developers every day beginning to build them. To reach
The post Meet Embabel: A Framework for Building AI Agents With Java appeared first on The New Stack.
After my successful use of Claude Code, I was keen to try other equivalent agentic large language model (LLM) tools
The post Agentic Coding: How Google’s Jules Compares to Claude Code appeared first on The New Stack.
The TC39, aka the Ecma International’s Technical Committee 39, met in early June to review ECMAScript (a.k.a. JavaScript) specification submissions.
The post ECMAScript Committee Advances 3 Proposals to Stage 4 appeared first on The New Stack.
Java is an object-oriented language; it supports encapsulation, inheritance and polymorphism. As we can see from Java’s continuing success in
The post Async Programming in Java Repositories appeared first on The New Stack.
The use of generative AI (GenAI) is growing at an unprecedented rate across various industries. Significant technical advances in AI
The post GenAI and Flexible Consumption Models Reshape Hybrid Storage Infrastructure appeared first on The New Stack.
It may sound like heresy, but some developers regret their commitment to JavaScript. Brian Cardarella, the founder of web and
The post Elixir: An Alternative to JavaScript-Based Web Development appeared first on The New Stack.
Platform engineering teams create self-service capabilities that provide development teams with golden paths for building reliable and secure software. When
The post Accelerating Developer Velocity With Effective Platform Teams appeared first on The New Stack.
Natural human language is the ideal interface for accessing data, and making it conversational is the ultimate goal. Advanced AI,
The post Why AI and SQL Go Together Like Peanut Butter and Jelly appeared first on The New Stack.
Although macOS is one of the more user-friendly operating systems on the market, as a developer, you might find it
The post Install Homebrew on MacOS for More Dev Tool Options appeared first on The New Stack.
The idea behind Infrastructure from Code (IfC) is that you would simply write out all deployment and configuration steps in
The post Infrastructure From Code: What Went Wrong appeared first on The New Stack.
The open source vLLM represents a milestone in large language model (LLM) serving technology, providing developers with a fast, flexible
The post Introduction to vLLM: A High-Performance LLM Serving Engine appeared first on The New Stack.
This month marks the 30th anniversary of PHP being released to the world. To find out how PHP has evolved
The post PHP Turns 30: Language and Ecosystem Are Stronger Than Ever appeared first on The New Stack.
React is one of the most popular tools for building modern web applications. If you’ve worked with React, you’ve probably
The post Boost Performance With React Server Components and Next.js appeared first on The New Stack.
Enterprise automation is increasingly leveraging intelligent agent workflows driven by AI, typically relying on large language models (LLMs) for these
The post Cloud Native and Open Source Help Scale Agentic AI Workflows appeared first on The New Stack.
Anthropic’s CEO has said that in three to six months, AI will be writing 90% of the code that software
The post AI Will Steal Developer Jobs (But Not How You Think) appeared first on The New Stack.
At WWDC 2025, Apple announced something that will fundamentally reshape the way we think about container security: its Containerization framework
The post What You Need To Know About Apple’s New Container Framework appeared first on The New Stack.
AI is rapidly transforming software development, with AI-coding assistants now commonplace, offering everything from autocompletion to generating substantial code blocks.
The post Using AI for Test Generation: Powerful Tool or Risky Shortcut? appeared first on The New Stack.
The rise of container-based Linux distros is real, especially now with the demand for deploying to edge environments that require
The post No SSH? What Is Talos, This Linux Distro for Kubernetes? appeared first on The New Stack.
JavaScript isn’t just a language — it’s a specialized craft. And like any skilled martial art, the difference between being
The post JavaScript Kung Fu: Elegant Techniques To Master the Language appeared first on The New Stack.
Customers expect fast, transparent updates about their orders. Think shipping status, delays, delivery confirmations and more. The WhatsApp Business App
The post Build a Package Tracker With WhatsApp API appeared first on The New Stack.
At its Data+AI Summit today in San Francisco, Databricks launched a number of new features for its data platform, including
The post Databricks Launches a No-Code Tool for Building Data Pipelines appeared first on The New Stack.
When Databricks announced its plans to acquire the serverless Postgres startup Neon a few weeks ago, it clearly signaled it’s
The post Lakebase Is Databricks’ Fully-Managed Postgres Database for the AI Era appeared first on The New Stack.
Jellyfish, a software engineering intelligence platform provider, today expanded its integrations to support additional AI coding tools, as engineering team
The post Jellyfish Tracks AI Impact Across Four Major Coding Tools appeared first on The New Stack.
AI agents, vibe coding, AI code review, and other AI-centric topics are all I see developers talk about on social
The post AI and Vibe Coding Are Radically Impacting Senior Devs in Code Review appeared first on The New Stack.
In the event of a cloud incident, everyone wants swift and clear communication from the cloud provider, and to be able to leverage that information effectively. Personalized Service Health in the Google Cloud console addresses this need with fast, transparent, relevant, and actionable communications about Google Cloud service disruptions, customized to your specific footprint. This helps you to quickly identify the source of the problem, helping you answer the question, “Is it Google or is it me?” You can then integrate this information into your incident response workflows to resolve the incident more efficiently.
We're excited to announce that you can prompt Gemini Cloud Assist to pull real-time information about active incidents, powered by Personalized Service Health, providing you with streamlined incident management, including discovery, impact assessment, and recovery. By combining Gemini's guidance with Personalized Service Health insights and up-to-the-minute information, you can assess the scope of impact and begin troubleshooting – all within a single, AI-driven Gemini Cloud Assist chat. Further, you can initiate this sort of incident discovery from anywhere within the console, offering immediate access to relevant incidents without interrupting your workflow. You can also check for active incidents impacting your projects, gathering details on their scope and the latest updates directly sourced from Personalized Service Health.
We designed Gemini Cloud Assist with a user-friendly layout and a well-organized information structure. Crucial details, including dynamic timelines, latest updates, symptoms, and workarounds sourced directly from Personalized Service Health, are now presented in the console, enabling conversational follow-ups. Gemini Cloud Assist highlights critical insights from Personalized Service Health, helping you refine your investigations and understand the impact of incidents.
To illustrate the power of this integration, the following demo showcases a typical incident response workflow leveraging the combined capabilities of Gemini and Personalized Service Health.
Incident discovery and triage
In the crucial first moments of an incident, Gemini Cloud Assist helps you answer "Is it Google or is it me?" Gemini Cloud Assist accesses data directly from Personalized Service Health, and provides feedback on which projects and at what locations are affected by a Google Cloud incident, speeding up the triage process.
To illustrate how you can start this process, try asking Gemini Cloud Assist questions like:
Is my project impacted by a Google Cloud incident?
Are there any incidents impacting Google Cloud at the moment?
Investigating and evaluating impact
Once you’ve identified a relevant Google Cloud incident, you can use Gemini Cloud Assist to delve deeper into the specifics and evaluate its impact on your environment. Furthermore, by asking follow-up questions, Gemini Cloud Assist can retrieve updates from Personalized Service Health about the incident as it evolves. You can then further investigate by asking Gemini to pinpoint exactly which of your apps or projects, and at what locations, might be affected by the reported incident.
Here are examples of prompts you might pose to Gemini Cloud Assist:
Tell me more about the ongoing Incident ID [X] (Replace [X] with the Incident ID)
Is [X] impacted? (Replace [X] with your specific location or Google Cloud product)
What is the latest update on Incident ID [X]?
Show me the details of Incident ID [X].
Can you guide me through some troubleshooting steps for [impacted Google Cloud product]?
Mitigation and recovery
Finally, Gemini Cloud Assist can also act as an intelligent assistant during the recovery phase, providing you with actionable guidance. You can gain access to relevant logs and monitoring data for more efficient resolution. Additionally, Gemini Cloud Assist can help surface potential workarounds from Personalized Service Health and direct you to the tools and information you need to restore your projects or applications. Here are some sample prompts:
What are the workarounds for the incident ID [X]? (Replace [X] with the Incident ID)
Can you suggest a temporary solution to keep my application running?
How can I find logs for this impacted project?
From these prompts, Gemini retrieves relevant information from Personalized Service Health to provide you with personalized insights into your Google Cloud environment's health — both for ongoing events and incidents from up to one year in the past. This helps when investigating an incident to narrow down its impact, as well as assisting in recovery.
Looking ahead, we are excited to provide even deeper insights and more comprehensive incident management with Gemini Cloud Assist and Personalized Service Health, extending these AI-driven capabilities beyond a single project view. Ready to get started?
Learn more about Personalized Service Health, or reach out to your account team to enable it.
Get started with Gemini Cloud Assist. Refine your prompts to ask about your specific regions or Google Cloud products, and experiment to discover how it can help you proactively manage incidents.
In 2023, the Waze platform engineering team transitioned to Infrastructure as Code (IaC) using Google Cloud's Config Connector (KCC) — and we haven’t looked back since. We embraced Config Connector, an open-source Kubernetes add-on, to manage Google Cloud resources through Kubernetes. To streamline management, we also leverage Config Controller, a hosted version of Config Connector on Google Kubernetes Engine (GKE), incorporating Policy Controller and Config Sync. This shift has significantly improved our infrastructure management and is shaping our future infrastructure.
Previously, Waze relied on Terraform to manage resources, particularly during our dual-cloud, VM-based phase. However, maintaining state and ensuring reconciliation proved challenging, leading to inconsistent configurations and increased management overhead.
In 2023, we adopted Config Connector, transforming our Google Cloud infrastructure into Kubernetes Resource Modules (KRMs) within a GKE cluster. This approach addresses the reconciliation issues encountered with Terraform. Config Sync, paired with Config Connector, automates KRM synchronization from source repositories to our live GKE cluster. This managed solution eliminates the need for us to build and maintain custom reconciliation systems.
The shift helped us meet the needs of three key roles within Waze’s infrastructure team:
Infrastructure consumers: Application developers who want to easily deploy infrastructure without worrying about the maintenance and complexity of underlying resources.
Infrastructure owners: Experts in specific resource types (e.g., Spanner, Google Cloud Storage, Load Balancers, etc.), who want to define and standardize best practices in how resources are created across Waze on Google Cloud.
Platform engineers: Engineers who build the system that enables infrastructure owners to codify and define best practices, while also providing a seamless API for infrastructure consumers.
It may seem circular to define all of our Google Cloud infrastructure as KRMs within a Google Cloud service, however, KRM is actually a great representation for our infrastructure as opposed to existing IaC tooling.
Terraform's reconciliation issues – state drift, version management, out of band changes – are a significant pain. Config Connector, through Config Sync, offers out-of-the-box reconciliation, a managed solution we prefer. Both KRM and Terraform offer templating, but KCC's managed nature aligns with our shift to Google Cloud-native solutions and reduces our maintenance burden.
Infrastructure complexity requires generalization regardless of the tool. We can see this when we look at the Spanner requirements at Waze:
Consistent backups for all Spanner databases
Each Spanner database utilizes a dedicated Cloud Storage bucket and Service Account to automate the execution of DDL jobs.
All IAM policies for Spanner instances, databases, and Cloud Storage buckets are defined in code to ensure consistent and auditable access control.
To define these resources, we evaluated various templating and rendering tools and selected Helm, a robust CNCF package manager for Kubernetes. Its strong open-source community, rich templating capabilities, and native rendering features made it a natural fit. We can now refer to our bundled infrastructure configurations as 'Charts.' While KRO has since emerged that achieves a similar purpose, our selection process predated its availability.
Let's open the hood and dive into how the system works and is driving value for Waze.
Waze infrastructure owners generically define Waze-flavored infrastructure in Helm Charts.
Infrastructure consumers use these Charts with simplified inputs to generate infrastructure (demo).
Infrastructure code is stored in repositories, enabling validation and presubmit checks.
Code is uploaded to a Artifact Registry where Config Sync and Config Connector align Google Cloud infrastructure with the code definitions.
This diagram represents a single "data domain," a collection of bounded services, databases, networks, and data. Many tech orgs today consist of Prod, QA, Staging, Development, etc.
So why does all of this matter? Adopting this approach allowed us to move from Infrastructure as Code to Infrastructure as Software. By treating each Chart as a software component, our infrastructure management goes beyond simple code declaration. Now, versioned Charts and configurations enable us to leverage a rich ecosystem of software practices, including sophisticated release management, automated rollbacks, and granular change tracking.
Here's where we apply this in practice: our configuration inheritance model minimizes redundancy. Resource Charts inherit settings from Projects, which inherit from Bootstraps. All three are defined as Charts. Consequently, Bootstrap configurations apply to all Projects, and Project configurations apply to all Resources.
Every change to our infrastructure – from changes on existing infrastructure to rolling out new resource types – can be treated like a software rollout.
Now that all of our infrastructure is treated like software, we can see what this does for us system-wide:
In summary, Config Connector and Config Controller have enabled Waze to achieve true Infrastructure as Software, providing a robust and scalable platform for our infrastructure needs, along with many other benefits including:
Infrastructure consumers receive the latest best practices through versioned updates.
Infrastructure owners can iterate and improve infrastructure safely.
Platform Engineers and Security teams are confident our resources are auditable and compliant
Config Connector leverages Google's managed services, reducing operational overhead.
Distributed tracing is a critical part of an observability stack, letting you troubleshoot latency and errors in your applications. Cloud Trace, part of Google Cloud Observability, is Google Cloud’s native tracing product, and we’ve made numerous improvements to the Trace explorer UI on top of a new analytics backend.
The new Trace explorer page contains:
A filter bar with options for users to choose a Google Cloud project-based trace scope, all/root spans and a custom attribute filter.
A faceted span filter pane that displays commonly used filters based on OpenTelemetry conventions.
A visualization of matching spans including an interactive span duration heatmap (default), a span rate line chart, and a span duration percentile chart.
A table of matching spans that can be narrowed down further by selecting a cell of interest on the heatmap.
Let’s take a closer look at these new features and how you can use them to troubleshoot your applications. Imagine you’re a developer working on the checkoutservice of a retail webstore application and you’ve been paged because there’s an ongoing incident.
This application is instrumented using OpenTelemetry and sends trace data to Google Cloud Trace, so you navigate to the Trace explorer page on the Google Cloud console with the context set to the Google Cloud project that hosts the checkoutservice.
Before starting your investigation, you remember that your admin recommended using the webstore-prod trace scope when investigating webstore app-wide prod issues. By using this Trace scope, you'll be able to see spans stored in other Google Cloud projects that are relevant to your investigation.
You set the trace scope to webstore-prod and your queries will now include spans from all the projects included in this trace scope.
You select checkoutservice in Span filters (1) and the following updates load on the page:
Other sections such as Span name in the span filter pane (2) are updated with counts and percentages that take into account the selection made under service name. This can help you narrow down your search criteria to be more specific.
The span Filter bar (3) is updated to display the active filter.
The heatmap visualization (4) is updated to only display spans from the checkoutservice in the last 1 hour (default). You can change the time-range using the time-picker (5). The heatmap’s x-axis is time and the y-axis is span duration. It uses color shades to denote the number of spans in each cell with a legend that indicates the corresponding range.
The Spans table (6) is updated with matching spans sorted by duration (default).
Other Chart views (7) that you can switch to are also updated with the applied filter.
From looking at the heatmap, you can see that there are some spans in the >100s range which is abnormal and concerning. But first, you’re curious about the traffic and corresponding latency of calls handled by the checkoutservice.
Switching to the Span rate line chart gives you an idea of the traffic handled by your service. The x-axis is time and the y-axis is spans/second. The traffic handled by your service looks normal as you know from past experience that 1.5-2 spans/second is quite typical.
Switching to the Span duration percentile chart gives you p50/p90/p95/p99 span duration trends. While p50 looks fine, the p9x durations are greater than you expect for your service.
You switch back to the heatmap chart and select one of the outlier cells to investigate further. This particular cell has two matching spans with a duration of over 2 minutes, which is concerning.
You investigate one of those spans by viewing the full trace and notice that the orders publish span is the one taking up the majority of the time when servicing this request. Given this, you form a hypothesis that the checkoutservice is having issues handling these types of calls. To validate your hypothesis, you note the rpc.method attribute being PlaceOrder and exit this trace using the X button.
You add an attribute filter for key: rpc.method value:PlaceOrder using the Filter bar, which shows you that there is a clear latency issue with PlaceOrder calls handled by your service. You’ve seen this issue before and know that there is a runbook that addresses it, so you alert the SRE team with the appropriate action that needs to be taken to mitigate the incident.
Share your feedback with us via the Send feedback button.
This new experience is powered by BigQuery, using the same platform that backs Log Analytics. We plan to launch new features that take full advantage of this platform: SQL queries, flexible sampling, export, and regional storage.
In summary, you can use the new Cloud Trace explorer to perform service-oriented investigations with advanced querying and visualization of trace data. This allows developers and SREs to effectively troubleshoot production incidents and identify mitigating measures to restore normal operations.
The new Cloud Trace explorer is generally available to all users — try it out and share your feedback with us via the Send feedback button.
Picture this: you’re an Site Reliability Engineer (SRE) responsible for the systems that power your company’s machine learning (ML) services. What do you do to ensure you have a reliable ML service, how do you know you’re doing it well, and how can you build strong systems to support these services?
As artificial intelligence (AI) becomes more widely available, its features — including ML — will matter more to SREs. That’s because ML becomes both a part of the infrastructure used in production software systems, as well as an important feature of the software itself.
Abstractly, machine learning relies on its pipelines … and you know how to manage those! So you can begin with pipeline management, then look to other factors that will strengthen your ML services: training, model freshness, and efficiency. In the resources below, we'll look at some of the ML-specific characteristics of these pipelines that you’ll want to consider in your operations. Then, we draw on the experience of Google SREs to show you how to apply your core SRE skills to operating and managing your organization’s machine-learning pipelines.
Training ML models applies the notion of pipelines to specific types of data, often running on specialized hardware. Critical aspects to consider about the pipeline:
how much data you’re ingesting
how fresh this data needs to be
how the system trains and deploys the models
how efficiently the system handles these first three things
This keynote presents an SRE perspective on the value of applying reliability principles to the components of machine learning systems. It provides insight into why ML systems matter for products, and how SREs should think about them. The challenges that ML systems present include capacity planning, resource management, and monitoring; other challenges include understanding the cost of ML systems as part of your overall operations environment.
As with any pipeline-based system, a big part of understanding the system is describing how much data it typically ingests and processes. The Data Processing Pipelines chapter in the SRE Workbook lays out the fundamentals: automate the pipeline’s operation so that it is resilient, and can operate unattended.
You’ll want to develop Service Level Objectives (SLOs) in order to measure the pipeline’s health, especially for data freshness, i.e., how recently the model got the data it’s using to produce an inference for a customer. Understanding freshness provides an important measure of an ML system’s health, as data that becomes stale may lead to lower-quality inferences and sub-optimal outcomes for the user. For some systems, such as weather forecasting, data may need to be very fresh (just minutes or seconds old); for other systems, such as spell-checkers, data freshness can lag on the order of days — or longer! Freshness requirements will vary by product, so it’s important that you know what you’re building and how the audience expects to use it.
In this way, freshness is a part of the critical user journey described in the SRE Workbook, describing one aspect of the customer experience. You can read more about data freshness as a component of pipeline systems in the Google SRE article Reliable Data Processing with Minimal Toil.
There’s more than freshness to ensuring high-quality data — there’s also how you define the model-training pipeline. A Brief Guide To Running ML Systems in Production gives you the nuts and bolts of this discipline, from using contextual metrics to understand freshness and throughput, to methods for understanding the quality of your input data.
The 2021 SRE blog post Efficient Machine Learning Inference provides a valuable resource to learn about improving your model’s performance in a production environment. (And remember, training is never the same as production for ML services!)
Optimizing machine learning inference serving is crucial for real-world deployment. In this article, the authors explore multi-model serving off of a shared VM. They cover realistic use cases and how to manage trade-offs between cost, utilization, and latency of model responses. By changing the allocation of models to VMs, and varying the size and shape of those VMs in terms of processing, GPU, and RAM attached, you can improve the cost effectiveness of model serving.
We mentioned that these AI pipelines often rely on specialized hardware. How do you know you’re using this hardware efficiently? Todd Underwood’s talk from SREcon EMEA 2023 on Artificial Intelligence: What Will It Cost You? gives you a sense of how much this specialized hardware costs to run, and how you can provide incentives for using it efficiently.
This article from Google's SRE team outlines strategies for ensuring reliable data processing while minimizing manual effort, or toil. One of the key takeaways: use an existing, standard platform for as much of the pipeline as possible. After all, your business goals should focus on innovations in presenting the data and the ML model, not in the pipeline itself. The article covers automation, monitoring, and incident response, with a focus on using these concepts to build resilient data pipelines. You’ll read best practices for designing data systems that can handle failures gracefully and reduce a team’s operational burden. This article is essential reading for anyone involved in data engineering or operations. Read more about toil in the SRE Workbook: https://sre.google/workbook/eliminating-toil/.
Successful ML deployments require careful management and monitoring for systems to be reliable and sustainable. That means taking a holistic approach, including implementing data pipelines, training pathways, model management, and validation, alongside monitoring and accuracy metrics. To go deeper, check out this guide on how to use GKE for your AI orchestration.
In today's dynamic digital landscape, building and operating secure, reliable, cost-efficient and high-performing cloud solutions is no easy feat. Enterprises grapple with the complexities of cloud adoption, and often struggle to bridge the gap between business needs, technical implementation, and operational readiness. This is where the Google Cloud Well-Architected Framework comes in. The framework provides comprehensive guidance to help you design, develop, deploy, and operate efficient, secure, resilient, high-performing, and cost-effective Google Cloud topologies that support your security and compliance requirements.
The Well-Architected Framework caters to a broad spectrum of cloud professionals. Cloud architects, developers, IT administrators, decision makers and other practitioners can benefit from years of subject-matter expertise and knowledge both from within Google and from the industry. The framework distills this vast expertise and presents it as an easy-to-consume set of recommendations.
The recommendations in the Well-Architected Framework are organized under five, business-focused pillars.
We recently completed a revamp of the guidance in all the pillars and perspectives of the Well-Architected Framework to center the recommendations around a core set of design principles.
In addition to the above pillars, the Well-Architected Framework provides cross-pillar perspectives that present recommendations for selected domains, industries, and technologies like AI and machine learning (ML).
The Well-Architected Framework is much more than a collection of design and operational recommendations. The framework empowers you with a structured principles-oriented design methodology that unlocks many advantages:
Enhanced security, privacy, and compliance: Security is paramount in the cloud. The Well-Architected Framework incorporates industry-leading security practices, helping ensure that your cloud architecture meets your security, privacy, and compliance requirements.
Optimized cost: The Well-Architected Framework lets you build and operate cost-efficient cloud solutions by promoting a cost-aware culture, focusing on resource optimization, and leveraging built-in cost-saving features in Google Cloud.
Resilience, scalability, and flexibility: As your business needs evolve, the Well-Architected Framework helps you design cloud deployments that can scale to accommodate changing demands, remain highly available, and be resilient to disasters and failures.
Operational excellence: The Well-Architected Framework promotes operationally sound architectures that are easy to operate, monitor, and maintain.
Predictable and workload-specific performance: The Well-Architected Framework offers guidance to help you build, deploy, and operate workloads that provide predictable performance based on your workloads’ needs.
The Well-Architected Framework also includes cross-pillar perspectives for selected domains, industries, and technologies like AI and machine learning (ML).
The principles and recommendations in the Google Cloud Well-Architected Framework are aligned with Google and industry best practices like Google’s Site Reliability Engineering (SRE) practices, DORA capabilities, the Google HEART framework for user-centered metrics, the FinOps framework, Supply-chain Levels for Software Artifacts (SLSA), and Google's Secure AI Framework (SAIF).
Embrace the Well-Architected Framework to transform your Google Cloud journey, and get comprehensive guidance on security, reliability, cost, performance, and operations — as well as targeted recommendations for specific industries and domains like AI and ML. To learn more, visit Google Cloud Well-Architected Framework.
We are thrilled to announce the collaboration between Google Cloud, AWS, and Azure on Kube Resource Orchestrator, or kro (pronounced “crow”). kro introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.
Platform and DevOps teams want to define standards for how application teams deploy their workloads, and they want to use Kubernetes as the platform for creating and enforcing these standards. Each service needs to handle everything from resource creation to security configurations, monitoring setup, defining the end-user interface, and more. There are client-side templating tools that can help with this (e.g., Helm, Kustomize), but Kubernetes lacks a native way for platform teams to create custom groupings of resources for consumption by end users.
Before kro, platform teams needed to invest in custom solutions such as building custom Kubernetes controllers, or using packaging tools like Helm, which can’t leverage the benefits of Kubernetes CRDs. These approaches are costly to build, maintain, and troubleshoot, and complex for non-Kubernetes experts to consume. This is a problem many Kubernetes users face. Rather than developing vendor-specific solutions, we’ve partnered with Amazon and Microsoft on making K8s APIs simpler for all Kubernetes users.
kro is a Kubernetes-native framework that lets you create reusable APIs to deploy multiple resources as a single unit. You can use it to encapsulate a Kubernetes deployment and its dependencies into a single API that your application teams can use, even if they aren’t familiar with Kubernetes. You can use kro to create custom end-user interfaces that expose only the parameters an end user should see, hiding the complexity of Kubernetes and cloud-provider APIs.
kro does this by introducing the concept of a ResourceGraphDefinition, which specifies how a standard Kubernetes Custom Resource Definition (CRD) should be expanded into a set of Kubernetes resources. End users define a single resource, which kro then expands into the custom resources defined in the CRD.
kro can be used to group and manage any Kubernetes resources. Tools like ACK, KCC, or ASO define CRDs to manage cloud provider resources from Kubernetes (these tools enable cloud provider resources, like storage buckets, to be created and managed as Kubernetes resources). kro can also be used to group resources from these tools, along with any other Kubernetes resources, to define an entire application deployment and the cloud provider resources it depends on.
Below, you’ll find some examples of kro being used with Google Cloud. You can find additional examples on the kro website.
Example 1: GKE cluster definition
Imagine that a platform administrator wants to give end users in their organization self-service access to create GKE clusters. The platform administrator creates a kro ResourceGraphDefinition called GKEclusterRGD that defines the required Kubernetes resources and a CRD called GKEcluster that exposes only the options they want to be configurable by end users. In addition to creating a cluster, the platform team also wants clusters to deploy administrative workloads such as policies, agents, etc. The ResourceGraphDefinition defines the following resources, using KCC to provide the mappings from K8s CRDs to Google Cloud APIs:
GKE cluster, Container Node Pools, IAM ServiceAccount, IAM PolicyMember, Services, Policies
The platform administrator would then define the end-user interface so that they can create a new cluster by creating an instance of the CRD that defines:
Cluster name, Nodepool name, Max nodes, Location (e.g. us-east1), Networks (optional)
Everything related to policy, service accounts, and service activation (and how these resources relate to each other) is hidden from the end user, simplifying their experience.
Example 2: Web application definition
In this example, a DevOps Engineer wants to create a reusable definition of a web application and its dependencies. They create a ResourceGraphDefinition called WebAppRGD, which defines a new Kubernetes CRD called WebApp. This new resource encapsulates all the necessary resources for a web application environment, including:
Deployments, service, service accounts, monitoring agents, and cloud resources like object storage buckets.
The WebAppRGD ResourceGraphDefinition can set a default configuration, and also define which parameters can be set by the end user at deployment time (kro gives you the flexibility to decide what is immutable, and what an end user is able to configure). A developer then creates an instance of the WebApp CRD, inputting any user-facing parameters. kro then deploys the desired Kubernetes resource.
We believe kro is a big step forward for platform engineering teams, delivering a number of advantages:
Kubernetes-native: kro leverages Kubernetes Custom Resource Definitions (CRDs) to extend Kubernetes, so it works with any Kubernetes resource and integrates with existing Kubernetes tools and workflows.
Lets you create a simplified end user experience: kro makes it easy to define end-user interfaces for complex groups of Kubernetes resources, making it easy for people who are not Kubernetes experts to consume services built on Kubernetes.
Enables standardized services for application teams: kro templates can be reused across different projects and environments, promoting consistency and reducing duplication of effort.
kro is available as an open-source project on GitHub. The GitHub organization is currently jointly owned by teams from Google, AWS, and Microsoft, and we welcome contributions from the community. We also have a website with documentation on installing and using kro, including example use cases. As an early-stage project, kro is not yet ready for production use, but we still encourage you to test it out in your own Kubernetes development environments!
Platform engineering, one of Gartner’s top 10 strategic technology trends for 2024, is rapidly becoming indispensable for enterprises seeking to accelerate software delivery and improve developer productivity. How does it do that? Platform engineering is about providing the right infrastructure, tools, and processes that enable efficient, scalable software development, deployment, and management, all while minimizing the cognitive burden on developers.
To uncover the secrets to platform engineering success, Google Cloud partnered with Enterprise Strategy Group (ESG) on a comprehensive research study of 500 global IT professionals and application developers working at organizations with at least 500 employees, all with formal platform engineering teams. Our goal was to understand whether they had adopted platform engineering, and if so, the impact that has had on their company’s software delivery capabilities.
The resulting report, “Building Competitive Edge With Platform Engineering: A Strategic Guide,” reveals common patterns, expectations, and actionable best practices for overcoming challenges and fully leveraging platform engineering. This blog post highlights some of the most powerful insights from this study.
The research confirms that platform engineering is no longer a nascent concept. 55% of the global organizations we invited to participate have already adopted platform engineering. Of those, 90% plan to expand its reach to more developers. Furthermore, 85% of companies using platform engineering report that their developers rely on the platform to succeed. These figures highlight that platform engineering is no longer just a trend; it's becoming a vital strategy for organizations seeking to unlock the full potential of their cloud and IT investments and gain a competitive edge.
Figure 1: 55% of 900+ global organizations surveyed have adopted platform engineering
The report identifies three critical components that are central to the success of mature platform engineering leaders.
Fostering close collaboration between platform engineers and other teams to ensure alignment
Adopting a “platform as a product” approach, which involves treating the developer platform with a clear roadmap, communicated value, and tight feedback loops
Defining success by measuring performance through clear metrics such as deployment frequency, failure recovery time, and lead time for changes
It's noteworthy that while many organizations have begun their platform engineering journey, only 27% of adopters have fully integrated these three key components in their practices, signaling a significant opportunity for further improvements.
One of the most compelling insights of this report is the synergistic relationship between platform engineering and AI. A remarkable 86% of respondents believe that platform engineering is essential to realizing the full business value of AI. At the same time, a vast majority of companies view AI as a catalyst for advancing platform engineering, with 94% of organizations identifying AI to be ‘Critical’ or ‘Important’ to the future of platform engineering.
The study also identified three cohorts of platform engineering adopters — nascent, established, and leading — based on whether and how much adopters had embraced the above-mentioned three key components of platform engineering success. The study shows that leading adopters gain more in terms of speed, efficiency, and productivity, and offers guidance for nascent and established adopters to improve their overall platform engineering maturity to gain more benefits.
The report also identified some additional benefits of platform engineering, including:
Improved employee satisfaction, talent acquisition & retention: mature platforms foster a positive developer experience that directly impacts company culture. Developers and IT pros working for organizations with mature developer platforms are much more likely to recommend their workplace to their peers.
Accelerated time to market: mature platform engineering adopters have significantly shortened time to market. 71% of leading adopters of platform engineering indicated they have significantly accelerated their time to market, compared with 28% of less mature adopters.
A vast majority (96%) of surveyed organizations are leveraging open-source tools to build their developer platforms. Moreover, most (84%) are partnering with external vendors to manage and support their open-source environments. Co-managed platforms with a third party or a cloud partner benefit from a higher degree of innovation. Organizations with co-managed platforms allocate an average of 47% of their developers’ productive time to innovation and experimentation, compared to just 38% for those that prefer to manage their platforms with internal staff.
While this blog provides a glimpse into the key findings from this study, the full report goes much further, revealing key platform engineering strategies and practices that will help you stay ahead of the curve. Download the report to explore additional topics, including:
The strategic considerations of centralized and distributed platform engineering teams
The key drivers behind platform engineering investments
Top priorities driving platform adoption for developers, ensuring alignment with their needs
Key pain points to anticipate and navigate on the road to platform engineering success
How platform engineering boosts productivity, performance, and innovation across the entire organization
The strategic importance of open source in platform engineering for competitive advantage
The transformative role of platform engineering for AI/ML workloads as adoption of AI increases
How to develop the right platform engineering strategy to drive scalability and innovation
Editor’s note: This blog post was updated to reflect the general availability status of these features as of March 31, 2025.
Cloud Deploy is a fully managed continuous delivery platform that automates the delivery of your application. On top of existing automation features, customers tell us they want other ways to automate their deployments to keep their production environments reliable and up to date.
We're happy to announce three new features to help with that, all in GA.
The new repair rollout automation rule lets you retry failed deployments or automatically roll back to a previously successful release when an error occurs. These errors could come in any phase of a deployment: a pre-deployment SQL migration, a misconfiguration detected when talking to a GKE cluster, or as part of a deployment verification step. In any of these cases, the repair rollout automation lets you retry the failed step a configurable number of times, perfect for those occasionally flaky end-to-end tests. If the retry succeeds, the rollout continues. If the retries fail (or none are configured) the repair rollout automation can also roll back to the previously successful release.
Automating deployments is powerful, but it can also be important to put some constraints on the automation. The new deploy policies feature is intended to limit what these automations (or users) can do. Initially, we're launching a time-windows policy which can, for example, inhibit deployments during evenings, weekends, or during important events. While an on-caller with the Policy Overrider role could "break glass" to get around these policies, automated deployments won't be able to trigger a rollout in the middle of your big demo.
After a release is successfully rolled out, you may want to automatically deploy it to the next environment. Our previous auto-promote feature let you promote a release after a specified duration, for example moving it into prod 12 hours after it went to staging. But often you want promotions to happen on a schedule, not based on a delay. Within Google, for example, we typically recommend that teams promote from a dev environment into staging every Thursday, and then start a promotion into prod on Monday mornings. With the new timed promotion automation, Cloud Deploy can handle these scheduled promotions for you.
Comprehensive, easy-to-use, and cost-effective DevOps tools are key to efficient software delivery, and it’s our hope that Cloud Deploy will help you implement complete CI/CD pipelines. Stay tuned as we introduce exciting new capabilities and features to Cloud Deploy in the months to come.
Update your current pipelines with these new features today. Check out the product page, documentation, quickstarts, and tutorials. Finally, if you have feedback on Cloud Deploy, you can join the conversation. We look forward to hearing from you!
Cloud applications like Google Workspace provide benefits such as collaboration, availability, security, and cost-efficiency. However, for cloud application developers, there’s a fundamental conflict between achieving high availability and the constant evolution of cloud applications. Changes to the application, such as new code, configuration updates, or infrastructure rearrangements, can introduce bugs and lead to outages. These risks pose a challenge for developers, who must balance stability and innovation while minimizing disruption to users.
Here on the Google Workspace Site Reliability Engineering team, we once moved a replica of Google Docs to a new data center because we needed extra capacity. But moving the associated data, which was vast, overloaded a key index in our database, restricting user ability to create new docs. Thankfully, we were able to identify the root cause and mitigate the problem quickly. Still, this experience convinced us of the need to reduce the risk of a global outage from a simple application change.
Our approach to reducing the risk of global outages is to limit the “blast radius,” or extent, of an outage by vertically partitioning the serving stack. The basic idea is to run isolated instances (“partitions”) of application servers and storage (Figure 1). Each partition contains all the various servers necessary to service a user request from end to end. Each production partition also has a pseudo-random mix of users and workloads, so all the partitions have similar resource needs. When it comes time to make changes to the application code, we deploy new changes to one partition at a time. Bad changes may cause a partition-wide outage, but we are protected from a global application outage.
Compare this approach to using canarying alone, in which new features or code changes are released to a small group of users before rolling them out to the rest. While canarying deploys changes first to just a few servers, it doesn’t prevent problems from spreading. For example, we’ve had incidents where canaried changes corrupted data used by all the servers in the deployment. With partitioning, the effects of bad changes are isolated to a single partition, preventing such contagion. Of course, in practice, we combine both techniques: canarying new changes to a few servers within a single partition.
Broadly speaking, partitioning brings a lot of advantages:
Despite the benefits, there are some challenges to adopt partitioning. In some cases, these challenges make it hard or risky to move from a non-partitioned to a partitioned setup. In other cases, challenges persist even after partitioning. Here are the issues as we see them:
We can improve the availability of web applications by partitioning their serving stacks. These partitions are isolated, because we do not fail over between them. Users and entities are assigned to partitions in a sticky manner to allow us to roll out changes in order of risk tolerance. This approach allows us to roll out changes one partition at a time with confidence that bad changes will only affect a single partition, and ideally that partition contains only users from your organization.
In short, partitioning supports our efforts to provide stronger and more reliable services to our users, and it might apply to your service as well. For example, you can improve the availability of your application by using Spanner, which provides geo-partitioning out of the box. Read more about geo-partitioning best practices here.
References
Cloud incidents happen. And when they do, it’s incumbent on the cloud service provider to communicate about the incident to impacted customers quickly and effectively — and for the cloud service consumer to use that information effectively, as part of a larger incident management response.
Google Cloud Personalized Service Health provides businesses with fast, transparent, relevant, and actionable communication about Google Cloud service disruptions, tailored to a specific business at its desired level of granularity. Cybersecurity company Palo Alto Networks is one Google Cloud customer and partner that recently integrated Personalized Service Health signals into the incident workflow for its Google Cloud-based PRISMA Access offering, saving its customers critical minutes during active incidents.
By programmatically ingesting Personalized Service Health signals into advanced workflow components, Palo Alto can quickly make decisions such as triggering contingency actions to protect business continuity.
Let’s take a closer look at how Palo Alto integrated Personalized Service Health into its operations.
Palo Alto ingests Personalized Service Health logs into its internal AIOps system, which centralizes incident communications for PRISMA Access and applies advanced techniques to classify and distribute signals to the people responsible for responding to a given incident.
Personalized Service Health UI Incident list view
Users of Personalized Service Health can filter what relevance levels they want to see. Here, “Partially related” reflects an issue anywhere in the world with the products that are used. “Related” reflects that the problem is detected within the data center regions, while “Impacted” means that Google has verified the impact to the customer for specific services.
While Google is still confirming an incident, Personalized Service Health communicates some of these incidents as 'PSH Emerging Incident' to provide customers with early notification. Once Google confirms the incident, these incidents are merged with 'PSH Confirmed Incidents'. This helps customers respond faster to a specific incident that’s impacting their environment or escalate back to Google, if needed.
Personalized Service Health distributes updates throughout an active incident, typically every 30 minutes, or sooner if there’s progress to share. These updates are also written to logs, which Palo Alto ingests into AIOps.
Responding to disruptive, unplanned cloud service provider incidents can be accelerated by programmatically ingesting and distributing incident communications. This is especially true in large-scale organizations such as Palo Alto, which has multiple teams involved in incident response for different applications, workloads and customers.
Palo Alto further leverages the ingested Personalized Service Health signals in its AIOps platform, which uses machine learning (ML) and analytics to automate IT operations. AIOps harnesses big data from operational appliances to detect and respond to issues instantaneously. AIOps correlates these signals with internally generated alerts to declare an incident that is affecting multiple customers. These AIOps alerts are tied to other incident management tools that assist with managing the incident lifecycle, including communication, regular updates and incident resolution.
In addition, a data enrichment pipeline takes Personalized Service Health incidents, adds Palo Alto’s related information, and publishes the events to Pub/Sub. AIOps then consumes the incident data from Pub/Sub, processes it, correlates it to related events signals, and notifies subscribed channels.
Palo Alto organizes Google Cloud assets into folders within the Google Cloud console. Each project represents a Palo Alto PRISMA Access customer. To receive incident signals that are likewise specific to end customers, Palo Alto creates a log sink that’s specific to each folder, aggregating service health logs at the folder level. Palo Alto then receives incident signals specific to each customer so it can take further action.
Palo Alto drives the following actions based on incident communications flowing from Google Cloud:
Proactive detection of zonal, inter-regional, external en-masse failures
Accurately identifying workloads affected by cloud provider incidents
Correlation of product issue caused by cloud service degradation in Google Cloud Platform itself
Incidents caused by cloud providers often go unnoticed or are difficult to isolate without involving multiple of the cloud provider’s teams (support, engineering, SRE, account management). The Personalized Service Health alerting framework plus AIOps correlation engine allows Palo Alto’s SRE teams to isolate issues caused by a cloud provider near-instantaneously.
Palo Alto’s incident management workflow is designed to address mass failures versus individual customer outages, ensuring the right teams are engaged until the incidents are resolved. This includes notifying relevant parties, such as the on-call engineer and the Google Cloud support team. With Personalized Service Health, Palo Alto can capture both event types i.e., mass failures as well as individual customer outages.
Palo Alto gets value from Personalized Service Health in multiple ways, beginning with faster incident response and contingency actions with which to optimize business continuity, especially for impacted customers of PRISMA Access. In the event of an incident impacting them, Prisma Access customers naturally seek and expect information from Palo Alto. By ensuring this information flows rapidly from Google Cloud to Palo Alto’s incident response systems, Palo Alto is able to provide more insightful answers to these end customers, and plans to serve additional Palo Alto use cases based on both existing and future Personalized Service Health capabilities.
Google Cloud is continually evolving Personalized Service Health to provide deeper value for all Google Cloud customers — from startups, to ISVs and SaaS providers, to the largest enterprises. Ready to get started? Learn more about Personalized Service Health, or reach out to your account team.
We'd like to thank Jose Andrade, Pankhuri Kumar and Sudhanshu Jain of Google for their contributions to this collaboration between PANW and Google Cloud.
From helping your developers write better code faster with Code Assist, to helping cloud operators more efficiently manage usage with Cloud Assist, Gemini for Google Cloud is your personal AI-powered assistant.
However, understanding exactly how your internal users are using Gemini has been a challenge — until today.
Today we are announcing that Cloud Logging and Cloud Monitoring support for Gemini for Google Cloud. Currently in public preview, Cloud Logging records requests and responses between Gemini for Google Cloud and individual users, while Cloud Monitoring reports 1-day, 7-day, and 28-day Gemini for Google Cloud active users and response counts in aggregate.
In addition to offering customers general visibility into the impact of Gemini, there are a few scenarios where logs are useful:
to track the provenance of your AI-generated content
to record and review user usage of Gemini for Google Cloud
This feature is available as opt-in and when enabled, logs your users’ Gemini for Google Cloud activity to Cloud Logging (Cloud Logging charges apply).
Once enabled, log entries are made for each request to and response from Gemini for Google Cloud. In a typical request entry, Logs Explorer would provide an entry similar to the following example:
There are several things to note about this entry:
The content inside jsonPayload
contains information about the request. In this case, it was a request to complete Python code with def fibonacci
as the input.
The labels tell you the method (CompleteCode
), the product (code_assist
), and the user who initiated the request (cal@google.com
).
The resource labels tell you the instance, location, and resource container (typically project) where the request occurred.
In a typical response entry, you’ll see the following:
Note that the request_id
inside the label are identical for this pair of requests and responses, enabling identification of request and response pairs.
In addition to the Log Explorer, Log Analytics supports queries to analyze your log data, and help you answer questions like "How many requests did User XYZ make to Code Assist?"
For more details, please see the Gemini for Google Cloud logging documentation.
Gemini for Google Cloud monitoring metrics help you answer questions like:
How many unique active users used Gemini for Google Cloud services over the past day or seven days?
How many total responses did my users receive from Gemini for Google Cloud services over the past six hours?
Cloud Monitoring support for Gemini for Google Cloud is available to anybody who uses a Gemini for Google Cloud product and records responses and active users as Cloud Monitoring metrics, with which dashboards and alerts can be configured.
Because these metrics are available with Cloud Monitoring, you can also use them as part of Cloud Monitoring dashboards. A “Gemini for Google Cloud” dashboard is automatically installed under “GCP Dashboards” when Gemini for Google Cloud usage is detected:
Metrics Explorer offers another avenue where metrics can be examined and filters applied to gain a more detailed view of your usage. This is done by selecting the “Cloud AI Companion Instance” active resource in the Metrics Explorer:
In the example above, response_count
is the number of responses sent by Gemini for Google Cloud, and can be filtered for Gemini Code Assist or the Gemini for Google Cloud method (code completion/generation).
For more details, please see the Gemini for Google Cloud monitoring documentation.
We’re continually working on additions to these new capabilities, and in particular are focused on Code Assist logging and metrics enhancements that will bring even further insight and observability into your use of Gemini Code Assist and its impact. To get started with Gemini Code Assist and learn more about Gemini Cloud Assist — as well as observability data about it from Cloud Logging and Monitoring — check out the following links:
At the end of the day, developers build, test, deploy and maintain software. But like with lots of things, it’s about the journey, not the destination.
Among platform engineers, we sometimes refer to that journey as the developer experience (DX), which encompasses how developers feel and interact with the tools and services they use throughout the software build, test, deployment and maintenance process.
Prioritizing DX is essential: Frustrated developers lead to inefficiency and talent loss as well as to shadow IT. Conversely, a positive DX drives innovation, community, and productivity. And if you want to provide a positive DX, you need to start measuring how you’re doing.
At PlatformCon 2024, I gave a talk entitled "Improving your developers' platform experience by applying Google frameworks and methods” where I spoke about Google’s HEART Framework, which provides a holistic view of your organization's developers’ experience through actionable data.
In this article, I will share ideas on how you can apply the HEART framework to your Platform Engineering practice, to gain a more comprehensive view of your organization’s developer experience. But before I do that, let me explain what the HEART Framework is.
In a nutshell, HEART measures developer behaviors and attitudes from their experience of your platform and provides you with insights into what’s going on behind the numbers, by defining specific metrics to track progress towards goals. This is beneficial because continuous improvements through feedback are vital components of a platform engineering journey, helping both platform and application product teams make decisions that are data-driven and user-centered.
However, HEART is not a data collection tool in and of itself; rather, it’s a user-sentiment framework for selecting the right metrics to focus on based on product or platform objectives. It balances quantitative or empirical data, e.g., number of active portal users, with qualitative or subjective insights such as "My users feel the portal navigation is confusing." In other words, consider HEART as a framework or methodology for assessing user experience, rather than a specific tool or assessment. It helps you decide what to measure, not how to measure it.
Let’s take a look at each of these in more detail.
Highlight: Gathering and analyzing developer feedback
Subjective metrics:
Surveys: Conduct regular surveys to gather feedback about overall satisfaction, ease of use, and pain points. Toil negatively affects developer satisfaction and morale. Repetitive, manual work can lead to frustration burnout and decreased happiness with the platform.
Feedback mechanisms: Establish easy ways for developers to provide direct feedback on specific features or areas of the platform like Net Promoter Score (NPS) or Customer Satisfaction surveys (CSAT).
Collect open-ended feedback from developers through interviews and user groups.
Sentiment analysis: Analyze developer sentiment expressed in feedback channels, support tickets and online communities.
System metrics:
Feature requests: Track the number and types of feature requests submitted by developers. This provides insights into their needs and desires and can help you prioritize improvements that will enhance happiness.
Watch out for: While platforms can boost developer productivity, they might not necessarily contribute to developer job satisfaction. This warrants further investigation, especially if your research suggests that your developers are unhappy.
Highlight: Frequency of interaction between platform engineers with developers and quality of interaction — intensity and quality of interaction with the platform, participation on chat channels, training, dual ownership of golden paths, joint troubleshooting, engaging in architectural design discussions, and the breadth of interaction by everyone from new hires through to senior developers.
Subjective metrics:
Survey for quality of interaction — focus on depth and type of interaction whether through chat channel, trainings, dual ownership of golden paths, joint troubleshooting, or architectural design discussions
High toil can reduce developer engagement with the platform. When developers spend excessive amounts of time on tedious tasks, they are less likely to explore new features, experiment, and contribute to the platform's evolution.
System metrics:
Active users: Track daily, weekly, and monthly active developers and how long they spend on tasks.
Usage patterns: Analyze the most used platform features, tools, and portal resources.
Frequency of interaction between platform engineers with developers.
Breadth of user engagement: Track onboarding time for new hires to reach proficiency, measure the percentage of senior developers actively contributing to golden paths or portal functionality.
Watch out for: Don’t confuse engagement with satisfaction. Developers may rate the platform highly in surveys, but usage data might reveal low frequency of interaction with core features or a limited subset of teams actively using the platform. Ask them “How has the platform changed your daily workflow?” rather than "Are you satisfied with the platform?”
Highlight: Overall acceptance and integration of the platform into the development workflow.
System metrics:
New user registrations: Monitor the growth rate of new developers using the platform.
Track time between registration and time to use the platform i.e., executing golden paths, tooling and portal functionality.
Number of active users per week / month / quarter / half-year / year who authenticate via the portal and/or use golden paths, tooling and portal functionality
Feature adoption: Track how quickly and widely new features or updates are used.
Percentage of developers using CI/CD through the platform
Number of deployments per user / team / day / week / month — basically of your choosing
Training: Evaluate changes in adoption, after delivering training.
Watch out for: Overlooking the "long tail" of adoption. A platform might see a burst of early adoption, but then plateau or even decline if it fails to continuously evolve and meet changing developer needs. Don't just measure initial adoption, monitor how usage evolves over weeks, months, and years.
Highlight: Long-term engagement and reducing churn.
Subjective metrics:
System metrics:
Churn rate: Track the percentage of developers who stop logging into the portal and are not using it.
Dormant users: Identify developers who become inactive after 6 months and investigate why.
Track services that are less frequently used.
Watch out for: Misinterpreting the reasons for churn. When developers stop using your platform (churn), it's crucial to understand why. Incorrectly identifying the cause can lead to wasted effort and missed opportunities for improvement. Consider factors outside the platform — churn could be caused by changes in project requirements, team structures or industry trends.
Highlight: Efficiency and effectiveness of the platform in supporting specific developer activities.
Subjective metrics:
Survey to assess the ongoing presence of toil and its inimical influence on developer productivity, ultimately hindering efficiency and leading to increased task completion times.
System metrics:
Completion rates: Measure the percentage of golden paths and tools successfully run on the platform without errors.
Time to complete tasks using golden paths, portal, or tooling.
Error rates: Track common errors and failures developers encounter from log files or monitoring dashboards from golden paths, portal or tooling.
Mean Time to Resolution (MTTR): When errors do occur, how long does it take to resolve them? A lower MTTR indicates a more resilient platform and faster recovery from failures.
Developer platform and portal uptime: Measure the percentage of time that the developer platform and portal is available and operational. Higher uptime ensures developers can consistently access the platform and complete their tasks.
Watch out for: Don't confuse task success with task completion. Simply measuring whether developers can complete tasks on the platform doesn't necessarily indicate true success. Developers might find workarounds or complete tasks inefficiently, even if they technically achieve the end goal. It may be worth manually observing developer workflows in their natural environment to identify pain points and areas of friction in their workflows.
Also, be careful with misaligning task success with business goals. Task completion might overlook the broader impact on business objectives. A platform might enable developers to complete tasks efficiently, but if those tasks don't contribute to overall business goals, the platform's true value is questionable.
It’s not necessary to use all of the categories each time. The number of categories to consider really depends on the specific goals and context of the assessment; you can include everything or trim it down to better match your objective. Here are some examples:
Improving onboarding for new developers: Focus on adoption, task success and happiness.
Launching a new feature: Concentrate on adoption and happiness.
Increasing platform usage: Track engagement, retention and task success.
Keep in mind that relying on just one category will likely provide an incomplete picture.
In a perfect world, you would use the HEART framework to establish a baseline assessment a few months after launching your platform, which will provide you with a valuable insight into early developer experience. As your platform evolves, this initial data becomes a benchmark for measuring progress and identifying trends. Early measurement allows you to proactively address UX issues, guide design decisions with data, and iterate quickly for optimal functionality and developer satisfaction. If you're starting with an MVP, conduct the baseline assessment once the core functionality is in place and you have a small group of early users to provide feedback.
After 12 or more months of usage, you can also add metrics to embody a new or more mature platform. This can help you gather deeper insights into your developers’ experience by understanding how they are using the platform, measure the impact of changes you’ve made to the platform, or identify areas for improvement and prioritize future development efforts. If you've added new golden paths, tooling, or enhanced functionality, then you'll need to track metrics that measure their success and impact on developer behavior.
The frequency with which you assess HEART metrics depends on several factors, including:
The maturity of your platform: Newer platforms benefit from more frequent reviews (e.g. monthly or quarterly) to track progress and address early issues. As the platform matures, you can reduce the frequency of your HEART assessments (e.g., bi-annually or annually).
The rate of change: To ensure updates and changes have a positive impact, apply the HEART framework more frequently when your platform is undergoing a period of rapid evolution such as major platform updates, new portal features or new golden paths, or some change in user behavior. This allows you to closely monitor the effects of each change on key metrics.
The size and complexity of your platform: Larger and more complex platforms may require more frequent assessments to capture nuances and potential issues.
Your team's capacity: Running HEART assessments requires time and resources. Consider your team's bandwidth and adjust the frequency accordingly.
Schedule periodic deep dives (e.g. quarterly or bi-annually) using the HEART framework to gain a more in-depth understanding of your platform's performance and identify areas for improvement.
In this blog post, we’ve shown how the HEART framework can be applied to platform engineering to measure and improve the developer experience. We’ve explored the five key aspects of the framework — happiness, engagement, adoption, retention, and task success — and provided specific metrics for each and guidance on when to apply them.By applying these insights, platform engineering teams can create a more positive and productive environment for their developers, leading to greater success in their software development efforts.To learn more about platform engineering, check out some of our other articles: 5 myths about platform engineering: what it is and what it isn’t, Another five myths about platform engineering, and Laying the foundation for a career in platform engineering.
And finally, check out the DORA Report 2024, which now has a section on Platform Engineering.
The DORA research program has been investigating the capabilities, practices, and measures of high-performing technology-driven teams and organizations for more than a decade. It has published reports based on data collected from annual surveys of professionals working in technical roles, including software developers, managers, and senior executives.
Today, we’re pleased to announce the publication of the 2024 Accelerate State of DevOps Report, marking a decade of DORA’s investigation into high-performing technology teams and organizations. DORA’s four key metrics, introduced in 2013, have become the industry standard for measuring software delivery performance.
Each year, we seek to gain a comprehensive understanding of standard DORA performance metrics, and how they intersect with individual, workflow, team, and product performance. We now include how AI adoption affects software development across multiple levels, too.
We also establish reference points each year to help teams understand how they are performing, relative to their peers, and to inspire teams with the knowledge that elite performance is possible in every industry. DORA’s research over the last decade has been designed to help teams get better at getting better: to strive to improve their improvements year over year.
For a quick overview of this year’s report, you can read in our executive DORA Report summary the spotlight adoption trends and the impact of AI, the emergence of platform engineering, and the continuing significance of developer experience.
Organizations across all industries are prioritizing the integration of AI into their applications and services. Developers are increasingly relying on AI to improve their productivity and fulfill their core responsibilities. This year's research reveals a complex landscape of benefits and tradeoffs for AI adoption.
The report underscores the need to approach platform engineering thoughtfully, and emphasizes the critical role of developer experience in achieving high performance.
Widespread AI adoption is reshaping software development practices. More than 75 percent of respondents said that they rely on AI for at least one daily professional responsibility. The most prevalent use cases include code writing, information summarization, and code explanation.
The report confirms that AI is boosting productivity for many developers. More than one-third of respondents experienced”‘moderate” to “extreme” productivity increases due to AI.
A 25% increase in AI adoption is associated with improvements in several key areas:
7.5% increase in documentation quality
3.4% increase in code quality
3.1% increase in code review speed
However, despite AI’s potential benefits, our research revealed a critical finding: AI adoption may negatively impact software delivery performance. As AI adoption increased, it was accompanied by an estimated decrease in delivery throughput by 1.5%, and an estimated reduction in delivery stability by 7.2%. Our data suggest that improving the development process does not automatically improve software delivery — at least not without proper adherence to the basics of successful software delivery, like small batch sizes and robust testing mechanisms. AI has positive impacts on many important individual and organizational factors which foster the conditions for high software delivery performance. But, AI does not appear to be a panacea.
Our research also shows that despite the productivity gains, 39% of the respondents reported little to no trust in AI-generated code. This unexpected low level of trust indicates to us that there is a need to manage AI integration more thoughtfully. Teams must carefully evaluate AI’s role in their development workflow to mitigate the downsides.
Based on these findings, we have three core recommendations:
Enable your employees and reduce toil by orienting your AI adoption strategies towards empowering employees and alleviating the burden of undesirable tasks.
Establish clear guidelines for the use of AI and address procedural concerns and foster open communication about its impact.
Encourage continuous exploration of AI tools and provide them dedicated time for experimentation, and promote trust through hands-on experience.
Another emerging discipline our research focused this year is on platform engineering. Its focus is on building and operating internal development platforms to streamline processes and enhance efficiency
Our research identified 4 key findings regarding platform engineering:
Increased developer productivity: Internal development platforms effectively increase productivity for developers.
Prevalence in larger firms: These platforms are more commonly found in larger organizations, suggesting their suitability for managing complex development environments.
Potential performance dip: Implementing a platform engineering initiative might lead to a temporary decrease in performance before improvements manifest as the platform matures.
Need for user-centeredness and developer independence: For optimal results, platform engineering efforts should prioritize user-centered design, developer independence, and a product-oriented approach
A thoughtful approach that prioritizes user needs, empowers developers, and anticipates potential challenges is key to maximizing the benefits of platform engineering initiatives.
One of the key insights in last year’s report was that a healthy culture can help reduce burnout, increase productivity, and increase job satisfaction. This year was no different. Teams that cultivate a stable and supportive environment that empowers developers to excel drive positive outcomes.
Move fast and constantly pivot’ mentality negatively impacts developer well-being and consequently, on overall performance. Instability in priorities, even with strong leadership, comprehensive documentation, and a user-centered approach — all known to be highly beneficial — can significantly hinder progress.
Creating a work environment where your team feels supported, valued, and empowered to contribute is fundamental to achieving high performance.
The key takeaway from the decade of research is that software development success hinges not just on technical prowess but also on fostering a supportive culture, prioritizing user needs, and focusing on developer experience. We encourage teams to replicate our findings within your specific context.
It can be used as a hypothesis for your experiments and continuous improvement initiatives. Please share those with us and the DORA community, so that your efforts can become part of our collaborative learning environment.
We work on this research in hopes that it serves as a roadmap for teams and organizations seeking to improve their practices and create a thriving environment for innovation, collaboration, and business success. We will continue our platform-agnostic research that focuses on the human aspect of technology for the next decade to come.
To learn more:
Share your experiences, learn from others, and get inspiration by joining the DORA community.
Measure your team’s software delivery performance in less than a minute using DORA's Quick Check.
Organizations are grappling with an explosion of operational data spread across an increasingly diverse and complex database landscape. This complexity often results in costly outages, performance bottlenecks, security vulnerabilities, and compliance gaps, hindering their ability to extract valuable insights and deliver exceptional customer experiences. To help businesses overcome these challenges, earlier this year, we announced the preview of Database Center, an AI-powered, unified fleet management solution.
We’re seeing accelerated adoption for Database Center from many customers. For example, Ford uses Database Center to get answers on their database fleet health in seconds, and proactively mitigates potential risks to their applications. Today, we’re announcing that Database Center is now available to all customers, empowering you to monitor and operate database fleets at scale with a single, unified solution. We've also added support for Spanner, so you can manage it along with your Cloud SQL and AlloyDB deployments, with support for additional databases on the way.
Database Center is designed to bring order to the chaos of your database fleet, and unlock the true potential of your data. It provides a single, intuitive interface where you can:
Gain a comprehensive view of your entire database fleet. No more silos of information or hunting through bespoke tools and spreadsheets.
Proactively de-risk your fleet with intelligent performance and security recommendations. Database Center provides actionable insights to help you stay ahead of potential problems, and helps improve performance, reduce costs and enhance security with data-driven suggestions.
Optimize your database fleet with AI-powered assistance. Use a natural-language chat interface to ask questions and quickly resolve fleet issues and get optimization recommendations.
Let’s now review each in more detail.
Tired of juggling different tools and consoles to keep track of your databases?
Database Center simplifies database management with a single, unified view of your entire database landscape. You can monitor database resources across your entire organization, spanning multiple engines, versions, regions, projects and environments (or applications using labels).
Cloud SQL, AlloyDB, and now Spanner are all fully integrated with Database Center, so you can monitor your inventory and proactively detect issues. Using the unified inventory view in Database Center, you can:
Identify out-of-date database versions to ensure proper support and reliability
Track version upgrades, e.g., if PostgreSQL 14 to PostgreSQL 15 is updating at an expected pace
Ensure database resources are appropriately distributed, e.g., identify the number of databases powering the critical production applications vs. non-critical dev/test environments
Monitor database migration from on-prem to cloud or across engines
Manage Cloud SQL, AlloyDB and Spanner resources with a unified view.
Managing your database fleet health at scale can involve navigating through a complex blend of security postures, data protection settings, resource configurations, performance tuning and cost optimizations. Database Center proactively detects issues associated with these configurations and guides you through addressing them.
For example, high transaction ID for a Cloud SQL instance can lead to the database no longer accepting new queries, potentially causing latency issues or even downtime. Database Center proactively detects this, provides an in-depth explanation, and walks you through prescriptive steps to troubleshoot the issue.
We’ve also added several performance recommendations to Database Center related to excessive tables/joins, connections, or logs, and can assist you through a simple optimization journey.
End-to-end workflow for detecting and troubleshooting performance issues.
Database Center also simplifies compliance management by automatically detecting and reporting violations across a wide range of industry standards, including CIS, PCI-DSS, SOC2, HIPAA. Database Center continuously monitors your databases for potential compliance violations. When a violation is detected, you receive a clear explanation of the problem, including:
The specific security or reliability issue causing the violation
Actionable steps to help address the issue and restore compliance
This helps reduce the risk of costly penalties, simplifies compliance audits and strengthens your security posture. Database Center now also supports real-time detection of unauthorized access, updates, and data exports.
Database Center helps ensure compliance to HIPAA standards.
With Gemini enabled, Database Center makes optimizing your database fleet incredibly intuitive. Simply chat with the AI-powered interface to get precise answers, uncover issues within your database fleet, troubleshoot problems, and quickly implement solutions. For example, you can quickly identify under-provisioned instances across your entire fleet, access actionable insights such as the duration of high CPU/Memory utilization conditions, receive recommendations for optimal CPU/memory configurations, and learn about the associated cost of those adjustments.
AI-powered chat in Database Center provides comprehensive information and recommendations across all aspects of database management, including inventory, performance, availability and data protection. Additionally, AI-powered cost recommendations suggest ways for optimizing your spend, and advanced security and compliance recommendations help strengthen your security and compliance posture.
AI-powered chat to identify data protection issues and optimize cost.
The new capabilities of Database Center are available in preview today for Spanner, Cloud SQL, and AlloyDB for all customers. Simply access Database Center within the Google Cloud console and begin monitoring and managing your entire databases fleet. To learn more about Database Center’s capabilities, check out the documentation.
Editor's note: Starting February 4, 2025, pipe syntax will be available to all BigQuery users by default.
Log data has become an invaluable resource for organizations seeking to understand application behavior, optimize performance, strengthen security, and enhance user experiences. But the sheer volume and complexity of logs generated by modern applications can feel overwhelming. How do you extract meaningful insights from this sea of data?
At Google Cloud, we’re committed to providing you with the most powerful and intuitive tools to unlock the full potential of your log data. That's why we're thrilled to announce a series of innovations in BigQuery and Cloud Logging designed to revolutionize the way you manage, analyze, and derive value from your logs.
Say goodbye to the days of deciphering complex, nested SQL queries. BigQuery pipe syntax ushers in a new era of SQL, specifically designed with the semi-structured nature of log data in mind. BigQuery’s pipe syntax introduces an intuitive, top-down syntax that mirrors how you naturally approach data transformations. As demonstrated in the recent research by Google, this approach leads to significant improvements in query readability and writability. By visually separating different stages of a query with the pipe symbol (|>), it becomes remarkably easy to understand the logical flow of data transformation. Each step is clear, concise, and self-contained, making your queries more approachable for both you and your team.
BigQuery’s pipe syntax isn’t just about cleaner SQL — it’s about unlocking a more intuitive and efficient way to work with your data. Instead of wrestling with code, experience faster insights, improved collaboration, and more time spent extracting value.
This streamlined approach is especially powerful when it comes to the world of log analysis.
With log analysis, exploration is key. Log analysis is rarely a straight line from question to answer. Analyzing logs often means sifting through mountains of data to find specific events or patterns. You explore, you discover, and you refine your approach as you go. Pipe syntax embraces this iterative approach. You can smoothly chain together filters (WHERE), aggregations (COUNT), and sorting (ORDER BY) to extract those golden insights. You can also add or remove steps in your data processing as you uncover new insights, easily adjusting your analysis on the fly.
Imagine you want to count the total number of users who were affected by the same errors more than 100 times in the month of January. As shown below, the pipe syntax’s linear structure clearly shows the data flowing through each transformation: starting from the table, filtering by the dates, counting by user id and error type, filtering for errors >100, and finally counting the number of users affected by the same errors.
The same example in the standard syntax will typically require using a subquery and non linear structure.
The impact of these advancements is already being felt by our customers. Here's what Carrefour, a global leader in retail, had to say about their experience with pipe syntax:
"Pipe syntax has been a very refreshing addition to BigQuery. We started using it to dig into our audit logs, where we often use Common Table Expressions (CTEs) and aggregations. With pipe syntax, we can filter and aggregate data on the fly by just adding more pipes to the same query. This iterative approach is very intuitive and natural to read and write. We are now using it for our analysis work in every business domain. We will have a hard time going back to the old SQL syntax now!" - Axel Thevenot, Lead Data Engineer, and Guillaume Blaquiere, Data Architect, Carrefour
BigQuery pipe syntax is currently available for all BigQuery users. You can check-out this introductory video.
But we haven't stopped at simplifying your code. BigQuery now offers enhanced performance and powerful JSON handling capabilities to further accelerate your log analytics workflows. Given the prevalence of json data in logs, we expect these changes to simplify log analytics for a majority of users.
Enhanced Point Lookups: Pinpoint critical events in massive datasets quickly using BigQuery's numeric search indexes, which dramatically accelerates queries that filter on timestamps and unique IDs. Here is a sample improvement from the announcement blog.
Powerful JSON Analysis: Parse and analyze your JSON-formatted log data with ease using BigQuery's JSON_KEYS function and JSONPath traversal feature. Extract specific fields, filter on nested values, and navigate complex JSON structures without breaking a sweat.
JSON_KEYS extracts unique JSON keys from JSON data for easier schema exploration and discoverability
Log Analytics in Cloud Logging is built on top of BigQuery and provides a UI that’s purpose-built for log analysis. With an integrated date/time picker, charting and dashboarding, Log Analytics makes use of the JSON capabilities to support advanced queries and analyze logs faster. To seamlessly integrate these powerful capabilities into your log management workflow, we're also enhancing Log Analytics (in Cloud Logging) with pipe syntax. You can now analyze your logs within Log Analytics leveraging the full power of BigQuery pipe syntax, enhanced lookups, and JSON handling, all within a unified platform.
Use of pipe syntax in Log Analytics (Cloud Logging) is now available in preview.
BigQuery and Cloud Logging provide an unmatched solution for managing, analyzing, and extracting actionable insights from your log data. Explore these new capabilities today and experience the power of:
Intuitive querying with pipe syntax - Introductory video, Documentation
Unified log management and analysis with Log Analytics in Cloud Logging
Blazing-fast lookups with numeric search indexes - Documentation
Start your journey towards more insightful and efficient log analytics in the cloud with BigQuery and Cloud Logging. Your data holds the answers — we're here to help you find them.
As AI adoption speeds up, one thing is becoming clear: the developer platforms that got you this far won’t get you to the next stage. While yesterday’s platforms were awesome, let’s face it, they weren’t built for today’s AI-infused application development and deployment. And organizations are quickly realizing they need to update their platform strategies to ensure that developers — and the wider set of folks using AI — have what they need for the years ahead.
In fact, as I explore in a new paper, nine out of ten decision makers are prioritizing the task of optimizing workloads for AI over the next 12 months. Problem is, given the pace of change lately, many don’t know where to start or what they need when it comes to modernizing their developer platforms.
What follows is a quick look at the key steps involved in planning your platform strategy. For all the details, download my full guide, Three pillars of a modern, AI-ready platform.
Whether you’re building your first platform or your fiftieth, you need to start by asking, “Why?” After all, a new platform is another asset to maintain and operate —you need to make sure it exists for the right reasons.
To build your case, ask yourself three questions:
Now that you’re clear on the customers, goals, and performance metrics of the platform you need, it’s time to actually build the thing. Here’s a glance at the key components of a modern, AI-ready platform — complete with the capabilities developers need to hit the ground running when developing AI-powered solutions.
For a detailed breakdown of what to consider in each area of your platform, including a list of technology options for each category, head over to the full paper.
The journey doesn’t end once your platform’s built. In fact, it’s just beginning. A platform is never “done;” it’s just released. As such, you need to adopt a continuous improvement mindset and assign a core platform team the task of finding new ways to introduce value to stakeholders.
At this stage, my top tip is to treat your platform like a product, applying platform engineering principles to keep making it faster, cheaper, and easier to deliver software. Oh, and to leverage the latest in AI-driven optimization tools to monitor and maintain your platform over time!
Organizations embark on platform overhauls for a whole bunch of reasons. Some do it to better cope with forecasted growth. Others have AI adoption in their sights. Then there are those driven by cost, performance, or the user experience. Whatever your reason for getting started, I encourage you to read the full paper on building a modern AI-ready platform — your developers (and the business) will thank you.
You’ve probably felt the frustration that arises when a project fails to meet established deadlines. And perhaps you’ve also encountered scenarios where project staff or computing have been reallocated to higher priority projects. It can be super challenging to get projects done on time with this kind of uncertainty.
That’s especially true for Site Reliability Engineering (SRE) teams. Project management principles can help, but in IT, many project management frameworks are directed at teams that have a single focus, such as a software-development team.
That’s not true for SRE teams at Google. They are charged with delivering infrastructure projects as well as their primary role: supporting production. Broadly speaking, SRE time is divided in half between supporting production environments and focusing on product.
In a recent endeavor, our SRE team took on a project to regionalize our infrastructure to enhance the reliability, security, and compliance of our cloud services. The project was allocated a well-defined timeline, driven by our commitments to our customers and adherence to local regulations. As the technical program manager (TPM), I decomposed the overarching goal into smaller milestones and communicated to the leadership team to ensure they remained abreast of the progress.
However, throughout the execution phase of the project, we encountered a multitude of unrelated production incidents — the Spanner queue was growing long, and the accumulation of messages led to increased compilation times for our developer builds; this in turn led to bad builds rolling out. On top of this, asynchronous tasks were not completing as expected. When the bad build was rolled back, all of the backlogged async tasks fired at once. Due to these unforeseen challenges, some engineers were temporarily reassigned from the regionalization project to handle operational constraints associated with production infrastructure. No surprise, the change in staff allocation towards production incidents resulted in the project work being delayed.
Teams that manage production services, like SRE, have many ways to solve tough problems. The secret is to choose the solution that gets the job done the fastest and with the least amount of red tape for engineers to deal with.
In our organization, we’ve started taking a proactive approach to problem-solving by incorporating enhanced planning at the project's inception. As a TPM, my biggest trick to ensuring projects are finished on time is keeping some engineering hours in reserve and planning carefully when the project should start.
How many resources should you hold back, exactly? We did a deep dive into our past production issues and how we've been using our resources. Based on this, when planning SRE projects, we set aside 25% of our time for production work. Of course, this 25% buffer number will differ across organizations, but this new approach, which takes into account our critical business needs, has been a game-changer for us in making sure our projects are delivered on time, while ensuring that SREs can still focus on production incidents — our top priority for the business.
In a nutshell, planning for SRE projects is different from planning for projects in development organizations, because development organizations spend the lion’s share of their time working on projects. Luckily, SRE Program Management is really good at handling complicated situations, especially big programs.
Beyond holding back resources, here are few other best practices and structures that TPMs employ when planning SRE projects:
Ensuring that critical programs are staffed for success
Providing opportunities for TPMs to work across services, cross pollinating with standardized solutions and avoiding duplication of work
Providing more education to Site Reliability Managers and SREs on the value of early TPM engagement and encourage services to surface problem statements earlier
Leveraging the skills of TPMs to manage external dependencies and interface with other partner organizations such as Engineering, Infrastructure Change Management, and Technical Infrastructure
Providing coverage at times of need for services with otherwise low program management demands
Enabling consistent performance evaluation and provide opportunities for career development for the TPM community
The TPM role within SRE is at the heart of fulfilling SRE’s mission: making workflows faster, more reliable, and preparing for the continued growth of Google's infrastructure. As a TPM, you need to ensure that systems and services are carefully planned and deployed, taking into account multiple variables such as price, availability, and scheduling, while always keeping the bigger picture in mind. To learn more about project management for TPMs and related roles, consider enrolling in this course, and check out the following resources:
Who is supposed to manage generative AI applications? While AI-related ownership often lands with data teams, we're seeing requirements specific to generative AI applications that have distinct differences from those of a data and AI team, and at times more similarities with a DevOps team. This blog post explores these similarities and differences, and considers the need for a new ‘GenOps’ team to cater for the unique characteristics of generative AI applications.
In contrast to data science which is about creating models from data, Generative AI relates to creating AI enabled services from models and is concerned with the integration of pre-existing data, models and APIs. When viewed this way, Generative AI can feel similar to a traditional microservices environment: multiple discrete, decoupled and interoperable services consumed via APIs. And if there are similarities with the landscape, then it is logical that they share common operational requirements. So what practices can we take from the world of microservices and DevOps and bring to the new world of GenOps?
How do the operational requirements of a generative AI application differ from other applications? With traditional applications, the unit of operationalisation is the microservice. A discrete, functional unit of code, packaged up into a container and deployed into a container-native runtime such as kubernetes. For generative AI applications, the comparative unit is the generative AI agent: also a discrete, functional unit of code defined to handle a specific task, but with some additional constituent components that make it more than ‘just’ a microservice and add in its key differentiating behavior of being non-deterministic in terms of both its processing and its output:
Reasoning loop - The control logic defining what the agent does and how it works. It often includes iterative logic or thought chains to break down an initial task into a series of model-powered steps that work towards the completion of a task.
Model definitions - One or a set of defined access patterns for communicating with models, readable and usable by the Reasoning Loop
Tool definitions - a set of defined access patterns for other services external to the agent, such as other agents, data access (RAG) flows, and external APIs. These should be shared across agents, exposed through APIs and hence a Tool definition will take the form of a machine-readable standard such as an OpenAPI specification.
Logical components of a generative AI agent
The Reasoning Loop is essentially the full scope of a microservice, and the model and Tool definitions are its additional powers that make it into something more. Importantly, although the Reasoning Loop logic is just code and therefore deterministic in nature, it is driven by the responses from non-deterministic AI models, and this non-deterministic nature is what provides the need for the Tool, as the agent ‘chooses for itself’ which external service should be used to fulfill a task. A fully deterministic microservice has no need for this ‘cookbook’ of Tools for it to select from: Its calls to external services are pre-determined and hard coded into the Reasoning Loop.
However there are still many similarities. Just like a microservice, an agent:
Is a discrete unit of function that should be shared across multiple apps/users/teams in a multi-tenancy pattern
Has a lot of flexibility with development approaches, a wide range of software languages are available to use, and any one agent can be built in a different way to another.
Has very low inter-dependency from one agent to another: development lifecycles are decoupled with independent CI/CD pipelines for each. The upgrade of one agent should not affect another agent.
Another important difference is service-discovery. This is a solved-problem in the world of microservices where the impracticalities for microservices to track the availability, whereabouts and networking considerations for communicating with each other were taken out of the microservice itself and handled by packaging the microservices into containers and deploying these into a common platform layer of kubernetes and Istio. With Generative AI agents, this consolidation onto a standard deployment unit has not yet happened. There are a range of ways to build and deploy a generative AI agent, from code-first DIY approaches through to no-code managed agent builder environments. I am not against these tools in principle, however they are creating a more heterogeneous deployment landscape than what we have today with microservices applications and I expect this will create future operational complexities.
To deal with this, at least for now, we need to move away from the Point-to-Point model seen in microservices and adopt a Hub-and-Spoke model, where the discoverability of agents, Tools and models is done via the publication of APIs onto an API Gateway that provides a consistent abstraction layer above this inconsistent landscape.
This brings the additional benefit of clear separation of responsibilities between the apps and agents built by development teams, and Generative AI specific components such as models and Tools:
Separating responsibilities with an API Gateway
All operational platforms should create a clear point of separation between the roles and responsibilities of app and microservice development teams from the responsibilities of the operational teams. With microservice based applications, responsibilities are handed over at the point of deployment, and focus switches to non-functional requirements such as reliability, scalability, infrastructure efficiency, networking and security.
Many of these requirements are still just as important for a generative AI app, and I believe there are some additional considerations specific to generative agents and apps which require specific operational tooling:
1. Model compliance and approval controls
There are a lot of models out there. Some are open-source, some are licensed. Some provide intellectual property indemnity, some do not. All have specific and complex usage terms that have large potential ramifications but take time and the right skillset to fully understand.
It’s not reasonable or appropriate to expect our developers to have the time or knowledge to factor in these considerations during model selection. Instead, an organization should have a separate model review and approval process to determine whether usage terms are acceptable for further use, owned by legal and compliance teams, supported on a technical level by clear, governable and auditable approval/denial processes that cascade down into development environments.
2. Prompt version management
Prompts need to be optimized for each model. Do we want our app teams focusing on prompt optimization, or on building great apps? Prompt management is a non-functional component and should be taken out of the app source code and managed centrally where they can be optimized, periodically evaluated, and reused across apps and agents.
3. Model (and prompt) evaluation
Just like an MLOps platform, there is clearly a need for ongoing assessments of model response quality to enable a data-driven approach to evaluating and selecting the most optimal models for a particular use-case. The key difference with Gen AI models being the assessment is inherently more qualitative compared to the quantitative analysis of skew or drift detection of a traditional ML model.
Subjective, qualitative assessments performed by humans are clearly not scalable, and introduce inconsistency when performed by multiple people. Instead, we need consistent automated pipelines powered by AI evaluators, which although imperfect, will provide consistency in the assessments and a baseline to compare models against each other.
4. Model security gateway
The single most common operational feature I hear large enterprises investing time into is a security proxy for safety checks before passing a prompt on to a model (as well as the reverse: a check against the generated response before passing back to the client).
Common considerations:
Prompt Injection attacks and other threats captured by OWASP Top 10 for LLMs
Harmful / unethical prompts
Customer PII or other data requiring redaction prior to sending on to the model and other downstream systems
Some models have built in security controls; however this creates inconsistency and increased complexity. Instead a model agnostic security endpoint abstracted above all models is required to create consistency and allow for easier model switching.
5. Centralized Tool management
Finally, the Tools available to the agent should be abstracted out from the agent to allow for reuse and centralized governance. This is the right separation of responsibilities especially when involving data retrieval patterns where access to data needs to be controlled.
RAG patterns have the potential to become numerous and complex, as well as in practice not being particularly robust or well maintained with the potential of causing significant technical debt, so central control is important to keep data access patterns as clean and visible as possible.
Outside of these specific considerations, a prerequisite already discussed is the need for the API Gateway itself to create consistency and abstraction above these Generative AI specific services. When used to their fullest, API Gateways can act as much more than simply an API Endpoint but can be a coordination and packaging point for a series of interim API calls and logic, security features and usage monitoring.
For example, a published API for sending a request to a model can be the starting point for a multi-step process:
Retrieving and ‘hydrating’ the optimal prompt template for that use case and model
Running security checks through the model safety service
Sending the request to the model
Persisting prompt, response and other information for use in operational processes such as model and prompt evaluation pipelines.
Key components of a GenOps platform
For each of the considerations above, Google Cloud provides unique and differentiating managed services offerings to support with evaluating, deploying, securing and upgrading Generative AI applications and agents:
As for the API Gateway, Google Cloud’s Apigee allows for publishing and exposure of models and Tools as API Proxies which can encompass multiple downstream API calls, as well as include conditional logic, reties, and tooling for security, usage monitoring and cross charging.
GenOps with Google Cloud
Regardless of size, for any organization to be successful with generative AI, they will need to ensure their generative AI application’s unique characteristics and requirements are well managed, and hence an operational platform engineered to cater for these characteristics and requirements is clearly required. I hope the points discussed in this blog make for helpful consideration as we all navigate through this exciting and highly impactful new era of technology.
If you are interested in learning more, reach out to your Google Cloud account team if you have one, or feel free to contact me directly.
The Terraform Google Provider v6.0.0 is now GA. Since the last major Terraform provider release in September 2023, the combined Hashicorp/Google provider team has been listening closely to the community's feedback. Discussed below are the primary enhancements and bug fixes that this major release focuses on. Support for earlier versions of HashiCorp Terraform will not change as a result of the major version release v6.0.0.
The key notable changes are as follows:
Opt-out default label “goog-terraform-provisioned”
Deletion protection fields added to multiple resources
Allowed reducing the suffix length in “name_prefix” for multiple resources
As a follow-up to the addition of provider-level default labels in 5.16.0, the 6.0.0 major release includes an opt-out default label “goog-terraform-provisioned”. This provider-level label “goog-terraform-provisioned” will be added to applicable resources to identify resources that were created by Terraform. This default label will only apply for newly created resources with a labels field. This will enable users to have a view of resources managed by Terraform when viewing/editing these resources in other tools like Cloud Console, Cloud Billing etc.
The label “goog-terraform-provisioned” can be used for the following:
To filter on the Billing Reports page:
The label can also be used with BigQuery export.
Please note that an opt-in version of the label was already released in 5.16.0, and 6.0.0 will change the label to opt-out. To opt-out of this default label, the users may toggle the add_terraform_attribution_label provider configuration field. This can be set explicitly using any release from 5.16.0 onwards and the value in configuration will apply after the 6.0.0 upgrade.
In order to prevent the accidental deletion of important resources, many resources now have a form of deletion protection enabled by default. These resources include google_domain, google_cloud_run_v2_job, google_cloud_run_v2_service, google_folder and google_project. Most of these are enabled by the deletion_protection field. google_project specifically has a deletion_policy field which is set to PREVENT by default.
Another notable issue resolved in this major release is, “Allow reducing the suffix length appended to instance templates name_prefix (#15374 ),” which changes the default behavior for name_prefix in multiple resources. The max length of the user-defined name_prefix has increased from 37 characters to 54. The provider will use a shorter appended suffix when using a name_prefix longer than 37 characters, which should allow for more flexible resource names. For example, google_instance_template.name_prefix.
With features like opt-out default labels and deletion protection, this version enables users to have a view of resources managed by Terraform in other tools and also prevents accidental deletion of important resources. The Terraform Google Provider 6.0.0 launch aims to improve the usability and safety of Terraform for managing Google Cloud resources on Google Cloud. When upgrading to version 6.0 of the Terraform Google Provider, please consult the upgrade guide on the Terraform Registry, which contains a full list of the changes and upgrade considerations. Please check out the Release notes for Terraform Google Provider 6.0.0 for more details on this major version release. Learn more about Terraform on Google Cloud in the Terraform on Google Cloud documentation.
Hakuhodo Technologies, a specialized technology company of the Hakuhodo DY Group — one of Japan’s leading advertising and media holding companies — is dedicated to enhancing our software development process to deliver new value and experiences to society and consumers through the integration of marketing and technology.
Our IT Infrastructure Team at Hakuhodo Technologies operates cross-functionally, ensuring the stable operation of the public cloud that supports the diverse services within the Hakuhodo DY Group. We also provide expertise and operational support for public cloud initiatives.
Our value is to excel in the cloud and infrastructure domain, exhibiting a strong sense of ownership, and embracing the challenge of creating new value.
The infrastructure team is tasked with developing and operating the application infrastructure tailored to each internal organization and service, in addition to managing shared infrastructure resources.
Following the principles of platform engineering and site reliability engineering (SRE), each team within the organization has adopted elements of SRE, including the implementation of post-mortems and the development of observability mechanisms. However, we encountered two primary challenges:
As the infrastructure expanded, the number of people on the team grew rapidly, bringing in new members from diverse backgrounds. This made it necessary to clarify and standardize tasks, and provide a collective understanding of our current situation and alignment on our goals.
We mainly communicate with the app team through a ticket-based system. In addition to expanding our workforce, we have also introduced remote working. As a result, team members may not be as well-acquainted as before. This lack of familiarity could potentially cause small misunderstandings that can escalate quickly.
As our systems and organization expand, we believe that strengthening common understanding and cooperative relationships within the infrastructure team and the application team is essential for sustainable business growth. This has become a core element of our strategy.
We believe that fostering an SRE mindset among both infrastructure and application team members and creating a culture based on that common understanding is essential to solving the issues above. To achieve this, we decided to implement the "SRE Core" program by Google Cloud Consulting, which serves as the first step in adopting SRE practices.
First, through the "SRE Core" program, we revitalized communication between the application and infrastructure teams, which had previously had limited interaction. For example, some aspects of the program required information that was challenging for infrastructure members to gather and understand on their own, making cooperation with the application team essential.
Our critical user journey (CUJ), one of the SRE metrics, was established based on the business requirements of the app and the behavior of actual users. This information is typically managed by the app team, which frequently communicates with the business side. This time, we collaborated with the application team to create a CUJ, set service level indicators (SLIs) and service level objectives (SLOs) which included error budgets, performed risk analysis, and designed the necessary elements for SRE.
This collaborative work and shared understanding served as a starting point. As we continued to build a closer working relationship even after the program ended, with infrastructure members also participating in sprint meetings that had previously been held only for the app team.
Additionally, as an infrastructure team, we systematically learned when and why SRE activities are necessary, allowing us to reflect on and strengthen our SRE efforts that had been partially implemented.
For example, I recently understood that the purpose of postmortems is not only to prevent the recurrence of incidents but also to gain insights from the differences in perspectives between team members. Learning the purpose of postmortems changed our team’s mindset. We now practice immediate improvement activities, such as formalizing the postmortem process, clarifying the creation of tickets for action items, and sharing postmortem minutes with the app team, which were previously kept internal.
We also reaffirmed the importance of observability to consistently review and improve our current system. Regular meetings between the infrastructure and application teams allow us to jointly check metrics, which in turn helps maintain application performance and prevent potential issues.
By elevating our previous partial SRE activities and integrating individual initiatives, the infrastructure team created an organizational activity cycle that has earned more trust. This enhanced cycle is now getting integrated into our original operational workflows.
With the experience gained through the SRE Core program, the infrastructure team looks forward to expanding collaboration with application and business teams and increasing proactive activities. Currently, we are starting with collaborations on select applications, but we aim to use these success stories to broaden similar initiatives across the organization.
It is important to remember that each app has different team members, business partners, environments, and cultures, so SRE activities must be tailored to each unique situation. We aim to harmonize and apply the content learned in this program with the understanding that SRE activities are not the goal, but are elements that support the goals of the apps and the business.
Additionally, our company has a Cloud Center of Excellence (CCoE) team dedicated to cross-organizational activities. The CCoE manages a portal site for company-wide information dissemination and a community platform for developers to connect. We plan to share the insights we've gained through these channels with other respective teams within our group companies. As the CCoE's internal activities mature, we also intend to share our knowledge and experiences externally.
Through these initiatives, we hope to continue our activities with the hope that internal members — beyond the CCoE and infrastructure organizations — take psychological safety into consideration during discussions and actions.
At our company, we have a diverse workforce with varying years of experience and perspectives. We believe that ensuring psychological safety is essential for achieving high performance.
When psychological safety is lacking, for instance, if the person delivering bad news is blamed, reports tend to become superficial and do not lead to substantive discussions.
This issue can also arise from psychological barriers, such as the omission of tasks known only to experienced employees, leading to problems caused by the fear of asking for clarification.
In a situation where psychological safety is ensured, we focus on systems rather than individuals, viewing problems as opportunities. For example, if errors occur due to manual work, the manual process itself is seen as the issue. Similarly, if a system failure with no prior similar case arises, it is considered an opportunity to gain new knowledge.
By adopting this mindset, fear is removed from the equation, allowing for unbiased discussions and work.
This allows every employee to perform at their best, regardless of their years of experience. Of course, this is not something that can be achieved through a single person. It will require a whole team or organization to recognize this to make it a reality.