Pipes Feed Preview: Towards Data Science & The New Stack & DevOps & SRE & DevOps.com & Google DeepMind Blog

  1. What Being a Data Scientist at a Startup Really Looks Like

    Wed, 03 Sep 2025 14:00:00 -0000

    What I learned about growth, visibility, and chaos over the past five years

    The post What Being a Data Scientist at a Startup Really Looks Like appeared first on Towards Data Science.

  2. A Deep Dive into RabbitMQ & Python’s Celery: How to Optimise Your Queues

    Wed, 03 Sep 2025 04:35:15 -0000

    Key lessons I’ve learned running RabbitMQ + Celery in production

    The post A Deep Dive into RabbitMQ & Python’s Celery: How to Optimise Your Queues appeared first on Towards Data Science.

  3. Implementing the Caesar Cipher in Python

    Tue, 02 Sep 2025 21:32:57 -0000

    Julius Caesar was a Roman ruler known for his military strategies and excellent leadership. Named after him, the Caesar Cipher is a fascinating cryptographic technique that Julius Caesar employed to send secret signals and messages to his military personnel. The Caesar Cipher is quite basic in its working. It works by shifting all the letters […]

    The post Implementing the Caesar Cipher in Python appeared first on Towards Data Science.

  4. How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques

    Tue, 02 Sep 2025 19:46:01 -0000

    Optimize your AI search with RAG, contextual retrieval and evaluations

    The post How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques appeared first on Towards Data Science.

  5. What is Universality in LLMs? How to Find Universal Neurons

    Tue, 02 Sep 2025 19:29:23 -0000

    How independently trained transformers form same the neurons

    The post What is Universality in LLMs? How to Find Universal Neurons appeared first on Towards Data Science.

  6. 3 Greedy Algorithms for Decision Trees, Explained with Examples

    Tue, 02 Sep 2025 19:21:14 -0000

    Learn the inner workings of decision trees

    The post 3 Greedy Algorithms for Decision Trees, Explained with Examples appeared first on Towards Data Science.

  7. Writing Is Thinking

    Tue, 02 Sep 2025 15:05:35 -0000

    Egor Howell on breaking into ML without a CS degree, surviving 80+ interviews, and what to do if you feel stuck in your career.

    The post Writing Is Thinking appeared first on Towards Data Science.

  8. The Generalist: The New All-Around Type of Data Professional?

    Mon, 01 Sep 2025 12:00:00 -0000

    Is over-specialization ending and are data generalists on the rise?

    The post The Generalist: The New All-Around Type of Data Professional? appeared first on Towards Data Science.

  9. How to Develop a Bilingual Voice Assistant

    Sun, 31 Aug 2025 16:00:00 -0000

    Exploring ways to make voice assistants more personal

    The post How to Develop a Bilingual Voice Assistant appeared first on Towards Data Science.

  10. The Machine Learning Lessons I’ve Learned This Month

    Sun, 31 Aug 2025 14:00:00 -0000

    August 2025: logging, lab notebooks, overnight runs

    The post The Machine Learning Lessons I’ve Learned This Month appeared first on Towards Data Science.

  11. Understanding Matrices | Part 4: Matrix Inverse

    Sat, 30 Aug 2025 16:00:00 -0000

    The physical meaning of matrix inversion, related formulas, and how inversion behaves on several special types of matrices.

    The post Understanding Matrices | Part 4: Matrix Inverse appeared first on Towards Data Science.

  12. Crafting a Custom Voice Assistant with Perplexity

    Sat, 30 Aug 2025 14:00:00 -0000

    How to build a fully functional, hands-free voice assistant on a Raspberry Pi

    The post Crafting a Custom Voice Assistant with Perplexity appeared first on Towards Data Science.

  13. Marginal Effect of Hyperparameter Tuning with XGBoost

    Fri, 29 Aug 2025 16:00:00 -0000

    Demystifying Bayesian hyperparameter optimization and comparing hyperparameter tuning paradigms

    The post Marginal Effect of Hyperparameter Tuning with XGBoost appeared first on Towards Data Science.

  14. Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks

    Fri, 29 Aug 2025 14:45:00 -0000

    This research answered the question: How can machine learning and artificial intelligence help us to unlearn bias?

    The post Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks appeared first on Towards Data Science.

  15. Unlocking Multimodal Video Transcription with Gemini

    Fri, 29 Aug 2025 13:30:00 -0000

    Explore how to transcribe videos with speaker identification in a single prompt

    The post Unlocking Multimodal Video Transcription with Gemini appeared first on Towards Data Science.

  16. How to Import Pre-Annotated Data into Label Studio and Run the Full Stack with Docker

    Fri, 29 Aug 2025 12:30:00 -0000

    From VOC to JSON: Importing pre-annotations made simple

    The post How to Import Pre-Annotated Data into Label Studio and Run the Full Stack with Docker appeared first on Towards Data Science.

  17. Implementing the Hangman Game in Python

    Thu, 28 Aug 2025 18:00:00 -0000

    A beginner-friendly project to understand variables, loops, and conditions in Python

    The post Implementing the Hangman Game in Python appeared first on Towards Data Science.

  18. Stepwise Selection Made Simple: Improve Your Regression Models in Python

    Thu, 28 Aug 2025 15:30:00 -0000

    Dimensionality reduction in linear regression: classical stepwise methods and a Python application on real-world data

    The post Stepwise Selection Made Simple: Improve Your Regression Models in Python appeared first on Towards Data Science.

  19. Graph Coloring for Data Science: A Comprehensive Guide

    Thu, 28 Aug 2025 14:15:00 -0000

    From theoretical puzzles to practical applications

    The post Graph Coloring for Data Science: A Comprehensive Guide appeared first on Towards Data Science.

  20. A Visual Guide to Tuning Decision-Tree Hyperparameters

    Thu, 28 Aug 2025 13:05:00 -0000

    How hyperparameter tuning visually changes decision trees

    The post A Visual Guide to Tuning Decision-Tree Hyperparameters appeared first on Towards Data Science.

  21. How to Make OpenTelemetry Better in the Browser

    Wed, 03 Sep 2025 17:00:46 -0000

    Computer screen with code for article "Building an Ergonomic OpenTelemetry for JavaScript."

    This is the second of two parts. Read Part 1: Why OpenTelemetry Is So Clunky for the Frontend In Part

    The post How to Make OpenTelemetry Better in the Browser appeared first on The New Stack.

    Addressing the OpenTelemetry API’s design might make instrumentation more ergonomic, but just improving isn't enough.
  22. Why Developers Don’t Know What Dev Ex Is

    Wed, 03 Sep 2025 16:00:58 -0000

    Bottom-up DevEx logo

    I first heard the term “developer experience” in late 2023. Everything I had been doing “by intuition” for years suddenly

    The post Why Developers Don’t Know What Dev Ex Is appeared first on The New Stack.

    Bottom-up developer experience turns silent frustrations into clear arguments for change.
  23. How To Enable Platform Engineering That Developers Love

    Wed, 03 Sep 2025 15:00:11 -0000

    Platform engineering featured image

    Platform engineering is a foundational strategy for scaling developer productivity, improving software quality, and unifying fragmented tooling. But there’s a

    The post How To Enable Platform Engineering That Developers Love appeared first on The New Stack.

    Unpack the critical data, features, and adoption strategies that drive true platform success for developers.
  24. What Is Open Source AI Anyway?

    Wed, 03 Sep 2025 14:00:56 -0000

    A sign from the Open Source Summit EU in Amsterdam saying "Welkom" (Dutch for 'welcome').

    AMSTERDAM — Last October, the Open Source Initiative (OSI) published its definition of what it would take for an AI

    The post What Is Open Source AI Anyway? appeared first on The New Stack.

    The discussion about when a large language model can be considered open source continues. We talked to the OSI about its definition.
  25. Pods Is a Handy Linux GUI for Managing Your Podman Containers

    Wed, 03 Sep 2025 13:00:05 -0000

    If you’ve used Linux as your container development environment, and your distribution of choice is based on Fedora, then you’ve

    The post Pods Is a Handy Linux GUI for Managing Your Podman Containers appeared first on The New Stack.

    Pods is a Linux GUI that can be installed and used for free, on any distribution that supports both Podman and Flatpak.
  26. Why Tech Giants Are Backing the New Agentgateway Project

    Tue, 02 Sep 2025 21:00:02 -0000

    logo for the agentgateway project, an open source, AI native proxy that makes AI agents more interopertable.

    A new open source project aims to accelerate the emergence of AI agents using multiple large language models (LLMs). At

    The post Why Tech Giants Are Backing the New Agentgateway Project appeared first on The New Stack.

    The open source, AI-native proxy, created by Solo.io to optimize connectivity, security and observability for AI agentic software, has joined the Linux Foundation.
  27. The Linux Foundation in the Age of AI

    Tue, 02 Sep 2025 19:00:41 -0000

    Jim Zemlin, the executive director of the Linux Foundation, joined us on the showfloor of the Open Source Summit in Amsterdam to talk about the state of open source AI, the role of the Linux Foundation in keeping AI models and tooling open, emerging standards and more.

    AMSTERDAM — This week, the New Stack Agents is all about open source AI. Jim Zemlin, the executive director of

    The post The Linux Foundation in the Age of AI appeared first on The New Stack.

    The New Stack Agents met with Jim Zemlin, the executive director of the Linux Foundation, at the Open Source Summit in Amsterdam to talk about the state of open source AI.
  28. Vibe Coding Python: Testing Copilot vs. CodeGPT vs. Tabnine

    Tue, 02 Sep 2025 18:00:39 -0000

    I put three vibe coding tools to the test. Not with the goal of finding the best one, just to

    The post Vibe Coding Python: Testing Copilot vs. CodeGPT vs. Tabnine appeared first on The New Stack.

    This tutorial tests GitHub Copilot, CodeGPT, and Tabnine by vibe coding the same Python calculator app with each tool.
  29. Cloudflare’s Balancing Act: Protect Content While Pushing AI

    Tue, 02 Sep 2025 15:00:51 -0000

    balancing

    This year Cloudflare has been outspoken in its support of the web’s content creators in the AI era, who have

    The post Cloudflare’s Balancing Act: Protect Content While Pushing AI appeared first on The New Stack.

    Cloudflare announces an implementation of NLWeb, an open protocol for AI chat on websites. Also, an update on its creator compensation drive.
  30. AI Combined With Agile Lets Developers Focus on Craft

    Tue, 02 Sep 2025 14:00:24 -0000

    A planning board with post-it notes.

    Speed isn’t the end-all, be-all for agile programming, but it certainly doesn’t hurt. AI can be a resource for generating

    The post AI Combined With Agile Lets Developers Focus on Craft appeared first on The New Stack.

    Can AI help you achieve the coveted flow state? Shannon Mason of Tempo Software explains how AI can help with agile development.
  31. XSLT Debate Leads to Bigger Questions of Web Governance

    Mon, 01 Sep 2025 15:00:46 -0000

    web

    XSLT (eXtensible Stylesheet Language Transformations) is a powerful tool with significant potential, but it’s currently poorly and sometimes insecurely implemented

    The post XSLT Debate Leads to Bigger Questions of Web Governance appeared first on The New Stack.

    Security issues and web compatibility are pulling in different directions, as Google and Firefox discuss dropping XSLT support from browsers.
  32. Containerized Apps for Your Home Network

    Mon, 01 Sep 2025 14:00:28 -0000

    If you run a home network lab or just like to make your daily routine considerably more effective, I would

    The post Containerized Apps for Your Home Network appeared first on The New Stack.

    Six containerized applications that can be used to build a foundational home network lab to improve efficiency and privacy.
  33. Unix Co-Creator Brian Kernighan on Rust, Distros and NixOS

    Sun, 31 Aug 2025 13:00:11 -0000

    “I’m still teaching at Princeton,” 83-year-old Brian Kernighan recently told an audience at the InfoAge Science and History Museums in

    The post Unix Co-Creator Brian Kernighan on Rust, Distros and NixOS appeared first on The New Stack.

    Kernighan shared his thoughts on what he thinks of the world today — with its push away from C to more memory-safe programming languages, its hundreds of distributions of Linux — and with descendants of Unix powering nearly every cellphone.
  34. Why AI Alone Fails at Large-Scale Code Modernization

    Sat, 30 Aug 2025 16:00:55 -0000

    Modern software teams face a double bind: deliver business value faster while maintaining aging, complex codebases. For executives, delivery speed

    The post Why AI Alone Fails at Large-Scale Code Modernization appeared first on The New Stack.

    AI alone can't handle large-scale code modernization. Discover why it falls short and how pairing it with deterministic automation is key to safe, scalable change.
  35. How To Build an App With Enhance, a Backend-First Framework

    Sat, 30 Aug 2025 14:00:40 -0000

    Enhance graphic

    At one point, the frontend framework landscape had three big players: React, Vue and Angular. Knowing all three was basically

    The post How To Build an App With Enhance, a Backend-First Framework appeared first on The New Stack.

    If you’re looking for rapid prototyping or you're doing a backend-first small project, Enhance is a solid web framework choice.
  36. Flutter 3.5 Offers New Developer Experience Features

    Sat, 30 Aug 2025 13:00:51 -0000

    Dev News logo

    Flutter 3.35 was released on Aug. 16 with updates that include the stable release of stateful hot reload on the

    The post Flutter 3.5 Offers New Developer Experience Features appeared first on The New Stack.

    Also: Google will require Android devs to be verified, and DigitalOcean's MCP server that allows devs to manage cloud resources with AI.
  37. Is Your Talent the Bottleneck to GenAI Success?

    Fri, 29 Aug 2025 17:00:10 -0000

    Generative AI (GenAI) is redefining how enterprises use the cloud. From automating infrastructure to advancing security measures, GenAI is helping

    The post Is Your Talent the Bottleneck to GenAI Success? appeared first on The New Stack.

    GenAI is transforming the cloud, but technology alone isn't enough. Discover why talent and employee upskilling are the real drivers of innovation and ROI.
  38. Why AI Search Platforms Are Gaining Attention

    Fri, 29 Aug 2025 14:00:19 -0000

    Laptop screen with search icons on pink background.

    A few years ago, my daughter told me that her school research project was so deep she had to venture

    The post Why AI Search Platforms Are Gaining Attention appeared first on The New Stack.

    Users expect search not just to return accurate results, but to do the heavy lifting: Answer a question, summarize research, or even solve a problem.
  39. How Webflow Got 89% of Its Engineers To Use AI Daily

    Fri, 29 Aug 2025 13:19:05 -0000

    abstract web

    The push to integrate AI technologies into IT departments is intense, but it must be even more so for companies

    The post How Webflow Got 89% of Its Engineers To Use AI Daily appeared first on The New Stack.

    We talk to Webflow's CTO about how the company has gone all-in on AI for its 300 engineers, including offering each one an AI toolkit.
  40. The Cyber Resilience Act: Fear, Confusion — And Reassurance

    Thu, 28 Aug 2025 21:00:17 -0000

    AMSTERDAM — If your organization generates revenue from open source software, whether directly or indirectly, you should at least be

    The post The Cyber Resilience Act: Fear, Confusion — And Reassurance appeared first on The New Stack.

    The EU's CRA takes effect in 2027. But with just months to prepare and rules still being written, nobody knows exactly what compliance looks like yet.
  41. Creating an Immutable ‘Family Tree’ for AI Training Data

    Thu, 28 Aug 2025 19:00:01 -0000

    Moody photo of a chain representing blockchain

    The race for every company to embrace AI has them combining myriad and often inconsistently documented datasets in their training,

    The post Creating an Immutable ‘Family Tree’ for AI Training Data appeared first on The New Stack.

    From secure chips to blockchain “billboard,” organizations can track and verify the history of even complex, mashed-together data sets.
  42. Building Real Enterprise AI Agents With Apache Flink

    Thu, 28 Aug 2025 18:00:31 -0000

    Colorful bands of light in a pinwheel, representing data.

    The current focus on AI chatbots overlooks the real opportunity for businesses: building autonomous agents. We want AI systems that

    The post Building Real Enterprise AI Agents With Apache Flink appeared first on The New Stack.

    Stateful stream processing is the necessary foundation, and Apache Flink provides a robust, low-latency engine to bring these autonomous agents to life.
  43. Guido van Rossum Revisits Python’s Life in a New Documentary

    Thu, 28 Aug 2025 17:30:51 -0000

    Screenshot from Cult Repo YouTube channel (Wednesdfay night)

    “It was a massive undertaking,” said filmmaker Ida Bechtle. Her new 84-minute documentary on Python attempts to cover 34 years

    The post Guido van Rossum Revisits Python’s Life in a New Documentary appeared first on The New Stack.

    Ida Bechtl‘s much-anticipated documentary on Python has arrived.
  44. Going for Silver: Making the Most of Tiered Observability

    Thu, 28 Aug 2025 17:00:24 -0000

    Hand swinging gold, silver and bronze medals at a tiered stadium to illustrate tiered observability data.

    Just a few years ago, many enterprises were working toward consolidating log data into a single observability platform. By bringing

    The post Going for Silver: Making the Most of Tiered Observability appeared first on The New Stack.

    To save money, enterprises are sending just a small fraction of their observability data to premium platforms and the rest to more affordable solutions.
  45. Install Cursor and Learn Programming With AI Help

    Thu, 28 Aug 2025 16:00:10 -0000

    I’m not a big fan of using AI as a shortcut. On the other hand, I’m perfectly OK using it

    The post Install Cursor and Learn Programming With AI Help appeared first on The New Stack.

    This tutorial shows how to make Cursor a part of your workflow and learn how this new IDE sets the standard for AI-powered programming tools.
  46. Is Your Data Strategy Ready for the Agentic AI Era?

    Thu, 28 Aug 2025 15:00:15 -0000

    Raj Verma, CEO of SingleStore Verma predicts on The New Stack Makers that AI will evolve in three phases: first, the easy tasks will be automated; next, complex tasks will become easier; and finally, the seemingly impossible will become achievable—likely within three years.

    It may seem like all organizations are in the AI business right now, but enterprises are just getting started, according

    The post Is Your Data Strategy Ready for the Agentic AI Era? appeared first on The New Stack.

    AI is only just beginning to transform the enterprise — and most organizations' data infrastructure can't handle what's ahead, said SingleStore's CEO in this episode of Makers.
  47. Cloud Solutions Architect Manager, Google Cloud

    Wed, 13 Aug 2025 16:00:00 -0000

    What guides your approach to software development? In our roles at Google, we’re constantly working to build better software, faster. Within Google, our Developer Platform team and Google Cloud have a strategic partnership and a shared strategy: together, we take our internal capabilities and engineering tools and package them up for Google Cloud customers.

    At the heart of this is understanding the many ways that software teams, big and small, need to balance efficiency, quality, and cost, all while delivering value. In our recent talk at PlatformCon 2025, we shared key parts of our platform strategy, which we call “shift down.”

    Shift down is an approach that advocates for embedding decisions and responsibilities into underlying internal developer platforms (IDPs), thereby reducing the operational burden on developers. This contrasts with the DevOps trend of "shift left," which pushes more effort earlier into the development cycle, a method that is proving difficult at scale due to the sheer volume and rate of change in requirements. Our shift down strategy helps us maximize value with existing resources so businesses can achieve high innovation velocity with acceptable quality, acceptable risk, and sustainable costs across a diverse range of business models. In the talk, we share learnings that have been really helpful to us in our software and platform engineering journey:

    image1
    aside_block
    <ListValue: []>
    1. Work backwards from the business model: By starting with the business model, organizations can intentionally guide platform evolution and investment to align with desired margins, risk tolerance, and quality requirements. At Google, our central platform must support diverse business models, necessitating continuous strategic refinement and adaptation.
    2. Focus on quality attributes for central software control: Quality attributes, such as reliability, security, efficiency, and performance, are emergent properties of software systems and are important for creating business value and managing risk. These are often referred to as “non-functional requirements” because they define how our software behaves, not what it functionally does. With a shift down strategy, we can embed the responsibility for assuring quality attributes directly into the underlying platform systems and infrastructure, thereby significantly reducing the operational burden on individual developers.
    3. Abstractions and coupling are key technical tools to gain control of quality attributes: We define two key technical components in the way we build platforms: abstractions and coupling. In a shift down strategy, abstractions provide understandability, risk management levers, accountability, and cost control by encapsulating complexity. Coupling refers to the interconnectedness and interdependence of components within a system or development ecosystem. For a successful shift down strategy, the right degree of coupling is crucial because it allows the development platform and ecosystem design to directly implement and influence quality attributes. In fact, coupling is how we offer entire infrastructure and platform solutions as coherent services like Google Kubernetes Engine (GKE).
    4. Shared responsibility, education, and policy are equally important social tools: Shared responsibility is a crucial social tool within software at scale. This is actively cultivated through education, such as training engineers on platform and AI usage, and fostering a "one team" culture that encourages a shift from artifact-bound identities to overarching mission goals and client-focused engagement. Furthermore, explicit policies like centrally enforced style guides and secure-by-design APIs are fundamental for embedding quality attribute assurance directly into the platform and infrastructure, significantly reducing the operational burden on individual developers by ensuring consistency and automated controls at scale.
    5. Use a map. Supporting many business units with one platform is a vast and complex problem; we need a map. The ecosystem model is a framework that categorizes different types of software development environments, ranging from highly flexible, developer-controlled systems to highly opinionated, vertically integrated ones where the ecosystem itself assures quality attributes. Its critical purpose is to provide a visual and conceptual tool for evaluating how well our ecosystem controls match our business risk. This helps us ensure that the level of oversight and assurance of quality attributes aligns with the potential cost of mistakes. The goal is to be in the "ecosystem effectiveness zone," where controls are balanced to mitigate significant risks from human error without imposing overly restrictive systems that negatively impact velocity and developer satisfaction.
    1

    6. Divide up the problem space by identifying different platform and ecosystem types.

    Because the developer experience and platform infrastructure change with scale and degree of shifting down, it’s not enough to just know where the ecosystem effectiveness zone is — you have to identify the ecosystem by type. We differentiate ecosystem types by the degree of oversight and assurance for quality attributes. As an ecosystem becomes more vertically integrated, such as Google's highly optimized "Assured" (Type 4) ecosystem, the platform itself assumes increasing responsibility for vital quality attributes, allowing specialists like site reliability engineers (SRE) and security teams to have full ownership in taking action through large-scale observability and embedded capabilities. Conversely, in less uniform "YOLO," "AdHoc," or "Guided" (Type 0-2) ecosystems, developers have more responsibility for assuring these attributes, while central specialist teams have less direct control and enforcement mechanisms are less pervasive. It’s really important to note here that this is not a maturity model — the best ecosystem and platform type is the one that best fits your business need (see point #1 above!).

    2

    Intentional choices in platform engineering

    The most important takeaway is to make active choices. Tailor platform engineering for each business unit and application to achieve the best outcomes. Place critical emphasis on identifying and solving stable sub-problems in reliable, reusable ways across various business problems. This approach directly underpins our "shift down" strategy, moving toward composable platforms that embed decisions and responsibilities for software quality directly into the underlying platform infrastructure, thereby improving our ability to maximize business value with the right resources, at the right quality level, and with sustainable costs.

    Watch our full discussion for more insights on effective platform engineering.

  48. Product Manager

    Mon, 04 Aug 2025 16:00:00 -0000

    Application owners are looking for three things when they think about optimizing cloud costs:

    1. What are the most expensive resources?

    2. Which resources are costing me more this week or month?

    3. Which resources are poorly utilized?

    To help you answer these questions quickly and easily, we announced Cloud Hub Optimization and Cost Explorer, in private preview, at Google Cloud Next 2025. And today, we are excited to announce that both Cloud Hub Optimization and Cost Explorer are now in public preview.

    Application cost and utilization

    As an app owner, your primary objective is keeping your application healthy at all times. Yet, monitoring all the individual components of your application, which may straddle dozens of Projects, can be quite overwhelming. AppHub Applications allow you to reorganize cloud around your application, giving you the information and controls you need at your fingertips.

    In addition to supporting Google Cloud Projects, Cloud Hub Optimization and Cost Explorer leverage App Hub applications to show you the cost-efficiency of your application’s workloads and services instantly. This is great for instance when you are trying to pinpoint deployments running on GKE clusters that might be wasting valuable resources, such as GPUs.

    1_CHO_utilization summary app

    Not just another cost dashboard

    When you bring up Cloud Hub Optimization, you can immediately see the resources that are costing you the most, along with the percentage change in their cost. With this highly granular cost information, you can now attribute your costs to specific resources and resource owners to reason about any changes in costs.

    2_CHO_cost summary

    We have additionally integrated granular cost data from Cloud Billing and resource utilization data from Cloud Monitoring to give you a comprehensive picture of your cost efficiency. This includes average vCPU utilization for your Project, which helps you find the most promising optimization candidates across hundreds of Google Cloud Projects.

    3_CHO_utilization summary project

    The Cost Explorer dashboard also shows you your costs logically organized at the product level, for even more cost explainability. Instead of seeing a lump sum cost for Compute Engine, you can now see your exact spend on individual products including Google Kubernetes Engine (GKE) clusters, Persistent Disks, Cloud Load Balancing, and more.

    4_CHO_cost explorer

    Simple is powerful

    Customers who have tried these new tools love the information that is surfaced as well as the simplicity of the interfaces.

    “My team has to keep an eye on cloud costs across tens of business units and hundreds of developers. The Cloud Hub Optimization and Cost Explorer dashboards are a force multiplier for my team as they tell us where to look for cost savings and potential optimization opportunities.” - Frank Dice, Principal Cloud Architect, Major League Baseball

    Customers especially appreciate the breadth of product coverage available out of the box without any additional setup, and the fact that there is no additional charge to using these features.

    What’s next

    As your organization “shifts left” on cloud cost management, we are working to help application owners and developers understand and optimize their cloud costs. You can try Cloud Hub Optimize and Cost Explorer here.

    You can also see a live demo of how Cloud Hub Optimization and Cost Explorer can be used to identify underutilized GKE clusters within seconds in the Google Cloud Next 2025 talk Maximize Your Cloud ROI.


    Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit MLB.com.

  49. Senior Product Manager

    Fri, 01 Aug 2025 16:00:00 -0000

    Are you ready to unlock the power of Google Cloud and want guidance on how to set up your environment effectively? Whether you're a cloud novice or part of an experienced team looking to migrate critical workloads, getting your foundational infrastructure right is the key to success. That's where Google Cloud Setup comes in — your guided pathway to a secure cloud foundation and quick start on Google Cloud.

    Google Cloud Setup helps you quickly implement Google Cloud's recommended best practices. Our goal is to provide a fast and easy path to deploying your workloads without unnecessary configuration effort. Think of it as your expert guide, walking you through the essential first steps so you can focus on what truly matters: rapidly deploying your innovative applications and services. To help you get started without financial barriers, all components and service integrations enabled during the setup process are free or include some level of no-cost access.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c4a9058b0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Choose the foundation that fits your needs

    We understand that every organization and project has unique requirements. That's why Cloud Setup offers three distinct guided flows to choose from:

    • Proof-of-concept: Designed for users who want to set up a lightweight environment to explore Google Cloud and run initial tests or sandbox workloads. This flow focuses on the minimum configuration to get you started quickly.

    • Production: This flow is recommended for supporting production-ready workloads with security and scalability in mind. It aligns with Google Cloud’s best practices and is tailored for administrators setting up basic foundational infrastructure for production workloads.

    • Enhanced security: Designed for organizations, regions or workloads with advanced security and compliance requirements, this flow defaults to more advanced security controls and is designed to help you meet rigorous requirements. Even this advanced foundation sets you up with a perpetual free tier up to certain usage limits.

    1

    Building blocks for a solid foundation

    Cloud Setup guides you through a series of onboarding steps, presenting defaults backed by Google Cloud best practices. Throughout the process, you'll also encounter key features designed to help protect your organization and prepare it for growth, including:

    • Cloud KMS AutoKey: Automates the provisioning and assignment of customer-managed encryption keys (CMEK).

    • Security Command Center: Provides security posture management for Google Cloud deployments including automatic project scanning for security issues such as open ports and misconfigured access controls.

    • Centralized Logging and Monitoring: Enables you to easily set up infrastructure to monitor your system's health and performance from a central location — critical for audit logging compliance and visualizing metrics across projects.

    • Shared VPC Networks: Allows you to establish a centralized network across multiple projects, enabling secure and efficient communication between your Google Cloud resources.

    • Hybrid Connectivity: Facilitates connecting your Google Cloud environment to your on-premises infrastructure or other cloud providers. This is often a critical step for workload migrations.

    • Support plan: Enables you to quickly resolve any issues with help from experts at Google Cloud.

    At the end of the guided flow, you can deploy your configuration directly via the Google Cloud console or download a Terraform configuration file for later deployment using other Infrastructure as Code (IaC) methods.

    2

    Experience the cloud faster and smarter

    Organizations using Cloud Setup experience enjoy:

    • Faster application deployment: By simplifying the initial setup, you can get your applications up and running more quickly, accelerating your cloud journey.

    • Reduced setup effort: Our streamlined flow significantly reduces the number of manual steps, allowing you to establish a basic foundation with less effort.

    • Greater access to Google Cloud's full potential: By establishing a solid foundation quickly, you can more easily explore and leverage a wider range of Google Cloud services to meet your evolving needs and unlock greater value.

    Ready to start your Google Cloud journey? Visit Google Cloud Setup today for a streamlined path to a secure cloud foundation. Let us guide you through the initial steps so you can focus on innovation and growth.

    To learn more, visit:

  50. Product Manager

    Fri, 18 Jul 2025 16:00:00 -0000

    As developers and operators, you know that having access to the right information in the proper context is crucial for effective troubleshooting. This is why organizations invest a lot upfront curating monitoring resources across different business units: so information is easy to find and contextualize when needed.

    Today we are reducing the need for this upfront investment with an out-of-the-box Application Monitoring experience for your organization on Google Cloud within Cloud Observability

    Application Monitoring consists of a set of pre-curated dashboards with relevant metrics and logs mapped to a user-defined application in App Hub. It incorporates best practices pioneered by Google Site Reliability Engineers (SRE) to optimize manual troubleshooting and unlock AI-assisted troubleshooting.

    Application Monitoring automatically labels and brings together key telemetry for your application into a centralized experience, making it easy to discover, filter and correlate trends. It also feeds application context into Gemini Cloud Assist Investigations, for AI-assisted troubleshooting.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1c197c70>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    1. Application, service and workload dashboards 

    No more spending hours configuring application dashboards. 

    From the moment you describe your application in App Hub, Application Monitoring starts to automatically build dashboards tailored to your environment. Each dashboard comprises relevant telemetry for your application and is searchable, filterable and ready for deep dives — no configuration required. 

    The dashboards offer an overview of charts detailing the SRE Four Golden Signals: traffic, latency, error rate, and saturation. This provides a high-level view of application performance, integrating automatically collected system metrics across various services and workloads such as load balancers, Cloud Run, GKE workloads, MIGs, and databases. From this overview, you can then drill down into services or workloads with performance issues or active alerts to access detailed metrics and logs.

    For example in the image below, a user defined an App Hub application called Cymbal BnB app, with multiple services and workloads. The flow below shows the automatically generated experience with golden signals, alerts and relevant logs.

    1

    Figure 1 - A user’s flow from an App Hub defined application (i.e. Cymbal BnB) to the automatic prebuilt Application Monitoring experience in Cloud Observability

    2. Labels and context propagation 

    See application labels propagated seamlessly across Google Cloud 

    Once Application Monitoring is enabled, your application labels are propagated across Google Cloud, so you can see and use them to filter and focus on the most essential signals across the logs, metrics and trace explorers.

    2

    Figure 2 - Logs Explorer showing application automatically tagged with application labels

    3

    Figure 3 - Metrics Explorer showing application labels automatically associated with metrics

    4

    Figure 4 - Trace Explorer showing AppHub label Integration

    3. Gemini Cloud Assist Investigations

    Troubleshoot issues faster with AI powered Investigations. 

    Gemini Cloud Assist’s investigation feature makes it easier to troubleshoot issues because application boundaries and relationships have been propagated into the AI model, grounding it in context about your environment. 

    5

    Figure 5 - Seamless entry point into Gemini Cloud Assist powered Investigations from application logs

    Note - Gemini Cloud Assist Investigations is currently in private preview

    Try Application Monitoring today

    The new Application Monitoring experience provides a low-effort unified view of application and infrastructure performance for your troubleshooting needs.

    Take advantage of the new Google Cloud Application Monitoring experience by:

    1. Visiting your Cloud console

    2. Setting up Applications in AppHub

      1. Adding Services and Workloads to your Application

    3. Navigating to Application Monitoring in Cloud Observability to see your automatically built experience

    4. Enable your Gemini Cloud Assist SKU and sign up for the trusted tester program to get access to the Investigations experience

    Related docs

    1. Application Monitoring docs

    2. AppHub docs

      1. Apphub coverage docs
  51. Director of Engineering, Google Cloud

    Thu, 10 Jul 2025 09:30:00 -0000

    At Google Cloud, we are committed to making it as seamless as possible for you to build and deploy the next generation of AI and agentic applications. Today, we’re thrilled to announce that we are collaborating with Docker to drastically simplify your deployment workflows, enabling you to bring your sophisticated AI applications from local development to Cloud Run with ease. 

    Deploy your compose.yaml directly to Cloud Run

    Previously, bridging the gap between your development environment and managed platforms like Cloud Run required you to manually translate and configure your infrastructure. Agentic applications that use MCP servers and self-hosted models added additional complexity. 

    The open-source Compose Specification is one of the most popular ways for developers to iterate on complex applications in their local environment, and is the basis of Docker Compose. And now, gcloud run compose up brings the simplicity of Docker Compose to Cloud Run, automating this entire process. Now in private preview, you can deploy your existing compose.yaml file to Cloud Run with a single command, including building containers from source and leveraging Cloud Run’s volume mounts for data persistence. 

    compose

    Supporting the Compose Specification with Cloud Run makes for easy transitions across your local and cloud deployments, where you can keep the same configuration format, ensuring consistency and accelerating your dev cycle.

    “We’ve recently evolved Docker Compose to support agentic applications, and we’re excited to see that innovation extend to Google Cloud Run with support for GPU-backed execution. Using Docker and Cloud Run, developers can now iterate locally and deploy intelligent agents to production at scale with a single command. It’s a major step forward in making AI-native development accessible and composable. We’re looking forward to continuing our close collaboration with Google Cloud to simplify how developers build and run the next generation of intelligent applications.” - Tushar Jain, EVP Engineering and Product, Docker

    Cloud Run, your home for AI applications

    Support for the compose spec isn’t the only AI-friendly innovation you’ll find in Cloud Run. We recently announced general availability of Cloud Run GPUs, removing a significant barrier to entry for developers who want access to GPUs for AI workloads. With its pay-per-second billing, scale to zero, and rapid scaling (which takes approximately 19 seconds for a gemma3:4b model for time-to-first-token), Cloud Run is a great hosting solution for deploying and serving LLMs. 

    This also makes Cloud Run a strong solution for Docker’s recently announced OSS MCP Gateway and Model Runner, making it easy for developers to take the AI applications locally to production in the cloud seamlessly. By supporting Docker’s recent addition of ‘models’ to the open Compose Spec, you can deploy these complex solutions to the cloud with a single command.  

    Bringing it all together

    Let's review the compose file for the above demo. It consists of a multi-container application (defined in services) built from sources and leveraging a storage volume (defined in volumes). It also uses the new models attribute to define AI models and a Cloud Run-extension defining the runtime image to use:

    code_block
    <ListValue: [StructValue([('code', 'name: agent\r\nservices:\r\n webapp:\r\n build: .\r\n ports:\r\n - "8080:8080"\r\n volumes:\r\n - web_images:/assets/images\r\n depends_on:\r\n - adk\r\n\r\n adk:\r\n image: us-central1-docker.pkg.dev/jmahood-demo/adk:latest\r\n ports:\r\n - "3000:3000"\r\n models:\r\n - ai-model\r\n\r\nmodels:\r\n ai-model:\r\n model: ai/gemma3-qat:4B-Q4_K_M\r\n x-google-cloudrun:\r\n inference-endpoint: docker/model-runner:latest-cuda12.2.2\r\n\r\nvolumes:\r\n web_images:'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e7c4c886760>)])]>

    Building the future of AI

    We’re committed to offering developers maximum flexibility and choice by adopting open standards and supporting various agent frameworks. This collaboration on Cloud Run and Docker is another example of how we aim to simplify the process for developers to build and deploy intelligent applications. 

    Compose Specification support is available for our trusted users — sign up here for the private preview.

  52. Principal Platform Engineer, John Lewis Partnership

    Thu, 26 Jun 2025 16:00:00 -0000

    Editor's note: This is part one of the story. After you’re finished reading, head over to part two


    In 2017, John Lewis, a major UK retailer with a £2.5bn annual online turnover, was hampered by its monolithic e-commerce platform. This outdated approach led to significant cross-team dependencies, cumbersome and infrequent releases (monthly at best), and excessive manual testing, all further hindered by complex on-premises infrastructure. What was needed were some bold decisions to drive a quick and significant transformation.

    The John Lewis engineers knew there was a better way. Working with Google Cloud, they modernized their e-commerce operations with Google Kubernetes Engine. They started with the frontend, and started to see results fast: the frontend was moved onto Google Cloud in mere months, releases to the frontend browser journey started to happen weekly, and the business gladly backed expansion into other areas.

    At the same time, the team had a broader strategy in mind: to take a platform engineering approach, creating many product teams who built their own microservices to replace the functionality of the legacy commerce engine, as well as creating brand new experiences for customers. 

    And so The John Lewis Digital Platform was born. The vision was to empower development teams and arm them with the tools and processes they needed to go to market fast, with full ownership of their own business services. The team’s motto? "You Build It. You Run It. You Own It." This decentralization of development and operational responsibilities would also enable the team to scale. 

    This article features insights from Principal Platform Engineer Alex Moss, who delves into their strategy, platform build, and key learnings of John Lewis’ journey to modernize and streamline its operations with platform engineering — so you can begin to think about how you might apply platform engineering to your own organization.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1eb09df0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Step 1: From monolithic to multi-tenant

    In order to make this happen, John Lewis needed to adopt a multi-tenant architecture — one tenant for each business service, allowing each owning team to work independently without risk to others -- and thereby permitting the Platform team to give the team a greater degree of freedom.

    Knowing that the business' primary objective was to greatly increase the number of product teams helped inform our initial design thinking, positioning ourselves to enable many independent teams even though we only had a handful of tenants. 

    This foundational design has served us very well and is largely unchanged now, seven years later. Central to the multi-tenant concept is what we chose to term a "Service" — a logical business application, usually composed of several microservices plus components for storing data.

    article1-image1

    We largely position our platform as a “bring your own container” experience, but encourage teams to make use of other Google Cloud services — particularly for handling state. Adopting services like Firestore and Pub/Sub reduces the complexity that our platform team has to work with, particularly for areas like resilience and disaster recovery. We also favor Kubernetes over compute products like Cloud Run because it strikes the right balance for us between enabling development teams to have freedom whilst allowing our platform to drive certain certain behaviours, e.g., the right level of guardrails, without introducing too much friction.

    On our platform, Product Teams (i.e., tenants) have a large amount of control over their own Namespaces and Projects. This allows them to prototype, build, and ultimately operate, their workloads without dependency on others — a crucial element of enabling scale. 

    Our early-adopter teams were extremely helpful in helping evolve the platform; they were accepting of the lack of features and willing to develop their own solutions, and provided very rich feedback on whether we were building something that met their needs.

    The first tenant to adopt the platform was rebuilding the johnlewis.com, search capability, replacing a commercial-off-the-shelf solution. This team was staffed with experienced engineers familiar with modern software development and the advantages of a microservice-based architecture. They quickly identified the need for supporting services for their application to store data and asynchronously communicate between their components. They worked with the Platform Team to identify options, and were onboard with our desire to lean into Google Cloud native services to avoid running our own databases or messaging. This led to us adopting Cloud Datastore and Pub/Sub for our first features that extended beyond Google Kubernetes Engine.

    All roads lead to success

    A risk with a platform that allows very high team autonomy is that it can turn into a bit of a wild-west of technology choices and implementation patterns. To handle this, but to do so in a way that remained developer-centric, we adopted the concept of a paved road,  analogous to a “golden path.” 

    We found that the paved road approach made it easier to:

    • build useful platform features to help developers do things rapidly and safely

    • share approaches and techniques, and engineers to move between teams

    • demonstrate to the wider organisation that teams are following required practices (which we do by building assurance capabilities, not by gating release)

    The concept of the paved road permeates most of what the platform builds, and has inspired other areas of the John Lewis Partnership beyond the John Lewis Digital space.

    Our paved road is powered by two key features to enable simplification for teams:

    1. The Paved Road Pipeline. This operates on the whole Service and drives capabilities such as Google Cloud resource provisioning and observability tools.

    2. The Microservice CRD. As the name implies, this is an abstraction at the microservice level. The majority of the benefit here is in making it easier for teams to work with Kubernetes.

    Whilst both features were created with the developer experience in mind, we discovered that they also hold a number of benefits for the platform team too.

    The Paved Road Pipeline is driven by a configuration file — in yaml (of course!) — which we call the Service Definition. This allows the team that owns the tenancy to describe, through easy-to-reason-about configuration, what they would like the platform to provide for them. Supporting documentation and examples help them understand what can be achieved. Pushes to this file then drive a CI/CD pipeline for a number of platform-owned jobs, which we refer to as provisioners. These provisioners are microservices-like themselves in that they are independently releasable and generally focus on performing one task well. Here are some examples of our provisioners and what they can do:

    • Create Google Cloud resources in a tenant’s Project. For example, Buckets, PubSub, and Firestore — amongst many others
    • Configure platform-provided dashboards and custom dashboards based on golden-signal and self-instrumented metrics
    • Tune alert configurations for a given microservice’s SLOs, and the incident response behaviour for those alerts
    article1-image2

    Our product teams are therefore freed from the need to familiarize themselves deeply with how Google Cloud resource provisioning works, or Infrastructure-as-Code (IaC) tooling for that matter. Our preferred technologies and good practices can be curated by our experts, and developers can focus on building differentiating software for the business, while remaining fully in control of what is provisioned and when.

    Earlier, we mentioned that this approach has the added benefit of being something that the platform team can rely upon to build their own features. The configuration updated by teams for their Service can be combined with metadata about their team and surfaced via an API and events published to Pub/Sub. This can then drive updates to other features like incident response and security tooling, pre-provision documentation repositories, and more. This is an example of how something that was originally intended as a means to help teams avoid writing their own IaC can also be used to make it easier for us to build platform features, further improving the value-add — without the developer even needing to be aware of it!

    We think this approach is also more scalable than providing pre-built Terraform modules for teams to use. That approach still burdens teams with being familiar with Terraform, and versioning and dependency complexities can create maintenance headaches for platform engineers. Instead, we provide an easy-to-reason-about API and deliberately burden the platform team, ensuring that the Service provides all the functionality our tenants require. This abstraction also means we can make significant refactoring choices if we need to.

    Adopting this approach also results in a broad consistency in technologies across our platform. For example, why would a team implement Kafka when the platform makes creating resources in Pub/Sub so easy? When you consider that this spans not just the runtime components that assemble into a working business service, but also all the ancillary needs for operating that software — resilience engineering, monitoring & alerting, incident response, security tooling, service management, and so on—  this has a massive amplifying effect on our engineers’ productivity. All of these areas have full paved road capabilities on the John Lewis Digital Platform, reducing the cognitive load for teams in recognizing the need for, identifying appropriate options, and then implementing technology or processes to use them.

    That being said, one of the reasons we particularly like the paved road concept is because it doesn't preclude teams choosing to "go off-road." A paved road shouldn’t be mandatory, but it should be compelling to use, so that engineers aren’t tempted to do something else. Preventing use of other approaches risks stifling innovation and the temptation to think the features you've built are "good enough." The paved road challenges our Platform Engineers to keep improving their product so that it continues to meet our Developers' changing needs. Likewise, development teams tempted to go off-road are put off by the increasing burden of replicating powerful platform features. 

    The needs of our Engineers don’t remain fixed, and Google Cloud are of course releasing new capabilities all the time, so we have extended the analogy to include a “dusty path” representing brand new platform features that aren’t as feature-rich as we’d like (perhaps they lack self-service provisioning or out-the-box observability). Teams are trusted to try different options and make use of Google Cloud products that we haven't yet paved. The Paved Road Pipeline allows for this experimentation - what we term "snowflaking". We then have an unofficial "rule of three", whereby if we notice at least 3 teams requesting the same feature, we move to make the use of it self-service.

    At the other end of the scale, teams can go completely solo — which we refer to as “crazy paving” — and might be needed to support wild experimentation or to accommodate a workload which cannot comply with the platform’s expectations for safe operation. Solutions in this space are generally not long-lived.

    In this article, we've covered how John Lewis revolutionized its e-commerce operations by adopting a multi-tenant, "paved road" approach to platform engineering. We explored how this strategy empowered development teams and streamlined their ability to provision Google Cloud resources and deploy operational and security features.

    In part 2 of this series, we'll dive deeper into how John Lewis further simplified the developer experience by introducing the Microservice CRD. You'll discover how this custom Kubernetes abstraction significantly reduced the complexity of working with Kubernetes at the component level, leading to faster development cycles and enhanced operational efficiency.

    To learn more about shifting down with platform engineering on Google Cloud, you can find more information available here. To learn more about how Google Kubernetes Engine (GKE) empowers developers to effortlessly deploy, scale, and manage containerized applications with its fully managed, robust, and intelligent Kubernetes service, you can find more information here.

  53. Sr. Staff UX Designer

    Wed, 28 May 2025 16:00:00 -0000

    In the event of a cloud incident, everyone wants swift and clear communication from the cloud provider, and to be able to leverage that information effectively. Personalized Service Health in the Google Cloud console addresses this need with fast, transparent, relevant, and actionable communications about Google Cloud service disruptions, customized to your specific footprint. This helps you to quickly identify the source of the problem, helping you answer the question, “Is it Google or is it me?” You can then integrate this information into your incident response workflows to resolve the incident more efficiently.

    We're excited to announce that you can prompt Gemini Cloud Assist to pull real-time information about active incidents, powered by Personalized Service Health, providing you with streamlined incident management, including discovery, impact assessment, and recovery. By combining Gemini's guidance with Personalized Service Health insights and up-to-the-minute information, you can assess the scope of impact and begin troubleshooting – all within a single, AI-driven Gemini Cloud Assist chat. Further, you  can initiate this sort of incident discovery from anywhere within the console, offering immediate access to relevant incidents without interrupting your workflow. You can also check for active incidents impacting your projects, gathering details on their scope and the latest updates directly sourced from Personalized Service Health.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c0662beb0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Using Gemini Cloud Assist with Personalized Service Health

    We designed Gemini Cloud Assist with a user-friendly layout and a well-organized information structure. Crucial details, including dynamic timelines, latest updates, symptoms, and workarounds sourced directly from Personalized Service Health, are now presented in the console, enabling conversational follow-ups. Gemini Cloud Assist highlights critical insights from Personalized Service Health, helping you refine your investigations and understand the impact of incidents.

    To illustrate the power of this integration, the following demo showcases a typical incident response workflow leveraging the combined capabilities of Gemini and Personalized Service Health.

    Incident discovery and triage
    In the crucial first moments of an incident, Gemini Cloud Assist helps you answer "Is it Google or is it me?" Gemini Cloud Assist accesses data directly from Personalized Service Health, and provides feedback on which projects and at what locations are affected by a Google Cloud incident, speeding up the triage process.

    To illustrate how you can start this process, try asking Gemini Cloud Assist questions like:

    • Is my project impacted by a Google Cloud incident?

    • Are there any incidents impacting Google Cloud at the moment?

    1 UpdatedNew

    Investigating and evaluating impact
    Once you’ve identified a relevant Google Cloud incident, you can use Gemini Cloud Assist to delve deeper into the specifics and evaluate its impact on your environment. Furthermore, by asking follow-up questions, Gemini Cloud Assist can retrieve updates from Personalized Service Health about the incident as it evolves. You can then further investigate by asking Gemini to pinpoint exactly which of your apps or projects, and at what locations, might be affected by the reported incident.

    Here are examples of prompts you might pose to Gemini Cloud Assist:

    • Tell me more about the ongoing Incident ID [X] (Replace [X] with the Incident ID)

    • Is [X] impacted? (Replace [X] with your specific location or Google Cloud product)

    • What is the latest update on Incident ID [X]?

    • Show me the details of Incident ID [X].

    • Can you guide me through some troubleshooting steps for [impacted Google Cloud product]?

    2

    Mitigation and recovery
    Finally, Gemini Cloud Assist can also act as an intelligent assistant during the recovery phase, providing you with actionable guidance. You can gain access to relevant logs and monitoring data for more efficient resolution. Additionally, Gemini Cloud Assist can help surface potential workarounds from Personalized Service Health and direct you to the tools and information you need to restore your projects or applications. Here are some sample prompts:

    • What are the workarounds for the incident ID [X]? (Replace [X] with the Incident ID)

    • Can you suggest a temporary solution to keep my application running?

    • How can I find logs for this impacted project?

    3 Updated

    From these prompts, Gemini retrieves relevant information from Personalized Service Health to provide you with personalized insights into your Google Cloud environment's health — both for ongoing events and incidents from up to one year in the past. This helps when investigating an incident to narrow down its impact, as well as assisting in recovery. 

    Next steps

    Looking ahead, we are excited to provide even deeper insights and more comprehensive incident management with Gemini Cloud Assist and Personalized Service Health, extending these AI-driven capabilities beyond a single project view. Ready to get started? 

    • Learn more about Personalized Service Health, or reach out to your account team to enable it.

    • Get started with Gemini Cloud Assist. Refine your prompts to ask about your specific regions or Google Cloud products, and experiment to discover how it can help you proactively manage incidents.

    Related Article

    Personalized Service Health is now generally available: Get started today

    Personalized Service Health provides visibility into incidents relevant to your environment, allowing you to evaluate their impact and tr...

    Read Article
  54. Staff Site Reliability Engineer, Waze

    Mon, 28 Apr 2025 16:00:00 -0000

    In 2023, the Waze platform engineering team transitioned to Infrastructure as Code (IaC) using Google Cloud's Config Connector (KCC) — and we haven’t looked back since. We embraced Config Connector, an open-source Kubernetes add-on, to manage Google Cloud resources through Kubernetes. To streamline management, we also leverage Config Controller, a hosted version of Config Connector on Google Kubernetes Engine (GKE), incorporating Policy Controller and Config Sync. This shift has significantly improved our infrastructure management and is shaping our future infrastructure.

    The shift to Config Connector

    Previously, Waze relied on Terraform to manage resources, particularly during our dual-cloud, VM-based phase. However, maintaining state and ensuring reconciliation proved challenging, leading to inconsistent configurations and increased management overhead.

    In 2023, we adopted Config Connector, transforming our Google Cloud infrastructure into Kubernetes Resource Modules (KRMs) within a GKE cluster. This approach addresses the reconciliation issues encountered with Terraform. Config Sync, paired with Config Connector, automates KRM synchronization from source repositories to our live GKE cluster. This managed solution eliminates the need for us to build and maintain custom reconciliation systems.

    The shift helped us meet the needs of three key roles within Waze’s infrastructure team: 

    1. Infrastructure consumers: Application developers who want to easily deploy infrastructure without worrying about the maintenance and complexity of underlying resources.

    2. Infrastructure owners: Experts in specific resource types (e.g., Spanner, Google Cloud Storage, Load Balancers, etc.), who want to define and standardize best practices in how resources are created across Waze on Google Cloud.

    3. Platform engineers: Engineers who build the system that enables infrastructure owners to codify and define best practices, while also providing a seamless API for infrastructure consumers.

    aside_block
    <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x3e7c4c91a340>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

    First stop: Config Connector

    It may seem circular to define all of our Google Cloud infrastructure as KRMs within a Google Cloud service, however, KRM is actually a great representation for our infrastructure as opposed to existing IaC tooling.

    Terraform's reconciliation issues – state drift, version management, out of band changes – are a significant pain. Config Connector, through Config Sync, offers out-of-the-box reconciliation, a managed solution we prefer. Both KRM and Terraform offer templating, but KCC's managed nature aligns with our shift to Google Cloud-native solutions and reduces our maintenance burden. 

    Infrastructure complexity requires generalization regardless of the tool. We can see this when we look at the Spanner requirements at Waze:

    • Consistent backups for all Spanner databases

    • Each Spanner database utilizes a dedicated Cloud Storage bucket and Service Account to automate the execution of DDL jobs.

    • All IAM policies for Spanner instances, databases, and Cloud Storage buckets are defined in code to ensure consistent and auditable access control.

    1 - Spanner at Waze

    To define these resources, we evaluated various templating and rendering tools and selected Helm, a robust CNCF package manager for Kubernetes. Its strong open-source community, rich templating capabilities, and native rendering features made it a natural fit. We can now refer to our bundled infrastructure configurations as 'Charts.' While KRO has since emerged that achieves a similar purpose, our selection process predated its availability.

    Under the hood

    Let's open the hood and dive into how the system works and is driving value for Waze.

    1. Waze infrastructure owners generically define Waze-flavored infrastructure in Helm Charts. 

    2. Infrastructure consumers use these Charts with simplified inputs to generate infrastructure (demo).

    3. Infrastructure code is stored in repositories, enabling validation and presubmit checks.

    Code is uploaded to a Artifact Registry where Config Sync and Config Connector align Google Cloud infrastructure with the code definitions.

    2 - Provisioning Cloud Resources at Waze

    This diagram represents a single "data domain," a collection of bounded services, databases, networks, and data. Many tech orgs today consist of Prod, QA, Staging, Development, etc.

    Approaching our destination

    So why does all of this matter? Adopting this approach allowed us to move from Infrastructure as Code to Infrastructure as Software. By treating each Chart as a software component, our infrastructure management goes beyond simple code declaration. Now, versioned Charts and configurations enable us to leverage a rich ecosystem of software practices, including sophisticated release management, automated rollbacks, and granular change tracking.

    Here's where we apply this in practice: our configuration inheritance model minimizes redundancy. Resource Charts inherit settings from Projects, which inherit from Bootstraps. All three are defined as Charts. Consequently, Bootstrap configurations apply to all Projects, and Project configurations apply to all Resources.

    Every change to our infrastructure – from changes on existing infrastructure to rolling out new resource types – can be treated like a software rollout.

    3 - Resource Inheritance

    Now that all of our infrastructure is treated like software, we can see what this does for us system-wide:

    4 - Data Domain Flow

    Reaching our destination

    In summary, Config Connector and Config Controller have enabled Waze to achieve true Infrastructure as Software, providing a robust and scalable platform for our infrastructure needs, along with many other benefits including: 

    • Infrastructure consumers receive the latest best practices through versioned updates.

    • Infrastructure owners can iterate and improve infrastructure safely.

    • Platform Engineers and Security teams are confident our resources are auditable and compliant

    • Config Connector leverages Google's managed services, reducing operational overhead.

  55. Engineering Manager

    Mon, 24 Feb 2025 17:00:00 -0000

    Distributed tracing is a critical part of an observability stack, letting you troubleshoot latency and errors in your applications. Cloud Trace, part of Google Cloud Observability, is Google Cloud’s native tracing product, and we’ve made numerous improvements to the Trace explorer UI on top of a new analytics backend.

    1_Components of the new trace explorer

    The new Trace explorer page contains:

    1. A filter bar with options for users to choose a Google Cloud project-based trace scope, all/root spans and a custom attribute filter.

    2. A faceted span filter pane that displays commonly used filters based on OpenTelemetry conventions.

    3. A visualization of matching spans including an interactive span duration heatmap (default), a span rate line chart, and a span duration percentile chart.

    4. A table of matching spans that can be narrowed down further by selecting a cell of interest on the heatmap.

    A tour of the new Trace explorer

    Let’s take a closer look at these new features and how you can use them to troubleshoot your applications. Imagine you’re a developer working on the checkoutservice of a retail webstore application and you’ve been paged because there’s an ongoing incident.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1eff0760>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    This application is instrumented using OpenTelemetry and sends trace data to Google Cloud Trace, so you navigate to the Trace explorer page on the Google Cloud console with the context set to the Google Cloud project that hosts the checkoutservice.

    Before starting your investigation, you remember that your admin recommended using the webstore-prod trace scope when investigating webstore app-wide prod issues. By using this Trace scope, you'll be able to see spans stored in other Google Cloud projects that are relevant to your investigation.

    2_Scope selection

    You set the trace scope to webstore-prod and your queries will now include spans from all the projects included in this trace scope.

    3_User Journey

    You select checkoutservice in Span filters (1) and the following updates load on the page:

    • Other sections such as Span name in the span filter pane (2) are updated with counts and percentages that take into account the selection made under service name. This can help you narrow down your search criteria to be more specific.

    • The span Filter bar (3) is updated to display the active filter.

    • The heatmap visualization (4)  is updated to only display spans from the checkoutservice in the last 1 hour (default). You can change the time-range using the time-picker (5). The heatmap’s x-axis is time and the y-axis is span duration. It uses color shades to denote the number of spans in each cell with a legend that indicates the corresponding range.

    • The Spans table (6) is updated with matching spans sorted by duration (default).

    • Other Chart views (7) that you can switch to are also updated with the applied filter.

    From looking at the heatmap, you can see that there are some spans in the >100s range which is abnormal and concerning. But first, you’re curious about the traffic and corresponding latency of calls handled by the checkoutservice.

    4_Span rate line chart

    Switching to the Span rate line chart gives you an idea of the traffic handled by your service. The x-axis is time and the y-axis is spans/second. The traffic handled by your service looks normal as you know from past experience that 1.5-2 spans/second is quite typical.

    5_Span duration percentile chart

    Switching to the Span duration percentile chart gives you p50/p90/p95/p99 span duration trends. While p50 looks fine, the p9x durations are greater than you expect for your service.

    6_Span selection

    You switch back to the heatmap chart and select one of the outlier cells to investigate further. This particular cell has two matching spans with a duration of over 2 minutes, which is concerning.

    7_Trace details & span attributes

    You investigate one of those spans by viewing the full trace and notice that the orders publish span is the one taking up the majority of the time when servicing this request. Given this, you form a hypothesis that the checkoutservice is having issues handling these types of calls. To validate your hypothesis, you note the rpc.method attribute being PlaceOrder and exit this trace using the X button.

    8_Custom attribute search

    You add an attribute filter for key: rpc.method value:PlaceOrder using the Filter bar, which shows you that there is a clear latency issue with PlaceOrder calls handled by your service. You’ve seen this issue before and know that there is a runbook that addresses it, so you alert the SRE team with the appropriate action that needs to be taken to mitigate the incident.

    9_Send feedback

    Share your feedback with us via the Send feedback button.

    Behind the scenes

    10_Cloud Trace architecture

    This new experience is powered by BigQuery, using the same platform that backs Log Analytics. We plan to launch new features that take full advantage of this platform: SQL queries, flexible sampling, export, and regional storage.

    In summary, you can use the new Cloud Trace explorer to perform service-oriented investigations with advanced querying and visualization of trace data. This allows developers and SREs to effectively troubleshoot production incidents and identify mitigating measures to restore normal operations.

    The new Cloud Trace explorer is generally available to all users — try it out and share your feedback with us via the Send feedback button.

  56. Technical Program Manager, Google

    Thu, 20 Feb 2025 17:00:00 -0000

    Picture this: you’re an Site Reliability Engineer (SRE) responsible for the systems that power your company’s machine learning (ML) services. What do you do to ensure you have a reliable ML service, how do you know you’re doing it well, and how can you build strong systems to support these services? 

    As artificial intelligence (AI) becomes more widely available, its features — including ML — will matter more to SREs. That’s because ML becomes both a part of the infrastructure used in production software systems, as well as an important feature of the software itself. 

    Abstractly, machine learning relies on its pipelines … and you know how to manage those! So you can begin with pipeline management, then look to other factors that will strengthen your ML services: training, model freshness, and efficiency. In the resources below, we'll look at some of the ML-specific characteristics of these pipelines that you’ll want to consider in your operations. Then, we draw on the experience of Google SREs to show you how to apply your core SRE skills to operating and managing your organization’s machine-learning pipelines. 

    Training ML models

    Training ML models applies the notion of pipelines to specific types of data, often running on specialized hardware. Critical aspects to consider about the pipeline:

    • how much data you’re ingesting

    • how fresh this data needs to be

    • how the system trains and deploys the models 

    • how efficiently the system handles these first three things

    This keynote presents an SRE perspective on the value of applying reliability principles to the components of machine learning systems. It provides insight into why ML systems matter for products, and how SREs should think about them. The challenges that ML systems present include capacity planning, resource management, and monitoring; other challenges include understanding the cost of ML systems as part of your overall operations environment. 

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c4d104c70>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    ML freshness and data volume

    As with any pipeline-based system, a big part of understanding the system is describing how much data it typically ingests and processes. The Data Processing Pipelines chapter in the SRE Workbook lays out the fundamentals: automate the pipeline’s operation so that it is resilient, and can operate unattended. 

    You’ll want to develop Service Level Objectives (SLOs) in order to measure the pipeline’s health, especially for data freshness, i.e., how recently the model got the data it’s using to produce an inference for a customer. Understanding freshness provides an important measure of an ML system’s health, as data that becomes stale may lead to lower-quality inferences and sub-optimal outcomes for the user. For some systems, such as weather forecasting, data may need to be very fresh (just minutes or seconds old); for other systems, such as spell-checkers, data freshness can lag on the order of days — or longer! Freshness requirements will vary by product, so it’s important that you know what you’re building and how the audience expects to use it. 

    In this way, freshness is a part of the critical user journey described in the SRE Workbook, describing one aspect of the customer experience. You can read more about data freshness as a component of pipeline systems in the Google SRE article Reliable Data Processing with Minimal Toil.  

    There’s more than freshness to ensuring high-quality data — there’s also how you define the model-training pipeline. A Brief Guide To Running ML Systems in Production gives you the nuts and bolts of this discipline, from using contextual metrics to understand freshness and throughput, to methods for understanding the quality of your input data. 

    Serving efficiency

    The 2021 SRE blog post Efficient Machine Learning Inference provides a valuable resource to learn about improving your model’s performance in a production environment. (And remember, training is never the same as production for ML services!) 

    Optimizing machine learning inference serving is crucial for real-world deployment. In this article, the authors explore multi-model serving off of a shared VM. They cover realistic use cases and how to manage trade-offs between cost, utilization, and latency of model responses. By changing the allocation of models to VMs, and varying the size and shape of those VMs in terms of processing, GPU, and RAM attached, you can improve the cost effectiveness of model serving. 

    Cost efficiency

    We mentioned that these AI pipelines often rely on specialized hardware. How do you know you’re using this hardware efficiently? Todd Underwood’s talk from SREcon EMEA 2023 on Artificial Intelligence: What Will It Cost You? gives you a sense of how much this specialized hardware costs to run, and how you can provide incentives for using it efficiently. 

    Automation for scale

    This article from Google's SRE team outlines strategies for ensuring reliable data processing while minimizing manual effort, or toil. One of the key takeaways: use an existing, standard platform for as much of the pipeline as possible. After all, your business goals should focus on innovations in presenting the data and the ML model, not in the pipeline itself. The article covers automation, monitoring, and incident response, with a focus on using these concepts to build resilient data pipelines. You’ll read best practices for designing data systems that can handle failures gracefully and reduce a team’s operational burden. This article is essential reading for anyone involved in data engineering or operations. Read more about toil in the SRE Workbook: https://sre.google/workbook/eliminating-toil/

    Next steps

    Successful ML deployments require careful management and monitoring for systems to be reliable and sustainable. That means taking a holistic approach, including implementing data pipelines, training pathways, model management, and validation, alongside monitoring and accuracy metrics. To go deeper, check out this guide on how to use GKE for your AI orchestration.

  57. Cross-Product Solution Developer

    Fri, 14 Feb 2025 17:00:00 -0000

    In today's dynamic digital landscape, building and operating secure, reliable, cost-efficient and high-performing cloud solutions is no easy feat. Enterprises grapple with the complexities of cloud adoption, and often struggle to bridge the gap between business needs, technical implementation, and operational readiness. This is where the Google Cloud Well-Architected Framework comes in. The framework provides comprehensive guidance to help you design, develop, deploy, and operate efficient, secure, resilient, high-performing, and cost-effective Google Cloud topologies that support your security and compliance requirements.

    Who should use the Well-Architected Framework?

    The Well-Architected Framework caters to a broad spectrum of cloud professionals. Cloud architects, developers, IT administrators, decision makers and other practitioners can benefit from years of subject-matter expertise and knowledge both from within Google and from the industry. The framework distills this vast expertise and presents it as an easy-to-consume set of recommendations. 

    The recommendations in the Well-Architected Framework are organized under five, business-focused pillars.

    af-infographic

    We recently completed a revamp of the guidance in all the pillars and perspectives of the Well-Architected Framework to center the recommendations around a core set of design principles.

    Operational excellence

    Security, privacy, and compliance

    Reliability

    Cost optimization

    Performance optimization

    • Operational readiness

    • Incident management

    • Resource optimization

    • Change management

    • Continuous improvement

    • Security by design

    • Zero trust

    • Shift-left security

    • Preemptive cyber-defense

    • Secure and responsible AI

    • AI for security

    • Regulatory, privacy, and compliance needs

    • User-focused goals

    • Realistic targets

    • HA through redundancy

    • Horizontal scaling

    • Observability

    • Graceful degradation

    • Recovery testing

    • Thorough postmortems

    • Spending aligned with business value

    • Culture of cost awareness

    • Resource optimization

    • Continuous optimization

    • Resource allocation planning

    • Elasticity

    • Modular design

    • Continuous  improvement

    In addition to the above pillars, the Well-Architected Framework provides cross-pillar perspectives that present recommendations for selected domains, industries, and technologies like AI and machine learning (ML).

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1c1ac1f0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Benefits of adopting the Well-Architected Framework

    The Well-Architected Framework is much more than a collection of design and operational recommendations. The framework empowers you with a structured principles-oriented design methodology that unlocks many advantages:

    • Enhanced security, privacy, and compliance: Security is paramount in the cloud. The Well-Architected Framework incorporates industry-leading security practices, helping ensure that your cloud architecture meets your security, privacy, and compliance requirements.

    • Optimized cost: The Well-Architected Framework lets you build and operate cost-efficient cloud solutions by promoting a cost-aware culture, focusing on resource optimization, and leveraging built-in cost-saving features in Google Cloud.

    • Resilience, scalability, and flexibility: As your business needs evolve, the Well-Architected Framework helps you design cloud deployments that can scale to accommodate changing demands, remain highly available, and be resilient to disasters and failures.

    • Operational excellence: The Well-Architected Framework promotes operationally sound architectures that are easy to operate, monitor, and maintain.

    • Predictable and workload-specific performance: The Well-Architected Framework offers guidance to help you build, deploy, and operate workloads that provide predictable performance based on your workloads’ needs.

    • The Well-Architected Framework also includes cross-pillar perspectives for selected domains, industries, and technologies like AI and machine learning (ML).

    The principles and recommendations in the Google Cloud Well-Architected Framework are aligned with Google and industry best practices like Google’s Site Reliability Engineering (SRE) practices, DORA capabilities, the Google HEART framework for user-centered metrics, the FinOps framework, Supply-chain Levels for Software Artifacts (SLSA), and Google's Secure AI Framework (SAIF).

    Embrace the Well-Architected Framework to transform your Google Cloud journey, and get comprehensive guidance on security, reliability, cost, performance, and operations — as well as targeted recommendations for specific industries and domains like AI and ML. To learn more, visit Google Cloud Well-Architected Framework.

  58. Product Manager

    Thu, 30 Jan 2025 20:00:00 -0000

    We are thrilled to announce the collaboration between Google Cloud, AWS, and Azure on Kube Resource Orchestrator, or kro (pronounced “crow”). kro introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.

    Challenges of Kubernetes resource orchestration

    Platform and DevOps teams want to define standards for how application teams deploy their workloads, and they want to use Kubernetes as the platform for creating and enforcing these standards. Each service needs to handle everything from resource creation to security configurations, monitoring setup, defining the end-user interface, and more. There are client-side templating tools that can help with this (e.g., Helm, Kustomize), but Kubernetes lacks a native way for platform teams to create custom groupings of resources for consumption by end users. 

    Before kro, platform teams needed to invest in custom solutions such as building custom Kubernetes controllers, or using packaging tools like Helm, which can’t leverage the benefits of Kubernetes CRDs. These approaches are costly to build, maintain, and troubleshoot, and complex for non-Kubernetes experts to consume. This is a problem many Kubernetes users face. Rather than developing vendor-specific solutions, we’ve partnered with Amazon and Microsoft on making K8s APIs simpler for all Kubernetes users.

    aside_block
    <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1c137eb0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

    How kro simplifies the developer experience

    kro is a Kubernetes-native framework that lets you create reusable APIs to deploy multiple resources as a single unit. You can use it to encapsulate a Kubernetes deployment and its dependencies into a single API that your application teams can use, even if they aren’t familiar with Kubernetes. You can use kro to create custom end-user interfaces that expose only the parameters an end user should see, hiding the complexity of Kubernetes and cloud-provider APIs.

    kro does this by introducing the concept of a ResourceGraphDefinition, which specifies how a standard Kubernetes Custom Resource Definition (CRD) should be expanded into a set of Kubernetes resources. End users define a single resource, which kro then expands into the custom resources defined in the CRD.

    kro can be used to group and manage any Kubernetes resources. Tools like ACK, KCC, or ASO define CRDs to manage cloud provider resources from Kubernetes (these tools enable cloud provider resources, like storage buckets, to be created and managed as Kubernetes resources). kro can also be used to group resources from these tools, along with any other Kubernetes resources, to define an entire application deployment and the cloud provider resources it depends on.

    1

    Example use cases

    Below, you’ll find some examples of kro being used with Google Cloud. You can find additional examples on the kro website

    Example 1: GKE cluster definition

    Imagine that a platform administrator wants to give end users in their organization self-service access to create GKE clusters. The platform administrator creates a kro ResourceGraphDefinition called GKEclusterRGD that defines the required Kubernetes resources and a CRD called GKEcluster that exposes only the options they want to be configurable by end users. In addition to creating a cluster, the platform team also wants clusters to deploy administrative workloads such as policies, agents, etc. The ResourceGraphDefinition defines the following resources, using KCC to provide the mappings from K8s CRDs to Google Cloud APIs:

    • GKE cluster, Container Node Pools, IAM ServiceAccount, IAM PolicyMember, Services, Policies

    The platform administrator would then define the end-user interface so that they can create a new cluster by creating an instance of the CRD that defines:

    • Cluster name, Nodepool name, Max nodes, Location (e.g. us-east1), Networks (optional)

    Everything related to policy, service accounts, and service activation (and how these resources relate to each other) is hidden from the end user, simplifying their experience.

    2

    Example 2: Web application definition

    In this example, a DevOps Engineer wants to create a reusable definition of a web application and its dependencies. They create a ResourceGraphDefinition called WebAppRGD, which defines a new Kubernetes CRD called WebApp. This new resource encapsulates all the necessary resources for a web application environment, including:

    • Deployments, service, service accounts, monitoring agents, and cloud resources like object storage buckets. 

    The WebAppRGD ResourceGraphDefinition can set a default configuration, and also define which parameters can be set by the end user at deployment time (kro gives you the flexibility to decide what is immutable, and what an end user is able to configure). A developer then creates an instance of the WebApp CRD, inputting any user-facing parameters. kro then deploys the desired Kubernetes resource.

    3

    Key benefits of kro

    We believe kro is a big step forward for platform engineering teams, delivering a number of advantages:

    • Kubernetes-native: kro leverages Kubernetes Custom Resource Definitions (CRDs) to extend Kubernetes, so it works with any Kubernetes resource and integrates with existing Kubernetes tools and workflows.

    • Lets you create a simplified end user experience: kro makes it easy to define end-user interfaces for complex groups of Kubernetes resources, making it easy for people who are not Kubernetes experts to consume services built on Kubernetes. 

    • Enables standardized services for application teams: kro templates can be reused across different projects and environments, promoting consistency and reducing duplication of effort.

    Get started with kro

    kro is available as an open-source project on GitHub. The GitHub organization is currently jointly owned by teams from Google, AWS, and Microsoft, and we welcome contributions from the community. We also have a website with documentation on installing and using kro, including example use cases. As an early-stage project, kro is not yet ready for production use, but we still encourage you to test it out in your own Kubernetes development environments!

  59. Senior Product Manager, Cloud Runtimes

    Thu, 23 Jan 2025 17:00:00 -0000

    Platform engineering, one of Gartner’s top 10 strategic technology trends for 2024, is rapidly becoming indispensable for enterprises seeking to accelerate software delivery and improve developer productivity. How does it do that? Platform engineering is about providing the right infrastructure, tools, and processes that enable efficient, scalable software development, deployment, and management, all while minimizing the cognitive burden on developers.

    To uncover the secrets to platform engineering success, Google Cloud partnered with Enterprise Strategy Group (ESG) on a comprehensive research study of 500 global IT professionals and application developers working at organizations with at least 500 employees, all with formal platform engineering teams. Our goal was to understand whether they had adopted platform engineering, and if so, the impact that has had on their company’s software delivery capabilities. 

    The resulting report, Building Competitive Edge With Platform Engineering: A Strategic Guide,” reveals common patterns, expectations, and actionable best practices for overcoming challenges and fully leveraging platform engineering. This blog post highlights some of the most powerful insights from this study.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c4adae5b0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Platform engineering is no longer optional

    The research confirms that platform engineering is no longer a nascent concept. 55% of the global organizations we invited to participate have already adopted platform engineering. Of those, 90% plan to expand its reach to more developers. Furthermore, 85% of companies using platform engineering report that their developers rely on the platform to succeed. These figures highlight that platform engineering is no longer just a trend; it's becoming a vital strategy for organizations seeking to unlock the full potential of their cloud and IT investments and gain a competitive edge.

    image1

    Figure 1: 55% of 900+ global organizations surveyed have adopted platform engineering

    Three keys to platform engineering success

    The report identifies three critical components that are central to the success of mature platform engineering leaders. 

    1. Fostering close collaboration between platform engineers and other teams to ensure alignment 

    2. Adopting a “platform as a product” approach, which involves treating the developer platform with a clear roadmap, communicated value, and tight feedback loops

    3. Defining success by measuring performance through clear metrics such as deployment frequency, failure recovery time, and lead time for changes 

    It's noteworthy that while many organizations have begun their platform engineering journey, only 27% of adopters have fully integrated these three key components in their practices, signaling a significant opportunity for further improvements.

    AI: platform engineering's new partner

    One of the most compelling insights of this report is the synergistic relationship between platform engineering and AI. A remarkable 86% of respondents believe that platform engineering is essential to realizing the full business value of AI. At the same time, a vast majority of companies view AI as a catalyst for advancing platform engineering, with 94% of organizations identifying AI to be ‘Critical’ or ‘Important’ to the future of platform engineering.

    image2

    Beyond speed: key benefits of platform engineering

    The study also identified three cohorts of platform engineering adopters — nascent, established, and leading — based on whether and how much adopters had embraced the above-mentioned three key components of platform engineering success. The study shows that leading adopters gain more in terms of speed, efficiency, and productivity, and offers guidance for nascent and established adopters to improve their overall platform engineering maturity to gain more benefits.

    The report also identified some additional benefits of platform engineering, including:

    • Improved employee satisfaction, talent acquisition & retention: mature platforms foster a positive developer experience that directly impacts company culture. Developers and IT pros working for organizations with mature developer platforms are much more likely to recommend their workplace to their peers.

    • Accelerated time to market: mature platform engineering adopters have significantly shortened time to market. 71% of leading adopters of platform engineering indicated they have significantly accelerated their time to market, compared with 28% of less mature adopters.

    Don't go it alone

    A vast majority (96%) of surveyed organizations are leveraging open-source tools to build their developer platforms. Moreover, most (84%) are partnering with external vendors to manage and support their open-source environments. Co-managed platforms with a third party or a cloud partner benefit from a higher degree of innovation. Organizations with co-managed platforms allocate an average of 47% of their developers’ productive time to innovation and experimentation, compared to just 38% for those that prefer to manage their platforms with internal staff.

    Ready to succeed? Explore the full report

    While this blog provides a glimpse into the key findings from this study, the full report goes much further, revealing key platform engineering strategies and practices that will help you stay ahead of the curve. Download the report to explore additional topics, including:

    • The strategic considerations of centralized and distributed platform engineering teams

    • The key drivers behind platform engineering investments

    • Top priorities driving platform adoption for developers, ensuring alignment with their needs

    • Key pain points to anticipate and navigate on the road to platform engineering success

    • How platform engineering boosts productivity, performance, and innovation across the entire organization

    • The strategic importance of open source in platform engineering for competitive advantage

    • The transformative role of platform engineering for AI/ML workloads as adoption of AI increases

    • How to develop the right platform engineering strategy to drive scalability and innovation

    Download the full report now.

  60. Software Engineer

    Thu, 23 Jan 2025 17:00:00 -0000

    Editor’s note: This blog post was updated to reflect the general availability status of these features as of March 31, 2025.


    Cloud Deploy is a fully managed continuous delivery platform that automates the delivery of your application. On top of existing automation features, customers tell us they want other ways to automate their deployments to keep their production environments reliable and up to date.

    We're happy to announce three new features to help with that, all in GA.

    1. Repair rollouts

    The new repair rollout automation rule lets you retry failed deployments or automatically roll back to a previously successful release when an error occurs. These errors could come in any phase of a deployment: a pre-deployment SQL migration, a misconfiguration detected when talking to a GKE cluster, or as part of a deployment verification step. In any of these cases, the repair rollout automation lets you retry the failed step a configurable number of times, perfect for those occasionally flaky end-to-end tests. If the retry succeeds, the rollout continues. If the retries fail (or none are configured) the repair rollout automation can also roll back to the previously successful release.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c407bcb20>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    2. Deploy policies

    Automating deployments is powerful, but it can also be important to put some constraints on the automation. The new deploy policies feature is intended to limit what these automations (or users) can do. Initially, we're launching a time-windows policy which can, for example, inhibit deployments during evenings, weekends, or during important events. While an on-caller with the Policy Overrider role could "break glass" to get around these policies, automated deployments won't be able to trigger a rollout in the middle of your big demo.

    3. Timed promotions

    After a release is successfully rolled out, you may want to automatically deploy it to the next environment. Our previous auto-promote feature let you promote a release after a specified duration, for example moving it into prod 12 hours after it went to staging. But often you want promotions to happen on a schedule, not based on a delay. Within Google, for example, we typically recommend that teams promote from a dev environment into staging every Thursday, and then start a promotion into prod on Monday mornings. With the new timed promotion automation, Cloud Deploy can handle these scheduled promotions for you. 

    The future

    Comprehensive, easy-to-use, and cost-effective DevOps tools are key to efficient software delivery, and it’s our hope that Cloud Deploy will help you implement complete CI/CD pipelines. Stay tuned as we introduce exciting new capabilities and features to Cloud Deploy in the months to come.

    Update your current pipelines with these new features today. Check out the product page, documentation, quickstarts, and tutorials. Finally, if you have feedback on Cloud Deploy, you can join the conversation. We look forward to hearing from you!

  61. Senior Staff Reliability Engineer

    Thu, 09 Jan 2025 17:00:00 -0000

    Cloud applications like Google Workspace provide benefits such as collaboration, availability, security, and cost-efficiency. However, for cloud application developers, there’s a fundamental conflict between achieving high availability and the constant evolution of cloud applications. Changes to the application, such as new code, configuration updates, or infrastructure rearrangements, can introduce bugs and lead to outages. These risks pose a challenge for developers, who must balance stability and innovation while minimizing disruption to users.

    Here on the Google Workspace Site Reliability Engineering team, we once moved a replica of Google Docs to a new data center because we needed extra capacity. But moving the associated data, which was vast, overloaded a key index in our database, restricting user ability to create new docs. Thankfully, we were able to identify the root cause and mitigate the problem quickly. Still, this experience convinced us of the need to reduce the risk of a global outage from a simple application change.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c4ad9c490>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Limit the blast radius

    Our approach to reducing the risk of global outages is to limit the “blast radius,” or extent, of an outage by vertically partitioning the serving stack. The basic idea is to run isolated instances (“partitions”) of application servers and storage (Figure 1). Each partition contains all the various servers necessary to service a user request from end to end. Each production partition also has a pseudo-random mix of users and workloads, so all the partitions have similar resource needs. When it comes time to make changes to the application code, we deploy new changes to one partition at a time. Bad changes may cause a partition-wide outage, but we are protected from a global application outage. 

    Compare this approach to using canarying alone, in which new features or code changes are released to a small group of users before rolling them out to the rest. While canarying deploys changes first to just a few servers, it doesn’t prevent problems from spreading. For example, we’ve had incidents where canaried changes corrupted data used by all the servers in the deployment. With partitioning, the effects of bad changes are isolated to a single partition, preventing such contagion. Of course, in practice, we combine both techniques: canarying new changes to a few servers within a single partition.

    image1

    Benefits of partitioning

    Broadly speaking, partitioning brings a lot of advantages:

    • Availability: Initially, the primary motivation for partitioning was to improve the availability of our services and avoid global outages. In a global outage, an entire service may be down (e.g., users cannot log into Gmail), or a critical user journey (e.g., users cannot create Calendar events) — obviously things to be avoided.

      Still, the reliability benefits of partitioning can be hard to quantify; global outages are relatively infrequent, so if you don’t have one for a while, it may be due to partitioning, or may be due to luck. That said, we’ve had several outages that were confined to a single partition, and believe they would have expanded into global outages without it.
    • Flexibility: We evaluate many changes to our systems by experimenting with data. Many user-facing experiments, such as a change to a UI element, use discrete groups of users. For example, in Gmail we can choose an on-disk layout that stores the message bodies of emails inline with the message metadata, or a layout that separates them into different disk files. The right decision depends on subtle aspects of the workload. For example, separating message metadata and bodies may reduce latency for some user interactions, but requires more compute resources in our backend servers to perform joins between the body and metadata columns. With partitioning, we can easily evaluate the impact of these choices in contained, isolated environments. 
    • Data location: Google Workspace lets enterprise customers specify that their data be stored in a specific jurisdiction. In our previous, non-partitioned architecture, such guarantees were difficult to provide, especially since services were designed to be globally replicated to reduce latency and take advantage of available capacity.

    Challenges

    Despite the benefits, there are some challenges to adopt partitioning. In some cases, these challenges make it hard or risky to move from a non-partitioned to a partitioned setup. In other cases, challenges persist even after partitioning. Here are the issues as we see them:

    • Not all data models are easy to partition: For example, Google Chat needs to assign both users and chat rooms to partitions. Ideally, a chat and its members would be in a single partition to avoid cross-partition traffic. However, in practice, this is difficult to accomplish. Chat rooms and users form a graph, with users in many chat rooms and chat rooms containing many users. In the worst case, this graph may have only a single connected component — the user. If we were to slice the graph into partitions, we could not guarantee that all users would be in the same partition as their chat rooms.
    • Partitioning a live service requires care: Most of our services pre-date partitioning. As a result, adopting partitioning means taking a live service and changing its routing and storage setup. Even if the end goal is higher reliability, making these kinds of changes in a live system is often the source of outages, and can be risky.
    • Partition misalignment between services: Our services often communicate with each other. For example, if a new person is added to a Calendar event, Calendar servers make an Remote Procedure Call (RPC) to Gmail delivery servers to send the new invitee an email notification. Similarly, Calendar events with video call links require Calendar to talk to Meet servers for a meeting id. Ideally, we would get the benefits of partitioning even across services. However, aligning partitions between services is difficult. The main reason is that different services tend to use different entity types when determining which partition to use. For example, Calendar partitions on the owner of the calendar while Meet partitions on meeting id. The result is that there is no clear mapping from partitions in one service to another.
    • Partitions are smaller than the service: A modern cloud application is served by hundreds or thousands of servers. We run servers at less than full utilization so that we can tolerate spikes in traffic, and because servers that are saturated with traffic generally perform poorly. If we have 500 servers, and target each at 60% CPU utilization, we effectively have 200 spare servers to absorb load spikes. Because we do not fail over between partitions, each partition has access to a much smaller amount of spare capacity. In a non-partitioned setup, a few server crashes may likely go unnoticed, since there is enough headroom to absorb the lost capacity. But in a smaller partition, these crashes may account for a non-trivial portion of the available server capacity, and the remaining servers may become overloaded.

    Key takeaways

    We can improve the availability of web applications by partitioning their serving stacks. These partitions are isolated, because we do not fail over between them. Users and entities are assigned to partitions in a sticky manner to allow us to roll out changes in order of risk tolerance. This approach allows us to roll out changes one partition at a time with confidence that bad changes will only affect a single partition, and ideally that partition contains only users from your organization.

    In short, partitioning supports our efforts to provide stronger and more reliable services to our users, and it might apply to your service as well. For example, you can improve the availability of your application by using Spanner, which provides geo-partitioning out of the box. Read more about geo-partitioning best practices here.

    References

  62. Product Leader for Customer Telemetry, Google Cloud

    Mon, 06 Jan 2025 17:00:00 -0000

    Cloud incidents happen. And when they do, it’s incumbent on the cloud service provider to communicate about the incident to impacted customers quickly and effectively — and for the cloud service consumer to use that information effectively, as part of a larger incident management response. 

    Google Cloud Personalized Service Health provides businesses with fast, transparent, relevant, and actionable communication about Google Cloud service disruptions, tailored to a specific business at its desired level of granularity. Cybersecurity company Palo Alto Networks is one Google Cloud customer and partner that recently integrated Personalized Service Health signals into the incident workflow for its Google Cloud-based PRISMA Access offering, saving its customers critical minutes during active incidents. 

    By programmatically ingesting Personalized Service Health signals into advanced workflow components, Palo Alto can quickly make decisions such as triggering contingency actions to protect business continuity.

    Let’s take a closer look at how Palo Alto integrated Personalized Service Health into its operations.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1c10b730>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    The Personalized Service Health integration

    Palo Alto ingests Personalized Service Health logs into its internal AIOps system, which centralizes incident communications for PRISMA Access and applies advanced techniques to classify and distribute signals to the people responsible for responding to a given incident.

    1

    Personalized Service Health UI Incident list view

    Users of Personalized Service Health can filter what relevance levels they want to see. Here, “Partially related” reflects an issue anywhere in the world with the products that are used. “Related” reflects that the problem is detected within the data center regions, while “Impacted” means that Google has verified the impact to the customer for specific services.

    While Google is still confirming an incident, Personalized Service Health communicates some of these incidents as 'PSH Emerging Incident' to provide customers with early notification. Once Google confirms the incident, these incidents are merged with 'PSH Confirmed Incidents'. This helps customers respond faster to a specific incident that’s impacting their environment or escalate back to Google, if needed. 

    Personalized Service Health distributes updates throughout an active incident, typically every 30 minutes, or sooner if there’s progress to share. These updates are also written to logs, which Palo Alto ingests into AIOps.

    Responding to disruptive, unplanned cloud service provider incidents can be accelerated by programmatically ingesting and distributing incident communications. This is especially true in large-scale organizations such as Palo Alto, which has multiple teams involved in incident response for different applications, workloads and customers. 

    Fueling the incident lifecycle

    Palo Alto further leverages the ingested Personalized Service Health signals in its AIOps platform, which uses machine learning (ML) and analytics to automate IT operations. AIOps harnesses big data from operational appliances to detect and respond to issues instantaneously.  AIOps correlates these signals with internally generated alerts to declare an incident that is affecting multiple customers. These AIOps alerts are tied to other incident management tools that assist with managing the incident lifecycle, including communication, regular updates and incident resolution.

    2

    In addition, a data enrichment pipeline takes Personalized Service Health incidents, adds Palo Alto’s related information, and publishes the events to Pub/Sub. AIOps then consumes the incident data from Pub/Sub, processes it, correlates it to related events signals, and notifies subscribed channels.

    Palo Alto organizes Google Cloud assets into folders within the Google Cloud console. Each project represents a Palo Alto PRISMA Access customer. To receive incident signals that are likewise specific to end customers, Palo Alto creates a log sink that’s specific to each folder, aggregating service health logs at the folder level. Palo Alto then receives incident signals specific to each customer so it can take further action.

    3

    Palo Alto drives the following actions based on incident communications flowing from Google Cloud:

    • Proactive detection of zonal, inter-regional, external en-masse failures

    • Accurately identifying workloads affected by cloud provider incidents 

    • Correlation of product issue caused by cloud service degradation in Google Cloud Platform itself

    Seeing Personalized Service Health’s value

    Incidents caused by cloud providers often go unnoticed or are difficult to isolate without involving multiple of the cloud provider’s teams (support, engineering, SRE, account management). The Personalized Service Health alerting framework plus AIOps correlation engine allows Palo Alto’s SRE teams to isolate issues caused by a cloud provider near-instantaneously.

    4

    Palo Alto’s incident management workflow is designed to address mass failures versus individual customer outages, ensuring the right teams are engaged until the incidents are resolved. This includes notifying relevant parties, such as the on-call engineer and the Google Cloud support team. With Personalized Service Health, Palo Alto can capture both event types i.e., mass failures as well as individual customer outages.

    Palo Alto gets value from Personalized Service Health in multiple ways, beginning with faster incident response and contingency actions with which to optimize business continuity, especially for impacted customers of PRISMA Access. In the event of an incident impacting them, Prisma Access customers naturally seek and expect information from Palo Alto. By ensuring this information flows rapidly from Google Cloud to Palo Alto’s incident response systems, Palo Alto is able to provide more insightful answers to these end customers, and plans to serve additional Palo Alto use cases based on both existing and future Personalized Service Health capabilities. 

    Take your incident management to the next level

    Google Cloud is continually evolving Personalized Service Health to provide deeper value for all Google Cloud customers — from startups, to ISVs and SaaS providers, to the largest enterprises. Ready to get started? Learn more about Personalized Service Health, or reach out to your account team.


    We'd like to thank Jose Andrade, Pankhuri Kumar and Sudhanshu Jain of Google for their contributions to this collaboration between PANW and Google Cloud.

  63. Staff Software Engineer

    Mon, 09 Dec 2024 17:00:00 -0000

    From helping your developers write better code faster with Code Assist, to helping cloud operators more efficiently manage usage with Cloud Assist, Gemini for Google Cloud is your personal AI-powered assistant. 

    However, understanding exactly how your internal users are using Gemini has been a challenge — until today. 

    Today we are announcing that Cloud Logging and Cloud Monitoring support for Gemini for Google Cloud. Currently in public preview, Cloud Logging records requests and responses between Gemini for Google Cloud and individual users, while Cloud Monitoring reports 1-day, 7-day, and 28-day Gemini for Google Cloud active users and response counts in aggregate.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c4a09eee0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Cloud Logging

    In addition to offering customers general visibility into the impact of Gemini, there are a few scenarios where logs are useful:

    • to track the provenance of your AI-generated content

    • to record and review user usage of Gemini for Google Cloud 

    This feature is available as opt-in and when enabled, logs your users’ Gemini for Google Cloud activity to Cloud Logging (Cloud Logging charges apply). 

    Once enabled, log entries are made for each request to and response from Gemini for Google Cloud. In a typical request entry, Logs Explorer would provide an entry similar to the following example:

    1

    There are several things to note about this entry:

    • The content inside jsonPayload contains information about the request. In this case, it was a request to complete Python code with def fibonacci as the input. 

    • The labels tell you the method (CompleteCode), the product (code_assist), and the user who initiated the request (cal@google.com). 

    • The resource labels tell you the instance, location, and resource container (typically project) where the request occurred. 

    In a typical response entry, you’ll see the following:

    2

    Note that the request_id inside the label are identical for this pair of requests and responses, enabling identification of request and response pairs.

    In addition to the Log Explorer, Log Analytics supports queries to analyze your log data, and help you answer questions like "How many requests did User XYZ make to Code Assist?" 

    For more details, please see the Gemini for Google Cloud logging documentation

    Cloud Monitoring 

    Gemini for Google Cloud monitoring metrics help you answer questions like: 

    • How many unique active users used Gemini for Google Cloud services over the past day or seven days? 

    • How many total responses did my users receive from Gemini for Google Cloud services over the past six hours?

    Cloud Monitoring support for Gemini for Google Cloud is available to anybody who uses a Gemini for Google Cloud product and records responses and active users as Cloud Monitoring metrics, with which dashboards and alerts can be configured. 

    Because these metrics are available with Cloud Monitoring, you can also use them as part of Cloud Monitoring dashboards. A “Gemini for Google Cloud” dashboard is automatically installed under “GCP Dashboards” when Gemini for Google Cloud usage is detected:

    3

    Metrics Explorer offers another avenue where metrics can be examined and filters applied to gain a more detailed view of your usage. This is done by selecting the “Cloud AI Companion Instance” active resource in the Metrics Explorer:

    4

    In the example above, response_count is the number of responses sent by Gemini for Google Cloud, and can be filtered for Gemini Code Assist or the Gemini for Google Cloud method (code completion/generation). 

    For more details, please see the Gemini for Google Cloud monitoring documentation.

    What’s next

    We’re continually working on additions to these new capabilities, and in particular are focused on Code Assist logging and metrics enhancements that will bring even further insight and observability into your use of Gemini Code Assist and its impact. To get started with Gemini Code Assist and learn more about Gemini Cloud Assist — as well as observability data about it from Cloud Logging and Monitoring — check out the following links: 

  64. EMEA Practice Solutions Lead, Application Platform

    Tue, 22 Oct 2024 17:00:00 -0000

    At the end of the day, developers build, test, deploy and maintain software. But like with lots of things, it’s about the journey, not the destination.

    Among platform engineers, we sometimes refer to that journey as the developer experience (DX), which encompasses how developers feel and interact with the tools and services they use throughout the software build, test, deployment and maintenance process.

    Prioritizing DX is essential: Frustrated developers lead to inefficiency and talent loss as well as to shadow IT. Conversely, a positive DX drives innovation, community, and productivity. And if you want to provide a  positive DX, you need to start measuring how you’re doing.

    At PlatformCon 2024, I gave a talk entitled "Improving your developers' platform experience by applying Google frameworks and methods” where I spoke about Google’s HEART Framework, which provides a holistic view of your organization's developers’ experience through actionable data.

    In this article, I will share ideas on how you can apply the HEART framework to your Platform Engineering practice, to gain a more comprehensive view of your organization’s developer experience. But before I do that, let me explain what the HEART Framework is.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1e1b9b50>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    The HEART Framework: an introduction

    In a nutshell, HEART measures developer behaviors and attitudes from their experience of your platform and provides you with insights into what’s going on behind the numbers, by defining specific metrics to track progress towards goals. This is beneficial because continuous improvements through feedback are vital components of a platform engineering journey, helping both platform and application product teams make decisions that are data-driven and user-centered.

    However, HEART is not a data collection tool in and of itself; rather, it’s a user-sentiment framework for selecting the right metrics to focus on based on product or platform objectives. It balances quantitative or empirical data, e.g., number of active portal users, with qualitative or subjective insights such as "My users feel the portal navigation is confusing." In other words, consider HEART as a framework or methodology for assessing user experience, rather than a specific tool or assessment. It helps you decide what to measure, not how to measure it.

    image2

    Let’s take a look at each of these in more detail.

    Happiness: Do users actually enjoy using your product?

    Highlight: Gathering and analyzing developer feedback

    Subjective metrics:

    • Surveys: Conduct regular surveys to gather feedback about overall satisfaction, ease of use, and pain points. Toil negatively affects developer satisfaction and morale. Repetitive, manual work can lead to frustration burnout and decreased happiness with the platform.

    • Feedback mechanisms: Establish easy ways for developers to provide direct feedback on specific features or areas of the platform like Net Promoter Score (NPS) or Customer Satisfaction surveys (CSAT).

    • Collect open-ended feedback from developers through interviews and user groups.

    • Sentiment analysis: Analyze developer sentiment expressed in feedback channels, support tickets and online communities.

    System metrics:

    • Feature requests: Track the number and types of feature requests submitted by developers. This provides insights into their needs and desires and can help you prioritize improvements that will enhance happiness.

    Watch out for: While platforms can boost developer productivity, they might not necessarily contribute to developer job satisfaction. This warrants further investigation, especially if your research suggests that your developers are unhappy.

    Engagement: What is the developer breadth and quality of platform experience?

    Highlight: Frequency of interaction between platform engineers with developers and quality of interaction — intensity and quality of interaction with the platform, participation on chat channels, training, dual ownership of golden paths, joint troubleshooting, engaging in architectural design discussions, and the breadth of interaction by everyone from new hires through to senior developers.

    Subjective metrics:

    • Survey for quality of interaction — focus on depth and type of interaction whether through chat channel, trainings, dual ownership of golden paths, joint troubleshooting, or architectural design discussions

    • High toil can reduce developer engagement with the platform. When developers spend excessive amounts of time on tedious tasks, they are less likely to explore new features, experiment, and contribute to the platform's evolution.

    System metrics:

    • Active users: Track daily, weekly, and monthly active developers and how long they spend on tasks.

    • Usage patterns: Analyze the most used platform features, tools, and portal resources.

    • Frequency of interaction between platform engineers with developers.

    • Breadth of user engagement: Track onboarding time for new hires to reach proficiency, measure the percentage of senior developers actively contributing to golden paths or portal functionality.

    Watch out for: Don’t confuse engagement with satisfaction. Developers may rate the platform highly in surveys, but usage data might reveal low frequency of interaction with core features or a limited subset of teams actively using the platform. Ask them “How has the platform changed your daily workflow?” rather than "Are you satisfied with the platform?”

    Adoption: What is the platform growth rate and developer feature adoption?

    Highlight: Overall acceptance and integration of the platform into the development workflow.

    System metrics:

    • New user registrations: Monitor the growth rate of new developers using the platform.

    • Track time between registration and time to use the platform i.e., executing golden paths, tooling and portal functionality.

    • Number of active users per week / month / quarter / half-year / year who authenticate via the portal and/or use golden paths, tooling and portal functionality

    • Feature adoption: Track how quickly and widely new features or updates are used.

    • Percentage of developers using CI/CD through the platform

    • Number of deployments per user / team / day / week / month — basically of your choosing

    • Training: Evaluate changes in adoption, after delivering training.

    Watch out for: Overlooking the "long tail" of adoption. A platform might see a burst of early adoption, but then plateau or even decline if it fails to continuously evolve and meet changing developer needs. Don't just measure initial adoption, monitor how usage evolves over weeks, months, and years.

    Retention: Are developers loyal to the platform?

    Highlight: Long-term engagement and reducing churn.

    Subjective metrics:

    • Use an exit survey if a user is dormant for 12 or more months.

    System metrics:

    • Churn rate: Track the percentage of developers who stop logging into the portal and are not using it.

    • Dormant users: Identify developers who become inactive after 6 months and investigate why.

    • Track services that are less frequently used.

    Watch out for: Misinterpreting the reasons for churn. When developers stop using your platform (churn), it's crucial to understand why. Incorrectly identifying the cause can lead to wasted effort and missed opportunities for improvement. Consider factors outside the platform — churn could be caused by changes in project requirements, team structures or industry trends.

    Task success: Can developers complete specific tasks?

    Highlight: Efficiency and effectiveness of the platform in supporting specific developer activities.

    Subjective metrics:

    • Survey to assess the ongoing presence of toil and its inimical influence on developer productivity, ultimately hindering efficiency and leading to increased task completion times.

    System metrics:

    • Completion rates: Measure the percentage of golden paths and tools successfully run on the platform without errors.

    • Time to complete tasks using golden paths, portal, or tooling.

    • Error rates: Track common errors and failures developers encounter from log files or monitoring dashboards from golden paths, portal or tooling.

    • Mean Time to Resolution (MTTR): When errors do occur, how long does it take to resolve them? A lower MTTR indicates a more resilient platform and faster recovery from failures.

    • Developer platform and portal uptime: Measure the percentage of time that the developer platform and portal is available and operational. Higher uptime ensures developers can consistently access the platform and complete their tasks.

    Watch out for: Don't confuse task success with task completion. Simply measuring whether developers can complete tasks on the platform doesn't necessarily indicate true success. Developers might find workarounds or complete tasks inefficiently, even if they technically achieve the end goal. It may be worth manually observing developer workflows in their natural environment to identify pain points and areas of friction in their workflows.

    Also, be careful with misaligning task success with business goals. Task completion might overlook the broader impact on business objectives. A platform might enable developers to complete tasks efficiently, but if those tasks don't contribute to overall business goals, the platform's true value is questionable.

    Applying the HEART framework to platform engineering

    It’s not necessary to use all of the categories each time. The number of categories to consider really depends on the specific goals and context of the assessment; you can include everything or trim it down to better match your objective. Here are some examples:

    • Improving onboarding for new developers: Focus on adoption, task success and happiness.

    • Launching a new feature: Concentrate on adoption and happiness.

    • Increasing platform usage: Track engagement, retention and task success.

    Keep in mind that relying on just one category will likely provide an incomplete picture.

    When should you use the framework?

    In a perfect world, you would use the HEART framework to establish a baseline assessment a few months after launching your platform, which will provide you with a valuable insight into early developer experience. As your platform evolves, this initial data becomes a benchmark for measuring progress and identifying trends. Early measurement allows you to proactively address UX issues, guide design decisions with data, and iterate quickly for optimal functionality and developer satisfaction. If you're starting with an MVP, conduct the baseline assessment once the core functionality is in place and you have a small group of early users to provide feedback.

    After 12 or more months of usage, you can also add metrics to embody a new or more mature platform. This can help you gather deeper insights into your developers’ experience by understanding how they are using the platform, measure the impact of changes you’ve made to the platform, or identify areas for improvement and prioritize future development efforts. If you've added new golden paths, tooling, or enhanced functionality, then you'll need to track metrics that measure their success and impact on developer behavior.

    The frequency with which you assess HEART metrics depends on several factors, including:

    • The maturity of your platform: Newer platforms benefit from more frequent reviews (e.g. monthly or quarterly) to track progress and address early issues. As the platform matures, you can reduce the frequency of your HEART assessments (e.g., bi-annually or annually).

    • The rate of change: To ensure updates and changes have a positive impact, apply the HEART framework more frequently when your platform is undergoing a period of rapid evolution such as major platform updates, new portal features or new golden paths, or some change in user behavior. This allows you to closely monitor the effects of each change on key metrics.

    • The size and complexity of your platform: Larger and more complex platforms may require more frequent assessments to capture nuances and potential issues.

    • Your team's capacity: Running HEART assessments requires time and resources. Consider your team's bandwidth and adjust the frequency accordingly.

    Schedule periodic deep dives (e.g. quarterly or bi-annually) using the HEART framework to gain a more in-depth understanding of your platform's performance and identify areas for improvement.

    Taking more steps towards platform engineering

    In this blog post, we’ve shown how the HEART framework can be applied to platform engineering to measure and improve the developer experience. We’ve explored the five key aspects of the framework — happiness, engagement, adoption, retention, and task success — and provided specific metrics for each and guidance on when to apply them.By applying these insights, platform engineering teams can create a more positive and productive environment for their developers, leading to greater success in their software development efforts.To learn more about platform engineering, check out some of our other articles:  5 myths about platform engineering: what it is and what it isn’t, Another five myths about platform engineering, and Laying the foundation for a career in platform engineering.

    And finally, check out the DORA Report 2024, which now has a section on Platform Engineering.

  65. DORA Research Lead

    Tue, 22 Oct 2024 16:00:00 -0000

    The DORA research program has been investigating the capabilities, practices, and measures of high-performing technology-driven teams and organizations for more than a decade. It has published reports based on data collected from annual surveys of professionals working in technical roles, including software developers, managers, and senior executives.

    Today, we’re pleased to announce the publication of the 2024 Accelerate State of DevOps Report, marking a decade of DORA’s investigation into high-performing technology teams and organizations. DORA’s four key metrics, introduced in 2013, have become the industry standard for measuring software delivery performance. 

    Each year, we seek to gain a comprehensive understanding of standard DORA performance metrics, and how they intersect with individual, workflow, team, and product performance. We now include how AI adoption affects software development across multiple levels, too.

    1

    We also establish reference points each year to help teams understand how they are performing, relative to their peers, and to inspire teams with the knowledge that elite performance is possible in every industry. DORA’s research over the last decade has been designed to help teams get better at getting better: to strive to improve their improvements year over year. 

    For a quick overview of this year’s report, you can read in our executive DORA Report summary the spotlight adoption trends and the impact of AI, the emergence of platform engineering, and the continuing significance of developer experience. 

    Organizations across all industries are prioritizing the integration of AI into their applications and services. Developers are increasingly relying on AI to improve their productivity and fulfill their core responsibilities. This year's research reveals a complex landscape of benefits and tradeoffs for AI adoption.

    The report underscores the need to approach platform engineering thoughtfully, and emphasizes the critical role of developer experience in achieving high performance. 

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e7c1eff7070>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    AI: Benefits, challenges, and developing trust

    Widespread AI adoption is reshaping software development practices. More than 75 percent of respondents said that they rely on AI for at least one daily professional responsibility. The most prevalent use cases include code writing, information summarization, and code explanation. 

    The report confirms that AI is boosting productivity for many developers. More than one-third of respondents experienced”‘moderate” to “extreme” productivity increases due to AI.

    2

    A 25% increase in AI adoption is associated with improvements in several key areas:

    • 7.5% increase in documentation quality

    • 3.4% increase in code quality

    • 3.1% increase in code review speed

    However, despite AI’s potential benefits, our research revealed a critical finding: AI adoption may negatively impact software delivery performance. As AI adoption increased, it was accompanied by an estimated  decrease in delivery throughput by 1.5%, and an estimated reduction in delivery stability by 7.2%. Our data suggest that improving the development process does not automatically improve software delivery — at least not without proper adherence to the basics of successful software delivery, like small batch sizes and robust testing mechanisms. AI has positive impacts on many important individual and organizational factors which foster the conditions for high software delivery performance. But, AI does not appear to be a panacea.

    Our research also shows that despite the productivity gains, 39% of the respondents reported little to no trust in AI-generated code. This unexpected low level of trust indicates to us that there is a need to manage AI integration more thoughtfully. Teams must carefully evaluate AI’s role in their development workflow to mitigate the downsides

    Based on these findings, we have three core recommendations:

    1. Enable your employees and reduce toil by orienting your AI adoption strategies towards empowering employees and alleviating the burden of undesirable tasks.

    2. Establish clear guidelines for the use of AI and address procedural concerns and foster open communication about its impact.

    3. Encourage continuous exploration of AI tools and provide them dedicated time for experimentation, and promote trust through hands-on experience.

    Platform engineering: A paradigm shift

    Another emerging discipline our research focused this year is on platform engineering. Its focus is on building and operating internal development platforms to streamline processes and enhance efficiency

    3

    Our research identified 4 key findings regarding platform engineering:

    • Increased developer productivity: Internal development platforms effectively increase productivity for developers.

    • Prevalence in larger firms: These platforms are more commonly found in larger organizations, suggesting their suitability for managing complex development environments.

    • Potential performance dip: Implementing a platform engineering initiative might lead to a temporary decrease in performance before improvements manifest as the platform matures.

    • Need for user-centeredness and developer independence: For optimal results, platform engineering efforts should prioritize user-centered design, developer independence, and a product-oriented approach

    A thoughtful approach that prioritizes user needs, empowers developers, and anticipates potential challenges is key to maximizing the benefits of platform engineering initiatives. 

    Developer experience: The cornerstone of success

    One of the key insights in last year’s report was that a healthy culture can help reduce burnout, increase productivity, and increase job satisfaction. This year was no different. Teams that cultivate a stable and supportive environment that empowers developers to excel drive positive outcomes. 

    Move fast and constantly pivot’ mentality negatively impacts developer well-being and consequently, on overall performance.  Instability in priorities, even with strong leadership, comprehensive documentation, and a user-centered approach — all known to be highly beneficial — can significantly hinder progress. 

    Creating a work environment where your team feels supported, valued, and empowered to contribute is fundamental to achieving high performance. 

    How to use these findings to help your DevOps team

    The key takeaway from the decade of research is that software development success hinges not just on technical prowess but also on fostering a supportive culture, prioritizing user needs, and focusing on developer experience. We encourage teams to replicate our findings within your specific context.  

    It can be used as a hypothesis for your experiments and continuous improvement initiatives. Please share those with us and the DORA community, so that your efforts can become part of our collaborative learning environment.  

    We work on this research in hopes that it serves as a roadmap for teams and organizations seeking to improve their practices and create a thriving environment for innovation, collaboration, and business success. We will continue our platform-agnostic research that focuses on the human aspect of technology for the next decade to come.

    To learn more:

  66. Product Manager, Google Cloud Databases

    Thu, 10 Oct 2024 14:00:00 -0000

    Organizations are grappling with an explosion of operational data spread across an increasingly diverse and complex database landscape. This complexity often results in costly outages, performance bottlenecks, security vulnerabilities, and compliance gaps, hindering their ability to extract valuable insights and deliver exceptional customer experiences. To help businesses overcome these challenges, earlier this year, we announced the preview of Database Center, an AI-powered, unified fleet management solution.

    We’re seeing accelerated adoption for Database Center from many customers. For example, Ford uses Database Center to get answers on their database fleet health in seconds, and proactively mitigates potential risks to their applications. Today, we’re announcing that Database Center is now available to all customers, empowering you to monitor and operate database fleets at scale with a single, unified solution. We've also added support for Spanner, so you can manage it along with your Cloud SQL and AlloyDB deployments, with support for additional databases on the way.

    Database Center is designed to bring order to the chaos of your database fleet, and unlock the true potential of your data. It provides a single, intuitive interface where you can:

    • Gain a comprehensive view of your entire database fleet. No more silos of information or hunting through bespoke tools and spreadsheets.

    • Proactively de-risk your fleet with intelligent performance and security recommendations. Database Center provides actionable insights to help you stay ahead of potential problems, and helps improve performance, reduce costs and enhance security with data-driven suggestions.

    • Optimize your database fleet with AI-powered assistance. Use a natural-language chat interface to ask questions and quickly resolve fleet issues and get optimization recommendations.

    Let’s now review each in more detail.

    Gain a comprehensive view of your database fleet 

    Tired of juggling different tools and consoles to keep track of your databases?

    Database Center simplifies database management with a single, unified view of your entire database landscape. You can monitor database resources across your entire organization, spanning multiple engines, versions, regions, projects and environments (or applications using labels). 

    Cloud SQL, AlloyDB, and now Spanner are all fully integrated with Database Center, so you can monitor your inventory and proactively detect issues. Using the unified inventory view in Database Center, you can: 

    • Identify out-of-date database versions to ensure proper support and reliability

    • Track version upgrades, e.g., if PostgreSQL 14 to PostgreSQL 15 is updating at an expected pace

    • Ensure database resources are appropriately distributed, e.g., identify the number of databases powering the critical production applications vs. non-critical dev/test environments

    • Monitor database migration from on-prem to cloud or across engines

    1-Unified FLeet View

    Manage Cloud SQL, AlloyDB and Spanner resources with a unified view.

    Proactively de-risk your fleet with recommendations

    Managing your database fleet health at scale can involve navigating through a complex blend of security postures, data protection settings, resource configurations, performance tuning and cost optimizations. Database Center proactively detects issues associated with these configurations and guides you through addressing them. 

    For example, high transaction ID for a Cloud SQL instance can lead to the database no longer accepting new queries, potentially causing latency issues or even downtime. Database Center proactively detects this, provides an in-depth explanation, and walks you through prescriptive steps to troubleshoot the issue. 

    We’ve also added several performance recommendations to Database Center related to excessive tables/joins, connections, or logs, and can assist you through a simple optimization journey.

    2. High Transaction ID

    End-to-end workflow for detecting and troubleshooting performance issues.

    Database Center also simplifies compliance management by automatically detecting and reporting violations across a wide range of industry standards, including CIS, PCI-DSS, SOC2, HIPAA. Database Center continuously monitors your databases for potential compliance violations. When a violation is detected, you receive a clear explanation of the problem, including:

    • The specific security or reliability issue causing the violation 

    • Actionable steps to help address the issue and restore compliance

    This helps reduce the risk of costly penalties, simplifies compliance audits and strengthens your security posture. Database Center now also supports real-time detection of unauthorized access, updates, and data exports.

    3. Compliance

    Database Center helps ensure compliance to HIPAA standards.

    Optimize your fleet with AI-powered assistance

    With Gemini enabled, Database Center makes optimizing your database fleet incredibly intuitive. Simply chat with the AI-powered interface to get precise answers, uncover issues within your database fleet, troubleshoot problems, and quickly implement solutions. For example, you can quickly identify under-provisioned instances across your entire fleet, access actionable insights such as the duration of high CPU/Memory utilization conditions, receive recommendations for optimal CPU/memory configurations, and learn about the associated cost of those adjustments. 

    AI-powered chat in Database Center provides comprehensive information and recommendations across all aspects of database management, including inventory, performance, availability and data protection. Additionally, AI-powered cost recommendations suggest ways for optimizing your spend, and advanced security and compliance recommendations help strengthen your security and compliance posture.

    4 - Chat -1

    AI-powered chat to identify data protection issues and optimize cost.

    Get started with Database Center today

    The new capabilities of Database Center are available in preview today for Spanner, Cloud SQL, and AlloyDB for all customers. Simply access  Database Center within the Google Cloud console and begin monitoring and managing your entire databases fleet. To learn more about Database Center’s capabilities, check out the documentation.

  67. Kong Acquires OpenMeter for API Metering and Billing

    Wed, 03 Sep 2025 13:00:45 -0000

    Kong, API, sentry, Wiz, Veracode, ASPM,
    Kong, API, sentry, Wiz, Veracode, ASPM,Kong acquires OpenMeter to add usage-based API metering and billing to Kong Konnect—critical for managing AI, LLM, and cloud service costs securely
  68. Can Vibe Coding Work on an Enterprise Level?

    Wed, 03 Sep 2025 10:12:37 -0000

    vibe, coding, shadow, vibecoding vibe, coding, GitHub, agents, Gemini, Canvas, Gemini, code, Augment Code, code, kernel compliance-as-code software secure software Terraform infrastructure
    vibe, coding, shadow, vibecoding vibe, coding, GitHub, agents, Gemini, Canvas, Gemini, code, Augment Code, code, kernel compliance-as-code software secure software Terraform infrastructureVibe coding promises AI-generated apps in seconds, but is it enterprise-ready? Explore risks, governance needs, and productivity potential.
  69. The AI Implementation Paradox: Why Your Best Strategy Might Be Your Smallest Move

    Tue, 02 Sep 2025 17:16:32 -0000

    Technology leaders are wrestling with the same contradiction: they know AI will transform their business, but they’re paralyzed by the scope of that transformation. The result? Organizations either chase moonshot AI projects that never deliver, or they delay action until competitors force their hand.
  70. Accelerating DevOps Automation: How AWS and Platformr Streamline Your Cloud Journey

    Tue, 02 Sep 2025 17:14:14 -0000

    Innovation simply can’t wait. Not today, when rapid evolution and change are the baseline, the table stakes. Organizations are under constant pressure to ship software faster, make their operations more reliable, and scale everything efficiently. And they have to do all this while dealing with complex regulations, ongoing talent shortages, and ever-increasing technical complexity. In […]
  71. Accelerate Data Management Modernization with Persistent’s iAURA 2.0 Agentic AI Suite and AWS

    Tue, 02 Sep 2025 17:06:40 -0000

    As businesses like yours turn to AI to drive innovation, data has become the strategic lever for agility and growth. Yet for many organizations, the promise of AI remains out of reach because entrenched legacy data systems stand in the way of progress. These aging architectures, characterized by siloed data, technical debt, and a lack […]
  72. Vibing with the Future: Why “Vibe Coding” Is the Next Big Wave for DevOps and CI/CD Teams

    Tue, 02 Sep 2025 11:05:25 -0000

    vibe, coding, shadow, vibecoding vibe, coding, GitHub, agents, Gemini, Canvas, Gemini, code, Augment Code, code, kernel compliance-as-code software secure software Terraform infrastructure
    vibe, coding, shadow, vibecoding vibe, coding, GitHub, agents, Gemini, Canvas, Gemini, code, Augment Code, code, kernel compliance-as-code software secure software Terraform infrastructureVibe coding blends AI with team culture, style, and workflows — reshaping DevOps, CI/CD, and enterprise development for speed, alignment, and resilience.
  73. Five Great DevOps Job Opportunities

    Mon, 01 Sep 2025 05:30:05 -0000

    job, opportunities, DevOps, hire, skills, careers,
    job, opportunities, DevOps, hire, skills, careers,The five DevOps job postings shared this week are selected from available opportunities at Citi, Qualcomm, DCI solutions and others.
  74. Malicious Nx Packages Used in Two Waves of Supply Chain Attack

    Fri, 29 Aug 2025 17:15:10 -0000

    supply chain, security,
    supply chain, security,The Nx build system was hit by a supply chain attack dubbed “s1ngularity,” leaking thousands of secrets and exploiting AI tools for data theft.
  75. Bringing Order to Chaotic Software Engineering Workflows

    Fri, 29 Aug 2025 16:16:47 -0000

    Software development has never been tidy, but the current landscape feels more chaotic than ever. Shannon Mason, chief strategy officer for Tempo, dives into why software engineering workflows remain chaotic and what teams should be doing to try and restore application development order. Developers today face a relentless push to innovate while keeping complex codebases […]
  76. Why APIs Alone Won’t Cut It in the AI Era

    Fri, 29 Aug 2025 16:03:12 -0000

    Kumar Chivukula, co-founder and CEO of Codeglide.ai (a subsidiary of Upsera), explains why the rise of the Model Context Protocol (MCP) is reshaping how enterprises connect APIs to large language models. For years, APIs have served as the backbone of data access, but they were never designed with AI in mind. They lack memory, context, […]
  77. Image editing in Gemini just got a major upgrade

    Tue, 26 Aug 2025 14:00:59 -0000

    Transform images in amazing new ways with updated native image editing in the Gemini app.
  78. Introducing Gemma 3 270M: The compact model for hyper-efficient AI

    Thu, 14 Aug 2025 16:00:00 -0000

    Today, we're adding a new, highly specialized tool to the Gemma 3 toolkit: Gemma 3 270M, a compact, 270-million parameter model.
  79. How AI is helping advance the science of bioacoustics to save endangered species

    Thu, 07 Aug 2025 14:59:00 -0000

    Our new Perch model helps conservationists analyze audio faster to protect endangered species, from Hawaiian honeycreepers to coral reefs.
  80. Genie 3: A new frontier for world models

    Tue, 05 Aug 2025 14:00:00 -0000

    Genie 3 can generate dynamic worlds that you can navigate in real time at 24 frames per second, retaining consistency for a few minutes at a resolution of 720p.
  81. Rethinking how we measure AI intelligence

    Mon, 04 Aug 2025 16:07:00 -0000

    Game Arena is a new, open-source platform for rigorous evaluation of AI models. It allows for head-to-head comparison of frontier systems in environments with clear winning conditions.
  82. Try Deep Think in the Gemini app

    Fri, 01 Aug 2025 11:09:00 -0000

    Deep Think utilizes extended, parallel thinking and novel reinforcement learning techniques for significantly improved problem-solving.
  83. AlphaEarth Foundations helps map our planet in unprecedented detail

    Wed, 30 Jul 2025 14:00:00 -0000

    New AI model integrates petabytes of Earth observation data to generate a unified data representation that revolutionizes global mapping and monitoring
  84. Aeneas transforms how historians connect the past

    Wed, 23 Jul 2025 14:59:00 -0000

    We’re publishing a paper in Nature introducing Aeneas, the first AI model for contextualizing ancient inscriptions.
  85. Gemini 2.5 Flash-Lite is now ready for scaled production use

    Tue, 22 Jul 2025 16:00:00 -0000

    Gemini 2.5 Flash-Lite, previously in preview, is now stable and generally available. This cost-efficient model provides high quality in a small size, and includes 2.5 family features like a 1 million-token context window and multimodality.
  86. Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad

    Mon, 21 Jul 2025 16:30:00 -0000

    Our advanced model officially achieved a gold-medal level performance on problems from the International Mathematical Olympiad (IMO), the world’s most prestigious competition for young mathematicians. It earned a total of 35 points by perfectly solving five out of the six problems.
  87. Exploring the context of online images with Backstory

    Mon, 21 Jul 2025 15:00:00 -0000

    New experimental AI tool helps people explore the context and origin of images seen online.
  88. AlphaGenome: AI for better understanding the genome

    Wed, 25 Jun 2025 13:59:00 -0000

    Introducing a new, unifying DNA sequence model that advances regulatory variant-effect prediction and promises to shed new light on genome function — now available via API.
  89. Gemini Robotics On-Device brings AI to local robotic devices

    Tue, 24 Jun 2025 14:00:00 -0000

    We’re introducing an efficient, on-device robotics model with general-purpose dexterity and fast task adaptation.
  90. Gemini 2.5: Updates to our family of thinking models

    Tue, 17 Jun 2025 16:03:00 -0000

    Explore the latest Gemini 2.5 model updates with enhanced performance and accuracy: Gemini 2.5 Pro now stable, Flash generally available, and the new Flash-Lite in preview.
  91. We’re expanding our Gemini 2.5 family of models

    Tue, 17 Jun 2025 16:01:00 -0000

    Gemini 2.5 Flash and Pro are now generally available, and we’re introducing 2.5 Flash-Lite, our most cost-efficient and fastest 2.5 model yet.
  92. Behind “ANCESTRA”: combining Veo with live-action filmmaking

    Fri, 13 Jun 2025 13:30:00 -0000

    We partnered with Darren Aronofsky, Eliza McNitt and a team of more than 200 people to make a film using Veo and live-action filmmaking.
  93. How we're supporting better tropical cyclone prediction with AI

    Thu, 12 Jun 2025 15:00:00 -0000

    We’re launching Weather Lab, featuring our experimental cyclone predictions, and we’re partnering with the U.S. National Hurricane Center to support their forecasts and warnings this cyclone season.
  94. Advanced audio dialog and generation with Gemini 2.5

    Tue, 03 Jun 2025 17:15:00 -0000

    Gemini 2.5 has new capabilities in AI-powered audio dialog and generation.
  95. Our vision for building a universal AI assistant

    Tue, 20 May 2025 09:45:00 -0000

    We’re extending Gemini to become a world model that can make plans and imagine new experiences by simulating aspects of the world.
  96. Advancing Gemini's security safeguards

    Tue, 20 May 2025 09:45:00 -0000

    We’ve made Gemini 2.5 our most secure model family to date.
  97. Fuel your creativity with new generative media models and tools

    Tue, 20 May 2025 09:45:00 -0000

    Introducing Veo 3 and Imagen 4, and a new tool for filmmaking called Flow.
  98. SynthID Detector — a new portal to help identify AI-generated content

    Tue, 20 May 2025 09:45:00 -0000

    Learn about the new SynthID Detector portal we announced at I/O to help people understand how the content they see online was generated.
  99. Announcing Gemma 3n preview: Powerful, efficient, mobile-first AI

    Tue, 20 May 2025 09:45:00 -0000

    Gemma 3n is a cutting-edge open model designed for fast, multimodal AI on devices, featuring optimized performance, unique flexibility with a 2-in-1 model, and expanded multimodal understanding with audio, empowering developers to build live, interactive applications and sophisticated audio-centric experiences.
  100. Gemini 2.5: Our most intelligent models are getting even better

    Tue, 20 May 2025 09:45:00 -0000

    Gemini 2.5 Pro continues to be loved by developers as the best model for coding, and 2.5 Flash is getting even better with a new update. We’re bringing new capabilities to our models, including Deep Think, an experimental enhanced reasoning mode for 2.5 Pro.
  101. AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

    Wed, 14 May 2025 14:59:00 -0000

    New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluators
  102. Gemini 2.5 Pro Preview: even better coding performance

    Tue, 06 May 2025 15:06:00 -0000

    We’ve seen developers doing amazing things with Gemini 2.5 Pro, so we decided to release an updated version a couple of weeks early to get into developers hands sooner.
  103. Build rich, interactive web apps with an updated Gemini 2.5 Pro

    Tue, 06 May 2025 15:00:00 -0000

    Our updated version of Gemini 2.5 Pro Preview has improved capabilities for coding.
  104. Music AI Sandbox, now with new features and broader access

    Thu, 24 Apr 2025 15:01:00 -0000

    Helping music professionals explore the potential of generative AI
  105. Introducing Gemini 2.5 Flash

    Thu, 17 Apr 2025 19:02:00 -0000

    Gemini 2.5 Flash is our first fully hybrid reasoning model, giving developers the ability to turn thinking on or off.
  106. Generate videos in Gemini and Whisk with Veo 2

    Tue, 15 Apr 2025 17:00:00 -0000

    Transform text-based prompts into high-resolution eight-second videos in Gemini Advanced and use Whisk Animate to turn images into eight-second animated clips.
  107. DolphinGemma: How Google AI is helping decode dolphin communication

    Mon, 14 Apr 2025 17:00:00 -0000

    DolphinGemma, a large language model developed by Google, is helping scientists study how dolphins communicate — and hopefully find out what they're saying, too.
  108. Taking a responsible path to AGI

    Wed, 02 Apr 2025 13:31:00 -0000

    We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.
  109. Evaluating potential cybersecurity threats of advanced AI

    Wed, 02 Apr 2025 13:30:00 -0000

    Our framework enables cybersecurity experts to identify which defenses are necessary—and how to prioritize them
  110. Gemini 2.5: Our most intelligent AI model

    Tue, 25 Mar 2025 17:00:00 -0000

    Gemini 2.5 is our most intelligent AI model, now with thinking built in.
  111. Gemini Robotics brings AI into the physical world

    Wed, 12 Mar 2025 15:00:00 -0000

    Introducing Gemini Robotics and Gemini Robotics-ER, AI models designed for robots to understand, act and react to the physical world.
  112. Experiment with Gemini 2.0 Flash native image generation

    Wed, 12 Mar 2025 14:58:00 -0000

    Native image output is available in Gemini 2.0 Flash for developers to experiment with in Google AI Studio and the Gemini API.
  113. Introducing Gemma 3

    Wed, 12 Mar 2025 08:00:00 -0000

    The most capable model you can run on a single GPU or TPU.
  114. Start building with Gemini 2.0 Flash and Flash-Lite

    Tue, 25 Feb 2025 18:02:00 -0000

    Gemini 2.0 Flash-Lite is now generally available in the Gemini API for production use in Google AI Studio and for enterprise customers on Vertex AI
  115. Gemini 2.0 is now available to everyone

    Wed, 05 Feb 2025 16:00:00 -0000

    We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini 2.0 Pro Experimental.
  116. Updating the Frontier Safety Framework

    Tue, 04 Feb 2025 16:41:00 -0000

    Our next iteration of the FSF sets out stronger security protocols on the path to AGI
  117. FACTS Grounding: A new benchmark for evaluating the factuality of large language models

    Tue, 17 Dec 2024 15:29:00 -0000

    Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations
  118. State-of-the-art video and image generation with Veo 2 and Imagen 3

    Mon, 16 Dec 2024 17:01:00 -0000

    We’re rolling out a new, state-of-the-art video model, Veo 2, and updates to Imagen 3. Plus, check out our new experiment, Whisk.
  119. Introducing Gemini 2.0: our new AI model for the agentic era

    Wed, 11 Dec 2024 15:30:00 -0000

    Today, we’re announcing Gemini 2.0, our most capable multimodal AI model yet.
  120. Google DeepMind at NeurIPS 2024

    Thu, 05 Dec 2024 17:45:00 -0000

    Advancing adaptive AI agents, empowering 3D scene creation, and innovating LLM training for a smarter, safer future
  121. GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy

    Wed, 04 Dec 2024 15:59:00 -0000

    New AI model advances the prediction of weather uncertainties and risks, delivering faster, more accurate forecasts up to 15 days ahead
  122. Genie 2: A large-scale foundation world model

    Wed, 04 Dec 2024 14:23:00 -0000

    Generating unlimited diverse training environments for future general agents
  123. AlphaQubit tackles one of quantum computing’s biggest challenges

    Wed, 20 Nov 2024 18:00:00 -0000

    Our new AI system accurately identifies errors inside quantum computers, helping to make this new technology more reliable.
  124. The AI for Science Forum: A new era of discovery

    Mon, 18 Nov 2024 19:57:00 -0000

    The AI Science Forum highlights AI's present and potential role in revolutionizing scientific discovery and solving global challenges, emphasizing collaboration between the scientific community, policymakers, and industry leaders.
  125. Pushing the frontiers of audio generation

    Wed, 30 Oct 2024 15:00:00 -0000

    Our pioneering speech generation technologies are helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
  126. New generative AI tools open the doors of music creation

    Wed, 23 Oct 2024 16:53:00 -0000

    Our latest AI music technologies are now available in MusicFX DJ, Music AI Sandbox and YouTube Shorts
  127. Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

    Wed, 09 Oct 2024 11:45:00 -0000

    The award recognizes their work developing AlphaFold, a groundbreaking AI system that predicts the 3D structure of proteins from their amino acid sequences.
  128. How AlphaChip transformed computer chip design

    Thu, 26 Sep 2024 14:08:00 -0000

    Our AI method has accelerated and optimized chip design, and its superhuman chip layouts are used in hardware around the world.
  129. Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more

    Tue, 24 Sep 2024 16:03:00 -0000

    We’re releasing two updated production-ready Gemini models
  130. Empowering YouTube creators with generative AI

    Wed, 18 Sep 2024 14:30:00 -0000

    New video generation technology in YouTube Shorts will help millions of people realize their creative vision
  131. Our latest advances in robot dexterity

    Thu, 12 Sep 2024 14:00:00 -0000

    Two new AI systems, ALOHA Unleashed and DemoStart, help robots learn to perform complex tasks that require dexterous movement
  132. AlphaProteo generates novel proteins for biology and health research

    Thu, 05 Sep 2024 15:00:00 -0000

    New AI system designs proteins that successfully bind to target molecules, with potential for advancing drug design, disease understanding and more.
  133. FermiNet: Quantum physics and chemistry from first principles

    Thu, 22 Aug 2024 19:00:00 -0000

    Using deep learning to solve fundamental problems in computational quantum chemistry and explore how matter interacts with light
  134. Mapping the misuse of generative AI

    Fri, 02 Aug 2024 10:50:58 -0000

    New research analyzes the misuse of multimodal generative AI today, in order to help build safer and more responsible technologies.
  135. Gemma Scope: helping the safety community shed light on the inner workings of language models

    Wed, 31 Jul 2024 15:59:19 -0000

    Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.
  136. AI achieves silver-medal standard solving International Mathematical Olympiad problems

    Thu, 25 Jul 2024 15:29:00 -0000

    Breakthrough models AlphaProof and AlphaGeometry 2 solve advanced reasoning problems in mathematics
  137. Google DeepMind at ICML 2024

    Fri, 19 Jul 2024 10:00:00 -0000

    Exploring AGI, the challenges of scaling and the future of multimodal generative AI
  138. Generating audio for video

    Mon, 17 Jun 2024 16:00:00 -0000

    Video-to-audio research uses video pixels and text prompts to generate rich soundtracks
  139. Looking ahead to the AI Seoul Summit

    Mon, 20 May 2024 07:00:00 -0000

    How summits in Seoul, France and beyond can galvanize international cooperation on frontier AI safety
  140. Introducing the Frontier Safety Framework

    Fri, 17 May 2024 14:00:00 -0000

    Our approach to analyzing and mitigating future risks posed by advanced AI models
  141. Gemini breaks new ground: a faster model, longer context and AI agents

    Tue, 14 May 2024 17:58:00 -0000

    We’re introducing a series of updates across the Gemini family of models, including the new 1.5 Flash, our lightweight model for speed and efficiency, and Project Astra, our vision for the future of AI assistants.
  142. New generative media models and tools, built with and for creators

    Tue, 14 May 2024 17:57:00 -0000

    We’re introducing Veo, our most capable model for generating high-definition video, and Imagen 3, our highest quality text-to-image model. We’re also sharing new demo recordings created with our Music AI Sandbox.
  143. Watermarking AI-generated text and video with SynthID

    Tue, 14 May 2024 17:56:00 -0000

    Announcing our novel watermarking method for AI-generated text and video, and how we’re bringing SynthID to key Google products
  144. AlphaFold 3 predicts the structure and interactions of all of life’s molecules

    Wed, 08 May 2024 16:00:00 -0000

    Introducing a new AI model developed by Google DeepMind and Isomorphic Labs.
  145. Google DeepMind at ICLR 2024

    Fri, 03 May 2024 13:39:00 -0000

    Developing next-gen AI agents, exploring new modalities, and pioneering foundational learning
  146. The ethics of advanced AI assistants

    Fri, 19 Apr 2024 10:00:00 -0000

    Exploring the promise and risks of a future with more capable AI
  147. TacticAI: an AI assistant for football tactics

    Tue, 19 Mar 2024 16:03:00 -0000

    As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
  148. A generalist AI agent for 3D virtual environments

    Wed, 13 Mar 2024 14:00:00 -0000

    Introducing SIMA, a Scalable Instructable Multiworld Agent
  149. Gemma: Introducing new state-of-the-art open models

    Wed, 21 Feb 2024 13:06:00 -0000

    Gemma is built for responsible AI development from the same research and technology used to create Gemini models.
  150. Our next-generation model: Gemini 1.5

    Thu, 15 Feb 2024 15:00:00 -0000

    The model delivers dramatically enhanced performance, with a breakthrough in long-context understanding across modalities.
  151. The next chapter of our Gemini era

    Thu, 08 Feb 2024 13:00:00 -0000

    We're bringing Gemini to more Google products
  152. AlphaGeometry: An Olympiad-level AI system for geometry

    Wed, 17 Jan 2024 16:00:00 -0000

    Advancing AI reasoning in mathematics
  153. Shaping the future of advanced robotics

    Thu, 04 Jan 2024 11:39:00 -0000

    Introducing AutoRT, SARA-RT, and RT-Trajectory
  154. Images altered to trick machine vision can influence humans too

    Tue, 02 Jan 2024 16:00:00 -0000

    In a series of experiments published in Nature Communications, we found evidence that human judgments are indeed systematically influenced by adversarial perturbations.
  155. 2023: A Year of Groundbreaking Advances in AI and Computing

    Fri, 22 Dec 2023 13:30:00 -0000

    This has been a year of incredible progress in the field of Artificial Intelligence (AI) research and its practical applications.
  156. FunSearch: Making new discoveries in mathematical sciences using Large Language Models

    Thu, 14 Dec 2023 16:00:00 -0000

    In a paper published in Nature, we introduce FunSearch, a method for searching for “functions” written in computer code, and find new solutions in mathematics and computer science. FunSearch works by pairing a pre-trained LLM, whose goal is to provide creative solutions in the form of computer code, with an automated “evaluator”, which guards against hallucinations and incorrect ideas.
  157. Google DeepMind at NeurIPS 2023

    Fri, 08 Dec 2023 15:01:00 -0000

    The Neural Information Processing Systems (NeurIPS) is the largest artificial intelligence (AI) conference in the world. NeurIPS 2023 will be taking place December 10-16 in New Orleans, USA.Teams from across Google DeepMind are presenting more than 150 papers at the main conference and workshops.
  158. Introducing Gemini: our largest and most capable AI model

    Wed, 06 Dec 2023 15:13:00 -0000

    Making AI more helpful for everyone
  159. Millions of new materials discovered with deep learning

    Wed, 29 Nov 2023 16:04:00 -0000

    We share the discovery of 2.2 million new crystals – equivalent to nearly 800 years’ worth of knowledge. We introduce Graph Networks for Materials Exploration (GNoME), our new deep learning tool that dramatically increases the speed and efficiency of discovery by predicting the stability of new materials.
  160. Transforming the future of music creation

    Thu, 16 Nov 2023 07:20:00 -0000

    Announcing our most advanced music generation model and two new AI experiments, designed to open a new playground for creativity
  161. Empowering the next generation for an AI-enabled world

    Wed, 15 Nov 2023 10:00:00 -0000

    Experience AI's course and resources are expanding on a global scale
  162. GraphCast: AI model for faster and more accurate global weather forecasting

    Tue, 14 Nov 2023 15:00:00 -0000

    We introduce GraphCast, a state-of-the-art AI model able to make medium-range weather forecasts with unprecedented accuracy
  163. A glimpse of the next generation of AlphaFold

    Tue, 31 Oct 2023 13:00:00 -0000

    Progress update: Our latest AlphaFold model shows significantly improved accuracy and expands coverage beyond proteins to other biological molecules, including ligands.
  164. Evaluating social and ethical risks from generative AI

    Thu, 19 Oct 2023 15:00:00 -0000

    Introducing a context-based framework for comprehensively evaluating the social and ethical risks of AI systems
  165. Scaling up learning across many different robot types

    Tue, 03 Oct 2023 15:00:00 -0000

    Robots are great specialists, but poor generalists. Typically, you have to train a model for each task, robot, and environment. Changing a single variable often requires starting from scratch. But what if we could combine the knowledge across robotics and create a way to train a general-purpose robot?
  166. A catalogue of genetic mutations to help pinpoint the cause of diseases

    Tue, 19 Sep 2023 13:37:00 -0000

    New AI tool classifies the effects of 71 million ‘missense’ mutations.
  167. Identifying AI-generated images with SynthID

    Tue, 29 Aug 2023 00:00:00 -0000

    New tool helps watermark and identify synthetic images created by Imagen
  168. RT-2: New model translates vision and language into action

    Fri, 28 Jul 2023 00:00:00 -0000

    Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control.
  169. Using AI to fight climate change

    Fri, 21 Jul 2023 00:00:00 -0000

    AI is a powerful technology that will transform our future, so how can we best apply it to help combat climate change and find sustainable solutions?
  170. Google DeepMind’s latest research at ICML 2023

    Thu, 20 Jul 2023 00:00:00 -0000

    Exploring AI safety, adaptability, and efficiency for the real world
  171. Developing reliable AI tools for healthcare

    Mon, 17 Jul 2023 00:00:00 -0000

    We’ve published our joint paper with Google Research in Nature Medicine, which proposes CoDoC (Complementarity-driven Deferral-to-Clinical Workflow), an AI system that learns when to rely on predictive AI tools or defer to a clinician for the most accurate interpretation of medical images.
  172. Exploring institutions for global AI governance

    Tue, 11 Jul 2023 00:00:00 -0000

    New white paper investigates models and functions of international institutions that could help manage opportunities and mitigate risks of advanced AI.
  173. RoboCat: A self-improving robotic agent

    Tue, 20 Jun 2023 00:00:00 -0000

    Robots are quickly becoming part of our everyday lives, but they’re often only programmed to perform specific tasks well. While harnessing recent advances in AI could lead to robots that could help in many more ways, progress in building general-purpose robots is slower in part because of the time needed to collect real-world training data. Our latest paper introduces a self-improving AI agent for robotics, RoboCat, that learns to perform a variety of tasks across different arms, and then self-generates new training data to improve its technique.
  174. YouTube: Enhancing the user experience

    Fri, 16 Jun 2023 14:55:00 -0000

    It’s all about using our technology and research to help enrich people’s lives. Like YouTube — and its mission to give everyone a voice and show them the world.
  175. Google Cloud: Driving digital transformation

    Wed, 14 Jun 2023 14:51:00 -0000

    Google Cloud empowers organizations to digitally transform themselves into smarter businesses. It offers cloud computing, data analytics, and the latest artificial intelligence (AI) and machine learning tools.
  176. MuZero, AlphaZero, and AlphaDev: Optimizing computer systems

    Mon, 12 Jun 2023 14:41:00 -0000

    How MuZero, AlphaZero, and AlphaDev are optimizing the computing ecosystem that powers our world of devices.