Pipes Feed Preview: Towards Data Science & The New Stack & DevOps & SRE & DevOps.com & Google DeepMind Blog

  1. Stop Building AI Platforms

    Sat, 14 Jun 2025 01:26:49 -0000

    When small and medium companies achieve success in building Data and ML platforms, building AI platforms is now profoundly challenging

    The post Stop Building AI Platforms appeared first on Towards Data Science.

  2. What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization

    Fri, 13 Jun 2025 23:03:03 -0000

    An LLM in 2018 would not have trivialized a complex project, although it could have enhanced the final solution

    The post What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization appeared first on Towards Data Science.

  3. AI Is Not a Black Box (Relatively Speaking)

    Fri, 13 Jun 2025 20:02:51 -0000

    Compared to the opacity around human intelligence, AI is more transparent in some very tangible ways.

    The post AI Is Not a Black Box (Relatively Speaking) appeared first on Towards Data Science.

  4. How AI Agents “Talk” to Each Other

    Fri, 13 Jun 2025 19:21:14 -0000

    Minimize chaos and maintain inter-agent harmony in your projects

    The post How AI Agents “Talk” to Each Other appeared first on Towards Data Science.

  5. Connecting the Dots for Better Movie Recommendations

    Fri, 13 Jun 2025 00:27:55 -0000

    Connecting the Dots for Better Movie Recommendations: Lightweight graph RAG on Rotten Tomatoes movie reviews

    The post Connecting the Dots for Better Movie Recommendations appeared first on Towards Data Science.

  6. Agentic AI 103: Building Multi-Agent Teams

    Thu, 12 Jun 2025 19:34:10 -0000

    Build multi-agent teams that can automate tasks and enhance productivity.

    The post Agentic AI 103: Building Multi-Agent Teams appeared first on Towards Data Science.

  7. Design Smarter Prompts and Boost Your LLM Output: Real Tricks from an AI Engineer’s Toolbox

    Thu, 12 Jun 2025 18:54:58 -0000

    Not just what you ask, but how you ask it. Practical techniques for prompt engineering that deliver

    The post Design Smarter Prompts and Boost Your LLM Output: Real Tricks from an AI Engineer’s Toolbox appeared first on Towards Data Science.

  8. User Authorisation in Streamlit With OIDC and Google

    Thu, 12 Jun 2025 16:54:07 -0000

    Log in to a Streamlit app with a Google email account

    The post User Authorisation in Streamlit With OIDC and Google appeared first on Towards Data Science.

  9. Exploring the Proportional Odds Model for Ordinal Logistic Regression

    Thu, 12 Jun 2025 05:45:56 -0000

    Understanding and Implementing Brant’s Tests in Ordinal Logistic Regression with Python

    The post Exploring the Proportional Odds Model for Ordinal Logistic Regression appeared first on Towards Data Science.

  10. Can AI Truly Develop a Memory That Adapts Like Ours?

    Thu, 12 Jun 2025 05:32:11 -0000

    Exploring Titans: A new architecture equipping LLMs with human-inspired memory that learns and updates itself during test-time.

    The post Can AI Truly Develop a Memory That Adapts Like Ours? appeared first on Towards Data Science.

  11. Model Context Protocol (MCP) Tutorial: Build Your First MCP Server in 6 Steps

    Wed, 11 Jun 2025 19:40:45 -0000

    A beginner-friendly tutorial of MCP architecture, with the focus on MCP server components and applications, guiding through the process of building a custom MCP server that enables code-to-diagram.

    The post Model Context Protocol (MCP) Tutorial: Build Your First MCP Server in 6 Steps appeared first on Towards Data Science.

  12. Mobile App Development with Python

    Wed, 11 Jun 2025 16:55:03 -0000

    Build iOS & Android Apps with Kivy

    The post Mobile App Development with Python appeared first on Towards Data Science.

  13. Audio Spectrogram Transformers Beyond the Lab

    Tue, 10 Jun 2025 23:47:29 -0000

    A recipe for building a portable soundscape monitoring app with AudioMoth, Raspberry Pi, and a decent dose of deep learning.

    The post Audio Spectrogram Transformers Beyond the Lab appeared first on Towards Data Science.

  14. Automate Models Training: An MLOps Pipeline with Tekton and Buildpacks

    Tue, 10 Jun 2025 23:37:18 -0000

    A step-by-step guide to containerizing and orchestrating an ML training workflow without the Dockerfile headache, using a lightweight GPT-2 example.

    The post Automate Models Training: An MLOps Pipeline with Tekton and Buildpacks appeared first on Towards Data Science.

  15. 10,000x Faster Bayesian Inference: Multi-GPU SVI vs. Traditional MCMC

    Tue, 10 Jun 2025 23:29:05 -0000

    Using GPU acceleration to speed up Bayesian Inference from months to minutes...

    The post 10,000x Faster Bayesian Inference: Multi-GPU SVI vs. Traditional MCMC appeared first on Towards Data Science.

  16. Applications of Density Estimation to Legal Theory

    Tue, 10 Jun 2025 16:36:24 -0000

    A brief analysis using density estimation to compare the two-verdict and three-verdict systems.

    The post Applications of Density Estimation to Legal Theory appeared first on Towards Data Science.

  17. Mastering SQL Window Functions

    Tue, 10 Jun 2025 05:36:15 -0000

    Understand how to use Window Functions to perform calculations without losing details

    The post Mastering SQL Window Functions appeared first on Towards Data Science.

  18. Exploratory Data Analysis: Gamma Spectroscopy in Python

    Tue, 10 Jun 2025 05:27:05 -0000

    Let’s observe the matter on the atomic level

    The post Exploratory Data Analysis: Gamma Spectroscopy in Python appeared first on Towards Data Science.

  19. A Bird’s-Eye View of Linear Algebra: Measure of a Map — Determinants

    Tue, 10 Jun 2025 05:00:27 -0000

    We roll up our sleeves and start to deal with matrices

    The post A Bird’s-Eye View of Linear Algebra: Measure of a Map — Determinants appeared first on Towards Data Science.

  20. How to Transition From Data Analyst to Data Scientist

    Mon, 09 Jun 2025 23:09:55 -0000

    Playbook on how data analysts can become data scientists

    The post How to Transition From Data Analyst to Data Scientist appeared first on Towards Data Science.

  21. Ready or Not, Agentic AI Is Disrupting Corporate Landscapes 

    Sun, 15 Jun 2025 14:00:45 -0000

    While the advancements in Generative AI continue to introduce astounding possibilities, Agentic AI has emerged as a solution to complex

    The post Ready or Not, Agentic AI Is Disrupting Corporate Landscapes  appeared first on The New Stack.

    The technology industry has witnessed the most substantial integration of agentic AI, characterized by increased autonomy.
  22. Mary Meeker’s New AI Trends Report Makes the Case for Optimism

    Sun, 15 Jun 2025 13:00:54 -0000

    Historically, Mary Meeker‘s “Internet Trends” reports have been legendary. After a five-year hiatus, the VC/analyst has co-authored a remarkable new

    The post Mary Meeker’s New AI Trends Report Makes the Case for Optimism appeared first on The New Stack.

    The occasionally poetic report "Trends – Artificial Intelligence" likens AI's unprecedented pace of change to a "kinetic" force transforming various sectors, usually for the better.
  23. Meet Embabel: A Framework for Building AI Agents With Java

    Sat, 14 Jun 2025 16:00:34 -0000

    Repeated images of coffee cups. Embabbel is an open source project that aims to make it easier to build AI agents in Java.

    AI agents are becoming true collaborators in the workplace, with more developers every day beginning to build them. To reach

    The post Meet Embabel: A Framework for Building AI Agents With Java appeared first on The New Stack.

    Embabel is a new open source project intended to simplify the creation of safe, enterprise-level AI agent workflows inside the rich Java ecosystem.
  24. Agentic Coding: How Google’s Jules Compares to Claude Code

    Sat, 14 Jun 2025 14:00:55 -0000

    agentic

    After my successful use of Claude Code, I was keen to try other equivalent agentic large language model (LLM) tools

    The post Agentic Coding: How Google’s Jules Compares to Claude Code appeared first on The New Stack.

    Google's version of an agentic coding tool that isn't IDE-based is Jules. We look at how it works and how it compares to Claude Code.
  25. ECMAScript Committee Advances 3 Proposals to Stage 4

    Sat, 14 Jun 2025 13:00:56 -0000

    Dev News logo

    The TC39, aka the Ecma International’s Technical Committee 39, met in early June to review ECMAScript (a.k.a. JavaScript) specification submissions.

    The post ECMAScript Committee Advances 3 Proposals to Stage 4 appeared first on The New Stack.

    Also this week in Dev News, React offers experimental MCP servers, Remix reimagines what a web framework can be, and more.
  26. Async Programming in Java Repositories

    Fri, 13 Jun 2025 20:00:11 -0000

    Java is an object-oriented language; it supports encapsulation, inheritance and polymorphism. As we can see from Java’s continuing success in

    The post Async Programming in Java Repositories appeared first on The New Stack.

    Java's Repository pattern, combined with asynchronous programming, can improve performance and maintainability when accessing multiple data sources.
  27. GenAI and Flexible Consumption Models Reshape Hybrid Storage Infrastructure

    Fri, 13 Jun 2025 19:00:57 -0000

    The use of generative AI (GenAI) is growing at an unprecedented rate across various industries. Significant technical advances in AI

    The post GenAI and Flexible Consumption Models Reshape Hybrid Storage Infrastructure appeared first on The New Stack.

    Due to its significant resource demands and requirement for rapid scalability, AI is the ultimate hybrid application.
  28. Elixir: An Alternative to JavaScript-Based Web Development

    Fri, 13 Jun 2025 18:00:16 -0000

    A chest with green glowing elixirs.

    It may sound like heresy, but some developers regret their commitment to JavaScript. Brian Cardarella, the founder of web and

    The post Elixir: An Alternative to JavaScript-Based Web Development appeared first on The New Stack.

    Some frustrated JavaScript developers have turned to Elixir and its Phoenix framework. They claim it has faster development and lower costs.
  29. Accelerating Developer Velocity With Effective Platform Teams

    Fri, 13 Jun 2025 17:00:01 -0000

    Teams

    Platform engineering teams create self-service capabilities that provide development teams with golden paths for building reliable and secure software. When

    The post Accelerating Developer Velocity With Effective Platform Teams appeared first on The New Stack.

    A look at three key considerations for leaders building strong platform teams.
  30. Why AI and SQL Go Together Like Peanut Butter and Jelly

    Fri, 13 Jun 2025 16:00:59 -0000

    "Why AI and SQL Go Together Like Peanut Butter and Jelly" featured image. PB&J sandwich

    Natural human language is the ideal interface for accessing data, and making it conversational is the ultimate goal. Advanced AI,

    The post Why AI and SQL Go Together Like Peanut Butter and Jelly appeared first on The New Stack.

    The future of information retrieval combines the power of AI's natural language capabilities with advanced, scalable SQL systems for data access.
  31. Install Homebrew on MacOS for More Dev Tool Options

    Fri, 13 Jun 2025 15:00:43 -0000

    Although macOS is one of the more user-friendly operating systems on the market, as a developer, you might find it

    The post Install Homebrew on MacOS for More Dev Tool Options appeared first on The New Stack.

    Package manager simplifies installing and managing development tools and command-line applications on macOS.
  32. Infrastructure From Code: What Went Wrong

    Fri, 13 Jun 2025 14:00:31 -0000

    The idea behind Infrastructure from Code (IfC) is that you would simply write out all deployment and configuration steps in

    The post Infrastructure From Code: What Went Wrong appeared first on The New Stack.

    Developers were reluctant to give up control and vendors struggled to meet diverse compliance and operational needs. Now, the focus is shifting.
  33. Introduction to vLLM: A High-Performance LLM Serving Engine

    Fri, 13 Jun 2025 13:00:12 -0000

    The open source vLLM represents a milestone in large language model (LLM) serving technology, providing developers with a fast, flexible

    The post Introduction to vLLM: A High-Performance LLM Serving Engine appeared first on The New Stack.

    Created at UC Berkeley, this community-driven library addresses memory management, throughput optimization and scalable deployment in LLM applications.
  34. PHP Turns 30: Language and Ecosystem Are Stronger Than Ever

    Thu, 12 Jun 2025 22:00:57 -0000

    elephant herd

    This month marks the 30th anniversary of PHP being released to the world. To find out how PHP has evolved

    The post PHP Turns 30: Language and Ecosystem Are Stronger Than Ever appeared first on The New Stack.

    PHP 8 is worlds apart from the humble toolset launched 30 years ago — helped by modern frameworks like Laravel and new tools like FrankenPHP.
  35. Boost Performance With React Server Components and Next.js

    Thu, 12 Jun 2025 21:00:57 -0000

    React is one of the most popular tools for building modern web applications. If you’ve worked with React, you’ve probably

    The post Boost Performance With React Server Components and Next.js appeared first on The New Stack.

    A practical guide to building fast product pages with React and Next.js.
  36. Cloud Native and Open Source Help Scale Agentic AI Workflows

    Thu, 12 Jun 2025 20:00:35 -0000

    Illustration of concept of business uses of data.

    Enterprise automation is increasingly leveraging intelligent agent workflows driven by AI, typically relying on large language models (LLMs) for these

    The post Cloud Native and Open Source Help Scale Agentic AI Workflows appeared first on The New Stack.

    Small language models (SLMs) paired with Kubernetes and Function as a Service (FaaS) have emerged as alternatives to LLMs for agentic AI use cases.
  37. AI Will Steal Developer Jobs (But Not How You Think)

    Thu, 12 Jun 2025 19:00:39 -0000

    A man breathes into a paper bag.

    Anthropic’s CEO has said that in three to six months, AI will be writing 90% of the code that software

    The post AI Will Steal Developer Jobs (But Not How You Think) appeared first on The New Stack.

    AI may or may not eliminate development jobs — but it will certainly change the way developers work and perhaps shift job titles.
  38. What You Need To Know About Apple’s New Container Framework

    Thu, 12 Jun 2025 17:00:53 -0000

    A person working with programming code.

    At WWDC 2025, Apple announced something that will fundamentally reshape the way we think about container security: its Containerization framework

    The post What You Need To Know About Apple’s New Container Framework appeared first on The New Stack.

    Starting in macOS 26, every macOS developer will have access to proper container isolation in their development workflow.
  39. Using AI for Test Generation: Powerful Tool or Risky Shortcut?

    Thu, 12 Jun 2025 16:00:51 -0000

    Concept illustration of using AI in software testing.

    AI is rapidly transforming software development, with AI-coding assistants now commonplace, offering everything from autocompletion to generating substantial code blocks.

    The post Using AI for Test Generation: Powerful Tool or Risky Shortcut? appeared first on The New Stack.

    How developers can use AI for test generation effectively, reaping its benefits without compromising code quality.
  40. No SSH? What Is Talos, This Linux Distro for Kubernetes?

    Thu, 12 Jun 2025 15:00:58 -0000

    Sidero CTO Andrew Rynhard and Head of Product Justin Garrison explained Talos’s design philosophy, highlighting its minimalism and focus on automation

    The rise of container-based Linux distros is real, especially now with the demand for deploying to edge environments that require

    The post No SSH? What Is Talos, This Linux Distro for Kubernetes? appeared first on The New Stack.

    Talos Linux, developed by Sidero Labs, is a Linux distro built for Kubernetes with SSH access disabled and built-in security.
  41. JavaScript Kung Fu: Elegant Techniques To Master the Language

    Thu, 12 Jun 2025 14:00:53 -0000

    Man practicing martial arts atop a cliff.

    JavaScript isn’t just a language — it’s a specialized craft. And like any skilled martial art, the difference between being

    The post JavaScript Kung Fu: Elegant Techniques To Master the Language appeared first on The New Stack.

    Whether you’re building enterprise software or just honing your skills, these patterns will help you write cleaner, faster and more maintainable code.
  42. Build a Package Tracker With WhatsApp API

    Thu, 12 Jun 2025 13:30:08 -0000

    In e-commerce business, woman packing fashion purchase in shipping packages for delivery.

    Customers expect fast, transparent updates about their orders. Think shipping status, delays, delivery confirmations and more. The WhatsApp Business App

    The post Build a Package Tracker With WhatsApp API appeared first on The New Stack.

    The WhatsApp Business Platform Cloud API paired with Stream Chat gives you a powerful tool for real-time customer support, order tracking and automation.
  43. Databricks Launches a No-Code Tool for Building Data Pipelines

    Wed, 11 Jun 2025 23:00:25 -0000

    At its Data+AI Summit today in San Francisco, Databricks launched a number of new features for its data platform, including

    The post Databricks Launches a No-Code Tool for Building Data Pipelines appeared first on The New Stack.

    Lakeflow Designer, a no-code tool for building data pipelines, is built on top of the company's more code-heavy data engineering platform.
  44. Lakebase Is Databricks’ Fully-Managed Postgres Database for the AI Era

    Wed, 11 Jun 2025 19:00:04 -0000

    When Databricks announced its plans to acquire the serverless Postgres startup Neon a few weeks ago, it clearly signaled it’s

    The post Lakebase Is Databricks’ Fully-Managed Postgres Database for the AI Era appeared first on The New Stack.

    Lakebase is Databricks' new fully-managed serverless Postgres service, based on its acquisition of Neon, which closed only a week ago.
  45. Jellyfish Tracks AI Impact Across Four Major Coding Tools

    Wed, 11 Jun 2025 18:00:18 -0000

    Jellyfish, a software engineering intelligence platform provider, today expanded its integrations to support additional AI coding tools, as engineering team

    The post Jellyfish Tracks AI Impact Across Four Major Coding Tools appeared first on The New Stack.

    Software intelligence platform Jellyfish expands beyond GitHub Copilot to track Cursor, Google Gemini Code Assist, and Sourcegraph as engineering teams increasingly adopt multiple AI coding tools simultaneously.
  46. AI and Vibe Coding Are Radically Impacting Senior Devs in Code Review

    Wed, 11 Jun 2025 17:00:07 -0000

    AI agents, vibe coding, AI code review, and other AI-centric topics are all I see developers talk about on social

    The post AI and Vibe Coding Are Radically Impacting Senior Devs in Code Review appeared first on The New Stack.

    AI is enabling senior developers to focus on strategic management by automating redundant tasks, thereby boosting overall team productivity.
  47. Sr. Staff UX Designer

    Wed, 28 May 2025 16:00:00 -0000

    In the event of a cloud incident, everyone wants swift and clear communication from the cloud provider, and to be able to leverage that information effectively. Personalized Service Health in the Google Cloud console addresses this need with fast, transparent, relevant, and actionable communications about Google Cloud service disruptions, customized to your specific footprint. This helps you to quickly identify the source of the problem, helping you answer the question, “Is it Google or is it me?” You can then integrate this information into your incident response workflows to resolve the incident more efficiently.

    We're excited to announce that you can prompt Gemini Cloud Assist to pull real-time information about active incidents, powered by Personalized Service Health, providing you with streamlined incident management, including discovery, impact assessment, and recovery. By combining Gemini's guidance with Personalized Service Health insights and up-to-the-minute information, you can assess the scope of impact and begin troubleshooting – all within a single, AI-driven Gemini Cloud Assist chat. Further, you  can initiate this sort of incident discovery from anywhere within the console, offering immediate access to relevant incidents without interrupting your workflow. You can also check for active incidents impacting your projects, gathering details on their scope and the latest updates directly sourced from Personalized Service Health.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6aa25fe820>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Using Gemini Cloud Assist with Personalized Service Health

    We designed Gemini Cloud Assist with a user-friendly layout and a well-organized information structure. Crucial details, including dynamic timelines, latest updates, symptoms, and workarounds sourced directly from Personalized Service Health, are now presented in the console, enabling conversational follow-ups. Gemini Cloud Assist highlights critical insights from Personalized Service Health, helping you refine your investigations and understand the impact of incidents.

    To illustrate the power of this integration, the following demo showcases a typical incident response workflow leveraging the combined capabilities of Gemini and Personalized Service Health.

    Incident discovery and triage
    In the crucial first moments of an incident, Gemini Cloud Assist helps you answer "Is it Google or is it me?" Gemini Cloud Assist accesses data directly from Personalized Service Health, and provides feedback on which projects and at what locations are affected by a Google Cloud incident, speeding up the triage process.

    To illustrate how you can start this process, try asking Gemini Cloud Assist questions like:

    • Is my project impacted by a Google Cloud incident?

    • Are there any incidents impacting Google Cloud at the moment?

    1 UpdatedNew

    Investigating and evaluating impact
    Once you’ve identified a relevant Google Cloud incident, you can use Gemini Cloud Assist to delve deeper into the specifics and evaluate its impact on your environment. Furthermore, by asking follow-up questions, Gemini Cloud Assist can retrieve updates from Personalized Service Health about the incident as it evolves. You can then further investigate by asking Gemini to pinpoint exactly which of your apps or projects, and at what locations, might be affected by the reported incident.

    Here are examples of prompts you might pose to Gemini Cloud Assist:

    • Tell me more about the ongoing Incident ID [X] (Replace [X] with the Incident ID)

    • Is [X] impacted? (Replace [X] with your specific location or Google Cloud product)

    • What is the latest update on Incident ID [X]?

    • Show me the details of Incident ID [X].

    • Can you guide me through some troubleshooting steps for [impacted Google Cloud product]?

    2

    Mitigation and recovery
    Finally, Gemini Cloud Assist can also act as an intelligent assistant during the recovery phase, providing you with actionable guidance. You can gain access to relevant logs and monitoring data for more efficient resolution. Additionally, Gemini Cloud Assist can help surface potential workarounds from Personalized Service Health and direct you to the tools and information you need to restore your projects or applications. Here are some sample prompts:

    • What are the workarounds for the incident ID [X]? (Replace [X] with the Incident ID)

    • Can you suggest a temporary solution to keep my application running?

    • How can I find logs for this impacted project?

    3 Updated

    From these prompts, Gemini retrieves relevant information from Personalized Service Health to provide you with personalized insights into your Google Cloud environment's health — both for ongoing events and incidents from up to one year in the past. This helps when investigating an incident to narrow down its impact, as well as assisting in recovery. 

    Next steps

    Looking ahead, we are excited to provide even deeper insights and more comprehensive incident management with Gemini Cloud Assist and Personalized Service Health, extending these AI-driven capabilities beyond a single project view. Ready to get started? 

    • Learn more about Personalized Service Health, or reach out to your account team to enable it.

    • Get started with Gemini Cloud Assist. Refine your prompts to ask about your specific regions or Google Cloud products, and experiment to discover how it can help you proactively manage incidents.

    Related Article

    Personalized Service Health is now generally available: Get started today

    Personalized Service Health provides visibility into incidents relevant to your environment, allowing you to evaluate their impact and tr...

    Read Article
  48. Staff Site Reliability Engineer, Waze

    Mon, 28 Apr 2025 16:00:00 -0000

    In 2023, the Waze platform engineering team transitioned to Infrastructure as Code (IaC) using Google Cloud's Config Connector (KCC) — and we haven’t looked back since. We embraced Config Connector, an open-source Kubernetes add-on, to manage Google Cloud resources through Kubernetes. To streamline management, we also leverage Config Controller, a hosted version of Config Connector on Google Kubernetes Engine (GKE), incorporating Policy Controller and Config Sync. This shift has significantly improved our infrastructure management and is shaping our future infrastructure.

    The shift to Config Connector

    Previously, Waze relied on Terraform to manage resources, particularly during our dual-cloud, VM-based phase. However, maintaining state and ensuring reconciliation proved challenging, leading to inconsistent configurations and increased management overhead.

    In 2023, we adopted Config Connector, transforming our Google Cloud infrastructure into Kubernetes Resource Modules (KRMs) within a GKE cluster. This approach addresses the reconciliation issues encountered with Terraform. Config Sync, paired with Config Connector, automates KRM synchronization from source repositories to our live GKE cluster. This managed solution eliminates the need for us to build and maintain custom reconciliation systems.

    The shift helped us meet the needs of three key roles within Waze’s infrastructure team: 

    1. Infrastructure consumers: Application developers who want to easily deploy infrastructure without worrying about the maintenance and complexity of underlying resources.

    2. Infrastructure owners: Experts in specific resource types (e.g., Spanner, Google Cloud Storage, Load Balancers, etc.), who want to define and standardize best practices in how resources are created across Waze on Google Cloud.

    3. Platform engineers: Engineers who build the system that enables infrastructure owners to codify and define best practices, while also providing a seamless API for infrastructure consumers.

    aside_block
    <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x3e6a7c1e5ac0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

    First stop: Config Connector

    It may seem circular to define all of our Google Cloud infrastructure as KRMs within a Google Cloud service, however, KRM is actually a great representation for our infrastructure as opposed to existing IaC tooling.

    Terraform's reconciliation issues – state drift, version management, out of band changes – are a significant pain. Config Connector, through Config Sync, offers out-of-the-box reconciliation, a managed solution we prefer. Both KRM and Terraform offer templating, but KCC's managed nature aligns with our shift to Google Cloud-native solutions and reduces our maintenance burden. 

    Infrastructure complexity requires generalization regardless of the tool. We can see this when we look at the Spanner requirements at Waze:

    • Consistent backups for all Spanner databases

    • Each Spanner database utilizes a dedicated Cloud Storage bucket and Service Account to automate the execution of DDL jobs.

    • All IAM policies for Spanner instances, databases, and Cloud Storage buckets are defined in code to ensure consistent and auditable access control.

    1 - Spanner at Waze

    To define these resources, we evaluated various templating and rendering tools and selected Helm, a robust CNCF package manager for Kubernetes. Its strong open-source community, rich templating capabilities, and native rendering features made it a natural fit. We can now refer to our bundled infrastructure configurations as 'Charts.' While KRO has since emerged that achieves a similar purpose, our selection process predated its availability.

    Under the hood

    Let's open the hood and dive into how the system works and is driving value for Waze.

    1. Waze infrastructure owners generically define Waze-flavored infrastructure in Helm Charts. 

    2. Infrastructure consumers use these Charts with simplified inputs to generate infrastructure (demo).

    3. Infrastructure code is stored in repositories, enabling validation and presubmit checks.

    Code is uploaded to a Artifact Registry where Config Sync and Config Connector align Google Cloud infrastructure with the code definitions.

    2 - Provisioning Cloud Resources at Waze

    This diagram represents a single "data domain," a collection of bounded services, databases, networks, and data. Many tech orgs today consist of Prod, QA, Staging, Development, etc.

    Approaching our destination

    So why does all of this matter? Adopting this approach allowed us to move from Infrastructure as Code to Infrastructure as Software. By treating each Chart as a software component, our infrastructure management goes beyond simple code declaration. Now, versioned Charts and configurations enable us to leverage a rich ecosystem of software practices, including sophisticated release management, automated rollbacks, and granular change tracking.

    Here's where we apply this in practice: our configuration inheritance model minimizes redundancy. Resource Charts inherit settings from Projects, which inherit from Bootstraps. All three are defined as Charts. Consequently, Bootstrap configurations apply to all Projects, and Project configurations apply to all Resources.

    Every change to our infrastructure – from changes on existing infrastructure to rolling out new resource types – can be treated like a software rollout.

    3 - Resource Inheritance

    Now that all of our infrastructure is treated like software, we can see what this does for us system-wide:

    4 - Data Domain Flow

    Reaching our destination

    In summary, Config Connector and Config Controller have enabled Waze to achieve true Infrastructure as Software, providing a robust and scalable platform for our infrastructure needs, along with many other benefits including: 

    • Infrastructure consumers receive the latest best practices through versioned updates.

    • Infrastructure owners can iterate and improve infrastructure safely.

    • Platform Engineers and Security teams are confident our resources are auditable and compliant

    • Config Connector leverages Google's managed services, reducing operational overhead.

  49. Engineering Manager

    Mon, 24 Feb 2025 17:00:00 -0000

    Distributed tracing is a critical part of an observability stack, letting you troubleshoot latency and errors in your applications. Cloud Trace, part of Google Cloud Observability, is Google Cloud’s native tracing product, and we’ve made numerous improvements to the Trace explorer UI on top of a new analytics backend.

    1_Components of the new trace explorer

    The new Trace explorer page contains:

    1. A filter bar with options for users to choose a Google Cloud project-based trace scope, all/root spans and a custom attribute filter.

    2. A faceted span filter pane that displays commonly used filters based on OpenTelemetry conventions.

    3. A visualization of matching spans including an interactive span duration heatmap (default), a span rate line chart, and a span duration percentile chart.

    4. A table of matching spans that can be narrowed down further by selecting a cell of interest on the heatmap.

    A tour of the new Trace explorer

    Let’s take a closer look at these new features and how you can use them to troubleshoot your applications. Imagine you’re a developer working on the checkoutservice of a retail webstore application and you’ve been paged because there’s an ongoing incident.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6aa21e32b0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    This application is instrumented using OpenTelemetry and sends trace data to Google Cloud Trace, so you navigate to the Trace explorer page on the Google Cloud console with the context set to the Google Cloud project that hosts the checkoutservice.

    Before starting your investigation, you remember that your admin recommended using the webstore-prod trace scope when investigating webstore app-wide prod issues. By using this Trace scope, you'll be able to see spans stored in other Google Cloud projects that are relevant to your investigation.

    2_Scope selection

    You set the trace scope to webstore-prod and your queries will now include spans from all the projects included in this trace scope.

    3_User Journey

    You select checkoutservice in Span filters (1) and the following updates load on the page:

    • Other sections such as Span name in the span filter pane (2) are updated with counts and percentages that take into account the selection made under service name. This can help you narrow down your search criteria to be more specific.

    • The span Filter bar (3) is updated to display the active filter.

    • The heatmap visualization (4)  is updated to only display spans from the checkoutservice in the last 1 hour (default). You can change the time-range using the time-picker (5). The heatmap’s x-axis is time and the y-axis is span duration. It uses color shades to denote the number of spans in each cell with a legend that indicates the corresponding range.

    • The Spans table (6) is updated with matching spans sorted by duration (default).

    • Other Chart views (7) that you can switch to are also updated with the applied filter.

    From looking at the heatmap, you can see that there are some spans in the >100s range which is abnormal and concerning. But first, you’re curious about the traffic and corresponding latency of calls handled by the checkoutservice.

    4_Span rate line chart

    Switching to the Span rate line chart gives you an idea of the traffic handled by your service. The x-axis is time and the y-axis is spans/second. The traffic handled by your service looks normal as you know from past experience that 1.5-2 spans/second is quite typical.

    5_Span duration percentile chart

    Switching to the Span duration percentile chart gives you p50/p90/p95/p99 span duration trends. While p50 looks fine, the p9x durations are greater than you expect for your service.

    6_Span selection

    You switch back to the heatmap chart and select one of the outlier cells to investigate further. This particular cell has two matching spans with a duration of over 2 minutes, which is concerning.

    7_Trace details & span attributes

    You investigate one of those spans by viewing the full trace and notice that the orders publish span is the one taking up the majority of the time when servicing this request. Given this, you form a hypothesis that the checkoutservice is having issues handling these types of calls. To validate your hypothesis, you note the rpc.method attribute being PlaceOrder and exit this trace using the X button.

    8_Custom attribute search

    You add an attribute filter for key: rpc.method value:PlaceOrder using the Filter bar, which shows you that there is a clear latency issue with PlaceOrder calls handled by your service. You’ve seen this issue before and know that there is a runbook that addresses it, so you alert the SRE team with the appropriate action that needs to be taken to mitigate the incident.

    9_Send feedback

    Share your feedback with us via the Send feedback button.

    Behind the scenes

    10_Cloud Trace architecture

    This new experience is powered by BigQuery, using the same platform that backs Log Analytics. We plan to launch new features that take full advantage of this platform: SQL queries, flexible sampling, export, and regional storage.

    In summary, you can use the new Cloud Trace explorer to perform service-oriented investigations with advanced querying and visualization of trace data. This allows developers and SREs to effectively troubleshoot production incidents and identify mitigating measures to restore normal operations.

    The new Cloud Trace explorer is generally available to all users — try it out and share your feedback with us via the Send feedback button.

  50. Technical Program Manager, Google

    Thu, 20 Feb 2025 17:00:00 -0000

    Picture this: you’re an Site Reliability Engineer (SRE) responsible for the systems that power your company’s machine learning (ML) services. What do you do to ensure you have a reliable ML service, how do you know you’re doing it well, and how can you build strong systems to support these services? 

    As artificial intelligence (AI) becomes more widely available, its features — including ML — will matter more to SREs. That’s because ML becomes both a part of the infrastructure used in production software systems, as well as an important feature of the software itself. 

    Abstractly, machine learning relies on its pipelines … and you know how to manage those! So you can begin with pipeline management, then look to other factors that will strengthen your ML services: training, model freshness, and efficiency. In the resources below, we'll look at some of the ML-specific characteristics of these pipelines that you’ll want to consider in your operations. Then, we draw on the experience of Google SREs to show you how to apply your core SRE skills to operating and managing your organization’s machine-learning pipelines. 

    Training ML models

    Training ML models applies the notion of pipelines to specific types of data, often running on specialized hardware. Critical aspects to consider about the pipeline:

    • how much data you’re ingesting

    • how fresh this data needs to be

    • how the system trains and deploys the models 

    • how efficiently the system handles these first three things

    This keynote presents an SRE perspective on the value of applying reliability principles to the components of machine learning systems. It provides insight into why ML systems matter for products, and how SREs should think about them. The challenges that ML systems present include capacity planning, resource management, and monitoring; other challenges include understanding the cost of ML systems as part of your overall operations environment. 

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6aa2091bb0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    ML freshness and data volume

    As with any pipeline-based system, a big part of understanding the system is describing how much data it typically ingests and processes. The Data Processing Pipelines chapter in the SRE Workbook lays out the fundamentals: automate the pipeline’s operation so that it is resilient, and can operate unattended. 

    You’ll want to develop Service Level Objectives (SLOs) in order to measure the pipeline’s health, especially for data freshness, i.e., how recently the model got the data it’s using to produce an inference for a customer. Understanding freshness provides an important measure of an ML system’s health, as data that becomes stale may lead to lower-quality inferences and sub-optimal outcomes for the user. For some systems, such as weather forecasting, data may need to be very fresh (just minutes or seconds old); for other systems, such as spell-checkers, data freshness can lag on the order of days — or longer! Freshness requirements will vary by product, so it’s important that you know what you’re building and how the audience expects to use it. 

    In this way, freshness is a part of the critical user journey described in the SRE Workbook, describing one aspect of the customer experience. You can read more about data freshness as a component of pipeline systems in the Google SRE article Reliable Data Processing with Minimal Toil.  

    There’s more than freshness to ensuring high-quality data — there’s also how you define the model-training pipeline. A Brief Guide To Running ML Systems in Production gives you the nuts and bolts of this discipline, from using contextual metrics to understand freshness and throughput, to methods for understanding the quality of your input data. 

    Serving efficiency

    The 2021 SRE blog post Efficient Machine Learning Inference provides a valuable resource to learn about improving your model’s performance in a production environment. (And remember, training is never the same as production for ML services!) 

    Optimizing machine learning inference serving is crucial for real-world deployment. In this article, the authors explore multi-model serving off of a shared VM. They cover realistic use cases and how to manage trade-offs between cost, utilization, and latency of model responses. By changing the allocation of models to VMs, and varying the size and shape of those VMs in terms of processing, GPU, and RAM attached, you can improve the cost effectiveness of model serving. 

    Cost efficiency

    We mentioned that these AI pipelines often rely on specialized hardware. How do you know you’re using this hardware efficiently? Todd Underwood’s talk from SREcon EMEA 2023 on Artificial Intelligence: What Will It Cost You? gives you a sense of how much this specialized hardware costs to run, and how you can provide incentives for using it efficiently. 

    Automation for scale

    This article from Google's SRE team outlines strategies for ensuring reliable data processing while minimizing manual effort, or toil. One of the key takeaways: use an existing, standard platform for as much of the pipeline as possible. After all, your business goals should focus on innovations in presenting the data and the ML model, not in the pipeline itself. The article covers automation, monitoring, and incident response, with a focus on using these concepts to build resilient data pipelines. You’ll read best practices for designing data systems that can handle failures gracefully and reduce a team’s operational burden. This article is essential reading for anyone involved in data engineering or operations. Read more about toil in the SRE Workbook: https://sre.google/workbook/eliminating-toil/

    Next steps

    Successful ML deployments require careful management and monitoring for systems to be reliable and sustainable. That means taking a holistic approach, including implementing data pipelines, training pathways, model management, and validation, alongside monitoring and accuracy metrics. To go deeper, check out this guide on how to use GKE for your AI orchestration.

  51. Cross-Product Solution Developer

    Fri, 14 Feb 2025 17:00:00 -0000

    In today's dynamic digital landscape, building and operating secure, reliable, cost-efficient and high-performing cloud solutions is no easy feat. Enterprises grapple with the complexities of cloud adoption, and often struggle to bridge the gap between business needs, technical implementation, and operational readiness. This is where the Google Cloud Well-Architected Framework comes in. The framework provides comprehensive guidance to help you design, develop, deploy, and operate efficient, secure, resilient, high-performing, and cost-effective Google Cloud topologies that support your security and compliance requirements.

    Who should use the Well-Architected Framework?

    The Well-Architected Framework caters to a broad spectrum of cloud professionals. Cloud architects, developers, IT administrators, decision makers and other practitioners can benefit from years of subject-matter expertise and knowledge both from within Google and from the industry. The framework distills this vast expertise and presents it as an easy-to-consume set of recommendations. 

    The recommendations in the Well-Architected Framework are organized under five, business-focused pillars.

    af-infographic

    We recently completed a revamp of the guidance in all the pillars and perspectives of the Well-Architected Framework to center the recommendations around a core set of design principles.

    Operational excellence

    Security, privacy, and compliance

    Reliability

    Cost optimization

    Performance optimization

    • Operational readiness

    • Incident management

    • Resource optimization

    • Change management

    • Continuous improvement

    • Security by design

    • Zero trust

    • Shift-left security

    • Preemptive cyber-defense

    • Secure and responsible AI

    • AI for security

    • Regulatory, privacy, and compliance needs

    • User-focused goals

    • Realistic targets

    • HA through redundancy

    • Horizontal scaling

    • Observability

    • Graceful degradation

    • Recovery testing

    • Thorough postmortems

    • Spending aligned with business value

    • Culture of cost awareness

    • Resource optimization

    • Continuous optimization

    • Resource allocation planning

    • Elasticity

    • Modular design

    • Continuous  improvement

    In addition to the above pillars, the Well-Architected Framework provides cross-pillar perspectives that present recommendations for selected domains, industries, and technologies like AI and machine learning (ML).

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6a7ff8ce80>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Benefits of adopting the Well-Architected Framework

    The Well-Architected Framework is much more than a collection of design and operational recommendations. The framework empowers you with a structured principles-oriented design methodology that unlocks many advantages:

    • Enhanced security, privacy, and compliance: Security is paramount in the cloud. The Well-Architected Framework incorporates industry-leading security practices, helping ensure that your cloud architecture meets your security, privacy, and compliance requirements.

    • Optimized cost: The Well-Architected Framework lets you build and operate cost-efficient cloud solutions by promoting a cost-aware culture, focusing on resource optimization, and leveraging built-in cost-saving features in Google Cloud.

    • Resilience, scalability, and flexibility: As your business needs evolve, the Well-Architected Framework helps you design cloud deployments that can scale to accommodate changing demands, remain highly available, and be resilient to disasters and failures.

    • Operational excellence: The Well-Architected Framework promotes operationally sound architectures that are easy to operate, monitor, and maintain.

    • Predictable and workload-specific performance: The Well-Architected Framework offers guidance to help you build, deploy, and operate workloads that provide predictable performance based on your workloads’ needs.

    • The Well-Architected Framework also includes cross-pillar perspectives for selected domains, industries, and technologies like AI and machine learning (ML).

    The principles and recommendations in the Google Cloud Well-Architected Framework are aligned with Google and industry best practices like Google’s Site Reliability Engineering (SRE) practices, DORA capabilities, the Google HEART framework for user-centered metrics, the FinOps framework, Supply-chain Levels for Software Artifacts (SLSA), and Google's Secure AI Framework (SAIF).

    Embrace the Well-Architected Framework to transform your Google Cloud journey, and get comprehensive guidance on security, reliability, cost, performance, and operations — as well as targeted recommendations for specific industries and domains like AI and ML. To learn more, visit Google Cloud Well-Architected Framework.

  52. Product Manager

    Thu, 30 Jan 2025 20:00:00 -0000

    We are thrilled to announce the collaboration between Google Cloud, AWS, and Azure on Kube Resource Orchestrator, or kro (pronounced “crow”). kro introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.

    Challenges of Kubernetes resource orchestration

    Platform and DevOps teams want to define standards for how application teams deploy their workloads, and they want to use Kubernetes as the platform for creating and enforcing these standards. Each service needs to handle everything from resource creation to security configurations, monitoring setup, defining the end-user interface, and more. There are client-side templating tools that can help with this (e.g., Helm, Kustomize), but Kubernetes lacks a native way for platform teams to create custom groupings of resources for consumption by end users. 

    Before kro, platform teams needed to invest in custom solutions such as building custom Kubernetes controllers, or using packaging tools like Helm, which can’t leverage the benefits of Kubernetes CRDs. These approaches are costly to build, maintain, and troubleshoot, and complex for non-Kubernetes experts to consume. This is a problem many Kubernetes users face. Rather than developing vendor-specific solutions, we’ve partnered with Amazon and Microsoft on making K8s APIs simpler for all Kubernetes users.

    aside_block
    <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x3e6aa2ca79a0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

    How kro simplifies the developer experience

    kro is a Kubernetes-native framework that lets you create reusable APIs to deploy multiple resources as a single unit. You can use it to encapsulate a Kubernetes deployment and its dependencies into a single API that your application teams can use, even if they aren’t familiar with Kubernetes. You can use kro to create custom end-user interfaces that expose only the parameters an end user should see, hiding the complexity of Kubernetes and cloud-provider APIs.

    kro does this by introducing the concept of a ResourceGraphDefinition, which specifies how a standard Kubernetes Custom Resource Definition (CRD) should be expanded into a set of Kubernetes resources. End users define a single resource, which kro then expands into the custom resources defined in the CRD.

    kro can be used to group and manage any Kubernetes resources. Tools like ACK, KCC, or ASO define CRDs to manage cloud provider resources from Kubernetes (these tools enable cloud provider resources, like storage buckets, to be created and managed as Kubernetes resources). kro can also be used to group resources from these tools, along with any other Kubernetes resources, to define an entire application deployment and the cloud provider resources it depends on.

    1

    Example use cases

    Below, you’ll find some examples of kro being used with Google Cloud. You can find additional examples on the kro website

    Example 1: GKE cluster definition

    Imagine that a platform administrator wants to give end users in their organization self-service access to create GKE clusters. The platform administrator creates a kro ResourceGraphDefinition called GKEclusterRGD that defines the required Kubernetes resources and a CRD called GKEcluster that exposes only the options they want to be configurable by end users. In addition to creating a cluster, the platform team also wants clusters to deploy administrative workloads such as policies, agents, etc. The ResourceGraphDefinition defines the following resources, using KCC to provide the mappings from K8s CRDs to Google Cloud APIs:

    • GKE cluster, Container Node Pools, IAM ServiceAccount, IAM PolicyMember, Services, Policies

    The platform administrator would then define the end-user interface so that they can create a new cluster by creating an instance of the CRD that defines:

    • Cluster name, Nodepool name, Max nodes, Location (e.g. us-east1), Networks (optional)

    Everything related to policy, service accounts, and service activation (and how these resources relate to each other) is hidden from the end user, simplifying their experience.

    2

    Example 2: Web application definition

    In this example, a DevOps Engineer wants to create a reusable definition of a web application and its dependencies. They create a ResourceGraphDefinition called WebAppRGD, which defines a new Kubernetes CRD called WebApp. This new resource encapsulates all the necessary resources for a web application environment, including:

    • Deployments, service, service accounts, monitoring agents, and cloud resources like object storage buckets. 

    The WebAppRGD ResourceGraphDefinition can set a default configuration, and also define which parameters can be set by the end user at deployment time (kro gives you the flexibility to decide what is immutable, and what an end user is able to configure). A developer then creates an instance of the WebApp CRD, inputting any user-facing parameters. kro then deploys the desired Kubernetes resource.

    3

    Key benefits of kro

    We believe kro is a big step forward for platform engineering teams, delivering a number of advantages:

    • Kubernetes-native: kro leverages Kubernetes Custom Resource Definitions (CRDs) to extend Kubernetes, so it works with any Kubernetes resource and integrates with existing Kubernetes tools and workflows.

    • Lets you create a simplified end user experience: kro makes it easy to define end-user interfaces for complex groups of Kubernetes resources, making it easy for people who are not Kubernetes experts to consume services built on Kubernetes. 

    • Enables standardized services for application teams: kro templates can be reused across different projects and environments, promoting consistency and reducing duplication of effort.

    Get started with kro

    kro is available as an open-source project on GitHub. The GitHub organization is currently jointly owned by teams from Google, AWS, and Microsoft, and we welcome contributions from the community. We also have a website with documentation on installing and using kro, including example use cases. As an early-stage project, kro is not yet ready for production use, but we still encourage you to test it out in your own Kubernetes development environments!

  53. Senior Product Manager, Google

    Thu, 23 Jan 2025 17:00:00 -0000

    Platform engineering, one of Gartner’s top 10 strategic technology trends for 2024, is rapidly becoming indispensable for enterprises seeking to accelerate software delivery and improve developer productivity. How does it do that? Platform engineering is about providing the right infrastructure, tools, and processes that enable efficient, scalable software development, deployment, and management, all while minimizing the cognitive burden on developers.

    To uncover the secrets to platform engineering success, Google Cloud partnered with Enterprise Strategy Group (ESG) on a comprehensive research study of 500 global IT professionals and application developers working at organizations with at least 500 employees, all with formal platform engineering teams. Our goal was to understand whether they had adopted platform engineering, and if so, the impact that has had on their company’s software delivery capabilities. 

    The resulting report, Building Competitive Edge With Platform Engineering: A Strategic Guide,” reveals common patterns, expectations, and actionable best practices for overcoming challenges and fully leveraging platform engineering. This blog post highlights some of the most powerful insights from this study.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6aa1ca0910>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Platform engineering is no longer optional

    The research confirms that platform engineering is no longer a nascent concept. 55% of the global organizations we invited to participate have already adopted platform engineering. Of those, 90% plan to expand its reach to more developers. Furthermore, 85% of companies using platform engineering report that their developers rely on the platform to succeed. These figures highlight that platform engineering is no longer just a trend; it's becoming a vital strategy for organizations seeking to unlock the full potential of their cloud and IT investments and gain a competitive edge.

    image1

    Figure 1: 55% of 900+ global organizations surveyed have adopted platform engineering

    Three keys to platform engineering success

    The report identifies three critical components that are central to the success of mature platform engineering leaders. 

    1. Fostering close collaboration between platform engineers and other teams to ensure alignment 

    2. Adopting a “platform as a product” approach, which involves treating the developer platform with a clear roadmap, communicated value, and tight feedback loops

    3. Defining success by measuring performance through clear metrics such as deployment frequency, failure recovery time, and lead time for changes 

    It's noteworthy that while many organizations have begun their platform engineering journey, only 27% of adopters have fully integrated these three key components in their practices, signaling a significant opportunity for further improvements.

    AI: platform engineering's new partner

    One of the most compelling insights of this report is the synergistic relationship between platform engineering and AI. A remarkable 86% of respondents believe that platform engineering is essential to realizing the full business value of AI. At the same time, a vast majority of companies view AI as a catalyst for advancing platform engineering, with 94% of organizations identifying AI to be ‘Critical’ or ‘Important’ to the future of platform engineering.

    image2

    Beyond speed: key benefits of platform engineering

    The study also identified three cohorts of platform engineering adopters — nascent, established, and leading — based on whether and how much adopters had embraced the above-mentioned three key components of platform engineering success. The study shows that leading adopters gain more in terms of speed, efficiency, and productivity, and offers guidance for nascent and established adopters to improve their overall platform engineering maturity to gain more benefits.

    The report also identified some additional benefits of platform engineering, including:

    • Improved employee satisfaction, talent acquisition & retention: mature platforms foster a positive developer experience that directly impacts company culture. Developers and IT pros working for organizations with mature developer platforms are much more likely to recommend their workplace to their peers.

    • Accelerated time to market: mature platform engineering adopters have significantly shortened time to market. 71% of leading adopters of platform engineering indicated they have significantly accelerated their time to market, compared with 28% of less mature adopters.

    Don't go it alone

    A vast majority (96%) of surveyed organizations are leveraging open-source tools to build their developer platforms. Moreover, most (84%) are partnering with external vendors to manage and support their open-source environments. Co-managed platforms with a third party or a cloud partner benefit from a higher degree of innovation. Organizations with co-managed platforms allocate an average of 47% of their developers’ productive time to innovation and experimentation, compared to just 38% for those that prefer to manage their platforms with internal staff.

    Ready to succeed? Explore the full report

    While this blog provides a glimpse into the key findings from this study, the full report goes much further, revealing key platform engineering strategies and practices that will help you stay ahead of the curve. Download the report to explore additional topics, including:

    • The strategic considerations of centralized and distributed platform engineering teams

    • The key drivers behind platform engineering investments

    • Top priorities driving platform adoption for developers, ensuring alignment with their needs

    • Key pain points to anticipate and navigate on the road to platform engineering success

    • How platform engineering boosts productivity, performance, and innovation across the entire organization

    • The strategic importance of open source in platform engineering for competitive advantage

    • The transformative role of platform engineering for AI/ML workloads as adoption of AI increases

    • How to develop the right platform engineering strategy to drive scalability and innovation

    Download the full report now.

  54. Software Engineer

    Thu, 23 Jan 2025 17:00:00 -0000

    Editor’s note: This blog post was updated to reflect the general availability status of these features as of March 31, 2025.


    Cloud Deploy is a fully managed continuous delivery platform that automates the delivery of your application. On top of existing automation features, customers tell us they want other ways to automate their deployments to keep their production environments reliable and up to date.

    We're happy to announce three new features to help with that, all in GA.

    1. Repair rollouts

    The new repair rollout automation rule lets you retry failed deployments or automatically roll back to a previously successful release when an error occurs. These errors could come in any phase of a deployment: a pre-deployment SQL migration, a misconfiguration detected when talking to a GKE cluster, or as part of a deployment verification step. In any of these cases, the repair rollout automation lets you retry the failed step a configurable number of times, perfect for those occasionally flaky end-to-end tests. If the retry succeeds, the rollout continues. If the retries fail (or none are configured) the repair rollout automation can also roll back to the previously successful release.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6a7fbea730>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    2. Deploy policies

    Automating deployments is powerful, but it can also be important to put some constraints on the automation. The new deploy policies feature is intended to limit what these automations (or users) can do. Initially, we're launching a time-windows policy which can, for example, inhibit deployments during evenings, weekends, or during important events. While an on-caller with the Policy Overrider role could "break glass" to get around these policies, automated deployments won't be able to trigger a rollout in the middle of your big demo.

    3. Timed promotions

    After a release is successfully rolled out, you may want to automatically deploy it to the next environment. Our previous auto-promote feature let you promote a release after a specified duration, for example moving it into prod 12 hours after it went to staging. But often you want promotions to happen on a schedule, not based on a delay. Within Google, for example, we typically recommend that teams promote from a dev environment into staging every Thursday, and then start a promotion into prod on Monday mornings. With the new timed promotion automation, Cloud Deploy can handle these scheduled promotions for you. 

    The future

    Comprehensive, easy-to-use, and cost-effective DevOps tools are key to efficient software delivery, and it’s our hope that Cloud Deploy will help you implement complete CI/CD pipelines. Stay tuned as we introduce exciting new capabilities and features to Cloud Deploy in the months to come.

    Update your current pipelines with these new features today. Check out the product page, documentation, quickstarts, and tutorials. Finally, if you have feedback on Cloud Deploy, you can join the conversation. We look forward to hearing from you!

  55. Senior Staff Reliability Engineer

    Thu, 09 Jan 2025 17:00:00 -0000

    Cloud applications like Google Workspace provide benefits such as collaboration, availability, security, and cost-efficiency. However, for cloud application developers, there’s a fundamental conflict between achieving high availability and the constant evolution of cloud applications. Changes to the application, such as new code, configuration updates, or infrastructure rearrangements, can introduce bugs and lead to outages. These risks pose a challenge for developers, who must balance stability and innovation while minimizing disruption to users.

    Here on the Google Workspace Site Reliability Engineering team, we once moved a replica of Google Docs to a new data center because we needed extra capacity. But moving the associated data, which was vast, overloaded a key index in our database, restricting user ability to create new docs. Thankfully, we were able to identify the root cause and mitigate the problem quickly. Still, this experience convinced us of the need to reduce the risk of a global outage from a simple application change.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6aa00cbdc0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Limit the blast radius

    Our approach to reducing the risk of global outages is to limit the “blast radius,” or extent, of an outage by vertically partitioning the serving stack. The basic idea is to run isolated instances (“partitions”) of application servers and storage (Figure 1). Each partition contains all the various servers necessary to service a user request from end to end. Each production partition also has a pseudo-random mix of users and workloads, so all the partitions have similar resource needs. When it comes time to make changes to the application code, we deploy new changes to one partition at a time. Bad changes may cause a partition-wide outage, but we are protected from a global application outage. 

    Compare this approach to using canarying alone, in which new features or code changes are released to a small group of users before rolling them out to the rest. While canarying deploys changes first to just a few servers, it doesn’t prevent problems from spreading. For example, we’ve had incidents where canaried changes corrupted data used by all the servers in the deployment. With partitioning, the effects of bad changes are isolated to a single partition, preventing such contagion. Of course, in practice, we combine both techniques: canarying new changes to a few servers within a single partition.

    image1

    Benefits of partitioning

    Broadly speaking, partitioning brings a lot of advantages:

    • Availability: Initially, the primary motivation for partitioning was to improve the availability of our services and avoid global outages. In a global outage, an entire service may be down (e.g., users cannot log into Gmail), or a critical user journey (e.g., users cannot create Calendar events) — obviously things to be avoided.

      Still, the reliability benefits of partitioning can be hard to quantify; global outages are relatively infrequent, so if you don’t have one for a while, it may be due to partitioning, or may be due to luck. That said, we’ve had several outages that were confined to a single partition, and believe they would have expanded into global outages without it.
    • Flexibility: We evaluate many changes to our systems by experimenting with data. Many user-facing experiments, such as a change to a UI element, use discrete groups of users. For example, in Gmail we can choose an on-disk layout that stores the message bodies of emails inline with the message metadata, or a layout that separates them into different disk files. The right decision depends on subtle aspects of the workload. For example, separating message metadata and bodies may reduce latency for some user interactions, but requires more compute resources in our backend servers to perform joins between the body and metadata columns. With partitioning, we can easily evaluate the impact of these choices in contained, isolated environments. 
    • Data location: Google Workspace lets enterprise customers specify that their data be stored in a specific jurisdiction. In our previous, non-partitioned architecture, such guarantees were difficult to provide, especially since services were designed to be globally replicated to reduce latency and take advantage of available capacity.

    Challenges

    Despite the benefits, there are some challenges to adopt partitioning. In some cases, these challenges make it hard or risky to move from a non-partitioned to a partitioned setup. In other cases, challenges persist even after partitioning. Here are the issues as we see them:

    • Not all data models are easy to partition: For example, Google Chat needs to assign both users and chat rooms to partitions. Ideally, a chat and its members would be in a single partition to avoid cross-partition traffic. However, in practice, this is difficult to accomplish. Chat rooms and users form a graph, with users in many chat rooms and chat rooms containing many users. In the worst case, this graph may have only a single connected component — the user. If we were to slice the graph into partitions, we could not guarantee that all users would be in the same partition as their chat rooms.
    • Partitioning a live service requires care: Most of our services pre-date partitioning. As a result, adopting partitioning means taking a live service and changing its routing and storage setup. Even if the end goal is higher reliability, making these kinds of changes in a live system is often the source of outages, and can be risky.
    • Partition misalignment between services: Our services often communicate with each other. For example, if a new person is added to a Calendar event, Calendar servers make an Remote Procedure Call (RPC) to Gmail delivery servers to send the new invitee an email notification. Similarly, Calendar events with video call links require Calendar to talk to Meet servers for a meeting id. Ideally, we would get the benefits of partitioning even across services. However, aligning partitions between services is difficult. The main reason is that different services tend to use different entity types when determining which partition to use. For example, Calendar partitions on the owner of the calendar while Meet partitions on meeting id. The result is that there is no clear mapping from partitions in one service to another.
    • Partitions are smaller than the service: A modern cloud application is served by hundreds or thousands of servers. We run servers at less than full utilization so that we can tolerate spikes in traffic, and because servers that are saturated with traffic generally perform poorly. If we have 500 servers, and target each at 60% CPU utilization, we effectively have 200 spare servers to absorb load spikes. Because we do not fail over between partitions, each partition has access to a much smaller amount of spare capacity. In a non-partitioned setup, a few server crashes may likely go unnoticed, since there is enough headroom to absorb the lost capacity. But in a smaller partition, these crashes may account for a non-trivial portion of the available server capacity, and the remaining servers may become overloaded.

    Key takeaways

    We can improve the availability of web applications by partitioning their serving stacks. These partitions are isolated, because we do not fail over between them. Users and entities are assigned to partitions in a sticky manner to allow us to roll out changes in order of risk tolerance. This approach allows us to roll out changes one partition at a time with confidence that bad changes will only affect a single partition, and ideally that partition contains only users from your organization.

    In short, partitioning supports our efforts to provide stronger and more reliable services to our users, and it might apply to your service as well. For example, you can improve the availability of your application by using Spanner, which provides geo-partitioning out of the box. Read more about geo-partitioning best practices here.

    References

  56. Product Leader for Customer Telemetry, Google Cloud

    Mon, 06 Jan 2025 17:00:00 -0000

    Cloud incidents happen. And when they do, it’s incumbent on the cloud service provider to communicate about the incident to impacted customers quickly and effectively — and for the cloud service consumer to use that information effectively, as part of a larger incident management response. 

    Google Cloud Personalized Service Health provides businesses with fast, transparent, relevant, and actionable communication about Google Cloud service disruptions, tailored to a specific business at its desired level of granularity. Cybersecurity company Palo Alto Networks is one Google Cloud customer and partner that recently integrated Personalized Service Health signals into the incident workflow for its Google Cloud-based PRISMA Access offering, saving its customers critical minutes during active incidents. 

    By programmatically ingesting Personalized Service Health signals into advanced workflow components, Palo Alto can quickly make decisions such as triggering contingency actions to protect business continuity.

    Let’s take a closer look at how Palo Alto integrated Personalized Service Health into its operations.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6a7f1f3a00>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    The Personalized Service Health integration

    Palo Alto ingests Personalized Service Health logs into its internal AIOps system, which centralizes incident communications for PRISMA Access and applies advanced techniques to classify and distribute signals to the people responsible for responding to a given incident.

    1

    Personalized Service Health UI Incident list view

    Users of Personalized Service Health can filter what relevance levels they want to see. Here, “Partially related” reflects an issue anywhere in the world with the products that are used. “Related” reflects that the problem is detected within the data center regions, while “Impacted” means that Google has verified the impact to the customer for specific services.

    While Google is still confirming an incident, Personalized Service Health communicates some of these incidents as 'PSH Emerging Incident' to provide customers with early notification. Once Google confirms the incident, these incidents are merged with 'PSH Confirmed Incidents'. This helps customers respond faster to a specific incident that’s impacting their environment or escalate back to Google, if needed. 

    Personalized Service Health distributes updates throughout an active incident, typically every 30 minutes, or sooner if there’s progress to share. These updates are also written to logs, which Palo Alto ingests into AIOps.

    Responding to disruptive, unplanned cloud service provider incidents can be accelerated by programmatically ingesting and distributing incident communications. This is especially true in large-scale organizations such as Palo Alto, which has multiple teams involved in incident response for different applications, workloads and customers. 

    Fueling the incident lifecycle

    Palo Alto further leverages the ingested Personalized Service Health signals in its AIOps platform, which uses machine learning (ML) and analytics to automate IT operations. AIOps harnesses big data from operational appliances to detect and respond to issues instantaneously.  AIOps correlates these signals with internally generated alerts to declare an incident that is affecting multiple customers. These AIOps alerts are tied to other incident management tools that assist with managing the incident lifecycle, including communication, regular updates and incident resolution.

    2

    In addition, a data enrichment pipeline takes Personalized Service Health incidents, adds Palo Alto’s related information, and publishes the events to Pub/Sub. AIOps then consumes the incident data from Pub/Sub, processes it, correlates it to related events signals, and notifies subscribed channels.

    Palo Alto organizes Google Cloud assets into folders within the Google Cloud console. Each project represents a Palo Alto PRISMA Access customer. To receive incident signals that are likewise specific to end customers, Palo Alto creates a log sink that’s specific to each folder, aggregating service health logs at the folder level. Palo Alto then receives incident signals specific to each customer so it can take further action.

    3

    Palo Alto drives the following actions based on incident communications flowing from Google Cloud:

    • Proactive detection of zonal, inter-regional, external en-masse failures

    • Accurately identifying workloads affected by cloud provider incidents 

    • Correlation of product issue caused by cloud service degradation in Google Cloud Platform itself

    Seeing Personalized Service Health’s value

    Incidents caused by cloud providers often go unnoticed or are difficult to isolate without involving multiple of the cloud provider’s teams (support, engineering, SRE, account management). The Personalized Service Health alerting framework plus AIOps correlation engine allows Palo Alto’s SRE teams to isolate issues caused by a cloud provider near-instantaneously.

    4

    Palo Alto’s incident management workflow is designed to address mass failures versus individual customer outages, ensuring the right teams are engaged until the incidents are resolved. This includes notifying relevant parties, such as the on-call engineer and the Google Cloud support team. With Personalized Service Health, Palo Alto can capture both event types i.e., mass failures as well as individual customer outages.

    Palo Alto gets value from Personalized Service Health in multiple ways, beginning with faster incident response and contingency actions with which to optimize business continuity, especially for impacted customers of PRISMA Access. In the event of an incident impacting them, Prisma Access customers naturally seek and expect information from Palo Alto. By ensuring this information flows rapidly from Google Cloud to Palo Alto’s incident response systems, Palo Alto is able to provide more insightful answers to these end customers, and plans to serve additional Palo Alto use cases based on both existing and future Personalized Service Health capabilities. 

    Take your incident management to the next level

    Google Cloud is continually evolving Personalized Service Health to provide deeper value for all Google Cloud customers — from startups, to ISVs and SaaS providers, to the largest enterprises. Ready to get started? Learn more about Personalized Service Health, or reach out to your account team.


    We'd like to thank Jose Andrade, Pankhuri Kumar and Sudhanshu Jain of Google for their contributions to this collaboration between PANW and Google Cloud.

  57. Staff Software Engineer

    Mon, 09 Dec 2024 17:00:00 -0000

    From helping your developers write better code faster with Code Assist, to helping cloud operators more efficiently manage usage with Cloud Assist, Gemini for Google Cloud is your personal AI-powered assistant. 

    However, understanding exactly how your internal users are using Gemini has been a challenge — until today. 

    Today we are announcing that Cloud Logging and Cloud Monitoring support for Gemini for Google Cloud. Currently in public preview, Cloud Logging records requests and responses between Gemini for Google Cloud and individual users, while Cloud Monitoring reports 1-day, 7-day, and 28-day Gemini for Google Cloud active users and response counts in aggregate.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6a9e6a54c0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Cloud Logging

    In addition to offering customers general visibility into the impact of Gemini, there are a few scenarios where logs are useful:

    • to track the provenance of your AI-generated content

    • to record and review user usage of Gemini for Google Cloud 

    This feature is available as opt-in and when enabled, logs your users’ Gemini for Google Cloud activity to Cloud Logging (Cloud Logging charges apply). 

    Once enabled, log entries are made for each request to and response from Gemini for Google Cloud. In a typical request entry, Logs Explorer would provide an entry similar to the following example:

    1

    There are several things to note about this entry:

    • The content inside jsonPayload contains information about the request. In this case, it was a request to complete Python code with def fibonacci as the input. 

    • The labels tell you the method (CompleteCode), the product (code_assist), and the user who initiated the request (cal@google.com). 

    • The resource labels tell you the instance, location, and resource container (typically project) where the request occurred. 

    In a typical response entry, you’ll see the following:

    2

    Note that the request_id inside the label are identical for this pair of requests and responses, enabling identification of request and response pairs.

    In addition to the Log Explorer, Log Analytics supports queries to analyze your log data, and help you answer questions like "How many requests did User XYZ make to Code Assist?" 

    For more details, please see the Gemini for Google Cloud logging documentation

    Cloud Monitoring 

    Gemini for Google Cloud monitoring metrics help you answer questions like: 

    • How many unique active users used Gemini for Google Cloud services over the past day or seven days? 

    • How many total responses did my users receive from Gemini for Google Cloud services over the past six hours?

    Cloud Monitoring support for Gemini for Google Cloud is available to anybody who uses a Gemini for Google Cloud product and records responses and active users as Cloud Monitoring metrics, with which dashboards and alerts can be configured. 

    Because these metrics are available with Cloud Monitoring, you can also use them as part of Cloud Monitoring dashboards. A “Gemini for Google Cloud” dashboard is automatically installed under “GCP Dashboards” when Gemini for Google Cloud usage is detected:

    3

    Metrics Explorer offers another avenue where metrics can be examined and filters applied to gain a more detailed view of your usage. This is done by selecting the “Cloud AI Companion Instance” active resource in the Metrics Explorer:

    4

    In the example above, response_count is the number of responses sent by Gemini for Google Cloud, and can be filtered for Gemini Code Assist or the Gemini for Google Cloud method (code completion/generation). 

    For more details, please see the Gemini for Google Cloud monitoring documentation.

    What’s next

    We’re continually working on additions to these new capabilities, and in particular are focused on Code Assist logging and metrics enhancements that will bring even further insight and observability into your use of Gemini Code Assist and its impact. To get started with Gemini Code Assist and learn more about Gemini Cloud Assist — as well as observability data about it from Cloud Logging and Monitoring — check out the following links: 

  58. EMEA Practice Solutions Lead, Application Platform

    Tue, 22 Oct 2024 17:00:00 -0000

    At the end of the day, developers build, test, deploy and maintain software. But like with lots of things, it’s about the journey, not the destination.

    Among platform engineers, we sometimes refer to that journey as the developer experience (DX), which encompasses how developers feel and interact with the tools and services they use throughout the software build, test, deployment and maintenance process.

    Prioritizing DX is essential: Frustrated developers lead to inefficiency and talent loss as well as to shadow IT. Conversely, a positive DX drives innovation, community, and productivity. And if you want to provide a  positive DX, you need to start measuring how you’re doing.

    At PlatformCon 2024, I gave a talk entitled "Improving your developers' platform experience by applying Google frameworks and methods” where I spoke about Google’s HEART Framework, which provides a holistic view of your organization's developers’ experience through actionable data.

    In this article, I will share ideas on how you can apply the HEART framework to your Platform Engineering practice, to gain a more comprehensive view of your organization’s developer experience. But before I do that, let me explain what the HEART Framework is.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6a9f91bbb0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    The HEART Framework: an introduction

    In a nutshell, HEART measures developer behaviors and attitudes from their experience of your platform and provides you with insights into what’s going on behind the numbers, by defining specific metrics to track progress towards goals. This is beneficial because continuous improvements through feedback are vital components of a platform engineering journey, helping both platform and application product teams make decisions that are data-driven and user-centered.

    However, HEART is not a data collection tool in and of itself; rather, it’s a user-sentiment framework for selecting the right metrics to focus on based on product or platform objectives. It balances quantitative or empirical data, e.g., number of active portal users, with qualitative or subjective insights such as "My users feel the portal navigation is confusing." In other words, consider HEART as a framework or methodology for assessing user experience, rather than a specific tool or assessment. It helps you decide what to measure, not how to measure it.

    image2

    Let’s take a look at each of these in more detail.

    Happiness: Do users actually enjoy using your product?

    Highlight: Gathering and analyzing developer feedback

    Subjective metrics:

    • Surveys: Conduct regular surveys to gather feedback about overall satisfaction, ease of use, and pain points. Toil negatively affects developer satisfaction and morale. Repetitive, manual work can lead to frustration burnout and decreased happiness with the platform.

    • Feedback mechanisms: Establish easy ways for developers to provide direct feedback on specific features or areas of the platform like Net Promoter Score (NPS) or Customer Satisfaction surveys (CSAT).

    • Collect open-ended feedback from developers through interviews and user groups.

    • Sentiment analysis: Analyze developer sentiment expressed in feedback channels, support tickets and online communities.

    System metrics:

    • Feature requests: Track the number and types of feature requests submitted by developers. This provides insights into their needs and desires and can help you prioritize improvements that will enhance happiness.

    Watch out for: While platforms can boost developer productivity, they might not necessarily contribute to developer job satisfaction. This warrants further investigation, especially if your research suggests that your developers are unhappy.

    Engagement: What is the developer breadth and quality of platform experience?

    Highlight: Frequency of interaction between platform engineers with developers and quality of interaction — intensity and quality of interaction with the platform, participation on chat channels, training, dual ownership of golden paths, joint troubleshooting, engaging in architectural design discussions, and the breadth of interaction by everyone from new hires through to senior developers.

    Subjective metrics:

    • Survey for quality of interaction — focus on depth and type of interaction whether through chat channel, trainings, dual ownership of golden paths, joint troubleshooting, or architectural design discussions

    • High toil can reduce developer engagement with the platform. When developers spend excessive amounts of time on tedious tasks, they are less likely to explore new features, experiment, and contribute to the platform's evolution.

    System metrics:

    • Active users: Track daily, weekly, and monthly active developers and how long they spend on tasks.

    • Usage patterns: Analyze the most used platform features, tools, and portal resources.

    • Frequency of interaction between platform engineers with developers.

    • Breadth of user engagement: Track onboarding time for new hires to reach proficiency, measure the percentage of senior developers actively contributing to golden paths or portal functionality.

    Watch out for: Don’t confuse engagement with satisfaction. Developers may rate the platform highly in surveys, but usage data might reveal low frequency of interaction with core features or a limited subset of teams actively using the platform. Ask them “How has the platform changed your daily workflow?” rather than "Are you satisfied with the platform?”

    Adoption: What is the platform growth rate and developer feature adoption?

    Highlight: Overall acceptance and integration of the platform into the development workflow.

    System metrics:

    • New user registrations: Monitor the growth rate of new developers using the platform.

    • Track time between registration and time to use the platform i.e., executing golden paths, tooling and portal functionality.

    • Number of active users per week / month / quarter / half-year / year who authenticate via the portal and/or use golden paths, tooling and portal functionality

    • Feature adoption: Track how quickly and widely new features or updates are used.

    • Percentage of developers using CI/CD through the platform

    • Number of deployments per user / team / day / week / month — basically of your choosing

    • Training: Evaluate changes in adoption, after delivering training.

    Watch out for: Overlooking the "long tail" of adoption. A platform might see a burst of early adoption, but then plateau or even decline if it fails to continuously evolve and meet changing developer needs. Don't just measure initial adoption, monitor how usage evolves over weeks, months, and years.

    Retention: Are developers loyal to the platform?

    Highlight: Long-term engagement and reducing churn.

    Subjective metrics:

    • Use an exit survey if a user is dormant for 12 or more months.

    System metrics:

    • Churn rate: Track the percentage of developers who stop logging into the portal and are not using it.

    • Dormant users: Identify developers who become inactive after 6 months and investigate why.

    • Track services that are less frequently used.

    Watch out for: Misinterpreting the reasons for churn. When developers stop using your platform (churn), it's crucial to understand why. Incorrectly identifying the cause can lead to wasted effort and missed opportunities for improvement. Consider factors outside the platform — churn could be caused by changes in project requirements, team structures or industry trends.

    Task success: Can developers complete specific tasks?

    Highlight: Efficiency and effectiveness of the platform in supporting specific developer activities.

    Subjective metrics:

    • Survey to assess the ongoing presence of toil and its inimical influence on developer productivity, ultimately hindering efficiency and leading to increased task completion times.

    System metrics:

    • Completion rates: Measure the percentage of golden paths and tools successfully run on the platform without errors.

    • Time to complete tasks using golden paths, portal, or tooling.

    • Error rates: Track common errors and failures developers encounter from log files or monitoring dashboards from golden paths, portal or tooling.

    • Mean Time to Resolution (MTTR): When errors do occur, how long does it take to resolve them? A lower MTTR indicates a more resilient platform and faster recovery from failures.

    • Developer platform and portal uptime: Measure the percentage of time that the developer platform and portal is available and operational. Higher uptime ensures developers can consistently access the platform and complete their tasks.

    Watch out for: Don't confuse task success with task completion. Simply measuring whether developers can complete tasks on the platform doesn't necessarily indicate true success. Developers might find workarounds or complete tasks inefficiently, even if they technically achieve the end goal. It may be worth manually observing developer workflows in their natural environment to identify pain points and areas of friction in their workflows.

    Also, be careful with misaligning task success with business goals. Task completion might overlook the broader impact on business objectives. A platform might enable developers to complete tasks efficiently, but if those tasks don't contribute to overall business goals, the platform's true value is questionable.

    Applying the HEART framework to platform engineering

    It’s not necessary to use all of the categories each time. The number of categories to consider really depends on the specific goals and context of the assessment; you can include everything or trim it down to better match your objective. Here are some examples:

    • Improving onboarding for new developers: Focus on adoption, task success and happiness.

    • Launching a new feature: Concentrate on adoption and happiness.

    • Increasing platform usage: Track engagement, retention and task success.

    Keep in mind that relying on just one category will likely provide an incomplete picture.

    When should you use the framework?

    In a perfect world, you would use the HEART framework to establish a baseline assessment a few months after launching your platform, which will provide you with a valuable insight into early developer experience. As your platform evolves, this initial data becomes a benchmark for measuring progress and identifying trends. Early measurement allows you to proactively address UX issues, guide design decisions with data, and iterate quickly for optimal functionality and developer satisfaction. If you're starting with an MVP, conduct the baseline assessment once the core functionality is in place and you have a small group of early users to provide feedback.

    After 12 or more months of usage, you can also add metrics to embody a new or more mature platform. This can help you gather deeper insights into your developers’ experience by understanding how they are using the platform, measure the impact of changes you’ve made to the platform, or identify areas for improvement and prioritize future development efforts. If you've added new golden paths, tooling, or enhanced functionality, then you'll need to track metrics that measure their success and impact on developer behavior.

    The frequency with which you assess HEART metrics depends on several factors, including:

    • The maturity of your platform: Newer platforms benefit from more frequent reviews (e.g. monthly or quarterly) to track progress and address early issues. As the platform matures, you can reduce the frequency of your HEART assessments (e.g., bi-annually or annually).

    • The rate of change: To ensure updates and changes have a positive impact, apply the HEART framework more frequently when your platform is undergoing a period of rapid evolution such as major platform updates, new portal features or new golden paths, or some change in user behavior. This allows you to closely monitor the effects of each change on key metrics.

    • The size and complexity of your platform: Larger and more complex platforms may require more frequent assessments to capture nuances and potential issues.

    • Your team's capacity: Running HEART assessments requires time and resources. Consider your team's bandwidth and adjust the frequency accordingly.

    Schedule periodic deep dives (e.g. quarterly or bi-annually) using the HEART framework to gain a more in-depth understanding of your platform's performance and identify areas for improvement.

    Taking more steps towards platform engineering

    In this blog post, we’ve shown how the HEART framework can be applied to platform engineering to measure and improve the developer experience. We’ve explored the five key aspects of the framework — happiness, engagement, adoption, retention, and task success — and provided specific metrics for each and guidance on when to apply them.By applying these insights, platform engineering teams can create a more positive and productive environment for their developers, leading to greater success in their software development efforts.To learn more about platform engineering, check out some of our other articles:  5 myths about platform engineering: what it is and what it isn’t, Another five myths about platform engineering, and Laying the foundation for a career in platform engineering.

    And finally, check out the DORA Report 2024, which now has a section on Platform Engineering.

  59. DORA Research Lead

    Tue, 22 Oct 2024 16:00:00 -0000

    The DORA research program has been investigating the capabilities, practices, and measures of high-performing technology-driven teams and organizations for more than a decade. It has published reports based on data collected from annual surveys of professionals working in technical roles, including software developers, managers, and senior executives.

    Today, we’re pleased to announce the publication of the 2024 Accelerate State of DevOps Report, marking a decade of DORA’s investigation into high-performing technology teams and organizations. DORA’s four key metrics, introduced in 2013, have become the industry standard for measuring software delivery performance. 

    Each year, we seek to gain a comprehensive understanding of standard DORA performance metrics, and how they intersect with individual, workflow, team, and product performance. We now include how AI adoption affects software development across multiple levels, too.

    1

    We also establish reference points each year to help teams understand how they are performing, relative to their peers, and to inspire teams with the knowledge that elite performance is possible in every industry. DORA’s research over the last decade has been designed to help teams get better at getting better: to strive to improve their improvements year over year. 

    For a quick overview of this year’s report, you can read in our executive DORA Report summary the spotlight adoption trends and the impact of AI, the emergence of platform engineering, and the continuing significance of developer experience. 

    Organizations across all industries are prioritizing the integration of AI into their applications and services. Developers are increasingly relying on AI to improve their productivity and fulfill their core responsibilities. This year's research reveals a complex landscape of benefits and tradeoffs for AI adoption.

    The report underscores the need to approach platform engineering thoughtfully, and emphasizes the critical role of developer experience in achieving high performance. 

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e6a9e69a0a0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    AI: Benefits, challenges, and developing trust

    Widespread AI adoption is reshaping software development practices. More than 75 percent of respondents said that they rely on AI for at least one daily professional responsibility. The most prevalent use cases include code writing, information summarization, and code explanation. 

    The report confirms that AI is boosting productivity for many developers. More than one-third of respondents experienced”‘moderate” to “extreme” productivity increases due to AI.

    2

    A 25% increase in AI adoption is associated with improvements in several key areas:

    • 7.5% increase in documentation quality

    • 3.4% increase in code quality

    • 3.1% increase in code review speed

    However, despite AI’s potential benefits, our research revealed a critical finding: AI adoption may negatively impact software delivery performance. As AI adoption increased, it was accompanied by an estimated  decrease in delivery throughput by 1.5%, and an estimated reduction in delivery stability by 7.2%. Our data suggest that improving the development process does not automatically improve software delivery — at least not without proper adherence to the basics of successful software delivery, like small batch sizes and robust testing mechanisms. AI has positive impacts on many important individual and organizational factors which foster the conditions for high software delivery performance. But, AI does not appear to be a panacea.

    Our research also shows that despite the productivity gains, 39% of the respondents reported little to no trust in AI-generated code. This unexpected low level of trust indicates to us that there is a need to manage AI integration more thoughtfully. Teams must carefully evaluate AI’s role in their development workflow to mitigate the downsides

    Based on these findings, we have three core recommendations:

    1. Enable your employees and reduce toil by orienting your AI adoption strategies towards empowering employees and alleviating the burden of undesirable tasks.

    2. Establish clear guidelines for the use of AI and address procedural concerns and foster open communication about its impact.

    3. Encourage continuous exploration of AI tools and provide them dedicated time for experimentation, and promote trust through hands-on experience.

    Platform engineering: A paradigm shift

    Another emerging discipline our research focused this year is on platform engineering. Its focus is on building and operating internal development platforms to streamline processes and enhance efficiency

    3

    Our research identified 4 key findings regarding platform engineering:

    • Increased developer productivity: Internal development platforms effectively increase productivity for developers.

    • Prevalence in larger firms: These platforms are more commonly found in larger organizations, suggesting their suitability for managing complex development environments.

    • Potential performance dip: Implementing a platform engineering initiative might lead to a temporary decrease in performance before improvements manifest as the platform matures.

    • Need for user-centeredness and developer independence: For optimal results, platform engineering efforts should prioritize user-centered design, developer independence, and a product-oriented approach

    A thoughtful approach that prioritizes user needs, empowers developers, and anticipates potential challenges is key to maximizing the benefits of platform engineering initiatives. 

    Developer experience: The cornerstone of success

    One of the key insights in last year’s report was that a healthy culture can help reduce burnout, increase productivity, and increase job satisfaction. This year was no different. Teams that cultivate a stable and supportive environment that empowers developers to excel drive positive outcomes. 

    Move fast and constantly pivot’ mentality negatively impacts developer well-being and consequently, on overall performance.  Instability in priorities, even with strong leadership, comprehensive documentation, and a user-centered approach — all known to be highly beneficial — can significantly hinder progress. 

    Creating a work environment where your team feels supported, valued, and empowered to contribute is fundamental to achieving high performance. 

    How to use these findings to help your DevOps team

    The key takeaway from the decade of research is that software development success hinges not just on technical prowess but also on fostering a supportive culture, prioritizing user needs, and focusing on developer experience. We encourage teams to replicate our findings within your specific context.  

    It can be used as a hypothesis for your experiments and continuous improvement initiatives. Please share those with us and the DORA community, so that your efforts can become part of our collaborative learning environment.  

    We work on this research in hopes that it serves as a roadmap for teams and organizations seeking to improve their practices and create a thriving environment for innovation, collaboration, and business success. We will continue our platform-agnostic research that focuses on the human aspect of technology for the next decade to come.

    To learn more:

  60. Product Manager, Google Cloud Databases

    Thu, 10 Oct 2024 14:00:00 -0000

    Organizations are grappling with an explosion of operational data spread across an increasingly diverse and complex database landscape. This complexity often results in costly outages, performance bottlenecks, security vulnerabilities, and compliance gaps, hindering their ability to extract valuable insights and deliver exceptional customer experiences. To help businesses overcome these challenges, earlier this year, we announced the preview of Database Center, an AI-powered, unified fleet management solution.

    We’re seeing accelerated adoption for Database Center from many customers. For example, Ford uses Database Center to get answers on their database fleet health in seconds, and proactively mitigates potential risks to their applications. Today, we’re announcing that Database Center is now available to all customers, empowering you to monitor and operate database fleets at scale with a single, unified solution. We've also added support for Spanner, so you can manage it along with your Cloud SQL and AlloyDB deployments, with support for additional databases on the way.

    Database Center is designed to bring order to the chaos of your database fleet, and unlock the true potential of your data. It provides a single, intuitive interface where you can:

    • Gain a comprehensive view of your entire database fleet. No more silos of information or hunting through bespoke tools and spreadsheets.

    • Proactively de-risk your fleet with intelligent performance and security recommendations. Database Center provides actionable insights to help you stay ahead of potential problems, and helps improve performance, reduce costs and enhance security with data-driven suggestions.

    • Optimize your database fleet with AI-powered assistance. Use a natural-language chat interface to ask questions and quickly resolve fleet issues and get optimization recommendations.

    Let’s now review each in more detail.

    Gain a comprehensive view of your database fleet 

    Tired of juggling different tools and consoles to keep track of your databases?

    Database Center simplifies database management with a single, unified view of your entire database landscape. You can monitor database resources across your entire organization, spanning multiple engines, versions, regions, projects and environments (or applications using labels). 

    Cloud SQL, AlloyDB, and now Spanner are all fully integrated with Database Center, so you can monitor your inventory and proactively detect issues. Using the unified inventory view in Database Center, you can: 

    • Identify out-of-date database versions to ensure proper support and reliability

    • Track version upgrades, e.g., if PostgreSQL 14 to PostgreSQL 15 is updating at an expected pace

    • Ensure database resources are appropriately distributed, e.g., identify the number of databases powering the critical production applications vs. non-critical dev/test environments

    • Monitor database migration from on-prem to cloud or across engines

    1-Unified FLeet View

    Manage Cloud SQL, AlloyDB and Spanner resources with a unified view.

    Proactively de-risk your fleet with recommendations

    Managing your database fleet health at scale can involve navigating through a complex blend of security postures, data protection settings, resource configurations, performance tuning and cost optimizations. Database Center proactively detects issues associated with these configurations and guides you through addressing them. 

    For example, high transaction ID for a Cloud SQL instance can lead to the database no longer accepting new queries, potentially causing latency issues or even downtime. Database Center proactively detects this, provides an in-depth explanation, and walks you through prescriptive steps to troubleshoot the issue. 

    We’ve also added several performance recommendations to Database Center related to excessive tables/joins, connections, or logs, and can assist you through a simple optimization journey.

    2. High Transaction ID

    End-to-end workflow for detecting and troubleshooting performance issues.

    Database Center also simplifies compliance management by automatically detecting and reporting violations across a wide range of industry standards, including CIS, PCI-DSS, SOC2, HIPAA. Database Center continuously monitors your databases for potential compliance violations. When a violation is detected, you receive a clear explanation of the problem, including:

    • The specific security or reliability issue causing the violation 

    • Actionable steps to help address the issue and restore compliance

    This helps reduce the risk of costly penalties, simplifies compliance audits and strengthens your security posture. Database Center now also supports real-time detection of unauthorized access, updates, and data exports.

    3. Compliance

    Database Center helps ensure compliance to HIPAA standards.

    Optimize your fleet with AI-powered assistance

    With Gemini enabled, Database Center makes optimizing your database fleet incredibly intuitive. Simply chat with the AI-powered interface to get precise answers, uncover issues within your database fleet, troubleshoot problems, and quickly implement solutions. For example, you can quickly identify under-provisioned instances across your entire fleet, access actionable insights such as the duration of high CPU/Memory utilization conditions, receive recommendations for optimal CPU/memory configurations, and learn about the associated cost of those adjustments. 

    AI-powered chat in Database Center provides comprehensive information and recommendations across all aspects of database management, including inventory, performance, availability and data protection. Additionally, AI-powered cost recommendations suggest ways for optimizing your spend, and advanced security and compliance recommendations help strengthen your security and compliance posture.

    4 - Chat -1

    AI-powered chat to identify data protection issues and optimize cost.

    Get started with Database Center today

    The new capabilities of Database Center are available in preview today for Spanner, Cloud SQL, and AlloyDB for all customers. Simply access  Database Center within the Google Cloud console and begin monitoring and managing your entire databases fleet. To learn more about Database Center’s capabilities, check out the documentation.

  61. Product Manager, Google Cloud

    Tue, 08 Oct 2024 16:00:00 -0000

    Editor's note: Starting February 4, 2025, pipe syntax will be available to all BigQuery users by default.


    Log data has become an invaluable resource for organizations seeking to understand application behavior, optimize performance, strengthen security, and enhance user experiences. But the sheer volume and complexity of logs generated by modern applications can feel overwhelming. How do you extract meaningful insights from this sea of data?

    At Google Cloud, we’re committed to providing you with the most powerful and intuitive tools to unlock the full potential of your log data. That's why we're thrilled to announce a series of innovations in BigQuery and Cloud Logging designed to revolutionize the way you manage, analyze, and derive value from your logs.

    BigQuery pipe syntax: Reimagine SQL for log data

    Say goodbye to the days of deciphering complex, nested SQL queries. BigQuery pipe syntax ushers in a new era of SQL, specifically designed with the semi-structured nature of log data in mind. BigQuery’s pipe syntax introduces an intuitive, top-down syntax that mirrors how you naturally approach data transformations. As demonstrated in the recent research by Google, this approach leads to significant improvements in query readability and writability. By visually separating different stages of a query with the pipe symbol (|>), it becomes remarkably easy to understand the logical flow of data transformation. Each step is clear, concise, and self-contained, making your queries more approachable for both you and your team.

    BigQuery’s pipe syntax isn’t just about cleaner SQL — it’s about unlocking a more intuitive and efficient way to work with your data. Instead of wrestling with code, experience faster insights, improved collaboration, and more time spent extracting value.

    This streamlined approach is especially powerful when it comes to the world of log analysis. 

    With log analysis, exploration is key. Log analysis is rarely a straight line from question to answer. Analyzing logs often means sifting through mountains of data to find specific events or patterns. You explore, you discover, and you refine your approach as you go. Pipe syntax embraces this iterative approach. You can smoothly chain together filters (WHERE), aggregations (COUNT), and sorting (ORDER BY) to extract those golden insights. You can also add or remove steps in your data processing as you uncover new insights, easily adjusting your analysis on the fly.

    Imagine you want to count the total number of users who were affected by the same errors more than 100 times in the month of January. As shown below, the pipe syntax’s linear structure clearly shows the data flowing through each transformation: starting from the table, filtering by the dates, counting by user id and error type, filtering for errors >100, and finally counting the number of users affected by the same errors.

    code_block
    <ListValue: [StructValue([('code', "-- Pipe Syntax \r\nFROM log_table \r\n|> WHERE datetime BETWEEN DATETIME '2024-01-01' AND '2024-01-31'\r\n|> AGGREGATE COUNT(log_id) AS error_count GROUP BY user_id, error_type\r\n|> WHERE error_count>100\r\n|> AGGREGATE COUNT(user_id) AS user_count GROUP BY error_type"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e6a7f1c3970>)])]>

    The same example in the standard syntax will typically require using a subquery and non linear structure.

    code_block
    <ListValue: [StructValue([('code', "-- Standard Syntax \r\nSELECT error_type, COUNT(user_id)\r\nFROM (\r\n SELECT user_id, error_type, \r\n count (log_id) AS error_count \r\n FROM log_table \r\n WHERE datetime BETWEEN DATETIME '2024-01-01' AND DATETIME '2024-01-31'\r\n GROUP BY user_id, error_type\r\n)\r\nGROUP BY error_type\r\nWHERE error_count > 100;"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e6a7f1c3820>)])]>

    Carrefour: A customer's perspective

    The impact of these advancements is already being felt by our customers. Here's what Carrefour, a global leader in retail, had to say about their experience with pipe syntax:

     "Pipe syntax has been a very refreshing addition to BigQuery. We started using it to dig into our audit logs, where we often use Common Table Expressions (CTEs) and aggregations. With pipe syntax, we can filter and aggregate data on the fly by just adding more pipes to the same query. This iterative approach is very intuitive and natural to read and write. We are now using it for our analysis work in every business domain. We will have a hard time going back to the old SQL syntax now!" - Axel Thevenot, Lead Data Engineer, and Guillaume Blaquiere, Data Architect, Carrefour

    BigQuery pipe syntax is currently available for all BigQuery users. You can check-out this introductory video.

    Beyond syntax: performance and flexibility

    But we haven't stopped at simplifying your code. BigQuery now offers enhanced performance and powerful JSON handling capabilities to further accelerate your log analytics workflows. Given the prevalence of json data in logs, we expect these changes to simplify log analytics for a majority of users. 

    • Enhanced Point Lookups: Pinpoint critical events in massive datasets quickly using BigQuery's numeric search indexes, which dramatically accelerates queries that filter on timestamps and unique IDs. Here is a sample improvement from the announcement blog

    Metrics 

    Without Index

    With Index

    Improvement

    Execution Time (ms)

    48,790

    4,664

    10x

    Processed Bytes

    2,174,758,158,336

    774,897,664

    2,806x

    Slot Usage (ms)

    25,735,222

    7,300

    3,525x

    • Powerful JSON Analysis: Parse and analyze your JSON-formatted log data with ease using BigQuery's JSON_KEYS function and JSONPath traversal feature. Extract specific fields, filter on nested values, and navigate complex JSON structures without breaking a sweat.

      • JSON_KEYS extracts unique JSON keys from JSON data for easier schema exploration and discoverability 

    Query 

    Results 

    JSON_KEYS(JSON '{"a":{"b":1}}')

    ["a", "a.b"]

    JSON_KEYS(JSON '{"a":[{"b":1}, {"c":2}]}', mode => "lax")

    ["a", "a.b", "a.c"]

    JSON_KEYS(JSON '[[{"a":1},{"b":2}]]', mode => "lax recursive")

    ["a", "b"]

      • JSONPath with LAX modes lets you easily fetch JSON arrays without having to use verbose UNNEST. The example below shows how to fetch all phone numbers from the person field, before and after:
    code_block
    <ListValue: [StructValue([('code', '-- consider a JSON field ‘Person’ as\r\n[{\r\n "name": "Bob",\r\n "phone":[{"type": "home", "number": 20}, {"number":30}]\r\n}]\r\n\r\n--Previously, to fetch all phone numbers from ‘Person’ column\r\nSELECT phone.number\r\nFROM (\r\nSELECT IF(JSON_TYPE(person.phone) = "array", JSON_QUERY_ARRAY (person.phone), [person.phone]) as nested_phone\r\nFrom (\r\nSELECT IF(JSON_TYPE(person)= "array", JSON_QUERY_ARRAY(person), [person])as nested_person\r\nFROM t), UNNEST(nested_person) person), UNNEST (nested_phone)phone\r\n\r\n--With Lax Mode\r\nSELECT JSON_QUERY(person, "lax recursive $.phone.number") FROM t'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e6a7f1c3c10>)])]>

    Log Analytics in Cloud Logging: Bringing it all together

    Log Analytics in Cloud Logging is built on top of BigQuery and provides a UI that’s purpose-built for log analysis. With an integrated date/time picker, charting and dashboarding, Log Analytics makes use of the JSON capabilities to support advanced queries and analyze logs faster. To seamlessly integrate these powerful capabilities into your log management workflow, we're also enhancing Log Analytics (in Cloud Logging) with pipe syntax. You can now analyze your logs within Log Analytics leveraging the full power of BigQuery pipe syntax, enhanced lookups, and JSON handling, all within a unified platform.

    pipe_syntax_in_log_analytics

    Use of pipe syntax in Log Analytics (Cloud Logging) is now available in preview.

    Unlock the future of log analytics today

    BigQuery and Cloud Logging provide an unmatched solution for managing, analyzing, and extracting actionable insights from your log data. Explore these new capabilities today and experience the power of:

    Start your journey towards more insightful and efficient log analytics in the cloud with BigQuery and Cloud Logging. Your data holds the answers — we're here to help you find them.

  62. Chief Evangelist, Google Cloud

    Fri, 04 Oct 2024 17:00:00 -0000

    As AI adoption speeds up, one thing is becoming clear: the developer platforms that got you this far won’t get you to the next stage. While yesterday’s platforms were awesome, let’s face it, they weren’t built for today’s AI-infused application development and deployment. And organizations are quickly realizing they need to update their platform strategies to ensure that developers — and the wider set of folks using AI — have what they need for the years ahead.

    In fact, as I explore in a new paper, nine out of ten decision makers are prioritizing the task of optimizing workloads for AI over the next 12 months. Problem is, given the pace of change lately, many don’t know where to start or what they need when it comes to modernizing their developer platforms.

    What follows is a quick look at the key steps involved in planning your platform strategy. For all the details, download my full guide, Three pillars of a modern, AI-ready platform.

    Step 1. Define your platform’s purpose

    Whether you’re building your first platform or your fiftieth, you need to start by asking, “Why?” After all, a new platform is another asset to maintain and operate —you need to make sure it exists for the right reasons.

    To build your case, ask yourself three questions:

    • Who is the platform for? Your platform’s customers, or users, can include developers, architects, product teams, SREs and Ops personnel, data scientists, security teams, and platform owners. Each has different needs, and your platform will need to be tailored accordingly.
    • What are its goals? Work out what problems you’re trying to solve. For example, are you optimizing for AI? Striving to speed up software delivery? Increasing developer productivity? Improving scale or security? Again, different goals will lead you down different paths for your platform — so map them out right from the start.
    • How will you measure success? To prove the worth of your platform, and to convince stakeholders to invest in its ongoing maintenance, establish metrics from the outset, and keep on measuring them! These could range from improved customer satisfaction to faster time-to-resolution for support issues. 

    Step 2. Assemble the pieces of your platform

    Now that you’re clear on the customers, goals, and performance metrics of the platform you need, it’s time to actually build the thing. Here’s a glance at the key components of a modern, AI-ready platform — complete with the capabilities developers need to hit the ground running when developing AI-powered solutions.

    image1

    For a detailed breakdown of what to consider in each area of your platform, including a list of technology options for each category, head over to the full paper.

    Step 3. Establish a process for improving your platform

    The journey doesn’t end once your platform’s built. In fact, it’s just beginning. A platform is never “done;” it’s just released. As such, you need to adopt a continuous improvement mindset and assign a core platform team the task of finding new ways to introduce value to stakeholders.

    At this stage, my top tip is to treat your platform like a product, applying platform engineering principles to keep making it faster, cheaper, and easier to deliver software. Oh, and to leverage the latest in AI-driven optimization tools to monitor and maintain your platform over time!  

    Ready to start your platform journey?

    Organizations embark on platform overhauls for a whole bunch of reasons. Some do it to better cope with forecasted growth. Others have AI adoption in their sights. Then there are those driven by cost, performance, or the user experience. Whatever your reason for getting started, I encourage you to read the full paper on building a modern AI-ready platform — your developers (and the business) will thank you.

  63. Technical Program Management

    Fri, 27 Sep 2024 16:00:00 -0000

    You’ve probably felt the frustration that arises when a project fails to meet established deadlines. And perhaps you’ve also encountered scenarios where project staff or computing have been reallocated to higher priority projects. It can be super challenging to get projects done on time with this kind of uncertainty. 

    That’s especially true for Site Reliability Engineering (SRE) teams. Project management principles can help, but in IT, many project management frameworks are directed at teams that have a single focus, such as a software-development team. 

    That’s not true for SRE teams at Google. They are charged with delivering infrastructure projects as well as their primary role: supporting production. Broadly speaking, SRE time is divided in half between supporting production environments and focusing on product. 

    A common problem

    In a recent endeavor, our SRE team took on a project to regionalize our infrastructure to enhance the reliability, security, and compliance of our cloud services. The project was allocated a well-defined timeline, driven by our commitments to our customers and adherence to local regulations. As the technical program manager (TPM), I decomposed the overarching goal into smaller milestones and communicated to the leadership team to ensure they remained abreast of the progress.

    However, throughout the execution phase of the project, we encountered a multitude of unrelated production incidents — the Spanner queue was growing long, and the accumulation of messages led to increased compilation times for our developer builds; this in turn led to bad builds rolling out. On top of this, asynchronous tasks were not completing as expected. When the bad build was rolled back, all of the backlogged async tasks fired at once. Due to these unforeseen challenges, some engineers were temporarily reassigned from the regionalization project to handle operational constraints associated with production infrastructure. No surprise, the change in staff allocation towards production incidents resulted in the project work being delayed. 

    Better planning with SRE

    Teams that manage production services, like SRE, have many ways to solve tough problems. The secret is to choose the solution that gets the job done the fastest and with the least amount of red tape for engineers to deal with.

    In our organization, we’ve started taking a proactive approach to problem-solving by incorporating enhanced planning at the project's inception. As a TPM, my biggest trick to ensuring projects are finished on time is keeping some engineering hours in reserve and planning carefully when the project should start.

    How many resources should you hold back, exactly? We did a deep dive into our past production issues and how we've been using our resources. Based on this, when planning SRE projects, we set aside 25% of our time for production work. Of course, this 25% buffer number will differ across organizations, but this new approach, which takes into account our critical business needs, has been a game-changer for us in making sure our projects are delivered on time, while ensuring that SREs can still focus on production incidents — our top priority for the business.

    Key takeaways

    In a nutshell, planning for SRE projects is different from planning for projects in development organizations, because development organizations spend the lion’s share of their time working on projects. Luckily, SRE Program Management is really good at handling complicated situations, especially big programs. 

    Beyond holding back resources, here are few other best practices and structures that TPMs employ when planning SRE projects:

    • Ensuring that critical programs are staffed for success

    • Providing opportunities for TPMs to work across services, cross pollinating with standardized solutions and avoiding duplication of work

    • Providing more education to Site Reliability Managers and SREs on the value of early TPM engagement and encourage services to surface problem statements earlier

    • Leveraging the skills of TPMs to manage external dependencies and interface with other partner organizations such as Engineering, Infrastructure Change Management, and Technical Infrastructure

    • Providing coverage at times of need for services with otherwise low program management demands

    • Enabling consistent performance evaluation and provide opportunities for career development for the TPM community

    The TPM role within SRE is at the heart of fulfilling SRE’s mission: making workflows faster, more reliable, and preparing for the continued growth of Google's infrastructure. As a TPM, you need to ensure that systems and services are carefully planned and deployed, taking into account multiple variables such as price, availability, and scheduling, while always keeping the bigger picture in mind. To learn more about project management for TPMs and related roles, consider enrolling in this course, and check out the following resources:

    1. Program Management Practices

    2. The Evolving SRE Engagement Model

    3. Part III. Practices

  64. AI/ML Customer Engineer, UKI, Google Cloud

    Fri, 30 Aug 2024 16:00:00 -0000

    Who is supposed to manage generative AI applications? While AI-related ownership often lands with data teams, we're seeing requirements specific to generative AI applications that have distinct differences from those of a data and AI team, and at times more similarities with a DevOps team. This blog post explores these similarities and differences, and considers the need for a new ‘GenOps’ team to cater for the unique characteristics of generative AI applications.

    In contrast to data science which is about creating models from data, Generative AI relates to creating AI enabled services from models and is concerned with the integration of pre-existing data, models and APIs. When viewed this way, Generative AI can feel similar to a traditional microservices environment: multiple discrete, decoupled and interoperable services consumed via APIs. And if there are similarities with the landscape, then it is logical that they share common operational requirements. So what practices can we take from the world of microservices and DevOps and bring to the new world of GenOps? 

    What are we operationalising? The AI agent vs the microservice

    How do the operational requirements of a generative AI application differ from other applications? With traditional applications, the unit of operationalisation is the microservice. A discrete, functional unit of code, packaged up into a container and deployed into a container-native runtime such as kubernetes. For generative AI applications, the comparative unit is the generative AI agent: also a discrete, functional unit of code defined to handle a specific task, but with some additional constituent components that make it more than ‘just’ a microservice and add in its key differentiating behavior of being non-deterministic in terms of both its processing and its output: 

    1. Reasoning loop - The control logic defining what the agent does and how it works. It often includes iterative logic or thought chains to break down an initial task into a series of model-powered steps that work towards the completion of a task. 

    2. Model definitions - One or a set of defined access patterns for communicating with models, readable and usable by the Reasoning Loop

    3. Tool definitions - a set of defined access patterns for other services external to the agent, such as other agents, data access (RAG) flows, and external APIs. These should be shared across agents, exposed through APIs and hence a Tool definition will take the form of a machine-readable standard such as an OpenAPI specification.

    blog-image-1 - Logical components of a Generative AI Agent

    Logical components of a generative AI agent

    The Reasoning Loop is essentially the full scope of a microservice, and the model and Tool definitions are its additional powers that make it into something more. Importantly, although the Reasoning Loop logic is just code and therefore deterministic in nature, it is driven by the responses from non-deterministic AI models, and this non-deterministic nature is what provides the need for the Tool, as the agent ‘chooses for itself’ which external service should be used to fulfill a task. A fully deterministic microservice has no need for this ‘cookbook’ of Tools for it to select from: Its calls to external services are pre-determined and hard coded into the Reasoning Loop.

    However there are still many similarities. Just like a microservice, an agent:

    • Is a discrete unit of function that should be shared across multiple apps/users/teams in a multi-tenancy pattern

    • Has a lot of flexibility with development approaches, a wide range of software languages are available to use, and any one agent can be built in a different way to another.

    • Has very low inter-dependency from one agent to another: development lifecycles are decoupled with independent CI/CD pipelines for each. The upgrade of one agent should not affect another agent.

    Feature

    Microservice

    agent

    Output

    Deterministic

    Non-deterministic

    Scope

    Single unit of discrete deterministic function

    Single unit of discrete non-deterministic function

    Latency

    Lower

    Higher

    Cost

    Lower

    Higher

    Transparency / Explainability

    High

    Low

    Development flexibility

    High

    High

    Development inter-dependence

    None

    None

    Upgrade inter-dependence

    None

    None

    Operational platforms and separation of responsibilities

    Another important difference is service-discovery. This is a solved-problem in the world of microservices where the impracticalities for microservices to track the availability, whereabouts and networking considerations for communicating with each other were taken out of the microservice itself and handled by packaging the microservices into containers and deploying these into a common platform layer of kubernetes and Istio. With Generative AI agents, this consolidation onto a standard deployment unit has not yet happened. There are a range of ways to build and deploy a generative AI agent, from code-first DIY approaches through to no-code managed agent builder environments. I am not against these tools in principle, however they are creating a more heterogeneous deployment landscape than what we have today with microservices applications and I expect this will create future operational complexities.

    To deal with this, at least for now, we need to move away from the Point-to-Point model seen in microservices and adopt a Hub-and-Spoke model, where the discoverability of agents, Tools and models is done via the publication of APIs onto an API Gateway that provides a consistent abstraction layer above this inconsistent landscape.

    This brings the additional benefit of clear separation of responsibilities between the apps and agents built by development teams, and Generative AI specific components such as models and Tools:

    blog-image-2 - Separating responsibilities with an API Gateway

    Separating responsibilities with an API Gateway

    All operational platforms should create a clear point of separation between the roles and responsibilities of app and microservice development teams from the responsibilities of the operational teams. With microservice based applications, responsibilities are handed over at the point of deployment, and focus switches to non-functional requirements such as reliability, scalability, infrastructure efficiency, networking and security.

    Many of these requirements are still just as important for a generative AI app, and I believe there are some additional considerations specific to generative agents and apps which require specific operational tooling:

    1. Model compliance and approval controls
    There are a lot of models out there. Some are open-source, some are licensed. Some provide intellectual property indemnity, some do not. All have specific and complex usage terms that have large potential ramifications but take time and the right skillset to fully understand.

    It’s not reasonable or appropriate to expect our developers to have the time or knowledge to factor in these considerations during model selection. Instead, an organization should have a separate model review and approval process to determine whether usage terms are acceptable for further use, owned by legal and compliance teams, supported on a technical level by clear, governable and auditable approval/denial processes that cascade down into development environments.

    2. Prompt version management
    Prompts need to be optimized for each model. Do we want our app teams focusing on prompt optimization, or on building great apps? Prompt management is a non-functional component and should be taken out of the app source code and managed centrally where they can be optimized, periodically evaluated, and reused across apps and agents.

    3. Model (and prompt) evaluation
    Just like an MLOps platform, there is clearly a need for ongoing assessments of model response quality to enable a data-driven approach to evaluating and selecting the most optimal models for a particular use-case. The key difference with Gen AI models being the assessment is inherently more qualitative compared to the quantitative analysis of skew or drift detection of a traditional ML model.

    Subjective, qualitative assessments performed by humans are clearly not scalable, and introduce inconsistency when performed by multiple people. Instead, we need consistent automated pipelines powered by AI evaluators, which although imperfect, will provide consistency in the assessments and a baseline to compare models against each other.

    4. Model security gateway
    The single most common operational feature I hear large enterprises investing time into is a security proxy for safety checks before passing a prompt on to a model (as well as the reverse: a check against the generated response before passing back to the client).

    Common considerations:

    1. Prompt Injection attacks and other threats captured by OWASP Top 10 for LLMs

    2. Harmful / unethical prompts

    3. Customer PII or other data requiring redaction prior to sending on to the model and other downstream systems

    Some models have built in security controls; however this creates inconsistency and increased complexity. Instead a model agnostic security endpoint abstracted above all models is required to create consistency and allow for easier model switching.

    5. Centralized Tool management
    Finally, the Tools available to the agent should be abstracted out from the agent to allow for reuse and centralized governance. This is the right separation of responsibilities especially when involving data retrieval patterns where access to data needs to be controlled.

    RAG patterns have the potential to become numerous and complex, as well as in practice not being particularly robust or well maintained with the potential of causing significant technical debt, so central control is important to keep data access patterns as clean and visible as possible.

    Outside of these specific considerations, a prerequisite already discussed is the need for the API Gateway itself to create consistency and abstraction above these Generative AI specific services. When used to their fullest, API Gateways can act as much more than simply an API Endpoint but can be a coordination and packaging point for a series of interim API calls and logic, security features and usage monitoring.

    For example, a published API for sending a request to a model can be the starting point for a multi-step process:

    • Retrieving and ‘hydrating’ the optimal prompt template for that use case and model

    • Running security checks through the model safety service

    • Sending the request to the model

    • Persisting prompt, response and other information for use in operational processes such as model and prompt evaluation pipelines.

    blog-image-3 - Key components of a GenOps platform

    Key components of a GenOps platform

    Making GenOps a reality with Google Cloud

    For each of the considerations above, Google Cloud provides unique and differentiating managed services offerings to support with evaluating, deploying, securing and upgrading Generative AI applications and agents:

    • Model compliance and approval controls - Google Cloud’s Model Garden is the central model library for over 150 of Google first-party models, partner models, or open source models, with thousands more available via the direct integration with Hugging Face.
    • Model security - The newly announced Model Armor, expected to be in preview in Q3, enables inspection, routing and protection of foundation model prompts and responses. It can help with mitigating risks such as prompt injections, jailbreaks, toxic content and sensitive data leakage.
    • Prompt version management - Upcoming prompt management capabilities were announced at Google Cloud Next ‘24 that include centralized version controlling, templating, branching and sharing of prompts. We also showcased AI prompt assistance capabilities to critique and automatically re-write prompts.
    • Model (and prompt) evaluation - Google Cloud’s model evaluation services provide automatic evaluations for a wide range of metrics prompts and responses enabling extensible evaluation patterns such as evaluating the responses from two models for a given input, or the responses from two different prompts for the same model.
    • Centralized Tool management - A comprehensive suite of managed services are available supporting Tool creation. A few to call out are the Document AI Layout Parser for intelligent document chunking, the multimodal embeddings API, Vertex AI Vector Search, and I specifically want to highlight Vertex AI Search: a fully managed, end-to-end OOTB RAG service, handling all the complexities from parsing and chunking documents, to creating and storing embeddings.

    As for the API Gateway, Google Cloud’s Apigee allows for publishing and exposure of models and Tools as API Proxies which can encompass multiple downstream API calls, as well as include conditional logic, reties, and tooling for security, usage monitoring and cross charging.

    blog-image-4 GenOps with Google Cloud

    GenOps with Google Cloud

    Regardless of size, for any organization to be successful with generative AI, they will need to ensure their generative AI application’s unique characteristics and requirements are well managed, and hence an operational platform engineered to cater for these characteristics and requirements is clearly required. I hope the points discussed in this blog make for helpful consideration as we all navigate through this exciting and highly impactful new era of technology.

    If you are interested in learning more, reach out to your Google Cloud account team if you have one, or feel free to contact me directly.

  65. Software Engineer, Google

    Mon, 26 Aug 2024 16:00:00 -0000

    The Terraform Google Provider v6.0.0 is now GA. Since the last major Terraform provider release in September 2023, the combined Hashicorp/Google provider team has been listening closely to the community's feedback. Discussed below are the primary enhancements and bug fixes that this major release focuses on. Support for earlier versions of HashiCorp Terraform will not change as a result of the major version release v6.0.0.

    Terraform Google Provider Highlights 

    The key notable changes are as follows: 

    • Opt-out default label “goog-terraform-provisioned”

    • Deletion protection fields added to multiple resources

    • Allowed reducing the suffix length in “name_prefix” for multiple resources

    Opt-out default label “goog-terraform-provisioned”

    As a follow-up to the addition of provider-level default labels in 5.16.0, the 6.0.0 major release includes an opt-out default label “goog-terraform-provisioned”. This provider-level label “goog-terraform-provisioned” will be added to applicable resources to identify resources that were created by Terraform. This default label will only apply for newly created resources with a labels field. This will enable users to have a view of resources managed by Terraform when viewing/editing these resources in other tools like Cloud Console, Cloud Billing etc.

    The label “goog-terraform-provisioned” can be used for the following:

    • To filter on the Billing Reports page:

    1 - Billing Reports page
    • To view the Cost breakdown:
    2 - Cost Breakdown

    Please note that an opt-in version of the label was already released in 5.16.0, and 6.0.0 will change the label to opt-out. To opt-out of this default label, the users may toggle the add_terraform_attribution_label provider configuration field. This can be set explicitly using any release from 5.16.0 onwards and the value in configuration will apply after the 6.0.0 upgrade.

    code_block
    <ListValue: [StructValue([('code', 'provider "google" {\r\n // opt out of “goog-terraform-provisioned” default label\r\n add_terraform_attribution_label = false\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e6aa008ab50>)])]>

    Deletion protection fields added to multiple resources

    In order to prevent the accidental deletion of important resources, many resources now have a form of deletion protection enabled by default. These resources include google_domain, google_cloud_run_v2_job, google_cloud_run_v2_service, google_folder and google_project. Most of these are enabled by the deletion_protection field. google_project specifically has a deletion_policy field which is set to PREVENT by default.

    Allowed reducing the suffix length in “name_prefix”

    Another notable issue resolved in this major release is, “Allow reducing the suffix length appended to instance templates name_prefix (#15374 ),” which changes the default behavior for name_prefix in multiple resources. The max length of the user-defined name_prefix has increased from 37 characters to 54. The provider will use a shorter appended suffix when using a name_prefix longer than 37 characters, which should allow for more flexible resource names. For example, google_instance_template.name_prefix.

    With features like opt-out default labels and deletion protection, this version enables users to have a view of resources managed by Terraform in other tools and also prevents accidental deletion of important resources. The Terraform Google Provider 6.0.0 launch aims to improve the usability and safety of Terraform for managing Google Cloud resources on Google Cloud. When upgrading to version 6.0 of the Terraform Google Provider, please consult the upgrade guide on the Terraform Registry, which contains a full list of the changes and upgrade considerations. Please check out the Release notes for Terraform Google Provider 6.0.0 for more details on this major version release. Learn more about Terraform on Google Cloud in the Terraform on Google Cloud documentation.

  66. CCoE Team Tech Lead, Hakuhodo Technologies Inc.

    Mon, 12 Aug 2024 16:00:00 -0000

    Hakuhodo Technologies, a specialized technology company of the Hakuhodo DY Group — one of Japan’s leading advertising and media holding companies — is dedicated to enhancing our software development process to deliver new value and experiences to society and consumers through the integration of marketing and technology. 

    Our IT Infrastructure Team at Hakuhodo Technologies operates cross-functionally, ensuring the stable operation of the public cloud that supports the diverse services within the Hakuhodo DY Group. We also provide expertise and operational support for public cloud initiatives.

    Our value is to excel in the cloud and infrastructure domain, exhibiting a strong sense of ownership, and embracing the challenge of creating new value.

    Background and challenges

    The infrastructure team is tasked with developing and operating the application infrastructure tailored to each internal organization and service, in addition to managing shared infrastructure resources.

    Following the principles of platform engineering and site reliability engineering (SRE), each team within the organization has adopted elements of SRE, including the implementation of post-mortems and the development of observability mechanisms. However, we encountered two primary challenges:

    • As the infrastructure expanded, the number of people on the team grew rapidly, bringing in new members from diverse backgrounds. This made it necessary to clarify and standardize tasks, and provide a collective understanding of our current situation and alignment on our goals.

    • We mainly communicate with the app team through a ticket-based system. In addition to expanding our workforce, we have also introduced remote working. As a result, team members may not be as well-acquainted as before. This lack of familiarity could potentially cause small misunderstandings that can escalate quickly.

    As our systems and organization expand, we believe that strengthening common understanding and cooperative relationships within the infrastructure team and the application team is essential for sustainable business growth. This has become a core element of our strategy.

    We believe that fostering an SRE mindset among both infrastructure and application team members and creating a culture based on that common understanding is essential to solving the issues above. To achieve this, we decided to implement the "SRE Core" program by Google Cloud Consulting, which serves as the first step in adopting SRE practices.

    Change

    First, through the "SRE Core" program, we revitalized communication between the application and infrastructure teams, which had previously had limited interaction. For example, some aspects of the program required information that was challenging for infrastructure members to gather and understand on their own, making cooperation with the application team essential.

    Our critical user journey (CUJ), one of the SRE metrics, was established based on the business requirements of the app and the behavior of actual users. This information is typically managed by the app team, which frequently communicates with the business side. This time, we collaborated with the application team to create a CUJ, set service level indicators (SLIs) and service level objectives (SLOs) which included error budgets, performed risk analysis, and designed the necessary elements for SRE.

    This collaborative work and shared understanding served as a starting point. As we continued to build a closer working relationship even after the program ended, with infrastructure members also participating in sprint meetings that had previously been held only for the app team.

    Image_1

    Additionally, as an infrastructure team, we systematically learned when and why SRE activities are necessary, allowing us to reflect on and strengthen our SRE efforts that had been partially implemented.

    For example, I recently understood that the purpose of postmortems is not only to prevent the recurrence of incidents but also to gain insights from the differences in perspectives between team members. Learning the purpose of postmortems changed our team’s mindset. We now practice immediate improvement activities, such as formalizing the postmortem process, clarifying the creation of tickets for action items, and sharing postmortem minutes with the app team, which were previously kept internal.

    We also reaffirmed the importance of observability to consistently review and improve our current system. Regular meetings between the infrastructure and application teams allow us to jointly check metrics, which in turn helps maintain application performance and prevent potential issues.

    By elevating our previous partial SRE activities and integrating individual initiatives, the infrastructure team created an organizational activity cycle that has earned more trust. This enhanced cycle is now getting integrated into our original operational workflows.

    Future plans

    With the experience gained through the SRE Core program, the infrastructure team looks forward to expanding collaboration with application and business teams and increasing proactive activities. Currently, we are starting with collaborations on select applications, but we aim to use these success stories to broaden similar initiatives across the organization.

    It is important to remember that each app has different team members, business partners, environments, and cultures, so SRE activities must be tailored to each unique situation. We aim to harmonize and apply the content learned in this program with the understanding that SRE activities are not the goal, but are elements that support the goals of the apps and the business.

    Additionally, our company has a Cloud Center of Excellence (CCoE) team dedicated to cross-organizational activities. The CCoE manages a portal site for company-wide information dissemination and a community platform for developers to connect. We plan to share the insights we've gained through these channels with other respective teams within our group companies. As the CCoE's internal activities mature, we also intend to share our knowledge and experiences externally.

    Through these initiatives, we hope to continue our activities with the hope that internal members — beyond the CCoE and infrastructure organizations — take psychological safety into consideration during discussions and actions.

    Supplement: Regarding psychological safety

    At our company, we have a diverse workforce with varying years of experience and perspectives. We believe that ensuring psychological safety is essential for achieving high performance.

    When psychological safety is lacking, for instance, if the person delivering bad news is blamed, reports tend to become superficial and do not lead to substantive discussions.

    This issue can also arise from psychological barriers, such as the omission of tasks known only to experienced employees, leading to problems caused by the fear of asking for clarification.

    In a situation where psychological safety is ensured, we focus on systems rather than individuals, viewing problems as opportunities. For example, if errors occur due to manual work, the manual process itself is seen as the issue. Similarly, if a system failure with no prior similar case arises, it is considered an opportunity to gain new knowledge.

    By adopting this mindset, fear is removed from the equation, allowing for unbiased discussions and work.

    This allows every employee to perform at their best, regardless of their years of experience. Of course, this is not something that can be achieved through a single person. It will require a whole team or organization to recognize this to make it a reality.

  67. From Automation to Orchestration: The Key to Digital Success

    Fri, 13 Jun 2025 17:40:16 -0000

    Digital transformation continues to be a key focus for many organizations, and this usually means the automation of processes and data. At its core, automation is meant to simplify and streamline business operations. However, if not implemented correctly, it can introduce complexity, risk and fragmentation. The proliferation of automation tools, combined with the rapid growth […]
  68. New Relic Adds Support for Model Context Protocol to Observability Platform

    Fri, 13 Jun 2025 17:36:03 -0000

    New Relic this week added support for the Model Context Protocol (MCP) to its observability platform to surface insights into artificial intelligence (AI) agents and applications. Originally developed by Anthropic, MCP is rapidly becoming a de facto application programming interface (API) for enabling interoperability between AI agents and other sources of data. New Relic has […]
  69. Gearset Extends Reach of DevOps Platform for Salesforce Applications

    Fri, 13 Jun 2025 11:29:45 -0000

    gearset, future low-code CI/CD release metrics CircleCI Future of DevOps and CI/CD - Predict 2021
    gearset, future low-code CI/CD release metrics CircleCI Future of DevOps and CI/CD - Predict 2021Gearset has extended the scope of the observability of Salesforce applications it provides to include software developed using low-code Flex and object-oriented Apex programming tools.
  70. From Noise to Narrative: Rethinking Observability for AI-Augmented DevOps Pipelines 

    Fri, 13 Jun 2025 10:23:09 -0000

    developers, observability, datadog, your, observability, customers, blind spots, telemetry, New Relic, Observe, Gen AI, Generative AI, modern, applications, risk, observability, AI, unified observability, binoculars
    developers, observability, datadog, your, observability, customers, blind spots, telemetry, New Relic, Observe, Gen AI, Generative AI, modern, applications, risk, observability, AI, unified observability, binocularsSupporting today’s DevOps — where AI is involved — requires developers to switch from seeing observability as a solely technical aspect to thinking of it as a way to explain what happens in the system
  71. Smooth the Path to AI Agent Adoption for Software Engineers 

    Fri, 13 Jun 2025 09:57:40 -0000

    AI agents, adoption, developers, Zencoder, ai, agent,
    AI agents, adoption, developers, Zencoder, ai, agent,To encourage effective AI use among developers and smooth the path to adoption, enterprises must define clear use cases for AI agents.
  72. Shift Left Alone is No Longer Enough, Runtime Context is Key 

    Fri, 13 Jun 2025 05:33:49 -0000

    runtime, security, shift left, controlmonkey, platform, Incident Response
    runtime, security, shift left, controlmonkey, platform, Incident ResponseFor a long time, security teams have been told that shifting left is the key to securing their apps and systems. And until recently, this was (mostly) sufficient. As long as security experts were included early enough in the development process, it worked to ensure that security awareness starts at the development and even design […]
  73. Telemetry-Driven DevOps: Using Data to Drive Product and Platform Decisions

    Fri, 13 Jun 2025 05:19:35 -0000

    telemetry, devops, Grafana, APIs, Sumo, Veracode, telemetry data, New Relic, observability, Sawmills, AI, Mezmo, Cribl, telemetry data, Telemetry, Data, OpenTelemetry, observability, data, Good Cribl Splunk telemetry OpenTelemetry
    telemetry, devops, Grafana, APIs, Sumo, Veracode, telemetry data, New Relic, observability, Sawmills, AI, Mezmo, Cribl, telemetry data, Telemetry, Data, OpenTelemetry, observability, data, Good Cribl Splunk telemetry OpenTelemetryWhether you’re launching a new product, modernizing a legacy platform, or scaling your DevOps practice, start with telemetry.
  74. 5 FinOps Strategies Every DevOps Team Should Adopt

    Fri, 13 Jun 2025 04:53:35 -0000

    finops, cloud, cost, cloud costs, AWS, engineering, AWS multi-cloud challenges, multi-cloud, costs, CloudBolt FinOps Grafana observability Vega Cloud cost multi-cloud FinOps governance cost-efficient Multi-Cloud Cost Optimization
    finops, cloud, cost, cloud costs, AWS, engineering, AWS multi-cloud challenges, multi-cloud, costs, CloudBolt FinOps Grafana observability Vega Cloud cost multi-cloud FinOps governance cost-efficient Multi-Cloud Cost OptimizationIf cloud costs feel unpredictable, tools can help teams take control, shifting FinOps from a reactive process to a proactive strategy that eliminates guesswork. 
  75. Navigating the Complexity of Hard Choices in Software Development 

    Wed, 11 Jun 2025 13:10:25 -0000

    developer, integration, software, investment, system, productivity, developer, development, software, system, Agile, and, IDP, practices, DevOps, open-source, CVE, Software, in-house development IDP developer, experience, software DevOps jobs secrets Libbpf BCC BPF kernel developer citizen secure software
    developer, integration, software, investment, system, productivity, developer, development, software, system, Agile, and, IDP, practices, DevOps, open-source, CVE, Software, in-house development IDP developer, experience, software DevOps jobs secrets Libbpf BCC BPF kernel developer citizen secure softwareAlthough having options gives developers more ways to solve specific project requirements, an excessive number of choices can be overwhelming, leading to indecision, procrastination, suboptimal selections or even complete inaction. 
  76. JFrog Extends Alliance With NVIDIA to Secure AI Software Supply Chain

    Wed, 11 Jun 2025 12:04:37 -0000

    JFrog, NVIDIA, DevSecOps, security, SAST, DevSecOps, Tidelift, GenAI, software, security, devsecops, AI, AISecOps Digital.ai transformation DevOps security mobile DevSecOps Dynatrace Extends Reach of Application Security Module
    JFrog, NVIDIA, DevSecOps, security, SAST, DevSecOps, Tidelift, GenAI, software, security, devsecops, AI, AISecOps Digital.ai transformation DevOps security mobile DevSecOps Dynatrace Extends Reach of Application Security ModuleJFrog and NVIDIA today announced they have expanded the integrations between their software development platforms to now include the Enterprise AI Factory, a set of frameworks and blueprints for building artificial intelligence (AI) applications. As a result, software artifacts created using the NVIDIA Enterprise AI Factory can be housed in the JFrog Software Supply Chain […]
  77. Behind “ANCESTRA”: combining Veo with live-action filmmaking

    Fri, 13 Jun 2025 13:30:00 -0000

    We partnered with Darren Aronofsky, Eliza McNitt and a team of more than 200 people to make a film using Veo and live-action filmmaking.
  78. How we're supporting better tropical cyclone prediction with AI

    Thu, 12 Jun 2025 15:00:00 -0000

    We’re launching Weather Lab, featuring our experimental cyclone predictions, and we’re partnering with the U.S. National Hurricane Center to support their forecasts and warnings this cyclone season.
  79. Advanced audio dialog and generation with Gemini 2.5

    Tue, 03 Jun 2025 17:15:47 -0000

    Gemini 2.5 has new capabilities in AI-powered audio dialog and generation.
  80. Our vision for building a universal AI assistant

    Tue, 20 May 2025 09:45:00 -0000

    We’re extending Gemini to become a world model that can make plans and imagine new experiences by simulating aspects of the world.
  81. SynthID Detector — a new portal to help identify AI-generated content

    Tue, 20 May 2025 09:45:00 -0000

    Learn about the new SynthID Detector portal we announced at I/O to help people understand how the content they see online was generated.
  82. Advancing Gemini's security safeguards

    Tue, 20 May 2025 09:45:00 -0000

    We’ve made Gemini 2.5 our most secure model family to date.
  83. Fuel your creativity with new generative media models and tools

    Tue, 20 May 2025 09:45:00 -0000

    Introducing Veo 3 and Imagen 4, and a new tool for filmmaking called Flow.
  84. Gemini 2.5: Our most intelligent models are getting even better

    Tue, 20 May 2025 09:45:00 -0000

    Gemini 2.5 Pro continues to be loved by developers as the best model for coding, and 2.5 Flash is getting even better with a new update. We’re bringing new capabilities to our models, including Deep Think, an experimental enhanced reasoning mode for 2.5 Pro.
  85. Announcing Gemma 3n preview: Powerful, efficient, mobile-first AI

    Tue, 20 May 2025 09:45:00 -0000

    Gemma 3n is a cutting-edge open model designed for fast, multimodal AI on devices, featuring optimized performance, unique flexibility with a 2-in-1 model, and expanded multimodal understanding with audio, empowering developers to build live, interactive applications and sophisticated audio-centric experiences.
  86. AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

    Wed, 14 May 2025 14:59:00 -0000

    New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluators
  87. Gemini 2.5 Pro Preview: even better coding performance

    Tue, 06 May 2025 15:06:55 -0000

    We’ve seen developers doing amazing things with Gemini 2.5 Pro, so we decided to release an updated version a couple of weeks early to get into developers hands sooner.
  88. Build rich, interactive web apps with an updated Gemini 2.5 Pro

    Tue, 06 May 2025 15:00:00 -0000

    Our updated version of Gemini 2.5 Pro Preview has improved capabilities for coding.
  89. Music AI Sandbox, now with new features and broader access

    Thu, 24 Apr 2025 15:01:00 -0000

    Helping music professionals explore the potential of generative AI
  90. Introducing Gemini 2.5 Flash

    Thu, 17 Apr 2025 19:02:00 -0000

    Gemini 2.5 Flash is our first fully hybrid reasoning model, giving developers the ability to turn thinking on or off.
  91. Generate videos in Gemini and Whisk with Veo 2

    Tue, 15 Apr 2025 17:00:00 -0000

    Transform text-based prompts into high-resolution eight-second videos in Gemini Advanced and use Whisk Animate to turn images into eight-second animated clips.
  92. DolphinGemma: How Google AI is helping decode dolphin communication

    Mon, 14 Apr 2025 17:00:00 -0000

    DolphinGemma, a large language model developed by Google, is helping scientists study how dolphins communicate — and hopefully find out what they're saying, too.
  93. Taking a responsible path to AGI

    Wed, 02 Apr 2025 13:31:00 -0000

    We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.
  94. Evaluating potential cybersecurity threats of advanced AI

    Wed, 02 Apr 2025 13:30:00 -0000

    Our framework enables cybersecurity experts to identify which defenses are necessary—and how to prioritize them
  95. Gemini 2.5: Our most intelligent AI model

    Tue, 25 Mar 2025 17:00:36 -0000

    Gemini 2.5 is our most intelligent AI model, now with thinking built in.
  96. Gemini Robotics brings AI into the physical world

    Wed, 12 Mar 2025 15:00:00 -0000

    Introducing Gemini Robotics and Gemini Robotics-ER, AI models designed for robots to understand, act and react to the physical world.
  97. Experiment with Gemini 2.0 Flash native image generation

    Wed, 12 Mar 2025 14:58:00 -0000

    Native image output is available in Gemini 2.0 Flash for developers to experiment with in Google AI Studio and the Gemini API.
  98. Introducing Gemma 3

    Wed, 12 Mar 2025 08:00:00 -0000

    The most capable model you can run on a single GPU or TPU.
  99. Start building with Gemini 2.0 Flash and Flash-Lite

    Tue, 25 Feb 2025 18:02:12 -0000

    Gemini 2.0 Flash-Lite is now generally available in the Gemini API for production use in Google AI Studio and for enterprise customers on Vertex AI
  100. Gemini 2.0 is now available to everyone

    Wed, 05 Feb 2025 16:00:00 -0000

    We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini 2.0 Pro Experimental.
  101. Updating the Frontier Safety Framework

    Tue, 04 Feb 2025 16:41:00 -0000

    Our next iteration of the FSF sets out stronger security protocols on the path to AGI
  102. FACTS Grounding: A new benchmark for evaluating the factuality of large language models

    Tue, 17 Dec 2024 15:29:00 -0000

    Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations
  103. State-of-the-art video and image generation with Veo 2 and Imagen 3

    Mon, 16 Dec 2024 17:01:16 -0000

    We’re rolling out a new, state-of-the-art video model, Veo 2, and updates to Imagen 3. Plus, check out our new experiment, Whisk.
  104. Introducing Gemini 2.0: our new AI model for the agentic era

    Wed, 11 Dec 2024 15:30:40 -0000

    Today, we’re announcing Gemini 2.0, our most capable multimodal AI model yet.
  105. Google DeepMind at NeurIPS 2024

    Thu, 05 Dec 2024 17:45:00 -0000

    Advancing adaptive AI agents, empowering 3D scene creation, and innovating LLM training for a smarter, safer future
  106. GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy

    Wed, 04 Dec 2024 15:59:00 -0000

    New AI model advances the prediction of weather uncertainties and risks, delivering faster, more accurate forecasts up to 15 days ahead
  107. Genie 2: A large-scale foundation world model

    Wed, 04 Dec 2024 14:23:00 -0000

    Generating unlimited diverse training environments for future general agents
  108. AlphaQubit tackles one of quantum computing’s biggest challenges

    Wed, 20 Nov 2024 18:00:00 -0000

    Our new AI system accurately identifies errors inside quantum computers, helping to make this new technology more reliable.
  109. The AI for Science Forum: A new era of discovery

    Mon, 18 Nov 2024 19:57:00 -0000

    The AI Science Forum highlights AI's present and potential role in revolutionizing scientific discovery and solving global challenges, emphasizing collaboration between the scientific community, policymakers, and industry leaders.
  110. Pushing the frontiers of audio generation

    Wed, 30 Oct 2024 15:00:00 -0000

    Our pioneering speech generation technologies are helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
  111. New generative AI tools open the doors of music creation

    Wed, 23 Oct 2024 16:53:00 -0000

    Our latest AI music technologies are now available in MusicFX DJ, Music AI Sandbox and YouTube Shorts
  112. Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

    Wed, 09 Oct 2024 11:45:00 -0000

    The award recognizes their work developing AlphaFold, a groundbreaking AI system that predicts the 3D structure of proteins from their amino acid sequences.
  113. How AlphaChip transformed computer chip design

    Thu, 26 Sep 2024 14:08:00 -0000

    Our AI method has accelerated and optimized chip design, and its superhuman chip layouts are used in hardware around the world.
  114. Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more

    Tue, 24 Sep 2024 16:03:03 -0000

    We’re releasing two updated production-ready Gemini models
  115. Empowering YouTube creators with generative AI

    Wed, 18 Sep 2024 14:30:06 -0000

    New video generation technology in YouTube Shorts will help millions of people realize their creative vision
  116. Our latest advances in robot dexterity

    Thu, 12 Sep 2024 14:00:00 -0000

    Two new AI systems, ALOHA Unleashed and DemoStart, help robots learn to perform complex tasks that require dexterous movement
  117. AlphaProteo generates novel proteins for biology and health research

    Thu, 05 Sep 2024 15:00:00 -0000

    New AI system designs proteins that successfully bind to target molecules, with potential for advancing drug design, disease understanding and more.
  118. FermiNet: Quantum physics and chemistry from first principles

    Thu, 22 Aug 2024 19:00:00 -0000

    Using deep learning to solve fundamental problems in computational quantum chemistry and explore how matter interacts with light
  119. Mapping the misuse of generative AI

    Fri, 02 Aug 2024 10:50:58 -0000

    New research analyzes the misuse of multimodal generative AI today, in order to help build safer and more responsible technologies.
  120. Gemma Scope: helping the safety community shed light on the inner workings of language models

    Wed, 31 Jul 2024 15:59:19 -0000

    Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.
  121. AI achieves silver-medal standard solving International Mathematical Olympiad problems

    Thu, 25 Jul 2024 15:29:00 -0000

    Breakthrough models AlphaProof and AlphaGeometry 2 solve advanced reasoning problems in mathematics
  122. Google DeepMind at ICML 2024

    Fri, 19 Jul 2024 10:00:00 -0000

    Exploring AGI, the challenges of scaling and the future of multimodal generative AI
  123. Generating audio for video

    Mon, 17 Jun 2024 16:00:00 -0000

    Video-to-audio research uses video pixels and text prompts to generate rich soundtracks
  124. Looking ahead to the AI Seoul Summit

    Mon, 20 May 2024 07:00:00 -0000

    How summits in Seoul, France and beyond can galvanize international cooperation on frontier AI safety
  125. Introducing the Frontier Safety Framework

    Fri, 17 May 2024 14:00:00 -0000

    Our approach to analyzing and mitigating future risks posed by advanced AI models
  126. Gemini breaks new ground: a faster model, longer context and AI agents

    Tue, 14 May 2024 17:58:00 -0000

    We’re introducing a series of updates across the Gemini family of models, including the new 1.5 Flash, our lightweight model for speed and efficiency, and Project Astra, our vision for the future of AI assistants.
  127. New generative media models and tools, built with and for creators

    Tue, 14 May 2024 17:57:00 -0000

    We’re introducing Veo, our most capable model for generating high-definition video, and Imagen 3, our highest quality text-to-image model. We’re also sharing new demo recordings created with our Music AI Sandbox.
  128. Watermarking AI-generated text and video with SynthID

    Tue, 14 May 2024 17:56:00 -0000

    Announcing our novel watermarking method for AI-generated text and video, and how we’re bringing SynthID to key Google products
  129. AlphaFold 3 predicts the structure and interactions of all of life’s molecules

    Wed, 08 May 2024 16:00:00 -0000

    Introducing a new AI model developed by Google DeepMind and Isomorphic Labs.
  130. Google DeepMind at ICLR 2024

    Fri, 03 May 2024 13:39:00 -0000

    Developing next-gen AI agents, exploring new modalities, and pioneering foundational learning
  131. The ethics of advanced AI assistants

    Fri, 19 Apr 2024 10:00:00 -0000

    Exploring the promise and risks of a future with more capable AI
  132. TacticAI: an AI assistant for football tactics

    Tue, 19 Mar 2024 16:03:00 -0000

    As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
  133. A generalist AI agent for 3D virtual environments

    Wed, 13 Mar 2024 14:00:00 -0000

    Introducing SIMA, a Scalable Instructable Multiworld Agent
  134. Gemma: Introducing new state-of-the-art open models

    Wed, 21 Feb 2024 13:06:00 -0000

    Gemma is built for responsible AI development from the same research and technology used to create Gemini models.
  135. Our next-generation model: Gemini 1.5

    Thu, 15 Feb 2024 15:00:00 -0000

    The model delivers dramatically enhanced performance, with a breakthrough in long-context understanding across modalities.
  136. The next chapter of our Gemini era

    Thu, 08 Feb 2024 13:00:00 -0000

    We're bringing Gemini to more Google products
  137. AlphaGeometry: An Olympiad-level AI system for geometry

    Wed, 17 Jan 2024 16:00:00 -0000

    Advancing AI reasoning in mathematics
  138. Shaping the future of advanced robotics

    Thu, 04 Jan 2024 11:39:00 -0000

    Introducing AutoRT, SARA-RT, and RT-Trajectory
  139. Images altered to trick machine vision can influence humans too

    Tue, 02 Jan 2024 16:00:00 -0000

    In a series of experiments published in Nature Communications, we found evidence that human judgments are indeed systematically influenced by adversarial perturbations.
  140. 2023: A Year of Groundbreaking Advances in AI and Computing

    Fri, 22 Dec 2023 13:30:00 -0000

    This has been a year of incredible progress in the field of Artificial Intelligence (AI) research and its practical applications.
  141. FunSearch: Making new discoveries in mathematical sciences using Large Language Models

    Thu, 14 Dec 2023 16:00:00 -0000

    In a paper published in Nature, we introduce FunSearch, a method for searching for “functions” written in computer code, and find new solutions in mathematics and computer science. FunSearch works by pairing a pre-trained LLM, whose goal is to provide creative solutions in the form of computer code, with an automated “evaluator”, which guards against hallucinations and incorrect ideas.
  142. Google DeepMind at NeurIPS 2023

    Fri, 08 Dec 2023 15:01:00 -0000

    The Neural Information Processing Systems (NeurIPS) is the largest artificial intelligence (AI) conference in the world. NeurIPS 2023 will be taking place December 10-16 in New Orleans, USA.Teams from across Google DeepMind are presenting more than 150 papers at the main conference and workshops.
  143. Introducing Gemini: our largest and most capable AI model

    Wed, 06 Dec 2023 15:13:00 -0000

    Making AI more helpful for everyone
  144. Millions of new materials discovered with deep learning

    Wed, 29 Nov 2023 16:04:00 -0000

    We share the discovery of 2.2 million new crystals – equivalent to nearly 800 years’ worth of knowledge. We introduce Graph Networks for Materials Exploration (GNoME), our new deep learning tool that dramatically increases the speed and efficiency of discovery by predicting the stability of new materials.
  145. Transforming the future of music creation

    Thu, 16 Nov 2023 07:20:00 -0000

    Announcing our most advanced music generation model and two new AI experiments, designed to open a new playground for creativity
  146. Empowering the next generation for an AI-enabled world

    Wed, 15 Nov 2023 10:00:00 -0000

    Experience AI's course and resources are expanding on a global scale
  147. GraphCast: AI model for faster and more accurate global weather forecasting

    Tue, 14 Nov 2023 15:00:00 -0000

    We introduce GraphCast, a state-of-the-art AI model able to make medium-range weather forecasts with unprecedented accuracy
  148. A glimpse of the next generation of AlphaFold

    Tue, 31 Oct 2023 13:00:00 -0000

    Progress update: Our latest AlphaFold model shows significantly improved accuracy and expands coverage beyond proteins to other biological molecules, including ligands.
  149. Evaluating social and ethical risks from generative AI

    Thu, 19 Oct 2023 15:00:00 -0000

    Introducing a context-based framework for comprehensively evaluating the social and ethical risks of AI systems
  150. Scaling up learning across many different robot types

    Tue, 03 Oct 2023 15:00:00 -0000

    Robots are great specialists, but poor generalists. Typically, you have to train a model for each task, robot, and environment. Changing a single variable often requires starting from scratch. But what if we could combine the knowledge across robotics and create a way to train a general-purpose robot?
  151. A catalogue of genetic mutations to help pinpoint the cause of diseases

    Tue, 19 Sep 2023 13:37:00 -0000

    New AI tool classifies the effects of 71 million ‘missense’ mutations.
  152. Identifying AI-generated images with SynthID

    Tue, 29 Aug 2023 00:00:00 -0000

    New tool helps watermark and identify synthetic images created by Imagen
  153. RT-2: New model translates vision and language into action

    Fri, 28 Jul 2023 00:00:00 -0000

    Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control.
  154. Using AI to fight climate change

    Fri, 21 Jul 2023 00:00:00 -0000

    AI is a powerful technology that will transform our future, so how can we best apply it to help combat climate change and find sustainable solutions?
  155. Google DeepMind’s latest research at ICML 2023

    Thu, 20 Jul 2023 00:00:00 -0000

    Exploring AI safety, adaptability, and efficiency for the real world
  156. Developing reliable AI tools for healthcare

    Mon, 17 Jul 2023 00:00:00 -0000

    We’ve published our joint paper with Google Research in Nature Medicine, which proposes CoDoC (Complementarity-driven Deferral-to-Clinical Workflow), an AI system that learns when to rely on predictive AI tools or defer to a clinician for the most accurate interpretation of medical images.
  157. Exploring institutions for global AI governance

    Tue, 11 Jul 2023 00:00:00 -0000

    New white paper investigates models and functions of international institutions that could help manage opportunities and mitigate risks of advanced AI.
  158. RoboCat: A self-improving robotic agent

    Tue, 20 Jun 2023 00:00:00 -0000

    Robots are quickly becoming part of our everyday lives, but they’re often only programmed to perform specific tasks well. While harnessing recent advances in AI could lead to robots that could help in many more ways, progress in building general-purpose robots is slower in part because of the time needed to collect real-world training data. Our latest paper introduces a self-improving AI agent for robotics, RoboCat, that learns to perform a variety of tasks across different arms, and then self-generates new training data to improve its technique.
  159. YouTube: Enhancing the user experience

    Fri, 16 Jun 2023 14:55:00 -0000

    It’s all about using our technology and research to help enrich people’s lives. Like YouTube — and its mission to give everyone a voice and show them the world.
  160. Google Cloud: Driving digital transformation

    Wed, 14 Jun 2023 14:51:00 -0000

    Google Cloud empowers organizations to digitally transform themselves into smarter businesses. It offers cloud computing, data analytics, and the latest artificial intelligence (AI) and machine learning tools.
  161. MuZero, AlphaZero, and AlphaDev: Optimizing computer systems

    Mon, 12 Jun 2023 14:41:00 -0000

    How MuZero, AlphaZero, and AlphaDev are optimizing the computing ecosystem that powers our world of devices.
  162. AlphaDev discovers faster sorting algorithms

    Wed, 07 Jun 2023 00:00:00 -0000

    New algorithms will transform the foundations of computing
  163. An early warning system for novel AI risks

    Thu, 25 May 2023 00:00:00 -0000

    New research proposes a framework for evaluating general-purpose models against novel threats
  164. DeepMind’s latest research at ICLR 2023

    Thu, 27 Apr 2023 00:00:00 -0000

    Next week marks the start of the 11th International Conference on Learning Representations (ICLR), taking place 1-5 May in Kigali, Rwanda. This will be the first major artificial intelligence (AI) conference to be hosted in Africa and the first in-person event since the start of the pandemic. Researchers from around the world will gather to share their cutting-edge work in deep learning spanning the fields of AI, statistics and data science, and applications including machine vision, gaming and robotics. We’re proud to support the conference as a Diamond sponsor and DEI champion.
  165. How can we build human values into AI?

    Mon, 24 Apr 2023 00:00:00 -0000

    Drawing from philosophy to identify fair principles for ethical AI...
  166. Announcing Google DeepMind

    Thu, 20 Apr 2023 00:00:00 -0000

    DeepMind and the Brain team from Google Research will join forces to accelerate progress towards a world in which AI helps solve the biggest challenges facing humanity.
  167. Competitive programming with AlphaCode

    Thu, 08 Dec 2022 00:00:00 -0000

    Solving novel problems and setting a new milestone in competitive programming.
  168. AI for the board game Diplomacy

    Tue, 06 Dec 2022 00:00:00 -0000

    Successful communication and cooperation have been crucial for helping societies advance throughout history. The closed environments of board games can serve as a sandbox for modelling and investigating interaction and communication – and we can learn a lot from playing them. In our recent paper, published today in Nature Communications, we show how artificial agents can use communication to better cooperate in the board game Diplomacy, a vibrant domain in artificial intelligence (AI) research, known for its focus on alliance building.
  169. Mastering Stratego, the classic game of imperfect information

    Thu, 01 Dec 2022 00:00:00 -0000

    Game-playing artificial intelligence (AI) systems have advanced to a new frontier.
  170. DeepMind’s latest research at NeurIPS 2022

    Fri, 25 Nov 2022 00:00:00 -0000

    NeurIPS is the world’s largest conference in artificial intelligence (AI) and machine learning (ML), and we’re proud to support the event as Diamond sponsors, helping foster the exchange of research advances in the AI and ML community. Teams from across DeepMind are presenting 47 papers, including 35 external collaborations in virtual panels and poster sessions.
  171. Building interactive agents in video game worlds

    Wed, 23 Nov 2022 00:00:00 -0000

    Most artificial intelligence (AI) researchers now believe that writing computer code which can capture the nuances of situated interactions is impossible. Alternatively, modern machine learning (ML) researchers have focused on learning about these types of interactions from data. To explore these learning-based approaches and quickly build agents that can make sense of human instructions and safely perform actions in open-ended conditions, we created a research framework within a video game environment.Today, we’re publishing a paper [INSERT LINK] and collection of videos, showing our early steps in building video game AIs that can understand fuzzy human concepts – and therefore, can begin to interact with people on their own terms.
  172. Benchmarking the next generation of never-ending learners

    Tue, 22 Nov 2022 00:00:00 -0000

    Learning how to build upon knowledge by tapping 30 years of computer vision research
  173. Best practices for data enrichment

    Wed, 16 Nov 2022 00:00:00 -0000

    Building a responsible approach to data collection with the Partnership on AI...
  174. Stopping malaria in its tracks

    Thu, 13 Oct 2022 15:00:00 -0000

    Developing a vaccine that could save hundreds of thousands of lives
  175. Measuring perception in AI models

    Wed, 12 Oct 2022 00:00:00 -0000

    Perception – the process of experiencing the world through senses – is a significant part of intelligence. And building agents with human-level perceptual understanding of the world is a central but challenging task, which is becoming increasingly important in robotics, self-driving cars, personal assistants, medical imaging, and more. So today, we’re introducing the Perception Test, a multimodal benchmark using real-world videos to help evaluate the perception capabilities of a model.
  176. How undesired goals can arise with correct rewards

    Fri, 07 Oct 2022 00:00:00 -0000

    As we build increasingly advanced artificial intelligence (AI) systems, we want to make sure they don’t pursue undesired goals. Such behaviour in an AI agent is often the result of specification gaming – exploiting a poor choice of what they are rewarded for. In our latest paper, we explore a more subtle mechanism by which AI systems may unintentionally learn to pursue undesired goals: goal misgeneralisation (GMG). GMG occurs when a system's capabilities generalise successfully but its goal does not generalise as desired, so the system competently pursues the wrong goal. Crucially, in contrast to specification gaming, GMG can occur even when the AI system is trained with a correct specification.