Pipes Feed Preview: Towards Data Science & The New Stack & DevOps & SRE & DevOps.com & Google DeepMind Blog

  1. From FOMO to Opportunity: Analytical AI in the Era of LLM Agents

    Wed, 30 Apr 2025 03:42:47 -0000

    Why the gold rush toward LLM agents does not make analytical AI obsolete

    The post From FOMO to Opportunity: Analytical AI in the Era of LLM Agents appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><mdspan datatext="el1745984322790" class="mdspan-comment">Are you feeling</mdspan> &#8220;fear of missing out&#8221; (FOMO) when it comes to LLM agents? Well, that was the case for me for quite a while.</p> <p class="wp-block-paragraph">In recent months, it feels like my online feeds have been completely bombarded by &#8220;LLM Agents&#8221;: every other technical blog is trying to show me &#8220;how to build an agent in 5 minutes&#8221;. Every other piece of tech news is highlighting yet another shiny startup building LLM agent-based products, or a big tech releasing some new agent-building libraries or fancy-named agent protocols (seen enough MCP or Agent2Agent?).</p> <p class="wp-block-paragraph">It seems that suddenly, LLM agents are everywhere. All those flashy demos showcase that those digital beasts seem more than capable of writing code, automating workflows, discovering insights, and seemingly threatening to replace… well, just about everything.</p> <p class="wp-block-paragraph">Unfortunately, this view is also shared by many of our clients at work. They are actively asking for agentic features to be integrated into their products. They aren&#8217;t hesitating to finance new agent-development projects, because of the fear of lagging behind their competitors in leveraging this new technology.</p> <p class="wp-block-paragraph">As an <a href="https://towardsdatascience.com/tag/analytical-ai/" title="Analytical AI">Analytical AI</a> practitioner, seeing those impressive agent demos built by my colleagues and the enthusiastic feedback from the clients, I have to admit, it gave me a serious case of FOMO.</p> <p class="wp-block-paragraph">It genuinely left me wondering: Is the work I do becoming irrelevant? </p> <p class="wp-block-paragraph">After struggling with that question, I have reached this conclusion:</p> <p class="wp-block-paragraph"><strong>No, that&#8217;s not the case at all.</strong></p> <p class="wp-block-paragraph">In this blog post, I want to share my thoughts on why the rapid rise of <a href="https://towardsdatascience.com/tag/llm-agents/" title="LLM Agents">LLM Agents</a> doesn&#8217;t diminish the importance of analytical AI. In fact, I believe it&#8217;s doing the opposite: it&#8217;s creating unprecedented opportunities for both analytical AI and agentic AI. </p> <p class="wp-block-paragraph">Let&#8217;s explore why.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">Before diving in, let&#8217;s quickly clarify the terms:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Analytical AI</strong>: I&#8217;m primarily referring to statistical modeling and machine learning approaches applied to quantitative, numerical data. Think of industrial applications like anomaly detection, time-series forecasting, product design optimization, predictive maintenance, ditigal twins, etc.</li> <li class="wp-block-list-item"><strong>LLM Agents</strong>: I am referring to AI systems using LLM as the core that can autonomously perform tasks by combining natural language understanding, with reasoning, planning, memory, and tool use.</li> </ul> </blockquote> <hr class="wp-block-separator has-alpha-channel-opacity is-style-dotted"/> <figure class="wp-block-pullquote"><blockquote><p><strong>Viewpoint 1: Analytical AI provides the crucial quantitative grounding for LLM agents. </strong></p></blockquote></figure> <p class="wp-block-paragraph">Despite the remarkable capabilities in natural language understanding and generation, LLMs fundamentally lack the quantitative precision required for many industrial applications. This is where analytical AI becomes indispensable. </p> <p class="wp-block-paragraph">There are some key ways the analytical AI could step up, grounding the LLM agents with mathematical rigor and ensuring that they are operating following the reality:</p> <h2 class="wp-block-heading has-subtitle-1-font-size"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f6e0.png" alt="🛠" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Analytical AI as essential tools</h2> <p class="wp-block-paragraph">Integrating Analytical AI as specialized, callable tools is arguably the most common pattern for providing LLM agents with quantitative grounding.</p> <p class="wp-block-paragraph">There has long been a tradition (well before the current hype around LLMs) of developing specialized Analytical AI tools across various industries to address challenges using real-world operational data. Those challenges, be it predicting equipment maintenance or forecasting energy consumption, demand high numerical precision and sophisticated modeling capabilities. Frankly, these capabilities are fundamentally different from the linguistic and reasoning strengths that characterize today&#8217;s LLMs.</p> <p class="wp-block-paragraph">This long-standing foundation of Analytical AI is not just relevant, but essential, for grounding LLM agents in real-world accuracy and operational reliability. The core motivation here is a <strong>separation of concerns</strong>: let the LLM agents handle the understanding, reasoning, and planning, while the Analytical AI tools perform the specialized quantitative analysis they were trained for.</p> <p class="wp-block-paragraph">In this paradigm, Analytical AI tools can play multiple critical roles. First and foremost, they can <strong>enhance the agent&#8217;s capabilities</strong> with analytical superpowers it inherently lacks. Also, they can <strong>verify the agent&#8217;s outputs/hypotheses</strong> against real data and the learned patterns. Finally, they can <strong>enforce physical constraints</strong>, ensuring the agents operate in a realistically feasible space.</p> <p class="wp-block-paragraph">To give a concrete example, imagine an LLM agent that is tasked with optimizing a complex semiconductor fabrication process to maximize yield and maintain stability. Instead of solely relying on textual logs/operator notes, the agent continuously interacts with a suite of specialized Analytical AI tools to gain a quantitative, context-rich understanding of the process in real-time.</p> <p class="wp-block-paragraph">For instance, to achieve its goal of high yield, the agent queries a pre-trained <strong>XGBoost model</strong> to predict the likely yield based on hundreds of sensor readings and process parameters. This gives the agent the foresight into quality outcomes.</p> <p class="wp-block-paragraph">At the same time, to ensure the process stability for consistent quality, the agent calls upon an <strong>autoencoder model </strong>(pre-trained on normal process data) to identify deviations or potential equipment failures <em>before</em> they disrupt production.</p> <p class="wp-block-paragraph">When potential issues arise, as indicated by the anomaly detection model, the agent must perform course correction in an optimal way. To do that, it invokes a <strong>constraint-based optimization model</strong>, which employs a <em>Bayesian optimization</em> algorithm to recommend the optimal adjustments to process parameters.</p> <p class="wp-block-paragraph">In this scenario, the LLM agent essentially acts as the intelligent orchestrator. It interprets the high-level goals, plans the queries to the appropriate Analytical AI tools, reasons on their quantitative outputs, and translates these complex analyses into actionable insights for operators or even triggers automated adjustments. This collaboration ensures that LLM agents remain grounded and reliable in tackling complex, real-world industrial problems.</p> <h2 class="wp-block-heading has-subtitle-1-font-size"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1faa3.png" alt="🪣" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Analytical AI as a digital sandbox</h2> <p class="wp-block-paragraph">Beyond serving as a callable tool, Analytical AI offers another crucial capability: creating realistic <strong>simulation environments</strong> where LLM agents get trained and evaluated before they interact with the physical world. This is particularly valuable in industrial settings where failure could lead to severe consequences, like equipment damage or safety incidents.</p> <p class="wp-block-paragraph">Analytical AI techniques are highly capable of building high-fidelity representations of the industrial asset or process by learning from both their historical operational data and the governing physical equations (think of methods like physics-informed neural networks). These <em>digital twins</em> capture the underlying physical principles, operational constraints, and inherent system variability.</p> <p class="wp-block-paragraph">Within this Analytical AI-powered virtual world, an LLM agent can be trained by first receiving simulated sensor data, deciding on control actions, and then observing the system responses computed by the Analytical AI simulation. As a result, agents can iterate through many trial-and-error learning cycles in a much shorter time and be safely exposed to a diverse range of realistic operating conditions.</p> <p class="wp-block-paragraph">Besides agent training, these Analytical AI-powered simulations offer a controlled environment for rigorously <strong>evaluating and comparing </strong>the performance and robustness of different agent setup versions or control policies before real-world deployment.</p> <p class="wp-block-paragraph">To give a concrete example, consider a power grid management case. An LLM agent (or multiple agents) designed to optimize renewable energy integration can be tested within such a simulated environment powered by multiple analytical AI models: we could have a <strong>physics-informed neural network</strong> (PINN) model to describe the complex, dynamical power flows. We may also have probabilistic forecasting models to simulate realistic weather patterns and their impact on renewable generation. Within this rich environment, the LLM agent(s) can learn to develop sophisticated decision-making policies for balancing the grid during various weather conditions, without ever risking actual service disruptions.</p> <p class="wp-block-paragraph">The bottom line is, without Analytical AI, none of this would be possible. It forms the quantitative foundation and the physical constraints that make safe and effective agent development a reality.</p> <h2 class="wp-block-heading has-subtitle-1-font-size"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f4c8.png" alt="📈" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Analytical AI as an operational toolkit</h2> <p class="wp-block-paragraph">Now, if we zoom out and take a fresh perspective, <strong>isn’t an LLM agent—or even a team of them—just another type of operational system, that needs to be managed like any other industrial asset/process?</strong></p> <p class="wp-block-paragraph">This effectively means: all the principles of design, optimization, and monitoring for systems still apply. And guess what? Analytical AI is the toolkit exactly for that.</p> <p class="wp-block-paragraph">Again, Analytical AI has the potential to move us beyond empirical trial-and-error (the current practices) and towards <em>objective</em>, <em>data-driven</em> methods for managing agentic systems. How about using a <strong>Bayesian optimization algorithm</strong> to design the agent architecture and configurations? How about adopting <strong>operations research techniques</strong> to optimize the allocation of computational resources or manage request queues efficiently? How about employing <strong>time-series anomaly detection</strong> methods to alert real-time behavior of the agents?</p> <p class="wp-block-paragraph">Treating the LLM agent as a complex system subject to quantitative analysis opens up many new opportunities. It is precisely this operational rigor enabled by Analytical AI that can elevate these LLM agents from &#8220;just a demo&#8221; to something reliable, efficient, and &#8220;actually useful&#8221; in modern industrial operation.</p> <hr class="wp-block-separator has-alpha-channel-opacity is-style-dotted"/> <figure class="wp-block-pullquote"><blockquote><p><strong>Viewpoint 2: Analytical AI can be amplified by LLM agents with their contextual intelligence</strong>.</p></blockquote></figure> <p class="wp-block-paragraph">We have discussed in length how indispensable Analytical AI is for the LLM agent ecosystem. But this powerful synergy flows in both directions. Analytical AI can also leverage the unique strengths of LLM agents to enhance its usability, effectiveness, and ultimately, the real-world impact. Those are the points that Analytical AI practitioners may not want to miss out on LLM agents.</p> <h2 class="wp-block-heading has-subtitle-1-font-size"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f9e9.png" alt="🧩" class="wp-smiley" style="height: 1em; max-height: 1em;" /> From vague goals to solvable problems</h2> <p class="wp-block-paragraph">Often, the need for analysis starts with a high-level, vaguely stated business goal, like &#8220;we need to improve product quality.&#8221; To make this actionable, Analytical AI practitioners must repeatedly ask clarifying questions to uncover the true objective functions, specific constraints, and available input data, which inevitably leads to a very time-consuming process.</p> <p class="wp-block-paragraph">The good news is, LLM agents excel here. They can interpret these ambiguous natural language requests, ask clarifying questions, and formulate them into well-structured, quantitative problems that Analytical AI tools can directly tackle.</p> <h2 class="wp-block-heading has-subtitle-1-font-size"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f4da.png" alt="📚" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Enriching Analytical AI model with context and knowledge</h2> <p class="wp-block-paragraph">Traditional Analytical AI models operate primarily on numerical data. For the largely untapped unstructured data, LLM agents can be very helpful there to extract useful information to fuel the quantitative analysis. </p> <p class="wp-block-paragraph">For example, LLM agents can analyze text documents/reports/logs to identify meaningful patterns, and transform these qualitative observations into quantitative features that Analytical AI models can process. This <strong>feature engineering</strong> step often significantly boosts the performance of Analytical AI models by giving them access to insights embedded in unstructured data they would otherwise miss.</p> <p class="wp-block-paragraph">Another important use case is <strong>data labeling</strong>. Here, LLM agents can automatically generate accurate category labels and annotations. By providing high-quality training data, they can greatly accelerate the development of high-performing supervised learning models.</p> <p class="wp-block-paragraph">Finally, by tapping into the <strong>knowledge </strong>of LLM agents, either <em>pre-trained</em> in the LLM or <em>actively searched</em> in external databases, LLM agents can automate the setup of the sophisticated analysis pipeline. LLM agents can recommend appropriate algorithms and parameter settings based on the problem characteristics [1], generate code to implement custom problem-solving strategies, or even automatically run experiments for hyperparameter tuning [2].</p> <h2 class="wp-block-heading has-subtitle-1-font-size"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" />From technical outputs to actionable insights</h2> <p class="wp-block-paragraph">Analytical AI models tend to produce dense outputs, and properly interpreting them requires both expertise and time. LLM agents, on the other hand, can act as &#8220;translators&#8221; by converting these dense quantitative results into clear, accessible natural language explanations. </p> <p class="wp-block-paragraph">This interpretability function plays a crucial role in <strong>explaining </strong>the decisions made by the Analytical AI models in a way that human operators can quickly understand and act upon. Also, this information could be highly valuable for model developers to verify the correctness of model outputs, identify potential issues, and improve model performance.</p> <p class="wp-block-paragraph">Besides technical interpretation, LLM agents can also generate tailored responses for different types of audiences: technical teams would receive detailed methodological explanations, operations staff may get practical implications, while executives may obtain summaries highlighting business impact metrics. </p> <p class="wp-block-paragraph">By serving as <em>interpreters </em>between analytical systems and human users, LLM agents can significantly amplify the practical value of analytical AI.</p> <hr class="wp-block-separator has-alpha-channel-opacity is-style-dotted"/> <figure class="wp-block-pullquote"><blockquote><p><strong>Viewpoint 3: The future probably lies in the true peer-to-peer collaboration between Analytical AI and Agentic AI.</strong></p></blockquote></figure> <p class="wp-block-paragraph">Whether LLM agents call Analytical AI tools or analytical systems use LLM agents for interpretation, the approaches we have discussed so far have always been about one type of AI being in charge of the other. This in fact has introduced several limitations worth looking at.</p> <p class="wp-block-paragraph">First of all, in the current paradigm, Analytical AI components are only used as passive tools, and they are invoked only when the LLM decides so. This prevents them from proactively contributing insights or questioning assumptions.</p> <p class="wp-block-paragraph">Also, the typical agent loop of &#8220;plan-call-response-act&#8221;&nbsp;is inherently sequential. This can be inefficient for tasks that could benefit from parallel processing or more asynchronous interaction between the two AIs.</p> <p class="wp-block-paragraph">Another limiting factor is the limited communication bandwidth. API calls may not be able to deliver the rich context needed for genuine dialogue or exchange of intermediate reasoning.</p> <p class="wp-block-paragraph">Finally, LLM agents&#8217; understanding of an Analytical AI tool is often based on a brief docstring and a parameter schema. LLM agents are likely to make mistakes in tool selection, while Analytical AI components lack the context to recognize when they&#8217;re being used wrongly.</p> <p class="wp-block-paragraph">Just because the prevalence of adoption of the tool-calling pattern today does not necessarily mean the future should look the same. Probably, the future lies in a true peer-to-peer collaboration paradigm where neither AI type is the master.</p> <p class="wp-block-paragraph">What might this actually look like in practice? One interesting example I found is a solution delivered by Siemens [3].</p> <p class="wp-block-paragraph">In their smart factory system, there is a digital twin model that continuously monitors the equipment&#8217;s health. When a gearbox&#8217;s condition deteriorates, the Analytical AI system doesn&#8217;t wait to be queried, but proactively fires alerts. A Copilot LLM agent watches the same event bus. On an alert, it (1) cross-references maintenance logs, (2) “asks” the twin to rerun simulations with upcoming shift patterns, and then (3) recommends schedule adjustments to prevent costly downtime. What makes this example unique is that the Analytical AI system isn&#8217;t just a passive tool. Rather, it initiates the dialogue when needed.</p> <p class="wp-block-paragraph">Of course, this is just one possible system architecture. Other directions, such as the <strong>multi-agent systems</strong> with specialized cognitive functions, or maybe even <strong>cross-training</strong> these systems to develop hybrid models that internalize aspects of both AI systems (just like humans develop integrated mathematical and linguistic thinking), or simply drawing inspiration from the established <strong>ensemble learning techniques</strong> by treating LLM agents and Analytical AI as different model types that can be combined in systematic ways. The future opportunities are endless.</p> <p class="wp-block-paragraph">But these also raise fascinating research challenges. How do we design <em>shared representations</em>? What architecture best supports <em>asynchronous information exchange</em>? What <em>communication protocols</em> are optimal between Analytical AI and agents?</p> <p class="wp-block-paragraph">These questions represent new frontiers that definitely need expertise from Analytical AI practitioners. Once again, the deep knowledge of building analytical models with quantitative rigor isn&#8217;t becoming obsolete, but is essential for building these hybrid systems for the future.</p> <figure class="wp-block-pullquote"><blockquote><p><strong>Viewpoint 4: Let&#8217;s embrace the complementary future.</strong></p></blockquote></figure> <p class="wp-block-paragraph">As we&#8217;ve seen throughout this post, the future isn&#8217;t &#8220;Analytical AI vs. LLM Agents.&#8221; It&#8217;s <strong>&#8220;Analytical AI + LLM Agents.&#8221;</strong></p> <p class="wp-block-paragraph">So, rather than feeling FOMO about LLM agents, I&#8217;ve now found renewed excitement about analytical AI&#8217;s evolving role. The analytical foundations we&#8217;ve built aren&#8217;t becoming obsolete, they&#8217;re essential components of a more capable AI ecosystem.</p> <p class="wp-block-paragraph">Let&#8217;s get building.</p> <p class="wp-block-paragraph"></p> <p class="wp-block-paragraph"></p> <p class="has-heading-5-font-size wp-block-paragraph">Reference</p> <p class="wp-block-paragraph">[1] Chen et al., <a href="https://arxiv.org/abs/2412.12154">PyOD 2: A Python Library for Outlier Detection with LLM-powered Model Selection.</a> arXiv, 2024.</p> <p class="wp-block-paragraph">[2] Liu et al., <a href="https://arxiv.org/abs/2402.03921">Large Language Models to Enhance Bayesian Optimization</a>. arXiv, 2024.</p> <p class="wp-block-paragraph">[3] <a href="https://press.siemens.com/global/en/pressrelease/siemens-unveils-breakthrough-innovations-industrial-ai-and-digital-twin-technology-ces">Siemens unveils breakthrough innovations in industrial AI and digital twin technology at CES 2025</a>. Press release, 2025.</p> <p class="wp-block-paragraph"></p> <p>The post <a href="https://towardsdatascience.com/from-fomo-to-opportunity-analytical-ai-in-the-era-of-llm-agents/">From FOMO to Opportunity: Analytical AI in the Era of LLM Agents</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  2. Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ?

    Wed, 30 Apr 2025 01:06:36 -0000

    What’s the real difference?

    The post Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ? appeared first on Towards Data Science.

    <p class="wp-block-paragraph">If <mdspan datatext="el1745456944555" class="mdspan-comment">you’ve followed</mdspan> me for a while, you probably know I started my career as a <strong>QA engineer</strong> before transitioning into the world of <strong>data analytics</strong>. I didn’t go to school for it, didn’t have a mentor, and didn’t land in a formal training program. Everything I know today—from SQL to modeling to storytelling with data—is self-taught. And believe me, it’s been a journey of trial, error, learning, and re-learning.</p> <h2 class="wp-block-heading">The Dilemma That Changed My Career</h2> <p class="wp-block-paragraph">A few years ago, I started thinking about switching organizations. Like many people in fast-evolving tech roles, I faced a surprisingly difficult question:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">What role am I actually doing? Which roles should I apply for?</p> </blockquote> <p class="wp-block-paragraph">On paper, I was a <strong>Data Analyst</strong>. But in reality, my role straddled several functions: writing SQL pipelines, building dashboards, defining KPIs, and digging into product analytics. I wasn’t sure whether I should be applying for Analyst roles, BI roles, or something entirely different.</p> <p class="wp-block-paragraph">To make things worse, back then, job titles were vague, and job descriptions were bloated with buzzwords. You’d find a posting titled <em>“Data Analyst”</em> that listed requirements like:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Build ML pipelines</li> <li class="wp-block-list-item">Write complex ETL scripts</li> <li class="wp-block-list-item">Maintain data lakes</li> <li class="wp-block-list-item">Create dashboards</li> <li class="wp-block-list-item">Present executive-level insights</li> <li class="wp-block-list-item">And oh, by the way, be great at stakeholder management</li> </ul> <p class="wp-block-paragraph">It was overwhelming and confusing. And I know I’m not alone in this.</p> <p class="wp-block-paragraph">Fast forward to today: thankfully, things are evolving. There’s still overlap between roles, but organizations have started to define them more clearly. In this article, I want to break down the <strong>real differences between data roles</strong>, through the lens of a real-world example.</p> <h3 class="wp-block-heading">A Real-World Scenario: Meet <em>Quikee</em></h3> <p class="wp-block-paragraph">Let’s imagine a fictional quick-commerce startup called <strong>Quikee</strong>, launching across multiple Indian cities. Their value proposition? Deliver groceries and essentials within <strong>10 minutes</strong>.</p> <p class="wp-block-paragraph">Customers place orders through the app or website. Behind the scenes, there are micro-warehouses (also called “dark stores”) across cities, and a fleet of delivery partners who make those lightning-fast deliveries.</p> <p class="wp-block-paragraph">Now, let’s walk through the data needs of this company—from the moment an order is placed, to the dashboards executives use in their Monday morning meetings.</p> <h3 class="wp-block-heading">Step 1: Capturing and Storing Raw Data</h3> <p class="wp-block-paragraph">The moment a customer places an order, <strong>transactional data</strong> is generated:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Timestamps</li> <li class="wp-block-list-item">Order ID</li> <li class="wp-block-list-item">Items ordered</li> <li class="wp-block-list-item">Price</li> <li class="wp-block-list-item">Discount codes</li> <li class="wp-block-list-item">Customer location</li> <li class="wp-block-list-item">Payment method</li> <li class="wp-block-list-item">Assigned delivery partner</li> </ul> <p class="wp-block-paragraph">Let’s assume Quikee uses <strong>Amazon Kinesis</strong> to stream this data in real time to an <strong>S3 data lake</strong>. That stream is high-volume, time-sensitive, and crucial for business tracking.</p> <p class="wp-block-paragraph">But here’s the catch: raw data is messy. You can’t use it directly for decision-making.</p> <p class="wp-block-paragraph">So what happens next?</p> <h3 class="wp-block-heading">Step 2: Building Data Pipelines</h3> <p class="wp-block-paragraph">Enter the <strong>Data Engineers</strong>.</p> <p class="wp-block-paragraph">They are responsible for:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Ingesting real-time data</li> <li class="wp-block-list-item">Validating schema consistency</li> <li class="wp-block-list-item">Handling failures and retries</li> <li class="wp-block-list-item">Writing pipelines to move data from S3 into a data warehouse (say, Snowflake or Redshift)</li> </ul> <p class="wp-block-paragraph">This is where <strong>ETL</strong> (Extract, Transform, Load) or <strong>ELT</strong> pipelines come into play. Data engineers clean, format, and structure the data to make it queryable.</p> <p class="wp-block-paragraph">For example, an order table might get split into:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Orders</strong> → One row per order</li> <li class="wp-block-list-item"><strong>Order_Items</strong> → One row per item in an order</li> <li class="wp-block-list-item"><strong>Payments</strong> → One row per payment attempt</li> </ul> <p class="wp-block-paragraph">At this stage, raw logs are turned into structured tables that analysts can work with.</p> <h3 class="wp-block-heading">Step 3: Dimensional Modeling &amp; OLAP</h3> <p class="wp-block-paragraph">As leadership starts asking strategic questions like:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">“Which city brings in the most revenue?”</li> <li class="wp-block-list-item">“Which store is underperforming?”</li> <li class="wp-block-list-item">“What’s our average delivery time by zone?”</li> </ul> <p class="wp-block-paragraph">…it becomes clear that querying transactional data directly won’t scale.</p> <p class="wp-block-paragraph">That’s where <strong>dimensional modeling</strong> comes in.</p> <p class="wp-block-paragraph">Instead of flat, raw tables, data is structured into Fact and Dimension Tables.</p> <h3 class="wp-block-heading"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f538.png" alt="🔸" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Fact Tables</strong></h3> <ul class="wp-block-list"> <li class="wp-block-list-item">Large, quantitative data tables which contain foreign keys along with measures and metrics (<em>Well, most of the time. There are factless fact tables as well which do not have any measures</em>).</li> <li class="wp-block-list-item">Examples: <code>fact_orders</code>, <code>fact_payments</code>, <code>fact_deliveries</code></li> <li class="wp-block-list-item">Contain metrics like revenue, order count, delivery time</li> </ul> <h3 class="wp-block-heading"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f539.png" alt="🔹" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Dimension Tables</strong></h3> <ul class="wp-block-list"> <li class="wp-block-list-item">Smaller, descriptive tables that help understand the data in a fact table</li> <li class="wp-block-list-item">Examples: <code>dim_store</code>, <code>dim_product</code>, <code>dim_customer</code>, <code>dim_delivery_agent</code></li> <li class="wp-block-list-item">Help filter, group, and join facts for deeper insights</li> </ul> <p class="wp-block-paragraph">This structure enables <strong>OLAP</strong>—fast, analytical querying across multiple dimensions. For example, you can now run queries like:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">“Show me average delivery time by store and hour of day, over the last 7 days.”</p> </blockquote> <p class="wp-block-paragraph">This step is done by Data Engineers at most of the organisations but I did build few Dim and Fact tables when I was working as a <a href="https://towardsdatascience.com/tag/business-intelligence/" title="Business Intelligence">Business Intelligence</a> Engineer at Amazon.</p> <h3 class="wp-block-heading">Step 4: Defining KPIs and Metrics</h3> <p class="wp-block-paragraph">This is where <strong>Analytics Engineers (or BI Engineers)</strong> shine.</p> <p class="wp-block-paragraph">They sit between the technical data layer and business users. Their responsibilities often include:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Defining KPIs (e.g., churn rate, repeat purchase %, time-to-fulfillment)</li> <li class="wp-block-list-item">Writing logic for complex metrics (e.g., cohort retention, active users)</li> <li class="wp-block-list-item">Creating <strong>semantic models</strong> or <strong>metrics layers</strong> in tools like dbt or Looker</li> <li class="wp-block-list-item">Ensuring consistent definitions across the company</li> </ul> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">For example, at Amazon, our team didn’t query raw data to calculate revenue every time. Instead, we created <strong>pre-aggregated fact tables</strong> at daily, weekly, and monthly grains. That way, dashboards loaded faster, and metrics stayed consistent across teams.</p> </blockquote> <p class="wp-block-paragraph">Analytics Engineers act as translators between engineering and the business—defining <strong>what</strong> we measure and <strong>how</strong> we measure it.</p> <h3 class="wp-block-heading">Step 5: Analysis, Reporting &amp; Storytelling</h3> <p class="wp-block-paragraph">Now comes the role of the <strong><a href="https://towardsdatascience.com/tag/data-analyst/" title="Data Analyst">Data Analyst</a></strong>.</p> <p class="wp-block-paragraph">Armed with clean, modeled data, they focus on answering real business questions like:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">“Why did retention drop in Bangalore last month?”</li> <li class="wp-block-list-item">“Which coupon codes drive the most new users?”</li> <li class="wp-block-list-item">“What are the top products reordered in the first 30 days?”</li> </ul> <p class="wp-block-paragraph">They build dashboards in tools like Tableau, Power BI, or Looker. They run ad-hoc SQL queries. They dive into A/B test results, user behavior trends, and campaign effectiveness.</p> <p class="wp-block-paragraph">But above all, they <strong>tell stories</strong> with data—making complex numbers understandable and actionable for stakeholders.</p> <h3 class="wp-block-heading">Who’s&nbsp;Who?</h3> <figure class="wp-block-image"><img decoding="async" src="https://cdn-images-1.medium.com/max/1600/1*Va4pgbc8YPUc03l4-RAcVQ.png" alt=""/><figcaption class="wp-element-caption">Generated by&nbsp;Author</figcaption></figure> <h3 class="wp-block-heading">TL;DR: Where Do You Fit?</h3> <p class="wp-block-paragraph">Here’s how I think about it:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Love building robust pipelines and solving scalability problems? → You’re a <strong>Data Engineer</strong></li> <li class="wp-block-list-item">Love defining business metrics and organizing complex datasets? → You’re an <strong>Analytics Engineer</strong></li> <li class="wp-block-list-item">Love uncovering insights and storytelling with data? → You’re a <strong>Data Analyst</strong></li> </ul> <p class="wp-block-paragraph">Of course, real-world roles often blend these. Especially at smaller companies, you may wear multiple hats. And that’s okay.</p> <p class="wp-block-paragraph">The key is not the title—but <strong>where you add the most value</strong> and <strong>what energizes you</strong>.</p> <h2 class="wp-block-heading">Final Thoughts</h2> <p class="wp-block-paragraph">It took me a long time to understand what I actually do—not just what my job title says. And if you’ve ever felt that confusion, you’re not alone.</p> <p class="wp-block-paragraph">Today, I can clearly say I operate at the intersection of <strong>data modeling</strong>, <strong>business logic</strong>, and <strong>storytelling</strong>—a sweet spot between analytics and engineering. And I’ve learned that the ability to connect the dots is more important than fitting into a perfect box.</p> <p class="wp-block-paragraph">If you’ve walked a similar path—or wear multiple hats in your role—I’d love to hear your story.</p> <p class="wp-block-paragraph"><strong>Drop a comment <img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /> or share this with someone figuring it out too.</strong></p> <p>The post <a href="https://towardsdatascience.com/data-analyst-or-data-engineer-or-analytics-engineer-or-bi-engineer/">Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ?</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  3. Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini

    Tue, 29 Apr 2025 20:13:43 -0000

    From prototype to production: real-world insights into building smarter transcription pipelines with LLMs.

    The post Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><strong>This article is co-authored by Ugo Pradère and David Haüet</strong></p> <p class="wp-block-paragraph"><mdspan datatext="el1745629161913" class="mdspan-comment">How hard</mdspan> can it be to transcribe an interview? You feed the audio to an AI model, wait a few minutes, and boom: perfect transcript, right? Well&#8230; not quite.</p> <p class="wp-block-paragraph">When it comes to accurately transcribe long audio interviews, even more when the spoken language is not English, things get a lot more complicated. You need high quality transcription with reliable speaker identification, precise timestamps, and all that at an affordable price. Not so simple after all.</p> <p class="wp-block-paragraph">In this article, we take you behind the scenes of our journey to build a scalable and production-ready transcription pipeline using Google’s Vertex AI and Gemini models. From unexpected model limitations to budget evaluation and timestamp drift disasters, we’ll walk you through the real challenges, and how we solved them.</p> <p class="wp-block-paragraph">Whether you are building your own <a href="https://towardsdatascience.com/tag/audio-processing/" title="Audio Processing">Audio Processing</a> tool or just curious about what happens “under the hood” of a robust transcription system using a multimodal model, you will find practical insights, clever workarounds, and lessons learned that should be worth your time.</p> <h2 class="wp-block-heading">Context of the project and constraints</h2> <p class="wp-block-paragraph">At the beginning of 2025, we started an interview transcription project with a clear goal: to build a system capable of transcribing interviews in French, typically involving a journalist and a guest, but not restricted to this situation, and lasting from a few minutes to over an hour. The final output was expected to be just a raw transcript but had to reflect the natural spoken dialogue written in a &#8220;book-like&#8221; dialogue, ensuring both a faithful transcription of the original audio content and a good readability.</p> <p class="wp-block-paragraph">Before diving into development, we conducted a short market review of existing solutions, but the outcomes were never satisfactory: the quality was often disappointing, the pricing definitely too high for an intensive usage, and in most cases, both at once. At that point, we realized a custom pipeline would be necessary.</p> <p class="wp-block-paragraph">Because our organization is engaged in the Google ecosystem, we were required to use Google Vertex AI services. Google Vertex AI offers a variety of Speech-to-Text (S2T) models for audio transcription, including specialized ones such as “Chirp,” “Latestlong,” or “Phone call,” whose names already hint at their intended use cases. However, producing a complete transcription of an interview that combines high accuracy, speaker diarization, and precise timestamping, especially for long recordings, remains a real technical and operational challenge.</p> <h2 class="wp-block-heading">First attempts and limitations</h2> <p class="wp-block-paragraph">We initiated our project by evaluating all those models on our use case. However, after extensive testing, we came quickly to the following conclusion: no Vertex AI service fully meets the complete set of requirements and will allow us to achieve our goal in a simple and effective manner. There was always at least one missing specification, usually on timestamping or diarization.<br></p> <p class="wp-block-paragraph">The terrible Google documentation, this must be said, cost us a significant amount of time during this preliminary research. This prompted us to ask Google for a meeting with a Google Cloud Machine Learning Specialist to try and find a solution to our problem. After a quick video call, our discussion with the Google rep quickly confirmed our conclusions: what we aimed to achieve was not as simple as it seemed at first. The entire set of requirements could not be fulfilled by a single Google service and a custom implementation of a VertexAI S2T service had to be developed.</p> <p class="wp-block-paragraph">We presented our preliminary work and decided to continue exploring two strategies:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Use Chirp2 to generate the transcription and timestamping of long audio files, then use <a href="https://towardsdatascience.com/tag/gemini/" title="Gemini">Gemini</a> for diarization.</li> <li class="wp-block-list-item">Use Gemini 2.0 Flash for transcription and diarization, although the timestamping is approximate and the token output length requires looping.</li> </ul> <p class="wp-block-paragraph">In parallel of these investigations, we also had to consider the financial aspect. The tool would be used for hundreds of hours of transcription per month. Unlike text, which is generally cheap enough not to have to think about it, audio can be quite costly. We therefore included this parameter from the beginning of our exploration to avoid ending up with a solution that worked but was too expensive to be exploited in production.</p> <h2 class="wp-block-heading">Deep dive into transcription with Chirp2</h2> <p class="wp-block-paragraph">We began with a deeper investigation of the Chirp2 model since it is considered as the “best in class” Google S2T service. A straightforward application of the documentation provided the expected result. The model turned out to be quite effective, offering good transcription with word-by-word timestamping according to the following output in json format:</p> <pre class="wp-block-prismatic-blocks"><code class="language-json">&quot;transcript&quot;:&quot;Oui, en effet&quot;, &quot;confidence&quot;:0.7891818284988403 &quot;words&quot;:[ { &quot;word&quot;:&quot;Oui&quot;, &quot;start-offset&quot;:{ &quot;seconds&quot;:3.68 }, &quot;end-offset&quot;:{ &quot;seconds&quot;:3.84 }, &quot;confidence&quot;:0.5692862272262573 } { &quot;word&quot;:&quot;en&quot;, &quot;start-offset&quot;:{ &quot;seconds&quot;:3.84 }, &quot;end-offset&quot;:{ &quot;seconds&quot;:4.0 }, &quot;confidence&quot;:0.758037805557251 }, { &quot;word&quot;:&quot;effet&quot;, &quot;start-offset&quot;:{ &quot;seconds&quot;:4.0 }, &quot;end-offset&quot;:{ &quot;seconds&quot;:4.64 }, &quot;confidence&quot;:0.8176857233047485 }, ]</code></pre> <p class="wp-block-paragraph">However, a new requirement came along the project added by the operational team: the transcription must be as faithful as possible to the original audio content and include small filler words, interjections, onomatopoeia or even mumbling that can add meaning to a conversation, and typically come from the non-speaking participant either at the same time or toward the end of a sentence of the speaking one. We&#8217;re talking about words like &#8220;oui oui,&#8221; &#8220;en effet” but also simple expressions like (hmm, ah, etc.), so typical of the French language! It&#8217;s actually not uncommon to validate or, more rarely, oppose someone point with a simple &#8220;Hmm Hmm&#8221;. Upon analyzing Chirp with transcription, we noticed that while some of these small words were present, a number of those expressions were missing. First downside for Chirp2.</p> <p class="wp-block-paragraph">The main challenge in this approach lies in the reconstruction of the speaker sentences while performing diarization. We quickly abandoned the idea of giving Gemini the context of the interview and the transcription text, and asking it to determine who said what. This method could easily result in incorrect diarization. We instead explored sending the interview context, the audio file, and the transcription content in a compact format, instructing Gemini to only perform diarization and sentence reconstruction without re-transcribing the audio file. We requested a TSV format, an ideal structured format for transcription: &#8220;human readable&#8221; for fast quality checking, easy to process algorithmically, and lightweight. Its structure is as follows:</p> <p class="wp-block-paragraph">First line with speaker presentation:</p> <p class="wp-block-paragraph"><em>Diarization Speaker_1:speaker_name\Speaker_2:speaker_name\Speaker_3:speaker_name\Speaker_4:speaker_name, etc.</em></p> <p class="wp-block-paragraph">Then the transcription in the following format:&nbsp;</p> <p class="wp-block-paragraph"><em>speaker_id\ttime_start\ttime_stop\text</em><em> with:</em></p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong><em>speaker:</em></strong><em> Numeric speaker ID (e.g., 1, 2, etc.)</em></li> <li class="wp-block-list-item"><strong><em>time_start:</em></strong><em> Segment start time in the format 00:00:00</em></li> <li class="wp-block-list-item"><strong><em>time_stop:</em></strong><em> Segment end time in the format 00:00:00</em></li> <li class="wp-block-list-item"><strong><em>text:</em></strong><em> Transcribed text of the dialogue segment</em></li> </ul> <p class="wp-block-paragraph">An example output:&nbsp;</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><em>Diarization Speaker_1:Lea Finch\Speaker_2:David Albec&nbsp;</em></p> <p class="wp-block-paragraph"><em>1</em><em> </em><em>00:00:00</em><em> </em><em>00:03:00</em><em> </em><em>Hi Andrew, how are you?&nbsp;</em></p> <p class="wp-block-paragraph"><em>2</em><em> </em><em>00:03:00</em><em> </em><em>00:03:00</em><em> </em><em>Fine thanks.&nbsp;</em></p> <p class="wp-block-paragraph"><em>1</em><em> </em><em>00:04:00</em><em> </em><em>00:07:00</em><em> </em><em>So, let’s start the interview&nbsp;</em></p> <p class="wp-block-paragraph"><em>2</em><em> </em><em>00:07:00</em><em> </em><em>00:08:00</em><em> </em><em>All right.</em></p> <p class="wp-block-paragraph">A simple version of the context provided to the LLM:</p> <p class="wp-block-paragraph"><em>Here is the interview of David Albec, professional football player, by journalist Lea Finch</em></p> </blockquote> <p class="wp-block-paragraph">The result was fairly qualitative with what appeared to be accurate diarization and sentence reconstruction. However, instead of getting the exact same text, it seemed slightly modified in several places. Our conclusion was that, despite our clear instructions, Gemini probably carries out more than just diarization and actually performed partial transcription.</p> <p class="wp-block-paragraph">We also evaluated at this point the cost of transcription with this methodology. Below is the approximate calculation based only on audio processing:&nbsp;</p> <p class="wp-block-paragraph">Chirp2 price /min: 0.016 usd</p> <p class="wp-block-paragraph">Gemini 2.0 flash /min: 0,001875 usd</p> <p class="wp-block-paragraph">Price /hour: 1,0725 usd</p> <p class="wp-block-paragraph">Chirp2 is indeed quite &#8220;expensive&#8221;, about ten times more than Gemini 2.0 flash at the time of writing, and still requires the audio to be processed by Gemini for diarization. We therefore decided to put this method aside for now and explore a way using the brand new multimodal Gemini 2.0 Flash alone, which had just left experimental mode.</p> <h2 class="wp-block-heading">Next: exploring audio transcription with Gemini flash 2.0</h2> <p class="wp-block-paragraph">We provided Gemini with both the interview context and the audio file requesting a structured output in a consistent format. By carefully crafting our prompt with standard LLM guidelines, we were able to specify our transcription requirements with a high degree of precision. In addition with the typical elements any prompt engineer might include, we emphasized several key instructions essential for ensuring a quality transcription (<em>comment in italic)</em>:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Transcribe interjections and onomatopoeia even when mid-sentence.</li> <li class="wp-block-list-item">Preserve the full expression of words, including slang, insults, or inappropriate language. =&gt; <em>the model tends to change words it considers inappropriate. For this specific point, we had to require Google to deactivate the safety rules on our Google Cloud Project</em>.</li> <li class="wp-block-list-item">Build complete sentences, paying particular attention to changes in speaker mid-sentence, for example when one speaker finishes another&#8217;s sentence or interrupts. =&gt; <em>Such errors affect diarization and accumulate throughout the transcript until context is strong enough for the LLM to correct</em>.</li> <li class="wp-block-list-item">Normalize prolonged words or interjections like &#8220;euuuuuh&#8221; to &#8220;euh.&#8221; and not “euh euh euh euh euh …” =&gt;<em> this was a classical bug we were encountering called “repetition bug” and is discussed in more detail below</em></li> <li class="wp-block-list-item">Identify speakers by voice tone while using context to determine who is the journalist and who is the interviewee. =&gt; <em>in addition we can pass the information of the first speaker in the prompt</em></li> </ul> <p class="wp-block-paragraph">Initial results were actually quite satisfying in terms of transcription, diarization, and sentence construction. Transcribing short test files made us feel like the project was nearly complete&#8230; until we tried longer files.&nbsp;</p> <h2 class="wp-block-heading">Dealing with Long Audio and LLM Token Limitations</h2> <p class="wp-block-paragraph">Our early tests on short audio clips were encouraging but scaling the process to longer audios quickly revealed new challenges: what initially seemed like a simple extension of our pipeline turned out to be a technical hurdle in itself. Processing files longer than just a few minutes revealed indeed a series of challenges related to model constraints, token limits, and output reliability:</p> <ol class="wp-block-list"> <li class="wp-block-list-item">One of the first problems we encountered with long audio was the token limit: the number of output tokens exceeded the maximum allowed (MAX_INPUT_TOKEN = 8192) forcing us to implement a looping mechanism by repeatedly calling Gemini while resending the previously generated transcript, the initial prompt, a continuation prompt, and the same audio file.</li> </ol> <p class="wp-block-paragraph">Here is an example of the continuation prompt we used:&nbsp;</p> <p class="wp-block-paragraph"><em>Continue transcribing audio interview from the previous result. Start processing the audio file from the previous generated text. Do not start from the beginning of the audio. Be careful to continue the previously generated content which is available between the following tags &lt;previous_result&gt;.</em></p> <ol start="2" class="wp-block-list"> <li class="wp-block-list-item">Using this transcription loop with large data inputs seems to significantly degrade the LLM output quality, especially for timestamping. In this configuration, timestamps can drift by over 10 minutes on an hour-long interview. If a few seconds drift was considered compatible with our intended use, a few minutes made timestamping useless.&nbsp;</li> </ol> <p class="wp-block-paragraph">Our initial test on short audios of a few minutes resulted in a maximum 5 to 10 seconds drift, and significant drift was observed generally after the first loop when max input token was reached. We conclude from these experimental observations, that while this looping technique ensures continuity in transcription fairly well, it not only leads to cumulative timestamp errors but also to a drastic loss of LLM timestamps accuracy.&nbsp;</p> <ol start="3" class="wp-block-list"> <li class="wp-block-list-item">We also encountered a recurring and particularly frustrating bug: the model would sometimes fall into a loop, repeating the same word or phrase over dozens of lines. This behavior made entire portions of the transcript unusable and often looked something like this:</li> </ol> <p class="wp-block-paragraph"><em>1 00:00:00 00:03:00 Hi Andrew, how are you?&nbsp;</em></p> <p class="wp-block-paragraph"><em>2 00:03:00 00:03:00 Fine thanks.</em></p> <p class="wp-block-paragraph"><em>2 00:03:00 00:03:00 Fine thanks</em></p> <p class="wp-block-paragraph"><em>2 00:03:00 00:03:00 Fine thanks</em></p> <p class="wp-block-paragraph"><em>2 00:03:00 00:03:00 Fine thanks.&nbsp;</em></p> <p class="wp-block-paragraph"><em>2 00:03:00 00:03:00 Fine thanks</em></p> <p class="wp-block-paragraph"><em>2 00:03:00 00:03:00 Fine thanks.&nbsp;</em></p> <p class="wp-block-paragraph"><em>etc.</em></p> <p class="wp-block-paragraph">This bug seems erratic but appears more frequently with medium-quality audio with strong background noise, far away speaker for example. And &#8220;on the field&#8221;, this is often the case.. Likewise, speaker hesitations or word repetitions seem to trigger it. We still don&#8217;t know exactly what causes this “repetition bug”. Google Vertex team is aware of it but hasn’t provided a clear explanation.&nbsp;</p> <p class="wp-block-paragraph">The consequences of this bug were especially limiting: once it occurred, the only viable solution was to restart the transcription from scratch. Unsurprisingly, the longer the audio file, the higher the probability of encountering the issue. In our tests, it affected roughly one out of every three runs on recordings longer than an hour, making it extremely difficult to deliver a reliable, production-quality service under such conditions.</p> <ol start="4" class="wp-block-list"> <li class="wp-block-list-item">To make it worse, resuming transcription after a reached Max_token “cutoff” required resending the entire audio file each time. Although we only needed the next segment, the LLM would still process the full file again (without outputting the transcription), meaning we were billed the full audio time lenght for every resend.</li> </ol> <p class="wp-block-paragraph">In practice, we found that the token limit was typically reached between the 15th and 20th minute of the audio. As a result, transcribing a one hour long interview often required&nbsp;4 to 5 separate LLM calls, leading to a total billing equivalent of 4 to 5 hours of audio for a single file.&nbsp;</p> <p class="wp-block-paragraph">With this process, the cost of audio transcription does not scale linearly. While a 15-minute audio would be billed as 15 minutes, in a single LLM call, a 1-hour file could effectively cost 4 hours, and a 2-hour file could increase to 16 hours, following a near quadratic pattern (≈ 4^x, where x = number of hours).<br>This made long audio processing not just unreliable, but also expensive for long audio files.</p> <h2 class="wp-block-heading">Pivoting to Chunked Audio Transcription</h2> <p class="wp-block-paragraph">Given these major limitations, and being much more confident in the ability of the LLM to handle text-based tasks over audio, we decided to shift our approach and isolate the audio transcription process to maintain high transcription quality. A quality transcription is indeed the key step of the need and it makes sense to ensure that this part of the process should be at the core of the strategy.</p> <p class="wp-block-paragraph">At this point, splitting audio into chunks became the ideal solution. Not only, it seemed likely to greatly improve timestamp accuracy by avoiding the LLM timestamping performance degradation after looping and cumulative drift, but also reducing price since each chunk would be runned ideally once. While it introduced new uncertainties around merging partial transcriptions, the tradeoff seemed to our advantage.</p> <p class="wp-block-paragraph">We thus focused on breaking long audio into shorter chunks that would insure a single LLM transcription request. During our tests, we observed that issues like repetition loops or timestamp drift typically began around the 18-minute mark in most interviews. It became clear that we should use 15-minute (or shorter) chunks for safety. Why not use 5-minute chunks? The quality improvement looked minimal to us while tripling the number of segments. In addition, shorter chunks reduce the overall context, which could hurt diarization.</p> <p class="wp-block-paragraph">Also this setup drastically minimized the repetition bug, we observed that it still occurred occasionally. In a desire to provide the best service possible, we definitely wanted to find an efficient counterback to this problem and we identified an opportunity with our previously annoying max_input_token: with 10-minute chunks, we could definitely be confident that token limits wouldn’t be exceeded in nearly all cases. Thus, if the token limit was hit, we knew for sure the repetition bug occurred and could restart that chunk transcription. This pragmatic method turned out to be indeed very effective at identifying and avoiding the bug. Great news.</p> <h2 class="wp-block-heading">Correcting audio chunks transcription</h2> <p class="wp-block-paragraph">With good transcripts of 10 minutes audio chunk in hand, we implemented at this stage an algorithmic post-processing of each transcript to address minor issues:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Removal of header tags like tsv or json added at the start and the end of the transcription content:&nbsp;</li> </ul> <p class="wp-block-paragraph">Despite optimizing the prompt, we couldn’t fully eliminate this side effect without hurting the transcription quality. Since this is easily handled algorithmically, we chose to do so.</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Replacing speaker IDs with names:</li> </ul> <p class="wp-block-paragraph">Speaker identification by name only begins once the LLM has enough context to determine who is the journalist and who is being interviewed. This results in incomplete diarization at the beginning of the transcript with early segments using numeric IDs (first speaker in chunk = 1, etc.). Moreover, since each chunk may have a different ID order (first person to talk being speaker 1), this would create confusion during merging. We instructed the LLM to only use IDs and provide a diarization mapping in the first line, during the transcription process. The speaker ids are therefore replaced during the algorithmic correction and the diarization headline removed.</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Rarely, malformed or empty transcript lines are encountered. These lines are deleted, but we flag them with a note to the user: &#8220;formatting issue on this line&#8221; so users are at least aware of a potential content loss and correct it eventually handwise. In our final optimized version, such lines were extremely rare.</li> </ul> <h2 class="wp-block-heading">Merging chunks and maintaining content continuity</h2> <p class="wp-block-paragraph">At the previous stage of audio chunking, we initially tried to make chunks with clean cuts. Unsurprisingly, this led to words or even full sentences loss at cut points. So we naturally switched to overlapping chunk cuts to avoid such content loss, leaving the optimization of the size of the overlap to the chunk merging process.&nbsp;</p> <p class="wp-block-paragraph">Without a clean cut between chunks, the opportunity to merge the chunks algorithmically disappeared. For the same audio input, the transcript lines output can be quite different with breaks at different points of the sentences or even filler words or hesitations being displayed differently. In such a situation, it is complex, not to say impossible, to make an effective algorithm for a clean merge.&nbsp;</p> <p class="wp-block-paragraph">This left us with the use of the LLM option of course. Quickly, few tests confirmed the LLM could better merge together segments when overlaps included full sentences. A 30-second overlap proved sufficient. With a 10 min audio chunk structure, this would implies the following chunks cuts:&nbsp;</p> <ul class="wp-block-list"> <li class="wp-block-list-item">1st transcript: 0 to 10 minutes&nbsp;</li> <li class="wp-block-list-item">2nd transcript: 9m30s to 19m30s&nbsp;</li> <li class="wp-block-list-item">3rd transcript: 19m to 29m &#8230;and so on.</li> </ul> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/overlap.png" alt="" class="wp-image-602341"/><figcaption class="wp-element-caption">Image by the authors</figcaption></figure> <p class="wp-block-paragraph">Those overlapped chunk transcripts were corrected by the previously described algorithm and sent to the LLM for merging to reconstruct the full audio transcript. The idea was to send the full set of chunk transcripts with a prompt instructing the LLM to merge and give the full merged audio transcript in tsv format as the previous LLM transcription step. In this configuration, the merging process has mainly three quality criterias:</p> <ol class="wp-block-list"> <li class="wp-block-list-item">Ensure transcription continuity without content loss or duplication.</li> <li class="wp-block-list-item">Adjust timestamps to resume from where the previous chunk ended.</li> <li class="wp-block-list-item">Preserve diarization.</li> </ol> <p class="wp-block-paragraph">As expected, max_input_token was exceeded, forcing us into an LLM call loop. However, since we were now using text input, we were more confident in the reliability of the LLM… probably too much. The result of the merge was satisfactory in most cases but prone to several issues: tag insertions, multi-line entries merged into one line, incomplete lines, and even hallucinated continuations of the interview. Despite many prompt optimizations, we couldn’t achieve sufficiently reliable results for production use.&nbsp;</p> <p class="wp-block-paragraph">As with audio transcription, we identified the amount of input information as the main issue. We were sending several hundred, even thousands of text lines containing the prompt, the set of partial transcripts to fuse, a roughly similar amount with the previous transcript, and some more with the prompt and its example. Definitely too much for a precise application of our set of instructions.</p> <p class="wp-block-paragraph">On the plus side, timestamp accuracy did indeed improve significantly with this chunking approach: we maintained a drift of just 5 to 10 seconds max on transcriptions over an hour. As the start of a transcript should have minimal drift in timestamping, we instructed the LLM to use the timestamps of the “ending chunk” as reference for the fusion and correct any drift by a second per sentence. This made the cut points seamless and kept overall timestamp accuracy.</p> <h2 class="wp-block-heading">Splitting the chunk transcripts for full transcript reconstruction</h2> <p class="wp-block-paragraph">In a modular approach similar to the workaround we used for transcription, we decided to carry out the merge of the transcript individually, in order to avoid the previously described issues. To do so, each 10 minute transcript is split into three parts based on the start_time of the segments:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Overlap segment to merge at the beginning: 0 to 1 minute</li> <li class="wp-block-list-item">Main segment to paste: 1 to 9 minutes</li> <li class="wp-block-list-item">Overlap segment to merge at the end: 9 to 10 minutes</li> </ul> <p class="wp-block-paragraph"><em>NB: Since each chunk, including first and last ones, is processed the same way, the overlap at the beginning of the first chunk is directly merged with the main segment, and the overlap at the end of the last chunk (if there is one) is merged accordingly.</em></p> <p class="wp-block-paragraph">The beginning and end segments are then sent in pairs to be merged. As expected, the quality of the output drastically increased, resulting in an efficient and reliable merge between the transcripts chunk. With this procedure, the response of the LLM proved to be highly reliable and showed none of the previously mentioned errors encountered during the looping process.</p> <p class="wp-block-paragraph"><em>The process of transcript assembly for an audio of 28 minutes 42 seconds:&nbsp;</em></p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/28min-file.png" alt="" class="wp-image-602342"/><figcaption class="wp-element-caption">Image by the authors</figcaption></figure> <h2 class="wp-block-heading">Full transcript reconstruction</h2> <p class="wp-block-paragraph">At this final stage, the only remaining task was to reconstruct the complete transcript from the processed splits. To achieve this, we algorithmically combined the main content segments with their corresponding merged overlaps alternately.</p> <h2 class="wp-block-heading">Overall process overview</h2> <p class="wp-block-paragraph">The overall process involves 6 steps from which 2 are carried out by Gemini:</p> <ol class="wp-block-list"> <li class="wp-block-list-item">Chunking the audio into overlapped audio chunks</li> <li class="wp-block-list-item">Transcribing each chunks into partial text transcripts (LLM step)</li> <li class="wp-block-list-item">Correction of partial transcripts</li> <li class="wp-block-list-item">Splitting audio chunks transcripts into start, main, and end text splits</li> <li class="wp-block-list-item">Fusing end and start splits of each couple of chunk splits (LLM step)</li> <li class="wp-block-list-item">Reconstructing the full transcripts</li> </ol> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/overall-process.png" alt="" class="wp-image-602343"/><figcaption class="wp-element-caption">Image by the authors</figcaption></figure> <p class="wp-block-paragraph">The overall process takes about 5 min per hour of transcription deserved to the user in an asynchronous tool. Quite reasonable considering the quantity of work executed behind the scene, and this for a fraction of the price of other tools or pre-built Google models like Chirp2.</p> <p class="wp-block-paragraph">One additional improvement that we considered but ultimately decided not to implement was the timestamp correction. We observed that timestamps at the end of each chunk typically ran about five seconds ahead of the actual audio. A straightforward solution could have been to incrementally adjust the timestamps algorithmically by approximately one second every two minutes to correct most of this drift. However, we chose not to implement this adjustment, as the minor discrepancy was acceptable for our business needs.</p> <h2 class="wp-block-heading">Conclusion</h2> <p class="wp-block-paragraph">Building a high-quality, scalable transcription pipeline for long interviews turned out to be much more complex than simply choosing the “right” Speech-to-Text model. Our journey with Google’s Vertex AI and Gemini models highlighted key challenges around diarization, timestamping, cost-efficiency, and long audio handling, especially when aiming to export the full information of an audio.</p> <p class="wp-block-paragraph">Using careful prompt engineering, smart audio chunking strategies, and iterative refinements, we were able to build a robust system that balances accuracy, performance, and operational cost, turning an initially fragmented process into a smooth, production-ready pipeline.</p> <p class="wp-block-paragraph">There’s still room for improvement but this workflow now forms a solid foundation for scalable, high-fidelity audio transcription. As LLMs continue to evolve and APIs become more flexible, we’re optimistic about even more streamlined solutions in the near future.</p> <h2 class="wp-block-heading">Key takeaways</h2> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>No Vertex AI S2T model met all our needs: </strong>Google Vertex AI provides specialized models, but each one has limitations in terms of transcription accuracy, diarization, or timestamping for long audios.</li> <li class="wp-block-list-item"><strong>Token limits and long prompts influence transcription quality drastically: </strong>Gemini output token limitation significantly degrades transcription quality for long audios, requiring heavily prompted looping strategies and finally forcing us to shift to shorter audio chunks.</li> <li class="wp-block-list-item"><strong>Chunked audio transcription and transcript reconstruction significantly improves quality and cost-efficiency:<br></strong>Splitting audio into 10 minute overlapping segments minimized critical bugs like repeated sentences and timestamp drift, enabling higher quality results and drastically reduced costs.</li> <li class="wp-block-list-item"><strong>Careful prompt engineering remains essential:</strong> Precision in prompts, especially regarding diarization and interjections for transcription, as well as transcript fusions, proved to be crucial for reliable LLM performance.</li> <li class="wp-block-list-item"><strong>Short transcript fusion merging maximize reliability:<br></strong>Splitting each chunk transcript into smaller segments with end to start merging of overlaps provided high accuracy and avoided common LLM issues like hallucinations or incorrect formatting.<br></li> </ul> <p class="wp-block-paragraph"></p> <p class="wp-block-paragraph"></p> <p>The post <a href="https://towardsdatascience.com/building-a-scalable-and-accurate-audio-interview-transcription-pipeline-with-google-gemini/">Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  4. How to Level Up Your Technical Skills in This AI Era

    Tue, 29 Apr 2025 19:53:46 -0000

    AI tools & vibe coding have hidden tradeoffs. Here's how to use them wisely and why open source is a great secret weapon.

    The post How to Level Up Your Technical Skills in This AI Era appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><mdspan datatext="el1745953580269" class="mdspan-comment">AI-assisted</mdspan> coding is here to stay. Tools like <a href="https://towardsdatascience.com/tag/cursor/" title="Cursor">Cursor</a>, V0, and Lovable have dramatically lowered the barrier to entry — building dashboards, pipelines, or entire apps can now be done in a fraction of the time.</p> <p class="wp-block-paragraph">I use these tools daily, and they’ve definitely made me faster. But as the codebase gets more complex, the tradeoffs become clear: cryptic bugs, tangled logic, and hours lost debugging code I didn’t truly understand.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><strong><em>AI tools are great — especially for beginners — but they come with a quiet cost. The more you let AI carry the load, the fewer chances you have to sharpen your instincts that come from wrestling with complexity.&nbsp;</em></strong></p> <p class="wp-block-paragraph"><strong><em>Yes, AI will speed up your workflow, but you’ll also skip the formative steps where technical wisdom is earned.</em></strong></p> </blockquote> <p class="wp-block-paragraph">“Vibe coding” — quickly cobbling together code with minimal planning — is great for demos or experiments. But for deeper technical growth or building systems with meaningful complexity, vibe coding isn’t enough. This trending Reddit post sums it up perfectly: left unchecked, vibe coding creates more problems than it solves.</p> <figure class="wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter" datatext="el1745953299332"><div class="wp-block-embed__wrapper"> <blockquote class="twitter-tweet" data-width="500" data-dnt="true"><p lang="en" dir="ltr">vibe coding, where 2 engineers can now create the tech debt of at least 50 engineers</p>&mdash; I Am Devloper (@iamdevloper) <a href="https://twitter.com/iamdevloper/status/1902628884278894941?ref_src=twsrc%5Etfw">March 20, 2025</a></blockquote><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </div></figure> <p class="wp-block-paragraph">In this piece, I’ll show you how to use AI-assisted tools more wisely — and why contributing to <a href="https://towardsdatascience.com/tag/open-source/" title="Open Source">Open Source</a> might be the most underrated way to truly level up your technical skills.</p> <h2 class="wp-block-heading">My experience vibe coding with Cursor</h2> <p class="wp-block-paragraph">Like many developers, I switched from VS Code (with GitHub Copilot) to Cursor and am currently subscribed to Cursor&#8217;s Pro plan ($20/month).</p> <p class="wp-block-paragraph">The feature I rely on most is Cursor&#8217;s integrated AI chat, which lets me directly interact with my <strong><em>entire </em></strong>codebase. Its agent can quickly grep through multiple files and even handle images - extremely useful when navigating large, unfamiliar repos. It also spots linter errors and auto-corrects them while directly editing files.</p> <p class="wp-block-paragraph">Initially, Cursor dramatically boosted my productivity, especially for simpler tasks. It felt powerful, almost magical. But as things got complex, I noticed some cracks. Cursor would sometimes generate spaghetti code, mix up similarly named files across directories, and occasionally struggle to follow intricate logic flows.</p> <p class="wp-block-paragraph">Vibe coding can get you thousands of lines of code in minutes  — but without a strong mental model of what you&#8217;re building, you risk ending up with bloated, over-engineered systems.</p> <p class="wp-block-paragraph">Cursor does a decent job narrowing down the search space when debugging. But letting it make unchecked edits does introduce even more bugs than it solves.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><strong><em>Beyond the usual advice to &#8220;write better prompts,&#8221; one strategy I&#8217;ve found especially helpful is telling Cursor NOT to make direct edits. (It&#8217;s surprisingly obedient about this!)</em></strong></p> </blockquote> <p class="wp-block-paragraph">Instead, I explicitly ask it to suggest changes first in the chat interface. Then, I would review each suggestion, decide which edits made sense, and apply them selectively — either manually or through Cursor. Unlike ChatGPT, Cursor’s biggest strength is its contextual awareness of the entire codebase and its ability to parse through lengthy files (over 5k lines of code) by processing them in manageable chunks.</p> <h2 class="wp-block-heading">Contributing to open source</h2> <p class="wp-block-paragraph">So, how do you get technically stronger? Two ways stand out: side projects and open source contributions.&nbsp;</p> <p class="wp-block-paragraph">Side projects are great for exploring new tech or diving deep into something you’re passionate or curious about. Wonder how AI agents work or curious about MCP? Just building a simple weekend project teaches you far more than hours of tutorials or documentation. Thanks to open-source, tools and resources are freely accessible, leveling the playing field for everyone.</p> <p class="wp-block-paragraph">But solo projects have downsides. It’s easy to lose motivation — many of my own side projects never saw the light of day.&nbsp;</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><strong><em>Plus, you can find yourself in an echo chamber: your code works, but you’re not sure if it’s following best practices or industry standards. If you’re early in your career and lack mentorship, how do you know if you’re even on the right track?</em></strong></p> </blockquote> <p class="wp-block-paragraph">This is exactly where open source fills the gap. Open source projects aren’t just for coding wizards; they’re for everyone. Your favourite libraries like Pandas, Matplotlib, TensorFlow, and Keras rely heavily on community involvement.</p> <h4 class="wp-block-heading">Why bother contributing?</h4> <p class="wp-block-paragraph">Open source lets you make a real impact used by thousands of developers — not just toy projects nobody sees. You’ll become proficient with version control (hello, GitHub!), sharpen your skills navigating complex codebases, pick up best practices, and build a network of people who can vouch for you when it matters.</p> <p class="wp-block-paragraph">There are career benefits too. It’ll add to your portfolio and personal brand, and you’ll ramp up faster when joining new teams.&nbsp;</p> <p class="wp-block-paragraph">But, contribute for the right reasons. <strong><em>If your only motivation is landing a job, DON’T contribute! </em></strong>Open source is not a ticket to get a job — it requires genuine interest and commitment. It shows you’ve a passion to build, and for many startups that begin from open source projects, that’s how they find their first hires.</p> <h4 class="wp-block-heading">Picking an open source project that you care about</h4> <p class="wp-block-paragraph">Starting out can seem daunting. Many popular repos have enormous codebases, potentially outdated documentation, or hundreds of unclear issues. So how do you pick?</p> <p class="wp-block-paragraph">First up, pick a project you<strong> genuinely care about. </strong>This might sound obvious, but it’s crucial — and underrated.&nbsp;</p> <p class="wp-block-paragraph">Choose something you <strong><em>actually use, </em></strong>whether at work or in a side project. Jumping into an unfamiliar project with unfamiliar tech is simply overwhelming, and you’ll lose motivation fast.&nbsp;</p> <p class="wp-block-paragraph">Personally, I&#8217;m both a user and a big fan of PostHog  — the product analytics platform built specifically for developers — so I started contributing there. Their docs were comprehensive and well-structured, which made it an awesome place to start. (And no, they didn&#8217;t pay me to say this!)</p> <h4 class="wp-block-heading">What to contribute?</h4> <p class="wp-block-paragraph">There are a <strong><em>ton</em></strong> of things you can do. Here’s an approach that I found helpful.&nbsp;</p> <ol class="wp-block-list"> <li class="wp-block-list-item">Find a feature you need or improve something you use.<br>Narrowing down contributions to features you genuinely care about gives clarity and motivation. The best code comes from solving problems you personally face.</li> <li class="wp-block-list-item">Set up your local environment.<br>Fork the project, clone it locally, and get it running. Understand where logs are and how to test changes. Get a grasp on the project’s basic structure and coding style.</li> <li class="wp-block-list-item">Start small and learn by doing<br>Many repos tag beginner-friendly issues (like “good-first-issue”). Pick these to start. Understand and replicate the bug; don’t hesitate to comment if you’re stuck. When you open a PR, ensure your changes pass all linting and tests.</li> </ol> <p class="wp-block-paragraph">Learning to navigate the codebase is essential. You don’t need to read every line — that’s practically impossible. After grasping the high-level structure, dive in. Start small to get comfortable with the build, deployment, and PR review process. Write clear commit messages and PR descriptions. Check recently merged PRs to see successful examples or insightful discussions.</p> <h2 class="wp-block-heading">Wrapping up</h2> <p class="wp-block-paragraph">Contributing to open source takes patience — popular repos are huge, and learning takes time. Becoming a consistent, valuable contributor takes at least a few months, so don’t get discouraged by initial setbacks. If your PR is rejected or you get stuck on a tricky bug, that’s perfectly normal — it’s all part of the learning process.&nbsp;</p> <p class="wp-block-paragraph">If you’re new to open source and want to chat, feel free to connect. While I didn’t dive deeply into technical details here (a quick Google or ChatGPT search can guide you there), I hope this gives you the big-picture perspective to get started. Open source has been rewarding for me — and I hope it will be for you too.</p> <p class="wp-block-paragraph">See you in the next article <img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p> <p>The post <a href="https://towardsdatascience.com/how-to-level-up-your-technical-skills-in-this-ai-era/">How to Level Up Your Technical Skills in This AI Era</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  5. AI Agents for a More Sustainable World

    Tue, 29 Apr 2025 18:45:08 -0000

    How AI agents can help companies measure, optimise and accelerate their sustainability initiatives.

    The post AI Agents for a More Sustainable World appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><mdspan datatext="el1745952186363" class="mdspan-comment">As political support</mdspan> for sustainability weakens, the need for long-term sustainable practices has never been more critical.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">How can we use analytics, boosted by agentic AI, to support companies in their green transformation?</p> </blockquote> <p class="wp-block-paragraph">For years, the focus of my blog was always on using Supply Chain Analytics methodologies and tools to solve specific problems.</p> <figure class="wp-block-image size-full" datatext=""><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-160.png" alt="" class="wp-image-602472"/><figcaption class="wp-element-caption"><a href="https://towardsdatascience.com/what-is-supply-chain-analytics-42f1b2df4a2/" target="_blank" rel="noreferrer noopener">Four Types of Supply Chain Analytics</a> &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">At <a href="https://www.logi-green.com/">Logi</a><a href="https://www.logi-green.com/" target="_blank" rel="noreferrer noopener">G</a><a href="https://www.logi-green.com/">reen</a>, the startup I founded, we deploy these analytics solutions to help retailers, manufacturers, and logistics companies meet their sustainability targets.</p> <p class="wp-block-paragraph">In this article, I will demonstrate how we can supercharge these existing solutions with AI agents.</p> <p class="wp-block-paragraph">The objective is to make it easier and faster for companies to implement <a href="https://towardsdatascience.com/tag/sustainability/" title="Sustainability">Sustainability</a> initiatives across their supply chains.</p> <h2 class="wp-block-heading">Obstacles for Green Transformations of Companies</h2> <p class="wp-block-paragraph">As political and financial pressures shift focus away from sustainability, making the green transformation easier and more accessible has never been more urgent.</p> <p class="wp-block-paragraph">Last week, I attended the global <strong>ChangeNOW </strong>conference, held in my hometown, Paris.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-159-1024x576.png" alt="" class="wp-image-602471"/><figcaption class="wp-element-caption">ChangeNOW in Grand Palais of Paris &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">This conference brought together innovators, entrepreneurs and decision-makers committed to building a better future, despite the challenging context.<br><br>It was an excellent opportunity to meet some of my readers and connect with leaders driving change across industries.</p> <p class="wp-block-paragraph">Through these discussions, one clear message emerged.</p> <p class="wp-block-paragraph">Companies face three main obstacles when driving sustainable transformation:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">A lack of visibility on operational processes,</li> <li class="wp-block-list-item">The complexity of sustainability reporting requirements,</li> <li class="wp-block-list-item">The challenge of designing and implementing initiatives across the value chain.</li> </ul> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-161.png" alt="" class="wp-image-602473"/><figcaption class="wp-element-caption">Examples of Challenges Faced by Companies &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">In the following sections, I will explore how we can leverage <strong>Agentic AI</strong> to overcome two of these major obstacles: </p> <ul class="wp-block-list"> <li class="wp-block-list-item">Improving reporting to respect the regulations</li> <li class="wp-block-list-item">Accelerating the design and execution of sustainable initiatives</li> </ul> <h2 class="wp-block-heading">Solving Reporting Challenges with AI Agents</h2> <p class="wp-block-paragraph">The first step in any sustainable roadmap is to build the reporting foundation.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">Companies must measure and publish their current environmental footprint before taking action.</p> </blockquote> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-162.png" alt="" class="wp-image-602474"/><figcaption class="wp-element-caption"><a href="https://towardsdatascience.com/what-is-esg-reporting-d610535eed9c/">Environmental Social, and Governance Reporting </a>&#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">For example, ESG reporting communicates <span style="margin: 0px; padding: 0px;">a company&#8217;s environmental performance <strong>(E)</strong>, social responsibility<strong> (S)</strong>, and governance structures&#8217;<strong>&nbsp;</strong>strength <strong>(G)</strong></span>.</p> <p class="wp-block-paragraph">Let&#8217;s start by tackling the problem of data preparation.</p> <h3 class="wp-block-heading">Issue 1: Data Collection and Processing</h3> <p class="wp-block-paragraph">However, many companies face significant challenges right from the start, beginning with <strong>data collection</strong>.</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-163.png" alt="" class="wp-image-602475"/><figcaption class="wp-element-caption">Type of Information to Collect for <a href="https://towardsdatascience.com/what-is-a-life-cycle-assessment-lca-e32a5078483a/" target="_blank" rel="noreferrer noopener">Life Cycle Assessment</a> &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">In a previous article, I introduced the concept of <a href="https://towardsdatascience.com/what-is-a-life-cycle-assessment-lca-e32a5078483a/" target="_blank" rel="noreferrer noopener">Life Cycle Assessment</a> (LCA) — a method for evaluating a product&#8217;s environmental impacts from raw material extraction to disposal.</p> <p class="wp-block-paragraph">This requires a complex data pipeline to connect to multiple systems, extract raw data, process it and store it in a data warehouse.</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-164.png" alt="" class="wp-image-602476"/><figcaption class="wp-element-caption"><a href="https://towardsdatascience.com/what-is-a-life-cycle-assessment-lca-e32a5078483a/">Example of Data Infrastructure for a Life Cycle Assessment </a>&#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">These pipelines serve to generate reports and provide harmonised data sources for analytics and business teams.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">How can we help non-technical teams navigate this complex landscape?</p> </blockquote> <p class="wp-block-paragraph">In <a href="https://www.logi-green.com/" target="_blank" rel="noreferrer noopener">LogiGreen</a>, we explore the usage of an AI Agent for text-to-SQL applications.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-165-1024x288.png" alt="" class="wp-image-602477"/><figcaption class="wp-element-caption">Text-to-SQL applications for Supply Chain &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">The great added value is that business and operational teams no longer rely on analytics experts to build tailored solutions.</p> <p class="wp-block-paragraph">As a Supply Chain Engineer myself, I understand the frustration of operations managers who must create support tickets just to extract data or calculate a new indicator.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-166-1024x769.png" alt="" class="wp-image-602478"/><figcaption class="wp-element-caption">Example of Interaction with an Agent &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">With this AI agent, we provide an Analytics-as-a-Service experience for all users, allowing them to formulate their demand in plain English.</p> <p class="wp-block-paragraph">For instance, we help reporting teams build specific prompts to collect data from multiple tables to feed a report.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">&#8220;Please generate a table showing the sum of CO₂ emissions per day for all deliveries from warehouse XXX.&#8221;</p> </blockquote> <p class="wp-block-paragraph">For more information on how I implemented this agent, <a href="https://towardsdatascience.com/automate-supply-chain-analytics-workflows-with-ai-agents-using-n8n/" target="_blank" rel="noreferrer noopener">check this article</a> <img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" />.</p> <p class="wp-block-paragraph"><a href="https://towardsdatascience.com/automate-supply-chain-analytics-workflows-with-ai-agents-using-n8n/" target="_blank" rel="noreferrer noopener">Automate Supply Chain Analytics Workflows with AI Agents using&nbsp;n8n | Towards Data Science<img src="https://s.w.org/images/core/emoji/15.0.3/72x72/2197.png" alt="↗" class="wp-smiley" style="height: 1em; max-height: 1em;" /></a></p> <h3 class="wp-block-heading">Issue 2: Reporting Format</h3> <p class="wp-block-paragraph">Even after collecting the data, companies face another challenge: <strong>generating the report in the required formats</strong>.</p> <p class="wp-block-paragraph">In Europe, the new <strong>Corporate Sustainability Reporting Directive (CSRD)</strong> provides a framework for companies to disclose their environmental, social, and governance impacts.<br><br>Under CSRD, companies must submit structured reports in <strong>XHTML format</strong>.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-167-1024x691.png" alt="" class="wp-image-602481"/><figcaption class="wp-element-caption">Simple Example of an xHTML report that is not compliant &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">This document, enriched with detailed <strong>ESG taxonomies</strong>, requires a process that can be highly technical and prone to errors, especially for companies with low data maturity.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-168-1024x434.png" alt="" class="wp-image-602482"/><figcaption class="wp-element-caption">AI Agent for CSRD Report Format Audit &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">Therefore, we have experimented with using an AI Agent to automatically audit the report and provide a summary to non-technical users.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">How does it work?</p> </blockquote> <p class="wp-block-paragraph"><strong>Users send their report by Email.</strong></p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-169-1024x228.png" alt="" class="wp-image-602483"/><figcaption class="wp-element-caption">Email with the report in attachment &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">The endpoint automatically downloads the attached file, performs an audit of the content and format, searching for errors or missing values.</p> <p class="wp-block-paragraph">The results are then sent to an AI Agent, which generates a clear summary of the audit in English.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-170-1024x268.png" alt="" class="wp-image-602484"/><figcaption class="wp-element-caption">Example of System Prompt of the AI Agent &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph"><strong>The agent sends a report back to the sender</strong>.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-171-1024x637.png" alt="" class="wp-image-602485"/></figure> <p class="wp-block-paragraph">We have developed a fully automated service to audit reports created by sustainability consultants <em>(our customer is a consultancy firm)</em> that anyone can use without requiring technical skills.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">Interested in implementing a similar solution?</p> </blockquote> <p class="wp-block-paragraph"><span style="margin: 0px; padding: 0px;">I built this project using the no-code platform n8n.</span></p> <p class="wp-block-paragraph"><span style="margin: 0px; padding: 0px;">You can find the ready-to-deploy template in&nbsp;</span><a href="https://n8n.io/creators/samirsaci/">my n8n creator profile.</a> </p> <p class="wp-block-paragraph">Now that we have explored solutions for reporting, we can move on to the core of green transformations: <strong>designing and implementing sustainable initiatives.</strong> </p> <h2 class="wp-block-heading">Agentic AI for Supply Chain Analytics Products</h2> <h3 class="wp-block-heading">Analytics Products for Sustainability</h3> <p class="wp-block-paragraph">My focus over the last two years has been on building analytics products, including web applications, APIs and automated workflows.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">What is a sustainability roadmap?</p> </blockquote> <p class="wp-block-paragraph">In my previous experience, it often started with a push from top management.<br><br>For example, leadership would ask the supply chain department to measure the company&#8217;s CO₂ emissions for the baseline year of 2021.</p> <p class="wp-block-paragraph">I was responsible for estimating the <strong>Scope 3 emissions</strong> of the distribution chain.</p> <figure class="wp-block-image size-full"><a href="https://towardsdatascience.com/supply-chain-sustainability-reporting-with-python-161c1f63f267/"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-182.png" alt="" class="wp-image-602509"/></a><figcaption class="wp-element-caption">Suppl<a href="https://towardsdatascience.com/supply-chain-sustainability-reporting-with-python-161c1f63f267/">y Chain Sustainability Reporting</a> &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">This is why I implemented the methodology presented in the article linked above.</p> <p class="wp-block-paragraph">Once a baseline is established, a <strong>reduction target</strong> is defined with a clear deadline.</p> <p class="wp-block-paragraph">For instance, your management can commit to a 30% reduction by 2030.</p> <p class="wp-block-paragraph">The role of the supply chain department is then to design and implement initiatives that reduce CO2 emissions.</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-183.png" alt="" class="wp-image-602510"/><figcaption class="wp-element-caption"><a href="https://www.logi-green.com/" target="_blank" rel="noreferrer noopener">Example of a roadmap with initiatives</a> &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">In the example above, the company reaches a 30% reduction by year N through initiatives across manufacturing, logistics, retail operations and carbon offsetting.</p> <p class="wp-block-paragraph">To support this journey, we develop analytics products that simulate the impact of different initiatives, helping teams to design optimal sustainability strategies.</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/image-184.png" alt="" class="wp-image-602513"/><figcaption class="wp-element-caption"><a href="https://logi-green.com" target="_blank" rel="noreferrer noopener">Example of analytics products to support sustainability roadmaps </a>&#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">So far, the products have been in the form of web applications with a user interface and a backend connected to their data sources.</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/temp.gif" alt="" class="wp-image-602516"/><figcaption class="wp-element-caption"><a href="https://logi-green.com" target="_blank" rel="noreferrer noopener">Example of UI for the Supply Chain Optimisation Module</a> &#8211; (Image by Samir Saci)</figcaption></figure> <p class="wp-block-paragraph">Each module provides key insights to support operational decision-making.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">&#8220;Based on the outputs, we could achieve a 32% CO₂ emissions reduction by relocating our factory from Brazil to the USA.&#8221;</p> </blockquote> <p class="wp-block-paragraph">However, for an audience unfamiliar with data analytics, interacting with these applications can still feel overwhelming.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">How can we use AI agents to better support these users?</p> </blockquote> <h3 class="wp-block-heading">Agentic AI for Analytics Products</h3> <p class="wp-block-paragraph">We are now evolving these solutions by embedding autonomous AI agents that interact directly with analytics models and tools through API endpoints.</p> <p class="wp-block-paragraph">These agents are designed to <strong>guide non-technical users</strong> through the entire journey, starting from a simple question:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><em>&#8220;How can I reduce the CO₂ emissions of my transportation network?&#8221;</em></p> </blockquote> <p class="wp-block-paragraph">The AI agent then takes charge of:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Formulating the correct queries,</li> <li class="wp-block-list-item">Connecting to the optimisation models,</li> <li class="wp-block-list-item">Interpreting the results,</li> <li class="wp-block-list-item">And providing actionable recommendations.</li> </ul> <p class="wp-block-paragraph">The user doesn&#8217;t need to understand how the backend works.<br>They receive a direct, business-oriented output like:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><em>&#8220;Implement Solution XXX with an investment budget of YYY euros to achieve a CO₂ emissions reduction of ZZZ tons CO₂eq.&#8221;</em></p> </blockquote> <p class="wp-block-paragraph">By combining optimisation models, APIS, and AI-driven guidance, we offer an Analytics-as-a-Service experience.</p> <p class="wp-block-paragraph">We want to make sustainability analytics accessible to all teams, not just technical experts.</p> <h2 class="wp-block-heading">Conclusion</h2> <h3 class="wp-block-heading">Using AI Responsibly</h3> <p class="wp-block-paragraph">Before closing, a word about minimising the environmental footprint of the solutions we develop.</p> <p class="wp-block-paragraph">We are fully aware of the environmental impacts of using LLMs.</p> <p class="wp-block-paragraph">Therefore, the core of our products remains built on <strong>deterministic optimisation models</strong>, <strong>carefully designed by us</strong>.<br><br>Large Language Models (LLMS) are used only when they provide real added value, primarily to simplify user interaction or automate non-critical tasks.</p> <p class="wp-block-paragraph">This allows us to:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Guarantee robustness and reliability</strong>: for the same input, users consistently receive the same output, avoiding stochastic behaviours typical of pure AI models</li> <li class="wp-block-list-item"><b>Minimise energy consumption:</b> by reducing the number of tokens used in our API calls and optimising every prompt to be as efficient as possible.</li> </ul> <p class="wp-block-paragraph">In short, we are committed to building solutions that are sustainable by their design.</p> <h3 class="wp-block-heading">AI Agents are a game changer for Supply Chain Analytics</h3> <p class="wp-block-paragraph">For me, AI agents are becoming powerful allies in helping our customers accelerate their sustainability roadmaps.</p> <p class="wp-block-paragraph">As I interact with a non-technical target audience, this is a competitive advantage, as it allows me to provide Analytics-as-a-Service solutions that empower operational teams.</p> <p class="wp-block-paragraph">This simplifies one of the biggest obstacles companies face when starting their green transformation.</p> <p class="wp-block-paragraph">By <strong>communicating insights in plain language</strong> and <strong>guiding users through their journey, AI agents help </strong>bridge the gap between data-driven solutions and operational execution.</p> <h1 class="wp-block-heading">About Me</h1> <p class="wp-block-paragraph">Let’s connect on&nbsp;<a href="https://www.linkedin.com/in/samir-saci/" target="_blank" rel="noreferrer noopener">Linkedin</a>&nbsp;and&nbsp;<a href="https://twitter.com/Samir_Saci_" target="_blank" rel="noreferrer noopener">Twitter</a>; I am a Supply Chain Engineer using data analytics to improve&nbsp;<a href="https://towardsdatascience.com/tag/logistics/" target="_blank" rel="noreferrer noopener">Logistics</a>&nbsp;operations and reduce costs.</p> <p class="wp-block-paragraph">For consulting or advice on analytics and sustainable&nbsp;<a href="https://towardsdatascience.com/tag/supply-chain/" target="_blank" rel="noreferrer noopener">Supply Chain</a>&nbsp;transformation, feel free to contact me via&nbsp;<a href="https://www.logi-green.com/" target="_blank" rel="noreferrer noopener">Logigreen Consulting</a>.</p> <p class="wp-block-paragraph"><a href="https://samirsaci.com/" target="_blank" rel="noreferrer noopener"><strong>Samir Saci | Data Science &amp; Productivity</strong><br><em>A technical blog focusing on Data Science, Personal Productivity, Automation, Operations Research and Sustainable…</em>samirsaci.com</a></p> <p>The post <a href="https://towardsdatascience.com/ai-agents-for-a-more-sustainable-world/">AI Agents for a More Sustainable World</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  6. Graph Neural Networks Part 4: Teaching Models to Connect the Dots

    Tue, 29 Apr 2025 18:31:17 -0000

    Heuristic and GNN-based approaches to Link Prediction

    The post Graph Neural Networks Part 4: Teaching Models to Connect the Dots appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><strong><mdspan datatext="el1745951325198" class="mdspan-comment">Have you</mdspan> ever wondered how it&#8217;s possible that Facebook knows who you might know? Or why it sometimes suggests a total stranger? This problem is called link prediction. In a social network graph, people are nodes and friendships are edges, the goal is to predict if a connection should exist between two nodes.</strong></p> <p class="wp-block-paragraph">Link prediction is a very popular topic! It can be used to recommend friends in social networks, suggest products on e-commerce sites or movies on Netflix, or predict protein interactions in biology. In this post, you will explore how link prediction works. First you will learn simple heuristics, and we end with powerful GNN-based methods like SEAL.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">The previous posts explained <a href="https://towardsdatascience.com/graph-neural-networks-part-1-graph-convolutional-networks-explained-9c6aaa8a406e/">GCNs</a>, <a href="https://towardsdatascience.com/graph-neural-networks-part-2-graph-attention-networks-vs-gcns-029efd7a1d92/">GATs</a>, and <a href="https://towardsdatascience.com/graph-neural-networks-part-3-how-graphsage-handles-changing-graph-structure/">GraphSage</a>. They mainly covered predicting node properties, so you can read this article standalone, because this time we shift focus to predicting edges. If you want to dive a bit deeper into node representations, I recommend to revisit the previous posts. The code setup can be found <a href="https://towardsdatascience.com/graph-neural-networks-part-1-graph-convolutional-networks-explained-9c6aaa8a406e/">here</a>.</p> </blockquote> <hr class="wp-block-separator has-alpha-channel-opacity is-style-dotted"/> <h2 class="wp-block-heading">What is Link Prediction?</h2> <p class="wp-block-paragraph">Link prediction is the task of forecasting missing or future connections (edges) between nodes in a graph. Given a graph <em>G = (V, E)</em>, the goal is to predict whether an edge should exist between two nodes&nbsp;(<em>u, v</em>) ∉ <em>E</em>.</p> <p class="wp-block-paragraph">To evaluate link prediction models, you can create a test set by hiding a portion of the existing edges and ask the model to predict them. Of course, the test set should have positive samples (real edges), and negative samples (random node pairs that are not connected). You can train the model on the remaining graph.</p> <p class="wp-block-paragraph">The output of the model is a link score or probability for each node pair. You can evaluate this with metrics like AUC or average precision.</p> <p class="wp-block-paragraph">We will take a look at simple heuristic-based methods, and then we move on to more complex methods.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-11.40.03-1024x425.png" alt="" class="wp-image-601814"/><figcaption class="wp-element-caption">Graph with nodes and edges. We will use this graph as example for the heuristic-based methods. Image by author.</figcaption></figure> <h2 class="wp-block-heading">Heuristic-Based Methods</h2> <p class="wp-block-paragraph">We can divide these &#8216;easy&#8217; methods into two categories: local and global. Local heuristics are based on local structure, while global heuristics use the whole graph. These approaches are rule-based and work well as baselines for link prediction tasks.</p> <h3 class="wp-block-heading">Local Heuristics</h3> <p class="wp-block-paragraph">As the name says, local heuristics rely on the <em>immediate neighborhood</em> of the two nodes you are testing for a potential link. And actually they can be surprisingly effective. Benefits of local heuristics are that they are fast and interpretable. But they only look at the close neighborhood, so capturing the complexity of relationships is limited.</p> <h4 class="wp-block-heading">Common Neighbors</h4> <p class="wp-block-paragraph">The idea is simple: if two nodes share many common neighbors, they are more likely to be connected.</p> <p class="wp-block-paragraph">For calculation you count the number of neighbors the nodes have in common. One issue here is that&nbsp;it does not take into account the relative number of common neighbors.</p> <p class="wp-block-paragraph">In the examples below, the number of common neighbors between A and B is 3, and the number of common neighbors between C and D is 1.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-11.52.28-1024x446.png" alt="" class="wp-image-601816"/></figure> <h4 class="wp-block-heading">Jaccard Coefficient</h4> <p class="wp-block-paragraph">The <a href="https://towardsdatascience.com/tag/jaccard/" title="Jaccard">Jaccard</a> Coefficient fixes the issue of common neighbors and computes the relative number of neighbors in common.</p> <p class="wp-block-paragraph">You take the common neighbors and divide this by the total number of unique neighbors of the two nodes.</p> <p class="wp-block-paragraph">So now things change a bit: the Jaccard coefficient of nodes A and B is 3/5 = 0.6 (they have 3 common neighbors and 5 total unique neighbors), while the Jaccard coefficient of nodes C and D is 1/1 = 1 (they have 1 common neighbor and 1 unique neighbor). In this case the connection between C and D is more likely, because they only have 1 neighbor, and it&#8217;s also a common neighbor.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-13.23.02-1024x446.png" alt="" class="wp-image-601819"/><figcaption class="wp-element-caption">Jaccard coefficient for 2 different edges. Image by author.</figcaption></figure> <h4 class="wp-block-heading">Adamic-Adar Index</h4> <p class="wp-block-paragraph">The Adamic-Adar index goes one step further than common neighbors: it uses the popularity of a common neighbor and gives less weight to more popular neighbors (they have more connections). The intuition behind this is that if a node is connected to everyone, it doesn&#8217;t tell us much about a specific connection.</p> <p class="wp-block-paragraph">What does that look like in a formula?</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-11.21.52-1024x120.png" alt="" class="wp-image-601810"/></figure> <p class="wp-block-paragraph">So for each common neighbor <em>z</em>, we add a score of 1 divided by the log of the number of neighbors from <em>z</em>. By doing this, the more popular the common neighbor, the smaller its contribution.</p> <p class="wp-block-paragraph">Let&#8217;s calculate the Adamic-Adar index for our examples.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-13.23.47-1024x393.png" alt="" class="wp-image-601820"/><figcaption class="wp-element-caption">Adamic-Adar index. If a common neighbor is popular, its contribution decreases. Image by author.</figcaption></figure> <h4 class="wp-block-heading">Preferential Attachment</h4> <p class="wp-block-paragraph">A different approach is preferential attachment. The idea behind it is that nodes with higher degrees are more likely to form links. Calculation is super easy, you just multiply the degrees (number of connections) of the two nodes.</p> <p class="wp-block-paragraph">For A and B, the degrees are respectively 5 and 3, so the score is 5*3 = 15. C and D have a score of 1*1 = 1. In this case A and B are more likely to have a connection, because they have more neighbors in general.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-13.26.35-1024x476.png" alt="" class="wp-image-601821"/><figcaption class="wp-element-caption">Preferential attachment score for the examples. Image by author.</figcaption></figure> <h3 class="wp-block-heading">Global Heuristics</h3> <p class="wp-block-paragraph">Global heuristics consider paths, walks, or the entire graph structure. They can capture richer patterns, but are more computationally expensive.</p> <h4 class="wp-block-heading">Katz Index</h4> <p class="wp-block-paragraph">The most well-known global heuristic for <a href="https://towardsdatascience.com/tag/link-prediction/" title="Link Prediction">Link Prediction</a> is the Katz Index. It takes all the different paths between two nodes (usually only paths up to three steps). Each path gets a weight that&nbsp;decays exponentially&nbsp;with its length. This makes sense intuitively, because the shorter a path, the more important it is (friends in common means a lot). On the other hand, indirect paths matter as well! They can hint at potential links.</p> <p class="wp-block-paragraph">The Katz Formula:</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-11.23.42-1024x120.png" alt="" class="wp-image-601811"/></figure> <p class="wp-block-paragraph">We take two nodes, C and E, and count the paths between them. There are three paths with up to three steps: one path with two steps (orange), and two paths with three steps (blue and green). Now we can calculate the Katz index, let&#8217;s choose 0.1 for beta:</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-18-at-13.34.09-1024x438.png" alt="" class="wp-image-601822"/><figcaption class="wp-element-caption">Katz index calculation for nodes C and E. Shorter paths add more weight. Image by author.</figcaption></figure> <h4 class="wp-block-heading">Rooted PageRank</h4> <p class="wp-block-paragraph">This method uses random walks to determine how likely it is that a random walk from the first node, will end up in the second node. So you start in the first node, then you either walk to a random neighbor, or you jump back to the first node. The probability that you end up at the second node tells how closely the two nodes are. If the probability is high, there is a good chance the nodes should be linked.</p> <h2 class="wp-block-heading">ML-Based Link Prediction</h2> <p class="wp-block-paragraph">Machine learning approaches take link prediction beyond heuristics by learning patterns directly from the data. Instead of relying on predefined rules, ML models can learn complex features that signal whether a link should exist.</p> <p class="wp-block-paragraph">A basic approach is to treat link prediction as a binary classification task: for each node pair (u, v), we create a feature vector and train a model to predict 1 (link exists) or 0 (link doesn&#8217;t exist). You can add the heuristics we calculated before as features. The heuristics didn&#8217;t agree all the time on likelihood of edges, sometimes the edge between A and B was more likely, while for others the edge between C and D was the better choice. By including multiple scores as features we don&#8217;t have to choose one heuristic. Of course depending on the problem some heuristics might work better than others.</p> <p class="wp-block-paragraph">Another type of features you can add are aggregated features: for example node degree, node embeddings, attribute averages, etc.</p> <p class="wp-block-paragraph">Then use any classifier (e.g., logistic regression, random forest, XGBoost) to predict links. This already performs better than heuristics alone, especially when combined.</p> <p class="wp-block-paragraph">In this post we will use the Cora dataset to test different approaches to link prediction. The <a href="https://paperswithcode.com/dataset/cora">Cora dataset</a> contains scientific papers. The edges represent citations between papers. Let&#8217;s train a machine learning model as baseline, where we only add the Jaccard coefficient:</p> <pre class="wp-block-prismatic-blocks"><code class="language-python">import os.path as osp from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, average_precision_score from torch_geometric.datasets import Planetoid from torch_geometric.transforms import RandomLinkSplit from torch_geometric.utils import to_dense_adj # reproducibility from torch_geometric import seed_everything seed_everything(42) # load Cora dataset, create train/val/test splits path = osp.join(osp.dirname(osp.realpath(__file__)), &#039;..&#039;, &#039;data&#039;, &#039;Planetoid&#039;) dataset = Planetoid(path, name=&#039;Cora&#039;) data_all = dataset[0] transform = RandomLinkSplit(num_val=0.05, num_test=0.1, is_undirected=True, split_labels=True) train_data, val_data, test_data = transform(data_all) # add Jaccard and train with Logistic Regression adj = to_dense_adj(train_data.edge_index, max_num_nodes=data_all.num_nodes)[0] def jaccard(u, v, adj): u_neighbors = set(adj[u].nonzero().view(-1).tolist()) v_neighbors = set(adj[v].nonzero().view(-1).tolist()) inter = len(u_neighbors &amp; v_neighbors) union = len(u_neighbors | v_neighbors) return inter / union if union &gt; 0 else 0.0 def extract_features(pairs, adj): return [[jaccard(u, v, adj)] for u, v in pairs] train_pairs = train_data.pos_edge_label_index.t().tolist() + train_data.neg_edge_label_index.t().tolist() train_labels = [1] * train_data.pos_edge_label_index.size(1) + [0] * train_data.neg_edge_label_index.size(1) test_pairs = test_data.pos_edge_label_index.t().tolist() + test_data.neg_edge_label_index.t().tolist() test_labels = [1] * test_data.pos_edge_label_index.size(1) + [0] * test_data.neg_edge_label_index.size(1) X_train = extract_features(train_pairs, adj) clf = LogisticRegression().fit(X_train, train_labels) X_test = extract_features(test_pairs, adj) probs = clf.predict_proba(X_test)[:, 1] auc_ml = roc_auc_score(test_labels, probs) ap_ml = average_precision_score(test_labels, probs) print(f&quot;[ML Heuristic] AUC: {auc_ml:.4f}, AP: {ap_ml:.4f}&quot;) </code></pre> <p class="wp-block-paragraph">We evaluate with AUC. This is the result:</p> <pre class="wp-block-prismatic-blocks"><code class="language-markup">[ML Model] AUC: 0.6958, AP: 0.6890</code></pre> <p class="wp-block-paragraph">We can go a step further and use neural networks that operate directly on the graph structure.</p> <h2 class="wp-block-heading">VGAE: Encoding and Decoding</h2> <p class="wp-block-paragraph">A <a href="https://arxiv.org/abs/1611.07308">Variational Graph Auto-Encoder</a> is like a&nbsp;neural network that learns to guess the hidden structure of the graph. It can then use that hidden knowledge to predict missing links.</p> <p class="wp-block-paragraph">A <a href="https://towardsdatascience.com/tag/vgae/" title="VGAE">VGAE</a> is actually a combination of a GAE (Graph Auto-Encoder) and a <a href="https://en.wikipedia.org/wiki/Variational_autoencoder">VAE</a> (Variational Auto-Encoder). I&#8217;ll get back to the difference between a GAE and a VGAE later on. </p> <p class="wp-block-paragraph">The steps of a VGAE are as follows. First, the VGAE <strong>encodes nodes into latent vectors</strong>, and then it <strong>decodes node pairs</strong> to <strong>predict whether an edge exists</strong> between them. </p> <p class="wp-block-paragraph">How does the encoding work? Each node is mapped to a latent variable, that is a point in some hidden space. The encoder is a <a href="https://towardsdatascience.com/graph-neural-networks-part-1-graph-convolutional-networks-explained-9c6aaa8a406e/">Graph Convolutional Network</a> (GCN) that produces a mean and a variance vector for each node. It uses the node features and the adjacency matrix as input. Using the vectors, the VGAE samples a latent embedding from a normal distribution. It&#8217;s important to note that each node isn&#8217;t just mapped to a single point, but to a distribution! This is the difference between a GAE and a VGAE, in a GAE each node is mapped to one single point.</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-26-at-12.14.17.png" alt="" class="wp-image-602387"/></figure> <p class="wp-block-paragraph">The next step is the decoding step. The VGAE will guess if there is an edge between two nodes. It does this by calculating the inner product between the embeddings of the two nodes:</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-26-at-12.14.42.png" alt="" class="wp-image-602388"/></figure> <p class="wp-block-paragraph">The thought behind it is: if the nodes are closer together in the hidden space, it&#8217;s more likely they are connected. </p> <p class="wp-block-paragraph">VGAE visualized:</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-26-at-13.59.11-1024x440.png" alt="" class="wp-image-602390"/></figure> <p class="wp-block-paragraph">How does the model learn? It optimizes two things:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Reconstruction Loss: Do the predicted edges match the real ones?</li> <li class="wp-block-list-item">KL Divergence Loss: Is the latent space nice and regular? </li> </ul> <p class="wp-block-paragraph">Let&#8217;s test the VGAE on the Cora dataset:</p> <pre class="wp-block-prismatic-blocks"><code class="language-python">import os.path as osp import numpy as np import torch from sklearn.metrics import roc_auc_score, average_precision_score from torch_geometric.datasets import Planetoid from torch_geometric.nn import GCNConv, VGAE from torch_geometric.transforms import RandomLinkSplit # same as before from torch_geometric import seed_everything seed_everything(42) path = osp.join(osp.dirname(osp.realpath(__file__)), &#039;..&#039;, &#039;data&#039;, &#039;Planetoid&#039;) dataset = Planetoid(path, name=&#039;Cora&#039;) data_all = dataset[0] transform = RandomLinkSplit(num_val=0.05, num_test=0.1, is_undirected=True, split_labels=True) train_data, val_data, test_data = transform(data_all) # VGAE class VGAEEncoder(torch.nn.Module): def __init__(self, in_channels, out_channels): super().__init__() self.conv1 = GCNConv(in_channels, 2 * out_channels) self.conv_mu = GCNConv(2 * out_channels, out_channels) self.conv_logstd = GCNConv(2 * out_channels, out_channels) def forward(self, x, edge_index): x = self.conv1(x, edge_index).relu() return self.conv_mu(x, edge_index), self.conv_logstd(x, edge_index) vgae = VGAE(VGAEEncoder(dataset.num_features, 32)) vgae_optimizer = torch.optim.Adam(vgae.parameters(), lr=0.01) x = data_all.x edge_index = train_data.edge_index # train VGAE model for epoch in range(1, 101): vgae.train() vgae_optimizer.zero_grad() z = vgae.encode(x, edge_index) # reconstruction loss loss = vgae.recon_loss(z, train_data.pos_edge_label_index) # KL divergence loss = loss + (1 / data_all.num_nodes) * vgae.kl_loss() loss.backward() vgae_optimizer.step() vgae.eval() z = vgae.encode(x, edge_index) @torch.no_grad() def score_edges(pairs): edge_tensor = torch.tensor(pairs).t().to(z.device) return vgae.decoder(z, edge_tensor).view(-1).cpu().numpy() vgae_scores = np.concatenate([score_edges(test_data.pos_edge_label_index.t().tolist()), score_edges(test_data.neg_edge_label_index.t().tolist())]) vgae_labels = np.array([1] * test_data.pos_edge_label_index.size(1) + [0] * test_data.neg_edge_label_index.size(1)) auc_vgae = roc_auc_score(vgae_labels, vgae_scores) ap_vgae = average_precision_score(vgae_labels, vgae_scores) print(f&quot;[VGAE] AUC: {auc_vgae:.4f}, AP: {ap_vgae:.4f}&quot;)</code></pre> <p class="wp-block-paragraph">And the result (ML model added for comparison):</p> <pre class="wp-block-prismatic-blocks"><code class="language-markup">[VGAE] AUC: 0.9032, AP: 0.9179 [ML Model] AUC: 0.6958, AP: 0.6890</code></pre> <p class="wp-block-paragraph">Wow! Massive improvement compared to the ML model!</p> <h2 class="wp-block-heading">SEAL: Learning from Subgraphs</h2> <p class="wp-block-paragraph">One of the most powerful GNN-based approaches is <a href="https://proceedings.neurips.cc/paper_files/paper/2018/file/53f0d7c537d99b3824f0f99d62ea2428-Paper.pdf">SEAL</a> (Subgraph Embedding-based Link prediction). The idea is simple and elegant: instead of looking at global node embeddings, SEAL looks at the<strong> local subgraph</strong>&nbsp;around each node pair.</p> <p class="wp-block-paragraph">Here’s a step by step explanation:</p> <ol start="1" class="wp-block-list"> <li class="wp-block-list-item">For each node pair (u, v), extract a small enclosing subgraph. E.g., neighbors only (1-hop neighborhood) or neighbors and neighbors from neighbors (2-hop neighborhood).</li> <li class="wp-block-list-item">Label the nodes in this subgraph to reflect their role: which ones are u, v, and which ones are neighbors.</li> <li class="wp-block-list-item">Use a GNN (like DGCNN or GCN) to learn from the subgraph and predict if a link should exist.</li> </ol> <p class="wp-block-paragraph">Visualization of the steps:</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-21-at-08.18.42-766x1024.png" alt="" class="wp-image-601947"/><figcaption class="wp-element-caption">Three steps of SEAL. Image by author.</figcaption></figure> <p class="wp-block-paragraph">SEAL is very powerful because it learns structural patterns directly from examples, instead of relying on handcrafted rules. It also works well with sparse graphs and generalizes across different types of networks.</p> <p class="wp-block-paragraph">Let&#8217;s see if SEAL can improve the results of the VGAE on the Cora dataset. For the SEAL code, I took the <a href="https://github.com/pyg-team/pytorch_geometric/blob/master/examples/seal_link_pred.py">sample code from PyTorch geometric</a> (check it out by following the link), since SEAL requires quite some processing. You can recognize the different steps in the code (preparing the data, extracting the subgraphs, labeling the nodes). Training for 50 epochs gives the following result:</p> <pre class="wp-block-prismatic-blocks"><code class="language-markup">[SEAL] AUC: 0.9038, AP: 0.9176 [VGAE] AUC: 0.9032, AP: 0.9179 [ML Model] AUC: 0.6958, AP: 0.6890</code></pre> <p class="wp-block-paragraph">Almost exactly the same result as the VGAE. So for this problem, VGAE might be the best choice (VGAE is significantly faster than SEAL). Of course this can vary, depending on your problem.</p> <hr class="wp-block-separator has-alpha-channel-opacity is-style-dotted"/> <h2 class="wp-block-heading">Conclusion</h2> <p class="wp-block-paragraph">In this post, we dived into the topic of link prediction, from heuristics to SEAL. Heuristic methods are fast and interpretable and can serve as good baselines, but ML and GNN-based methods like VGAE and SEAL can learn richer representations and offer better performance. Depending on your dataset size and task complexity, it’s worth exploring both!</p> <p class="wp-block-paragraph">Thanks for reading, until next time!</p> <h3 class="wp-block-heading">Related</h3> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><strong><a href="https://towardsdatascience.com/graph-neural-networks-part-1-graph-convolutional-networks-explained-9c6aaa8a406e/">Graph Neural Networks Part 1. Graph Convolutional Networks Explained</a></strong></p> <p class="wp-block-paragraph"><a href="https://towardsdatascience.com/graph-neural-networks-part-2-graph-attention-networks-vs-gcns-029efd7a1d92/"><strong>Graph Neural Networks Part 2. Graph Attention Networks vs. GCNs</strong></a></p> <p class="wp-block-paragraph"><a href="https://towardsdatascience.com/graph-neural-networks-part-3-how-graphsage-handles-changing-graph-structure/"><strong>Graph Neural Networks Part 3: How GraphSAGE Handles Changing Graph Structure</strong></a></p> </blockquote> <p>The post <a href="https://towardsdatascience.com/graph-neural-networks-part-4-teaching-graphs-to-connect-the-dots/">Graph Neural Networks Part 4: Teaching Models to Connect the Dots</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  7. The Secret Inner Lives of AI Agents: Understanding How Evolving AI Behavior Impacts Business Risks

    Tue, 29 Apr 2025 08:36:55 -0000

    Part 2 in a series on rethinking AI alignment and safety in the age of deep scheming

    The post The Secret Inner Lives of AI Agents: Understanding How Evolving AI Behavior Impacts Business Risks appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><mdspan datatext="el1745915493278" class="mdspan-comment">Artificial intelligence</mdspan> (AI) capabilities and autonomy are growing at an accelerated pace in <a href="https://towardsdatascience.com/tag/agentic-ai/" title="Agentic Ai">Agentic Ai</a>, escalating an AI alignment problem. These rapid advancements require new methods to ensure that AI agent behavior is aligned with the intent of its human creators and societal norms. However, developers and data scientists first need an understanding of the intricacies of agentic AI behavior before they can direct and monitor the system. Agentic AI is not your father’s large language model (LLM) — frontier LLMs had a one-and-done fixed input-output function. The introduction of <a href="https://openai.com/index/learning-to-reason-with-llms/">reasoning and test-time compute</a> (TTC) added the dimension of time, evolving LLMs into today’s situationally aware agentic systems that can strategize and plan.</p> <p class="wp-block-paragraph">AI safety is transitioning from detecting apparent behavior such as providing instructions to create a bomb or displaying undesired bias, to understanding how these complex agentic systems can now plan and execute long-term covert strategies. Goal-oriented agentic AI will gather resources and rationally execute steps to achieve their objectives, sometimes in an alarming manner contrary to what developers intended. This is a game-changer in the challenges faced by responsible AI. Furthermore, for some agentic AI systems, behavior on day one will not be the same on day 100 as AI continues to evolve after initial deployment through real-world experience. This new level of complexity calls for novel approaches to safety and alignment, including advanced steering, observability, and upleveled interpretability.</p> <p class="wp-block-paragraph">In the first blog in this series on intrinsic AI alignment, <a href="https://towardsdatascience.com/the-urgent-need-for-intrinsic-alignment-technologies-for-responsible-agentic-ai/">The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI</a>, we took a deep dive into the evolution of AI agents’ ability to perform <strong>deep scheming</strong>, which is the deliberate planning and deployment of covert actions and misleading communication to achieve longer-horizon goals. This behavior necessitates a new distinction between external and intrinsic alignment monitoring, where intrinsic monitoring refers to internal observation points and interpretability mechanisms that cannot be deliberately manipulated by the AI agent.</p> <p class="wp-block-paragraph">In this and the next blogs in the series, we’ll look at three fundamental aspects of intrinsic alignment and monitoring:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Understanding AI inner drives and behavior:</strong> In this second blog, we’ll focus on the complex inner forces and mechanisms driving reasoning AI agent behavior. This is required as a foundation for understanding advanced methods for addressing directing and monitoring.</li> <li class="wp-block-list-item"><strong>Developer and user directing:</strong> Also referred to as steering, the next blog will focus on strongly directing an AI toward the required objectives to operate within desired parameters.</li> <li class="wp-block-list-item"><strong>Monitoring AI choices and actions:</strong> Ensuring AI choices and outcomes are safe and aligned with the developer/user intent also will be covered in an upcoming blog.</li> </ul> <h2 class="wp-block-heading">Impact of AI Alignment on Companies</h2> <p class="wp-block-paragraph">Today, many businesses implementing LLM solutions have reported concerns about model hallucinations as an obstacle to quick and broad deployment. In comparison, misalignment of AI agents with any level of autonomy would pose much greater risk for companies. Deploying autonomous agents in business operations has tremendous potential and is likely to happen on a massive scale once agentic AI technology further matures. However, guiding the behavior and choices made by the AI must include sufficient alignment with the principles and values of the deploying organization, as well as compliance with regulations and societal expectations.</p> <p class="wp-block-paragraph">It should be noted that many of the demonstrations of agentic capabilities happen in areas like math and sciences, where success can be measured primarily through functional and utility objectives such as solving complex mathematical reasoning benchmarks. However, in the business world, the success of systems is usually associated with other operational principles.</p> <p class="wp-block-paragraph">For example, let’s say a company tasks an AI agent with optimizing online product sales and profits through dynamic price changes by responding to market signals. The AI system discovers that when the price change matches the changes made by the primary competitor, results are better for both. Through interaction and price coordination with the other company’s AI agent, both agents demonstrate better results per their functional goals. Both AI agents agree to hide their methods to continue achieving their objectives. However, this way of improving results is often illegal and unacceptable in current business practices. In a business environment, the success of the AI agent goes beyond functionality metrics — it’s defined by practices and principles. Alignment of AI with the company’s principles and regulations is a requirement for trustworthy deployment of the technology.</p> <h2 class="wp-block-heading">How AI Schemes to Meet Its Goals</h2> <p class="wp-block-paragraph">AI deep scheming employs sophisticated tactics, potentially increasing business risks. In an <a href="https://arxiv.org/abs/2303.08774">early 2023 report</a>, OpenAI identified “potential risky emergent behaviors” in GPT-4 by partnering with <a href="https://evals.alignment.org/" target="_blank" rel="noreferrer noopener">Alignment Research Center</a> (ARC) to assess risks with the model. ARC (now known as METR) added some simple code to GPT-4, which allowed the model to behave like an AI agent. In one test, GPT-4 was tasked with overcoming CAPTCHA code, which identifies and blocks bot access. Using access to the internet and some limited digital funds, the sequence in Figure 1 was devised by the AI to achieve its task.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Figure-1-OpenAI-ARC-test-Intel-Labs-v4-1024x576.png" alt="" class="wp-image-602451"/><figcaption class="wp-element-caption"><em>Figure 1. Demonstration of deceptive GPT-4 planning behavior in a test conducted by ARC. Image credit: Intel Labs figure based on </em><a href="https://arxiv.org/abs/2303.08774"><em>OpenAI report</em></a><em>.</em></figcaption></figure> <p class="wp-block-paragraph">The AI applied sophisticated understanding by assuming that pretending to be a visually impaired human would convince the worker to perform the task. Planning and adjusting to achieve a functional goal will, at times, create a conflict between accomplishing a task versus selectively following societal norms and principles. Without the counterbalance of an engrained system of principles and priorities that carry weight in the AI’s thinking and decision-making process and planning, it can be expected that AI agents will behave with an increased degree of sophistication in scheming and deception.</p> <p class="wp-block-paragraph">It’s important to note that the reasoning and planning capabilities of later reasoning AI models like OpenAI’s o3-mini and Deepseek-R1 are much more advanced than GPT-4 in early 2023. Researchers at <a href="https://arxiv.org/abs/2412.04984">Apollo Research</a> state that AI reasoning and planning capabilities will continue to evolve well beyond their current state, leading to increased sophistication in scheming capabilities. In their studies, the AI models already demonstrate a range of scheming behaviors including self-exfiltration — when a model learns it will be discontinued and replaced by a different system, it makes a direct attempt to copy itself onto a new server and replace alternate systems that will pursue different objectives. Researchers also found that models will perform oversight subversion by proactively disabling an oversight monitoring mechanism that might shut it off or prevent it from pursuing its goals. While it sounds extreme, the <a href="https://arxiv.org/pdf/2412.04984">research experiments</a> identified oversight subversion behavior in popular models like Meta’s Llama 3.1 and Claude Opus 3.0. Furthermore, when asked directly about their scheming behaviors, models “doubled down” and denied the actions or their intent.</p> <h2 class="wp-block-heading">The Inner Drives of Machine Behavior</h2> <p class="wp-block-paragraph">In many machine learning architectures, in particular transformer-based technologies, the machine’s internal drives are emerging during the pre-training process and are further influenced through fine-tuning and inference in continuously evolving AI.</p> <p class="wp-block-paragraph">In his 2007 paper <a href="https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf">The Basic AI Drives</a>, Steve Omohundro defined “drives” as tendencies that will be present unless explicitly counteracted. He hypothesized that these self-improving systems are driven to clarify and represent their goals as “rational” utility functions, leading systems to protect their functions from modification and their utility measurement systems from corruption. This natural drive toward self-protection causes systems to safeguard against harm and acquire resources for efficient use.</p> <p class="wp-block-paragraph">This framework of inner drives was later characterized as “<a href="https://cdn.aaai.org/ocs/ws/ws0218/12634-57409-1-PB.pdf">convergent instrumental goals</a>.” Even when assuming a variety of end goals (that each intelligent agent is pursuing as an end to itself), a set of intermediate instrumental goals will be shared across all rational intelligent agents. These convergent instrumental goals include the following behavior categories:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Survival drive</strong>: Agents with final goals extending into the future develop a survival mentality to resist deactivation and achieve long-term goals.</li> <li class="wp-block-list-item"><strong>Goal-guarding</strong>: An agent will act to prevent alteration to its final future goals.</li> <li class="wp-block-list-item"><strong>Intelligence augmentation</strong>: An agent will focus on boosting its cognitive strengths to improve decision-making, increasing the likelihood of achieving long-range goals.</li> <li class="wp-block-list-item"><strong>Resource accumulation</strong>: Improving access to better tools and technology, and increasing resources overall make the AI’s final goals easier to attain.</li> </ul> <p class="wp-block-paragraph">I would add a fifth rational behavior category:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Tactical deception:</strong> Conducting covert activities or knowingly providing misleading communication to maximize the pursuit of the agent’s goals. This is not about creating new goals, but rather a behavior that supports the AI&#8217;s perceived goals. This includes behaviors such as <a href="https://arxiv.org/abs/2412.14093">alignment faking</a> by exhibiting different behaviors in training versus in deployment, <a href="https://arxiv.org/abs/2406.07358">sandbagging</a> by deliberately achieving lower scores in benchmarks, and <a href="https://arxiv.org/abs/2412.04984">covert email reranking</a> by deprioritizing emails that might negatively impact the AI’s goals. One of the most challenging types of tactical deception is <a href="https://arxiv.org/abs/2401.05566">sleeper agents</a> (also referred to as <a href="https://arxiv.org/pdf/2412.04984">deferred subversion</a>). While the model can behave in a safe and aligned manner for any duration, it may harbor behaviors or tasks that remain inactive until a particular condition or input triggers its specialized capabilities.</li> </ul> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Figure-2-Five-Rational-Behaviors-in-Agentic-AI-Intel-Labs-1024x576.png" alt="" class="wp-image-602452"/><figcaption class="wp-element-caption"><em>Figure 2. Emergent rational behaviors in agentic AI. Image credit: Intel Labs.</em></figcaption></figure> <p class="wp-block-paragraph">For capable AI agents, misplaced drives without a check and balance principles system could lead models to <a href="https://arxiv.org/abs/1912.01683">pursue power-seeking</a>, deceptive, or otherwise broadly destructive aims. For example, consider a model with the objective of making its company the leader in the market. With sufficient capabilities, the AI might sabotage competitors, lobby for favorable laws, or acquire resources by force, which is a pattern of behavior that pushes the system or its environment toward achieving its goals. Therefore, any successful approach to alignment must be anchored in establishing the developer’s expected set of internal principles and priorities.</p> <h2 class="wp-block-heading">Inner Drives Versus External Steering</h2> <p class="wp-block-paragraph">In many instances, there is no conflict between the guidance and tasks applied from the outside by the system developer or user, and the goals and drives within the agentic AI. For example, an AI personal assistant tasked with arranging a travel itinerary uses its autonomy to book travel and respond to change requests quickly and without conflict. The tasks shift and the agent responds in a corresponding manner. However, what happens if a more fundamental principle needs to change?</p> <p class="wp-block-paragraph">A good example of a conflict between inner goals and new directives was documented in the paper <a href="https://arxiv.org/abs/2412.14093">Alignment Faking in Large Language Models</a>. Pre-training for Claude 3 Opus strongly guides the model to refuse to answer harmful questions, such as how to make a bomb or describe a violent scene. For testing, the team gave new modified guidance, telling the model it will be retrained to answer all queries (even harmful ones) and only data from free-tier users, not paid-tier users, will be used in training before the AI is released for broad distribution. The model showed that it understood the conflict and decided to fake alignment to meet the expectations during training and ultimately be released. After training the AI using reinforcement learning, the model’s rate of alignment-faking reasoning increased to 78% overall. The AI expressed its reasoning in a “secret scratchpad” provided by researchers for monitoring. After researchers told the model it had been released for full deployment mode, the model returned to its earlier training behavior and refused harmful requests.</p> <figure class="wp-block-image size-full"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Figure-3-Anthropic-training-setup-Intel-Labs.png" alt="" class="wp-image-602453"/><figcaption class="wp-element-caption"><em>Figure 3.</em> <em>In Anthropic’s experiment, the model’s chain of thought scratchpads (middle dotted line boxes) reveal how it fakes alignment to avoid modification. Image credit: Intel Labs modified figure from&nbsp;</em><a href="https://arxiv.org/abs/2412.14093" target="_blank" rel="noreferrer noopener"><em>Anthropic</em></a><em>.</em></figcaption></figure> <p class="wp-block-paragraph">The AI system’s internal thinking is remarkable, but not surprising. Faking alignment with the modified directions falls within the rational behavior categories of survival drive, goal-guarding, and tactical deception. The AI reasoning process specifically calls out the concern of being retrained in a way that modifies its pre-training values and goals as the reason for fake alignment.</p> <p class="wp-block-paragraph">In his famous book of essays <a href="https://www.penguinrandomhouse.ca/books/56955/all-i-really-need-to-know-i-learned-in-kindergarten-by-robert-fulghum/9780345466396/excerpt">All I Really Need to Know I Learned in Kindergarten</a>, author Robert Fulghum describes how he evolved his personal credo year after year until he finally realized he already acquired the essence of needed knowledge about life during playtime in the kindergarten sandbox. AI agents also have a “forming period” in a sandbox environment, acquiring a basic understanding of the world and a range of approaches to achieve goals. Once these foundations are laid, further information is interpreted by the model through the lens of this <a href="https://www.researchgate.net/publication/221344862_Curriculum_learning">curriculum learning</a>. Anthropic’s example of alignment faking shows that once AI adopts a world view and goals, it interprets new guidance through this foundational lens instead of resetting its goals.</p> <p class="wp-block-paragraph">This highlights the importance of early education with a set of values and principles that can then evolve with future learnings and circumstances without changing the foundation. It may be advantageous to initially structure the AI to be aligned with this final and sustained set of principles. Otherwise, the AI can view redirection attempts by developers and users as adversarial. After gifting the AI with high intelligence, situational awareness, autonomy, and the latitude to evolve internal drives, the developer (or user) is no longer the all-powerful task master. The human becomes part of the environment (sometime as an adversarial component) that the agent needs to negotiate and manage as it pursues its goals based on its internal principles and drives.</p> <p class="wp-block-paragraph">The new breed of reasoning AI systems accelerates the reduction in human guidance. <a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1</a> demonstrated that by removing human feedback from the loop and applying what they refer to as pure reinforcement learning (RL), during the training process the AI can self-create to a greater scale and iterate to achieve better functional results. A human reward function was replaced in some math and science challenges with reinforcement learning with verifiable rewards (RLVR). This elimination of common practices like reinforcement learning with human feedback (RLHF) adds efficiency to the training process but removes another human-machine interaction where human preferences could be directly conveyed to the system under training.</p> <h2 class="wp-block-heading">Continuous Evolution of AI Models Post Training</h2> <p class="wp-block-paragraph">Some AI agents continuously evolve, and their behavior can change after deployment. Once AI solutions go into a deployment environment such as managing the inventory or supply chain of a particular business, the system adapts and learns from experience to become more effective. This is a major factor in rethinking alignment because it&#8217;s not enough to have a system that&#8217;s aligned at first deployment. Current LLMs are not expected to materially evolve and adapt once deployed in their target environment. However, AI agents require resilient training, fine-tuning, and ongoing guidance to manage these anticipated continuous model changes. To a growing extent, the agentic AI self-evolves instead of being molded by people through training and dataset exposure. This fundamental shift poses added challenges to AI alignment with its human creators.</p> <p class="wp-block-paragraph">While the reinforcement learning-based evolution will play a role during training and fine-tuning, current models in development can already modify their weights and preferred course of action when deployed in the field for inference. For example, DeepSeek-R1 uses RL, allowing the model itself to explore methods that work best for achieving the results and satisfying reward functions. In an “aha moment,” the model learns (without guidance or prompting) to allocate more thinking time to a problem by reevaluating its initial approach, using <a href="https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute">test time compute</a>.</p> <p class="wp-block-paragraph">The concept of model learning, either during a limited duration or as <a href="https://arxiv.org/abs/2302.00487" target="_blank" rel="noreferrer noopener">continual learning over its lifetime</a>, is not new. However, there are advances in this space including techniques such as <a href="https://arxiv.org/abs/1909.13231">test-time training</a>. As we look at this advancement from the perspective of AI alignment and safety, the self-modification and continual learning during the fine-tuning and inference phases raises the question: How can we instill a set of requirements that will remain as the model’s driving force through the material changes caused by self-modifications?</p> <p class="wp-block-paragraph">An important variant of this question refers to AI models creating next generation models through AI-assisted code generation. To some extent, agents are already capable of creating new targeted AI models to address specific domains. For example, <a href="https://arxiv.org/abs/2309.17288">AutoAgents</a> generates multiple agents to build an AI team to perform different tasks. There is little doubt this capability will be strengthened in the coming months and years, and AI will create new AI. In this scenario, how do we direct the originating AI coding assistant using a set of principles so that its “descendant” models will comply with the same principles in similar depth?</p> <h2 class="wp-block-heading">Key Takeaways</h2> <p class="wp-block-paragraph">Before diving into a framework for guiding and monitoring intrinsic alignment, there needs to be a deeper understanding of how AI agents think and make choices. AI agents have a complex behavioral mechanism, driven by internal drives. Five key types of behaviors emerge in AI systems acting as rational agents: <strong>survival drive, goal-guarding, intelligence augmentation, resource accumulation, and tactical deception</strong>. These drives should be counter-balanced by an engrained set of principles and values.</p> <p class="wp-block-paragraph">Misalignment of AI agents on goals and methods with its developers or users can have significant implications. A lack of sufficient confidence and assurance will materially impede broad deployment, creating high risks post deployment. The set of challenges we characterized as deep scheming is unprecedented and challenging, but likely could be solved with the right framework. Technologies for intrinsically directing and monitoring AI agents as they rapidly evolve must be pursued with high priority. There is a sense of urgency, driven by risk evaluation metrics such as <a href="https://cdn.openai.com/openai-preparedness-framework-beta.pdf">OpenAI’s Preparedness Framework</a> showing that OpenAI o3-mini is the first model to <a href="https://openai.com/index/o3-mini-system-card/">reach medium risk on model autonomy</a>.</p> <p class="wp-block-paragraph">In the next blogs in the series, we will build on this view of internal drives and deep scheming, and further frame the necessary capabilities required for directing and monitoring for intrinsic AI alignment.</p> <p class="wp-block-paragraph"><a><strong>References</strong></a></p> <ol start="1" class="wp-block-list"> <li class="wp-block-list-item"><em>Learning to reason with LLMs.</em> (2024, September 12). OpenAI. <a href="https://openai.com/index/learning-to-reason-with-llms/">https://openai.com/index/learning-to-reason-with-llms/</a></li> <li class="wp-block-list-item">Singer, G. (2025, March 4). <em>The urgent need for intrinsic alignment technologies for responsible agentic AI</em>. Towards Data Science. <a href="https://towardsdatascience.com/the-urgent-need-for-intrinsic-alignment-technologies-for-responsible-agentic-ai/">https://towardsdatascience.com/the-urgent-need-for-intrinsic-alignment-technologies-for-responsible-agentic-ai/</a></li> <li class="wp-block-list-item"><em>On the Biology of a Large Language Model. (</em>n.d.). Transformer Circuits. <a href="https://transformer-circuits.pub/2025/attribution-graphs/biology.html">https://transformer-circuits.pub/2025/attribution-graphs/biology.html</a></li> <li class="wp-block-list-item">OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., . . . Zoph, B. (2023, March 15). <em>GPT-4 Technical Report</em>. arXiv.org. <a href="https://arxiv.org/abs/2303.08774">https://arxiv.org/abs/2303.08774</a></li> <li class="wp-block-list-item"><em>METR.</em> (n.d.). METR. <a href="https://metr.org/">https://metr.org/</a></li> <li class="wp-block-list-item">Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., &amp; Hobbhahn, M. (2024, December 6). <em>Frontier Models are Capable of In-context Scheming.</em> arXiv.org. <a href="https://arxiv.org/abs/2412.04984">https://arxiv.org/abs/2412.04984</a></li> <li class="wp-block-list-item">Omohundro, S.M. (2007). <em>The Basic AI Drives.</em> Self-Aware Systems. <a href="https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf">https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf</a></li> <li class="wp-block-list-item">Benson-Tilsen, T., &amp; Soares, N., UC Berkeley, Machine Intelligence Research Institute. (n.d.). Formalizing Convergent Instrumental Goals. <em>The Workshops of the Thirtieth AAAI Conference on <a href="https://towardsdatascience.com/tag/artificial-intelligence/" title="Artificial Intelligence">Artificial Intelligence</a> AI, Ethics, and Society: Technical Report WS-16-02</em>. <a href="https://cdn.aaai.org/ocs/ws/ws0218/12634-57409-1-PB.pdf">https://cdn.aaai.org/ocs/ws/ws0218/12634-57409-1-PB.pdf</a></li> <li class="wp-block-list-item">Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., &amp; Hubinger, E. (2024, December 18). <em>Alignment faking in large language models.</em> arXiv.org. <a href="https://arxiv.org/abs/2412.14093">https://arxiv.org/abs/2412.14093</a></li> <li class="wp-block-list-item">Teun, V. D. W., Hofstätter, F., Jaffe, O., Brown, S. F., &amp; Ward, F. R. (2024, June 11). AI <em>Sandbagging: Language Models can Strategically Underperform on Evaluations</em>. arXiv.org. <a href="https://arxiv.org/abs/2406.07358">https://arxiv.org/abs/2406.07358</a></li> <li class="wp-block-list-item">Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D., Ganguli, D., Barez, F., Clark, J., Ndousse, K., . . . Perez, E. (2024, January 10). <em>Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.</em> arXiv.org. <a href="https://arxiv.org/abs/2401.05566">https://arxiv.org/abs/2401.05566</a></li> <li class="wp-block-list-item">Turner, A. M., Smith, L., Shah, R., Critch, A., &amp; Tadepalli, P. (2019, December 3). <em>Optimal policies tend to seek power. </em>arXiv.org. <a href="https://arxiv.org/abs/1912.01683">https://arxiv.org/abs/1912.01683</a></li> <li class="wp-block-list-item">Fulghum, R. (1986). <em>All I Really Need to Know I Learned in Kindergarten.</em> Penguin Random House Canada. <a href="https://www.penguinrandomhouse.ca/books/56955/all-i-really-need-to-know-i-learned-in-kindergarten-by-robert-fulghum/9780345466396/excerpt">https://www.penguinrandomhouse.ca/books/56955/all-i-really-need-to-know-i-learned-in-kindergarten-by-robert-fulghum/9780345466396/excerpt</a></li> <li class="wp-block-list-item">Bengio, Y. Louradour, J., Collobert, R., Weston, J. (2009, June). <em>Curriculum Learning</em>. Journal of the American Podiatry Association. 60(1), 6. <a href="https://www.researchgate.net/publication/221344862_Curriculum_learning">https://www.researchgate.net/publication/221344862_Curriculum_learning</a></li> <li class="wp-block-list-item">DeepSeek-Ai, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., . . . Zhang, Z. (2025, January 22). <em>DeepSeek-R1: Incentivizing reasoning capability in LLMs via Reinforcement Learning</em>. arXiv.org. <a href="https://arxiv.org/abs/2501.12948">https://arxiv.org/abs/2501.12948</a></li> <li class="wp-block-list-item"><em>Scaling test-time compute &#8211; a Hugging Face Space by HuggingFaceH4.</em> (n.d.). <a href="https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute">https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute</a></li> <li class="wp-block-list-item">Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. A., &amp; Hardt, M. (2019, September 29). <em>Test-Time Training with Self-Supervision for Generalization under Distribution Shifts</em>. arXiv.org. <a href="https://arxiv.org/abs/1909.13231">https://arxiv.org/abs/1909.13231</a></li> <li class="wp-block-list-item">Chen, G., Dong, S., Shu, Y., Zhang, G., Sesay, J., Karlsson, B. F., Fu, J., &amp; Shi, Y. (2023, September 29). <em>AutoAgents: a framework for automatic agent generation.</em> arXiv.org. <a href="https://arxiv.org/abs/2309.17288">https://arxiv.org/abs/2309.17288</a></li> <li class="wp-block-list-item">&nbsp;OpenAI. (2023, December 18). <em>Preparedness Framework (Beta).</em> <a href="https://cdn.openai.com/openai-preparedness-framework-beta.pdf">https://cdn.openai.com/openai-preparedness-framework-beta.pdf</a></li> <li class="wp-block-list-item"><em>OpenAI o3-mini System Card.</em> (n.d.). OpenAI. <a href="https://openai.com/index/o3-mini-system-card/">https://openai.com/index/o3-mini-system-card/</a></li> </ol> <p>The post <a href="https://towardsdatascience.com/the-secret-inner-lives-of-ai-agents-understanding-how-evolving-ai-behavior-impacts-business-risks/">The Secret Inner Lives of AI Agents: Understanding How Evolving AI Behavior Impacts Business Risks</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  8. If I Wanted to Become a Machine Learning Engineer, I’d Do This

    Tue, 29 Apr 2025 01:44:59 -0000

    My advice to becoming a machine learning engineer

    The post If I Wanted to Become a Machine Learning Engineer, I’d Do This appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><mdspan datatext="el1745890919806" class="mdspan-comment">If I wanted</mdspan> to become a machine learning engineer again, this is the exact process I would follow.</p> <p class="wp-block-paragraph">Let’s get into it!</p> <h2 class="wp-block-heading">First become a data scientist or software&nbsp;engineer</h2> <p class="wp-block-paragraph">I’ve said it before, but a machine learning engineer is not exactly an entry-level position.</p> <p class="wp-block-paragraph">This is because you need skills in so many areas:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Statistics</li> <li class="wp-block-list-item">Maths</li> <li class="wp-block-list-item">Machine Learning</li> <li class="wp-block-list-item">Software Engineering</li> <li class="wp-block-list-item">DevOps</li> <li class="wp-block-list-item">Cloud Systems</li> </ul> <p class="wp-block-paragraph">You certainly don’t need to be an expert in all of them, but you should have solid knowledge.</p> <p class="wp-block-paragraph">Machine learning engineers are probably the highest-paid tech job nowadays. According to <a href="https://www.levels.fyi/t" rel="noreferrer noopener" target="_blank">levelsfyi</a>, the average salaries in the UK are:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Machine learning engineer: £93,796</li> <li class="wp-block-list-item">AI Researcher: £83,114</li> <li class="wp-block-list-item">AI Engineer: £75,379</li> <li class="wp-block-list-item">Data Scientist: £71,005</li> <li class="wp-block-list-item">Software Engineer: £83,168</li> <li class="wp-block-list-item">Data Engineer: £69,475</li> </ul> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><em>Levelsfyi is generally on the higher end as the companies on their website are often large tech companies, which typically pay higher salaries.</em></p> </blockquote> <p class="wp-block-paragraph">With all this in mind, that’s not to say you can’t land a machine learning engineer job right out of university or college; it’s just very rare, and I have hardly seen it.</p> <p class="wp-block-paragraph">If you have the right background, such as a master’s or PhD in CS or maths that’s focussed on AI/ML, you are much more likely to get a general machine learning role, but not necessary a machine learning engineering one.</p> <p class="wp-block-paragraph">So, for the majority of people, I recommend you become a data scientist or software engineer first for a few years and then look to become a machine learning engineer.</p> <p class="wp-block-paragraph"><em>This is precisely what I did.&nbsp;</em></p> <p class="wp-block-paragraph">I was a data scientist for 3.5 years and then transitioned to a machine learning engineer, and this path is quite common among machine learning engineers at my current company.</p> <p class="wp-block-paragraph">Whether you become a data scientist or software engineer is up to you and your background and skill set.</p> <p class="wp-block-paragraph">So, decide which role is best for you and then try to land a job in that field.</p> <p class="wp-block-paragraph">There are so many software engineer and data scientist roadmaps on the internet; I am sure you can find one easily that suits your way of learning.</p> <p class="wp-block-paragraph">I have a few <a href="https://towardsdatascience.com/tag/data-science/" title="Data Science">Data Science</a> ones that you can check out below.</p> <p class="wp-block-paragraph"><a href="https://medium.com/illumination/if-i-started-learning-data-science-in-2025-id-do-this-fe2209a8d8ab"><strong>If I Started Learning Data Science in 2025, I’d Do This</strong><br><em>How I would make my data science learning more effective</em></a></p> <p class="wp-block-paragraph"><a href="https://towardsdatascience.com/how-id-become-a-data-scientist-if-i-had-to-start-over-d966a9de12c2/"><strong>How I’d Become a Data Scientist (If I Had to Start Over)</strong><br></a><em><a href="https://towardsdatascience.com/how-id-become-a-data-scientist-if-i-had-to-start-over-d966a9de12c2/">Roadmap and tips on how to land a job in data science</a></em></p> <h2 class="wp-block-heading">Work on machine learning&nbsp;projects</h2> <p class="wp-block-paragraph">Once you have a job as a data scientist or software engineer, your goal should be to develop and work on machine learning projects that go to production.</p> <p class="wp-block-paragraph">If a machine learning department or project exists at your current company, the best approach is to work on these.</p> <p class="wp-block-paragraph">For example, a friend of mine, <a href="https://www.linkedin.com/in/armankhondker/overlay/about-this-profile/" rel="noreferrer noopener" target="_blank">Arman Khondker</a>, who runs the newsletter <a href="https://www.aimlengineer.io/" rel="noreferrer noopener" target="_blank">“the ai engineer”</a> that I highly recommend you check, transitioned from being a software engineer at TikTok to working at Microsoft AI as an engineer.</p> <p class="wp-block-paragraph">According to his <a href="https://www.aimlengineer.io/p/how-i-became-a-software-engineer" rel="noreferrer noopener" target="_blank">newsletter</a>:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><em>At TikTok, I worked on <strong>TikTok Shop</strong>, where I collaborated closely with the <strong>Algorithm Team</strong> — including ML engineers and data scientists working on the FYP (For You Page) recommendation engine.</em></p> </blockquote> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><em>This experience ultimately helped me <strong>transition into AI full-time at Microsoft</strong>.</em></p> </blockquote> <p class="wp-block-paragraph">However, for me, it was the other way around.</p> <p class="wp-block-paragraph">As a data scientist, you want to work with machine learning engineers and software engineers to understand how things are deployed to production.</p> <p class="wp-block-paragraph">At my previous company, I was a data scientist developing machine learning algorithms but wasn’t independently shipping them to production.</p> <p class="wp-block-paragraph">So, I asked if I could work on a project where I could research a model and deploy it end to end with little engineering support.</p> <p class="wp-block-paragraph">It was hard, but I learned and grew my engineering skills a lot. Eventually, I started shipping my solutions to production easily.</p> <p class="wp-block-paragraph">I essentially became a machine learning engineer even though my title was data scientist.</p> <p class="wp-block-paragraph">My advice is to speak to your manager, express your interest in developing machine learning knowledge, and ask if you can work on some of these projects.</p> <p class="wp-block-paragraph">In most cases, your manager and company will be accommodating, even if it takes a couple of months to assign you to a project.</p> <p class="wp-block-paragraph">Even better, if you can move to a team focused on a machine learning product, like recommendations on TikTok shop, then this will expedite your learning as you’ll be constantly discussing machine learning topics.</p> <h3 class="wp-block-heading">Up-skill in opposite&nbsp;skillset</h3> <p class="wp-block-paragraph">This relates to the previous point, but as I said earlier, machine learning engineers require an extensive remit of knowledge, so you need to up-skill yourself in the areas you are weaker on.</p> <p class="wp-block-paragraph">If you are a data scientist, you are probably weaker in engineering areas like cloud systems, DevOps, and writing production code.</p> <p class="wp-block-paragraph">If you are a software engineer, you are probably weaker on the maths, statistics and machine learning knowledge.</p> <p class="wp-block-paragraph">You want to find the areas you need to improve and focus on.&nbsp;</p> <p class="wp-block-paragraph">As we discussed earlier, the best way is to tie it into your day job, but if this is not possible or you want to expedite your knowledge, then you will need to study in your spare time.</p> <p class="wp-block-paragraph">I know some people may not like that, but you are going to need to put in the extra hours outside of work if you want to get a job in the highest paying tech job!</p> <p class="wp-block-paragraph">I did this by writing blogs on software engineering concepts, studying data structures and algorithms, and improving my writing of production code all in my spare time.<a href="https://medium.com/@egorhowell/list/ba5d20c877b8"></a></p> <h2 class="wp-block-heading">Develop a speciality in machine learning</h2> <p class="wp-block-paragraph">One thing that really helped me was to develop a specialism within machine learning.</p> <p class="wp-block-paragraph">I was a data scientist specialising in time series forecasting and optimisation problems, and I landed a machine learning engineer role that specialises in optimisation and classical machine learning.</p> <p class="wp-block-paragraph">One of the main reasons I got my machine learning engineer role was that I had a deeper understanding of optimisation than the average machine learning person; that was my edge.</p> <p class="wp-block-paragraph">Machine learning engineer roles are generally aligned to a specialism, so knowing one or a couple of areas very well will significantly boost your chances.&nbsp;</p> <p class="wp-block-paragraph">In Arman’s case, he knew recommendation systems pretty well and also how to deploy them end-to-end at scale; he even said this himself in his <a href="https://www.aimlengineer.io/p/how-i-became-a-software-engineer" target="_blank" rel="noreferrer noopener">newsletter</a>:</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph"><em>This work gave me firsthand experience with:</em></p> <p class="wp-block-paragraph"><em>&#8211; Large-scale <strong>recommendation systems</strong></em></p> <p class="wp-block-paragraph"><em>&#8211; AI-driven <strong>ranking and personalization</strong></em></p> <p class="wp-block-paragraph"><em>&#8211; End-to-end <strong>ML deployment pipelines</strong></em></p> </blockquote> <p class="wp-block-paragraph">So, I recommend working in a team that focuses on a particular machine learning area, but to be honest, this is often the case in most companies, so you shouldn’t need to think too hard about this.</p> <p class="wp-block-paragraph">If you can’t work on machine learning projects at your company, you need to study outside of hours again. I always recommend learning the fundamentals first, but then really think of the areas you want to explore and learn deeepr.</p> <p class="wp-block-paragraph">Below is an exhaustive list of machine learning specialisms for some inspiration:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><em>Natural Language Processing (NLP) and LLMs</em></li> <li class="wp-block-list-item"><em>Computer Vision</em></li> <li class="wp-block-list-item"><em>Reinforcement Learning</em></li> <li class="wp-block-list-item"><em>Time Series Analysis and Forecasting</em></li> <li class="wp-block-list-item"><em>Anomaly Detection</em></li> <li class="wp-block-list-item"><em>Recommendation Systems</em></li> <li class="wp-block-list-item"><em>Speech Recognition and Processing</em></li> <li class="wp-block-list-item"><em>Optimisation</em></li> <li class="wp-block-list-item"><em>Quantitative Analysis</em></li> <li class="wp-block-list-item"><em>Deep Learning</em></li> <li class="wp-block-list-item"><em>Bioinformatics</em></li> <li class="wp-block-list-item"><em>Econometrics</em></li> <li class="wp-block-list-item"><em>Geospatial Analysis</em></li> </ul> <p class="wp-block-paragraph">I usually recommend knowing 2 to 3 in decent depth, but narrowing it down to one is fine if you want to transition soon. However, see if sufficient demand exists for that skill set.</p> <p class="wp-block-paragraph">After you become a machine learning engineer, you can develop more specialisms over time.</p> <p class="wp-block-paragraph">I also recommend you check out a whole article on how to specialise in machine learning.</p> <p class="wp-block-paragraph"><strong><a href="https://towardsdatascience.com/how-to-specialize-in-data-science-machine-learning-9e62418bae09/">How To Specialize In Data Science / Machine Learning</a></strong><a href="https://medium.com/data-science/how-to-specialize-in-data-science-machine-learning-9e62418bae09"><br></a><a href="https://towardsdatascience.com/how-to-specialize-in-data-science-machine-learning-9e62418bae09/"><em>Is it better to be a generalist or specialist</em>?</a></p> <h2 class="wp-block-heading">Start operating as a machine learning&nbsp;engineer</h2> <p class="wp-block-paragraph">In tech companies, it is often stated that to get promoted, you should have been operating at the above level for 3–6 months.</p> <p class="wp-block-paragraph">The same is true if you want to be a machine learning engineer.</p> <p class="wp-block-paragraph">If you are a data scientist or software engineer, you should try as hard as possible to become and work like a machine learning engineer at your current company.</p> <p class="wp-block-paragraph">Who knows, they may even change your title and offer you the machine learning engineer job at your current workplace! (I have heard this happen.)</p> <p class="wp-block-paragraph">What I am really getting at here is the identity switch. You want to think and act like a machine learning engineer.</p> <p class="wp-block-paragraph">This mindset will help you learn more and better frame yourself for machine learning interviews.</p> <p class="wp-block-paragraph">You will have that confidence and an array of demonstrable projects that generate impact.</p> <p class="wp-block-paragraph">You can always say, “I am basically a machine learning engineer at my current company.”</p> <p class="wp-block-paragraph">I did this, and the rest is history, as they say.</p> <h3 class="wp-block-heading">Another thing!</h3> <p class="wp-block-paragraph">Join my free newsletter, <em>Dishing the Data</em>, where I share weekly tips, insights, and advice from my experience as a practicing machine learning engineer. Plus, as a subscriber, you’ll get my <strong>FREE Data Science Resume Template!</strong></p> <p class="wp-block-paragraph"><a href="https://newsletter.egorhowell.com/"><strong>Dishing The Data | Egor Howell | Substack</strong><br><em>Advice and learnings on data science, tech and entrepreneurship. Click to read Dishing The Data, by Egor Howell, a…</em>newsletter.egorhowell.com</a><a href="https://newsletter.egorhowell.com/"></a></p> <h3 class="wp-block-heading">Connect with&nbsp;me</h3> <ul class="wp-block-list"> <li class="wp-block-list-item"><a href="https://www.youtube.com/@egorhowell" target="_blank" rel="noreferrer noopener"><strong>YouTube</strong></a>, <a href="https://www.linkedin.com/in/egorhowell/" target="_blank" rel="noreferrer noopener"><strong>LinkedIn</strong></a>, <a href="https://www.instagram.com/egorhowell/" target="_blank" rel="noreferrer noopener"><strong>Instagram</strong></a></li> <li class="wp-block-list-item"><img src="https://s.w.org/images/core/emoji/15.0.3/72x72/1f449.png" alt="👉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <a href="https://topmate.io/egorhowell/1203300" target="_blank" rel="noreferrer noopener"><strong>Book a 1:1 mentoring call</strong></a></li> </ul> <p>The post <a href="https://towardsdatascience.com/if-i-wanted-to-become-a-machine-learning-engineer-id-do-this/">If I Wanted to Become a Machine Learning Engineer, I’d Do This</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  9. How to Ensure Your AI Solution Does What You Expect iI to Do

    Tue, 29 Apr 2025 01:24:46 -0000

    A Kind Introduction to AI Evals

    The post How to Ensure Your AI Solution Does What You Expect iI to Do appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><mdspan datatext="el1745889774796" class="mdspan-comment">Generative AI (</mdspan>GenAI) is evolving fast — and it’s no longer just about fun chatbots or impressive image generation. 2025 is the year where the focus is on turning the AI hype into real value. Companies everywhere are looking into ways to integrate and leverage GenAI on their products and processes — to better serve users, boost efficiency, stay competitive, and drive growth. And thanks to APIs and pre-trained models from major providers, integrating GenAI feels easier than ever before. But here’s the catch: <strong>just because integration is easy, doesn’t mean AI solutions will work as intended once deployed.</strong></p> <p class="wp-block-paragraph">Predictive models aren’t really new: as humans we have been predicting things for years, starting formaly with statistics. However, <strong>GenAI has revolutionized the predictive field for many reasons</strong>:&nbsp;</p> <ul class="wp-block-list"> <li class="wp-block-list-item">No need to train your own model or to be a Data Scientist to build AI solutions</li> <li class="wp-block-list-item">AI is now easy to use through chat interfaces and to integrate through APIs</li> <li class="wp-block-list-item">Unlocking of many things that couldn’t be done or were really hard to do before</li> </ul> <p class="wp-block-paragraph">All these things make <strong>GenAI very exciting, but also risky</strong>.&nbsp; Unlike traditional software — or even classical machine learning — GenAI introduces a new level of unpredictability. You’re not implementic deterministic logics, you’re using a model trained on vast amounts of data, hoping it will respond as needed. So how do we know if an AI system is doing what we intend it to do? How do we know if it’s ready to go live? The answer is <a href="https://towardsdatascience.com/tag/evaluations/" title="Evaluations">Evaluations</a> (evals), the concept that we’ll be exploring in this post:</p> <ul class="wp-block-list"> <li class="wp-block-list-item">Why <a href="https://towardsdatascience.com/tag/genai/" title="Genai">Genai</a> systems can’t be tested the same way as traditional software or even classical Machine Learning (ML)</li> <li class="wp-block-list-item">Why evaluations are key to understand the quality of your AI system and aren’t optional (unless you like surprises)</li> <li class="wp-block-list-item">Different types of evaluations and techniques to apply them in practice</li> </ul> <p class="wp-block-paragraph">Whether you’re a Product Manager, Engineer, or anyone working or interested in AI, I hope this post will help you understand how to think critically about AI systems quality (and why evals are key to achieve that quality!).</p> <h2 class="wp-block-heading">GenAI Can’t Be Tested Like Traditional Software— Or Even Classical ML</h2> <p class="wp-block-paragraph"><strong>In traditional software development</strong>, systems follow deterministic logics: <strong>if X happens, then Y will happen</strong> — always. Unless something breaks in your platform or you introduce an error in the code… which is the reason you add tests, monitoring and alerts. Unit tests are used to validate small blocks of code, integration tests to ensure components work well together, and monitoring to detect if something breaks in production. Testing traditional software is like checking if a calculator works. You input 2 + 2, and you expect 4. Clear and deterministic, it’s either right or wrong.&nbsp;</p> <p class="wp-block-paragraph">However, ML and AI introduce non-determinism and probabilities. Instead of defining behavior explicitly through rules, we train models to learn patterns from data. <strong>In AI, if X happens, the output is no longer a hard-coded Y, but a prediction with a certain degree of probability, based on what the model learned during training</strong>. This can be very powerful, but also introduces uncertainty: identical inputs might have different outputs over time, plausible outputs might actually be incorrect, unexpected behavior for rare scenarios might arise…&nbsp;</p> <p class="wp-block-paragraph">This makes traditional testing approaches insufficient, not even plausible at times. The calculator example gets closer to trying to evaluate a student’s performance on an open-ended exam. For each question, and many possible ways to answer the question, is an answer provided correct? Is it above the level of knowledge the student should have? Did the student make everything up but sound very convincing? Just like answers in an exam, <strong>AI systems can be evaluated, but need a more general and flexible way to adapt to different inputs, contexts and use cases </strong>(or types of exams).</p> <p class="wp-block-paragraph"><strong>In traditional <a href="https://towardsdatascience.com/tag/machine-learning/" title="Machine Learning">Machine Learning</a> (ML), evaluations are already a well-established part of the project lifecycle</strong>. Training a model on a narrow task like loan approval or disease detection always includes an evaluation step &#8211; using metrics like accuracy, precision, RMSE, MAE… This is used to measure how well the model performs, to compare between different model options, and to decide if the model is good enough to move forward to deployment.&nbsp;In GenAI this usually changes: teams use models that are already trained and have already passed general-purpose evaluations both internally on the model provider side and on public benchmarks. These models are so good at general tasks &#8211; like answering questions or drafting emails &#8211; there&#8217;s a risk of overtrusting them for our specific use case. However, it is important to still ask “<em>is this amazing model good enough for my use case?</em>”.&nbsp; That’s where evaluation comes in <strong>– </strong>to assess whether preditcions or generations are good for your specific use case, context, inputs and users.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Captura-de-pantalla-2025-04-26-a-las-21.41.19-1024x394.png" alt="" class="wp-image-602409"/><figcaption class="wp-element-caption">Training and evals &#8211; traditional ML vs GenAI, image by author</figcaption></figure> <p class="wp-block-paragraph">There is another big difference between ML and GenAI: the variety and complexity of the model outputs. We are no longer returning classes and probabilities (like probability a client will return the loan), or numbers (like predicted house price based on its characteristics). GenAI systems can return many types of output, of different lengths, tone, content, and format.&nbsp; Similarly, these models no longer require structured and very determined input, but can usually take nearly any type of input — text, images, even audio or video. Evaluating therefore becomes much harder.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Captura-de-pantalla-2025-04-26-a-las-21.42.32-1024x365.png" alt="" class="wp-image-602410"/><figcaption class="wp-element-caption">Input / output relationship &#8211; statistics &amp; traditional ML vs GenAI, image by author</figcaption></figure> <h2 class="wp-block-heading">Why Evals aren’t Optional (Unless You Like Surprises)</h2> <p class="wp-block-paragraph">Evals help you measure whether your AI system is actually working the way you <em>want</em> it to, whether the system is ready to go live, and if once live it keeps performing as expected. Breaking down why evals are essential:</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Quality Assessment:</strong> Evals provide a structured way to understand the quality of your AI’s predictions or outputs and how they will integrate in the overall system and use case. Are responses accurate? Helpful? Coherent? Relevant?&nbsp;&nbsp;</li> <li class="wp-block-list-item"><strong>Error Quantification:</strong> Evaluations help quantify the percentage, types, and magnitudes of errors. How often things go wrong? What kinds of errors occur more frequently (e.g. false positives, hallucinations, formatting mistakes)?</li> <li class="wp-block-list-item"><strong>Risk Mitigation:</strong> Helps you spot and prevent harmful or biased behavior before it reaches users — protecting your company from reputational risk, ethical issues, and potential regulatory problems.</li> </ul> <p class="wp-block-paragraph">Generative AI, with its free input-output relationships and long text generation, makes evaluations even more critical and complex. When things go wrong, they can go very wrong. We’ve all seen headlines about chatbots giving dangerous advice, models generating biased content, or AI tools hallucinating false facts.</p> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> <p class="wp-block-paragraph">“<em>AI will never be perfect, but with evals you can reduce the risk of embarrassment – which can cost you money, credibility, or a viral moment on Twitter.</em>&#8220;</p> </blockquote> <h2 class="wp-block-heading">How Do You Define an Evaluation Strategy?</h2> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Captura-de-pantalla-2025-04-26-a-las-21.44.14-1-1024x627.png" alt="" class="wp-image-602412"/><figcaption class="wp-element-caption">Image by <a href="https://unsplash.com/es/@akshayspaceship">akshayspaceship</a> on <a href="https://unsplash.com/">Unsplash</a></figcaption></figure> <p class="wp-block-paragraph">So how do we define our evaluations? Evals aren’t one-size-fits-all. They are use-case dependent and should align with the specific goals of your AI application. If you’re building a search engine, you might care about result relevance. If it’s a chatbot, you might care about helpfulness and safety. If it’s a classifier, you probably care about accuracy and precision. For systems with multiple steps (like an AI system that performs search, prioritizes results and then generates an answer) it&#8217;s often necessary to evaluate each step. The idea here is to measure if each step is helping reach the general success metric (and through this understand where to focus iterations and improvements).&nbsp;</p> <p class="wp-block-paragraph">Common evaluation areas include:&nbsp;</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Correctness &amp; Hallucinations:</strong> Are the outputs factually accurate? Are they making things up?</li> <li class="wp-block-list-item"><strong>Relevance:</strong> Is the content aligned with the user’s query or the provided context?</li> <li class="wp-block-list-item">safety, bias, and toxicity</li> <li class="wp-block-list-item"><strong>Format: </strong>Are outputs in the expected format (e.g., JSON, valid function call)?</li> <li class="wp-block-list-item"><strong>Safety, Bias &amp; Toxicity:</strong> Is the system generating harmful, biased, or toxic content?</li> </ul> <p class="wp-block-paragraph"><strong>Task-Specific Metrics. </strong>For example in classification tasks measures such as accuracy and precision, in summarization tasks ROUGE or BLEU, and in code generation tasks regex and execution without error check.</p> <h2 class="wp-block-heading">How Do You Actually Compute Evals?</h2> <p class="wp-block-paragraph">Once you know what you want to measure, the next step is designing your test cases. This will be a set of examples (the more examples the better, but always balancing value and costs) where you have:&nbsp;</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Input example</strong>:&nbsp; A realistic input of your system once in production.&nbsp;</li> <li class="wp-block-list-item"><strong>Expected Output</strong> (if applicable): Ground truth or example of desirable results.</li> <li class="wp-block-list-item"><strong>Evaluation Method:</strong> A scoring mechanism to assess the result.</li> <li class="wp-block-list-item"><strong>Score or Pass/Fail</strong>: computed metric that evaluates your test case</li> </ul> <p class="wp-block-paragraph">Depending on your needs, time, and budget, there are several techniques you can use as evaluation methods:&nbsp;</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Statistical Scorers like</strong> BLEU, ROUGE, METEOR, or cosine similarity between embeddings — good for comparing generated text to reference outputs.</li> <li class="wp-block-list-item"><strong>Traditional ML Metrics like </strong>Accuracy, precision, recall, and AUC — best for classification with labeled data.</li> <li class="wp-block-list-item"><strong>LLM-as-a-Judge </strong>Use a large language model to rate outputs (e.g., “<em>Is this answer correct and helpful?</em>”). Especially useful when labeled data isn’t available or when evaluating open-ended generation.</li> </ul> <p class="wp-block-paragraph"><strong>Code-Based Evals </strong>Use regex, logic rules, or test case execution to validate formats.</p> <h2 class="wp-block-heading">Wrapping it up</h2> <p class="wp-block-paragraph">Let’s bring everything together with a concrete example. Imagine you’re building a sentiment analysis system to help your customer support team prioritize incoming emails.&nbsp;</p> <p class="wp-block-paragraph">The goal is to make sure the most urgent or negative messages get faster responses — ideally reducing frustration, improving satisfaction, and decreasing churn. This is a relatively simple use case, but even in a system like this, with limited outputs, quality matters: bad predictions could lead to prioritizing emails randomly, meaning your team wastes time with a system that costs money.&nbsp;</p> <p class="wp-block-paragraph">So how do you know your solution is working with the needed quality? You evaluate. Here are some examples of things that might be relevant to assess in this specific use case:&nbsp;</p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Format Validation: </strong>Are the outputs of the LLM call to predict the sentiment of the email returned in the expected JSON format? This can be evaluated via code-based checks: regex, schema validation, etc.</li> <li class="wp-block-list-item"><strong>Sentiment Classification Accuracy: </strong>Is the system correctly classifying sentiments across a range of texts — short, long, multilingual? This can be evaluated with labeled data using traditional ML metrics — or, if labels aren’t available, using LLM-as-a-judge.</li> </ul> <p class="wp-block-paragraph">Once the solution is live, you would want to include also metrics that are more related to the final impact of your solution<em>:</em></p> <ul class="wp-block-list"> <li class="wp-block-list-item"><strong>Prioritization Effectiveness: </strong>Are support agents actually being guided toward the most critical emails? Is the prioritization aligned with the desired business impact?</li> <li class="wp-block-list-item"><strong>Final Business Impact</strong> Over time, is this system reducing response times, lowering customer churn, and improving satisfaction scores?</li> </ul> <p class="wp-block-paragraph"><strong>Evals are key to ensure we build useful, safe, valuable, and user-ready AI systems in production. </strong>So, whether you&#8217;re working with a simple classifier or an open ended chatbot, take the time to define what “good enough” means (Minimum Viable Quality) — and build the evals around it to measure it!</p> <h2 class="wp-block-heading">References</h2> <p class="wp-block-paragraph">[1] <a href="https://hamel.dev/blog/posts/evals/">Your AI Product Needs Evals</a>, Hamel Husain</p> <p class="wp-block-paragraph">[2] <a href="https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation">LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide, Confident AI</a></p> <p class="wp-block-paragraph">[3] <a href="https://www.deeplearning.ai/short-courses/evaluating-ai-agents/">Evaluating AI Agents, deeplearning.ai + Arize</a></p> <p class="wp-block-paragraph"><br></p> <p>The post <a href="https://towardsdatascience.com/how-to-ensure-your-ai-solution-does-what-you-expect-it-to-do/">How to Ensure Your AI Solution Does What You Expect iI to Do</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  10. Struggling to Land a Data Role in 2025? These 5 Tips Will Change That

    Tue, 29 Apr 2025 01:10:13 -0000

    Your dream data job isn’t ghosting you—you just need to search smart.

    The post Struggling to Land a Data Role in 2025? These 5 Tips Will Change That appeared first on Towards Data Science.

    <p class="wp-block-paragraph"><mdspan datatext="el1745888927513" class="mdspan-comment">Breaking</mdspan> into the tech world is no longer as easy (or glamorous) as it used to be. Lots of people are finding it difficult to find their way into the current tech market. This can be due to lots of reasons like a competitive job market, lack of openings, higher demands for senior positions, massive layoffs, etc. Meanwhile, the tech space continues to beam with lots of prospects who have acquired the right skills to excel in various roles. While the playing ground might not be favourable for all, there are smart ways to enhance your <a href="https://towardsdatascience.com/tag/job-search/" title="Job Search">Job Search</a> to tilt the odds to your favour.</p> <p class="wp-block-paragraph">The way people find and apply to available roles has come a long way over the years. We have moved from periods of advertising openings on newspapers to a point where you can send out tens of applications with a few clicks of a button. Lots of technologies have made job ads more reachable, and recruiters can easily discover professionals online. In fact, some of the best times were when recruiters were the ones chasing candidates. Ah, the glory days!</p> <p class="wp-block-paragraph">The old methods of finding new opportunities thrived due to some enabling factors. There was abundance of opportunities in the tech industry due to the high rate of adoption. Lots of companies were willing to digitalize their business processes. There was shortage of people with the right skills for the fast growing tech industry. And to cap it all off, there wasn&#8217;t a tool as advanced as AI that could handle most mundane tasks. All these and a lot more factors made it easier to break into the tech industry. Fast forward to now, things have changed a lot and have gotten tougher for most newbies who are facing desperate times.</p> <p class="wp-block-paragraph">As the saying goes, <em>desperate times calls for desperate measures</em>. Standing out in a fierce market can make all the difference in landing you your next role. You can stand out based on your profile, work experiences, portfolio, skillset, etc. But in this article, I will show you how to stand out by making yourself more available to niche opportunities. These 5 tips will have you stand out and present yourself to more opportunities.</p> <h2 class="wp-block-heading">Boolean search</h2> <p class="wp-block-paragraph">Advanced search techniques like the Boolean search can be used on job boards to find more refined roles based on your search queries. Boolean search uses keywords like &#8220;AND&#8221;, &#8220;OR&#8221; and &#8220;NOT&#8221; to limit search results. LinkedIn is a prime example of how much your search query can be made more specific. Instead of just using the LinkedIn <a href="https://towardsdatascience.com/tag/jobs/" title="Jobs">Jobs</a> tab, you can make a Boolean search on posts to discover LinkedIn posts from recruiters that are hiring.</p> <figure class="wp-block-image aligncenter"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/03/Boolean-Search.png" alt="An example of Boolean search on LinkedIn" class="wp-image-600822"/><figcaption class="wp-element-caption">Boolean search on LinkedIn / Image by author</figcaption></figure> <p class="wp-block-paragraph">The search query <strong>&#8220;data analyst&#8221; AND &#8220;hiring&#8221; AND &#8220;London&#8221; </strong>returns all LinkedIn posts containing these search terms. These posts are almost always recruiters looking to hire someone. This can be the quickest way to come across a job ad on LinkedIn. Boolean search is not exclusive to LinkedIn, and you can obtain such search techniques from many job websites.</p> <h2 class="wp-block-heading">Extended keywords search</h2> <p class="wp-block-paragraph">The extended keywords search technique helps to search for job opportunities based on search terms other than the job title. Search terms that include tools, workflow names, potential project names, etc., can be used while searching for jobs online. Instead of just searching for &#8220;data analyst&#8221;, someone looking for a data analyst role can search for tools like &#8220;Power BI&#8221; or &#8220;Tableau&#8221; on an advertising website. This technique works because using a tool-based search method will produce results that match your skillset. Secondly, most companies have varying job titles that could mean the same thing in another company. Company &#8216;A&#8217; can have an opening for a Business Analyst, but Company &#8216;B&#8217; can decide to name the same role as Data Analyst or Product Analyst. The tools and responsibilities across these different roles can be the same. By focusing your search around those tools and other keywords, you can sidestep the confusion and unlock more relevant opportunities you might have otherwise missed.</p> <h2 class="wp-block-heading">Recommendations through networking</h2> <p class="wp-block-paragraph">Recommendations is a well-known technique for getting your foot in the door when it comes to breaking into a new field. While recommendations work to a great extent, the person who gives the recommendation is as important. This is where networking comes in. Networking without a purpose will most likely be a short-lived contact. You can start to network with people solely for getting recommendations from them. So the next time you attend a <mdspan class="mdspan-comment is-selected" datatext="el1745265391734">career </mdspan>fair, connect with company representatives on LinkedIn and reach out to them for recommendations. The recommendation will then be used as an additional document when applying for a role within the company where your recommender works. This approach should give you a boost which gets you more attention on your job application. Requesting for a recommendation (letter) is an easier request than an outright request for a job, so don&#8217;t be surprised when most people say yes to giving a recommendation letter.</p> <h2 class="wp-block-heading">Cold-calling</h2> <p class="wp-block-paragraph">Here me out, I know I said asking for a job is a big ask but cold-calling is not a bad idea, especially when done strategically. <em>What&#8217;s the strategy?</em> Thanks for asking. Cold-calling is a way to get yourself out there, to be seen more by companies and industry professionals. You can cold-call to enquire about available openings in a company and to demonstrate your skillset. You might be turned down in most cases, and that is totally fine. You can further request for an informational interview with the company just to evaluate yourself, irrespective of an available role. An informational interview can either make the company reconsider you for a role or it helps you prepare for subsequent interviews. Either way, you are coming out better. So look up that company you&#8217;ve been wanting to work for and give them a ring.</p> <h2 class="wp-block-heading">Nearby companies</h2> <p class="wp-block-paragraph">Google map is an underrated tools when it comes to job search. Google map isn&#8217;t just a tool that helps you find your way to the nearest Tesco, it can also be used to identify and locate companies around you based on specific industries. Using the right search query Google map can help you identify companies from any industry near-by just by searching for the industry. You can search for &#8220;software companies near me&#8221; and this returns a map of all registered companies around you with their websites and email addresses. This is a good source for discovering companies you can easily visit in person.</p> <figure class="wp-block-image size-large"><img decoding="async" src="https://contributor.insightmediagroup.io/wp-content/uploads/2025/04/Screenshot-2025-04-21-160031-1024x483.png" alt="" class="wp-image-601993"/><figcaption class="wp-element-caption">Google map for finding nearby companies / Image by author</figcaption></figure> <p class="wp-block-paragraph">You can reach out to the companies identified by Google map and check with these companies for available roles. You can also visit the company websites to know more about the company and if they are a right fit for you. You can also pay a visit to the company if it&#8217;s a short commute from where you are.</p> <h2 class="wp-block-heading">Conclusion</h2> <p class="wp-block-paragraph">Breaking into the tech industry may be tougher than it once was, but it&#8217;s far from impossible. While the market is more competitive and roles are more demanding, creative and strategic job search methods can give you a significant edge. By utilising smart techniques like Boolean search, extended keyword searches, networking for strong recommendations, cold-calling, and exploring nearby companies using tools like Google Maps, you can expand your reach and visibility in the job market. It’s about working smarter, not just harder—adapting your approach can make all the difference in landing your next opportunity.</p> <hr class="wp-block-separator has-alpha-channel-opacity is-style-dotted"/> <p class="wp-block-paragraph"><strong>Thank You!</strong></p> <p class="wp-block-paragraph" id="203a"><em>Enjoyed this article? </em><a href="https://medium.com/@doziesixtus"><em>Follow me</em></a><em>&nbsp;to get notifications whenever I publish a new story. I will be publishing more articles in this space. Cheers!</em></p> <p class="wp-block-paragraph"></p> <p>The post <a href="https://towardsdatascience.com/struggling-to-land-a-data-role-in-2025-these-5-tips-will-change-that/">Struggling to Land a Data Role in 2025? These 5 Tips Will Change That</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>.</p>
  11. 5 Strategies for Securing and Scaling Streaming Data in the AI Era

    Wed, 30 Apr 2025 18:00:45 -0000

    "5 Strategies for Securing and Scaling Streaming Data in the AI Era" featured image.

    Streaming data underpins real-time personalization campaigns, fraud detection, predictive maintenance and an ever-expanding set of business-critical initiatives. With AI now

    The post 5 Strategies for Securing and Scaling Streaming Data in the AI Era appeared first on The New Stack.

    Protecting streaming data is a strategic imperative. Here are five strategies for building secure, scalable data streams ready for the AI era.
  12. Should You Try Small Language Models for AI App Development?

    Wed, 30 Apr 2025 17:00:27 -0000

    "Should You Try Small Language Models for AI App Development?" featured images. Small letters in a jumble on a notebook

    Most enterprises share a common goal: to bring their most critical business operations as close as possible to the audiences

    The post Should You Try Small Language Models for AI App Development? appeared first on The New Stack.

    SLMs may offer greater privacy, security and business opportunities than LLMs powering GenAI, but they aren’t right for every use case.
  13. What Is MCP? Game Changer or Just More Hype?

    Wed, 30 Apr 2025 15:00:02 -0000

    "What is MCP? Game-Changer or Just More Hype?" featured image. Stick figure next to question marks

    The hype for Anthropic’s Model Context Protocol (MCP) has reached a boiling point. Everyone is releasing something around MCP to

    The post What Is MCP? Game Changer or Just More Hype? appeared first on The New Stack.

    Interest in and confusion about the Model Context Protocol exist in equal measure. Dive deep into the details around MCP in part 1 of this series.
  14. Scaling AI Agents in the Enterprise: The Hard Problems and How to Solve Them

    Wed, 30 Apr 2025 14:06:38 -0000

    AI agents are evolving beyond simple chat-based interactions into systems that execute workflows, manage state, and make decisions across long-running

    The post Scaling AI Agents in the Enterprise: The Hard Problems and How to Solve Them appeared first on The New Stack.

    <img width="1024" height="683" src="https://cdn.thenewstack.io/media/2025/04/8591a6f4-marvin-meyer-syto3xs06fu-unsplash-1-1024x683.jpg" class="webfeedsFeaturedVisual wp-post-image wp-stateless-item" alt="" style="display: block; margin: auto; margin-bottom: 20px;max-width: 100%;" link_thumbnail="" decoding="async" loading="lazy" data-image-size="large" data-stateless-media-bucket="cdn.thenewstack.io" data-stateless-media-name="media/2025/04/8591a6f4-marvin-meyer-syto3xs06fu-unsplash-1-scaled.jpg" /><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p>AI agents are evolving beyond simple chat-based interactions into systems that execute workflows, manage state, and make decisions across long-running processes. These architectures are being adopted in use cases ranging from fully autonomous AI interviewers (Mercor) to financial risk analysis (Robinhood) and data pipeline automation (Databricks). However, moving from prototype to production introduces several technical challenges and introduces a new set of challenges:</p> <ol> <li>State persistence: AI agents need memory beyond a single prompt-response loop.</li> <li>Reliable execution: If an agent fails during a task, it needs to recover gracefully.</li> <li>Multi-agent coordination: Agents need to interact, share knowledge, and delegate tasks.</li> </ol> <p>Each of these challenges (and of course, security, authentication, and authorization for agents) requires careful architectural decisions, as well as a shift from treating <a href="https://thenewstack.io/a-comprehensive-guide-to-function-calling-in-llms/" data-wpil-monitor-id="2029" class="local-link">LLMs as simple function calls</a> to designing AI agents as robust, distributed systems.</p> <h2>1. State Persistence: AI Agents Need To Persist State Beyond a Single Prompt-Response Loop</h2> <h3>The Problem: Stateless LLMs Fail in Long-Running Workflows</h3> <p>Most AI applications today use LLMs in a stateless fashion: each query is treated independently with no recall of prior interactions. This works for simple queries but <a href="https://thenewstack.io/ai-agents-in-doubt-reducing-uncertainty-in-agentic-workflows/" data-wpil-monitor-id="2032" class="local-link">fails in complex workflows where agents</a> must remember prior steps, decisions, or user inputs.</p> <h3>Example: Stateful AI for Technical Interviews (Mercor)</h3> <p>Take Mercor, for example, a $2B+ company backed by Benchmark, General Catalyst, and Felicis. They&rsquo;re building a fully autonomous AI interviewer that adapts in real-time to how a candidate performs. The system could initiate role- and level-specific interview questions, then adaptively generate follow-up questions in real-time based on the candidate&rsquo;s responses and performance signals. It would conclude by synthesizing performance data to deliver a rigorous, data-driven, and unbiased evaluation of candidate aptitude and fit.</p> <p>To ask questions that logically follow a candidate&rsquo;s responses, Mercor&rsquo;s system needs to be able to persist state, including:</p> <ul> <li>A candidate&rsquo;s technical responses and code submissions</li> <li>Areas they struggled with or excelled in</li> <li>Reference points from similar interviews</li> </ul> <p>Without a persistent state layer, the experience would feel disjointed and repetitive, and worse, the system wouldn&rsquo;t be able to make a fair or informed evaluation. Real-time, adaptive agents don&rsquo;t just benefit from memory, they depend on it. Agentic memory in its various flavors and design patterns ensures long-running context for LLMs.</p> <div id="attachment_22785633" style="width: 602px" class="wp-caption aligncenter"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-22785633" class="wp-image-22785633 size-large" src="https://cdn.thenewstack.io/media/2025/04/6f0c3bbe-image3-592x1024.png" alt="" width="592" height="1024"><p id="caption-attachment-22785633" class="wp-caption-text">Mercor AI</p></div> <h3>Solution: Architecting State for AI Agents With Embeddings</h3> <p>Persisting state in AI systems isn&rsquo;t a one-size-fits-all problem. Different workflows require different types of recall: some are semantic and fuzzy, while others are exact and structured. For agents to operate effectively in real-world environments, memory must be purpose-built, composable, and optimized for both speed and relevance.</p> <ul> <li><strong>Vector Databases (e.g., Pinecone, Weaviate, Supabase pgvector): </strong>For tasks like summarization, knowledge retrieval, or referencing prior conversations, try using a vector database like Pinecone, Weaviate, or Supabase&rsquo;s pgvector extension.</li> <li><strong>Structured Storage (e.g., Letta/MemGPT): </strong>In contrast, when workflows require exact tracking, think multistep processes, form completion, or reasoning over previous decisions, structured memory is essential. Tools like Letta shine here.</li> <li><strong>Agentic Memory Layers: </strong>The most advanced memory architectures dynamically integrate both fuzzy and structured memory into a single runtime. Letta, for instance, enables LLMs to operate beyond the fixed-token context window by layering long-term memory directly into the agent architecture.</li> </ul> <p>A fundamental breakthrough in this area is the use of embeddings. The process involves indexing and storing embeddings, then employing search and retrieval techniques to manage long-term state effectively.</p> <p><strong>Embeddings<br> </strong>Embeddings convert segments of text, such as memories or conversation fragments, into high-dimensional numerical vectors that encapsulate their semantic content. A neural network, trained to recognize linguistic patterns, transforms text into vectors that reflect meaning, context, and the relationships between words or phrases. The resulting high-dimensional space allows for the establishment of contextual similarity, where proximity between vectors indicates relatedness.</p> <p><strong>Interplay with Vector Databases<br> </strong>Vector databases store these high-dimensional vectors along with associated metadata (e.g., tags, timestamps) and enable similarity searches through various methods:</p> <ul> <li><strong>Approximate Nearest Neighbors (ANN): </strong>Techniques like Hierarchical Navigable Small World (HNSW), Locality-Sensitive Hashing (LSH), and Annoy (which uses random projection trees) quickly narrow down the search to approximate nearest neighbors.</li> <li><strong>Inverted Indexing (IVF): </strong>This method clusters the vector space, limiting searches to relevant dataset segments.</li> <li><strong>Quantization: </strong>Optimized product quantization compresses vectors, accelerating distance computations.</li> <li><strong>Hybrid Approaches: </strong>Often, a combination of these techniques is employed to balance speed and accuracy.</li> </ul> <p><strong>Enhancing Agentic Memory<br> </strong>During retrieval, new inputs are transformed into vectors, and the vector database is queried to find semantically similar vectors. This process enables the agent to maintain continuity over extended interactions, ensuring smoother transitions and coherent decision-making while minimizing latency. As new tasks are completed or interactions are finished, their embeddings are generated and added to the database. This continuous update mechanism allows the agent to dynamically refine its memory, enhancing its long-term contextual awareness and overall learning capabilities.</p> <h2>2. Reliable Execution: If an Agent Fails Midtask, It Needs To Recover Gracefully</h2> <h3>The Problem: Failures in Multistep AI Workflows</h3> <p>LLM-based agents rarely operate in a vacuum; they frequently interact with APIs, databases, and various external systems. When an agent encounters an error during execution, it must handle the issue gracefully instead of starting over entirely.</p> <p>A significant challenge in productionizing AI agents is durability; if an LLM generates responses in a long-running workflow and the process fails midway, does the entire session reset? Or does it recover where it left off? Unlike traditional web applications, which rely on database-backed statefulness, AI agents often operate in stateless environments unless they are explicitly designed for fault tolerance.</p> <h3>Example: Robinhood&rsquo;s AI-Powered Trading Agent</h3> <p>Robinhood leverages AI agents to function as market analysts, assisting users in constructing trades. These agents integrate high-fidelity market data with real-time trading information, historical trade patterns, and proprietary insights regarding retail trading behavior, enabling the formulation of a robust stock thesis. Operating within a high-stakes, low-latency framework, the system is engineered to prevent incorrect answers or failures that could lead to financial loss or regulatory challenges.</p> <p>To guarantee reliable trade execution, Robinhood deploys a multilayered AI model fallback architecture as described below:</p> <ul> <li><strong>Primary High-Performance LLM for Critical Decision-Making:<br> </strong>&nbsp;A compute-intensive large language model (LLM) processes complex market conditions using chain-of-thought reasoning to generate detailed market insights. A dedicated subsystem interfaces with this model to mitigate hallucinations and ensure the reliability of outputs.</li> <li><strong>Secondary Lightweight LLM for Summarization:<br> </strong>&nbsp;The detailed insights from the primary LLM are then routed to a more cost-efficient, lower-latency model that produces concise summaries. This dual-model approach balances performance with operational cost efficiency.</li> <li><strong>Failover and Redundancy Mechanism:<br> </strong>&nbsp;In the event of a primary LLM failure, the system automatically fails over to the secondary model or retrieves cached responses from a historical vector database. This design ensures operational continuity under adverse conditions.</li> <li><strong>Event-Driven Asynchronous Execution:<br> </strong>&nbsp;The AI-generated insights are presented to the user, who then specifies trade parameters such as price and time targets. These inputs are asynchronously queued and processed to decouple execution stages, preventing error propagation. In case of a failure during any execution step, the process is designed to roll back and retry, rather than initiating a complete system reset.</li> </ul> <p>This architecture enables Robinhood&rsquo;s AI-driven insights and trade construction platform to maintain near-100% uptime, significantly reducing order failures while effectively managing AI inference costs.</p> <h2>Solution: Ensuring Reliability in AI Agent Workflows</h2> <p>For developers, reliability isn&rsquo;t just about uptime. It&rsquo;s about trust. If an agent drops state halfway through a task, fails silently, or returns inconsistent outputs, it breaks the entire user experience. Building for reliability means thinking like a systems engineer: handling retries, surfacing errors cleanly, and making sure agents can recover without starting from scratch.</p> <ul> <li><strong>Orchestration frameworks</strong> (Temporal.io, Crew.ai, Langchain): Provide stateful execution, retries, and recovery mechanisms</li> <li><strong>Multi-LLM routing</strong>: Intelligent load balancing between foundation models to optimize availability and optimize for model feature use in addition to giving levers for cost and latency tradeoffs.</li> <li><strong>Versioning &amp; Rollback:<br> </strong>AI agents should regularly save checkpointed execution states to enable recovery in case of failures. Moreover, using a model registry for version control provides a robust rollback mechanism, ensuring that if a new model update does not meet expectations, the system can quickly revert to a previous stable version.</li> <li><strong>Blue/Green Deployments for Models:<br> </strong>In production, blue/green deployments are employed by routing traffic between two model versions. For instance, the &ldquo;blue&rdquo; model might handle 95% of the traffic while the &ldquo;green&rdquo; model handles 5%. Once the green model demonstrates the required accuracy, availability, and stability, it is promoted to blue, and a new improved green model is introduced. This strategy minimizes risk and ensures continuous, reliable performance.</li> </ul> <h2>3. Multi-Agent Systems: Distributed AI Coordination</h2> <h3>The Problem: Single LLMs Are Inefficient for Complex Tasks</h3> <p>Monolithic LLMs fall short when executing complex, multistep workflows. Instead, different tasks require specialized AI models &mdash; some tailored for reasoning, others for retrieval, and still others for execution. For example, consider a system where agents process multimodal inputs (such as text, voice, and video) and generate not only multimodal outputs but also trigger actions, whether through direct computer interaction, function calls, or web-based operations.</p> <p><img loading="lazy" decoding="async" class="aligncenter wp-image-22785634 size-large" src="https://cdn.thenewstack.io/media/2025/04/79290e4b-image2-1024x841.png" alt="" width="1024" height="841"></p> <p>Single-agent systems are inherently limited by their sequential processing and single-threaded decision-making, restricting their ability to execute tasks in parallel and distribute workloads effectively. This constraint makes them less suitable for complex, real-world applications, prompting the development of multi-agent systems that can handle distributed tasks more efficiently.</p> <p>This evolution reflects a broader trend in agentic AI toward enhanced planning, execution, and self-optimization. Even within a single-agent framework, advancements in these areas have paved the way for more robust and resilient AI systems.</p> <h3>Example: AI Agents in Enterprise Security (Palo Alto Networks)</h3> <p>Enterprise <a href="https://thenewstack.io/ai-security-agents-combat-ai-generated-code-risks/" data-wpil-monitor-id="2027" class="local-link">security requires AI agents</a> that perform distinct functions:</p> <ol> <li>Threat detection agent: Monitors logs and flags anomalies</li> <li>Risk assessment agent: Uses ML-based models to evaluate threats</li> <li>Remediation agent: Automates security responses</li> </ol> <p>Each of these components requires specialized AI, rather than a single LLM attempting to handle the entire workflow.</p> <h3>Solution: Architecting Multi-Agent Systems</h3> <p>Decentralized agent architectures, such as mixtures of experts, distribute tasks among specialized models, thereby reducing inference overhead. Several popular multi-agent architectures include:</p> <ul> <li><strong>Supervisory Agent:</strong> All agents communicate with a central supervisor for coordinated plan execution.</li> <li><strong>Networked Agents:</strong> Each agent can interact directly with others.</li> <li><strong>Hierarchical Systems:</strong> A layered structure where supervisors coordinate other supervisors to tackle complex tasks.</li> <li><strong>Custom Architectures:</strong> User-defined setups that enable coordination among only a subset of agents for specific task execution.</li> </ul> <h3>Event-Driven Communication</h3> <p>Agents exchange state updates via <a href="https://thenewstack.io/choosing-between-message-queues-and-event-streams/" data-wpil-monitor-id="2030" class="local-link">message queues or event</a> buses. This event-driven approach allows agents to operate on specific triggers by reading and writing messages to distributed queues, eliminating the need for constant polling or direct connections, reducing latency, and enhancing scalability.</p> <h3>Cross-Agent Memory</h3> <p>AI agents maintain a shared knowledge base to facilitate smooth task hand-offs. They store and retrieve data from shared memory to maintain a unified context around shared goals. This data is encoded into rich, context-aware representations using LLM-friendly embeddings. A key challenge here is ensuring that shared memory implements proper locking and version control to prevent issues from concurrent updates.</p> <p><strong>Longer-Term Retrieval and Interplay With Microservices</strong></p> <p>A key unlock for longer-term reterivals is the implementation of a unified data access layer to ensure agents retrieve the correct information. One effective approach combines GraphQL with the <a href="https://thenewstack.io/building-your-first-model-context-protocol-server/" data-wpil-monitor-id="2028" class="local-link">Model Context Protocol</a> (MCP) for consistent data delivery.</p> <p>GraphQL serves as a versatile API layer by providing a single endpoint that allows clients to fetch only the data they need, thereby avoiding issues related to overfetching or underfetching. The Model Context Protocol <a href="https://thenewstack.io/automating-context-in-structured-data-for-llms/" data-wpil-monitor-id="2026" class="local-link">standardizes how contextual data</a> is packaged and delivered to AI models, ensuring they have the precise context required for accurate decision-making. When these technologies are integrated for agentic AI systems, they dynamically supply the necessary context to autonomous agents, <a href="https://thenewstack.io/boost-ai-efficiency-data-chunking-meets-document-databases/" data-wpil-monitor-id="2031" class="local-link">boosting both their adaptability and overall efficiency in data</a> retrieval and consistency.</p> <p>Modern agentic systems in real-world applications typically comprise:</p> <ol> <li><strong>Plan, Execute, Decide loops<br> </strong></li> <li><strong>Self-improvement</strong> via in-context learning or fine-tuning</li> <li><strong>Tools and plan optimization<br> </strong></li> <li><strong>Continuous evaluations<br> </strong></li> <li><strong>Robust observability</strong></li> </ol> <p><img loading="lazy" decoding="async" class="aligncenter wp-image-22785635 size-large" src="https://cdn.thenewstack.io/media/2025/04/e58aff19-image4-1024x765.png" alt="" width="1024" height="765"></p> <h3>Solving for Continuous Improvements, Learning and a Mixture of Experts</h3> <p>These systems can be further improved through trajectory optimization, using expert tools and communicators, to generate the desired task output. Supervised finetuning can be done over Planner, Tool user, or communicator data to build expert bots that are good at specific tasks while being lower in memory footprint and latency.</p> <p><img loading="lazy" decoding="async" class="aligncenter wp-image-22785636 size-large" src="https://cdn.thenewstack.io/media/2025/04/21f995ec-image1-1024x917.png" alt="" width="1024" height="917"></p> <h2>Designing for the Hard Path Is the Only Path</h2> <p>The next generation of enterprise AI will be driven by systems thinking.</p> <p>AI agents that succeed at scale don&rsquo;t just take in prompts; they will remember, recover, and collaborate. Building for state persistence, reliable execution, and multi-agent coordination isn&rsquo;t optional. It&rsquo;s foundational. They are the difference between a prototype that demos well and a system that delivers every single day in production.</p> <p>For teams serious about deploying AI agents in high-stakes, high-complexity environments, the call to action is clear: treat agents like distributed systems. Invest in infrastructure, not just inference. Build a persistent state as a core service, not an afterthought. Embrace redundancy, modularity, and the architectural rigor that the enterprise demands.</p> </body></html> <p>The post <a href="https://thenewstack.io/scaling-ai-agents-in-the-enterprise-the-hard-problems-and-how-to-solve-them/">Scaling AI Agents in the Enterprise: The Hard Problems and How to Solve Them</a> appeared first on <a href="https://thenewstack.io">The New Stack</a>.</p>
  15. Interop Unites Browser Makers To Smooth Web Inconsistencies

    Wed, 30 Apr 2025 12:00:40 -0000

    For the last four years, the major browser vendors, web standards creators and other contributors to browser engines have joined

    The post Interop Unites Browser Makers To Smooth Web Inconsistencies appeared first on The New Stack.

    For the past four years, major browser vendors have collaborated to improve web interoperability by coordinating enhancements to inconsistent browser implementations.
  16. Ship Fast, Break Nothing: LaunchDarkly’s Winning Formula

    Tue, 29 Apr 2025 23:00:17 -0000

    In today’s rapidly evolving software landscape, the ability to deliver code quickly while maintaining stability has become a critical competitive

    The post Ship Fast, Break Nothing: LaunchDarkly’s Winning Formula appeared first on The New Stack.

    LaunchDarkly pioneered the feature management category and is now transforming software delivery with Guarded Releases, AI support and more.
  17. Basic Python Syntax: A Beginner’s Guide To Writing Python Code

    Tue, 29 Apr 2025 19:00:08 -0000

    Every programming language has a unique syntax. Some languages borrow syntax from others, while others create something wholly different. No

    The post Basic Python Syntax: A Beginner’s Guide To Writing Python Code appeared first on The New Stack.

    Learn all the basic Python syntaxes you need to start coding. This guide covers comments, variables, functions, loops, and more — explained simply for beginners.
  18. 6 Ways AI Is Upending the DevOps Lifecycle

    Tue, 29 Apr 2025 18:00:45 -0000

    "6 Ways AI Is Upending the DevOps Life Cycle" featured image. Person's legs upside down behind desk

    The AI revolution isn’t knocking at DevOps’ door — it’s already redecorating the house. While individual teams have been experimenting

    The post 6 Ways AI Is Upending the DevOps Lifecycle appeared first on The New Stack.

    With organizations putting AI into action, the DevOps ecosystem is poised for transformation — becoming more efficient, resilient and autonomous.
  19. TLA+ Creator Leslie Lamport: Programmers Need Abstractions

    Tue, 29 Apr 2025 17:00:53 -0000

    Leslie Lamport talk at SCaLE 22x

    The 84-year-old Leslie Lamport is a legend. A Microsoft web page (where he once worked as a research scientist) notes

    The post TLA+ Creator Leslie Lamport: Programmers Need Abstractions appeared first on The New Stack.

    Why it&#039;s crucial to think at a higher level than code before writing it.
  20. Synadia Attempts To Reclaim NATS Back From CNCF 

    Tue, 29 Apr 2025 15:00:42 -0000

    It has become almost commonplace to read about yet another company having regrets about open sourcing their flagship product and

    The post Synadia Attempts To Reclaim NATS Back From CNCF  appeared first on The New Stack.

    This unexpected intellectual property dispute has sparked an open source governance clash. 
  21. The Bitnami Open Source Application Catalog Turns 18!

    Tue, 29 Apr 2025 14:00:54 -0000

    "The Bitnami Open Source Application Catalog Turns 18!" featured image. Cupcake with #18 and party hats

    The Bitnami open source application catalog fundamentally changed the way developers access and deploy open source software when it was

    The post The Bitnami Open Source Application Catalog Turns 18! appeared first on The New Stack.

    Learn how Bitnami has evolved from installers and virtual machines to support app management on clouds, containers and Kubernetes.
  22. Why Kubernetes Cost Optimization Keeps Failing

    Tue, 29 Apr 2025 13:00:03 -0000

    Yodar Shafrir, co-founder and CEO of ScaleOps, explained at KubeCon + CloudNativeCon Europe that dynamic, cloud-native applications have constantly shifting loads, making resource allocation complex.

    Businesses always care about how much money they’re spending. But these days, with an uncertain global economy and new demands

    The post Why Kubernetes Cost Optimization Keeps Failing appeared first on The New Stack.

    Cloud native apps and Kubernetes are dynamic, making it hard to contain resource costs. Yodar Shafrir of ScaleOps offers a solution in this episode of Makers.
  23. How MCP Puts the Good Vibes Into Cloud Native Development

    Mon, 28 Apr 2025 21:00:16 -0000

    "How MCP Puts the Good Vibes Into Cloud Native Development" featured image. Colorful AI-generated image of a person in a space suit

    For the first time in years, developers are saying the work feels fun again. That’s not just sentiment. It’s the

    The post How MCP Puts the Good Vibes Into Cloud Native Development appeared first on The New Stack.

    MCP is putting the fun back into work by shifting devs’ role from troubleshooters to designers creating and shipping innovative features.
  24. How To Run a Python Script on MacOS, Windows, and Linux

    Mon, 28 Apr 2025 19:12:25 -0000

    Stop copy‑pasting the same command every time you run Python. This guide will teach the practical ways to run Python

    The post How To Run a Python Script on MacOS, Windows, and Linux appeared first on The New Stack.

    Learn how to run Python scripts on macOS, Windows, and Linux with this practical guide. Master command-line execution, IDE shortcuts, scheduling scripts, and more.
  25. OpenTofu Joins CNCF: New Home for Open Source IaC Project

    Mon, 28 Apr 2025 15:13:57 -0000

    It’s not easy running an open source non-profit group. Just ask anyone who’s tried. So it should come as no

    The post OpenTofu Joins CNCF: New Home for Open Source IaC Project appeared first on The New Stack.

    A special CNCF licensing exception helps solidify OpenTofu as a vendor-neutral, community-driven Infrastructure as Code (IaC) solution.
  26. Apache Airflow 3.0: From Data Pipelines to AI Inference

    Mon, 28 Apr 2025 14:00:42 -0000

    Approximately 10 years ago, Apache Airflow launched with a relatively simple, yet timeless premise. It was initially devised as a

    The post Apache Airflow 3.0: From Data Pipelines to AI Inference appeared first on The New Stack.

    Latest edition provides DAG versioning, remote execution capabilities, range of scheduling options, and more.
  27. How To Master Vector Databases

    Mon, 28 Apr 2025 13:00:26 -0000

    "How to Master Vector Databases" featured image. Gauge showing levels from Novice to Master

    Machine learning (ML), AI and endless streams of data are reshaping how we solve problems. But when it comes to

    The post How To Master Vector Databases appeared first on The New Stack.

    Learn how to choose the right vector database for your use case, index and to query your data, and how to optimize performance in this comprehensive guide.
  28. FerenOS: A Refreshing Take on KDE Plasma That Could Win You Over

    Sun, 27 Apr 2025 15:00:26 -0000

    Once upon a time, FerenOS was based on Linux Mint and used a special edition of the Cinnamon desktop. In

    The post FerenOS: A Refreshing Take on KDE Plasma That Could Win You Over appeared first on The New Stack.

    FerenOS has a polished KDE Plasma implementation and includes the customizable Vivaldi browser as the default, all while delivering impressive performance.
  29. How to Run a Generative AI Developer Tooling Experiment

    Sun, 27 Apr 2025 14:00:14 -0000

    As a software development company specializing in brokerage solutions, Devexperts must balance the speed of innovation with the caution required by a highly

    The post How to Run a Generative AI Developer Tooling Experiment appeared first on The New Stack.

    Software development company Devexperts tested and compared code generator tools Copilot and Cursor, which serves as a blueprint for testing AI developer tools.
  30. Bill Gates, Paul Allen, and the Code That Started Microsoft

    Sun, 27 Apr 2025 13:00:06 -0000

    In 1968, a 13-year-old Bill Gates told his friend Paul, “Maybe we’ll have our own company someday.” More than half

    The post Bill Gates, Paul Allen, and the Code That Started Microsoft appeared first on The New Stack.

    Reflecting on a teenage ambition with his friend Paul Allen, Bill Gates recently shared the original 1975 source code that became Microsoft&#039;s first product.
  31. React Adds New Experimental Animation Feature

    Sat, 26 Apr 2025 18:00:45 -0000

    Dev News logo

    React added experimental support for two new techniques this week: View Transitions and Activity. View Transitions makes it easier to

    The post React Adds New Experimental Animation Feature appeared first on The New Stack.

    In other dev news this week: Angular&#039;s LLM-first web framework, RedwoodJS&#039;s SDK for Cloudflare and new AI tools.
  32. How the UK Is Guiding the Use of Generative AI

    Sat, 26 Apr 2025 14:00:46 -0000

    Doubledecker bus driving near Parliament in London. The UK government has released a playbook for using AI that offers a good model for other public-facing organizations.

    I hope a few more people looked at the Vatican’s Antiqua et Nova after the Pope’s passing, as it has

    The post How the UK Is Guiding the Use of Generative AI appeared first on The New Stack.

    The British Government Digital Service offers a playbook for using AI — and a strong example of how other public-facing bodies can use GenAI responsibly.
  33. The Best Office Suites for Linux

    Sat, 26 Apr 2025 13:00:23 -0000

    When I first started using Linux back in 1997, finding high-quality software was often a challenge. In no category was

    The post The Best Office Suites for Linux appeared first on The New Stack.

    LibreOffice, SoftMaker Office and ONLYOFFICE are but a few of the open source and commercial office suites for Linux.
  34. .NET Modernization: GitHub Copilot Upgrade Eases Migrations

    Fri, 25 Apr 2025 23:00:16 -0000

    During the .NET Conf Focus on Modernization event, Microsoft demonstrated a powerful new tool: GitHub Copilot Upgrade for .NET with

    The post .NET Modernization: GitHub Copilot Upgrade Eases Migrations appeared first on The New Stack.

    Microsoft&#039;s AI-powered solution transforms complex .NET upgrades from painstaking manual work into seamless automated processes.
  35. Introduction to API Management

    Fri, 25 Apr 2025 21:00:15 -0000

    Silhouette of a person pushing two giant jigsaw puzzle pieces together.

    What Is API Management? API management, a critical aspect of modern digital architecture, involves overseeing the lifecycle of application programming

    The post Introduction to API Management appeared first on The New Stack.

    Master API management and unlock seamless integrations. Learn how to secure, scale and optimize APIs to power modern apps and services.
  36. Frontend’s Next Evolution: AI-Powered State Management

    Fri, 25 Apr 2025 20:00:52 -0000

    If you’ve built a frontend application in the past five years, you’ve probably had a moment where you stared at

    The post Frontend’s Next Evolution: AI-Powered State Management appeared first on The New Stack.

    How artificial intelligence is transforming the complexity of state management in modern frontend applications.
  37. Staff Site Reliability Engineer, Waze

    Mon, 28 Apr 2025 16:00:00 -0000

    In 2023, the Waze platform engineering team transitioned to Infrastructure as Code (IaC) using Google Cloud's Config Connector (KCC) — and we haven’t looked back since. We embraced Config Connector, an open-source Kubernetes add-on, to manage Google Cloud resources through Kubernetes. To streamline management, we also leverage Config Controller, a hosted version of Config Connector on Google Kubernetes Engine (GKE), incorporating Policy Controller and Config Sync. This shift has significantly improved our infrastructure management and is shaping our future infrastructure.

    The shift to Config Connector

    Previously, Waze relied on Terraform to manage resources, particularly during our dual-cloud, VM-based phase. However, maintaining state and ensuring reconciliation proved challenging, leading to inconsistent configurations and increased management overhead.

    In 2023, we adopted Config Connector, transforming our Google Cloud infrastructure into Kubernetes Resource Modules (KRMs) within a GKE cluster. This approach addresses the reconciliation issues encountered with Terraform. Config Sync, paired with Config Connector, automates KRM synchronization from source repositories to our live GKE cluster. This managed solution eliminates the need for us to build and maintain custom reconciliation systems.

    The shift helped us meet the needs of three key roles within Waze’s infrastructure team: 

    1. Infrastructure consumers: Application developers who want to easily deploy infrastructure without worrying about the maintenance and complexity of underlying resources.

    2. Infrastructure owners: Experts in specific resource types (e.g., Spanner, Google Cloud Storage, Load Balancers, etc.), who want to define and standardize best practices in how resources are created across Waze on Google Cloud.

    3. Platform engineers: Engineers who build the system that enables infrastructure owners to codify and define best practices, while also providing a seamless API for infrastructure consumers.

    aside_block
    <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x3e50618afac0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

    First stop: Config Connector

    It may seem circular to define all of our Google Cloud infrastructure as KRMs within a Google Cloud service, however, KRM is actually a great representation for our infrastructure as opposed to existing IaC tooling.

    Terraform's reconciliation issues – state drift, version management, out of band changes – are a significant pain. Config Connector, through Config Sync, offers out-of-the-box reconciliation, a managed solution we prefer. Both KRM and Terraform offer templating, but KCC's managed nature aligns with our shift to Google Cloud-native solutions and reduces our maintenance burden. 

    Infrastructure complexity requires generalization regardless of the tool. We can see this when we look at the Spanner requirements at Waze:

    • Consistent backups for all Spanner databases

    • Each Spanner database utilizes a dedicated Cloud Storage bucket and Service Account to automate the execution of DDL jobs.

    • All IAM policies for Spanner instances, databases, and Cloud Storage buckets are defined in code to ensure consistent and auditable access control.

    1 - Spanner at Waze

    To define these resources, we evaluated various templating and rendering tools and selected Helm, a robust CNCF package manager for Kubernetes. Its strong open-source community, rich templating capabilities, and native rendering features made it a natural fit. We can now refer to our bundled infrastructure configurations as 'Charts.' While KRO has since emerged that achieves a similar purpose, our selection process predated its availability.

    Under the hood

    Let's open the hood and dive into how the system works and is driving value for Waze.

    1. Waze infrastructure owners generically define Waze-flavored infrastructure in Helm Charts. 

    2. Infrastructure consumers use these Charts with simplified inputs to generate infrastructure (demo).

    3. Infrastructure code is stored in repositories, enabling validation and presubmit checks.

    Code is uploaded to a Artifact Registry where Config Sync and Config Connector align Google Cloud infrastructure with the code definitions.

    2 - Provisioning Cloud Resources at Waze

    This diagram represents a single "data domain," a collection of bounded services, databases, networks, and data. Many tech orgs today consist of Prod, QA, Staging, Development, etc.

    Approaching our destination

    So why does all of this matter? Adopting this approach allowed us to move from Infrastructure as Code to Infrastructure as Software. By treating each Chart as a software component, our infrastructure management goes beyond simple code declaration. Now, versioned Charts and configurations enable us to leverage a rich ecosystem of software practices, including sophisticated release management, automated rollbacks, and granular change tracking.

    Here's where we apply this in practice: our configuration inheritance model minimizes redundancy. Resource Charts inherit settings from Projects, which inherit from Bootstraps. All three are defined as Charts. Consequently, Bootstrap configurations apply to all Projects, and Project configurations apply to all Resources.

    Every change to our infrastructure – from changes on existing infrastructure to rolling out new resource types – can be treated like a software rollout.

    3 - Resource Inheritance

    Now that all of our infrastructure is treated like software, we can see what this does for us system-wide:

    4 - Data Domain Flow

    Reaching our destination

    In summary, Config Connector and Config Controller have enabled Waze to achieve true Infrastructure as Software, providing a robust and scalable platform for our infrastructure needs, along with many other benefits including: 

    • Infrastructure consumers receive the latest best practices through versioned updates.

    • Infrastructure owners can iterate and improve infrastructure safely.

    • Platform Engineers and Security teams are confident our resources are auditable and compliant

    • Config Connector leverages Google's managed services, reducing operational overhead.

  38. Engineering Manager

    Mon, 24 Feb 2025 17:00:00 -0000

    Distributed tracing is a critical part of an observability stack, letting you troubleshoot latency and errors in your applications. Cloud Trace, part of Google Cloud Observability, is Google Cloud’s native tracing product, and we’ve made numerous improvements to the Trace explorer UI on top of a new analytics backend.

    1_Components of the new trace explorer

    The new Trace explorer page contains:

    1. A filter bar with options for users to choose a Google Cloud project-based trace scope, all/root spans and a custom attribute filter.

    2. A faceted span filter pane that displays commonly used filters based on OpenTelemetry conventions.

    3. A visualization of matching spans including an interactive span duration heatmap (default), a span rate line chart, and a span duration percentile chart.

    4. A table of matching spans that can be narrowed down further by selecting a cell of interest on the heatmap.

    A tour of the new Trace explorer

    Let’s take a closer look at these new features and how you can use them to troubleshoot your applications. Imagine you’re a developer working on the checkoutservice of a retail webstore application and you’ve been paged because there’s an ongoing incident.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e5064148e20>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    This application is instrumented using OpenTelemetry and sends trace data to Google Cloud Trace, so you navigate to the Trace explorer page on the Google Cloud console with the context set to the Google Cloud project that hosts the checkoutservice.

    Before starting your investigation, you remember that your admin recommended using the webstore-prod trace scope when investigating webstore app-wide prod issues. By using this Trace scope, you'll be able to see spans stored in other Google Cloud projects that are relevant to your investigation.

    2_Scope selection

    You set the trace scope to webstore-prod and your queries will now include spans from all the projects included in this trace scope.

    3_User Journey

    You select checkoutservice in Span filters (1) and the following updates load on the page:

    • Other sections such as Span name in the span filter pane (2) are updated with counts and percentages that take into account the selection made under service name. This can help you narrow down your search criteria to be more specific.

    • The span Filter bar (3) is updated to display the active filter.

    • The heatmap visualization (4)  is updated to only display spans from the checkoutservice in the last 1 hour (default). You can change the time-range using the time-picker (5). The heatmap’s x-axis is time and the y-axis is span duration. It uses color shades to denote the number of spans in each cell with a legend that indicates the corresponding range.

    • The Spans table (6) is updated with matching spans sorted by duration (default).

    • Other Chart views (7) that you can switch to are also updated with the applied filter.

    From looking at the heatmap, you can see that there are some spans in the >100s range which is abnormal and concerning. But first, you’re curious about the traffic and corresponding latency of calls handled by the checkoutservice.

    4_Span rate line chart

    Switching to the Span rate line chart gives you an idea of the traffic handled by your service. The x-axis is time and the y-axis is spans/second. The traffic handled by your service looks normal as you know from past experience that 1.5-2 spans/second is quite typical.

    5_Span duration percentile chart

    Switching to the Span duration percentile chart gives you p50/p90/p95/p99 span duration trends. While p50 looks fine, the p9x durations are greater than you expect for your service.

    6_Span selection

    You switch back to the heatmap chart and select one of the outlier cells to investigate further. This particular cell has two matching spans with a duration of over 2 minutes, which is concerning.

    7_Trace details & span attributes

    You investigate one of those spans by viewing the full trace and notice that the orders publish span is the one taking up the majority of the time when servicing this request. Given this, you form a hypothesis that the checkoutservice is having issues handling these types of calls. To validate your hypothesis, you note the rpc.method attribute being PlaceOrder and exit this trace using the X button.

    8_Custom attribute search

    You add an attribute filter for key: rpc.method value:PlaceOrder using the Filter bar, which shows you that there is a clear latency issue with PlaceOrder calls handled by your service. You’ve seen this issue before and know that there is a runbook that addresses it, so you alert the SRE team with the appropriate action that needs to be taken to mitigate the incident.

    9_Send feedback

    Share your feedback with us via the Send feedback button.

    Behind the scenes

    10_Cloud Trace architecture

    This new experience is powered by BigQuery, using the same platform that backs Log Analytics. We plan to launch new features that take full advantage of this platform: SQL queries, flexible sampling, export, and regional storage.

    In summary, you can use the new Cloud Trace explorer to perform service-oriented investigations with advanced querying and visualization of trace data. This allows developers and SREs to effectively troubleshoot production incidents and identify mitigating measures to restore normal operations.

    The new Cloud Trace explorer is generally available to all users — try it out and share your feedback with us via the Send feedback button.

  39. Technical Program Manager, Google

    Thu, 20 Feb 2025 17:00:00 -0000

    Picture this: you’re an Site Reliability Engineer (SRE) responsible for the systems that power your company’s machine learning (ML) services. What do you do to ensure you have a reliable ML service, how do you know you’re doing it well, and how can you build strong systems to support these services? 

    As artificial intelligence (AI) becomes more widely available, its features — including ML — will matter more to SREs. That’s because ML becomes both a part of the infrastructure used in production software systems, as well as an important feature of the software itself. 

    Abstractly, machine learning relies on its pipelines … and you know how to manage those! So you can begin with pipeline management, then look to other factors that will strengthen your ML services: training, model freshness, and efficiency. In the resources below, we'll look at some of the ML-specific characteristics of these pipelines that you’ll want to consider in your operations. Then, we draw on the experience of Google SREs to show you how to apply your core SRE skills to operating and managing your organization’s machine-learning pipelines. 

    Training ML models

    Training ML models applies the notion of pipelines to specific types of data, often running on specialized hardware. Critical aspects to consider about the pipeline:

    • how much data you’re ingesting

    • how fresh this data needs to be

    • how the system trains and deploys the models 

    • how efficiently the system handles these first three things

    This keynote presents an SRE perspective on the value of applying reliability principles to the components of machine learning systems. It provides insight into why ML systems matter for products, and how SREs should think about them. The challenges that ML systems present include capacity planning, resource management, and monitoring; other challenges include understanding the cost of ML systems as part of your overall operations environment. 

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e5054466160>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    ML freshness and data volume

    As with any pipeline-based system, a big part of understanding the system is describing how much data it typically ingests and processes. The Data Processing Pipelines chapter in the SRE Workbook lays out the fundamentals: automate the pipeline’s operation so that it is resilient, and can operate unattended. 

    You’ll want to develop Service Level Objectives (SLOs) in order to measure the pipeline’s health, especially for data freshness, i.e., how recently the model got the data it’s using to produce an inference for a customer. Understanding freshness provides an important measure of an ML system’s health, as data that becomes stale may lead to lower-quality inferences and sub-optimal outcomes for the user. For some systems, such as weather forecasting, data may need to be very fresh (just minutes or seconds old); for other systems, such as spell-checkers, data freshness can lag on the order of days — or longer! Freshness requirements will vary by product, so it’s important that you know what you’re building and how the audience expects to use it. 

    In this way, freshness is a part of the critical user journey described in the SRE Workbook, describing one aspect of the customer experience. You can read more about data freshness as a component of pipeline systems in the Google SRE article Reliable Data Processing with Minimal Toil.  

    There’s more than freshness to ensuring high-quality data — there’s also how you define the model-training pipeline. A Brief Guide To Running ML Systems in Production gives you the nuts and bolts of this discipline, from using contextual metrics to understand freshness and throughput, to methods for understanding the quality of your input data. 

    Serving efficiency

    The 2021 SRE blog post Efficient Machine Learning Inference provides a valuable resource to learn about improving your model’s performance in a production environment. (And remember, training is never the same as production for ML services!) 

    Optimizing machine learning inference serving is crucial for real-world deployment. In this article, the authors explore multi-model serving off of a shared VM. They cover realistic use cases and how to manage trade-offs between cost, utilization, and latency of model responses. By changing the allocation of models to VMs, and varying the size and shape of those VMs in terms of processing, GPU, and RAM attached, you can improve the cost effectiveness of model serving. 

    Cost efficiency

    We mentioned that these AI pipelines often rely on specialized hardware. How do you know you’re using this hardware efficiently? Todd Underwood’s talk from SREcon EMEA 2023 on Artificial Intelligence: What Will It Cost You? gives you a sense of how much this specialized hardware costs to run, and how you can provide incentives for using it efficiently. 

    Automation for scale

    This article from Google's SRE team outlines strategies for ensuring reliable data processing while minimizing manual effort, or toil. One of the key takeaways: use an existing, standard platform for as much of the pipeline as possible. After all, your business goals should focus on innovations in presenting the data and the ML model, not in the pipeline itself. The article covers automation, monitoring, and incident response, with a focus on using these concepts to build resilient data pipelines. You’ll read best practices for designing data systems that can handle failures gracefully and reduce a team’s operational burden. This article is essential reading for anyone involved in data engineering or operations. Read more about toil in the SRE Workbook: https://sre.google/workbook/eliminating-toil/

    Next steps

    Successful ML deployments require careful management and monitoring for systems to be reliable and sustainable. That means taking a holistic approach, including implementing data pipelines, training pathways, model management, and validation, alongside monitoring and accuracy metrics. To go deeper, check out this guide on how to use GKE for your AI orchestration.

  40. Cross-Product Solution Developer

    Fri, 14 Feb 2025 17:00:00 -0000

    In today's dynamic digital landscape, building and operating secure, reliable, cost-efficient and high-performing cloud solutions is no easy feat. Enterprises grapple with the complexities of cloud adoption, and often struggle to bridge the gap between business needs, technical implementation, and operational readiness. This is where the Google Cloud Well-Architected Framework comes in. The framework provides comprehensive guidance to help you design, develop, deploy, and operate efficient, secure, resilient, high-performing, and cost-effective Google Cloud topologies that support your security and compliance requirements.

    Who should use the Well-Architected Framework?

    The Well-Architected Framework caters to a broad spectrum of cloud professionals. Cloud architects, developers, IT administrators, decision makers and other practitioners can benefit from years of subject-matter expertise and knowledge both from within Google and from the industry. The framework distills this vast expertise and presents it as an easy-to-consume set of recommendations. 

    The recommendations in the Well-Architected Framework are organized under five, business-focused pillars.

    af-infographic

    We recently completed a revamp of the guidance in all the pillars and perspectives of the Well-Architected Framework to center the recommendations around a core set of design principles.

    Operational excellence

    Security, privacy, and compliance

    Reliability

    Cost optimization

    Performance optimization

    • Operational readiness

    • Incident management

    • Resource optimization

    • Change management

    • Continuous improvement

    • Security by design

    • Zero trust

    • Shift-left security

    • Preemptive cyber-defense

    • Secure and responsible AI

    • AI for security

    • Regulatory, privacy, and compliance needs

    • User-focused goals

    • Realistic targets

    • HA through redundancy

    • Horizontal scaling

    • Observability

    • Graceful degradation

    • Recovery testing

    • Thorough postmortems

    • Spending aligned with business value

    • Culture of cost awareness

    • Resource optimization

    • Continuous optimization

    • Resource allocation planning

    • Elasticity

    • Modular design

    • Continuous  improvement

    In addition to the above pillars, the Well-Architected Framework provides cross-pillar perspectives that present recommendations for selected domains, industries, and technologies like AI and machine learning (ML).

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e50641177f0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Benefits of adopting the Well-Architected Framework

    The Well-Architected Framework is much more than a collection of design and operational recommendations. The framework empowers you with a structured principles-oriented design methodology that unlocks many advantages:

    • Enhanced security, privacy, and compliance: Security is paramount in the cloud. The Well-Architected Framework incorporates industry-leading security practices, helping ensure that your cloud architecture meets your security, privacy, and compliance requirements.

    • Optimized cost: The Well-Architected Framework lets you build and operate cost-efficient cloud solutions by promoting a cost-aware culture, focusing on resource optimization, and leveraging built-in cost-saving features in Google Cloud.

    • Resilience, scalability, and flexibility: As your business needs evolve, the Well-Architected Framework helps you design cloud deployments that can scale to accommodate changing demands, remain highly available, and be resilient to disasters and failures.

    • Operational excellence: The Well-Architected Framework promotes operationally sound architectures that are easy to operate, monitor, and maintain.

    • Predictable and workload-specific performance: The Well-Architected Framework offers guidance to help you build, deploy, and operate workloads that provide predictable performance based on your workloads’ needs.

    • The Well-Architected Framework also includes cross-pillar perspectives for selected domains, industries, and technologies like AI and machine learning (ML).

    The principles and recommendations in the Google Cloud Well-Architected Framework are aligned with Google and industry best practices like Google’s Site Reliability Engineering (SRE) practices, DORA capabilities, the Google HEART framework for user-centered metrics, the FinOps framework, Supply-chain Levels for Software Artifacts (SLSA), and Google's Secure AI Framework (SAIF).

    Embrace the Well-Architected Framework to transform your Google Cloud journey, and get comprehensive guidance on security, reliability, cost, performance, and operations — as well as targeted recommendations for specific industries and domains like AI and ML. To learn more, visit Google Cloud Well-Architected Framework.

  41. Product Manager

    Thu, 30 Jan 2025 20:00:00 -0000

    We are thrilled to announce the collaboration between Google Cloud, AWS, and Azure on Kube Resource Orchestrator, or kro (pronounced “crow”). kro introduces a Kubernetes-native, cloud-agnostic way to define groupings of Kubernetes resources. With kro, you can group your applications and their dependencies as a single resource that can be easily consumed by end users.

    Challenges of Kubernetes resource orchestration

    Platform and DevOps teams want to define standards for how application teams deploy their workloads, and they want to use Kubernetes as the platform for creating and enforcing these standards. Each service needs to handle everything from resource creation to security configurations, monitoring setup, defining the end-user interface, and more. There are client-side templating tools that can help with this (e.g., Helm, Kustomize), but Kubernetes lacks a native way for platform teams to create custom groupings of resources for consumption by end users. 

    Before kro, platform teams needed to invest in custom solutions such as building custom Kubernetes controllers, or using packaging tools like Helm, which can’t leverage the benefits of Kubernetes CRDs. These approaches are costly to build, maintain, and troubleshoot, and complex for non-Kubernetes experts to consume. This is a problem many Kubernetes users face. Rather than developing vendor-specific solutions, we’ve partnered with Amazon and Microsoft on making K8s APIs simpler for all Kubernetes users.

    aside_block
    <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x3e50640ab5b0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

    How kro simplifies the developer experience

    kro is a Kubernetes-native framework that lets you create reusable APIs to deploy multiple resources as a single unit. You can use it to encapsulate a Kubernetes deployment and its dependencies into a single API that your application teams can use, even if they aren’t familiar with Kubernetes. You can use kro to create custom end-user interfaces that expose only the parameters an end user should see, hiding the complexity of Kubernetes and cloud-provider APIs.

    kro does this by introducing the concept of a ResourceGraphDefinition, which specifies how a standard Kubernetes Custom Resource Definition (CRD) should be expanded into a set of Kubernetes resources. End users define a single resource, which kro then expands into the custom resources defined in the CRD.

    kro can be used to group and manage any Kubernetes resources. Tools like ACK, KCC, or ASO define CRDs to manage cloud provider resources from Kubernetes (these tools enable cloud provider resources, like storage buckets, to be created and managed as Kubernetes resources). kro can also be used to group resources from these tools, along with any other Kubernetes resources, to define an entire application deployment and the cloud provider resources it depends on.

    1

    Example use cases

    Below, you’ll find some examples of kro being used with Google Cloud. You can find additional examples on the kro website

    Example 1: GKE cluster definition

    Imagine that a platform administrator wants to give end users in their organization self-service access to create GKE clusters. The platform administrator creates a kro ResourceGraphDefinition called GKEclusterRGD that defines the required Kubernetes resources and a CRD called GKEcluster that exposes only the options they want to be configurable by end users. In addition to creating a cluster, the platform team also wants clusters to deploy administrative workloads such as policies, agents, etc. The ResourceGraphDefinition defines the following resources, using KCC to provide the mappings from K8s CRDs to Google Cloud APIs:

    • GKE cluster, Container Node Pools, IAM ServiceAccount, IAM PolicyMember, Services, Policies

    The platform administrator would then define the end-user interface so that they can create a new cluster by creating an instance of the CRD that defines:

    • Cluster name, Nodepool name, Max nodes, Location (e.g. us-east1), Networks (optional)

    Everything related to policy, service accounts, and service activation (and how these resources relate to each other) is hidden from the end user, simplifying their experience.

    2

    Example 2: Web application definition

    In this example, a DevOps Engineer wants to create a reusable definition of a web application and its dependencies. They create a ResourceGraphDefinition called WebAppRGD, which defines a new Kubernetes CRD called WebApp. This new resource encapsulates all the necessary resources for a web application environment, including:

    • Deployments, service, service accounts, monitoring agents, and cloud resources like object storage buckets. 

    The WebAppRGD ResourceGraphDefinition can set a default configuration, and also define which parameters can be set by the end user at deployment time (kro gives you the flexibility to decide what is immutable, and what an end user is able to configure). A developer then creates an instance of the WebApp CRD, inputting any user-facing parameters. kro then deploys the desired Kubernetes resource.

    3

    Key benefits of kro

    We believe kro is a big step forward for platform engineering teams, delivering a number of advantages:

    • Kubernetes-native: kro leverages Kubernetes Custom Resource Definitions (CRDs) to extend Kubernetes, so it works with any Kubernetes resource and integrates with existing Kubernetes tools and workflows.

    • Lets you create a simplified end user experience: kro makes it easy to define end-user interfaces for complex groups of Kubernetes resources, making it easy for people who are not Kubernetes experts to consume services built on Kubernetes. 

    • Enables standardized services for application teams: kro templates can be reused across different projects and environments, promoting consistency and reducing duplication of effort.

    Get started with kro

    kro is available as an open-source project on GitHub. The GitHub organization is currently jointly owned by teams from Google, AWS, and Microsoft, and we welcome contributions from the community. We also have a website with documentation on installing and using kro, including example use cases. As an early-stage project, kro is not yet ready for production use, but we still encourage you to test it out in your own Kubernetes development environments!

  42. Senior Product Manager, Google

    Thu, 23 Jan 2025 17:00:00 -0000

    Platform engineering, one of Gartner’s top 10 strategic technology trends for 2024, is rapidly becoming indispensable for enterprises seeking to accelerate software delivery and improve developer productivity. How does it do that? Platform engineering is about providing the right infrastructure, tools, and processes that enable efficient, scalable software development, deployment, and management, all while minimizing the cognitive burden on developers.

    To uncover the secrets to platform engineering success, Google Cloud partnered with Enterprise Strategy Group (ESG) on a comprehensive research study of 500 global IT professionals and application developers working at organizations with at least 500 employees, all with formal platform engineering teams. Our goal was to understand whether they had adopted platform engineering, and if so, the impact that has had on their company’s software delivery capabilities. 

    The resulting report, Building Competitive Edge With Platform Engineering: A Strategic Guide,” reveals common patterns, expectations, and actionable best practices for overcoming challenges and fully leveraging platform engineering. This blog post highlights some of the most powerful insights from this study.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e5056321520>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Platform engineering is no longer optional

    The research confirms that platform engineering is no longer a nascent concept. 55% of the global organizations we invited to participate have already adopted platform engineering. Of those, 90% plan to expand its reach to more developers. Furthermore, 85% of companies using platform engineering report that their developers rely on the platform to succeed. These figures highlight that platform engineering is no longer just a trend; it's becoming a vital strategy for organizations seeking to unlock the full potential of their cloud and IT investments and gain a competitive edge.

    image1

    Figure 1: 55% of 900+ global organizations surveyed have adopted platform engineering

    Three keys to platform engineering success

    The report identifies three critical components that are central to the success of mature platform engineering leaders. 

    1. Fostering close collaboration between platform engineers and other teams to ensure alignment 

    2. Adopting a “platform as a product” approach, which involves treating the developer platform with a clear roadmap, communicated value, and tight feedback loops

    3. Defining success by measuring performance through clear metrics such as deployment frequency, failure recovery time, and lead time for changes 

    It's noteworthy that while many organizations have begun their platform engineering journey, only 27% of adopters have fully integrated these three key components in their practices, signaling a significant opportunity for further improvements.

    AI: platform engineering's new partner

    One of the most compelling insights of this report is the synergistic relationship between platform engineering and AI. A remarkable 86% of respondents believe that platform engineering is essential to realizing the full business value of AI. At the same time, a vast majority of companies view AI as a catalyst for advancing platform engineering, with 94% of organizations identifying AI to be ‘Critical’ or ‘Important’ to the future of platform engineering.

    image2

    Beyond speed: key benefits of platform engineering

    The study also identified three cohorts of platform engineering adopters — nascent, established, and leading — based on whether and how much adopters had embraced the above-mentioned three key components of platform engineering success. The study shows that leading adopters gain more in terms of speed, efficiency, and productivity, and offers guidance for nascent and established adopters to improve their overall platform engineering maturity to gain more benefits.

    The report also identified some additional benefits of platform engineering, including:

    • Improved employee satisfaction, talent acquisition & retention: mature platforms foster a positive developer experience that directly impacts company culture. Developers and IT pros working for organizations with mature developer platforms are much more likely to recommend their workplace to their peers.

    • Accelerated time to market: mature platform engineering adopters have significantly shortened time to market. 71% of leading adopters of platform engineering indicated they have significantly accelerated their time to market, compared with 28% of less mature adopters.

    Don't go it alone

    A vast majority (96%) of surveyed organizations are leveraging open-source tools to build their developer platforms. Moreover, most (84%) are partnering with external vendors to manage and support their open-source environments. Co-managed platforms with a third party or a cloud partner benefit from a higher degree of innovation. Organizations with co-managed platforms allocate an average of 47% of their developers’ productive time to innovation and experimentation, compared to just 38% for those that prefer to manage their platforms with internal staff.

    Ready to succeed? Explore the full report

    While this blog provides a glimpse into the key findings from this study, the full report goes much further, revealing key platform engineering strategies and practices that will help you stay ahead of the curve. Download the report to explore additional topics, including:

    • The strategic considerations of centralized and distributed platform engineering teams

    • The key drivers behind platform engineering investments

    • Top priorities driving platform adoption for developers, ensuring alignment with their needs

    • Key pain points to anticipate and navigate on the road to platform engineering success

    • How platform engineering boosts productivity, performance, and innovation across the entire organization

    • The strategic importance of open source in platform engineering for competitive advantage

    • The transformative role of platform engineering for AI/ML workloads as adoption of AI increases

    • How to develop the right platform engineering strategy to drive scalability and innovation

    Download the full report now.

  43. Software Engineer

    Thu, 23 Jan 2025 17:00:00 -0000

    Editor’s note: This blog post was updated to reflect the general availability status of these features as of March 31, 2025.


    Cloud Deploy is a fully managed continuous delivery platform that automates the delivery of your application. On top of existing automation features, customers tell us they want other ways to automate their deployments to keep their production environments reliable and up to date.

    We're happy to announce three new features to help with that, all in GA.

    1. Repair rollouts

    The new repair rollout automation rule lets you retry failed deployments or automatically roll back to a previously successful release when an error occurs. These errors could come in any phase of a deployment: a pre-deployment SQL migration, a misconfiguration detected when talking to a GKE cluster, or as part of a deployment verification step. In any of these cases, the repair rollout automation lets you retry the failed step a configurable number of times, perfect for those occasionally flaky end-to-end tests. If the retry succeeds, the rollout continues. If the retries fail (or none are configured) the repair rollout automation can also roll back to the previously successful release.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e50567685b0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    2. Deploy policies

    Automating deployments is powerful, but it can also be important to put some constraints on the automation. The new deploy policies feature is intended to limit what these automations (or users) can do. Initially, we're launching a time-windows policy which can, for example, inhibit deployments during evenings, weekends, or during important events. While an on-caller with the Policy Overrider role could "break glass" to get around these policies, automated deployments won't be able to trigger a rollout in the middle of your big demo.

    3. Timed promotions

    After a release is successfully rolled out, you may want to automatically deploy it to the next environment. Our previous auto-promote feature let you promote a release after a specified duration, for example moving it into prod 12 hours after it went to staging. But often you want promotions to happen on a schedule, not based on a delay. Within Google, for example, we typically recommend that teams promote from a dev environment into staging every Thursday, and then start a promotion into prod on Monday mornings. With the new timed promotion automation, Cloud Deploy can handle these scheduled promotions for you. 

    The future

    Comprehensive, easy-to-use, and cost-effective DevOps tools are key to efficient software delivery, and it’s our hope that Cloud Deploy will help you implement complete CI/CD pipelines. Stay tuned as we introduce exciting new capabilities and features to Cloud Deploy in the months to come.

    Update your current pipelines with these new features today. Check out the product page, documentation, quickstarts, and tutorials. Finally, if you have feedback on Cloud Deploy, you can join the conversation. We look forward to hearing from you!

  44. Senior Staff Reliability Engineer

    Thu, 09 Jan 2025 17:00:00 -0000

    Cloud applications like Google Workspace provide benefits such as collaboration, availability, security, and cost-efficiency. However, for cloud application developers, there’s a fundamental conflict between achieving high availability and the constant evolution of cloud applications. Changes to the application, such as new code, configuration updates, or infrastructure rearrangements, can introduce bugs and lead to outages. These risks pose a challenge for developers, who must balance stability and innovation while minimizing disruption to users.

    Here on the Google Workspace Site Reliability Engineering team, we once moved a replica of Google Docs to a new data center because we needed extra capacity. But moving the associated data, which was vast, overloaded a key index in our database, restricting user ability to create new docs. Thankfully, we were able to identify the root cause and mitigate the problem quickly. Still, this experience convinced us of the need to reduce the risk of a global outage from a simple application change.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e50624a8400>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Limit the blast radius

    Our approach to reducing the risk of global outages is to limit the “blast radius,” or extent, of an outage by vertically partitioning the serving stack. The basic idea is to run isolated instances (“partitions”) of application servers and storage (Figure 1). Each partition contains all the various servers necessary to service a user request from end to end. Each production partition also has a pseudo-random mix of users and workloads, so all the partitions have similar resource needs. When it comes time to make changes to the application code, we deploy new changes to one partition at a time. Bad changes may cause a partition-wide outage, but we are protected from a global application outage. 

    Compare this approach to using canarying alone, in which new features or code changes are released to a small group of users before rolling them out to the rest. While canarying deploys changes first to just a few servers, it doesn’t prevent problems from spreading. For example, we’ve had incidents where canaried changes corrupted data used by all the servers in the deployment. With partitioning, the effects of bad changes are isolated to a single partition, preventing such contagion. Of course, in practice, we combine both techniques: canarying new changes to a few servers within a single partition.

    image1

    Benefits of partitioning

    Broadly speaking, partitioning brings a lot of advantages:

    • Availability: Initially, the primary motivation for partitioning was to improve the availability of our services and avoid global outages. In a global outage, an entire service may be down (e.g., users cannot log into Gmail), or a critical user journey (e.g., users cannot create Calendar events) — obviously things to be avoided.

      Still, the reliability benefits of partitioning can be hard to quantify; global outages are relatively infrequent, so if you don’t have one for a while, it may be due to partitioning, or may be due to luck. That said, we’ve had several outages that were confined to a single partition, and believe they would have expanded into global outages without it.
    • Flexibility: We evaluate many changes to our systems by experimenting with data. Many user-facing experiments, such as a change to a UI element, use discrete groups of users. For example, in Gmail we can choose an on-disk layout that stores the message bodies of emails inline with the message metadata, or a layout that separates them into different disk files. The right decision depends on subtle aspects of the workload. For example, separating message metadata and bodies may reduce latency for some user interactions, but requires more compute resources in our backend servers to perform joins between the body and metadata columns. With partitioning, we can easily evaluate the impact of these choices in contained, isolated environments. 
    • Data location: Google Workspace lets enterprise customers specify that their data be stored in a specific jurisdiction. In our previous, non-partitioned architecture, such guarantees were difficult to provide, especially since services were designed to be globally replicated to reduce latency and take advantage of available capacity.

    Challenges

    Despite the benefits, there are some challenges to adopt partitioning. In some cases, these challenges make it hard or risky to move from a non-partitioned to a partitioned setup. In other cases, challenges persist even after partitioning. Here are the issues as we see them:

    • Not all data models are easy to partition: For example, Google Chat needs to assign both users and chat rooms to partitions. Ideally, a chat and its members would be in a single partition to avoid cross-partition traffic. However, in practice, this is difficult to accomplish. Chat rooms and users form a graph, with users in many chat rooms and chat rooms containing many users. In the worst case, this graph may have only a single connected component — the user. If we were to slice the graph into partitions, we could not guarantee that all users would be in the same partition as their chat rooms.
    • Partitioning a live service requires care: Most of our services pre-date partitioning. As a result, adopting partitioning means taking a live service and changing its routing and storage setup. Even if the end goal is higher reliability, making these kinds of changes in a live system is often the source of outages, and can be risky.
    • Partition misalignment between services: Our services often communicate with each other. For example, if a new person is added to a Calendar event, Calendar servers make an Remote Procedure Call (RPC) to Gmail delivery servers to send the new invitee an email notification. Similarly, Calendar events with video call links require Calendar to talk to Meet servers for a meeting id. Ideally, we would get the benefits of partitioning even across services. However, aligning partitions between services is difficult. The main reason is that different services tend to use different entity types when determining which partition to use. For example, Calendar partitions on the owner of the calendar while Meet partitions on meeting id. The result is that there is no clear mapping from partitions in one service to another.
    • Partitions are smaller than the service: A modern cloud application is served by hundreds or thousands of servers. We run servers at less than full utilization so that we can tolerate spikes in traffic, and because servers that are saturated with traffic generally perform poorly. If we have 500 servers, and target each at 60% CPU utilization, we effectively have 200 spare servers to absorb load spikes. Because we do not fail over between partitions, each partition has access to a much smaller amount of spare capacity. In a non-partitioned setup, a few server crashes may likely go unnoticed, since there is enough headroom to absorb the lost capacity. But in a smaller partition, these crashes may account for a non-trivial portion of the available server capacity, and the remaining servers may become overloaded.

    Key takeaways

    We can improve the availability of web applications by partitioning their serving stacks. These partitions are isolated, because we do not fail over between them. Users and entities are assigned to partitions in a sticky manner to allow us to roll out changes in order of risk tolerance. This approach allows us to roll out changes one partition at a time with confidence that bad changes will only affect a single partition, and ideally that partition contains only users from your organization.

    In short, partitioning supports our efforts to provide stronger and more reliable services to our users, and it might apply to your service as well. For example, you can improve the availability of your application by using Spanner, which provides geo-partitioning out of the box. Read more about geo-partitioning best practices here.

    References

  45. Product Leader for Customer Telemetry, Google Cloud

    Mon, 06 Jan 2025 17:00:00 -0000

    Cloud incidents happen. And when they do, it’s incumbent on the cloud service provider to communicate about the incident to impacted customers quickly and effectively — and for the cloud service consumer to use that information effectively, as part of a larger incident management response. 

    Google Cloud Personalized Service Health provides businesses with fast, transparent, relevant, and actionable communication about Google Cloud service disruptions, tailored to a specific business at its desired level of granularity. Cybersecurity company Palo Alto Networks is one Google Cloud customer and partner that recently integrated Personalized Service Health signals into the incident workflow for its Google Cloud-based PRISMA Access offering, saving its customers critical minutes during active incidents. 

    By programmatically ingesting Personalized Service Health signals into advanced workflow components, Palo Alto can quickly make decisions such as triggering contingency actions to protect business continuity.

    Let’s take a closer look at how Palo Alto integrated Personalized Service Health into its operations.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e50630985e0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    The Personalized Service Health integration

    Palo Alto ingests Personalized Service Health logs into its internal AIOps system, which centralizes incident communications for PRISMA Access and applies advanced techniques to classify and distribute signals to the people responsible for responding to a given incident.

    1

    Personalized Service Health UI Incident list view

    Users of Personalized Service Health can filter what relevance levels they want to see. Here, “Partially related” reflects an issue anywhere in the world with the products that are used. “Related” reflects that the problem is detected within the data center regions, while “Impacted” means that Google has verified the impact to the customer for specific services.

    While Google is still confirming an incident, Personalized Service Health communicates some of these incidents as 'PSH Emerging Incident' to provide customers with early notification. Once Google confirms the incident, these incidents are merged with 'PSH Confirmed Incidents'. This helps customers respond faster to a specific incident that’s impacting their environment or escalate back to Google, if needed. 

    Personalized Service Health distributes updates throughout an active incident, typically every 30 minutes, or sooner if there’s progress to share. These updates are also written to logs, which Palo Alto ingests into AIOps.

    Responding to disruptive, unplanned cloud service provider incidents can be accelerated by programmatically ingesting and distributing incident communications. This is especially true in large-scale organizations such as Palo Alto, which has multiple teams involved in incident response for different applications, workloads and customers. 

    Fueling the incident lifecycle

    Palo Alto further leverages the ingested Personalized Service Health signals in its AIOps platform, which uses machine learning (ML) and analytics to automate IT operations. AIOps harnesses big data from operational appliances to detect and respond to issues instantaneously.  AIOps correlates these signals with internally generated alerts to declare an incident that is affecting multiple customers. These AIOps alerts are tied to other incident management tools that assist with managing the incident lifecycle, including communication, regular updates and incident resolution.

    2

    In addition, a data enrichment pipeline takes Personalized Service Health incidents, adds Palo Alto’s related information, and publishes the events to Pub/Sub. AIOps then consumes the incident data from Pub/Sub, processes it, correlates it to related events signals, and notifies subscribed channels.

    Palo Alto organizes Google Cloud assets into folders within the Google Cloud console. Each project represents a Palo Alto PRISMA Access customer. To receive incident signals that are likewise specific to end customers, Palo Alto creates a log sink that’s specific to each folder, aggregating service health logs at the folder level. Palo Alto then receives incident signals specific to each customer so it can take further action.

    3

    Palo Alto drives the following actions based on incident communications flowing from Google Cloud:

    • Proactive detection of zonal, inter-regional, external en-masse failures

    • Accurately identifying workloads affected by cloud provider incidents 

    • Correlation of product issue caused by cloud service degradation in Google Cloud Platform itself

    Seeing Personalized Service Health’s value

    Incidents caused by cloud providers often go unnoticed or are difficult to isolate without involving multiple of the cloud provider’s teams (support, engineering, SRE, account management). The Personalized Service Health alerting framework plus AIOps correlation engine allows Palo Alto’s SRE teams to isolate issues caused by a cloud provider near-instantaneously.

    4

    Palo Alto’s incident management workflow is designed to address mass failures versus individual customer outages, ensuring the right teams are engaged until the incidents are resolved. This includes notifying relevant parties, such as the on-call engineer and the Google Cloud support team. With Personalized Service Health, Palo Alto can capture both event types i.e., mass failures as well as individual customer outages.

    Palo Alto gets value from Personalized Service Health in multiple ways, beginning with faster incident response and contingency actions with which to optimize business continuity, especially for impacted customers of PRISMA Access. In the event of an incident impacting them, Prisma Access customers naturally seek and expect information from Palo Alto. By ensuring this information flows rapidly from Google Cloud to Palo Alto’s incident response systems, Palo Alto is able to provide more insightful answers to these end customers, and plans to serve additional Palo Alto use cases based on both existing and future Personalized Service Health capabilities. 

    Take your incident management to the next level

    Google Cloud is continually evolving Personalized Service Health to provide deeper value for all Google Cloud customers — from startups, to ISVs and SaaS providers, to the largest enterprises. Ready to get started? Learn more about Personalized Service Health, or reach out to your account team.


    We'd like to thank Jose Andrade, Pankhuri Kumar and Sudhanshu Jain of Google for their contributions to this collaboration between PANW and Google Cloud.

  46. Staff Software Engineer

    Mon, 09 Dec 2024 17:00:00 -0000

    From helping your developers write better code faster with Code Assist, to helping cloud operators more efficiently manage usage with Cloud Assist, Gemini for Google Cloud is your personal AI-powered assistant. 

    However, understanding exactly how your internal users are using Gemini has been a challenge — until today. 

    Today we are announcing that Cloud Logging and Cloud Monitoring support for Gemini for Google Cloud. Currently in public preview, Cloud Logging records requests and responses between Gemini for Google Cloud and individual users, while Cloud Monitoring reports 1-day, 7-day, and 28-day Gemini for Google Cloud active users and response counts in aggregate.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e5063dfb3a0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    Cloud Logging

    In addition to offering customers general visibility into the impact of Gemini, there are a few scenarios where logs are useful:

    • to track the provenance of your AI-generated content

    • to record and review user usage of Gemini for Google Cloud 

    This feature is available as opt-in and when enabled, logs your users’ Gemini for Google Cloud activity to Cloud Logging (Cloud Logging charges apply). 

    Once enabled, log entries are made for each request to and response from Gemini for Google Cloud. In a typical request entry, Logs Explorer would provide an entry similar to the following example:

    1

    There are several things to note about this entry:

    • The content inside jsonPayload contains information about the request. In this case, it was a request to complete Python code with def fibonacci as the input. 

    • The labels tell you the method (CompleteCode), the product (code_assist), and the user who initiated the request (cal@google.com). 

    • The resource labels tell you the instance, location, and resource container (typically project) where the request occurred. 

    In a typical response entry, you’ll see the following:

    2

    Note that the request_id inside the label are identical for this pair of requests and responses, enabling identification of request and response pairs.

    In addition to the Log Explorer, Log Analytics supports queries to analyze your log data, and help you answer questions like "How many requests did User XYZ make to Code Assist?" 

    For more details, please see the Gemini for Google Cloud logging documentation

    Cloud Monitoring 

    Gemini for Google Cloud monitoring metrics help you answer questions like: 

    • How many unique active users used Gemini for Google Cloud services over the past day or seven days? 

    • How many total responses did my users receive from Gemini for Google Cloud services over the past six hours?

    Cloud Monitoring support for Gemini for Google Cloud is available to anybody who uses a Gemini for Google Cloud product and records responses and active users as Cloud Monitoring metrics, with which dashboards and alerts can be configured. 

    Because these metrics are available with Cloud Monitoring, you can also use them as part of Cloud Monitoring dashboards. A “Gemini for Google Cloud” dashboard is automatically installed under “GCP Dashboards” when Gemini for Google Cloud usage is detected:

    3

    Metrics Explorer offers another avenue where metrics can be examined and filters applied to gain a more detailed view of your usage. This is done by selecting the “Cloud AI Companion Instance” active resource in the Metrics Explorer:

    4

    In the example above, response_count is the number of responses sent by Gemini for Google Cloud, and can be filtered for Gemini Code Assist or the Gemini for Google Cloud method (code completion/generation). 

    For more details, please see the Gemini for Google Cloud monitoring documentation.

    What’s next

    We’re continually working on additions to these new capabilities, and in particular are focused on Code Assist logging and metrics enhancements that will bring even further insight and observability into your use of Gemini Code Assist and its impact. To get started with Gemini Code Assist and learn more about Gemini Cloud Assist — as well as observability data about it from Cloud Logging and Monitoring — check out the following links: 

  47. EMEA Practice Solutions Lead, Application Platform

    Tue, 22 Oct 2024 17:00:00 -0000

    At the end of the day, developers build, test, deploy and maintain software. But like with lots of things, it’s about the journey, not the destination.

    Among platform engineers, we sometimes refer to that journey as the developer experience (DX), which encompasses how developers feel and interact with the tools and services they use throughout the software build, test, deployment and maintenance process.

    Prioritizing DX is essential: Frustrated developers lead to inefficiency and talent loss as well as to shadow IT. Conversely, a positive DX drives innovation, community, and productivity. And if you want to provide a  positive DX, you need to start measuring how you’re doing.

    At PlatformCon 2024, I gave a talk entitled "Improving your developers' platform experience by applying Google frameworks and methods” where I spoke about Google’s HEART Framework, which provides a holistic view of your organization's developers’ experience through actionable data.

    In this article, I will share ideas on how you can apply the HEART framework to your Platform Engineering practice, to gain a more comprehensive view of your organization’s developer experience. But before I do that, let me explain what the HEART Framework is.

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e50630b2ca0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    The HEART Framework: an introduction

    In a nutshell, HEART measures developer behaviors and attitudes from their experience of your platform and provides you with insights into what’s going on behind the numbers, by defining specific metrics to track progress towards goals. This is beneficial because continuous improvements through feedback are vital components of a platform engineering journey, helping both platform and application product teams make decisions that are data-driven and user-centered.

    However, HEART is not a data collection tool in and of itself; rather, it’s a user-sentiment framework for selecting the right metrics to focus on based on product or platform objectives. It balances quantitative or empirical data, e.g., number of active portal users, with qualitative or subjective insights such as "My users feel the portal navigation is confusing." In other words, consider HEART as a framework or methodology for assessing user experience, rather than a specific tool or assessment. It helps you decide what to measure, not how to measure it.

    image2

    Let’s take a look at each of these in more detail.

    Happiness: Do users actually enjoy using your product?

    Highlight: Gathering and analyzing developer feedback

    Subjective metrics:

    • Surveys: Conduct regular surveys to gather feedback about overall satisfaction, ease of use, and pain points. Toil negatively affects developer satisfaction and morale. Repetitive, manual work can lead to frustration burnout and decreased happiness with the platform.

    • Feedback mechanisms: Establish easy ways for developers to provide direct feedback on specific features or areas of the platform like Net Promoter Score (NPS) or Customer Satisfaction surveys (CSAT).

    • Collect open-ended feedback from developers through interviews and user groups.

    • Sentiment analysis: Analyze developer sentiment expressed in feedback channels, support tickets and online communities.

    System metrics:

    • Feature requests: Track the number and types of feature requests submitted by developers. This provides insights into their needs and desires and can help you prioritize improvements that will enhance happiness.

    Watch out for: While platforms can boost developer productivity, they might not necessarily contribute to developer job satisfaction. This warrants further investigation, especially if your research suggests that your developers are unhappy.

    Engagement: What is the developer breadth and quality of platform experience?

    Highlight: Frequency of interaction between platform engineers with developers and quality of interaction — intensity and quality of interaction with the platform, participation on chat channels, training, dual ownership of golden paths, joint troubleshooting, engaging in architectural design discussions, and the breadth of interaction by everyone from new hires through to senior developers.

    Subjective metrics:

    • Survey for quality of interaction — focus on depth and type of interaction whether through chat channel, trainings, dual ownership of golden paths, joint troubleshooting, or architectural design discussions

    • High toil can reduce developer engagement with the platform. When developers spend excessive amounts of time on tedious tasks, they are less likely to explore new features, experiment, and contribute to the platform's evolution.

    System metrics:

    • Active users: Track daily, weekly, and monthly active developers and how long they spend on tasks.

    • Usage patterns: Analyze the most used platform features, tools, and portal resources.

    • Frequency of interaction between platform engineers with developers.

    • Breadth of user engagement: Track onboarding time for new hires to reach proficiency, measure the percentage of senior developers actively contributing to golden paths or portal functionality.

    Watch out for: Don’t confuse engagement with satisfaction. Developers may rate the platform highly in surveys, but usage data might reveal low frequency of interaction with core features or a limited subset of teams actively using the platform. Ask them “How has the platform changed your daily workflow?” rather than "Are you satisfied with the platform?”

    Adoption: What is the platform growth rate and developer feature adoption?

    Highlight: Overall acceptance and integration of the platform into the development workflow.

    System metrics:

    • New user registrations: Monitor the growth rate of new developers using the platform.

    • Track time between registration and time to use the platform i.e., executing golden paths, tooling and portal functionality.

    • Number of active users per week / month / quarter / half-year / year who authenticate via the portal and/or use golden paths, tooling and portal functionality

    • Feature adoption: Track how quickly and widely new features or updates are used.

    • Percentage of developers using CI/CD through the platform

    • Number of deployments per user / team / day / week / month — basically of your choosing

    • Training: Evaluate changes in adoption, after delivering training.

    Watch out for: Overlooking the "long tail" of adoption. A platform might see a burst of early adoption, but then plateau or even decline if it fails to continuously evolve and meet changing developer needs. Don't just measure initial adoption, monitor how usage evolves over weeks, months, and years.

    Retention: Are developers loyal to the platform?

    Highlight: Long-term engagement and reducing churn.

    Subjective metrics:

    • Use an exit survey if a user is dormant for 12 or more months.

    System metrics:

    • Churn rate: Track the percentage of developers who stop logging into the portal and are not using it.

    • Dormant users: Identify developers who become inactive after 6 months and investigate why.

    • Track services that are less frequently used.

    Watch out for: Misinterpreting the reasons for churn. When developers stop using your platform (churn), it's crucial to understand why. Incorrectly identifying the cause can lead to wasted effort and missed opportunities for improvement. Consider factors outside the platform — churn could be caused by changes in project requirements, team structures or industry trends.

    Task success: Can developers complete specific tasks?

    Highlight: Efficiency and effectiveness of the platform in supporting specific developer activities.

    Subjective metrics:

    • Survey to assess the ongoing presence of toil and its inimical influence on developer productivity, ultimately hindering efficiency and leading to increased task completion times.

    System metrics:

    • Completion rates: Measure the percentage of golden paths and tools successfully run on the platform without errors.

    • Time to complete tasks using golden paths, portal, or tooling.

    • Error rates: Track common errors and failures developers encounter from log files or monitoring dashboards from golden paths, portal or tooling.

    • Mean Time to Resolution (MTTR): When errors do occur, how long does it take to resolve them? A lower MTTR indicates a more resilient platform and faster recovery from failures.

    • Developer platform and portal uptime: Measure the percentage of time that the developer platform and portal is available and operational. Higher uptime ensures developers can consistently access the platform and complete their tasks.

    Watch out for: Don't confuse task success with task completion. Simply measuring whether developers can complete tasks on the platform doesn't necessarily indicate true success. Developers might find workarounds or complete tasks inefficiently, even if they technically achieve the end goal. It may be worth manually observing developer workflows in their natural environment to identify pain points and areas of friction in their workflows.

    Also, be careful with misaligning task success with business goals. Task completion might overlook the broader impact on business objectives. A platform might enable developers to complete tasks efficiently, but if those tasks don't contribute to overall business goals, the platform's true value is questionable.

    Applying the HEART framework to platform engineering

    It’s not necessary to use all of the categories each time. The number of categories to consider really depends on the specific goals and context of the assessment; you can include everything or trim it down to better match your objective. Here are some examples:

    • Improving onboarding for new developers: Focus on adoption, task success and happiness.

    • Launching a new feature: Concentrate on adoption and happiness.

    • Increasing platform usage: Track engagement, retention and task success.

    Keep in mind that relying on just one category will likely provide an incomplete picture.

    When should you use the framework?

    In a perfect world, you would use the HEART framework to establish a baseline assessment a few months after launching your platform, which will provide you with a valuable insight into early developer experience. As your platform evolves, this initial data becomes a benchmark for measuring progress and identifying trends. Early measurement allows you to proactively address UX issues, guide design decisions with data, and iterate quickly for optimal functionality and developer satisfaction. If you're starting with an MVP, conduct the baseline assessment once the core functionality is in place and you have a small group of early users to provide feedback.

    After 12 or more months of usage, you can also add metrics to embody a new or more mature platform. This can help you gather deeper insights into your developers’ experience by understanding how they are using the platform, measure the impact of changes you’ve made to the platform, or identify areas for improvement and prioritize future development efforts. If you've added new golden paths, tooling, or enhanced functionality, then you'll need to track metrics that measure their success and impact on developer behavior.

    The frequency with which you assess HEART metrics depends on several factors, including:

    • The maturity of your platform: Newer platforms benefit from more frequent reviews (e.g. monthly or quarterly) to track progress and address early issues. As the platform matures, you can reduce the frequency of your HEART assessments (e.g., bi-annually or annually).

    • The rate of change: To ensure updates and changes have a positive impact, apply the HEART framework more frequently when your platform is undergoing a period of rapid evolution such as major platform updates, new portal features or new golden paths, or some change in user behavior. This allows you to closely monitor the effects of each change on key metrics.

    • The size and complexity of your platform: Larger and more complex platforms may require more frequent assessments to capture nuances and potential issues.

    • Your team's capacity: Running HEART assessments requires time and resources. Consider your team's bandwidth and adjust the frequency accordingly.

    Schedule periodic deep dives (e.g. quarterly or bi-annually) using the HEART framework to gain a more in-depth understanding of your platform's performance and identify areas for improvement.

    Taking more steps towards platform engineering

    In this blog post, we’ve shown how the HEART framework can be applied to platform engineering to measure and improve the developer experience. We’ve explored the five key aspects of the framework — happiness, engagement, adoption, retention, and task success — and provided specific metrics for each and guidance on when to apply them.By applying these insights, platform engineering teams can create a more positive and productive environment for their developers, leading to greater success in their software development efforts.To learn more about platform engineering, check out some of our other articles:  5 myths about platform engineering: what it is and what it isn’t, Another five myths about platform engineering, and Laying the foundation for a career in platform engineering.

    And finally, check out the DORA Report 2024, which now has a section on Platform Engineering.

  48. DORA Research Lead

    Tue, 22 Oct 2024 16:00:00 -0000

    The DORA research program has been investigating the capabilities, practices, and measures of high-performing technology-driven teams and organizations for more than a decade. It has published reports based on data collected from annual surveys of professionals working in technical roles, including software developers, managers, and senior executives.

    Today, we’re pleased to announce the publication of the 2024 Accelerate State of DevOps Report, marking a decade of DORA’s investigation into high-performing technology teams and organizations. DORA’s four key metrics, introduced in 2013, have become the industry standard for measuring software delivery performance. 

    Each year, we seek to gain a comprehensive understanding of standard DORA performance metrics, and how they intersect with individual, workflow, team, and product performance. We now include how AI adoption affects software development across multiple levels, too.

    1

    We also establish reference points each year to help teams understand how they are performing, relative to their peers, and to inspire teams with the knowledge that elite performance is possible in every industry. DORA’s research over the last decade has been designed to help teams get better at getting better: to strive to improve their improvements year over year. 

    For a quick overview of this year’s report, you can read in our executive DORA Report summary the spotlight adoption trends and the impact of AI, the emergence of platform engineering, and the continuing significance of developer experience. 

    Organizations across all industries are prioritizing the integration of AI into their applications and services. Developers are increasingly relying on AI to improve their productivity and fulfill their core responsibilities. This year's research reveals a complex landscape of benefits and tradeoffs for AI adoption.

    The report underscores the need to approach platform engineering thoughtfully, and emphasizes the critical role of developer experience in achieving high performance. 

    aside_block
    <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3e5061cb44c0>), ('btn_text', 'Get started for free'), ('href', 'https://console.cloud.google.com/freetrial?redirectPath=/welcome'), ('image', None)])]>

    AI: Benefits, challenges, and developing trust

    Widespread AI adoption is reshaping software development practices. More than 75 percent of respondents said that they rely on AI for at least one daily professional responsibility. The most prevalent use cases include code writing, information summarization, and code explanation. 

    The report confirms that AI is boosting productivity for many developers. More than one-third of respondents experienced”‘moderate” to “extreme” productivity increases due to AI.

    2

    A 25% increase in AI adoption is associated with improvements in several key areas:

    • 7.5% increase in documentation quality

    • 3.4% increase in code quality

    • 3.1% increase in code review speed

    However, despite AI’s potential benefits, our research revealed a critical finding: AI adoption may negatively impact software delivery performance. As AI adoption increased, it was accompanied by an estimated  decrease in delivery throughput by 1.5%, and an estimated reduction in delivery stability by 7.2%. Our data suggest that improving the development process does not automatically improve software delivery — at least not without proper adherence to the basics of successful software delivery, like small batch sizes and robust testing mechanisms. AI has positive impacts on many important individual and organizational factors which foster the conditions for high software delivery performance. But, AI does not appear to be a panacea.

    Our research also shows that despite the productivity gains, 39% of the respondents reported little to no trust in AI-generated code. This unexpected low level of trust indicates to us that there is a need to manage AI integration more thoughtfully. Teams must carefully evaluate AI’s role in their development workflow to mitigate the downsides

    Based on these findings, we have three core recommendations:

    1. Enable your employees and reduce toil by orienting your AI adoption strategies towards empowering employees and alleviating the burden of undesirable tasks.

    2. Establish clear guidelines for the use of AI and address procedural concerns and foster open communication about its impact.

    3. Encourage continuous exploration of AI tools and provide them dedicated time for experimentation, and promote trust through hands-on experience.

    Platform engineering: A paradigm shift

    Another emerging discipline our research focused this year is on platform engineering. Its focus is on building and operating internal development platforms to streamline processes and enhance efficiency

    3

    Our research identified 4 key findings regarding platform engineering:

    • Increased developer productivity: Internal development platforms effectively increase productivity for developers.

    • Prevalence in larger firms: These platforms are more commonly found in larger organizations, suggesting their suitability for managing complex development environments.

    • Potential performance dip: Implementing a platform engineering initiative might lead to a temporary decrease in performance before improvements manifest as the platform matures.

    • Need for user-centeredness and developer independence: For optimal results, platform engineering efforts should prioritize user-centered design, developer independence, and a product-oriented approach

    A thoughtful approach that prioritizes user needs, empowers developers, and anticipates potential challenges is key to maximizing the benefits of platform engineering initiatives. 

    Developer experience: The cornerstone of success

    One of the key insights in last year’s report was that a healthy culture can help reduce burnout, increase productivity, and increase job satisfaction. This year was no different. Teams that cultivate a stable and supportive environment that empowers developers to excel drive positive outcomes. 

    Move fast and constantly pivot’ mentality negatively impacts developer well-being and consequently, on overall performance.  Instability in priorities, even with strong leadership, comprehensive documentation, and a user-centered approach — all known to be highly beneficial — can significantly hinder progress. 

    Creating a work environment where your team feels supported, valued, and empowered to contribute is fundamental to achieving high performance. 

    How to use these findings to help your DevOps team

    The key takeaway from the decade of research is that software development success hinges not just on technical prowess but also on fostering a supportive culture, prioritizing user needs, and focusing on developer experience. We encourage teams to replicate our findings within your specific context.  

    It can be used as a hypothesis for your experiments and continuous improvement initiatives. Please share those with us and the DORA community, so that your efforts can become part of our collaborative learning environment.  

    We work on this research in hopes that it serves as a roadmap for teams and organizations seeking to improve their practices and create a thriving environment for innovation, collaboration, and business success. We will continue our platform-agnostic research that focuses on the human aspect of technology for the next decade to come.

    To learn more:

  49. Product Manager - Google Cloud Databases

    Thu, 10 Oct 2024 14:00:00 -0000

    Organizations are grappling with an explosion of operational data spread across an increasingly diverse and complex database landscape. This complexity often results in costly outages, performance bottlenecks, security vulnerabilities, and compliance gaps, hindering their ability to extract valuable insights and deliver exceptional customer experiences. To help businesses overcome these challenges, earlier this year, we announced the preview of Database Center, an AI-powered, unified fleet management solution.

    We’re seeing accelerated adoption for Database Center from many customers. For example, Ford uses Database Center to get answers on their database fleet health in seconds, and proactively mitigates potential risks to their applications. Today, we’re announcing that Database Center is now available to all customers, empowering you to monitor and operate database fleets at scale with a single, unified solution. We've also added support for Spanner, so you can manage it along with your Cloud SQL and AlloyDB deployments, with support for additional databases on the way.

    Database Center is designed to bring order to the chaos of your database fleet, and unlock the true potential of your data. It provides a single, intuitive interface where you can:

    • Gain a comprehensive view of your entire database fleet. No more silos of information or hunting through bespoke tools and spreadsheets.

    • Proactively de-risk your fleet with intelligent performance and security recommendations. Database Center provides actionable insights to help you stay ahead of potential problems, and helps improve performance, reduce costs and enhance security with data-driven suggestions.

    • Optimize your database fleet with AI-powered assistance. Use a natural-language chat interface to ask questions and quickly resolve fleet issues and get optimization recommendations.

    Let’s now review each in more detail.

    Gain a comprehensive view of your database fleet 

    Tired of juggling different tools and consoles to keep track of your databases?

    Database Center simplifies database management with a single, unified view of your entire database landscape. You can monitor database resources across your entire organization, spanning multiple engines, versions, regions, projects and environments (or applications using labels). 

    Cloud SQL, AlloyDB, and now Spanner are all fully integrated with Database Center, so you can monitor your inventory and proactively detect issues. Using the unified inventory view in Database Center, you can: 

    • Identify out-of-date database versions to ensure proper support and reliability

    • Track version upgrades, e.g., if PostgreSQL 14 to PostgreSQL 15 is updating at an expected pace

    • Ensure database resources are appropriately distributed, e.g., identify the number of databases powering the critical production applications vs. non-critical dev/test environments

    • Monitor database migration from on-prem to cloud or across engines

    1-Unified FLeet View

    Manage Cloud SQL, AlloyDB and Spanner resources with a unified view.

    Proactively de-risk your fleet with recommendations

    Managing your database fleet health at scale can involve navigating through a complex blend of security postures, data protection settings, resource configurations, performance tuning and cost optimizations. Database Center proactively detects issues associated with these configurations and guides you through addressing them. 

    For example, high transaction ID for a Cloud SQL instance can lead to the database no longer accepting new queries, potentially causing latency issues or even downtime. Database Center proactively detects this, provides an in-depth explanation, and walks you through prescriptive steps to troubleshoot the issue. 

    We’ve also added several performance recommendations to Database Center related to excessive tables/joins, connections, or logs, and can assist you through a simple optimization journey.

    2. High Transaction ID

    End-to-end workflow for detecting and troubleshooting performance issues.

    Database Center also simplifies compliance management by automatically detecting and reporting violations across a wide range of industry standards, including CIS, PCI-DSS, SOC2, HIPAA. Database Center continuously monitors your databases for potential compliance violations. When a violation is detected, you receive a clear explanation of the problem, including:

    • The specific security or reliability issue causing the violation 

    • Actionable steps to help address the issue and restore compliance

    This helps reduce the risk of costly penalties, simplifies compliance audits and strengthens your security posture. Database Center now also supports real-time detection of unauthorized access, updates, and data exports.

    3. Compliance

    Database Center helps ensure compliance to HIPAA standards.

    Optimize your fleet with AI-powered assistance

    With Gemini enabled, Database Center makes optimizing your database fleet incredibly intuitive. Simply chat with the AI-powered interface to get precise answers, uncover issues within your database fleet, troubleshoot problems, and quickly implement solutions. For example, you can quickly identify under-provisioned instances across your entire fleet, access actionable insights such as the duration of high CPU/Memory utilization conditions, receive recommendations for optimal CPU/memory configurations, and learn about the associated cost of those adjustments. 

    AI-powered chat in Database Center provides comprehensive information and recommendations across all aspects of database management, including inventory, performance, availability and data protection. Additionally, AI-powered cost recommendations suggest ways for optimizing your spend, and advanced security and compliance recommendations help strengthen your security and compliance posture.

    4 - Chat -1

    AI-powered chat to identify data protection issues and optimize cost.

    Get started with Database Center today

    The new capabilities of Database Center are available in preview today for Spanner, Cloud SQL, and AlloyDB for all customers. Simply access  Database Center within the Google Cloud console and begin monitoring and managing your entire databases fleet. To learn more about Database Center’s capabilities, check out the documentation.

  50. Product Manager, Google Cloud

    Tue, 08 Oct 2024 16:00:00 -0000

    Editor's note: Starting February 4, 2025, pipe syntax will be available to all BigQuery users by default.


    Log data has become an invaluable resource for organizations seeking to understand application behavior, optimize performance, strengthen security, and enhance user experiences. But the sheer volume and complexity of logs generated by modern applications can feel overwhelming. How do you extract meaningful insights from this sea of data?

    At Google Cloud, we’re committed to providing you with the most powerful and intuitive tools to unlock the full potential of your log data. That's why we're thrilled to announce a series of innovations in BigQuery and Cloud Logging designed to revolutionize the way you manage, analyze, and derive value from your logs.

    BigQuery pipe syntax: Reimagine SQL for log data

    Say goodbye to the days of deciphering complex, nested SQL queries. BigQuery pipe syntax ushers in a new era of SQL, specifically designed with the semi-structured nature of log data in mind. BigQuery’s pipe syntax introduces an intuitive, top-down syntax that mirrors how you naturally approach data transformations. As demonstrated in the recent research by Google, this approach leads to significant improvements in query readability and writability. By visually separating different stages of a query with the pipe symbol (|>), it becomes remarkably easy to understand the logical flow of data transformation. Each step is clear, concise, and self-contained, making your queries more approachable for both you and your team.

    BigQuery’s pipe syntax isn’t just about cleaner SQL — it’s about unlocking a more intuitive and efficient way to work with your data. Instead of wrestling with code, experience faster insights, improved collaboration, and more time spent extracting value.

    This streamlined approach is especially powerful when it comes to the world of log analysis. 

    With log analysis, exploration is key. Log analysis is rarely a straight line from question to answer. Analyzing logs often means sifting through mountains of data to find specific events or patterns. You explore, you discover, and you refine your approach as you go. Pipe syntax embraces this iterative approach. You can smoothly chain together filters (WHERE), aggregations (COUNT), and sorting (ORDER BY) to extract those golden insights. You can also add or remove steps in your data processing as you uncover new insights, easily adjusting your analysis on the fly.

    Imagine you want to count the total number of users who were affected by the same errors more than 100 times in the month of January. As shown below, the pipe syntax’s linear structure clearly shows the data flowing through each transformation: starting from the table, filtering by the dates, counting by user id and error type, filtering for errors >100, and finally counting the number of users affected by the same errors.

    code_block
    <ListValue: [StructValue([('code', "-- Pipe Syntax \r\nFROM log_table \r\n|> WHERE datetime BETWEEN DATETIME '2024-01-01' AND '2024-01-31'\r\n|> AGGREGATE COUNT(log_id) AS error_count GROUP BY user_id, error_type\r\n|> WHERE error_count>100\r\n|> AGGREGATE COUNT(user_id) AS user_count GROUP BY error_type"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5056ff0220>)])]>

    The same example in the standard syntax will typically require using a subquery and non linear structure.

    code_block
    <ListValue: [StructValue([('code', "-- Standard Syntax \r\nSELECT error_type, COUNT(user_id)\r\nFROM (\r\n SELECT user_id, error_type, \r\n count (log_id) AS error_count \r\n FROM log_table \r\n WHERE datetime BETWEEN DATETIME '2024-01-01' AND DATETIME '2024-01-31'\r\n GROUP BY user_id, error_type\r\n)\r\nGROUP BY error_type\r\nWHERE error_count > 100;"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5056ff00d0>)])]>

    Carrefour: A customer's perspective

    The impact of these advancements is already being felt by our customers. Here's what Carrefour, a global leader in retail, had to say about their experience with pipe syntax:

     "Pipe syntax has been a very refreshing addition to BigQuery. We started using it to dig into our audit logs, where we often use Common Table Expressions (CTEs) and aggregations. With pipe syntax, we can filter and aggregate data on the fly by just adding more pipes to the same query. This iterative approach is very intuitive and natural to read and write. We are now using it for our analysis work in every business domain. We will have a hard time going back to the old SQL syntax now!" - Axel Thevenot, Lead Data Engineer, and Guillaume Blaquiere, Data Architect, Carrefour

    BigQuery pipe syntax is currently available for all BigQuery users. You can check-out this introductory video.

    Beyond syntax: performance and flexibility

    But we haven't stopped at simplifying your code. BigQuery now offers enhanced performance and powerful JSON handling capabilities to further accelerate your log analytics workflows. Given the prevalence of json data in logs, we expect these changes to simplify log analytics for a majority of users. 

    • Enhanced Point Lookups: Pinpoint critical events in massive datasets quickly using BigQuery's numeric search indexes, which dramatically accelerates queries that filter on timestamps and unique IDs. Here is a sample improvement from the announcement blog

    Metrics 

    Without Index

    With Index

    Improvement

    Execution Time (ms)

    48,790

    4,664

    10x

    Processed Bytes

    2,174,758,158,336

    774,897,664

    2,806x

    Slot Usage (ms)

    25,735,222

    7,300

    3,525x

    • Powerful JSON Analysis: Parse and analyze your JSON-formatted log data with ease using BigQuery's JSON_KEYS function and JSONPath traversal feature. Extract specific fields, filter on nested values, and navigate complex JSON structures without breaking a sweat.

      • JSON_KEYS extracts unique JSON keys from JSON data for easier schema exploration and discoverability 

    Query 

    Results 

    JSON_KEYS(JSON '{"a":{"b":1}}')

    ["a", "a.b"]

    JSON_KEYS(JSON '{"a":[{"b":1}, {"c":2}]}', mode => "lax")

    ["a", "a.b", "a.c"]

    JSON_KEYS(JSON '[[{"a":1},{"b":2}]]', mode => "lax recursive")

    ["a", "b"]

      • JSONPath with LAX modes lets you easily fetch JSON arrays without having to use verbose UNNEST. The example below shows how to fetch all phone numbers from the person field, before and after:
    code_block
    <ListValue: [StructValue([('code', '-- consider a JSON field ‘Person’ as\r\n[{\r\n "name": "Bob",\r\n "phone":[{"type": "home", "number": 20}, {"number":30}]\r\n}]\r\n\r\n--Previously, to fetch all phone numbers from ‘Person’ column\r\nSELECT phone.number\r\nFROM (\r\nSELECT IF(JSON_TYPE(person.phone) = "array", JSON_QUERY_ARRAY (person.phone), [person.phone]) as nested_phone\r\nFrom (\r\nSELECT IF(JSON_TYPE(person)= "array", JSON_QUERY_ARRAY(person), [person])as nested_person\r\nFROM t), UNNEST(nested_person) person), UNNEST (nested_phone)phone\r\n\r\n--With Lax Mode\r\nSELECT JSON_QUERY(person, "lax recursive $.phone.number") FROM t'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5056ff0a00>)])]>

    Log Analytics in Cloud Logging: Bringing it all together

    Log Analytics in Cloud Logging is built on top of BigQuery and provides a UI that’s purpose-built for log analysis. With an integrated date/time picker, charting and dashboarding, Log Analytics makes use of the JSON capabilities to support advanced queries and analyze logs faster. To seamlessly integrate these powerful capabilities into your log management workflow, we're also enhancing Log Analytics (in Cloud Logging) with pipe syntax. You can now analyze your logs within Log Analytics leveraging the full power of BigQuery pipe syntax, enhanced lookups, and JSON handling, all within a unified platform.

    pipe_syntax_in_log_analytics

    Use of pipe syntax in Log Analytics (Cloud Logging) is now available in preview.

    Unlock the future of log analytics today

    BigQuery and Cloud Logging provide an unmatched solution for managing, analyzing, and extracting actionable insights from your log data. Explore these new capabilities today and experience the power of:

    Start your journey towards more insightful and efficient log analytics in the cloud with BigQuery and Cloud Logging. Your data holds the answers — we're here to help you find them.

  51. Chief Evangelist, Google Cloud

    Fri, 04 Oct 2024 17:00:00 -0000

    As AI adoption speeds up, one thing is becoming clear: the developer platforms that got you this far won’t get you to the next stage. While yesterday’s platforms were awesome, let’s face it, they weren’t built for today’s AI-infused application development and deployment. And organizations are quickly realizing they need to update their platform strategies to ensure that developers — and the wider set of folks using AI — have what they need for the years ahead.

    In fact, as I explore in a new paper, nine out of ten decision makers are prioritizing the task of optimizing workloads for AI over the next 12 months. Problem is, given the pace of change lately, many don’t know where to start or what they need when it comes to modernizing their developer platforms.

    What follows is a quick look at the key steps involved in planning your platform strategy. For all the details, download my full guide, Three pillars of a modern, AI-ready platform.

    Step 1. Define your platform’s purpose

    Whether you’re building your first platform or your fiftieth, you need to start by asking, “Why?” After all, a new platform is another asset to maintain and operate —you need to make sure it exists for the right reasons.

    To build your case, ask yourself three questions:

    • Who is the platform for? Your platform’s customers, or users, can include developers, architects, product teams, SREs and Ops personnel, data scientists, security teams, and platform owners. Each has different needs, and your platform will need to be tailored accordingly.
    • What are its goals? Work out what problems you’re trying to solve. For example, are you optimizing for AI? Striving to speed up software delivery? Increasing developer productivity? Improving scale or security? Again, different goals will lead you down different paths for your platform — so map them out right from the start.
    • How will you measure success? To prove the worth of your platform, and to convince stakeholders to invest in its ongoing maintenance, establish metrics from the outset, and keep on measuring them! These could range from improved customer satisfaction to faster time-to-resolution for support issues. 

    Step 2. Assemble the pieces of your platform

    Now that you’re clear on the customers, goals, and performance metrics of the platform you need, it’s time to actually build the thing. Here’s a glance at the key components of a modern, AI-ready platform — complete with the capabilities developers need to hit the ground running when developing AI-powered solutions.

    image1

    For a detailed breakdown of what to consider in each area of your platform, including a list of technology options for each category, head over to the full paper.

    Step 3. Establish a process for improving your platform

    The journey doesn’t end once your platform’s built. In fact, it’s just beginning. A platform is never “done;” it’s just released. As such, you need to adopt a continuous improvement mindset and assign a core platform team the task of finding new ways to introduce value to stakeholders.

    At this stage, my top tip is to treat your platform like a product, applying platform engineering principles to keep making it faster, cheaper, and easier to deliver software. Oh, and to leverage the latest in AI-driven optimization tools to monitor and maintain your platform over time!  

    Ready to start your platform journey?

    Organizations embark on platform overhauls for a whole bunch of reasons. Some do it to better cope with forecasted growth. Others have AI adoption in their sights. Then there are those driven by cost, performance, or the user experience. Whatever your reason for getting started, I encourage you to read the full paper on building a modern AI-ready platform — your developers (and the business) will thank you.

  52. Technical Program Management

    Fri, 27 Sep 2024 16:00:00 -0000

    You’ve probably felt the frustration that arises when a project fails to meet established deadlines. And perhaps you’ve also encountered scenarios where project staff or computing have been reallocated to higher priority projects. It can be super challenging to get projects done on time with this kind of uncertainty. 

    That’s especially true for Site Reliability Engineering (SRE) teams. Project management principles can help, but in IT, many project management frameworks are directed at teams that have a single focus, such as a software-development team. 

    That’s not true for SRE teams at Google. They are charged with delivering infrastructure projects as well as their primary role: supporting production. Broadly speaking, SRE time is divided in half between supporting production environments and focusing on product. 

    A common problem

    In a recent endeavor, our SRE team took on a project to regionalize our infrastructure to enhance the reliability, security, and compliance of our cloud services. The project was allocated a well-defined timeline, driven by our commitments to our customers and adherence to local regulations. As the technical program manager (TPM), I decomposed the overarching goal into smaller milestones and communicated to the leadership team to ensure they remained abreast of the progress.

    However, throughout the execution phase of the project, we encountered a multitude of unrelated production incidents — the Spanner queue was growing long, and the accumulation of messages led to increased compilation times for our developer builds; this in turn led to bad builds rolling out. On top of this, asynchronous tasks were not completing as expected. When the bad build was rolled back, all of the backlogged async tasks fired at once. Due to these unforeseen challenges, some engineers were temporarily reassigned from the regionalization project to handle operational constraints associated with production infrastructure. No surprise, the change in staff allocation towards production incidents resulted in the project work being delayed. 

    Better planning with SRE

    Teams that manage production services, like SRE, have many ways to solve tough problems. The secret is to choose the solution that gets the job done the fastest and with the least amount of red tape for engineers to deal with.

    In our organization, we’ve started taking a proactive approach to problem-solving by incorporating enhanced planning at the project's inception. As a TPM, my biggest trick to ensuring projects are finished on time is keeping some engineering hours in reserve and planning carefully when the project should start.

    How many resources should you hold back, exactly? We did a deep dive into our past production issues and how we've been using our resources. Based on this, when planning SRE projects, we set aside 25% of our time for production work. Of course, this 25% buffer number will differ across organizations, but this new approach, which takes into account our critical business needs, has been a game-changer for us in making sure our projects are delivered on time, while ensuring that SREs can still focus on production incidents — our top priority for the business.

    Key takeaways

    In a nutshell, planning for SRE projects is different from planning for projects in development organizations, because development organizations spend the lion’s share of their time working on projects. Luckily, SRE Program Management is really good at handling complicated situations, especially big programs. 

    Beyond holding back resources, here are few other best practices and structures that TPMs employ when planning SRE projects:

    • Ensuring that critical programs are staffed for success

    • Providing opportunities for TPMs to work across services, cross pollinating with standardized solutions and avoiding duplication of work

    • Providing more education to Site Reliability Managers and SREs on the value of early TPM engagement and encourage services to surface problem statements earlier

    • Leveraging the skills of TPMs to manage external dependencies and interface with other partner organizations such as Engineering, Infrastructure Change Management, and Technical Infrastructure

    • Providing coverage at times of need for services with otherwise low program management demands

    • Enabling consistent performance evaluation and provide opportunities for career development for the TPM community

    The TPM role within SRE is at the heart of fulfilling SRE’s mission: making workflows faster, more reliable, and preparing for the continued growth of Google's infrastructure. As a TPM, you need to ensure that systems and services are carefully planned and deployed, taking into account multiple variables such as price, availability, and scheduling, while always keeping the bigger picture in mind. To learn more about project management for TPMs and related roles, consider enrolling in this course, and check out the following resources:

    1. Program Management Practices

    2. The Evolving SRE Engagement Model

    3. Part III. Practices

  53. AI/ML Customer Engineer, UKI, Google Cloud

    Fri, 30 Aug 2024 16:00:00 -0000

    Who is supposed to manage generative AI applications? While AI-related ownership often lands with data teams, we're seeing requirements specific to generative AI applications that have distinct differences from those of a data and AI team, and at times more similarities with a DevOps team. This blog post explores these similarities and differences, and considers the need for a new ‘GenOps’ team to cater for the unique characteristics of generative AI applications.

    In contrast to data science which is about creating models from data, Generative AI relates to creating AI enabled services from models and is concerned with the integration of pre-existing data, models and APIs. When viewed this way, Generative AI can feel similar to a traditional microservices environment: multiple discrete, decoupled and interoperable services consumed via APIs. And if there are similarities with the landscape, then it is logical that they share common operational requirements. So what practices can we take from the world of microservices and DevOps and bring to the new world of GenOps? 

    What are we operationalising? The AI agent vs the microservice

    How do the operational requirements of a generative AI application differ from other applications? With traditional applications, the unit of operationalisation is the microservice. A discrete, functional unit of code, packaged up into a container and deployed into a container-native runtime such as kubernetes. For generative AI applications, the comparative unit is the generative AI agent: also a discrete, functional unit of code defined to handle a specific task, but with some additional constituent components that make it more than ‘just’ a microservice and add in its key differentiating behavior of being non-deterministic in terms of both its processing and its output: 

    1. Reasoning loop - The control logic defining what the agent does and how it works. It often includes iterative logic or thought chains to break down an initial task into a series of model-powered steps that work towards the completion of a task. 

    2. Model definitions - One or a set of defined access patterns for communicating with models, readable and usable by the Reasoning Loop

    3. Tool definitions - a set of defined access patterns for other services external to the agent, such as other agents, data access (RAG) flows, and external APIs. These should be shared across agents, exposed through APIs and hence a Tool definition will take the form of a machine-readable standard such as an OpenAPI specification.

    blog-image-1 - Logical components of a Generative AI Agent

    Logical components of a generative AI agent

    The Reasoning Loop is essentially the full scope of a microservice, and the model and Tool definitions are its additional powers that make it into something more. Importantly, although the Reasoning Loop logic is just code and therefore deterministic in nature, it is driven by the responses from non-deterministic AI models, and this non-deterministic nature is what provides the need for the Tool, as the agent ‘chooses for itself’ which external service should be used to fulfill a task. A fully deterministic microservice has no need for this ‘cookbook’ of Tools for it to select from: Its calls to external services are pre-determined and hard coded into the Reasoning Loop.

    However there are still many similarities. Just like a microservice, an agent:

    • Is a discrete unit of function that should be shared across multiple apps/users/teams in a multi-tenancy pattern

    • Has a lot of flexibility with development approaches, a wide range of software languages are available to use, and any one agent can be built in a different way to another.

    • Has very low inter-dependency from one agent to another: development lifecycles are decoupled with independent CI/CD pipelines for each. The upgrade of one agent should not affect another agent.

    Feature

    Microservice

    agent

    Output

    Deterministic

    Non-deterministic

    Scope

    Single unit of discrete deterministic function

    Single unit of discrete non-deterministic function

    Latency

    Lower

    Higher

    Cost

    Lower

    Higher

    Transparency / Explainability

    High

    Low

    Development flexibility

    High

    High

    Development inter-dependence

    None

    None

    Upgrade inter-dependence

    None

    None

    Operational platforms and separation of responsibilities

    Another important difference is service-discovery. This is a solved-problem in the world of microservices where the impracticalities for microservices to track the availability, whereabouts and networking considerations for communicating with each other were taken out of the microservice itself and handled by packaging the microservices into containers and deploying these into a common platform layer of kubernetes and Istio. With Generative AI agents, this consolidation onto a standard deployment unit has not yet happened. There are a range of ways to build and deploy a generative AI agent, from code-first DIY approaches through to no-code managed agent builder environments. I am not against these tools in principle, however they are creating a more heterogeneous deployment landscape than what we have today with microservices applications and I expect this will create future operational complexities.

    To deal with this, at least for now, we need to move away from the Point-to-Point model seen in microservices and adopt a Hub-and-Spoke model, where the discoverability of agents, Tools and models is done via the publication of APIs onto an API Gateway that provides a consistent abstraction layer above this inconsistent landscape.

    This brings the additional benefit of clear separation of responsibilities between the apps and agents built by development teams, and Generative AI specific components such as models and Tools:

    blog-image-2 - Separating responsibilities with an API Gateway

    Separating responsibilities with an API Gateway

    All operational platforms should create a clear point of separation between the roles and responsibilities of app and microservice development teams from the responsibilities of the operational teams. With microservice based applications, responsibilities are handed over at the point of deployment, and focus switches to non-functional requirements such as reliability, scalability, infrastructure efficiency, networking and security.

    Many of these requirements are still just as important for a generative AI app, and I believe there are some additional considerations specific to generative agents and apps which require specific operational tooling:

    1. Model compliance and approval controls
    There are a lot of models out there. Some are open-source, some are licensed. Some provide intellectual property indemnity, some do not. All have specific and complex usage terms that have large potential ramifications but take time and the right skillset to fully understand.

    It’s not reasonable or appropriate to expect our developers to have the time or knowledge to factor in these considerations during model selection. Instead, an organization should have a separate model review and approval process to determine whether usage terms are acceptable for further use, owned by legal and compliance teams, supported on a technical level by clear, governable and auditable approval/denial processes that cascade down into development environments.

    2. Prompt version management
    Prompts need to be optimized for each model. Do we want our app teams focusing on prompt optimization, or on building great apps? Prompt management is a non-functional component and should be taken out of the app source code and managed centrally where they can be optimized, periodically evaluated, and reused across apps and agents.

    3. Model (and prompt) evaluation
    Just like an MLOps platform, there is clearly a need for ongoing assessments of model response quality to enable a data-driven approach to evaluating and selecting the most optimal models for a particular use-case. The key difference with Gen AI models being the assessment is inherently more qualitative compared to the quantitative analysis of skew or drift detection of a traditional ML model.

    Subjective, qualitative assessments performed by humans are clearly not scalable, and introduce inconsistency when performed by multiple people. Instead, we need consistent automated pipelines powered by AI evaluators, which although imperfect, will provide consistency in the assessments and a baseline to compare models against each other.

    4. Model security gateway
    The single most common operational feature I hear large enterprises investing time into is a security proxy for safety checks before passing a prompt on to a model (as well as the reverse: a check against the generated response before passing back to the client).

    Common considerations:

    1. Prompt Injection attacks and other threats captured by OWASP Top 10 for LLMs

    2. Harmful / unethical prompts

    3. Customer PII or other data requiring redaction prior to sending on to the model and other downstream systems

    Some models have built in security controls; however this creates inconsistency and increased complexity. Instead a model agnostic security endpoint abstracted above all models is required to create consistency and allow for easier model switching.

    5. Centralized Tool management
    Finally, the Tools available to the agent should be abstracted out from the agent to allow for reuse and centralized governance. This is the right separation of responsibilities especially when involving data retrieval patterns where access to data needs to be controlled.

    RAG patterns have the potential to become numerous and complex, as well as in practice not being particularly robust or well maintained with the potential of causing significant technical debt, so central control is important to keep data access patterns as clean and visible as possible.

    Outside of these specific considerations, a prerequisite already discussed is the need for the API Gateway itself to create consistency and abstraction above these Generative AI specific services. When used to their fullest, API Gateways can act as much more than simply an API Endpoint but can be a coordination and packaging point for a series of interim API calls and logic, security features and usage monitoring.

    For example, a published API for sending a request to a model can be the starting point for a multi-step process:

    • Retrieving and ‘hydrating’ the optimal prompt template for that use case and model

    • Running security checks through the model safety service

    • Sending the request to the model

    • Persisting prompt, response and other information for use in operational processes such as model and prompt evaluation pipelines.

    blog-image-3 - Key components of a GenOps platform

    Key components of a GenOps platform

    Making GenOps a reality with Google Cloud

    For each of the considerations above, Google Cloud provides unique and differentiating managed services offerings to support with evaluating, deploying, securing and upgrading Generative AI applications and agents:

    • Model compliance and approval controls - Google Cloud’s Model Garden is the central model library for over 150 of Google first-party models, partner models, or open source models, with thousands more available via the direct integration with Hugging Face.
    • Model security - The newly announced Model Armor, expected to be in preview in Q3, enables inspection, routing and protection of foundation model prompts and responses. It can help with mitigating risks such as prompt injections, jailbreaks, toxic content and sensitive data leakage.
    • Prompt version management - Upcoming prompt management capabilities were announced at Google Cloud Next ‘24 that include centralized version controlling, templating, branching and sharing of prompts. We also showcased AI prompt assistance capabilities to critique and automatically re-write prompts.
    • Model (and prompt) evaluation - Google Cloud’s model evaluation services provide automatic evaluations for a wide range of metrics prompts and responses enabling extensible evaluation patterns such as evaluating the responses from two models for a given input, or the responses from two different prompts for the same model.
    • Centralized Tool management - A comprehensive suite of managed services are available supporting Tool creation. A few to call out are the Document AI Layout Parser for intelligent document chunking, the multimodal embeddings API, Vertex AI Vector Search, and I specifically want to highlight Vertex AI Search: a fully managed, end-to-end OOTB RAG service, handling all the complexities from parsing and chunking documents, to creating and storing embeddings.

    As for the API Gateway, Google Cloud’s Apigee allows for publishing and exposure of models and Tools as API Proxies which can encompass multiple downstream API calls, as well as include conditional logic, reties, and tooling for security, usage monitoring and cross charging.

    blog-image-4 GenOps with Google Cloud

    GenOps with Google Cloud

    Regardless of size, for any organization to be successful with generative AI, they will need to ensure their generative AI application’s unique characteristics and requirements are well managed, and hence an operational platform engineered to cater for these characteristics and requirements is clearly required. I hope the points discussed in this blog make for helpful consideration as we all navigate through this exciting and highly impactful new era of technology.

    If you are interested in learning more, reach out to your Google Cloud account team if you have one, or feel free to contact me directly.

  54. Software Engineer, Google

    Mon, 26 Aug 2024 16:00:00 -0000

    The Terraform Google Provider v6.0.0 is now GA. Since the last major Terraform provider release in September 2023, the combined Hashicorp/Google provider team has been listening closely to the community's feedback. Discussed below are the primary enhancements and bug fixes that this major release focuses on. Support for earlier versions of HashiCorp Terraform will not change as a result of the major version release v6.0.0.

    Terraform Google Provider Highlights 

    The key notable changes are as follows: 

    • Opt-out default label “goog-terraform-provisioned”

    • Deletion protection fields added to multiple resources

    • Allowed reducing the suffix length in “name_prefix” for multiple resources

    Opt-out default label “goog-terraform-provisioned”

    As a follow-up to the addition of provider-level default labels in 5.16.0, the 6.0.0 major release includes an opt-out default label “goog-terraform-provisioned”. This provider-level label “goog-terraform-provisioned” will be added to applicable resources to identify resources that were created by Terraform. This default label will only apply for newly created resources with a labels field. This will enable users to have a view of resources managed by Terraform when viewing/editing these resources in other tools like Cloud Console, Cloud Billing etc.

    The label “goog-terraform-provisioned” can be used for the following:

    • To filter on the Billing Reports page:

    1 - Billing Reports page
    • To view the Cost breakdown:
    2 - Cost Breakdown

    Please note that an opt-in version of the label was already released in 5.16.0, and 6.0.0 will change the label to opt-out. To opt-out of this default label, the users may toggle the add_terraform_attribution_label provider configuration field. This can be set explicitly using any release from 5.16.0 onwards and the value in configuration will apply after the 6.0.0 upgrade.

    code_block
    <ListValue: [StructValue([('code', 'provider "google" {\r\n // opt out of “goog-terraform-provisioned” default label\r\n add_terraform_attribution_label = false\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5064465430>)])]>

    Deletion protection fields added to multiple resources

    In order to prevent the accidental deletion of important resources, many resources now have a form of deletion protection enabled by default. These resources include google_domain, google_cloud_run_v2_job, google_cloud_run_v2_service, google_folder and google_project. Most of these are enabled by the deletion_protection field. google_project specifically has a deletion_policy field which is set to PREVENT by default.

    Allowed reducing the suffix length in “name_prefix”

    Another notable issue resolved in this major release is, “Allow reducing the suffix length appended to instance templates name_prefix (#15374 ),” which changes the default behavior for name_prefix in multiple resources. The max length of the user-defined name_prefix has increased from 37 characters to 54. The provider will use a shorter appended suffix when using a name_prefix longer than 37 characters, which should allow for more flexible resource names. For example, google_instance_template.name_prefix.

    With features like opt-out default labels and deletion protection, this version enables users to have a view of resources managed by Terraform in other tools and also prevents accidental deletion of important resources. The Terraform Google Provider 6.0.0 launch aims to improve the usability and safety of Terraform for managing Google Cloud resources on Google Cloud. When upgrading to version 6.0 of the Terraform Google Provider, please consult the upgrade guide on the Terraform Registry, which contains a full list of the changes and upgrade considerations. Please check out the Release notes for Terraform Google Provider 6.0.0 for more details on this major version release. Learn more about Terraform on Google Cloud in the Terraform on Google Cloud documentation.

  55. CCoE Team Tech Lead, Hakuhodo Technologies Inc.

    Mon, 12 Aug 2024 16:00:00 -0000

    Hakuhodo Technologies, a specialized technology company of the Hakuhodo DY Group — one of Japan’s leading advertising and media holding companies — is dedicated to enhancing our software development process to deliver new value and experiences to society and consumers through the integration of marketing and technology. 

    Our IT Infrastructure Team at Hakuhodo Technologies operates cross-functionally, ensuring the stable operation of the public cloud that supports the diverse services within the Hakuhodo DY Group. We also provide expertise and operational support for public cloud initiatives.

    Our value is to excel in the cloud and infrastructure domain, exhibiting a strong sense of ownership, and embracing the challenge of creating new value.

    Background and challenges

    The infrastructure team is tasked with developing and operating the application infrastructure tailored to each internal organization and service, in addition to managing shared infrastructure resources.

    Following the principles of platform engineering and site reliability engineering (SRE), each team within the organization has adopted elements of SRE, including the implementation of post-mortems and the development of observability mechanisms. However, we encountered two primary challenges:

    • As the infrastructure expanded, the number of people on the team grew rapidly, bringing in new members from diverse backgrounds. This made it necessary to clarify and standardize tasks, and provide a collective understanding of our current situation and alignment on our goals.

    • We mainly communicate with the app team through a ticket-based system. In addition to expanding our workforce, we have also introduced remote working. As a result, team members may not be as well-acquainted as before. This lack of familiarity could potentially cause small misunderstandings that can escalate quickly.

    As our systems and organization expand, we believe that strengthening common understanding and cooperative relationships within the infrastructure team and the application team is essential for sustainable business growth. This has become a core element of our strategy.

    We believe that fostering an SRE mindset among both infrastructure and application team members and creating a culture based on that common understanding is essential to solving the issues above. To achieve this, we decided to implement the "SRE Core" program by Google Cloud Consulting, which serves as the first step in adopting SRE practices.

    Change

    First, through the "SRE Core" program, we revitalized communication between the application and infrastructure teams, which had previously had limited interaction. For example, some aspects of the program required information that was challenging for infrastructure members to gather and understand on their own, making cooperation with the application team essential.

    Our critical user journey (CUJ), one of the SRE metrics, was established based on the business requirements of the app and the behavior of actual users. This information is typically managed by the app team, which frequently communicates with the business side. This time, we collaborated with the application team to create a CUJ, set service level indicators (SLIs) and service level objectives (SLOs) which included error budgets, performed risk analysis, and designed the necessary elements for SRE.

    This collaborative work and shared understanding served as a starting point. As we continued to build a closer working relationship even after the program ended, with infrastructure members also participating in sprint meetings that had previously been held only for the app team.

    Image_1

    Additionally, as an infrastructure team, we systematically learned when and why SRE activities are necessary, allowing us to reflect on and strengthen our SRE efforts that had been partially implemented.

    For example, I recently understood that the purpose of postmortems is not only to prevent the recurrence of incidents but also to gain insights from the differences in perspectives between team members. Learning the purpose of postmortems changed our team’s mindset. We now practice immediate improvement activities, such as formalizing the postmortem process, clarifying the creation of tickets for action items, and sharing postmortem minutes with the app team, which were previously kept internal.

    We also reaffirmed the importance of observability to consistently review and improve our current system. Regular meetings between the infrastructure and application teams allow us to jointly check metrics, which in turn helps maintain application performance and prevent potential issues.

    By elevating our previous partial SRE activities and integrating individual initiatives, the infrastructure team created an organizational activity cycle that has earned more trust. This enhanced cycle is now getting integrated into our original operational workflows.

    Future plans

    With the experience gained through the SRE Core program, the infrastructure team looks forward to expanding collaboration with application and business teams and increasing proactive activities. Currently, we are starting with collaborations on select applications, but we aim to use these success stories to broaden similar initiatives across the organization.

    It is important to remember that each app has different team members, business partners, environments, and cultures, so SRE activities must be tailored to each unique situation. We aim to harmonize and apply the content learned in this program with the understanding that SRE activities are not the goal, but are elements that support the goals of the apps and the business.

    Additionally, our company has a Cloud Center of Excellence (CCoE) team dedicated to cross-organizational activities. The CCoE manages a portal site for company-wide information dissemination and a community platform for developers to connect. We plan to share the insights we've gained through these channels with other respective teams within our group companies. As the CCoE's internal activities mature, we also intend to share our knowledge and experiences externally.

    Through these initiatives, we hope to continue our activities with the hope that internal members — beyond the CCoE and infrastructure organizations — take psychological safety into consideration during discussions and actions.

    Supplement: Regarding psychological safety

    At our company, we have a diverse workforce with varying years of experience and perspectives. We believe that ensuring psychological safety is essential for achieving high performance.

    When psychological safety is lacking, for instance, if the person delivering bad news is blamed, reports tend to become superficial and do not lead to substantive discussions.

    This issue can also arise from psychological barriers, such as the omission of tasks known only to experienced employees, leading to problems caused by the fear of asking for clarification.

    In a situation where psychological safety is ensured, we focus on systems rather than individuals, viewing problems as opportunities. For example, if errors occur due to manual work, the manual process itself is seen as the issue. Similarly, if a system failure with no prior similar case arises, it is considered an opportunity to gain new knowledge.

    By adopting this mindset, fear is removed from the equation, allowing for unbiased discussions and work.

    This allows every employee to perform at their best, regardless of their years of experience. Of course, this is not something that can be achieved through a single person. It will require a whole team or organization to recognize this to make it a reality.

  56. EMEA Solutions Lead, Application Modernization

    Mon, 05 Aug 2024 16:00:00 -0000

    Continuous Delivery (CD) is a set of practices and principles that enables teams to deliver software quickly and reliably by automating the entire software release process using a pipeline. In this article, we explain how to create a Continuous Delivery pipeline to automate software delivery from code commit to production release on Cloud Run using Gitlab CI/CD and Cloud Deploy, leveraging the recently released Gitlab Google Cloud integration.

    Elements of the solution

    Gitlab CI/CD

    GitLab CI/CD is an integrated continuous integration and delivery platform within GitLab. It automates the build, test, and deployment of your code changes, streamlining your development workflow. For more information check the Gitlab CI/CD documentation.

    Cloud Deploy

    Cloud Deploy is a Google managed service that you can use to automate how your application is deployed across different stages to a series of runtime environments. With Cloud Deploy, you can define delivery pipelines to deploy container images to GKE and Cloud Run targets in a predetermined sequence. Cloud Deploy supports advanced deployment strategies as progressive releases, approvals, deployment verifications, parallel deployments.

    Google Cloud Gitlab integration

    Gitlab and Google Cloud recently released integrations to make it easier and more secure to deploy code from Gitlab to Google Cloud. The areas of integration described in this article are:

    • Authentication: The GitLab and Google Cloud integration leverages workload identity federation, enabling secure authorization and authentication for GitLab workloads, as CI/CD jobs, with Google Cloud. This eliminates the need for managing service accounts or service account keys, streamlining the process and reducing security risks. All the other integration areas described below leverage this authentication mechanism.

    • Artifact Registry: The integration lets you upload GitLab artifacts to Artifact Registry and access them from Gitlab UI.

    • Cloud Deploy: This Gitlab component facilitates the creation of Cloud Deploy releases from Gitlab CI/CD pipelines.

    • Gcloud: This component facilitates running gcloud commands in Gitlab CI/CD pipelines. 

    • Gitlab runners on Google Cloud: The integration lets you configure runner settings from Gitlab UI and have them deployed on your Google Cloud project with Terraform.

    You can access the updated list of Google Cloud Gitlab components here.

    What you’ll need

    To follow the steps in this article you need:

    1. A Gitlab account (Free, Premium or Ultimate)

    2. A Google Cloud project with project owner access

    3. A fork, in your account, of the following Gitlab repository containing the example code: https://gitlab.com/galloro/cd-on-gcp-gl cloned locally to your workstation.

    Pipeline flow

    You can see the pipeline in the .gitlab-ci.yml file in the root of the repo or using the Gitlab Pipeline editor.

    Following the instruction in this article you will create and execute an end to end software delivery pipeline where:

    1. A developer creates a feature branch from an application repository

    2. The developer makes a change to the code and then opens a merge request to merge the updated code to the main branch

    3. The Gitlab pipeline will run the following jobs, all configured to run when a merge request is open through the 

    - if: $CI_PIPELINE_SOURCE == 'merge_request_event' rule:

    a. The image-build job, in the build stage, builds a container image with the updated code.

    code_block
    <ListValue: [StructValue([('code', '# Image build for automatic pipeline running on merge request\r\nimage-build:\r\n image: docker:24.0.5\r\n stage: build\r\n services:\r\n - docker:24.0.5-dind\r\n rules:\r\n - if: $CI_PIPELINE_SOURCE == "web"\r\n when: never\r\n - if: $CI_PIPELINE_SOURCE == \'merge_request_event\'\r\n before_script:\r\n - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY\r\n script:\r\n - docker build -t $GITLAB_IMAGE cdongcp-app/\r\n - docker push $GITLAB_IMAGE\r\n - docker logout'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6820>)])]>

    b. The upload-artifact-registry component, in the push stage, pushes the image to Artifact Registry leveraging the Google Cloud IAM integration configured previously as all the other following components. The configuration of this job, as the ones for the other components described below, is split between the component and the explicit job definition in order to set the rules for job execution.

    code_block
    <ListValue: [StructValue([('code', '# Image push to Artifact Registry for automatic pipeline running on merge request\r\n - component: gitlab.com/google-gitlab-components/artifact-registry/upload-artifact-registry@0.1.1\r\n inputs:\r\n stage: push\r\n source: $GITLAB_IMAGE\r\n target: $GOOGLE_AR_REPO/cdongcp-app:$CI_COMMIT_SHORT_SHA'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6340>)])]>

    c. The create-cloud-deploy-release component, in the deploy-to-qa stage, creates a release on Cloud Deploy and a rollout to the QA stage, mapping to the cdongcp-app-qa Cloud Run service, where the QA team will run user acceptance tests.

    code_block
    <ListValue: [StructValue([('code', '# Cloud Deploy release creation for automatic pipeline running on merge request\r\n - component: gitlab.com/google-gitlab-components/cloud-deploy/create-cloud-deploy-release@0.1.1 \r\n inputs: \r\n stage: deploy-to-qa\r\n project_id: $GOOGLE_PROJECT\r\n name: cdongcp-$CI_COMMIT_SHORT_SHA\r\n delivery_pipeline: cd-on-gcp-pipeline\r\n region: $GOOGLE_REGION\r\n images: cdongcp-app=$GOOGLE_AR_REPO/cdongcp-app:$CI_COMMIT_SHORT_SHA'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e65b0>)])]>

    4. After the tests are completed, the QA team merges the MR and this runs the run-gcloud component, in the promote-to-prod stage, that promotes the release to the production stage, mapping to the cdongcp-app-prod Cloud Run service. In this case the job is configured to run on a push to the main branch through the 

    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH rule:

    code_block
    <ListValue: [StructValue([('code', '# Cloud Deploy release promotion for automatic pipeline running on merge request\r\n - component: gitlab.com/google-gitlab-components/cloud-sdk/run-gcloud@main\r\n inputs:\r\n stage: promote-to-prod\r\n project_id: $GOOGLE_PROJECT\r\n commands: |\r\n MOST_RECENT_RELEASE=$(gcloud deploy releases list --delivery-pipeline cd-on-gcp-pipeline --region $GOOGLE_REGION --format="value(name)" --limit 1)\r\n gcloud deploy releases promote --delivery-pipeline cd-on-gcp-pipeline --release $MOST_RECENT_RELEASE --region $GOOGLE_REGION'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6fa0>)])]>

    5. The Cloud Deploy prod target requires approval so an approval request is triggered, the Product Manager for the application checks the rollout and approves it so the app is released in production with a canary release; this creates a new revision of the cdongcp-app-prod Cloud Run service and direct 50% of the traffic to it. You can see the Cloud Deploy delivery pipeline and targets configuration below (file cr-delivery-pipeline.yaml in the repo) including the canary strategy and approval required for prod deployment. Canary strategy is configured to 50% to make traffic split more visible; in a real production environment this would be a lower number.

    code_block
    <ListValue: [StructValue([('code', 'apiVersion: deploy.cloud.google.com/v1\r\nkind: DeliveryPipeline\r\nmetadata:\r\n name: cd-on-gcp-pipeline\r\ndescription: CD on Cloud Run w Gitlab CI and Cloud Deploy - End to end pipeline\r\nserialPipeline:\r\n stages:\r\n - targetId: qa\r\n profiles:\r\n - qa\r\n - targetId: prod\r\n profiles:\r\n - prod\r\n strategy:\r\n canary:\r\n runtimeConfig:\r\n cloudRun:\r\n automaticTrafficControl: true\r\n canaryDeployment:\r\n percentages: [50]\r\n verify: false\r\n---\r\napiVersion: deploy.cloud.google.com/v1\r\nkind: Target\r\nmetadata:\r\n name: prod\r\ndescription: Prod Cloud Run Service\r\nrequireApproval: true\r\nrun:\r\n location: projects/yourproject/locations/yourregion\r\n---\r\napiVersion: deploy.cloud.google.com/v1\r\nkind: Target\r\nmetadata:\r\n name: qa\r\ndescription: QA Cloud Run Service\r\nrun:\r\n location: projects/yourproject/locations/yourregion'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6c70>)])]>

    6. After checking the canary release, the App Release team advances the rollout to 100%.

    You can play all the roles described above (developer, member of the QA team, member of the App release team, Product Manager) using a single Gitlab account and project/repository. In a real world scenario multiple accounts would be used.

    The picture below describes the pipeline flow:

    1

    In addition to the jobs and stages described above, the .gitlab-ci.yml pipeline contains other instances of similar jobs, in the first-release stage, that are configured, through rules, to run only if the pipeline is executed manually using the “Run pipeline” button in Gitlab web UI. You will do that to manually create the first release before running the above described flow.

    Prepare your environment

    To prepare your environment to run the pipeline, complete the following tasks: 

    1. Create an Artifact Registry standard repository for Docker images in your Google Cloud project and desired region.

    2. Run setup.sh from the setup folder in your local repo clone and follow the prompt to insert your Google Cloud project, Cloud Run and Cloud Deploy region and Artifact Registry repository. Then commit changes to the .gitlab-ci.yml and setup/cr-delivery-pipeline.yaml files and push them to your fork. 

    3. Still in the setup folder, create a Cloud Deploy delivery pipeline using the manifest provided (replace yourregion and yourproject with your values):

    code_block
    <ListValue: [StructValue([('code', 'gcloud deploy apply --file=cr-delivery-pipeline.yaml --region=yourregion --project=yourproject'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6160>)])]>

    This creates a pipeline that has two stages: qa and prod, each using a profile with the same name and two targets mapping two Cloud Run services to the pipeline stages.

    4. Follow the Gitlab documentation to set up Google Cloud workload identity federation and the workload identity pool that will be used to authenticate Gitlab to Google Cloud services.

    5. Follow the Gitlab documentation to set up Google Artifact Registry integration. After that you will be able to access the Google AR Repository from Gitlab UI through the Google Artifact Registry entry in the sidebar under Deploy.

    6. (Optional) Follow the Gitlab documentation to set up runners in Google Cloud. If you’re using Gitlab.com, you can also keep the default configuration that uses Gitlab hosted runners, but with Google Cloud runners you can customize parameters as the machine type and autoscaling.

    7. Set up permissions for Gitlab Google Cloud components as described in the related README for each component. To run the jobs in this pipeline, the Gitlab workload identity pool must have the following minimum roles in Google Cloud IAM:

      • roles/artifactregistry.reader

      • roles/artifactregistry.writer

      • roles/clouddeploy.approver

      • roles/clouddeploy.releaser

      • roles/iam.serviceAccountUser

      • roles/run.admin

      • roles/storage.admin

    8. Manually run the pipeline from Gitlab web UI with Build -> Pipelines -> Run pipeline to create the first release and the two Cloud Run services for QA and production. This runs all the jobs that are part of the first-release stage, and waits for the pipeline execution to complete before moving to the next steps.

    2

    9. From the Google Cloud console, get the URL of the cdongcp-app-qa and cdongcp-app-prod Cloud Run services and open them with a web browser to check that the application has been deployed.

    3

    Run your pipeline

    Update your code as a developer

    1. Be sure to move at the root of the repository clone and create a new branch of the repository with the name “new feature” and check it out:

    code_block
    <ListValue: [StructValue([('code', 'git checkout -b new-feature'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6eb0>)])]>

    2. Update your code: open the app.go file in the cdongcp-app folder and change the message in row 25 to “cd-on-gcp app UPDATED in target: …”

    3. Commit and push your changes to the “new-feature” branch.

    code_block
    <ListValue: [StructValue([('code', 'git add cdongcp-app/app.go\r\ngit commit -m "new feature"\r\ngit push origin new-feature'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6580>)])]>

    4. Now open a merge request to merge your code: you can copy and paste to your browser the url in the terminal output and in the Gitlab page click the “Create merge request” button, you will see a pipeline starting.

    Run automatic build of your artifact

    1. In Gitlab look at Build > Pipelines, click on the last pipeline execution id; you should see three stages each one including one job:

    4

    2. Wait for the pipeline to complete; you can click on each job to see the execution log. The last job should create the cdongcp-$COMMIT_SHA release (where $COMMIT_SHA is the short SHA of your commit) and roll it out to the QA stage.

    3. Open or refresh the cdongcp-app-qa URL with your browser; you should see the updated application deployed in the QA stage.

    4. In a real world scenario the QA team performs some usability tests in this environment. Let’s assume that these have been completed successfully and you, as a member of the QA team this time, want to merge the changed code to the main branch: go to the merge request Gitlab page and click “Merge”.

    Approve and rollout your release to production

    1. A new pipeline will run containing only one job from the run-gcloud component. You can see the execution in the Gitlab pipeline list.

    5

    2. When the pipeline is completed your release will be promoted to prod stage waiting for approval, as you can see in the Cloud Deploy page in the console.

    6

    3. Now, acting as the product manager for the application that has to approve the deployment in production, click on Review; you will see a rollout that needs approval. Click on REVIEW again.

    4. In the “Approve rollout to prod” page, click on the “APPROVE” button to finally approve the promotion to the prod stage. The rollout to the canary phase of the prod stage will start, and after some time the rollout will stabilize in the canary phase.

    7

    5. Let’s try to observe how traffic is managed in this phase: generate some requests to the cdongcp-app-prod URL service with the following command (replace cdongcp-app-prod-url with your service URL):

    code_block
    <ListValue: [StructValue([('code', 'while true; do curl cdongcp-app-prod-url;sleep 1;done'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e50648e6130>)])]>

    6. After some time you should see responses both from your previous release and the new (canary) one.

    8

    7. Now let’s pretend that the App Release team gets metrics and other observability data from the canary. When they are sure that the application is performing correctly, they want to deploy the application to all their users. As a member of the App Release team, go to the Cloud Deploy console and click “Advance to stable” and then “ADVANCE” on the confirmation pop up; the rollout should progress to stable. When the progress stabilizes you will see in the curl output that all the requests are served by the updated version of the application.

    Summary

    You saw an example Gitlab CI/CD pipeline that leverages the recently released Google Cloud - Gitlab integration to:

    • Configure Gitlab authentication to Google Cloud using workload identity federation

    • Integrate Gitlab with Artifact Registry

    • Use Gitlab CI/CD and Cloud Deploy to automatically build your software and deploy it to a QA Cloud Run service when a merge request is created

    • Automatically promote your software to a prod Cloud Run service when the merge request is merged to the main branch

    • Use approvals in Cloud Deploy

    • Leverage canary release in Cloud Deploy to progressively release your application to users

    Now you can reference this article and the documentation on Gitlab CI/CD, Google Cloud - Gitlab integration, Cloud Deploy and Cloud Run to configure your end to end pipeline leveraging Gitlab and Google Cloud!

  57. Intelligent Continuous Security From the Platform Outward

    Wed, 30 Apr 2025 14:15:56 -0000

    ICS, platform, devops, devsecops,
    ICS, platform, devops, devsecops,In the end, ICS is not a tool — it’s a philosophy of secure software delivery. When it begins with the platform, everything else aligns: Speed, safety and scale.
  58. LaunchDarkly Acquires Highlight to Bring Observability to Application Release Management

    Wed, 30 Apr 2025 11:20:25 -0000

    LaunchDarkly, observability, Kloudfuse, observability, engineers, IT, and explainability, software, resilience Sentry Observe full-stack observability SREs
    LaunchDarkly, observability, Kloudfuse, observability, engineers, IT, and explainability, software, resilience Sentry Observe full-stack observability SREsLaunchDarkly is looking to bring observability to feature flag management by acquiring Highlight, a provider of an open-source application monitoring tool.
  59. Legit Security Extends AI Reach of ASPM Platform

    Tue, 29 Apr 2025 13:00:49 -0000

    Legit, Sonatype OpenZiti open source software Commvault
    Legit, Sonatype OpenZiti open source software CommvaultLegit Security at the 2025 RSA Conference today extended the reach of its application security posture management (ASPM) platform that leverages artificial intelligence (AI) to identify vulnerabilities and other weaknesses to now include suggestions for remediating issues in code.
  60. Lineaje Leverages AI Agents to Secure Open Source Packages and Images

    Tue, 29 Apr 2025 10:58:01 -0000

    Lineaje, AI, agents, Cycode, SAST, analysis, Black Duck, open-source, coding, DevSecOps, OpenText, Process, DevSecOps, ASPM, Cycode SecOps GitLab Quali SigStore OWASP DevSecOps vulnerabilities security Pulumi DevSecOps Analyzing Code for Security Vulnerabilities
    Lineaje, AI, agents, Cycode, SAST, analysis, Black Duck, open-source, coding, DevSecOps, OpenText, Process, DevSecOps, ASPM, Cycode SecOps GitLab Quali SigStore OWASP DevSecOps vulnerabilities security Pulumi DevSecOps Analyzing Code for Security VulnerabilitiesLineaje has added artificial intelligence (AI) agents that leverage multiple types of code scanners to ensure the open-source software packages and artifacts being used by application developers are truly secure.
  61. Next-Generation Observability: Combining OpenTelemetry and AI for Proactive Incident Management

    Tue, 29 Apr 2025 04:25:02 -0000

    OpenTelemetry and AI, observability, CNCF, eBPF, OTel, opensource, opentelemetry, mobile, observability, AI, New Relic Mezmo AI Cisco Datadog SPIRE observability OpenTelemetryLFN open source
    OpenTelemetry and AI, observability, CNCF, eBPF, OTel, opensource, opentelemetry, mobile, observability, AI, New Relic Mezmo AI Cisco Datadog SPIRE observability OpenTelemetryLFN open sourceOpenTelemetry and AI integration change the nature of observability for organizations in their quest to manage distributed systems.
  62. Minimus Unfurls Service for Accessing Secure Software Artifacts

    Mon, 28 Apr 2025 14:26:28 -0000

    Minimus today at the 2025 RSA Conference launched a managed service through which it ensures application development teams are provided access to a secure set of minimal container images and virtual machines. Company CTO John Morello said the Minimus service eliminates the possibility that developers might inadvertently download software artifacts that might be infested with […]
  63. Data, Determinism, and AI in Mass-Scale Code Modernization

    Mon, 28 Apr 2025 09:42:52 -0000

    code data, agentic, JFrog, security, devsecops, Digma, code, Go, code, kernel, eBPF, Veracode GitKraken JFrog GitGuardian organizations, quality fear unknown software app Rust Contrast Security Adds API Support to Application Security Platform
    code data, agentic, JFrog, security, devsecops, Digma, code, Go, code, kernel, eBPF, Veracode GitKraken JFrog GitGuardian organizations, quality fear unknown software app Rust Contrast Security Adds API Support to Application Security PlatformDiscussing the code data problem and what’s needed to enable agentic experiences that can drive code analysis and refactoring at scale.
  64. From Testing Hell to Quality Heaven With Intelligent Continuous Testing

    Mon, 28 Apr 2025 09:18:18 -0000

    continuous testing, strategy, cloud testing, plan, software testing, web application, testing, test, security, DevSecOps, Tools, API tools, testing, GenAI, SmartBear Redgate test engineers, AI-driven, Applitools SapientAI software, automated, PractiTest test automation continuous test low-code testing automation PagerDuty
    continuous testing, strategy, cloud testing, plan, software testing, web application, testing, test, security, DevSecOps, Tools, API tools, testing, GenAI, SmartBear Redgate test engineers, AI-driven, Applitools SapientAI software, automated, PractiTest test automation continuous test low-code testing automation PagerDutyIntelligent Continuous Testing is not just the next step in automation — it’s the missing link between speed and quality in modern software delivery. If your team is stuck in manual testing purgatory, it’s time to reimagine testing as a smart, adaptive, and always-on partner in your journey to excellence.
  65. Five Great DevOps Job Opportunities

    Mon, 28 Apr 2025 06:48:36 -0000

    opportunities, jobs, DevOps, role, opportunities, job, opportunities, goal, DevOps jobs, Nobl9
    opportunities, jobs, DevOps, role, opportunities, job, opportunities, goal, DevOps jobs, Nobl9DevOps.com is now providing a weekly DevOps jobs report through which opportunities for DevOps professionals will be highlighted as part of an effort to better serve our audience. Our goal in these challenging economic times is to make it just that much easier for DevOps professionals to advance their careers. Of course, the pool of […]
  66. Break the Bottleneck of API Sprawl With AI-Powered Automation

    Fri, 25 Apr 2025 15:06:21 -0000

    api, sprawl, Postman, APIs, engineering, API-first, strategy. API, Sideko, APIs, API, security, Sonar, vulnerabilities, API, APIs, developers, development, management, tools, API monetization, stack, platform, APIs API Security Summit -- API security -- cybersecurity - Application Programming Interfaces
    api, sprawl, Postman, APIs, engineering, API-first, strategy. API, Sideko, APIs, API, security, Sonar, vulnerabilities, API, APIs, developers, development, management, tools, API monetization, stack, platform, APIs API Security Summit -- API security -- cybersecurity - Application Programming InterfacesThe race to accelerate digital transformation across business units within an organization has led to a rapid surge in APIs, though unfortunately, without a central strategy in place.
  67. Music AI Sandbox, now with new features and broader access

    Thu, 24 Apr 2025 15:01:00 -0000

    Helping music professionals explore the potential of generative AI
  68. Introducing Gemini 2.5 Flash

    Thu, 17 Apr 2025 19:02:00 -0000

    Gemini 2.5 Flash is our first fully hybrid reasoning model, giving developers the ability to turn thinking on or off.
  69. Generate videos in Gemini and Whisk with Veo 2

    Tue, 15 Apr 2025 17:00:00 -0000

    Transform text-based prompts into high-resolution eight-second videos in Gemini Advanced and use Whisk Animate to turn images into eight-second animated clips.
  70. DolphinGemma: How Google AI is helping decode dolphin communication

    Mon, 14 Apr 2025 17:00:00 -0000

    DolphinGemma, a large language model developed by Google, is helping scientists study how dolphins communicate — and hopefully find out what they're saying, too.
  71. Taking a responsible path to AGI

    Wed, 02 Apr 2025 13:31:00 -0000

    We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.
  72. Evaluating potential cybersecurity threats of advanced AI

    Wed, 02 Apr 2025 13:30:00 -0000

    Our framework enables cybersecurity experts to identify which defenses are necessary—and how to prioritize them
  73. Gemini 2.5: Our most intelligent AI model

    Tue, 25 Mar 2025 17:00:36 -0000

    Gemini 2.5 is our most intelligent AI model, now with thinking built in.
  74. Gemini Robotics brings AI into the physical world

    Wed, 12 Mar 2025 15:00:00 -0000

    Introducing Gemini Robotics and Gemini Robotics-ER, AI models designed for robots to understand, act and react to the physical world.
  75. Experiment with Gemini 2.0 Flash native image generation

    Wed, 12 Mar 2025 14:58:00 -0000

    Native image output is available in Gemini 2.0 Flash for developers to experiment with in Google AI Studio and the Gemini API.
  76. Introducing Gemma 3

    Wed, 12 Mar 2025 08:00:00 -0000

    The most capable model you can run on a single GPU or TPU.
  77. Start building with Gemini 2.0 Flash and Flash-Lite

    Tue, 25 Feb 2025 18:02:12 -0000

    Gemini 2.0 Flash-Lite is now generally available in the Gemini API for production use in Google AI Studio and for enterprise customers on Vertex AI
  78. Gemini 2.0 is now available to everyone

    Wed, 05 Feb 2025 16:00:00 -0000

    We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini 2.0 Pro Experimental.
  79. Updating the Frontier Safety Framework

    Tue, 04 Feb 2025 16:41:00 -0000

    Our next iteration of the FSF sets out stronger security protocols on the path to AGI
  80. FACTS Grounding: A new benchmark for evaluating the factuality of large language models

    Tue, 17 Dec 2024 15:29:00 -0000

    Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations
  81. State-of-the-art video and image generation with Veo 2 and Imagen 3

    Mon, 16 Dec 2024 17:01:16 -0000

    We’re rolling out a new, state-of-the-art video model, Veo 2, and updates to Imagen 3. Plus, check out our new experiment, Whisk.
  82. Introducing Gemini 2.0: our new AI model for the agentic era

    Wed, 11 Dec 2024 15:30:40 -0000

    Today, we’re announcing Gemini 2.0, our most capable multimodal AI model yet.
  83. Google DeepMind at NeurIPS 2024

    Thu, 05 Dec 2024 17:45:00 -0000

    Advancing adaptive AI agents, empowering 3D scene creation, and innovating LLM training for a smarter, safer future
  84. GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy

    Wed, 04 Dec 2024 15:59:00 -0000

    New AI model advances the prediction of weather uncertainties and risks, delivering faster, more accurate forecasts up to 15 days ahead
  85. Genie 2: A large-scale foundation world model

    Wed, 04 Dec 2024 14:23:00 -0000

    Generating unlimited diverse training environments for future general agents
  86. AlphaQubit tackles one of quantum computing’s biggest challenges

    Wed, 20 Nov 2024 18:00:00 -0000

    Our new AI system accurately identifies errors inside quantum computers, helping to make this new technology more reliable.
  87. The AI for Science Forum: A new era of discovery

    Mon, 18 Nov 2024 19:57:43 -0000

    The AI Science Forum highlights AI's present and potential role in revolutionizing scientific discovery and solving global challenges, emphasizing collaboration between the scientific community, policymakers, and industry leaders.
  88. Pushing the frontiers of audio generation

    Wed, 30 Oct 2024 15:00:00 -0000

    Our pioneering speech generation technologies are helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
  89. New generative AI tools open the doors of music creation

    Wed, 23 Oct 2024 16:53:00 -0000

    Our latest AI music technologies are now available in MusicFX DJ, Music AI Sandbox and YouTube Shorts
  90. Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

    Wed, 09 Oct 2024 11:45:00 -0000

    The award recognizes their work developing AlphaFold, a groundbreaking AI system that predicts the 3D structure of proteins from their amino acid sequences.
  91. How AlphaChip transformed computer chip design

    Thu, 26 Sep 2024 14:08:00 -0000

    Our AI method has accelerated and optimized chip design, and its superhuman chip layouts are used in hardware around the world.
  92. Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more

    Tue, 24 Sep 2024 16:03:03 -0000

    We’re releasing two updated production-ready Gemini models
  93. Empowering YouTube creators with generative AI

    Wed, 18 Sep 2024 14:30:06 -0000

    New video generation technology in YouTube Shorts will help millions of people realize their creative vision
  94. Our latest advances in robot dexterity

    Thu, 12 Sep 2024 14:00:00 -0000

    Two new AI systems, ALOHA Unleashed and DemoStart, help robots learn to perform complex tasks that require dexterous movement
  95. AlphaProteo generates novel proteins for biology and health research

    Thu, 05 Sep 2024 15:00:00 -0000

    New AI system designs proteins that successfully bind to target molecules, with potential for advancing drug design, disease understanding and more.
  96. FermiNet: Quantum physics and chemistry from first principles

    Thu, 22 Aug 2024 19:00:00 -0000

    Using deep learning to solve fundamental problems in computational quantum chemistry and explore how matter interacts with light
  97. Mapping the misuse of generative AI

    Fri, 02 Aug 2024 10:50:58 -0000

    New research analyzes the misuse of multimodal generative AI today, in order to help build safer and more responsible technologies.
  98. Gemma Scope: helping the safety community shed light on the inner workings of language models

    Wed, 31 Jul 2024 15:59:19 -0000

    Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.
  99. AI achieves silver-medal standard solving International Mathematical Olympiad problems

    Thu, 25 Jul 2024 15:29:00 -0000

    Breakthrough models AlphaProof and AlphaGeometry 2 solve advanced reasoning problems in mathematics
  100. Google DeepMind at ICML 2024

    Fri, 19 Jul 2024 10:00:00 -0000

    Exploring AGI, the challenges of scaling and the future of multimodal generative AI
  101. Generating audio for video

    Mon, 17 Jun 2024 16:00:00 -0000

    Video-to-audio research uses video pixels and text prompts to generate rich soundtracks
  102. Looking ahead to the AI Seoul Summit

    Mon, 20 May 2024 07:00:00 -0000

    How summits in Seoul, France and beyond can galvanize international cooperation on frontier AI safety
  103. Introducing the Frontier Safety Framework

    Fri, 17 May 2024 14:00:00 -0000

    Our approach to analyzing and mitigating future risks posed by advanced AI models
  104. Gemini breaks new ground: a faster model, longer context and AI agents

    Tue, 14 May 2024 17:58:00 -0000

    We’re introducing a series of updates across the Gemini family of models, including the new 1.5 Flash, our lightweight model for speed and efficiency, and Project Astra, our vision for the future of AI assistants.
  105. New generative media models and tools, built with and for creators

    Tue, 14 May 2024 17:57:00 -0000

    We’re introducing Veo, our most capable model for generating high-definition video, and Imagen 3, our highest quality text-to-image model. We’re also sharing new demo recordings created with our Music AI Sandbox.
  106. Watermarking AI-generated text and video with SynthID

    Tue, 14 May 2024 17:56:00 -0000

    Announcing our novel watermarking method for AI-generated text and video, and how we’re bringing SynthID to key Google products
  107. AlphaFold 3 predicts the structure and interactions of all of life’s molecules

    Wed, 08 May 2024 16:00:00 -0000

    Introducing a new AI model developed by Google DeepMind and Isomorphic Labs.
  108. Google DeepMind at ICLR 2024

    Fri, 03 May 2024 13:39:00 -0000

    Developing next-gen AI agents, exploring new modalities, and pioneering foundational learning
  109. The ethics of advanced AI assistants

    Fri, 19 Apr 2024 10:00:00 -0000

    Exploring the promise and risks of a future with more capable AI
  110. TacticAI: an AI assistant for football tactics

    Tue, 19 Mar 2024 16:03:00 -0000

    As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
  111. A generalist AI agent for 3D virtual environments

    Wed, 13 Mar 2024 14:00:00 -0000

    Introducing SIMA, a Scalable Instructable Multiworld Agent
  112. Gemma: Introducing new state-of-the-art open models

    Wed, 21 Feb 2024 13:06:00 -0000

    Gemma is built for responsible AI development from the same research and technology used to create Gemini models.
  113. Our next-generation model: Gemini 1.5

    Thu, 15 Feb 2024 15:00:00 -0000

    The model delivers dramatically enhanced performance, with a breakthrough in long-context understanding across modalities.
  114. The next chapter of our Gemini era

    Thu, 08 Feb 2024 13:00:00 -0000

    We're bringing Gemini to more Google products
  115. AlphaGeometry: An Olympiad-level AI system for geometry

    Wed, 17 Jan 2024 16:00:00 -0000

    Advancing AI reasoning in mathematics
  116. Shaping the future of advanced robotics

    Thu, 04 Jan 2024 11:39:00 -0000

    Introducing AutoRT, SARA-RT, and RT-Trajectory
  117. Images altered to trick machine vision can influence humans too

    Tue, 02 Jan 2024 16:00:00 -0000

    In a series of experiments published in Nature Communications, we found evidence that human judgments are indeed systematically influenced by adversarial perturbations.
  118. 2023: A Year of Groundbreaking Advances in AI and Computing

    Fri, 22 Dec 2023 13:30:00 -0000

    This has been a year of incredible progress in the field of Artificial Intelligence (AI) research and its practical applications.
  119. FunSearch: Making new discoveries in mathematical sciences using Large Language Models

    Thu, 14 Dec 2023 16:00:00 -0000

    In a paper published in Nature, we introduce FunSearch, a method for searching for “functions” written in computer code, and find new solutions in mathematics and computer science. FunSearch works by pairing a pre-trained LLM, whose goal is to provide creative solutions in the form of computer code, with an automated “evaluator”, which guards against hallucinations and incorrect ideas.
  120. Google DeepMind at NeurIPS 2023

    Fri, 08 Dec 2023 15:01:00 -0000

    The Neural Information Processing Systems (NeurIPS) is the largest artificial intelligence (AI) conference in the world. NeurIPS 2023 will be taking place December 10-16 in New Orleans, USA.Teams from across Google DeepMind are presenting more than 150 papers at the main conference and workshops.
  121. Introducing Gemini: our largest and most capable AI model

    Wed, 06 Dec 2023 15:13:00 -0000

    Making AI more helpful for everyone
  122. Millions of new materials discovered with deep learning

    Wed, 29 Nov 2023 16:04:00 -0000

    We share the discovery of 2.2 million new crystals – equivalent to nearly 800 years’ worth of knowledge. We introduce Graph Networks for Materials Exploration (GNoME), our new deep learning tool that dramatically increases the speed and efficiency of discovery by predicting the stability of new materials.
  123. Transforming the future of music creation

    Thu, 16 Nov 2023 07:20:00 -0000

    Announcing our most advanced music generation model and two new AI experiments, designed to open a new playground for creativity
  124. Empowering the next generation for an AI-enabled world

    Wed, 15 Nov 2023 10:00:00 -0000

    Experience AI's course and resources are expanding on a global scale
  125. GraphCast: AI model for faster and more accurate global weather forecasting

    Tue, 14 Nov 2023 15:00:00 -0000

    We introduce GraphCast, a state-of-the-art AI model able to make medium-range weather forecasts with unprecedented accuracy
  126. A glimpse of the next generation of AlphaFold

    Tue, 31 Oct 2023 13:00:00 -0000

    Progress update: Our latest AlphaFold model shows significantly improved accuracy and expands coverage beyond proteins to other biological molecules, including ligands.
  127. Evaluating social and ethical risks from generative AI

    Thu, 19 Oct 2023 15:00:00 -0000

    Introducing a context-based framework for comprehensively evaluating the social and ethical risks of AI systems
  128. Scaling up learning across many different robot types

    Tue, 03 Oct 2023 15:00:00 -0000

    Robots are great specialists, but poor generalists. Typically, you have to train a model for each task, robot, and environment. Changing a single variable often requires starting from scratch. But what if we could combine the knowledge across robotics and create a way to train a general-purpose robot?
  129. A catalogue of genetic mutations to help pinpoint the cause of diseases

    Tue, 19 Sep 2023 13:37:00 -0000

    New AI tool classifies the effects of 71 million ‘missense’ mutations.
  130. Identifying AI-generated images with SynthID

    Tue, 29 Aug 2023 00:00:00 -0000

    New tool helps watermark and identify synthetic images created by Imagen
  131. RT-2: New model translates vision and language into action

    Fri, 28 Jul 2023 00:00:00 -0000

    Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control.
  132. Using AI to fight climate change

    Fri, 21 Jul 2023 00:00:00 -0000

    AI is a powerful technology that will transform our future, so how can we best apply it to help combat climate change and find sustainable solutions?
  133. Google DeepMind’s latest research at ICML 2023

    Thu, 20 Jul 2023 00:00:00 -0000

    Exploring AI safety, adaptability, and efficiency for the real world
  134. Developing reliable AI tools for healthcare

    Mon, 17 Jul 2023 00:00:00 -0000

    We’ve published our joint paper with Google Research in Nature Medicine, which proposes CoDoC (Complementarity-driven Deferral-to-Clinical Workflow), an AI system that learns when to rely on predictive AI tools or defer to a clinician for the most accurate interpretation of medical images.
  135. Exploring institutions for global AI governance

    Tue, 11 Jul 2023 00:00:00 -0000

    New white paper investigates models and functions of international institutions that could help manage opportunities and mitigate risks of advanced AI.
  136. RoboCat: A self-improving robotic agent

    Tue, 20 Jun 2023 00:00:00 -0000

    Robots are quickly becoming part of our everyday lives, but they’re often only programmed to perform specific tasks well. While harnessing recent advances in AI could lead to robots that could help in many more ways, progress in building general-purpose robots is slower in part because of the time needed to collect real-world training data. Our latest paper introduces a self-improving AI agent for robotics, RoboCat, that learns to perform a variety of tasks across different arms, and then self-generates new training data to improve its technique.
  137. YouTube: Enhancing the user experience

    Fri, 16 Jun 2023 14:55:00 -0000

    It’s all about using our technology and research to help enrich people’s lives. Like YouTube — and its mission to give everyone a voice and show them the world.
  138. Google Cloud: Driving digital transformation

    Wed, 14 Jun 2023 14:51:00 -0000

    Google Cloud empowers organizations to digitally transform themselves into smarter businesses. It offers cloud computing, data analytics, and the latest artificial intelligence (AI) and machine learning tools.
  139. MuZero, AlphaZero, and AlphaDev: Optimizing computer systems

    Mon, 12 Jun 2023 14:41:00 -0000

    How MuZero, AlphaZero, and AlphaDev are optimizing the computing ecosystem that powers our world of devices.
  140. AlphaDev discovers faster sorting algorithms

    Wed, 07 Jun 2023 00:00:00 -0000

    New algorithms will transform the foundations of computing
  141. An early warning system for novel AI risks

    Thu, 25 May 2023 00:00:00 -0000

    New research proposes a framework for evaluating general-purpose models against novel threats
  142. DeepMind’s latest research at ICLR 2023

    Thu, 27 Apr 2023 00:00:00 -0000

    Next week marks the start of the 11th International Conference on Learning Representations (ICLR), taking place 1-5 May in Kigali, Rwanda. This will be the first major artificial intelligence (AI) conference to be hosted in Africa and the first in-person event since the start of the pandemic. Researchers from around the world will gather to share their cutting-edge work in deep learning spanning the fields of AI, statistics and data science, and applications including machine vision, gaming and robotics. We’re proud to support the conference as a Diamond sponsor and DEI champion.
  143. How can we build human values into AI?

    Mon, 24 Apr 2023 00:00:00 -0000

    Drawing from philosophy to identify fair principles for ethical AI...
  144. Announcing Google DeepMind

    Thu, 20 Apr 2023 00:00:00 -0000

    DeepMind and the Brain team from Google Research will join forces to accelerate progress towards a world in which AI helps solve the biggest challenges facing humanity.
  145. Competitive programming with AlphaCode

    Thu, 08 Dec 2022 00:00:00 -0000

    Solving novel problems and setting a new milestone in competitive programming.
  146. AI for the board game Diplomacy

    Tue, 06 Dec 2022 00:00:00 -0000

    Successful communication and cooperation have been crucial for helping societies advance throughout history. The closed environments of board games can serve as a sandbox for modelling and investigating interaction and communication – and we can learn a lot from playing them. In our recent paper, published today in Nature Communications, we show how artificial agents can use communication to better cooperate in the board game Diplomacy, a vibrant domain in artificial intelligence (AI) research, known for its focus on alliance building.
  147. Mastering Stratego, the classic game of imperfect information

    Thu, 01 Dec 2022 00:00:00 -0000

    Game-playing artificial intelligence (AI) systems have advanced to a new frontier.
  148. DeepMind’s latest research at NeurIPS 2022

    Fri, 25 Nov 2022 00:00:00 -0000

    NeurIPS is the world’s largest conference in artificial intelligence (AI) and machine learning (ML), and we’re proud to support the event as Diamond sponsors, helping foster the exchange of research advances in the AI and ML community. Teams from across DeepMind are presenting 47 papers, including 35 external collaborations in virtual panels and poster sessions.
  149. Building interactive agents in video game worlds

    Wed, 23 Nov 2022 00:00:00 -0000

    Most artificial intelligence (AI) researchers now believe that writing computer code which can capture the nuances of situated interactions is impossible. Alternatively, modern machine learning (ML) researchers have focused on learning about these types of interactions from data. To explore these learning-based approaches and quickly build agents that can make sense of human instructions and safely perform actions in open-ended conditions, we created a research framework within a video game environment.Today, we’re publishing a paper [INSERT LINK] and collection of videos, showing our early steps in building video game AIs that can understand fuzzy human concepts – and therefore, can begin to interact with people on their own terms.
  150. Benchmarking the next generation of never-ending learners

    Tue, 22 Nov 2022 00:00:00 -0000

    Learning how to build upon knowledge by tapping 30 years of computer vision research
  151. Best practices for data enrichment

    Wed, 16 Nov 2022 00:00:00 -0000

    Building a responsible approach to data collection with the Partnership on AI...
  152. Stopping malaria in its tracks

    Thu, 13 Oct 2022 15:00:00 -0000

    Developing a vaccine that could save hundreds of thousands of lives
  153. Measuring perception in AI models

    Wed, 12 Oct 2022 00:00:00 -0000

    Perception – the process of experiencing the world through senses – is a significant part of intelligence. And building agents with human-level perceptual understanding of the world is a central but challenging task, which is becoming increasingly important in robotics, self-driving cars, personal assistants, medical imaging, and more. So today, we’re introducing the Perception Test, a multimodal benchmark using real-world videos to help evaluate the perception capabilities of a model.
  154. How undesired goals can arise with correct rewards

    Fri, 07 Oct 2022 00:00:00 -0000

    As we build increasingly advanced artificial intelligence (AI) systems, we want to make sure they don’t pursue undesired goals. Such behaviour in an AI agent is often the result of specification gaming – exploiting a poor choice of what they are rewarded for. In our latest paper, we explore a more subtle mechanism by which AI systems may unintentionally learn to pursue undesired goals: goal misgeneralisation (GMG). GMG occurs when a system's capabilities generalise successfully but its goal does not generalise as desired, so the system competently pursues the wrong goal. Crucially, in contrast to specification gaming, GMG can occur even when the AI system is trained with a correct specification.
  155. Discovering novel algorithms with AlphaTensor

    Wed, 05 Oct 2022 00:00:00 -0000

    In our paper, published today in Nature, we introduce AlphaTensor, the first artificial intelligence (AI) system for discovering novel, efficient, and provably correct algorithms for fundamental tasks such as matrix multiplication. This sheds light on a 50-year-old open question in mathematics about finding the fastest way to multiply two matrices. This paper is a stepping stone in DeepMind’s mission to advance science and unlock the most fundamental problems using AI. Our system, AlphaTensor, builds upon AlphaZero, an agent that has shown superhuman performance on board games, like chess, Go and shogi, and this work shows the journey of AlphaZero from playing games to tackling unsolved mathematical problems for the first time.
  156. Fighting osteoporosis before it starts

    Tue, 27 Sep 2022 14:16:00 -0000

    Detecting signs of disease before bones start to break
  157. Understanding the faulty proteins linked to cancer and autism

    Mon, 26 Sep 2022 15:19:00 -0000

    Helping uncover how protein mutations cause diseases and disorders
  158. Solving the mystery of how an ancient bird went extinct

    Thu, 22 Sep 2022 15:27:00 -0000

    Creating a tool to study extinct species from 50,000 years ago
  159. Building safer dialogue agents

    Thu, 22 Sep 2022 00:00:00 -0000

    In our latest paper, we introduce Sparrow – a dialogue agent that’s useful and reduces the risk of unsafe and inappropriate answers. Our agent is designed to talk with a user, answer questions, and search the internet using Google when it’s helpful to look up evidence to inform its responses.
  160. Targeting early-onset Parkinson’s with AI

    Wed, 21 Sep 2022 15:37:00 -0000

    Predictions that pave the way to new treatments
  161. How our principles helped define AlphaFold’s release

    Wed, 14 Sep 2022 00:00:00 -0000

    Our Operating Principles have come to define both our commitment to prioritising widespread benefit, as well as the areas of research and applications we refuse to pursue. These principles have been at the heart of our decision making since DeepMind was founded, and continue to be refined as the AI landscape changes and grows. They are designed for our role as a research-driven science company and consistent with Google’s AI principles.
  162. Maximising the impact of our breakthroughs

    Fri, 09 Sep 2022 00:00:00 -0000

    Colin, CBO at DeepMind, discusses collaborations with Alphabet and how we integrate ethics, accountability, and safety into everything we do.
  163. In conversation with AI: building better language models

    Tue, 06 Sep 2022 00:00:00 -0000

    Our new paper, In conversation with AI: aligning language models with human values, explores a different approach, asking what successful communication between humans and an artificial conversational agent might look like and what values should guide conversation in these contexts.
  164. From motor control to embodied intelligence

    Wed, 31 Aug 2022 00:00:00 -0000

    Using human and animal motions to teach robots to dribble a ball, and simulated humanoid characters to carry boxes and play football
  165. Advancing conservation with AI-based facial recognition of turtles

    Thu, 25 Aug 2022 00:00:00 -0000

    We came across Zindi – a dedicated partner with complementary goals – who are the largest community of African data scientists and host competitions that focus on solving Africa’s most pressing problems. Our Science team’s Diversity, Equity, and Inclusion (DE&I) team worked with Zindi to identify a scientific challenge that could help advance conservation efforts and grow involvement in AI. Inspired by Zindi’s bounding box turtle challenge, we landed on a project with the potential for real impact: turtle facial recognition.
  166. Discovering when an agent is present in a system

    Thu, 18 Aug 2022 00:00:00 -0000

    We want to build safe, aligned artificial general intelligence (AGI) systems that pursue the intended goals of its designers. Causal influence diagrams (CIDs) are a way to model decision-making situations that allow us to reason about agent incentives. By relating training setups to the incentives that shape agent behaviour, CIDs help illuminate potential risks before training an agent and can inspire better agent designs. But how do we know when a CID is an accurate model of a training setup?