D
DeepLearningAI
#Spice AI#AI Agents#Data Stack

Spice AI: The Open-Source Agent Data Stack for AI Agents

Spice AI introduces an open-source agent data stack designed for AI agents, providing secure, high-performance, and federated access to analytical and operational data. This solution addresses scalability and security challenges in the AI era, enabling agents to access diverse data sources efficiently.

5 min readAI Guide

Introduction

Spice AI provides an open-source agent data stack for AI context, enabling AI agents to access analytical and operational data across an organization securely and efficiently. It helps overcome challenges of data access, scalability, and security in the AI era.

Configuration Checklist

Element Version / Link
Language / Runtime Rust (for Spice AI Open Source), Python (for OpenClaw agents)
Main library Spice AI Open Source (built on Apache DataFusion, Ballista, Arrow, Vortex, Iceberg), OpenClaw
Required APIs GitHub API, HTTP endpoints (for logs/metrics), PostgreSQL
Keys / credentials needed Database credentials, API tokens for GitHub/HTTP endpoints (managed by Spice AI)

Step-by-Step Guide

Step-by-Step Guide

Step 1 — Set up the OpenClaw SRE Agent with Spice AI

To enable the AI agent to access diverse data sources (logs, metrics, databases, documentation) for incident response and analysis.

# Install Spice CLI for unified SQL interface
curl https://install.spiceai.org | /bin/bash

# [Editor's note: Specific OpenClaw agent setup code is not provided in the transcript, but implied by the demo.]

Step 2 — Configure Data Sources in Spice AI

To provide the agent with a unified interface to query various data types (monitoring logs, metrics, production databases, troubleshooting guides).

# Example of configured datasets in Spice AI (from video screenshot)
# monitoring logs: Log aggregation API query application logs and Kubernetes auth logs. Columns: request_path, hostchart, request_query, timechan. (HTTP)
# monitoring metrics: Prometheus metrics API. Query time-series metrics, active alerts, and scrape targets. Columns: request_path, hostchart, request_query, timechan. (HTTP)
# data.orders: Orders table from PostgreSQL. Contains all customer orders. Columns: id (text), user_id (text), PK to users, status (text). (PostgreSQL)
# data.users: Users table from PostgreSQL. Contains user identity records. Columns: id (text). (PostgreSQL)
# tsg: GitHub repository for troubleshooting guides (Markdown files). (GitHub)

Step 3 — Trigger a Live Site Incident

To simulate a real-world scenario where the SRE agent needs to respond to a performance degradation.

# Scale load-generator service from 1 to 6 replicas to increase traffic
kubectl scale deployment/load-generator --replicas=6 -n demo

Step 4 — Observe Latency Increase and Alerting

To confirm the incident is active and the monitoring system (Grafana) is detecting it.

Observation: Latency for payment-service skyrockets (from Grafana dashboard). An alert is received in Slack: "OrderLatencyHigh - order-service p95 latency above 350ms - runbook: [link to GitHub TSG]".

Step 5 — Interact with the OpenClaw SRE Agent for Diagnosis

To leverage the AI agent's ability to query diverse data sources and provide actionable insights.

# Slack interaction with the OpenClaw SRE agent
@OpenClaw SRE what's going on?

# Agent response:
# I need a fresh live re-check to answer that accurately. Say 're-check now' and I'll give you the current root cause and exact next action.

@OpenClaw SRE re-check the live system

Step 6 — Receive Remediation Steps from the Agent

The agent identifies the root cause and suggests a solution based on its data access and knowledge base.

Agent's diagnosis: "order-service is under-provisioned for current traffic. Live checks show p95 latency at 41.7ms over the last 5 minutes with only 1 replica, while payment-service is healthy and there are no current error events. Scale order-service to 3 replicas now."

Suggested action:

# Scale order-service to 3 replicas
kubectl scale deployment/order-service --replicas=3 -n demo

Step 7 — Apply Remediation and Observe Service Recovery

To implement the agent's recommendation to resolve the incident.

Action: Scale order-service to 3 replicas.

Observation: Latency initially drops, but then errors increase, indicating a new problem.

Step 8 — Debug New Issue with Agent and Apply Further Fix

The agent adapts to new information and provides a refined solution.

Agent's diagnosis: "order-service is failing on PostgreSQL connection exhaustion after scaling to 3 replicas. PgBouncer for pgdb-pooler-rw is still in session mode, so the added replicas pushed backend connection demand past PostgreSQL capacity and db insert failed / query_wait_timeout errors are firing. In Custom Resources -> Pooler pgdb-pooler-rw, open the YAML editor and change: poolMode: transaction"

Action: Change pgbouncer pool mode from session to transaction.

# Excerpt from pgdb-pooler-rw YAML configuration
spec:
  pgbouncer:
    poolMode: transaction # Changed from 'session'

[Editor's note: The full YAML update command is not shown, but implies editing the Kubernetes Custom Resource Definition (CRD) for the pooler.]

Step 9 — Confirm Full Recovery and Customer Impact

To verify the incident is fully resolved and understand the business impact.

Observation: Latency and error rates return to normal.

Agent's summary: "Customer impact from the DB incident window (22:45:39-22:47:40 UTC): 2 failed orders affecting 2 customers (user-002, user-006). I cannot re-queue or modify those records; remediation must be done by a human."

Comparison Tables

Comparison Tables

SaaS-era Data Stack vs. AI-era Data Stack

Feature SaaS-era Data Stack AI-era Data Stack (Agents)
Primary Consumers Apps, Dashboards, Services (humans) AI Agents (24/7, on loops)
Data Needs Analytical data Analytical + Operational (real-time) data
Data Access Pattern Centralized clusters + ETL Distributed, federated access + acceleration
Data Latency Weeks/months for ETL Real-time, low latency
Query Volume Lower, predictable High, unpredictable, orders of magnitude more
Security Model Centralized access control Agent-specific, isolated, firewalled access
Scalability Struggles with agent load Designed for agent load

Federated Data Access Performance

Data Source Performance Profile
GitHub Slow (snail icon)
Iceberg Medium (clock icon)
PostgreSQL Fast (lightning bolt icon)

⚠️ Common Mistakes & Pitfalls

  1. ETL Bottlenecks: Relying on traditional ETL pipelines for AI agents leads to weeks or months of delay in data delivery, hindering rapid AI use case development.
  2. Overwhelmed Infrastructure: AI agents operate 24/7, generating orders of magnitude more queries across diverse data sources, which current centralized data stacks cannot keep up with, leading to outages (e.g., GitHub's recent struggles).
  3. Security Risks with Direct Access: Giving AI agents direct access to production databases can lead to data destruction or unauthorized access if not properly controlled, as demonstrated by recent incidents (e.g., "An AI Agent Just Destroyed Our Production Data. It Confessed in Writing.").
  4. Inconsistent Performance: Agents querying various backend data stores (e.g., GitHub, Iceberg, PostgreSQL) directly will experience inconsistent performance due to differing load patterns and profiles, impacting AI context accuracy and responsiveness.
  5. Lack of Observability for Agent Workflows: Without proper tracing of AI agent tool calls and LLM interactions, debugging and understanding agent behavior across diverse data systems becomes extremely challenging.

Glossary

AI Agent: An autonomous software entity that perceives its environment, makes decisions, and takes actions to achieve specific goals, often interacting with various data sources.
Agent Data Stack: A specialized data infrastructure designed to provide AI agents with secure, high-performance, and federated access to both analytical and operational data across an organization.
Query Federation: The ability to query data from multiple, disparate data sources (databases, data warehouses, data lakes, APIs) as if they were a single, unified source.

Key Takeaways

Key Takeaways

  • AI agents require access to both analytical and operational data across the entire enterprise for accuracy and usefulness.
  • Traditional ETL and centralized data stacks are insufficient for the speed, scale, and security demands of AI agents.
  • Every AI agent needs its own dedicated, isolated data stack to ensure secure and high-performance data access.
  • Spice AI provides an open-source agent data stack that offers query federation, local data acceleration (e.g., using DuckDB, SQLite), and hybrid search capabilities.
  • This approach enables consistently fast AI context by replicating working sets of data into local embedded databases, reducing latency and load on backend systems.
  • The agent data stack provides secure, firewalled access to data, preventing agents from directly accessing production systems and mitigating security risks.
  • Observability tools within the agent data stack allow tracing of all tool calls and LLM interactions, crucial for debugging and understanding agent behavior.
  • The demo showcased an OpenClaw SRE agent using Spice AI to diagnose and mitigate a live site incident by querying logs, metrics, production databases, and troubleshooting guides.

Resources

  • Spice AI GitHub: https://github.com/spiceai/spiceai
  • Install Spice CLI: curl https://install.spiceai.org | /bin/bash
  • GitHub News on Availability: https://github.blog/news-insights/company-news/an-update-on-github-availability/
  • OpenClaw SRE Demo Alerts Runbook: https://github.com/spicehq/openclaw-sre-demo/blob/main/docs/tsg/order-service-scaling.md
  • Postgres Connection Exhaustion TSG: https://github.com/spicehq/openclaw-sre-demo/blob/main/docs/tsg/postgres-connection-exhaustion.md
  • Spice AI Dashboard (On-Call Demo): https://openclaw-sre-demo.spiceai.ai/dashboard/c/main/deployments?namespace=demo
  • Spice AI Dashboard (Observability): https://openclaw-sre-demo.spiceai.ai/dashboard/c/main/deployments/openclaw-sre-demo/observability