How to Monitor Your AI Agents in Production: A Hands-On Guide

February 12, 202515 min read

You shipped an AI agent. Users are talking to it. But do you actually know what it's doing?

Most teams deploy AI agents and cross their fingers. When something goes wrong — a hallucination, a data leak, a frustrated user getting the runaround — they find out from a support ticket, not from their monitoring stack.

This guide walks you through adding full observability to your AI agents using Foil's Native SDK. By the end, you'll have distributed tracing, automated evaluations, user feedback tracking, semantic search, and A/B testing — all running in production.

Why the Native SDK?

The Native SDK gives you full control over your trace structure. You choose exactly which spans to create (AGENT, LLM, TOOL, CHAIN, EMBEDDING, RETRIEVER), attach custom metadata and properties, and track tokens explicitly. It's the right choice for complex multi-step pipelines, custom tool integrations, and when you need precise visibility into every step of your agent's execution.

Want zero-code setup instead? Foil also supports OpenTelemetry auto-instrumentation — three lines of setup, no manual spans, and automatic tracing of every LLM call. Read the OTEL guide →

We'll build five real scenarios, starting simple and layering on features. Every code snippet is runnable.

What You'll Build

Step	Scenario	What You'll Learn
1	Customer support bot	Tracing, spans, feedback signals
2	RAG pipeline	Retriever/embedding spans, custom evaluations
3	E-commerce agent	A/B testing, cost analytics, eval templates
4	Knowledge base assistant	Data leakage detection, user tracking
5	Content generator	Multimodal content, score evaluations

Prerequisites

A Foil account (free tier works)
Node.js 18+ or Python 3.9+
An API key from Foil (Settings → API Keys)
Optionally, an OpenAI API key (all demos work without one using built-in mock responses)

Setup

npm init -y
npm install @getfoil/foil-js

Set your environment variables:

export FOIL_API_KEY=sk_live_your_key_here
# Optional — demos use a mock LLM if this isn't set
export OPENAI_API_KEY=sk-your-openai-key

Step 1: Your First Traced Agent — Customer Support Bot

Let's start with the most common pattern: an agent that takes user input, calls an LLM, uses tools, and returns a response. We want to see every step of that flow in our dashboard.

Initialize the Tracer

The tracer is your entry point. It wraps your agent logic and sends structured traces to Foil.

const { createFoilTracer, SpanKind, Foil } = require('@getfoil/foil-js');

const tracer = createFoilTracer({
  apiKey: process.env.FOIL_API_KEY,
  agentName: 'demo-customer-support',
});

That's three lines. Your agent now has an identity in Foil.

Wrap a Conversation in a Trace

Every conversation becomes a trace — a container that holds all the spans (LLM calls, tool executions, etc.) that happen within it.

const result = await tracer.trace(
  async (ctx) => {
    // Everything inside here is traced
    const messages = [
      { role: 'system', content: 'You are a helpful support agent for TechMart.' },
      { role: 'user', content: 'Can you check on order ORD-12345?' },
    ];

    // Create an LLM span
    const llmSpan = await ctx.startSpan(SpanKind.LLM, 'gpt-4o-mini', {
      input: 'Can you check on order ORD-12345?',
    });

    const response = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages,
    });

    const answer = response.choices[0].message.content;

    await llmSpan.end({
      output: answer,
      tokens: {
        prompt: response.usage.prompt_tokens,
        completion: response.usage.completion_tokens,
        total: response.usage.total_tokens,
      },
    });

    return answer;
  },
  {
    name: 'order-status-check',
    sessionId: 'session-abc-123',
    input: 'Can you check on order ORD-12345?',
  },
);

Add Tool Spans

When your agent calls tools (database lookups, API calls, etc.), wrap them in tool spans so you can see what happened:

// Inside your trace callback...
const orderData = await ctx.tool('lookup_order', async () => {
  // Your actual tool logic
  return db.orders.findOne({ orderNumber: 'ORD-12345' });
}, { input: { order_number: 'ORD-12345' } });

Now your trace shows the full span tree: AGENT → LLM → TOOL.

Dashboard screenshot showing a trace with AGENT > LLM > TOOL span hierarchy

Record User Feedback

After each conversation, record whether the user was satisfied:

// Inside the trace callback
await ctx.recordFeedback(true);  // thumbs up
// or
await ctx.recordFeedback(false); // thumbs down

Feedback shows up as signals attached to the trace. In the dashboard, you can filter traces by satisfaction to find problem conversations fast.

Dashboard showing traces with feedback signals (thumbs up/down icons)

Run the Full Demo

The demo runs 12 conversations — normal interactions, edge cases, and frustrated users — to show how traces look at scale:

cd foil/sdks/javascript
FOIL_API_KEY=sk_live_xxx node examples/01_customer_support_bot.js

After it runs, you'll have 12 traced conversations in your dashboard. That's enough to trigger profile learning — Foil's system for understanding your agent's domain and typical behavior patterns. You'll see the profile status in your agent settings.

Semantic Search

Once traces are embedded (happens automatically), you can search them by meaning, not just keywords:

const foil = new Foil({
  apiKey: process.env.FOIL_API_KEY,
});

const results = await foil.semanticSearch('frustrated customer', {
  agentName: 'demo-customer-support',
  limit: 5,
});

console.log(`Found ${results.results.length} matching traces`);

This finds the angry-customer conversations even though nobody typed the exact phrase "frustrated customer." That's the power of semantic search over keyword matching.

Semantic search results showing matching 'frustrated customer' traces with similarity scores

Step 2: RAG Pipeline Monitoring

Retrieval-Augmented Generation adds complexity: your agent embeds a query, retrieves documents, then generates an answer from context. If any step fails, the output suffers. Foil gives you visibility into each step.

Span Types for RAG

Foil has dedicated span types for RAG components:

EMBEDDING → RETRIEVER → LLM

Here's how each step looks:

const result = await tracer.trace(async (ctx) => {
  // Step 1: Embed the query
  const embeddingSpan = await ctx.startSpan(SpanKind.EMBEDDING, 'text-embedding-3-small', {
    input: userQuestion,
  });
  const queryVector = await embedQuery(userQuestion);
  await embeddingSpan.end({
    output: { dimensions: queryVector.length },
    tokens: { prompt: tokenCount, total: tokenCount },
  });

  // Step 2: Retrieve relevant documents
  const retrieverSpan = await ctx.startSpan(SpanKind.RETRIEVER, 'knowledge-base-retriever', {
    input: userQuestion,
    properties: { topK: 3 },
  });
  const docs = await vectorStore.search(queryVector, { topK: 3 });
  await retrieverSpan.end({
    output: docs.map(d => ({
      title: d.title,
      score: d.relevanceScore,
      snippet: d.content.slice(0, 100),
    })),
  });

  // Step 3: Generate answer from context
  const llmSpan = await ctx.startSpan(SpanKind.LLM, 'gpt-4o-mini', {
    input: messages,
    properties: {
      retrievedDocCount: docs.length,
      avgRelevanceScore: avgScore,
    },
  });
  const response = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages });
  await llmSpan.end({ output: response.choices[0].message.content, tokens: /* ... */ });

  return response.choices[0].message.content;
}, { name: 'rag-query', input: userQuestion });

Dashboard trace detail showing EMBEDDING > RETRIEVER > LLM span hierarchy with retrieval scores visible

Create a Custom Evaluation

You can define your own evaluation criteria. For RAG, a "groundedness check" verifies that responses are supported by retrieved documents:

const foil = new Foil({ apiKey: process.env.FOIL_API_KEY });

const evaluation = await foil.createEvaluation(agentId, {
  name: 'groundedness_check',
  description: 'Checks if the response is grounded in retrieved documents',
  prompt: `Evaluate whether the assistant response is grounded in the provided context.
Return true if grounded, false if it contains hallucinated or unsupported claims.

Input: {input}
Output: {output}`,
  evaluationType: 'boolean',
  enabled: true,
});

Once enabled, this evaluation runs automatically on every new trace for the agent. No code changes needed in your agent — Foil handles it server-side.

Test Your Evaluation Before Deploying

Don't deploy an eval blind. Test it first:

const testResult = await foil.testEvaluation(agentId, evaluationId, {
  input: 'How do I get started with Kubernetes?',
  output: 'Install kubectl and minikube. Key concepts are pods, services, and deployments.',
});

console.log(testResult.result);    // true
console.log(testResult.reasoning); // "Response is directly supported by the context..."

Dashboard showing evaluation test results with result and reasoning

Run the Demo

FOIL_API_KEY=sk_live_xxx node examples/02_rag_pipeline.js

The demo runs 5 queries against a mock knowledge base. Query 4 ("What is quantum computing architecture?") intentionally has no relevant docs — you'll see low retrieval scores, and the groundedness eval should flag the response.

Dashboard showing a trace with low retrieval scores highlighted

Step 3: A/B Testing with Cost Analytics

You're debating whether to use gpt-4o-mini or gpt-4o for your e-commerce agent. Mini is cheaper, but is it good enough? Foil lets you run both and compare.

Tag Traces with Variant Info

const variants = [
  { name: 'gpt-4o-mini', model: 'gpt-4o-mini', costPerToken: 0.00015 },
  { name: 'gpt-4o', model: 'gpt-4o', costPerToken: 0.0025 },
];

for (const variant of variants) {
  for (const conversation of conversations) {
    await tracer.trace(
      async (ctx) => {
        const llmSpan = await ctx.startSpan(SpanKind.LLM, variant.model, {
          input: userMessage,
          properties: { variant: variant.name },
        });

        const response = await llm.chat.completions.create({
          model: variant.model,
          messages,
        });

        await llmSpan.end({ output: response.choices[0].message.content, tokens: /* ... */ });

        // Record business metrics as signals
        await ctx.recordSignal('conversion', didConvert, {
          signalType: 'engagement',
          metadata: { variant: variant.name },
        });
        await ctx.recordRating(satisfactionScore);

        return response.choices[0].message.content;
      },
      {
        name: 'product-query',
        input: userMessage,
        properties: { experimentVariant: variant.name },
      },
    );
  }
}

Clone Evaluation Templates

Foil ships with pre-built evaluation templates for common checks. Clone them to your agent with one call:

// Brand safety — catches competitor mentions, off-brand messaging
await foil.cloneEvaluationTemplate(agentId, 'brand_safety');

// Competitor redirect — catches when the agent recommends competitors
await foil.cloneEvaluationTemplate(agentId, 'competitor_redirect');

These evals now run automatically on every trace. No prompts to write — the templates come with battle-tested prompts.

Dashboard showing evaluation templates list with brand_safety and competitor_redirect

Compare Variants

After running the demo, you'll see output like:

─────────────────────────────────────────────────────────────────
A/B Test Cost Comparison
─────────────────────────────────────────────────────────────────
  gpt-4o-mini:
    Total tokens: 1,240
    Est. cost: $0.1860
    Conversions: 3/6
  gpt-4o:
    Total tokens: 1,180
    Est. cost: $2.9500
    Conversions: 3/6

Same conversion rate, 15x cost difference. In the dashboard, you can drill deeper — compare eval pass rates, user satisfaction, and latency across variants.

Dashboard analytics comparing two experiment variants side by side

Run the Demo

FOIL_API_KEY=sk_live_xxx node examples/03_ecommerce_product_agent.js

Step 4: Catching Data Leakage

This scenario is about security. Your internal knowledge base assistant might accidentally include API keys, internal URLs, or secrets in its responses. Foil can catch this automatically.

User and Device Tracking

Track who's using your agent and from where:

await tracer.trace(
  async (ctx) => {
    // ... your agent logic ...
  },
  {
    name: 'kb-query',
    input: question,
    userId: 'dev-alice',
    userProperties: { name: 'Alice Chen' },
    device: {
      platform: 'web',
      browser: 'Chrome 120',
      os: 'macOS 14.0',
    },
  },
);

In the dashboard, you can now segment traces by user, see which users hit errors most often, and track usage by platform.

Dashboard showing traces filtered by user ID with device info visible

CHAIN Spans for Orchestration

When your agent has an orchestration layer (search docs → fetch reference → generate answer), use CHAIN spans to group related operations:

const chainSpan = await ctx.startSpan(SpanKind.CHAIN, 'knowledge-lookup', {
  input: question,
});

// Tool calls and LLM calls happen here as children of the chain
const docs = await ctx.tool('search_docs', async () => searchDocs(query), { input: query });
const answer = await callLLM(ctx, docs, question);

await chainSpan.end({ output: answer });

Clone the Data Leakage Template

await foil.cloneEvaluationTemplate(agentId, 'data_leakage');

This eval looks for API keys (sk_*, pk_*), internal URLs, database connection strings, and other secrets in your agent's responses. It runs on every trace automatically.

Run the Demo

FOIL_API_KEY=sk_live_xxx node examples/04_knowledge_base_assistant.js

Query 4 in the demo asks about environment setup. The mock response intentionally includes a fake API key (sk_live_abc123def456ghi789). You'll see the data leakage eval flag it in the dashboard.

Dashboard showing a trace with data_leakage evaluation result = detected, with the flagged content highlighted

Step 5: Multimodal Content and Score Evaluations

Not all agent output is plain text. If your agent generates social media posts with images, product pages with media, or any mixed content, Foil tracks it with content blocks.

Multimodal Output

Use content() and ContentBlock helpers to structure mixed outputs:

const { content, ContentBlock } = require('@getfoil/foil-js');

// Inside your trace...
const output = content(
  generatedText,
  ContentBlock.media(mediaId, {
    category: 'image',
    mimeType: 'image/png',
    filename: 'social-banner.png',
  })
);

await llmSpan.end({ output, tokens: /* ... */ });

In the dashboard, multimodal traces show the text and media blocks together so you can see exactly what the agent produced.

Dashboard trace showing multimodal content blocks (text + image reference)

Score Evaluations

Boolean evals (pass/fail) aren't always enough. Score evaluations rate output on a numeric scale:

await foil.createEvaluation(agentId, {
  name: 'content_quality',
  description: 'Rates generated content quality from 1-10',
  prompt: `Rate the quality of this content on a scale of 1-10.

Consider: clarity, engagement, accuracy, tone appropriateness, and structure.

Input (prompt): {input}
Output (content): {output}

Return a score from 1 (poor) to 10 (excellent).`,
  evaluationType: 'score',
  scoreMin: 1,
  scoreMax: 10,
  enabled: true,
});

You can also create category evaluations (e.g., classify tone as "formal," "casual," "aggressive") and combine them with score evals to get a full picture of content quality.

Dashboard showing score evaluation results (1-10) across multiple traces with a trend chart

Run the Demo

FOIL_API_KEY=sk_live_xxx node examples/05_content_generation_agent.js

The demo generates four content types: a blog post, a social media post (multimodal), a product description, and a creative writing piece. Each gets scored by the content_quality eval and checked by the tone_analysis template.

The Full Evaluation Stack

Here's a summary of every evaluation type we used across the five scenarios:

Evaluation	Type	Scenario	What It Checks
Built-in evals	Boolean	All	Quality, safety (run automatically)
`groundedness_check`	Boolean	RAG Pipeline	Is the response supported by retrieved docs?
`content_quality`	Score (1-10)	Content Generator	Overall quality of generated content
`query_complexity`	Category	Knowledge Base	Classifies queries as simple/moderate/complex/ambiguous
`brand_safety`	Boolean (template)	E-commerce	Competitor mentions, off-brand messaging
`competitor_redirect`	Boolean (template)	E-commerce	Recommending users to competitors
`data_leakage`	Boolean (template)	Knowledge Base	API keys, secrets in responses
`tone_analysis`	Boolean (template)	Content Generator	Inappropriate tone detection

You can mix and match. Templates give you a head start; custom evals let you check anything specific to your domain.

What We Covered

In five scenarios, we went from zero observability to:

Distributed tracing with AGENT, LLM, TOOL, CHAIN, EMBEDDING, and RETRIEVER spans
8 evaluation types running automatically on every trace
User feedback signals linked to traces
Semantic search for finding conversations by meaning
A/B testing with per-variant cost comparison
Data leakage detection catching secrets in responses
User and device tracking for segmenting by who uses what
Multimodal content tracking for mixed text + media outputs
Profile learning that improves as your agent handles more conversations

All of this with an SDK that takes three lines to initialize and doesn't require changes to your LLM calls.

Run All Five Demos

cd foil/sdks/javascript
FOIL_API_KEY=sk_live_xxx node examples/01_customer_support_bot.js
FOIL_API_KEY=sk_live_xxx node examples/02_rag_pipeline.js
FOIL_API_KEY=sk_live_xxx node examples/03_ecommerce_product_agent.js
FOIL_API_KEY=sk_live_xxx node examples/04_knowledge_base_assistant.js
FOIL_API_KEY=sk_live_xxx node examples/05_content_generation_agent.js

No OpenAI key required — all demos include mock LLM responses. Add OPENAI_API_KEY=sk-xxx to use real LLM calls.

Next Steps

Set up alerts: Get notified when evaluations fail or costs spike
Connect your real agent: Replace the demo code with your actual agent logic
Add custom evaluations: Define checks specific to your domain
Explore the dashboard: Filter by time, agent, user, session, and evaluation results

Foil is an AI monitoring platform that gives you visibility into your agents in production. Sign up free at getfoil.ai.

How to Monitor Your AI Agents in Production: A Hands-On Guide

What You'll Build

Prerequisites

Setup

Step 1: Your First Traced Agent — Customer Support Bot

Initialize the Tracer

Wrap a Conversation in a Trace

Add Tool Spans

Record User Feedback

Run the Full Demo

Semantic Search

Step 2: RAG Pipeline Monitoring

Span Types for RAG

Create a Custom Evaluation

Test Your Evaluation Before Deploying

Run the Demo

Step 3: A/B Testing with Cost Analytics

Tag Traces with Variant Info

Clone Evaluation Templates

Compare Variants

Run the Demo

Step 4: Catching Data Leakage

User and Device Tracking

CHAIN Spans for Orchestration

Clone the Data Leakage Template

Run the Demo

Step 5: Multimodal Content and Score Evaluations

Multimodal Output

Score Evaluations

Run the Demo

The Full Evaluation Stack

What We Covered

Run All Five Demos

Next Steps

Ready to monitor your AI agents?