You shipped an AI agent. Users are talking to it. But do you actually know what it's doing?
Most teams deploy AI agents and cross their fingers. When something goes wrong — a hallucination, a data leak, a frustrated user getting the runaround — they find out from a support ticket, not from their monitoring stack.
This guide walks you through adding full observability to your AI agents using Foil's Native SDK. By the end, you'll have distributed tracing, automated evaluations, user feedback tracking, semantic search, and A/B testing — all running in production.
Why the Native SDK?
The Native SDK gives you full control over your trace structure. You choose exactly which spans to create (AGENT, LLM, TOOL, CHAIN, EMBEDDING, RETRIEVER), attach custom metadata and properties, and track tokens explicitly. It's the right choice for complex multi-step pipelines, custom tool integrations, and when you need precise visibility into every step of your agent's execution.
Want zero-code setup instead? Foil also supports OpenTelemetry auto-instrumentation — three lines of setup, no manual spans, and automatic tracing of every LLM call. Read the OTEL guide →
We'll build five real scenarios, starting simple and layering on features. Every code snippet is runnable.
What You'll Build
| Step | Scenario | What You'll Learn |
|---|---|---|
| 1 | Customer support bot | Tracing, spans, feedback signals |
| 2 | RAG pipeline | Retriever/embedding spans, custom evaluations |
| 3 | E-commerce agent | A/B testing, cost analytics, eval templates |
| 4 | Knowledge base assistant | Data leakage detection, user tracking |
| 5 | Content generator | Multimodal content, score evaluations |
Prerequisites
- A Foil account (free tier works)
- Node.js 18+ or Python 3.9+
- An API key from Foil (Settings → API Keys)
- Optionally, an OpenAI API key (all demos work without one using built-in mock responses)
Setup
npm init -y
npm install @getfoil/foil-jsSet your environment variables:
export FOIL_API_KEY=sk_live_your_key_here
# Optional — demos use a mock LLM if this isn't set
export OPENAI_API_KEY=sk-your-openai-keyStep 1: Your First Traced Agent — Customer Support Bot
Let's start with the most common pattern: an agent that takes user input, calls an LLM, uses tools, and returns a response. We want to see every step of that flow in our dashboard.
Initialize the Tracer
The tracer is your entry point. It wraps your agent logic and sends structured traces to Foil.
const { createFoilTracer, SpanKind, Foil } = require('@getfoil/foil-js');
const tracer = createFoilTracer({
apiKey: process.env.FOIL_API_KEY,
agentName: 'demo-customer-support',
});That's three lines. Your agent now has an identity in Foil.
Wrap a Conversation in a Trace
Every conversation becomes a trace — a container that holds all the spans (LLM calls, tool executions, etc.) that happen within it.
const result = await tracer.trace(
async (ctx) => {
// Everything inside here is traced
const messages = [
{ role: 'system', content: 'You are a helpful support agent for TechMart.' },
{ role: 'user', content: 'Can you check on order ORD-12345?' },
];
// Create an LLM span
const llmSpan = await ctx.startSpan(SpanKind.LLM, 'gpt-4o-mini', {
input: 'Can you check on order ORD-12345?',
});
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages,
});
const answer = response.choices[0].message.content;
await llmSpan.end({
output: answer,
tokens: {
prompt: response.usage.prompt_tokens,
completion: response.usage.completion_tokens,
total: response.usage.total_tokens,
},
});
return answer;
},
{
name: 'order-status-check',
sessionId: 'session-abc-123',
input: 'Can you check on order ORD-12345?',
},
);Add Tool Spans
When your agent calls tools (database lookups, API calls, etc.), wrap them in tool spans so you can see what happened:
// Inside your trace callback...
const orderData = await ctx.tool('lookup_order', async () => {
// Your actual tool logic
return db.orders.findOne({ orderNumber: 'ORD-12345' });
}, { input: { order_number: 'ORD-12345' } });Now your trace shows the full span tree: AGENT → LLM → TOOL.

Record User Feedback
After each conversation, record whether the user was satisfied:
// Inside the trace callback
await ctx.recordFeedback(true); // thumbs up
// or
await ctx.recordFeedback(false); // thumbs downFeedback shows up as signals attached to the trace. In the dashboard, you can filter traces by satisfaction to find problem conversations fast.

Run the Full Demo
The demo runs 12 conversations — normal interactions, edge cases, and frustrated users — to show how traces look at scale:
cd foil/sdks/javascript
FOIL_API_KEY=sk_live_xxx node examples/01_customer_support_bot.jsAfter it runs, you'll have 12 traced conversations in your dashboard. That's enough to trigger profile learning — Foil's system for understanding your agent's domain and typical behavior patterns. You'll see the profile status in your agent settings.

Semantic Search
Once traces are embedded (happens automatically), you can search them by meaning, not just keywords:
const foil = new Foil({
apiKey: process.env.FOIL_API_KEY,
});
const results = await foil.semanticSearch('frustrated customer', {
agentName: 'demo-customer-support',
limit: 5,
});
console.log(`Found ${results.results.length} matching traces`);This finds the angry-customer conversations even though nobody typed the exact phrase "frustrated customer." That's the power of semantic search over keyword matching.

Step 2: RAG Pipeline Monitoring
Retrieval-Augmented Generation adds complexity: your agent embeds a query, retrieves documents, then generates an answer from context. If any step fails, the output suffers. Foil gives you visibility into each step.
Span Types for RAG
Foil has dedicated span types for RAG components:
EMBEDDING → RETRIEVER → LLMHere's how each step looks:
const result = await tracer.trace(async (ctx) => {
// Step 1: Embed the query
const embeddingSpan = await ctx.startSpan(SpanKind.EMBEDDING, 'text-embedding-3-small', {
input: userQuestion,
});
const queryVector = await embedQuery(userQuestion);
await embeddingSpan.end({
output: { dimensions: queryVector.length },
tokens: { prompt: tokenCount, total: tokenCount },
});
// Step 2: Retrieve relevant documents
const retrieverSpan = await ctx.startSpan(SpanKind.RETRIEVER, 'knowledge-base-retriever', {
input: userQuestion,
properties: { topK: 3 },
});
const docs = await vectorStore.search(queryVector, { topK: 3 });
await retrieverSpan.end({
output: docs.map(d => ({
title: d.title,
score: d.relevanceScore,
snippet: d.content.slice(0, 100),
})),
});
// Step 3: Generate answer from context
const llmSpan = await ctx.startSpan(SpanKind.LLM, 'gpt-4o-mini', {
input: messages,
properties: {
retrievedDocCount: docs.length,
avgRelevanceScore: avgScore,
},
});
const response = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages });
await llmSpan.end({ output: response.choices[0].message.content, tokens: /* ... */ });
return response.choices[0].message.content;
}, { name: 'rag-query', input: userQuestion });
Create a Custom Evaluation
You can define your own evaluation criteria. For RAG, a "groundedness check" verifies that responses are supported by retrieved documents:
const foil = new Foil({ apiKey: process.env.FOIL_API_KEY });
const evaluation = await foil.createEvaluation(agentId, {
name: 'groundedness_check',
description: 'Checks if the response is grounded in retrieved documents',
prompt: `Evaluate whether the assistant response is grounded in the provided context.
Return true if grounded, false if it contains hallucinated or unsupported claims.
Input: {input}
Output: {output}`,
evaluationType: 'boolean',
enabled: true,
});Once enabled, this evaluation runs automatically on every new trace for the agent. No code changes needed in your agent — Foil handles it server-side.
Test Your Evaluation Before Deploying
Don't deploy an eval blind. Test it first:
const testResult = await foil.testEvaluation(agentId, evaluationId, {
input: 'How do I get started with Kubernetes?',
output: 'Install kubectl and minikube. Key concepts are pods, services, and deployments.',
});
console.log(testResult.result); // true
console.log(testResult.reasoning); // "Response is directly supported by the context..."
Run the Demo
FOIL_API_KEY=sk_live_xxx node examples/02_rag_pipeline.jsThe demo runs 5 queries against a mock knowledge base. Query 4 ("What is quantum computing architecture?") intentionally has no relevant docs — you'll see low retrieval scores, and the groundedness eval should flag the response.

Step 3: A/B Testing with Cost Analytics
You're debating whether to use gpt-4o-mini or gpt-4o for your e-commerce agent. Mini is cheaper, but is it good enough? Foil lets you run both and compare.
Tag Traces with Variant Info
const variants = [
{ name: 'gpt-4o-mini', model: 'gpt-4o-mini', costPerToken: 0.00015 },
{ name: 'gpt-4o', model: 'gpt-4o', costPerToken: 0.0025 },
];
for (const variant of variants) {
for (const conversation of conversations) {
await tracer.trace(
async (ctx) => {
const llmSpan = await ctx.startSpan(SpanKind.LLM, variant.model, {
input: userMessage,
properties: { variant: variant.name },
});
const response = await llm.chat.completions.create({
model: variant.model,
messages,
});
await llmSpan.end({ output: response.choices[0].message.content, tokens: /* ... */ });
// Record business metrics as signals
await ctx.recordSignal('conversion', didConvert, {
signalType: 'engagement',
metadata: { variant: variant.name },
});
await ctx.recordRating(satisfactionScore);
return response.choices[0].message.content;
},
{
name: 'product-query',
input: userMessage,
properties: { experimentVariant: variant.name },
},
);
}
}Clone Evaluation Templates
Foil ships with pre-built evaluation templates for common checks. Clone them to your agent with one call:
// Brand safety — catches competitor mentions, off-brand messaging
await foil.cloneEvaluationTemplate(agentId, 'brand_safety');
// Competitor redirect — catches when the agent recommends competitors
await foil.cloneEvaluationTemplate(agentId, 'competitor_redirect');These evals now run automatically on every trace. No prompts to write — the templates come with battle-tested prompts.

Compare Variants
After running the demo, you'll see output like:
─────────────────────────────────────────────────────────────────
A/B Test Cost Comparison
─────────────────────────────────────────────────────────────────
gpt-4o-mini:
Total tokens: 1,240
Est. cost: $0.1860
Conversions: 3/6
gpt-4o:
Total tokens: 1,180
Est. cost: $2.9500
Conversions: 3/6Same conversion rate, 15x cost difference. In the dashboard, you can drill deeper — compare eval pass rates, user satisfaction, and latency across variants.

Run the Demo
FOIL_API_KEY=sk_live_xxx node examples/03_ecommerce_product_agent.jsStep 4: Catching Data Leakage
This scenario is about security. Your internal knowledge base assistant might accidentally include API keys, internal URLs, or secrets in its responses. Foil can catch this automatically.
User and Device Tracking
Track who's using your agent and from where:
await tracer.trace(
async (ctx) => {
// ... your agent logic ...
},
{
name: 'kb-query',
input: question,
userId: 'dev-alice',
userProperties: { name: 'Alice Chen' },
device: {
platform: 'web',
browser: 'Chrome 120',
os: 'macOS 14.0',
},
},
);In the dashboard, you can now segment traces by user, see which users hit errors most often, and track usage by platform.
CHAIN Spans for Orchestration
When your agent has an orchestration layer (search docs → fetch reference → generate answer), use CHAIN spans to group related operations:
const chainSpan = await ctx.startSpan(SpanKind.CHAIN, 'knowledge-lookup', {
input: question,
});
// Tool calls and LLM calls happen here as children of the chain
const docs = await ctx.tool('search_docs', async () => searchDocs(query), { input: query });
const answer = await callLLM(ctx, docs, question);
await chainSpan.end({ output: answer });Clone the Data Leakage Template
await foil.cloneEvaluationTemplate(agentId, 'data_leakage');This eval looks for API keys (sk_*, pk_*), internal URLs, database connection strings, and other secrets in your agent's responses. It runs on every trace automatically.
Run the Demo
FOIL_API_KEY=sk_live_xxx node examples/04_knowledge_base_assistant.jsQuery 4 in the demo asks about environment setup. The mock response intentionally includes a fake API key (sk_live_abc123def456ghi789). You'll see the data leakage eval flag it in the dashboard.

Step 5: Multimodal Content and Score Evaluations
Not all agent output is plain text. If your agent generates social media posts with images, product pages with media, or any mixed content, Foil tracks it with content blocks.
Multimodal Output
Use content() and ContentBlock helpers to structure mixed outputs:
const { content, ContentBlock } = require('@getfoil/foil-js');
// Inside your trace...
const output = content(
generatedText,
ContentBlock.media(mediaId, {
category: 'image',
mimeType: 'image/png',
filename: 'social-banner.png',
})
);
await llmSpan.end({ output, tokens: /* ... */ });In the dashboard, multimodal traces show the text and media blocks together so you can see exactly what the agent produced.

Score Evaluations
Boolean evals (pass/fail) aren't always enough. Score evaluations rate output on a numeric scale:
await foil.createEvaluation(agentId, {
name: 'content_quality',
description: 'Rates generated content quality from 1-10',
prompt: `Rate the quality of this content on a scale of 1-10.
Consider: clarity, engagement, accuracy, tone appropriateness, and structure.
Input (prompt): {input}
Output (content): {output}
Return a score from 1 (poor) to 10 (excellent).`,
evaluationType: 'score',
scoreMin: 1,
scoreMax: 10,
enabled: true,
});You can also create category evaluations (e.g., classify tone as "formal," "casual," "aggressive") and combine them with score evals to get a full picture of content quality.

Run the Demo
FOIL_API_KEY=sk_live_xxx node examples/05_content_generation_agent.jsThe demo generates four content types: a blog post, a social media post (multimodal), a product description, and a creative writing piece. Each gets scored by the content_quality eval and checked by the tone_analysis template.
The Full Evaluation Stack
Here's a summary of every evaluation type we used across the five scenarios:
| Evaluation | Type | Scenario | What It Checks |
|---|---|---|---|
| Built-in evals | Boolean | All | Quality, safety (run automatically) |
groundedness_check | Boolean | RAG Pipeline | Is the response supported by retrieved docs? |
content_quality | Score (1-10) | Content Generator | Overall quality of generated content |
query_complexity | Category | Knowledge Base | Classifies queries as simple/moderate/complex/ambiguous |
brand_safety | Boolean (template) | E-commerce | Competitor mentions, off-brand messaging |
competitor_redirect | Boolean (template) | E-commerce | Recommending users to competitors |
data_leakage | Boolean (template) | Knowledge Base | API keys, secrets in responses |
tone_analysis | Boolean (template) | Content Generator | Inappropriate tone detection |
You can mix and match. Templates give you a head start; custom evals let you check anything specific to your domain.
What We Covered
In five scenarios, we went from zero observability to:
- Distributed tracing with AGENT, LLM, TOOL, CHAIN, EMBEDDING, and RETRIEVER spans
- 8 evaluation types running automatically on every trace
- User feedback signals linked to traces
- Semantic search for finding conversations by meaning
- A/B testing with per-variant cost comparison
- Data leakage detection catching secrets in responses
- User and device tracking for segmenting by who uses what
- Multimodal content tracking for mixed text + media outputs
- Profile learning that improves as your agent handles more conversations
All of this with an SDK that takes three lines to initialize and doesn't require changes to your LLM calls.
Run All Five Demos
cd foil/sdks/javascript
FOIL_API_KEY=sk_live_xxx node examples/01_customer_support_bot.js
FOIL_API_KEY=sk_live_xxx node examples/02_rag_pipeline.js
FOIL_API_KEY=sk_live_xxx node examples/03_ecommerce_product_agent.js
FOIL_API_KEY=sk_live_xxx node examples/04_knowledge_base_assistant.js
FOIL_API_KEY=sk_live_xxx node examples/05_content_generation_agent.jsNo OpenAI key required — all demos include mock LLM responses. Add OPENAI_API_KEY=sk-xxx to use real LLM calls.
Next Steps
- Set up alerts: Get notified when evaluations fail or costs spike
- Connect your real agent: Replace the demo code with your actual agent logic
- Add custom evaluations: Define checks specific to your domain
- Explore the dashboard: Filter by time, agent, user, session, and evaluation results
Foil is an AI monitoring platform that gives you visibility into your agents in production. Sign up free at getfoil.ai.