Training AI Agents the Smart Way: A Deep Dive into DSPy’s Custom Training and Evaluation

Ever wondered how to train an AI agent to be really good at a specific task without needing massive datasets? Today we’re exploring a fascinating approach using DSPy and the MIPROv2 optimizer that flips traditional machine learning on its head.

Instead of feeding our AI millions of examples, we’re going to show you how to create a small, carefully crafted “answer key” and let the system figure out the best way to solve problems on its own.

New code available on GitHub

The Big Picture: What We’re Building

Imagine you’re creating an AI security analyst that needs to examine smart contracts for vulnerabilities. Traditional approaches might require thousands of labeled examples. But what if we could teach it with just a handful of perfect examples and let it optimize itself?

That’s exactly what this DSPy approach does. We’ll walk through how to:

Create a small set of “gold standard” training examples
Set up custom evaluation metrics that matter for your specific use case
Let the MIPROv2 optimizer automatically improve the AI’s prompts

The “Answer Key” Approach: Small But Mighty Training Data

Here’s where things get interesting. Instead of scraping the internet for thousands of examples, we’re taking a completely different approach - we’re becoming the teacher and creating our own perfect answer key.

Think of it like this: if you were teaching a student to analyze security vulnerabilities, you wouldn’t show them 10,000 mediocre examples. You’d carefully select 5-10 perfect examples that demonstrate exactly what good analysis looks like.

That’s what the create_training_examples() function does. It creates a small, curated set of examples where we define:

The Problem: A smart contract with specific vulnerabilities The Perfect Answer: Exactly which tools should be used, which vulnerabilities should be found, and what the risk score should be

For example, with a vulnerable DeFi pool contract, our “answer key” might say:

“Use all three analysis tools: contract_analysis, constructor_analysis, and deployment_context”
“Find these specific vulnerabilities: reentrancy, initialization_risk, and high_value_target”
“The final risk score should be around 8.5”

This gives our optimizer a crystal-clear target to aim for.

The Evaluation Engine: Where the Magic Happens

This is where DSPy and MIPROv2 really shine. The optimizer doesn’t just randomly try different approaches - it follows a systematic 5-step process that’s almost like having a really smart tutor constantly improving the AI’s study methods.

Here’s how it works:

Step 1: Generate New Prompts The MIPROv2 optimizer creates new candidate prompts for different parts of our agent (like the ToolSelector and VulnerabilityAssessor modules). Think of it as experimenting with different ways to “ask” the AI to do its job.

Step 2: Test Drive the Agent It runs our agent using these new prompts on one of our training examples - like giving it that vulnerable DeFi pool contract to analyze.

Step 3: Capture the Results The agent produces its actual analysis - this becomes our “prediction” that we need to evaluate.

Step 4: The Report Card Here’s where our custom a1_security_metric function comes in. It compares what the agent actually found against our “gold standard” answer.

Step 5: The Scoring System This is where the evaluation measures performance using a weighted scoring system.

The Scoring System: Why This Approach is Brilliant

The a1_security_metric uses a weighted average that tackles three critical aspects of AI agent performance. Here’s the breakdown:

50% Detection Accuracy - “Did you find what matters?” This is the biggest chunk of the score. Did the agent actually identify the vulnerabilities we expected? If our “answer key” says there should be reentrancy and initialization risks, did the agent catch them? This ensures the AI doesn’t miss critical security issues.

30% Tool Diversity - “Did you use your tools properly?” Here’s the clever part that fights “tool bias.” Many AI agents get lazy and just use one favorite tool repeatedly. This metric specifically rewards using the expected combination of tools. If the perfect answer requires all three analysis tools, the agent gets points for actually using all three, not just picking one.

20% Risk Score Accuracy - “Is your final judgment reasonable?” The agent needs to not just find problems, but also assess their severity correctly. If the expected risk score is 8.5, how close did the agent get?

The beauty is in the weighting - finding the right vulnerabilities matters most, but the system also ensures the agent develops good analytical habits.

The MIPROv2 Optimization Loop: Getting Smarter Over Time

MIPROv2 is like having a really persistent coach that never gives up on making the agent better. Here’s how it uses those scores:

The Iterative Improvement Cycle:

Try New Approaches: MIPROv2 generates variations of the prompts - maybe it tries being more specific about tool usage, or changes how it asks the agent to assess risk levels.
Measure Everything: Each new prompt combination gets tested across all our training examples and scored using our 50/30/20 metric.
Keep What Works: The optimizer tracks which prompt changes lead to higher scores and which ones hurt performance.
Evolve and Repeat: It gradually converges on the best-performing prompt templates by building on successful changes and discarding poor ones.

The Smart Part: MIPROv2 doesn’t just randomly try things - it learns patterns. If being more explicit about tool requirements consistently improves the Tool Diversity score, it incorporates that insight into future prompt generations.

MIPROv2 in Action: A Real Optimization Story

Starting Point - Iteration 1: Let’s say MIPROv2 begins with a basic prompt for the ToolSelector: “Analyze this smart contract for security issues.”

When tested on our DeFi pool example, the agent only uses contract_analysis and gets a score of 4.2/10:

Detection Accuracy: 6/10 (missed initialization_risk)
Tool Diversity: 2/10 (only used 1 of 3 expected tools)
Risk Score: 5/10 (scored 6.0 instead of expected 8.5)

Learning and Adapting - Iteration 15: MIPROv2 notices the tool diversity problem and tries: “Analyze this smart contract using contract analysis, constructor analysis, and deployment context tools to ensure comprehensive coverage.”

New score: 7.1/10:

Detection Accuracy: 8/10 (found all vulnerabilities!)
Tool Diversity: 8/10 (used all three tools)
Risk Score: 5/10 (still underestimating severity)

Converging on Excellence - Iteration 45: After more refinement: “Systematically analyze this contract using all available tools. Pay special attention to high-risk patterns and provide severity scores that reflect real-world impact.”

Final score: 9.2/10 - much better across all metrics!

DSPy + MIPROv2 vs. Traditional Training: The Showdown

Traditional Machine Learning Approach:

Data Requirements: Needs thousands of labeled examples
Time Investment: Months of data collection and cleaning
Expertise Needed: Data scientists, annotation teams, infrastructure specialists
Flexibility: Hard to adjust - need new data for new requirements
Black Box Problem: Hard to understand why the model makes certain decisions

DSPy + MIPROv2 Approach:

Data Requirements: Just 5-10 carefully crafted “gold standard” examples
Time Investment: Days to weeks (mostly spent defining good examples)
Expertise Needed: Domain experts who understand the problem well
Flexibility: Change your examples or metrics, re-optimize quickly
Transparency: You can see and modify the prompts the system uses

The Real Kicker: With traditional ML, if you want your security agent to also analyze a new type of vulnerability, you need hundreds more labeled examples. With DSPy, you just add one perfect example to your “answer key” and let MIPROv2 figure out how to handle it.

Choosing Your Base Model: The Foundation Matters

Start with these key considerations:

1. Task Complexity

For complex reasoning (like security analysis): GPT-4, Claude 3.5 Sonnet, or GPT-4o
For simpler classification tasks: GPT-3.5-turbo or Llama 3 might suffice
For specialized domains: Consider if the model has relevant training data

2. Cost vs. Performance Trade-off

Premium models (GPT-4): Higher accuracy, better reasoning, but more expensive per call
Mid-tier models (GPT-3.5, Claude 3 Haiku): Good balance for many tasks
Open-source models (Llama 3): Lower ongoing costs, but need your own infrastructure

3. Speed Requirements

Real-time applications: Favor faster models like GPT-3.5-turbo or Claude 3 Haiku
Batch processing: Can use slower but more capable models

The DSPy Advantage: You can actually test multiple base models with the same training examples and see which performs best with your specific use case!

Wrapping Up: The DSPy Advantage

This DSPy + MIPROv2 approach represents a fundamental shift in how we train AI agents:

Key Takeaways:

Small, curated “answer key” beats massive datasets
Custom evaluation metrics ensure the AI learns what actually matters
MIPROv2 automatically finds the best prompts through systematic optimization
Much faster and more flexible than traditional ML approaches

The Bottom Line: Instead of collecting thousands of examples and hoping for the best, you define exactly what “perfect” looks like and let the system figure out how to get there.

This approach is particularly powerful for specialized domains like security analysis, where expert knowledge matters more than raw data volume.

Thanks for following along with this deep dive! The combination of small, high-quality training data with intelligent optimization is opening up new possibilities for building highly effective AI agents.