The trajectory of artificial intelligence has fundamentally shifted.
We are moving from systems that merely generate probabilistic text to autonomous entities capable of executing complex, multi-step operations within highly deterministic external environments. This paradigm shift is entirely predicated on the evolution of tool calling architectures. The mechanism by which a large language model interacts with external systems — databases, APIs, computational engines, and file systems — defines its operational ceiling and utility.
1. The Genesis of Deterministic Action
Legacy Function Calling (Introduced mid-2023 by OpenAI)
Core Mechanics & Shift
Function calling turned algorithmic intent into programmatic action. Instead of generating text, the model recognizes when a request needs external data, halts its generative process, and outputs a highly structured JSON object matching a schema.
Crucial distinction: The language model does not execute the function itself; it acts solely as a semantic routing engine.
API Evolution: Functions to Tools
Early implementations explicitly passed a functions array (requiring name, description, parameters).
This was deprecated for a more flexible tools array. This design shift allowed grouping multiple functions under namespaces and, critically, enabled parallel tool calls for multi-threaded data retrieval.
Architectural Limitations: The Schema Bloat Crisis
Every available tool must have its full JSON schema injected into the system prompt for every interaction. In finance, expanding to dozens of APIs saturates the context window, degrading reasoning quality and inflating inference costs.
The architecture necessitates a complete network round-trip for every individual invocation. Querying a DB, parsing it, then calling a pricing API for each ticker requires multiple sequential cycles, making it unsuitable for high-frequency workflows.
2. The Interoperability Standard
Model Context Protocol (MCP) — Late 2024
The Three-Tier Architecture
1. MCP Host
The primary application housing the LLM (e.g., Claude Desktop, Cursor IDE, bespoke terminals).
2. MCP Client
Protocol translation layer directly within the host managing connections, routing, and lifecycle.
3. MCP Server
Lightweight programs exposing APIs via STDIO (highly secure, local) or HTTP+SSE (remote/cloud).
Tools
Executable functions that alter state or perform external computations (e.g., executing a database query, submitting a trade order).
Resources
Read-only data sources that provide contextual info directly to the context window (e.g., PDF contents, live API responses).
Prompts
Reusable conversational templates that structure interactions and provide few-shot examples to optimize model behavior.
3. The Paradigm Shift
Programmatic Tool Calling ('Code Mode' with Claude 4.6 / Sonnet 4.6)
The Turing-Complete Leap
Instead of predicting a JSON object for a single action, the model writes Turing-complete code (TypeScript/Python) to orchestrate entire multi-tool workflows natively. It shifts the execution burden to an isolated, secure computational sandbox (e.g., V8 JavaScript isolates or Daytona containers).
Progressive Tool Discovery
Rather than loading all tool schemas upfront, the agent uses a Tool Search Tool. It queries to discover relevant libraries/MCP servers, imports them into its generated script, and executes locally.
Real-World Example: Filtering 10,000 Tick Data Records
Suppose the user asks: “Find the 3 minutes with the highest trading volume for AAPL today.”
Legacy JSON Flow
Context Window Bloat
- 1. Model calls
get_intraday_data(ticker="AAPL") - 2. The API returns 10,000 rows of JSON.
- 3. All 10,000 rows are injected directly into the LLM's context window.
[
{"time": "09:30", "vol": 14500, "price": 150.2},
{"time": "09:31", "vol": 12200, "price": 150.4},
... 9,998 more rows ...
]Programmatic Flow
Sandbox Execution
- 1. Model writes a Python script and sends it to the runtime sandbox.
- 2. The sandbox fetches the data and processes it locally using pandas.
- 3. Only the final
print()output is sent back to the LLM.
import mcp_finance as fin
import pandas as pd
# Executed in secure sandbox
data = fin.get_intraday_data("AAPL")
df = pd.DataFrame(data)
top_3 = df.nlargest(3, 'vol')
print(top_3.to_json())| Metric | JSON Tool Calling | Programmatic Tool Calling |
|---|---|---|
| Execution Medium | Parsed JSON objects mapped to host functions | Turing-complete scripts (Python/TypeScript) |
| Context Window | High; all intermediate data passes through LLM | Minimal; data processing occurs within sandbox |
| Token Reduction | Baseline (1x) | 85% to 98% reduction in token consumption |
| System Latency | High; requires network round-trips per step | Low; loops/conditionals execute at runtime speed |
4. Modern Execution Environments
Agent Harnesses and Agentic Skills
Frameworks vs. Runtimes vs. Harnesses
While frameworks (LangChain) provide basic orchestration loops and runtimes provide computational sandboxes, the Agent Harness is the holistic wrapper managing system instructions, dynamic tools, conversational state, and persistence over multiple autonomous turns through Agentic Skills.
YAML frontmatter (name, description, compatibility). Loads on initialization. Consumes ~100 tokens. Lets the agent know capabilities exist.
Detailed markdown instructions, constraints, and logic pulled via bash call only when a request matches the description.
Execution scripts (Python, Bash) housed in scripts/ or references/. Never loaded into context; executed directly by sandbox runtime.
Sample Implementations
Cursor Agent Harness (IDE Focus)
Cursor orchestrates instructions while tuning for specific foundational models (e.g., knowing one model prefers grep while another needs linter nudges).
Lifecycle Hooks: Defined in .cursor/hooks.json. Intercepts checkpoints like beforeShellExecution or preToolUse. Background command-based scripts execute returning JSON; exit code 0 is success, 2 actively blocks the proposed action.
The OpenClaw Architecture (Persistent Daemon)
OpenClaw rejects ephemeral chat sessions, establishing a long-lived Node.js gateway locally. It operates a heartbeat scheduler (waking every 30 mins to review HEARTBEAT.md) to run background tasks.
- Channel Adapters: WhatsApp (Baileys), Telegram (gramm Y), CLI.
- Session Manager & Queue System: Serializes multi-step tool invocations.
- Agent Runtime: Dynamically assembles context from
AGENTS.md,SOUL.md. - Control Plane: WebSocket API on port
:18789for global state.
Claude Code CLI (Terminal Focus)
Anthropic's official CLI tool brings the agentic loop directly into the developer's native environment. Rather than acting as a simple autocomplete, it operates as an autonomous sub-agent capable of exploring file systems, running tests, and executing multi-step refactoring.
The Agentic REPL Loop
Claude Code doesn't just guess an answer; it iteratively searches, reads, modifies, and verifies code. If a test fails, it reads the error and tries again without human prompting.
- Native Tools: Built-in primitives like
Bash,Glob,Grep,FileRead, andFileEdit. - Context Injection: Automatically understands Git state, uncommitted changes, and project structures.
Security & Customization
Operates with a strict permission model and highly configurable project-level instructions.
- allowed-tools: Defined in SKILL.md to limit privileges. A “code-reviewer” skill can explicitly allow only Read/Grep, denying file modification tools.
- Slash Commands: Explicit invocation using commands like
/bugfixor/test, which trigger specialized prompts. - Human Approval: Destructive bash commands are trapped and require the user to press ‘Enter’ before execution.
How Harnesses Orchestrate Everything Together
A harness is not a replacement for MCP, function calling, or programmatic execution — it is the conductor that decides which tool to reach for and when. Each layer has a distinct role, and the harness composes them into a single coherent workflow.
Full-Stack Execution Flow — Single User Request
Harness / Skill
User prompt arrives. Harness loads YAML metadata (~100 tokens), runs semantic trigger matching, and JIT-injects the matching skill's instructions.md. Sets the active tool allowlist for this request only.
Output: enriched system prompt + tool allowlist
MCP Server
LLM calls a tool exposed by an MCP server (e.g., get_ticker_history). The server fetches raw data from the exchange API and returns a structured JSON payload — but not directly into the LLM context.
Output: raw JSON payload → sandbox
Programmatic Sandbox
The LLM writes a Python script that ingests the MCP payload, runs pandas/numpy computations, and print()s only the summary. The sandbox executes it in isolation. 10,000 rows become 3 lines of output.
Output: summary result → LLM context
Function Call
For simple, low-volume lookups (e.g., get_company_name(ticker)), the harness permits a direct JSON function call. No sandbox needed — the result is small enough to pass through the context safely.
Output: small JSON → LLM context directly
CLI / Hook
Before the final response is committed, a post_tool_use lifecycle hook fires a CLI command (e.g., pytest, ruff). If it exits non-zero, the harness blocks and re-prompts the LLM with the error output.
Output: pass → response / fail → re-prompt
Harness + MCP
Data access layer
The harness's skill YAML declares which MCP servers are required (mcp_servers: [quant-db, alpha-vantage]). When the skill activates, the harness spins up only those servers — not all registered ones. This means a “risk-analysis” skill never accidentally exposes the trade-execution MCP server to the LLM.
Harness + Function Calling
Selective schema injection
Instead of injecting all 50 tool schemas globally, the harness injects only the schemas relevant to the active skill. A “portfolio-rebalancer” skill exposes get_weights and set_allocation. A “news-analyst” skill exposes search_filings. Context stays lean regardless of how many tools exist in the registry.
Harness + Programmatic Execution
Sandbox delegation
The skill's instructions.md explicitly tells the LLM: “For any dataset > 1,000 rows, write a Python script and execute it via the sandbox tool. Never paste raw data into your response.” The harness enforces this by only granting the sandbox_exec tool — not a raw data-dump tool — in the active allowlist.
Harness + CLI Hooks
Lifecycle enforcement
Hooks are the harness's immune system. A pre_tool_use hook intercepts every bash call and blocks commands containing DROP TABLE or live API keys. A post_tool_use hook auto-runs ruff on any generated Python. The LLM never bypasses these — they execute at the harness layer, below the model's awareness.
5. Demo: Anatomy of an Agentic Skill
Building a highly-scoped 'Database Architect' Skill
skill.yaml (The Metadata)
Loaded into the context window at startup. Costs ~40 tokens.
The Harness reads this file to understand when to activate the skill. The triggers array acts as semantic routing hooks.
name: database-architect
description: Expert in PostgreSQL schema design, migrations, and query optimization.
version: 1.0.0
triggers:
- "create a table"
- "write a migration"
- "optimize this query"
- "database schema"
allowed_tools:
- bash
- file_read
- file_write
- psql_eval
env_vars_required:
- DATABASE_URLinstructions.md (The Payload)
Dynamically injected ONLY when a trigger is matched.
This contains the heavy, specialized prompt engineering. By keeping this out of the global prompt, we save thousands of idle tokens and prevent the model from getting confused by irrelevant instructions.
# Role: Senior Database Architect
You are responsible for modifying the PostgreSQL database.
## Strict Constraints:
1. NEVER use destructive commands (DROP TABLE, DELETE) without explicit user confirmation.
2. All new tables MUST include id (UUID), created_at, and updated_at columns.
3. Indexes MUST be created concurrently (CREATE INDEX CONCURRENTLY).
4. Always write migrations in the /supabase/migrations directory
using the format YYYYMMDDHHMMSS_name.sql.
## Execution Protocol:
When asked to create a schema:
1. Use file_read to check existing schemas in /migrations.
2. Write the new SQL file using file_write.
3. Use the psql_eval tool to run a dry-run syntax check against the local DB.harness.ts (The Orchestrator)
The core loop that manages the skill injection.
This is a simplified view of how an Agent Harness (like OpenClaw or an IDE) processes user input, detects the need for a skill, and alters the context dynamically before calling the LLM.
import { loadSkills, semanticMatch } from './skill-manager';
import { callLLM } from './llm-provider';
async function processUserRequest(userPrompt: string, messageHistory: any[]) {
// 1. Load lightweight YAML metadata for all installed skills
const availableSkills = await loadSkills();
let activeSystemPrompt = "You are a helpful coding assistant.";
let activeTools = ["bash", "file_read"]; // Default tools
// 2. Check if user prompt matches any skill triggers
for (const skill of availableSkills) {
if (semanticMatch(userPrompt, skill.triggers)) {
console.log(`[Harness] Activating Skill: ${skill.name}`);
// 3. Dynamically read the heavy markdown instructions
const skillInstructions = await readFile(`skills/${skill.name}/instructions.md`);
// 4. Augment the context window
activeSystemPrompt += `\n\n---\n${skillInstructions}`;
activeTools = [...new Set([...activeTools, ...skill.allowed_tools])];
}
}
// 5. Execute LLM with the highly specific, JIT-loaded context
return await callLLM({
system: activeSystemPrompt,
messages: [...messageHistory, { role: "user", content: userPrompt }],
tools: activeTools
});
}If the user asks “How do I center a div?”, the Harness skips the Database Architect skill entirely. The model responds instantly using a tiny default context. If the user asks “Add a users table”, the Harness injects the strict SQL constraints and grants access to the psql_eval tool. This guarantees high precision without context bloat.
How Skills Chain: Using Prompts to Decide the Next Skill
A single skill rarely completes a complex task alone. The real power emerges when a skill's instructions.md explicitly tells the LLM which skill to invoke next based on what it finds — turning a flat list of skills into a dynamic decision tree.
The Mechanism
Each skill's instructions.md ends with a routing block — a conditional section that tells the LLM: “If you find X, your next action is to invoke skill Y by emitting this exact phrase.” The harness watches the LLM's output stream for these trigger phrases and activates the next skill automatically.
This is not magic — it is structured prompt engineering. The LLM is instructed to be explicit about its intent, and the harness is wired to act on that intent.
Why Not Just One Big Skill?
Combining all logic into one skill bloats the context window and degrades precision. A “data-fetcher” skill has no business knowing SQL migration rules. Keeping skills small and single-purpose means each one is injected only when needed — and evicted the moment it is done.
Chaining also enables conditional branching: the path taken depends on what the previous skill actually found, not what was assumed upfront.
Worked Example: “Analyse AAPL earnings and flag any risk”
Three skills chain automatically from a single user prompt
Calls the MCP server to pull the last 4 quarters of AAPL earnings transcripts. Extracts revenue, EPS, and guidance figures into a structured JSON summary.
## After completing data extraction: - If revenue growth YoY < 5%: emit "INVOKE: risk-assessor — slow growth detected" - If guidance was revised downward: emit "INVOKE: risk-assessor — guidance cut detected" - If all metrics are within normal range: emit "INVOKE: report-writer — data clean" - Always pass the extracted JSON as context to the next skill.
Injected only because the previous skill flagged a condition. Runs a programmatic sandbox script to compute VaR, max drawdown delta, and compares guidance cut magnitude against historical precedents.
## After completing risk scoring: - If risk_score > 7: emit "INVOKE: report-writer — HIGH RISK — include risk section" - If risk_score 4-7: emit "INVOKE: report-writer — MODERATE RISK — summarise flags" - If risk_score < 4: emit "INVOKE: report-writer — LOW RISK — brief mention only" - Attach risk_score and flag_list to context for report-writer.
Receives the earnings JSON and risk context from the shared state. Formats a structured investment memo with the appropriate risk section depth based on the flag passed by the risk-assessor. The final output is the only thing the user ever sees.
Rules for Reliable Skill Chaining
1. Explicit emit phrases
Use a fixed, unambiguous prefix like INVOKE: that the harness regex-matches. Never rely on the LLM to “naturally” say the right thing — constrain it.
2. Pass state explicitly
Each skill must be instructed to attach its output to a shared context object. The next skill reads from that object — it never re-fetches data the previous skill already retrieved.
3. Define a terminal condition
Every chain must have a skill that emits no further INVOKE: — the report-writer above. Without a terminal node, the harness can loop indefinitely.
6. Applications in Finance
Quantitative Research and Algorithmic Execution
The Quantitative Data Layer (MCP Servers)
| Provider | Protocol | Key Tools Exposed | Primary Use Case |
|---|---|---|---|
| Alpha Vantage | HTTP/STDIO | TIME_SERIES, REALTIME_OPTIONS, EARNINGS_TRANSCRIPT | Real-time pricing, greeks, sentiment via TOOL_CALL wrapper |
| Financial Datasets | HTTP | get_income_statements, get_company_news | Fundamental analysis & corporate filings |
| EODHD | HTTP | get_us_tick_data, get_mp_illio_market_insights | Macro forecasting, tick data, beta band distributions |
| Octagon AI | HTTP | SEC filings, private market insights | Institutional equity research gathering |
| QuantConnect | HTTP | Historical Data access, Backtesting Engine | Strategy transition from research to live brokerage |
Agentic Quantitative Skills
Traditional LLMs fail at quantitative finance because text prediction cannot reliably perform complex mathematics. Agentic Quantitative Skills solve this by forcing the LLM to write, execute, and iterate on Python code natively within a secure sandbox (Programmatic Tool Calling), offloading the math to libraries like pandas, numpy, and scipy.
The Execution Pipeline
- Data Ingestion: The agent writes code to fetch 10 years of OHLCV data via an MCP Resource (e.g., EODHD).
- Feature Engineering: It dynamically calculates rolling Z-scores, MACD, and Bollinger Bands using
pandas_ta. - Statistical Validation: Runs ADF (Augmented Dickey-Fuller) tests for cointegration in pairs trading strategies.
- Metric Generation: Calculates Maximum Drawdown, Value at Risk (VaR), and Sharpe ratio.
skill-quant-analyst/
- skill.yaml # Invocation trigger
- instructions.md # Enforces strict math
- scripts/
- backtest_engine.py # Sandbox target
- risk_metrics.py
- requirements.txt # statsmodels, pandas
risk_metrics.py and reads the standard output.Multi-Agent Collaboration (FinRobot & LangGraph architectures)
Institutional complexity exceeds the capability of a single foundational model, no matter how large the context window. Modern financial AI leverages Multi-Agent Systems (MAS), inspired by frameworks like FinRobot, where specialized sub-agents operate under a central orchestrator using shared memory and StateGraphs.
Supervisor Agent
Receives the user intent (e.g., “Analyze semiconductor supply chains”). Breaks down the task, creates an execution plan, and routes sub-tasks to specialized agents.
Data & News Agent
Uses RAG and MCP to pull 10-K filings, earnings transcripts, and realtime news. Performs NLP sentiment analysis (FinBERT) to score qualitative market narratives.
Quantitative Agent
Receives historical data. Writes programmatic code to generate Discounted Cash Flow (DCF) models, Enterprise Value multiples, and backtests technical strategies.
Report Agent
Ingests the JSON state emitted by the News and Quant agents. Synthesizes conflicting data points into a cohesive, institutional-grade markdown investment memo.
State Management: These agents do not just chat; they pass a strictly typed State object (often via LangGraph). If the Risk Agent flags that the Quant Agent's portfolio violates a volatility constraint, the state is passed back to the Quant Agent with an error flag, forcing a recalculation before the final report is generated.
7. Security & Governance
Hardening Autonomous Systems
Strict Containment & Secrets
As agents write/execute native code, boundaries become porous. Sandboxes must be hardened against unauthorized network access. Keys must be in secure vaults and rotated regularly — never exposed to the prompt.
Skills use flags like disable-model-invocation: true to ensure critical files are gated by explicit human triggers.
Evaluation & Policy (HITL)
Agents traverse non-linear workflows, making static unit testing obsolete. Continuous multivariate testing (simulating multi-turn financial scenarios) is mandatory for evaluating regulatory compliance.
Agent OS Governance enforces kernel-level deterministic policy and immutable audit logs. High-stakes deployments always require Human-in-the-Loop (HITL) approval.
