SOPHIE Daddy Quant Blog - Stock & Options Analysis

The trajectory of artificial intelligence has fundamentally shifted.

We are moving from systems that merely generate probabilistic text to autonomous entities capable of executing complex, multi-step operations within highly deterministic external environments. This paradigm shift is entirely predicated on the evolution of tool calling architectures. The mechanism by which a large language model interacts with external systems — databases, APIs, computational engines, and file systems — defines its operational ceiling and utility.

1. The Genesis of Deterministic Action

Legacy Function Calling (Introduced mid-2023 by OpenAI)

Prior to mid-2023, models were strictly confined to their pre-trained weights and required extensive, error-prone prompt engineering to parse free-form text into machine-readable formats. Legacy function calling provided a structural bridge by enabling developers to define tools using strict JSON Schema specifications.

Core Mechanics & Shift

Function calling turned algorithmic intent into programmatic action. Instead of generating text, the model recognizes when a request needs external data, halts its generative process, and outputs a highly structured JSON object matching a schema.

Crucial distinction: The language model does not execute the function itself; it acts solely as a semantic routing engine.

API Evolution: Functions to Tools

Early implementations explicitly passed a functions array (requiring name, description, parameters).

This was deprecated for a more flexible tools array. This design shift allowed grouping multiple functions under namespaces and, critically, enabled parallel tool calls for multi-threaded data retrieval.

Architectural Limitations: The Schema Bloat Crisis

1. Token Inefficiency

Every available tool must have its full JSON schema injected into the system prompt for every interaction. In finance, expanding to dozens of APIs saturates the context window, degrading reasoning quality and inflating inference costs.

2. Compounding Latency

The architecture necessitates a complete network round-trip for every individual invocation. Querying a DB, parsing it, then calling a pricing API for each ticker requires multiple sequential cycles, making it unsuitable for high-frequency workflows.

2. The Interoperability Standard

Model Context Protocol (MCP) — Late 2024

As the AI ecosystem expanded, the N-to-N integration problem became the primary bottleneck. Developers were forced to write custom logic for every combination of LLM framework (LangChain, LlamaIndex) and external service. In late 2024, Anthropic introduced MCP, inspired by the Language Server Protocol (LSP), to establish a universal standard.

The Three-Tier Architecture

1. MCP Host

The primary application housing the LLM (e.g., Claude Desktop, Cursor IDE, bespoke terminals).

2. MCP Client

Protocol translation layer directly within the host managing connections, routing, and lifecycle.

3. MCP Server

Lightweight programs exposing APIs via STDIO (highly secure, local) or HTTP+SSE (remote/cloud).

Tools

Executable functions that alter state or perform external computations (e.g., executing a database query, submitting a trade order).

Resources

Read-only data sources that provide contextual info directly to the context window (e.g., PDF contents, live API responses).

Prompts

Reusable conversational templates that structure interactions and provide few-shot examples to optimize model behavior.

The Inherited Flaw: Despite solving interoperability (SDKs in Python/TS), MCP inherits token consumption vulnerabilities. Extracting massive datasets can cost 50,000+ tokens purely on raw intermediate data before the agent even begins core reasoning.

3. The Paradigm Shift

Programmatic Tool Calling ('Code Mode' with Claude 4.6 / Sonnet 4.6)

The Turing-Complete Leap

Instead of predicting a JSON object for a single action, the model writes Turing-complete code (TypeScript/Python) to orchestrate entire multi-tool workflows natively. It shifts the execution burden to an isolated, secure computational sandbox (e.g., V8 JavaScript isolates or Daytona containers).

Unlocks industry-leading scores on BrowseComp and DeepSearchQA

Progressive Tool Discovery

Rather than loading all tool schemas upfront, the agent uses a Tool Search Tool. It queries to discover relevant libraries/MCP servers, imports them into its generated script, and executes locally.

Real-World Example: Filtering 10,000 Tick Data Records

Suppose the user asks: “Find the 3 minutes with the highest trading volume for AAPL today.”

Legacy JSON Flow

Context Window Bloat

1. Model calls get_intraday_data(ticker="AAPL")
2. The API returns 10,000 rows of JSON.
3. All 10,000 rows are injected directly into the LLM's context window.

json

[
  {"time": "09:30", "vol": 14500, "price": 150.2},
  {"time": "09:31", "vol": 12200, "price": 150.4},
  ... 9,998 more rows ...
]

⚠️

Cost: ~80,000 tokens. High probability of the model “forgetting” instructions or hallucinating the sort order.

Programmatic Flow

Sandbox Execution

1. Model writes a Python script and sends it to the runtime sandbox.
2. The sandbox fetches the data and processes it locally using pandas.
3. Only the final print() output is sent back to the LLM.

python

import mcp_finance as fin
import pandas as pd

# Executed in secure sandbox
data = fin.get_intraday_data("AAPL")
df = pd.DataFrame(data)
top_3 = df.nlargest(3, 'vol')
print(top_3.to_json())

✅

Cost: ~50 tokens. The model only reads the 3 rows it requested. 99.9% reduction in latency and token usage.

Metric	JSON Tool Calling	Programmatic Tool Calling
Execution Medium	Parsed JSON objects mapped to host functions	Turing-complete scripts (Python/TypeScript)
Context Window	High; all intermediate data passes through LLM	Minimal; data processing occurs within sandbox
Token Reduction	Baseline (1x)	85% to 98% reduction in token consumption
System Latency	High; requires network round-trips per step	Low; loops/conditionals execute at runtime speed

4. Modern Execution Environments

Agent Harnesses and Agentic Skills

Frameworks vs. Runtimes vs. Harnesses

While frameworks (LangChain) provide basic orchestration loops and runtimes provide computational sandboxes, the Agent Harness is the holistic wrapper managing system instructions, dynamic tools, conversational state, and persistence over multiple autonomous turns through Agentic Skills.

SKILL.md Standard Directory Structure

Level 1 (Metadata)

YAML frontmatter (name, description, compatibility). Loads on initialization. Consumes ~100 tokens. Lets the agent know capabilities exist.

Level 2 (Instructions)

Detailed markdown instructions, constraints, and logic pulled via bash call only when a request matches the description.

Level 3 (Assets)

Execution scripts (Python, Bash) housed in scripts/ or references/. Never loaded into context; executed directly by sandbox runtime.

Sample Implementations

Cursor Agent Harness (IDE Focus)

Cursor orchestrates instructions while tuning for specific foundational models (e.g., knowing one model prefers grep while another needs linter nudges).

Lifecycle Hooks: Defined in .cursor/hooks.json. Intercepts checkpoints like beforeShellExecution or preToolUse. Background command-based scripts execute returning JSON; exit code 0 is success, 2 actively blocks the proposed action.

The OpenClaw Architecture (Persistent Daemon)

OpenClaw rejects ephemeral chat sessions, establishing a long-lived Node.js gateway locally. It operates a heartbeat scheduler (waking every 30 mins to review HEARTBEAT.md) to run background tasks.

Channel Adapters: WhatsApp (Baileys), Telegram (gramm Y), CLI.
Session Manager & Queue System: Serializes multi-step tool invocations.
Agent Runtime: Dynamically assembles context from AGENTS.md, SOUL.md.
Control Plane: WebSocket API on port :18789 for global state.

Claude Code CLI (Terminal Focus)

Anthropic's official CLI tool brings the agentic loop directly into the developer's native environment. Rather than acting as a simple autocomplete, it operates as an autonomous sub-agent capable of exploring file systems, running tests, and executing multi-step refactoring.

The Agentic REPL Loop

Claude Code doesn't just guess an answer; it iteratively searches, reads, modifies, and verifies code. If a test fails, it reads the error and tries again without human prompting.

Native Tools: Built-in primitives like Bash, Glob, Grep, FileRead, and FileEdit.
Context Injection: Automatically understands Git state, uncommitted changes, and project structures.

Security & Customization

Operates with a strict permission model and highly configurable project-level instructions.

allowed-tools: Defined in SKILL.md to limit privileges. A “code-reviewer” skill can explicitly allow only Read/Grep, denying file modification tools.
Slash Commands: Explicit invocation using commands like /bugfix or /test, which trigger specialized prompts.
Human Approval: Destructive bash commands are trapped and require the user to press ‘Enter’ before execution.

How Harnesses Orchestrate Everything Together

A harness is not a replacement for MCP, function calling, or programmatic execution — it is the conductor that decides which tool to reach for and when. Each layer has a distinct role, and the harness composes them into a single coherent workflow.

Full-Stack Execution Flow — Single User Request

Harness / Skill

User prompt arrives. Harness loads YAML metadata (~100 tokens), runs semantic trigger matching, and JIT-injects the matching skill's instructions.md. Sets the active tool allowlist for this request only.

Role: Context assembly & routing
Output: enriched system prompt + tool allowlist

MCP Server

LLM calls a tool exposed by an MCP server (e.g., get_ticker_history). The server fetches raw data from the exchange API and returns a structured JSON payload — but not directly into the LLM context.

Role: Standardized data access
Output: raw JSON payload → sandbox

Programmatic Sandbox

The LLM writes a Python script that ingests the MCP payload, runs pandas/numpy computations, and print()s only the summary. The sandbox executes it in isolation. 10,000 rows become 3 lines of output.

Role: Token-efficient computation
Output: summary result → LLM context

Function Call

For simple, low-volume lookups (e.g., get_company_name(ticker)), the harness permits a direct JSON function call. No sandbox needed — the result is small enough to pass through the context safely.

Role: Lightweight atomic lookups
Output: small JSON → LLM context directly

CLI / Hook

Before the final response is committed, a post_tool_use lifecycle hook fires a CLI command (e.g., pytest, ruff). If it exits non-zero, the harness blocks and re-prompts the LLM with the error output.

Role: Compliance & quality gates
Output: pass → response / fail → re-prompt

Harness + MCP

Data access layer

The harness's skill YAML declares which MCP servers are required (mcp_servers: [quant-db, alpha-vantage]). When the skill activates, the harness spins up only those servers — not all registered ones. This means a “risk-analysis” skill never accidentally exposes the trade-execution MCP server to the LLM.

Harness + Function Calling

Selective schema injection

Instead of injecting all 50 tool schemas globally, the harness injects only the schemas relevant to the active skill. A “portfolio-rebalancer” skill exposes get_weights and set_allocation. A “news-analyst” skill exposes search_filings. Context stays lean regardless of how many tools exist in the registry.

Harness + Programmatic Execution

Sandbox delegation

The skill's instructions.md explicitly tells the LLM: “For any dataset > 1,000 rows, write a Python script and execute it via the sandbox tool. Never paste raw data into your response.” The harness enforces this by only granting the sandbox_exec tool — not a raw data-dump tool — in the active allowlist.

Harness + CLI Hooks

Lifecycle enforcement

Hooks are the harness's immune system. A pre_tool_use hook intercepts every bash call and blocks commands containing DROP TABLE or live API keys. A post_tool_use hook auto-runs ruff on any generated Python. The LLM never bypasses these — they execute at the harness layer, below the model's awareness.

5. Demo: Anatomy of an Agentic Skill

Building a highly-scoped 'Database Architect' Skill

To understand how Harnesses work in practice, let's look at a concrete implementation. We will build a Database Architect Skill. Instead of bloating our default system prompt with SQL constraints, we encapsulate them in a skill directory.

skill.yaml (The Metadata)

Loaded into the context window at startup. Costs ~40 tokens.

The Harness reads this file to understand when to activate the skill. The triggers array acts as semantic routing hooks.

yaml

name: database-architect
description: Expert in PostgreSQL schema design, migrations, and query optimization.
version: 1.0.0
triggers:
  - "create a table"
  - "write a migration"
  - "optimize this query"
  - "database schema"
allowed_tools:
  - bash
  - file_read
  - file_write
  - psql_eval
env_vars_required:
  - DATABASE_URL

instructions.md (The Payload)

Dynamically injected ONLY when a trigger is matched.

This contains the heavy, specialized prompt engineering. By keeping this out of the global prompt, we save thousands of idle tokens and prevent the model from getting confused by irrelevant instructions.

markdown

# Role: Senior Database Architect
You are responsible for modifying the PostgreSQL database.

## Strict Constraints:
1. NEVER use destructive commands (DROP TABLE, DELETE) without explicit user confirmation.
2. All new tables MUST include id (UUID), created_at, and updated_at columns.
3. Indexes MUST be created concurrently (CREATE INDEX CONCURRENTLY).
4. Always write migrations in the /supabase/migrations directory
   using the format YYYYMMDDHHMMSS_name.sql.

## Execution Protocol:
When asked to create a schema:
1. Use file_read to check existing schemas in /migrations.
2. Write the new SQL file using file_write.
3. Use the psql_eval tool to run a dry-run syntax check against the local DB.

harness.ts (The Orchestrator)

The core loop that manages the skill injection.

This is a simplified view of how an Agent Harness (like OpenClaw or an IDE) processes user input, detects the need for a skill, and alters the context dynamically before calling the LLM.

typescript

import { loadSkills, semanticMatch } from './skill-manager';
import { callLLM } from './llm-provider';

async function processUserRequest(userPrompt: string, messageHistory: any[]) {
  // 1. Load lightweight YAML metadata for all installed skills
  const availableSkills = await loadSkills();
  let activeSystemPrompt = "You are a helpful coding assistant.";
  let activeTools = ["bash", "file_read"]; // Default tools

  // 2. Check if user prompt matches any skill triggers
  for (const skill of availableSkills) {
    if (semanticMatch(userPrompt, skill.triggers)) {
      console.log(`[Harness] Activating Skill: ${skill.name}`);
      // 3. Dynamically read the heavy markdown instructions
      const skillInstructions = await readFile(`skills/${skill.name}/instructions.md`);
      // 4. Augment the context window
      activeSystemPrompt += `\n\n---\n${skillInstructions}`;
      activeTools = [...new Set([...activeTools, ...skill.allowed_tools])];
    }
  }

  // 5. Execute LLM with the highly specific, JIT-loaded context
  return await callLLM({
    system: activeSystemPrompt,
    messages: [...messageHistory, { role: "user", content: userPrompt }],
    tools: activeTools
  });
}

Why this architecture wins:

If the user asks “How do I center a div?”, the Harness skips the Database Architect skill entirely. The model responds instantly using a tiny default context. If the user asks “Add a users table”, the Harness injects the strict SQL constraints and grants access to the psql_eval tool. This guarantees high precision without context bloat.

How Skills Chain: Using Prompts to Decide the Next Skill

A single skill rarely completes a complex task alone. The real power emerges when a skill's instructions.md explicitly tells the LLM which skill to invoke next based on what it finds — turning a flat list of skills into a dynamic decision tree.

The Mechanism

Each skill's instructions.md ends with a routing block — a conditional section that tells the LLM: “If you find X, your next action is to invoke skill Y by emitting this exact phrase.” The harness watches the LLM's output stream for these trigger phrases and activates the next skill automatically.

This is not magic — it is structured prompt engineering. The LLM is instructed to be explicit about its intent, and the harness is wired to act on that intent.

Why Not Just One Big Skill?

Combining all logic into one skill bloats the context window and degrades precision. A “data-fetcher” skill has no business knowing SQL migration rules. Keeping skills small and single-purpose means each one is injected only when needed — and evicted the moment it is done.

Chaining also enables conditional branching: the path taken depends on what the previous skill actually found, not what was assumed upfront.

Worked Example: “Analyse AAPL earnings and flag any risk”

Three skills chain automatically from a single user prompt

skill: earnings-fetchertriggered by: “analyse earnings”

Calls the MCP server to pull the last 4 quarters of AAPL earnings transcripts. Extracts revenue, EPS, and guidance figures into a structured JSON summary.

earnings-fetcher/instructions.md — routing block

## After completing data extraction:
- If revenue growth YoY < 5%: emit "INVOKE: risk-assessor — slow growth detected"
- If guidance was revised downward: emit "INVOKE: risk-assessor — guidance cut detected"  
- If all metrics are within normal range: emit "INVOKE: report-writer — data clean"
- Always pass the extracted JSON as context to the next skill.

skill: risk-assessortriggered by: harness detecting “INVOKE: risk-assessor”

Injected only because the previous skill flagged a condition. Runs a programmatic sandbox script to compute VaR, max drawdown delta, and compares guidance cut magnitude against historical precedents.

risk-assessor/instructions.md — routing block

## After completing risk scoring:
- If risk_score > 7: emit "INVOKE: report-writer — HIGH RISK — include risk section"
- If risk_score 4-7: emit "INVOKE: report-writer — MODERATE RISK — summarise flags"
- If risk_score < 4: emit "INVOKE: report-writer — LOW RISK — brief mention only"
- Attach risk_score and flag_list to context for report-writer.

skill: report-writertriggered by: harness detecting “INVOKE: report-writer”

Receives the earnings JSON and risk context from the shared state. Formats a structured investment memo with the appropriate risk section depth based on the flag passed by the risk-assessor. The final output is the only thing the user ever sees.

Key insight: The report-writer skill never ran any data fetching or risk math. It only knows how to write. Each skill stayed in its lane — and the harness stitched them together based entirely on what the LLM emitted in its output.

Rules for Reliable Skill Chaining

1. Explicit emit phrases

Use a fixed, unambiguous prefix like INVOKE: that the harness regex-matches. Never rely on the LLM to “naturally” say the right thing — constrain it.

2. Pass state explicitly

Each skill must be instructed to attach its output to a shared context object. The next skill reads from that object — it never re-fetches data the previous skill already retrieved.

3. Define a terminal condition

Every chain must have a skill that emits no further INVOKE: — the report-writer above. Without a terminal node, the harness can loop indefinitely.

6. Applications in Finance

Quantitative Research and Algorithmic Execution

The financial sector — characterized by massive unstructured datasets, strict regulatory boundaries, and demand for low latency — is the ultimate proving ground. Traditional LLMs make poor quantitative traders, but programmatic harnesses change everything.

The Quantitative Data Layer (MCP Servers)

Provider	Protocol	Key Tools Exposed	Primary Use Case
Alpha Vantage	HTTP/STDIO	TIME_SERIES, REALTIME_OPTIONS, EARNINGS_TRANSCRIPT	Real-time pricing, greeks, sentiment via TOOL_CALL wrapper
Financial Datasets	HTTP	get_income_statements, get_company_news	Fundamental analysis & corporate filings
EODHD	HTTP	get_us_tick_data, get_mp_illio_market_insights	Macro forecasting, tick data, beta band distributions
Octagon AI	HTTP	SEC filings, private market insights	Institutional equity research gathering
QuantConnect	HTTP	Historical Data access, Backtesting Engine	Strategy transition from research to live brokerage

Agentic Quantitative Skills

Traditional LLMs fail at quantitative finance because text prediction cannot reliably perform complex mathematics. Agentic Quantitative Skills solve this by forcing the LLM to write, execute, and iterate on Python code natively within a secure sandbox (Programmatic Tool Calling), offloading the math to libraries like pandas, numpy, and scipy.

The Execution Pipeline

Data Ingestion: The agent writes code to fetch 10 years of OHLCV data via an MCP Resource (e.g., EODHD).
Feature Engineering: It dynamically calculates rolling Z-scores, MACD, and Bollinger Bands using pandas_ta.
Statistical Validation: Runs ADF (Augmented Dickey-Fuller) tests for cointegration in pairs trading strategies.
Metric Generation: Calculates Maximum Drawdown, Value at Risk (VaR), and Sharpe ratio.

skill-quant-analyst/

skill.yaml # Invocation trigger
instructions.md # Enforces strict math
scripts/
backtest_engine.py # Sandbox target
risk_metrics.py
requirements.txt # statsmodels, pandas

* The agent never tries to guess the Sharpe ratio. It executes risk_metrics.py and reads the standard output.

Multi-Agent Collaboration (FinRobot & LangGraph architectures)

Institutional complexity exceeds the capability of a single foundational model, no matter how large the context window. Modern financial AI leverages Multi-Agent Systems (MAS), inspired by frameworks like FinRobot, where specialized sub-agents operate under a central orchestrator using shared memory and StateGraphs.

Supervisor Agent

Receives the user intent (e.g., “Analyze semiconductor supply chains”). Breaks down the task, creates an execution plan, and routes sub-tasks to specialized agents.

Data & News Agent

Uses RAG and MCP to pull 10-K filings, earnings transcripts, and realtime news. Performs NLP sentiment analysis (FinBERT) to score qualitative market narratives.

Quantitative Agent

Receives historical data. Writes programmatic code to generate Discounted Cash Flow (DCF) models, Enterprise Value multiples, and backtests technical strategies.

Report Agent

Ingests the JSON state emitted by the News and Quant agents. Synthesizes conflicting data points into a cohesive, institutional-grade markdown investment memo.

State Management: These agents do not just chat; they pass a strictly typed State object (often via LangGraph). If the Risk Agent flags that the Quant Agent's portfolio violates a volatility constraint, the state is passed back to the Quant Agent with an error flag, forcing a recalculation before the final report is generated.

7. Security & Governance

Hardening Autonomous Systems

Strict Containment & Secrets

As agents write/execute native code, boundaries become porous. Sandboxes must be hardened against unauthorized network access. Keys must be in secure vaults and rotated regularly — never exposed to the prompt.

Skills use flags like disable-model-invocation: true to ensure critical files are gated by explicit human triggers.

Evaluation & Policy (HITL)

Agents traverse non-linear workflows, making static unit testing obsolete. Continuous multivariate testing (simulating multi-turn financial scenarios) is mandatory for evaluating regulatory compliance.

Agent OS Governance enforces kernel-level deterministic policy and immutable audit logs. High-stakes deployments always require Human-in-the-Loop (HITL) approval.