Skip to content

Home

The explainable, auditable prompt optimizer

Rules you can read. Gains you can trust. No fine-tuning.

Python 3.10+ License
+123% Healthcare PII
+17% Legal PII
+63% HoVer
100% GSM8K
8 of 9 Beats GEPA/MIPRO

Overview

Constitutional AI steers LLM behavior through natural-language rules, but writing those rules by hand is hard, and getting them right is harder. Existing prompt optimizers try to automate this but fall short: they need many labeled examples, produce opaque prompt edits, and hit diminishing returns as prompts grow.

MAC fixes this. It optimizes over structured sets of rules using a network of specialized agents that propose, edit, and validate rule updates against a held-out set. The result is a human-readable constitution with no fine-tuning and no gradient updates.

Real rules MAC learned for PII tagging

Legal: "Mark as private specific dates when they appear in the context of personal events or actions, such as births, deaths, or significant life events. Do not mark general references or narrative text." Examples: mark "1975", "22 August 2003"; do not mark "on a day in June".

Healthcare: "Mark terms such as heart failure subtypes (e.g., diastolic heart failure, systolic heart failure) when explicitly mentioned in a patient's medical history as private. Do not mark generic medical conditions without an explicit subtype."

Finance: "Mark as private any phrase indicating a specific financial timeframe (e.g., FY2022, YTD FY2021) when it appears in direct association with identifiable information. Do not mark standalone labels without specific identifiers."

Explainable Structured Auditable Transferable Sample-efficient

  • Explainable - every rule is natural language you can read, audit, and hand-edit
  • Structured - optimizes over a set of rules, not a monolithic prompt blob
  • Auditable - each rule change is validated on a held-out set before acceptance
  • Transferable - rules learned on one model work on another without retraining
  • Sample-efficient - far fewer labeled examples than GEPA or MIPRO to converge

MAC system overview showing 4 agents iteratively learning constitution rules

Four agents coordinate in a loop: Annotator (applies rules) → Decision Agent (analyzes errors) → Rule Proposer (drafts new rules) / Rule Editor (refines existing ones). A meta-model rewrites all agent prompts so MAC generalizes to any downstream task. Just provide examples and a metric.


MAC vs Prompt Optimizers

Domain-specific PII tagging across Legal, Finance, and Healthcare documents using Qwen2.5-Instruct models at 3B, 7B, and 14B scales.

MAC vs GEPA vs MIPRO F1 by domain and model scale

Dataset Method 3B 7B 14B
Legal GEPA 12.7 52.1 50.1
Legal MIPRO 13.2 38.6 44.3
Legal MAC 36.0 55.1 67.3
Finance GEPA 11.9 22.5 28.8
Finance MIPRO 9.8 22.3 26.8
Finance MAC 30.1 37.5 45.5
Healthcare GEPA 16.5 12.9 16.8
Healthcare MIPRO 12.5 16.8 20.6
Healthcare MAC 9.7 20.1 26.7

8 of 9 configurations MAC wins. Largest gain: Legal 3B +174% over next best.

MAC vs Pretrained Taggers (14B)

Domain MAC Best Baseline Gain
Legal 67.3 57.3 (Presidio) +17%
Finance 45.5 44.7 (Presidio) +2%
Healthcare 26.7 12.0 (GLiNER) +123%

Training Dynamics

Validation F1 over training batches on 3B models (ECHR dataset). MAC steadily improves while baselines plateau or fluctuate.

Validation F1 vs training batches on 3B models


General-Purpose Benchmarks

MAC is task-agnostic. A meta-model rewrites all agent prompts to fit any downstream task. Below are results on three diverse benchmarks.

HoVer (Fact Verification)

HoVer results

Setup Worker Style Baseline Best Delta
API + Cloud MAC gpt-4o-mini adapt 25% 88% +63%
Local + Cloud MAC Qwen3-8B custom 62% 88% +26%
Local + Cloud MAC Qwen3-8B adapt 69% 81% +12%
Fully Local Qwen3-8B custom 75% 81% +6%
API + Cloud MAC gpt-4o-mini custom 88% 88% 0%
Fully Local Qwen3-8B adapt 75% 75% 0%

GSM8K (Math Reasoning)

GSM8K results

Setup Worker Style Baseline Best Delta
Local + Cloud MAC Qwen3-8B adapt 94% 100% +6%
Local + Cloud MAC Qwen3-8B custom 94% 100% +6%
Fully Local Qwen3-8B custom 94% 100% +6%
API + Cloud MAC gpt-4o-mini adapt 100% 100% 0%
API + Cloud MAC gpt-4o-mini custom 94% 94% 0%
Fully Local Qwen3-8B adapt 100% 100% 0%

HotpotQA (Multi-Hop QA)

HotpotQA results

Setup Worker Style Baseline Best Delta
Fully Local Qwen3-8B custom 22% 36% +14%
Local + Cloud MAC Qwen3-8B adapt 29% 38% +9%
API + Cloud MAC gpt-4o-mini adapt 25% 34% +9%
Local + Cloud MAC Qwen3-8B custom 29% 36% +7%
API + Cloud MAC gpt-4o-mini custom 26% 32% +6%
Fully Local Qwen3-8B adapt 27% 27% 0%

Any Task, Any Model

MAC is not tied to a single domain. Give it a task_description, a few examples, and a scoring function, and it learns a constitution for classification, extraction, math, QA, tool calling, or anything else you can evaluate. A meta-model automatically rewrites every agent prompt to fit the new task, so you never write task-specific plumbing.

Under the hood MAC uses a three-tier model setup. A cheap worker does annotation, strong MAC agents learn the rules, and an optional adapt model rewrites prompts:

compiler = MAC(
    model="Qwen/Qwen3-8B",           # Tier 1: worker (cheap / local)
    base_url="http://localhost:8000/v1",
    mac_model="gpt-5.2",              # Tier 2: MAC agents (strong)
    # adapt_model="gpt-5.2",          # Tier 3: defaults to mac_model
    task_description="Solve AIME competition math problems.",
    rule_type="math reasoning rules",
)

See Model Configuration for the full fallback cascade and provider examples.


Citation

@article{thareja2025mac,
  title={MAC: Multi-Agent Constitution Learning for Generalizable Text Annotation},
  author={Thareja, Rushil},
  year={2025}
}

License

MAC is released under a dual license:

  • Non-commercial use (academic research, education, personal projects): free
  • Commercial use: requires a separate license. Contact rushil.thareja@mbzuai.ac.ae