Agentic AI in Financial Services: The Model Was Never the Hard Part
Multi-agent systems have moved out of research labs and into production at major banks. The frontier models change every few weeks now. The thing that decides whether a bank ships agentic AI to production has almost nothing to do with which model it picked.
The short version
- Agentic AI in banking has moved past pilots. Production results are public: a Dutch institution cut KYC onboarding time by around 90 percent, a Singapore bank took onboarding from days to minutes, and trade-surveillance teams report workload down by roughly 60 percent.
- The model layer has become a moving target. Seven frontier models shipped in the first quarter of 2026 alone. Any architecture pinned to one model version is already out of date.
- That makes the model the least interesting decision. The work that determines success is compliance engineering: model risk management, lineage, fairness testing, reproducibility, and explainability built for an examiner.
- Design for auditability first. The most successful agentic systems in financial services are not the most autonomous ones. They are the ones a regulator can inspect.
Why banks are betting on agents now
Every Tier-1 bank I've spoken with in the past year has moved past the pilot stage. The technology underneath (tool use, chain-of-thought reasoning) has been around for a while. What changed is the evidence. The results are now public and specific. A large Dutch financial institution reports cutting KYC onboarding time by roughly 90 percent and analyst workload by about 30 percent. A bank in Singapore took onboarding from three days to minutes. Trade-surveillance teams report review workload down by around 60 percent. When the case studies stop being projections and start being post-mortems, the board conversation changes.
But financial services is not a typical deployment environment. You cannot iterate quickly when you are handling PII, processing live trades, or generating regulatory filings. The constraints are not blockers. They are requirements. The teams that treat them that way are the ones shipping to production. The teams that treat them as friction are still running pilots.
The pattern we've landed on
The systems I design at IBM follow a structure that balances autonomy with auditability. It has four layers, and each one matters equally.
Orchestrator agent
A central reasoning agent that breaks tasks apart, selects tools, and tracks workflow state. In our deployments this runs on a current-generation frontier model through Azure OpenAI, behind structured output schemas. Note the deliberate vagueness about which model. Seven frontier models shipped in the first quarter of 2026 alone, and the model that wins a benchmark this month is rarely the one in production next quarter. The architecture is built so the orchestrator model is a swappable component. The schemas, not the model, enforce deterministic routing, and deterministic routing is the difference between a demo and something you can explain to a regulator.
Specialist tool agents
Each one handles a single domain task: document extraction, entity resolution, watchlist screening, or risk scoring. We enforce three rules on each specialist:
- Bounded scope. One agent, one job. No generalists.
- Typed inputs and outputs. No free-text handoffs between agents.
- Confidence-gated fallback. If the model isn't sure, a human gets the task. No exceptions.
Human-in-the-loop checkpoints
Every workflow has mandatory review gates. This isn't a trust problem; it's a regulatory one. The OCC is not yet ready to accept fully autonomous decisions on risk assessments. So the agent does 80% of the work (extraction, cross-referencing, drafting), and the analyst owns 100% of the decision. That split is deliberate.
Audit and lineage layer
Every agent action, tool invocation, and reasoning step writes to an immutable log. When a regulator asks "why was this customer flagged?", you need the full decision chain, including which model version produced it. This layer is non-negotiable.
Key insight
The best agentic systems in financial services aren't the most autonomous. They're the most auditable. Design for explainability first; automation follows naturally.
Three use cases running in production
KYC/AML document processing
This is where the ROI is clearest. Agents pull entities from identity documents, cross-check against sanctions lists (OFAC, EU, UN), and generate structured risk profiles. On the system we deployed, manual review time dropped by 60% while maintaining a false-negative rate below 0.1%. The key was keeping the human sign-off in the loop; the agent does the legwork, not the judgment call.
Trade surveillance
Traditional rule-based surveillance generates thousands of false positives. Agents change that equation by contextualizing alerts against market conditions, news, and historical trader behavior. We've seen alert-to-investigation ratios improve from roughly 50:1 down to 8:1. The agents aren't replacing the surveillance team; they're filtering the noise so investigators can focus on the signals that matter.
Regulatory reporting copilots
These agents help compliance teams draft regulatory filings. They pull data from source systems, populate templates, run validation checks, and flag inconsistencies. The compliance officer still owns the filing. What the agent eliminates is the hours of manual data gathering that used to precede each submission.
Compliance is the engineering problem
Here's what gets lost at AI conferences: building the agent is maybe 30% of the effort. The remaining 70% is compliance engineering:
- Model risk management (SR 11-7, the Federal Reserve's guidance on model risk). Your AI system is a "model" under that framework. It needs formal validation with the same rigor as a credit risk model.
- Data lineage. Every input to the agent must trace back to an authoritative source. No shortcuts.
- Fairness testing. The system cannot produce biased outcomes across protected classes. Period.
- Versioning. You must be able to reproduce any past decision using the exact model weights and data that were active at the time.
- Explainability. Human-readable reports that articulate why each decision was made. Not for a data scientist. For an examiner.
"Regulation doesn't slow AI adoption down. It forces better engineering. The banks that treat compliance as a design constraint, not an afterthought, deploy faster."
The stack: Databricks + Azure OpenAI
We've standardized on a reference architecture at IBM's One Microsoft Practice that addresses both the AI problem and the governance problem at once:
- Azure OpenAI for model access with enterprise security, content filtering, and data residency controls.
- Databricks Mosaic AI as the agent framework, with MLflow for evaluation and vector search for RAG workloads.
- Unity Catalog as the single governance layer across data access, function permissions, and model lineage.
- Microsoft Fabric for downstream reporting, Power BI dashboards, and sharing results with business teams that don't touch Spark.
This stack works because it solves two problems at once. You get strong AI capabilities and a governance layer that regulators can actually inspect. Neither one matters without the other.
What I'm watching next
Three developments that could shift the field significantly:
- Agent-to-agent protocols. Standards for agents at different institutions to exchange information securely. Think interbank KYC data sharing through AI-mediated APIs. Early, but promising.
- Continuous evaluation. The industry is moving from periodic model validation to real-time drift monitoring. If your agent's behavior changes between quarterly reviews, you should know immediately.
- Regulatory sandboxes. Singapore and the UK are already creating safe-harbor environments for testing autonomous financial AI. Expect the US to follow, slowly.
Want to talk architecture?
I work with financial services teams building and deploying these systems. Happy to compare notes.
Book a Session