Responsible AI on Databricks: Taking a Bank's Agent to Production
I have spent years building data and AI platforms on Databricks for financial institutions: lakehouse foundations, Unity Catalog governance, MLflow pipelines, and lately the agentic stack. So this is a point of view, not a product tour. I want to take one problem I am handed almost every week, an AI agent that works in a notebook but cannot get to production, and walk it the rest of the way. The agent in the example triages KYC alerts. The tools are the ones Databricks shipped through 2026: Agent Bricks to build it, Unity Catalog to govern it. The interesting part is not the agent. It is what it takes to make a bank trust it.
The short version
- The problem I am handed almost weekly: a financial-services AI agent that works in a notebook and cannot reach production. The agent is usually fine. The path to production is the gap.
- An agent is not a static model. It plans, calls tools, and behaves differently on every run, so a one-time review board, the control a bank knows, inspects a snapshot of something that will not hold still.
- The way I build this on Databricks: governance is a property of every phase, not a final gate. Agent Bricks gives the shape, Build, Evaluate, Deploy, and Govern.
- Evaluate is where I turn the sign-off into a number. The CLEARS framework scores the agent on six dimensions in MLflow, so a model risk function governs a metric, not a feeling.
- Unity Catalog is the spine. It is the identity, permission, and lineage layer the agent and its data sit inside, and the AI Gateway runs the runtime guardrails on top of it.
- This is now urgent for banks. The US Treasury's Financial Services AI Risk Management Framework (the FS AI RMF), published in February 2026, sets 230 control objectives across the AI lifecycle, and most banks run agentic AI without a mature governance model.
The problem I keep being handed
Almost every week, a team shows me an agent. It works. It does something genuinely useful: in the running example for this piece, it triages KYC alerts, reads the case file, cross-checks the watchlists, and drafts a recommendation. And it lives in a notebook, because nobody will let it near production.
This is not a quality problem. The agent is often good. It is a trust problem, and the trust problem has a specific shape. For a decade, a bank's answer to "can we trust this model" was a review. A model was built, documented, and handed to model risk management for validation under SR 11-7. The review was a gate: pass it, ship it. That gate assumes the system you reviewed is the system that runs. For a credit scorecard, a fixed function, that holds.
An agent is not a fixed function. It plans, calls tools, retries, and composes its own steps from the input it receives. Two KYC alerts that look similar can send it down different paths. A review board inspecting that agent in a test environment is looking at a snapshot of something that will not hold still. You cannot review what you cannot predict.
So the agent sits in the notebook. Not because the model risk team is obstructive, but because the control they hold, a point-in-time gate, is the wrong shape for a system that generates its risk continuously, at runtime, on inputs nobody validated.
How I structure the build
If the risk is generated continuously, the control has to be continuous. So the first thing I change is not the agent. It is the shape of the pipeline. Responsible AI stops being a stage near the end and becomes a property of every stage.
The Databricks agentic stack is built to that shape, which is why I design to it rather than around it. Agent Bricks organises an agent's life into four phases, and Unity Catalog runs underneath all of them.
The win is not any single feature. It is that a control living in the pipeline runs on every build, every evaluation, and every call, while a control living in a review board runs once. For the KYC agent, that is the difference between "it passed review in March" and "it is scored and guarded on every alert it touches."
Build: on the platform, not beside it
Most agents I am handed were built beside the data, not on it. Someone exported a sample of cases to a notebook, wired up a model, and got a good demo. Production is where that breaks, because the export was a copy and the copy has no governance.
The way I build the KYC agent now is as a Custom Agent deployed as a Databricks App, on serverless compute, so the team ships without re-architecting code or running infrastructure. It reads its context through Unity Catalog, not a copy. That matters more than it sounds. Unity Catalog supplies the schema, the business definitions, the lineage, and the permissions, so the agent reasons over governed data and inherits the access controls that already exist. The analyst who cannot see a customer's records does not get to see them because an agent fetched them on their behalf.
Build on the platform and governance is the substrate. Build beside it and governance is a retrofit, which is exactly the conversation that keeps the agent in the notebook.
Evaluate: I turn the sign-off into a number
The weakest link in every responsible-AI programme I have seen is evaluation. "The agent works well" is not a control. It is an opinion, and opinions do not survive an examiner.
So in the Evaluate phase I make the sign-off a number. The CLEARS framework scores the agent across six dimensions: correctness, latency, execution, adherence, relevance, and safety. It runs in MLflow, the same place a bank's model metrics already live, and Agent Bricks generates synthetic task data and uses model-based judges, so the score is repeatable rather than hand-curated.
This is the move that gets the KYC agent out of the notebook. A model risk function cannot sign off on a feeling. It can sign off on a metric with a threshold, a method, and a history. Six scored dimensions, regenerated on every change, is a thing the second line can actually govern. It also gives my build team a target: "raise adherence above the threshold" is an engineering task; "make the agent more responsible" is not.
Govern: Unity Catalog is the spine
When people say "AI governance" they often mean a document. The governance that actually lets a KYC agent run inside a bank is not written, it runs, and on Databricks it has a name: Unity Catalog.
Unity Catalog is the spine. It is the single place identity, permissions, lineage, and audit live, for data, for models, for functions, and now for agents. The agent does not get its own parallel access model. It inherits the one the bank already governs. Every call the agent makes is attributable to an identity and written to a log.
The AI Gateway runs on top of that spine. It sits in front of models, coding agents, and tools reached over MCP, and it enforces identity, permissions, and observability on every interaction. Its guardrails execute on live traffic: detection for PII exposure, prompt injection, data exfiltration, unsafe content, and hallucination.
For the KYC agent this is the whole game. The agent handles customer PII and drafts a risk recommendation. A guardrail in a policy binder stops nothing. A guardrail in the request path stops a prompt-injection attempt, or an attempt to pull a customer record it should not touch, at 2 a.m. on a Sunday, without anyone being asked. That is the difference between a control I can describe to an examiner and one I can demonstrate.
Why a bank cannot wait this out
Two things happened in early 2026 that put a clock on this.
The first is the US Treasury's Financial Services AI Risk Management Framework, published in February 2026. It is aligned to the NIST AI RMF but written for financial institutions, and it sets out 230 control objectives mapped across the AI lifecycle. It is soft law: non-binding, but it standardises what "good" looks like, and soft law is what an examiner reaches for first. The EU AI Act's operational deadlines are phasing in across the same period.
The second is the gap. EY's 2026 outlook reports that more than 70 percent of banking firms now use agentic AI to some degree, while industry surveys put the share with a mature governance model for autonomous agents at roughly one in five.
I read that gap as the real exposure. It is not that banks are not using agents, mine included. It is that adoption has run ahead of governance, and a 230-control framework has just told every examiner exactly what to ask for.
The lifecycle is the unit of control
The FS AI RMF does one thing I want every team I work with to copy: it maps its controls across the whole AI lifecycle, design through monitoring, not at a single checkpoint. That is the same move as putting governance in the pipeline.
A control that exists at only one stage is a control I have already lost. Bias tested at design but not monitored in production drifts. A guardrail set at deployment but not re-checked after a model swap goes stale. The KYC agent that passed in March and was never re-scored is an unreviewed agent by June. The lifecycle, running as a loop, is the unit of control, and a platform that scores and guards the agent on every iteration enforces that loop whether the team thinks in those terms or not.
What I hand the team on day one
When I start one of these, here is what goes on the table on day one. Five things, in order.
- Map a control framework to the pipeline, not the org chart. The NIST AI RMF or the Treasury FS AI RMF both give you repeatable control categories. Map each to a pipeline phase: build, evaluate, deploy, govern. A control with no phase is a control with no owner.
- Make evaluation emit numbers. If the agent review produces prose, it is not a gate. A gate produces a score against a threshold. CLEARS-style scoring belongs in the same place the bank's model metrics already live.
- Put the agent on Unity Catalog from the first commit. Identity, lineage, and permissions are not a deployment task. If the agent reads a copy of the data, the governance conversation is already lost.
- Treat guardrails as runtime services. PII detection, prompt-injection screening, and exfiltration checks belong in the request path, not in a standard you circulate. If a guardrail cannot block something at 2 a.m. unattended, it is documentation.
- Govern the lifecycle, not the launch. Re-evaluate on every model swap and every material change. The platform should make shipping the governed way the lowest-effort way, so the team never has a reason to route around it.
Responsible AI is something you build
Responsible AI is not a document, a board, or a one-time sign-off. For an agent it is a property of the pipeline that builds, scores, ships, and governs it, on every iteration.
My point of view, after enough of these: the platform decides whether that property is realistic or aspirational. Agent Bricks and Unity Catalog matter because they put evaluation and guardrails where the work happens, so the governed path and the fast path are the same path. The KYC agent reaches production not when the policy is thick enough, but when the pipeline makes shipping it any other way the harder option. That is the version of responsible AI a bank can actually run, and it is the one I build.
Want to talk architecture?
I work with financial services teams designing and governing agentic AI on Databricks and Microsoft Fabric. Happy to compare notes.
Book a Session