Responsible AI Databricks Financial Services Point of View 11 min read

Responsible AI on Databricks: Taking a Bank's Agent to Production

I have spent years building data and AI platforms on Databricks for financial institutions: lakehouse foundations, Unity Catalog governance, MLflow pipelines, and lately the agentic stack. So this is a point of view, not a product tour. I want to take one problem I am handed almost every week, an AI agent that works in a notebook but cannot get to production, and walk it the rest of the way. The agent in the example triages KYC alerts. The tools are the ones Databricks shipped through 2026: Agent Bricks to build it, Unity Catalog to govern it. The interesting part is not the agent. It is what it takes to make a bank trust it.

Views are my own and do not represent IBM. This piece reflects personal analysis of public information; nothing here references confidential client work.

The short version

The problem I keep being handed

Almost every week, a team shows me an agent. It works. It does something genuinely useful: in the running example for this piece, it triages KYC alerts, reads the case file, cross-checks the watchlists, and drafts a recommendation. And it lives in a notebook, because nobody will let it near production.

This is not a quality problem. The agent is often good. It is a trust problem, and the trust problem has a specific shape. For a decade, a bank's answer to "can we trust this model" was a review. A model was built, documented, and handed to model risk management for validation under SR 11-7. The review was a gate: pass it, ship it. That gate assumes the system you reviewed is the system that runs. For a credit scorecard, a fixed function, that holds.

An agent is not a fixed function. It plans, calls tools, retries, and composes its own steps from the input it receives. Two KYC alerts that look similar can send it down different paths. A review board inspecting that agent in a test environment is looking at a snapshot of something that will not hold still. You cannot review what you cannot predict.

The review-gate model against agent behaviour A linear pipeline runs Build, Train, Validate, a responsible-AI review gate, then Production. Past Production the agent fans out into many branching paths that flicker unpredictably, none of which the review covered. Build Train Validate RAI review approved โœ“ Production Behaviour the review never saw
Fig. 1: A point-in-time review gate sits before production. The agent generates its risk after that gate, at runtime, along paths nobody validated.

So the agent sits in the notebook. Not because the model risk team is obstructive, but because the control they hold, a point-in-time gate, is the wrong shape for a system that generates its risk continuously, at runtime, on inputs nobody validated.

How I structure the build

If the risk is generated continuously, the control has to be continuous. So the first thing I change is not the agent. It is the shape of the pipeline. Responsible AI stops being a stage near the end and becomes a property of every stage.

The Databricks agentic stack is built to that shape, which is why I design to it rather than around it. Agent Bricks organises an agent's life into four phases, and Unity Catalog runs underneath all of them.

The Build, Evaluate, Deploy, Govern pipeline Four connected phases run left to right: Build with Custom Agents, Evaluate with CLEARS, Deploy with the Supervisor Agent, and Govern with Unity Catalog and the AI Gateway. A continuous governance band runs beneath all four. Build Custom Agents on Databricks Apps Evaluate CLEARS scoring in MLflow Deploy Supervisor Agent orchestration Govern Unity Catalog + AI Gateway Unity Catalog governs every phase
Fig. 2: Build, Evaluate, Deploy, Govern. Governance is not a phase at the end. Unity Catalog is the band under all of them, enforced on every iteration.

The win is not any single feature. It is that a control living in the pipeline runs on every build, every evaluation, and every call, while a control living in a review board runs once. For the KYC agent, that is the difference between "it passed review in March" and "it is scored and guarded on every alert it touches."

Build: on the platform, not beside it

Most agents I am handed were built beside the data, not on it. Someone exported a sample of cases to a notebook, wired up a model, and got a good demo. Production is where that breaks, because the export was a copy and the copy has no governance.

The way I build the KYC agent now is as a Custom Agent deployed as a Databricks App, on serverless compute, so the team ships without re-architecting code or running infrastructure. It reads its context through Unity Catalog, not a copy. That matters more than it sounds. Unity Catalog supplies the schema, the business definitions, the lineage, and the permissions, so the agent reasons over governed data and inherits the access controls that already exist. The analyst who cannot see a customer's records does not get to see them because an agent fetched them on their behalf.

Build on the platform and governance is the substrate. Build beside it and governance is a retrofit, which is exactly the conversation that keeps the agent in the notebook.

Evaluate: I turn the sign-off into a number

The weakest link in every responsible-AI programme I have seen is evaluation. "The agent works well" is not a control. It is an opinion, and opinions do not survive an examiner.

So in the Evaluate phase I make the sign-off a number. The CLEARS framework scores the agent across six dimensions: correctness, latency, execution, adherence, relevance, and safety. It runs in MLflow, the same place a bank's model metrics already live, and Agent Bricks generates synthetic task data and uses model-based judges, so the score is repeatable rather than hand-curated.

The CLEARS evaluation framework A six-sided radar with one axis for each CLEARS dimension: correctness, latency, execution, adherence, relevance, and safety. A scored polygon pulses inside it and each dimension lights up in turn. Correctness Latency Execution Adherence Relevance Safety CLEARS scored in MLflow
Fig. 3: CLEARS scores an agent on six dimensions. "Responsible" stops being a feeling and becomes a metric with a threshold, a method, and a history.

This is the move that gets the KYC agent out of the notebook. A model risk function cannot sign off on a feeling. It can sign off on a metric with a threshold, a method, and a history. Six scored dimensions, regenerated on every change, is a thing the second line can actually govern. It also gives my build team a target: "raise adherence above the threshold" is an engineering task; "make the agent more responsible" is not.

Govern: Unity Catalog is the spine

When people say "AI governance" they often mean a document. The governance that actually lets a KYC agent run inside a bank is not written, it runs, and on Databricks it has a name: Unity Catalog.

Unity Catalog is the spine. It is the single place identity, permissions, lineage, and audit live, for data, for models, for functions, and now for agents. The agent does not get its own parallel access model. It inherits the one the bank already governs. Every call the agent makes is attributable to an identity and written to a log.

The AI Gateway runs on top of that spine. It sits in front of models, coding agents, and tools reached over MCP, and it enforces identity, permissions, and observability on every interaction. Its guardrails execute on live traffic: detection for PII exposure, prompt injection, data exfiltration, unsafe content, and hallucination.

The AI Gateway runtime guardrails Agent traffic passes through the AI Gateway. Five threats are caught in turn at the gateway: PII exposure, prompt injection, data exfiltration, unsafe content, and hallucination. Clean output passes through to the user. Agent request + output AI Gateway identity ยท guardrails observability โœ• PII exposure โœ• Prompt injection โœ• Data exfiltration โœ• Unsafe content โœ• Hallucination
Fig. 4: The AI Gateway runs in the request path. Each guardrail blocks a class of failure on live traffic, unattended, every time.

For the KYC agent this is the whole game. The agent handles customer PII and drafts a risk recommendation. A guardrail in a policy binder stops nothing. A guardrail in the request path stops a prompt-injection attempt, or an attempt to pull a customer record it should not touch, at 2 a.m. on a Sunday, without anyone being asked. That is the difference between a control I can describe to an examiner and one I can demonstrate.

Why a bank cannot wait this out

Two things happened in early 2026 that put a clock on this.

The first is the US Treasury's Financial Services AI Risk Management Framework, published in February 2026. It is aligned to the NIST AI RMF but written for financial institutions, and it sets out 230 control objectives mapped across the AI lifecycle. It is soft law: non-binding, but it standardises what "good" looks like, and soft law is what an examiner reaches for first. The EU AI Act's operational deadlines are phasing in across the same period.

The second is the gap. EY's 2026 outlook reports that more than 70 percent of banking firms now use agentic AI to some degree, while industry surveys put the share with a mature governance model for autonomous agents at roughly one in five.

The agentic AI governance gap Two bars. Banks using agentic AI fills past 70 percent. Banks with a mature agent governance model fills to roughly one in five. The distance between them is the governance gap. Banks using agentic AI 70%+ Banks with mature agent governance ~1 in 5 the governance gap
Fig. 5: Adoption has run ahead of governance. The gap between the two bars is where the regulatory risk now sits.

I read that gap as the real exposure. It is not that banks are not using agents, mine included. It is that adoption has run ahead of governance, and a 230-control framework has just told every examiner exactly what to ask for.

The lifecycle is the unit of control

The FS AI RMF does one thing I want every team I work with to copy: it maps its controls across the whole AI lifecycle, design through monitoring, not at a single checkpoint. That is the same move as putting governance in the pipeline.

The AI lifecycle as a control loop Four stages, design, develop, deploy, and monitor, sit around a circular loop. A marker travels the loop continuously and each stage lights up as it passes, with the FS AI RMF control objectives at the centre. Design Develop Deploy Monitor FS AI RMF 230 control objectives across the lifecycle
Fig. 6: Design, develop, deploy, monitor, running as a loop. A control that exists at only one stage is a control you have already lost.

A control that exists at only one stage is a control I have already lost. Bias tested at design but not monitored in production drifts. A guardrail set at deployment but not re-checked after a model swap goes stale. The KYC agent that passed in March and was never re-scored is an unreviewed agent by June. The lifecycle, running as a loop, is the unit of control, and a platform that scores and guards the agent on every iteration enforces that loop whether the team thinks in those terms or not.

What I hand the team on day one

When I start one of these, here is what goes on the table on day one. Five things, in order.

  1. Map a control framework to the pipeline, not the org chart. The NIST AI RMF or the Treasury FS AI RMF both give you repeatable control categories. Map each to a pipeline phase: build, evaluate, deploy, govern. A control with no phase is a control with no owner.
  2. Make evaluation emit numbers. If the agent review produces prose, it is not a gate. A gate produces a score against a threshold. CLEARS-style scoring belongs in the same place the bank's model metrics already live.
  3. Put the agent on Unity Catalog from the first commit. Identity, lineage, and permissions are not a deployment task. If the agent reads a copy of the data, the governance conversation is already lost.
  4. Treat guardrails as runtime services. PII detection, prompt-injection screening, and exfiltration checks belong in the request path, not in a standard you circulate. If a guardrail cannot block something at 2 a.m. unattended, it is documentation.
  5. Govern the lifecycle, not the launch. Re-evaluate on every model swap and every material change. The platform should make shipping the governed way the lowest-effort way, so the team never has a reason to route around it.

Responsible AI is something you build

Responsible AI is not a document, a board, or a one-time sign-off. For an agent it is a property of the pipeline that builds, scores, ships, and governs it, on every iteration.

My point of view, after enough of these: the platform decides whether that property is realistic or aspirational. Agent Bricks and Unity Catalog matter because they put evaluation and guardrails where the work happens, so the governed path and the fast path are the same path. The KYC agent reaches production not when the policy is thick enough, but when the pipeline makes shipping it any other way the harder option. That is the version of responsible AI a bank can actually run, and it is the one I build.

Reminder: This reflects my personal analysis and opinions. It does not represent the views, strategy, or endorsement of IBM, Databricks, Microsoft, or any other organization. All trademarks belong to their respective owners.

Want to talk architecture?

I work with financial services teams designing and governing agentic AI on Databricks and Microsoft Fabric. Happy to compare notes.

Book a Session