What Changed with GPT-4

Reassessing AI for Banking

June 17, 2023 · 4 min read

GPT-4 was released in March 2023, and I've had time since then to spend with it on personal projects, comparing it against 3.5. The difference is not subtle.

When ChatGPT launched, the wonder phase set the bar high. GPT-4 clears that bar in a way that changes my mental model of what these tools could eventually do in a regulated context, even though my engagement with them stays on personal time, on personal projects, until policy and governance catch up.

The Gap

The reasoning is noticeably better. GPT-4 follows multi-step instructions without losing the thread. It hallucinates less, though it still does. Give it something complicated and ask it to work through the logic, and it actually works through the logic instead of generating something that looks right but falls apart on inspection.

The context window is bigger too. GPT-3.5 could handle about 4,000 tokens. GPT-4 handles 8,000 by default and up to 32,000 in the extended version. That matters when you're working with anything longer than a few paragraphs. The kinds of documents that would eventually be useful to feed into one of these models, regulatory guidance, policy manuals, long-form analyses, are not short.

Where the Threshold Moved

A few months ago, when I was thinking through where AI might fit in a banking context and where it would be too risky, the buckets were roughly things that looked promising, things to wait on, and a framework for deciding. GPT-4 shifts some of those lines.

Long-document summarization was on the promising side, and it was already decent with 3.5. With GPT-4, it is meaningfully better. It catches nuance. It can pull out the parts relevant to a particular context instead of giving a generic summary. The longer context window helps here too.

First-draft writing was borderline before. With 3.5, the output needed heavy editing. With GPT-4, the first drafts are closer to usable. Still need review, but the starting point is better. I've felt that consistently in personal projects.

The wait list is more interesting. Regulatory reporting and credit decisions belong on it, and they still belong on it. Even with better reasoning, the cost of a mistake in a call report or a fair lending decision is too high to ever hand off entirely to a model. But some of the analysis that feeds into those decisions feels more approachable as a future use case, once the policy and governance work is done. Not the decisions themselves. The supporting work.

The Limits

GPT-4 still hallucinates. Less often, but it is confident when it does, which might be worse. For anything that touches compliance or client-facing communication, human review will not be optional. That has not changed, and I do not think it changes soon.

It also struggles with arithmetic in ways that matter for finance. Ask it to do math on financial data and it will sometimes get it wrong. Not often, but often enough that I would not trust it with anything where precision matters.

The Real Question

What strikes me most is not any single capability. It is that GPT-4 crossed a threshold where use cases I had been thinking about abstractly start to look like practical capabilities, when the time comes. The question moves from "could AI theoretically do this?" to "this is the kind of capability the eventual policy work has to account for."

That is a different conversation than the one I was having a year ago. The technology side of the equation has gotten stronger faster than the operational side has caught up, and the gap is where the careful work happens. Policy, governance, vendor relationships, and training need to catch up with what the technology can actually do, and that work takes time. Banks have always been good at that kind of disciplined integration, and the AI work will follow the same pattern. The pace of improvement just makes the runway feel shorter.

The Gap​

Where the Threshold Moved​

The Limits​

The Real Question​

The Gap

Where the Threshold Moved

The Limits

The Real Question