Coding Back Office

01 · Context

There's money missing from every behavioral health claim.

Behavioral health is one of the most undercoded specialties in medicine. Clinicians spend significant time with patients, time they document meticulously for clinical reasons, and then bill for shorter, lower-reimbursed service codes than the documentation supports.

This isn't negligence. It's a training problem, a workflow problem, and an information problem. CPT coding for behavioral health involves judgment calls, about session duration, service type, add-ons, and payor-specific rules, that most clinicians were never formally taught. The feedback loop is also broken: claims are coded at the point of service, but denials come weeks later, with no connection back to the original coding decision.

For a mid-sized behavioral health organization seeing a few hundred patients a week, the revenue gap can be substantial, often six figures annually, sometimes more. And unlike clinical errors, this one is fixable without changing how care is delivered.

Why now? The data is already there. Behavioral health providers using Eleos generate rich session notes tied to specific clinicians, dates, and durations. With SFTP access to historical billing records, we can cross-reference what was documented against what was billed, without requiring any clinician onboarding or workflow change for the initial analysis.

15%+

Average revenue uplift identified in organizations audited before launch

4 types

Distinct undercoding patterns surfaced across behavioral health billing data

0 EHR changes

Required from clinicians to generate the initial opportunity analysis

02 · Discovery

I started by learning the domain, not the design.

Before I opened Figma, I spent time with billing analysts, RCM directors, and clinicians to understand how coding actually works at behavioral health organizations. This is a domain where the terminology alone, CPT codes, E&M levels, HCPCS modifiers, add-on codes, prior auth, can obscure more than it reveals.

I wanted to understand the jobs-to-be-done at different levels. A billing director wants a portfolio view: where are the biggest opportunities across my organization? A coder working the back office wants a workflow: show me the claims with the highest yield, give me the evidence, let me act. A clinical supervisor wants confidence: is this defensible if we get audited?

"We know we're undercoding. We just have no way to know where or by how much. It shows up in denials occasionally, but by then it's too late."

RCM Director · Discovery interview

What emerged from discovery was a clear primary persona: not the clinician, but the billing back office. The person who already understands coding, is already working claims, and just needs better data. That framing changed everything about the product.

Original assumption

Help clinicians code better

Surface CPT suggestions in the clinical note-writing flow, at the point of care, before the claim is filed.

What research revealed

Help billing staff recover revenue

Analyze historical data in bulk, surface patterns, and support retroactive and prospective corrections by people who already know coding.

This isn't a minor pivot. Point-of-care suggestions require clinician onboarding, EHR integration, and behavior change at scale. Back office analysis requires SFTP access and a browser. The right initial product was much smaller, and much more deployable.

03 · Opportunity types

Four types of uncaptured revenue, each with different evidence requirements.

Not all undercoding looks the same. During discovery, we identified four distinct opportunity types, each with different causes, different evidence quality, and different implications for how confident the system should be in its recommendations.

01 · Primary focus

Time-based undercoding

90834 → 90837

Documentation shows session duration that qualifies for a higher-reimbursed timed code, but a lower code was billed. This is the strongest signal, duration is objective, documented, and directly tied to CPT thresholds.

Strongest data signal

02

E/M level selection

E/M 99213 → 99214

Evaluation and management level doesn't match the clinical complexity documented in the note. Requires NLP-level analysis of note content, higher evidence bar, higher yield when correct.

03

Psychotherapy add-on codes

90833, 90836, 90838

Psychotherapy was provided alongside an E/M visit but no add-on was billed. Often missed because clinicians don't know the add-on exists, or assume it's redundant. Meaningful revenue per encounter.

04

Billable vs. non-billable confusion

H2014, T1016, H0036

Services billed under general HCPCS codes when more specific, higher-reimbursed codes apply, or services not billed at all because staff assumed they weren't billable. Particularly common in crisis and intensive outpatient settings.

The ordering matters for the product. Time-based undercoding becomes the default starting point, highest confidence, clearest evidence, easiest for billing staff to verify and act on. Other opportunity types are surfaced with explicit confidence signals and different levels of review friction.

04 · AI & trust

The hardest design question wasn't the layout. It was how certain to sound.

Coding recommendations aren't just suggestions, they have downstream consequences. If a billing coder acts on a bad recommendation, they file an incorrect claim. If they file enough incorrect claims, they're audited. In behavioral health, an audit can mean clawbacks, compliance investigations, and reputational damage.

So the question I kept returning to was: what's the right interaction model for this? I mapped out the spectrum.

Interaction model spectrum

Calculator "You documented 53 minutes. The threshold for 90837 is 53 minutes. Here is the code." Objective, mechanical. High user trust in data. Low AI judgment.

Trusted advisor "Based on this note, we recommend upgrading to 90837. Here's why." Interpretive, contextual. Requires trust in AI judgment. Higher yield potential.

Current design: informed recommendation with full evidence trail

I landed somewhere deliberate: not a calculator (too narrow, misses complexity), not an unqualified advisor (too much trust too fast). The system surfaces a recommended code with an explicit evidence chain, the documentation that supports it, the CPT threshold that applies, the specific language from the note. The coder reviews and approves. The AI doesn't act; it informs.

The audit requirement shaped everything. Billing staff told us that any recommendation they act on needs to be defensible if they're ever asked why the code changed. That's not a nice-to-have, it's a compliance requirement. Every recommendation in the interface is built around the evidence that supports it.

"I need to be able to explain this to my compliance team. If something looks off, they're going to ask me why we changed it."

Billing director · User interview

"Coders already know the codes. What they don't have is a systematic way to check whether what was documented actually matches what was billed."

RCM analyst · Discovery session

Confidence isn't uniform, and the design reflects that.

The four opportunity types don't have equal evidence quality, and pretending they do would be a design failure. Time-based undercoding has objective evidence: duration is a number in the note, the CPT threshold is a fixed rule, the delta is calculable. The model's confidence here is high, and the interface presents it directly, here's what was documented, here's the code it supports, here's the threshold. It's close to arithmetic.

Add-on codes and level-of-service upgrades are different. They require reading the note for clinical content, not just extracting a number. The model is interpreting documentation, not matching it against a threshold. For these opportunity types, I'm designing more friction into the review flow, not less. The evidence panel shows more supporting context. The expected yield is labeled as estimated. The call-to-action language shifts from "apply this" to "review this opportunity."

Time-based undercoding → direct action affordance

Duration documented vs. CPT threshold required. Calculable delta. High-confidence signal. The interface shows the gap plainly and makes the action direct: review the evidence, apply the code. The coder is verifying, not interpreting.

Evidence type: objective · Confidence: high · UI friction: low

Level-of-service and add-on opportunities → review-first affordance

These require clinical interpretation of note content. The model surfaces supporting documentation, but the coder needs to read it and make a judgment call. The interface labels these as opportunities to investigate rather than recommendations to act on, different language, more evidence surface area, explicit "estimated" yield framing.

Evidence type: interpretive · Confidence: variable · UI friction: intentionally higher

Designing for the wrong recommendation.

If a coder acts on a bad recommendation and files an incorrect claim, two things need to be true: the coder should have had enough evidence to evaluate it before acting, and there needs to be a traceable record of what supported the decision. This is both an audit requirement and a trust architecture.

Every recommendation has a named approver, the coder who reviewed and acted, and a timestamp. The supporting documentation that triggered the recommendation is preserved alongside it. If the claim gets audited later, the organization can reconstruct exactly what the note said, what threshold applied, and what the coder saw when they made the call. The system doesn't act autonomously. It recommends. The human approves. The record reflects that chain.

This also shapes how I think about false positives. A billing coder who reviews a recommendation, disagrees, and doesn't act has done exactly what the interface was designed for. The friction is intentional. The goal isn't to maximize the number of codes that get changed, it's to maximize the number of correct coding decisions, whether that's an upgrade or a confirmation that the original code was right.

05 · Key decisions

Decisions I've made, and what I gave up to make them.

Active design work always involves tradeoffs. Here's where I've taken a clear position, and what's sitting on the other side of each decision.

Removed compliance scoring from the MVP

Early concepts included a compliance risk score alongside revenue uplift, flagging claims that might be overbilled or auditable. I pulled it. The product's job is revenue recovery, and conflating that with compliance monitoring creates a confused product that does neither well. Compliance can be a future module with its own UX surface and buyer motion.

Tradeoff: narrower market short-term; cleaner product immediately

Duration as a first-class attribute

Session duration isn't metadata, it's the primary evidence for time-based coding. I made it prominent in every view, not tucked in a detail panel. When a coder is reviewing a recommendation, they need to see the documented time and the CPT threshold together, immediately, without drilling down. This is the moment of trust, and it should be easy.

Tradeoff: requires time to be consistently documented in source notes

Aggregated view is primary; line-level is supporting

The homepage is a portfolio view, by clinician, by service type, by opportunity size. The first question a billing director asks isn't "show me every claim", it's "where do I have the biggest opportunity?" Line-level deep dives exist, but you enter them from the aggregate, not the other way around.

Tradeoff: power users may want line-level first; watching this in alpha

Replaced opaque labels with clear language

Early terminology included "Service Translation", internal language that meant nothing to billing staff. Replaced throughout with plain descriptions: "Duration mismatch," "Add-on not captured," "Level upgrade opportunity." The product earns trust through clarity, not jargon.

Tradeoff: requires alignment across sales and CS on new terminology

06 · Constraints

What I'm designing against.

This isn't a greenfield product. It's shipping on a specific timeline, into a specific regulatory environment, for a specific buyer persona with specific compliance anxieties.

April 2026 Alpha SFTP data ingestion only Billing staff, not clinicians HIPAA compliance req. No EHR write-back Audit defensibility Payor rule variation Historical data only at launch

In scope, Alpha
Historical billing analysis
Cross-reference existing claims against documentation. Surface time-based and add-on opportunities. Aggregated org view + line-level drill-down. Evidence chain for every recommendation.

Future scope

Prospective and embedded flows

Pre-submission flagging in the EHR workflow. Real-time coding suggestions as notes are completed. Compliance risk monitoring. Payor-specific rule sets and override tracking.

The constraint I find most interesting is the no-write-back limitation. The product surfaces opportunities, but billing staff have to act on them in their existing EHR or billing system. That means the interface is purely informational, which actually clarifies the design challenge. I'm designing a decision-support tool, not an action tool. Every screen asks: does this person have what they need to go act elsewhere?

07 · What's next

This is the work in front of me.

Alpha goes live in April 2026 with our first customer. Here's where I'm focused between now and then, and the questions I'm still working through.

01

Resolve the inline vs. tabbed layout question

Where do code recommendations live relative to the supporting evidence? Inline collapses context; tabs create navigation overhead. I'm prototyping both and running them by billing staff before committing.

02

Define the right confidence signaling language

Not all recommendations carry the same weight. Time-based opportunities are objective; E/M level upgrades involve interpretation. The UI needs a consistent, legible system for communicating confidence, without creating anxiety or false certainty.

03

Test the aggregate-first navigation assumption

My current model puts the org-level dashboard first. But some billing staff may want to work claim-by-claim. Alpha will tell me whether the entry point is right, or whether I need a mode toggle.

04

Design the "act" moment, without owning the action

Users leave my product to take action in their billing system. I need to design the handoff clearly: what information do they need to bring? What's the right format for a recommendation they can bring to their biller? This is a UX problem no one has fully solved for decision-support tools.

This case study will continue growing as the product does. If you want to talk through any of this, the domain, the trust design, the AI recommendation problem, I'm always up for it.

AI that finds the revenue your clinicians are leaving behind.