My Projects

CXR Draft Auditor

A research and educational quality-assurance tool that gives a chest X-ray draft impression a transparent second read. Two tiny 4B models — a fine-tuned MedGemma that grounds the image into labeled findings with bounding boxes and an NVIDIA Nemotron-3 Nano 4B that parses the draft into the same labels — feed a deterministic, model-free comparator that flags MISSING, UNSUPPORTED, and URGENT discrepancies, with every flag traced to a specific finding, a specific phrase, and a box on the image. Not a medical device and not a diagnostic tool.

Overview

CXR Draft Auditor is a small multimodal quality-assurance tool that gives a chest X-ray draft impression a transparent second read. You hand it a chest X-ray and the human-written draft impression; it reads the image and the words separately, then quietly compares them and flags where they appear to disagree — and instead of just saying "look again," it draws a box on the image, right where it wants you to look.

The point of the tool is the audit loop, never a verdict. It does not diagnose. It points a person back at their own image and says, gently, "take one more look at this." The radiologist is always in the loop and always the one who decides; the tool just makes sure a tired pair of eyes near the bottom of a long worklist gets a second chance before the patient does. It is live on a Hugging Face Space, runs both models inside the Space itself, and was built for the Build Small Hackathon, but this page is about the project, not the contest.

Research and educational quality-assurance only. CXR Draft Auditor is NOT a medical device, NOT a diagnostic tool, and NOT for clinical use of any kind. Its outputs are frequently wrong. It must never inform screening, triage, or patient care. Always consult a qualified radiologist. Imaging findings cannot be interpreted in isolation from a particular patient's clinical picture, and the final word always rests with a qualified specialist.

🩻
🤗 Hugging Face / SPACE

build-small-hackathon / cxr-draft-auditor

Research QA for chest X-ray draft impressions (not a device)

The Problem

Automated chest X-ray report generation is not a solved problem. Recent work shows generated reports are error-free on fewer than half of abnormal cases, and the grounded-fact-checking literature still leaves omission detection as future work. The two failure modes that matter most clinically are also the two hardest to catch automatically: a draft that misses a finding that is actually present, and a draft that over-calls a finding the image does not support.

There is a human side to that problem that I could not stop thinking about. A radiologist at hour eleven of a long shift may be on their two-hundredth scan; the worklist does not get shorter, attention frays, and some findings are genuinely, stubbornly quiet — a small effusion, a faint nodule tucked behind a rib, a line sitting a few millimeters off. This is not a story about anyone making mistakes; the skill is not in question, the workload is. The most careful person in the world is still a person under real pressure of volume, fatigue, and time.

So instead of generating yet another report, I built an auditor. It asks a narrower, safer question than "what is wrong with this patient": where do the image and the draft appear to disagree, and can I show the evidence? The output is never a diagnosis. It is a set of flags, each tied to a region on the image, that send a person back to look again.

The Solution

CXR Draft Auditor is deliberately simple and transparent: two tiny 4B models and no black box. The system is three layers. The two perception layers each use the model that is genuinely good at its job, and the only layer that makes a judgment is the one with no model in it at all.

  • Image to grounded findings. A fine-tuned MedGemma 4B vision-language model emits a constrained JSON list of findings over a fixed set of six labels, each with a normalized bounding box. It is the half that turns pixels into labeled, boxed evidence.
  • Draft to labels. NVIDIA Nemotron-3 Nano 4B parses the draft impression into the same six labels, marking each as asserted or denied and keeping the verbatim draft phrase that produced it. Paste "Cardiomegaly is present. No pneumothorax." and it returns cardiomegaly asserted plus pneumothorax denied, each with the exact span it came from. It reasons briefly before emitting the labels — which materially improves extraction on multi-clause drafts — and that reasoning trace is stripped before parsing.
  • Deterministic comparison. A pure-logic comparator, no model and no randomness, applies three rules. A finding present in the image but absent or denied in the draft is MISSING. A finding asserted in the draft but absent from the image is UNSUPPORTED. Any image-present finding on the urgent whitelist — pneumothorax and nodule or mass, both can't-miss findings — is surfaced as URGENT for radiologist review.

The six canonical findings are pleural effusion, pneumothorax, lung opacity or consolidation, nodule or mass, cardiomegaly, and no-finding. Every dataset's native labels are normalized into that small set; labels with no canonical counterpart are dropped rather than forced. Both models run on the GPU inside the Space, so a full audit takes roughly fifteen to thirty seconds, and if the draft cannot be parsed at all the audit degrades to a visible image-only pass rather than failing silently.

Why Two Models, and Why a Model-Free Judge

The two-model split is practical, not decorative. My first design used the fine-tuned MedGemma for both jobs. It is genuinely good at grounding the image, because that is exactly what I fine-tuned it for — but I had narrowed it so hard on grounded finding extraction that it became an unreliable reader of free text, missing denials, dropping the verbatim span, or wandering outside the label set on ordinary report phrasing. The draft parser does not need to look at the image; it needs to follow instructions over text and respect a strict schema. So I gave that job to a model built for it. Two narrow models each doing what they are good at beat one model stretched across two jobs.

The reason for the decomposition as a whole is trust. Because the comparator is deterministic and reads two explicit label sets, every flag is explainable: you can see which image finding and which draft phrase produced it, and you can see the box. The two perception models can each be wrong, but the flag they feed into is never a mystery — a wrong flag is debuggable rather than mysterious, which I valued more than a slightly higher end-to-end score. The pure-logic core — the label set, the schema, the prompts, the comparator, the metrics, and the synthetic-draft generator — depends only on the Python standard library, NumPy, and Pydantic, so it unit-tests with no GPU, no Torch, and no network. The heavy stacks are optional extras, imported lazily, which kept iteration fast and the tests objective.

Building It from Open Data Alone

The single biggest constraint shaped everything: no PhysioNet credentials, which rules out MIMIC-CXR and most of the paired image-plus-report-plus-box datasets the field leans on. I needed real radiologist bounding boxes from a source I could actually reach. VinDr-CXR turned out to be the crux — commonly described as PhysioNet-gated, but reachable through Kaggle, with radiologist-drawn boxes. Its upstream Data Use Agreement is non-commercial research only, which fits a research and educational project, and because of that agreement I keep the fine-tuning corpora private and never redistribute its pixels on the public Space. I layered a few more open sources on top, normalized everything into the six canonical findings, and held out NIH ChestX-ray14 boxes for evaluation, using a handful of openly licensed NIH images as the Space's examples.

The gap is that no instant-access open dataset gives you images, real free-text reports, and boxes all at once. My way around it was a synthetic-draft method. Starting from the real box labels, I generate a synthetic draft that faithfully describes those findings, then corrupt it in exactly one controlled way: drop a present finding to manufacture a MISSING case, add an absent finding to manufacture an UNSUPPORTED case, or change nothing as the negative control that must produce no flags. Because I know the corruption I applied, I know the correct audit decision, so I can measure audit precision and recall directly. Real radiology reports are used only to confirm the parser reads realistic prose, never as box supervision. That decoupling is what made a credible audit loop possible from open data alone.

Fine-Tuning the Grounding Model

Before fine-tuning anything, I tried the stock base model on the grounding job with my production prompt. The encouraging part was that the medical knowledge was plainly already there — it reasons sensibly about opacities and heart size. The discouraging part was that it would not honor the contract the rest of the auditor depends on: a clean, on-vocabulary JSON list of findings with boxes and nothing else. On one image it narrated paragraphs instead of emitting JSON, so nothing parsed; on another it called a real cardiomegaly normal; on the mass image it mislabeled the mass and then tacked on a contradictory "no finding" that silently knocked out the URGENT flag — the single case I least wanted to lose. The knowledge was present; the discipline was not. That framed exactly what fine-tuning was for: not to teach the model to see a chest X-ray, which it already could, but to discipline it into reliable, parseable, on-vocabulary findings-with-boxes every time, so the comparator downstream could trust what it was handed.

The fine-tune used QLoRA through TRL's supervised trainer with PEFT, on a single A100 via Hugging Face Jobs, after a cheap smoke run to catch container and CUDA issues first. I merged the adapter into a clean bf16 base and verified the merge actually captured the adapter before publishing — a silent merge would have shipped the unchanged base. I trained it twice. The first pass under-localized on harder images and kept double-labeling one region as both opacity and nodule, so I re-curated the corpus to deduplicate same-region overlaps and merge duplicate triple-reader boxes, then trained again over the cleaner data. The second version, alex-feeel/medgemma-cxr-auditor-v2, is the model the Space serves today; the first version, alex-feeel/medgemma-cxr-auditor, is kept public for reference and comparison.

A Lesson in Evaluation Integrity

The most useful thing I learned came from nearly shipping the wrong model. For a while my own numbers said the second version had regressed on exactly the can't-miss findings I cared about most, and I came close to keeping the first version as the served model on the strength of that comparison. That would have been the wrong call. The numbers were not measuring the model — a silent output-handling bug was quietly corrupting the evaluation itself, throwing away any generation it could not consume cleanly, with no error and no count, so a chunk of the better model's outputs were being dropped before they were ever scored. The model was producing the findings; the evaluation was discarding them and then scoring it as if it had stayed silent.

Once I made the output handling refuse to drop anything silently — recover what it can, fall back visibly when it genuinely cannot, and never discard a generation without saying so — and re-ran the held-out comparison, the real signal showed through and the decision flipped to the right, better model. A measurement that silently drops data does not produce a smaller dataset; it produces wrong conclusions. The lesson is about evaluation integrity, not one bug: trust your evaluation before you trust its verdict. The same principle now governs the live system — when the draft cannot be parsed, the audit degrades to a visible image-only pass rather than silently dropping the draft, so a failure is always something you can see. I want to be equally honest about the limits: a 4B model on noisy, triple-annotated boxes is good enough to demonstrate the audit loop and to flag can't-miss findings for a second look, but it is frequently wrong and is research and educational only.

Where It Came From

This did not start with a model; it started with a person. A radiologist I know, whose mostly-pediatric practice is exactly the kind of high-volume reading the problem describes, needed a reliable second read on chest X-ray draft impressions, and the whole tool is shaped around that need. He has since tried it himself, and his framing matches mine: a calm second pair of eyes, not a replacement for the specialist; the decision always stays with the doctor. He also added an important caveat I want to carry forward — for pediatric chest radiography, AI is less validated than it is for adults, because the major open chest-radiograph databases were collected from adults, so tools that grew out of adult data must be independently validated and recalibrated on pediatric data before there is any talk of using them in children. That real need, and that honesty about limits, is the spine of the project.

My Role

I designed and built CXR Draft Auditor end-to-end as a solo project: the canonical six-finding label space and the dataset normalization, the tolerant output schema and box conversions, the deterministic MISSING / UNSUPPORTED / URGENT comparator that is the only judgment layer, the synthetic-draft generator that made the audit loop measurable, the dependency-light pure-logic core and its GPU-free test suite, the data-sourcing and licensing work to build from open data without PhysioNet, the twice-over QLoRA fine-tuning of the MedGemma grounding model with a verified adapter merge, the held-out evaluation (including catching the silent output-handling bug that nearly shipped the worse model), the integration of NVIDIA Nemotron-3 Nano 4B as the draft parser, and the Gradio Space that runs the full loop with both tiny models hosted inside it.

It draws on the same machine-learning workflow discipline behind my earlier ML Prep and Train Toolkit: careful data preparation, reproducible training, and metrics you can actually trust. If a transparent, evidence-first second read for chest X-ray drafts sounds useful for your own research or educational work — or if you simply want to talk it through — the contact page is the best way to reach me.