Blog

Field Notes: Building a Chest X-ray Draft Auditor with Two Tiny 4B Models — MedGemma and NVIDIA Nemotron

Aleksandr Filippov Artificial Intelligence June 14, 2026 19-minute read

You have probably had a chest X-ray. And you probably never saw what was on it.

Maybe it was before an operation. Maybe a routine check-up, or a cough that just would not quit, or a night when something hurt and nobody could tell you why yet. You held still, the machine clicked, and a few minutes later someone told you it looked fine. Then you went home and forgot about it.

Here is the part you never saw. That image of your chest went into a long, quiet list, and somewhere a radiologist you will likely never meet pulled it up as one of hundreds that day. We hand that person an enormous, silent trust, usually without ever thinking of them at all.

Picture them now. It is late. This might be the 200th scan of a long shift, maybe a night shift, maybe a stack read in from another city. The list does not get shorter; it just keeps scrolling. They have trained for years to see what most of us never could. But eyes get tired near the bottom of a stack. Attention frays. And some findings are genuinely, stubbornly quiet, a small effusion, a faint nodule tucked behind a rib, a line or tube sitting a few millimeters off, a subtle change you would only catch by holding it against an image from a year ago.

This is not a story about doctors making mistakes. The skill is not in question; the workload is. It is a story about being human under real pressure, volume, fatigue, time, the late hour. The most careful person in the world is still a person. And when the stakes are someone's lungs, "almost caught it" is not where any of us want to land.

That is the moment I kept thinking about. Not the technology. The tired person at hour eleven who genuinely cares and just wants a backstop.

So I built a second pair of eyes. Not a replacement, not a diagnosis, never the one who decides. A quiet helper that looks again.

You give CXR Draft Auditor a chest X-ray and the human-written draft impression. It reads the image and the words separately, then quietly compares them and flags where they seem to disagree. And it does not just say "look again," it draws a box on the image, right where it wants you to look.

Here is the line that matters most to me. It never diagnoses. It points a person back at their own image and says, gently, "take one more look at this." The radiologist is always in the loop, always the one who decides. The tool just makes sure the tired eyes at hour eleven get a second chance before the patient does. A safety net under people who are already trying their hardest. That is the whole idea.

Under the hood it stays deliberately simple and transparent: two tiny 4B models and no black box. A fine-tuned MedGemma grounds the image into labeled boxes, NVIDIA Nemotron-3 Nano 4B reads the draft into the same labels, and a deterministic, model-free comparator does the only judging, so every flag traces back to a specific finding and a specific phrase. This is a Field Notes write-up of how I built that from open data alone, including the evaluation-integrity bug that nearly made me ship the worse of my two models. The rest is the story behind those decisions: the problem, the data, the architecture, the synthetic-draft method, and the evaluation lesson.

The problem

Automated chest X-ray report generation is not solved. Recent work shows generated reports are error-free on fewer than half of abnormal cases, and the literature on grounded fact-checking explicitly leaves omission detection as future work. The two failure modes that matter clinically are the two hardest to catch automatically: a draft that misses a finding that is actually present (an omission), and a draft that asserts a finding the image does not support (an over-call).

So instead of generating yet another report, I built an auditor. It takes a chest X-ray and a human-written draft impression, and it asks a narrower question: where do the image and the draft appear to disagree, and can I show the evidence? The output is never a verdict. It is a set of flags, each tied to a region on the image, that send a person back to look again.

The no-PhysioNet data story

The single biggest constraint shaped everything: no PhysioNet credentials. That rules out MIMIC-CXR and most of the paired image-plus-report-plus-box datasets the field leans on. I needed real radiologist bounding boxes from a source I could actually access.

The crux turned out to be VinDr-CXR. It is commonly described as PhysioNet-gated, but it is reachable without PhysioNet through Kaggle: the VinBigData Chest X-ray Abnormalities Detection competition (via the rules click-through, or Late Submission) and public resized PNG mirrors that need no competition entry at all. The boxes are radiologist-drawn, with up to three readers per image. The important caveat is licensing: the upstream VinDr Data Use Agreement is non-commercial research only, and a CC0 tag on a downstream mirror does not override that. My project is research and educational, which fits. Because of that DUA I keep the SFT corpora private and never redistribute VinDr pixels on the public Space.

I layered a few more open sources on top. The VinDr-CXR-VQA dataset (faizan711/VinDR-CXR-VQA) is annotations only, no images: it ships a single data_v1.json that I join to the Kaggle VinDr pixels by image_id, a 32-character hex filename. Its gt_location boxes are in original full-resolution pixel space, so I rescale them per image whenever I pair them against a resized image mirror. ChestX-Det (natealberti/ChestX-Det) gave me a second box source under Apache-2.0 annotations. NIH ChestX-ray14 with BBox_List_2017.csv is held out for box evaluation, and because its images are openly licensed I use a few of them as the example images on the public Space (via the natealberti/ChestX-Det redistribution). IU-Xray / Open-i gave me real radiology reports, used only to check that my draft parser handles realistic phrasing.

The gap: there is no instant-access open dataset with images, real free-text reports, and boxes all at once. PadChest-GR is the closest, but it is request-gated, so I never put it on the critical path. The way around the gap is the synthetic-draft method below.

I normalized every dataset's native labels into one small canonical set of six findings: pleural effusion, pneumothorax, lung opacity / consolidation, nodule / mass, cardiomegaly, and no-finding. Labels with no canonical counterpart (aortic enlargement, atelectasis, calcification, and so on) are dropped rather than forced.

The decomposed, transparent architecture

I deliberately did not build one end-to-end black box. The system is three layers, the two perception layers use the model that is actually good at each job, and the only layer that makes a judgment is the one with no model in it.

  1. Image to grounded findings. A fine-tuned MedGemma 4B vision-language model, running on the GPU, emits a constrained JSON list of findings over the six labels, each with a normalized bounding box in MedGemma's native [y0, x0, y1, x1] format.
  2. Draft to labels. NVIDIA Nemotron-3 Nano 4B, running on the GPU through Hugging Face transformers, parses the draft impression into the same six labels, marking each as present or absent and keeping the verbatim draft phrase that produced each label. It reasons briefly over the draft before emitting the label JSON, which materially improves extraction on multi-clause drafts; the reasoning trace is stripped before the labels are parsed. Crucially, it reads explicit denials: paste "Cardiomegaly is present. No pneumothorax." and it returns cardiomegaly present plus pneumothorax absent, each with the exact span it came from.
  3. Deterministic comparison. A pure-logic comparator, no model, no randomness, applies three rules: a finding present in the image but absent or denied in the draft is MISSING; a finding asserted in the draft but absent from the image is UNSUPPORTED; any image-present finding on the urgent whitelist (pneumothorax and nodule / mass, both can't-miss findings) is surfaced as URGENT.

The reason for the two-model split is practical, not decorative. My first design used the fine-tuned MedGemma for both jobs: grounding the image and parsing the draft. It is genuinely good at the first, because that is what I fine-tuned it to do, but I had narrowed it so hard on grounded finding extraction that it had become an unreliable reader of free text. It would miss denials, drop the verbatim span, or wander outside the label set on ordinary report phrasing. The draft parser does not need to look at the image at all; it needs to follow instructions over text and respect a strict schema. So I gave that job to a model built for it. NVIDIA Nemotron-3 Nano 4B is a small, instruction-following text model whose native nemotron_h architecture (a Mamba2-Transformer hybrid) transformers supports directly, so it loads from the bf16 weights and runs on the GPU with no extra runtime and no CUDA build of its own. It parses the draft cleanly, including the denials and the spans, while MedGemma stays on the GPU doing the grounding it was tuned for. Both models run on the GPU, so a full audit takes well under a minute, roughly 15 to 30 seconds. If the draft cannot be parsed at all, the audit degrades to an image-only pass with a visible note rather than failing.

The reason for the decomposition as a whole is trust. Because the comparator is deterministic and reads two explicit label sets, every flag is explainable: you can see which image finding and which draft phrase produced it, and you can see the box. Swapping the draft parser for Nemotron did not add a judgment layer; the only thing that decides MISSING, UNSUPPORTED, or URGENT is still the model-free comparator. The two perception models can each be wrong, but the flag they feed into is never a mystery.

The pure-logic core (the label set, the schema, the prompts, the comparator, the metrics, and the synthetic-draft generator) depends only on the Python standard library, numpy, and pydantic. It unit-tests with no GPU, no torch, and no network. The heavy stacks are optional extras, imported lazily. That separation kept iteration fast and the tests objective.

The synthetic-draft method

I could not get the perfect triple of image, real report, and box, so I manufactured the part I was missing. Starting from the real box labels, I generate a synthetic draft impression that faithfully describes those findings, and then I corrupt it in exactly one of three controlled ways:

  • Drop a present finding. This produces a MISSING case with known ground truth.
  • Add an absent finding. This produces an UNSUPPORTED case with known ground truth.
  • Change nothing. This is the faithful draft, my negative control, which must produce no flags.

Because I know the corruption I applied, I know the correct audit decision, so I can measure audit precision and recall directly. The real IU-Xray reports are used only to validate that the parser reads realistic prose, never as box supervision. This decoupling is what made a credible audit loop possible inside one week from open data alone.

Training

Before I fine-tuned anything, I tried the stock base model. I pointed google/medgemma-1.5-4b-it, exactly as it ships, at the grounding job with the same production prompt I planned to use, on the handful of example chest X-rays. The encouraging part was that the medical knowledge was plainly already in there: read its output and it reasons sensibly about opacities, heart size, the things a chest X-ray model should know. The discouraging part was that it would not honor the contract the rest of the auditor depends on, a clean, on-vocabulary JSON list of {label, box} and nothing else, the two explicit label sets the deterministic comparator reads without guessing. The knowledge was present; the discipline was not.

A small, illustrative pass over the four example images, not a benchmark, just enough to see the shape of the problem, made the gap concrete. On one image it narrated paragraphs of step-by-step reasoning instead of emitting the JSON, so nothing parsed and the audit came back with zero findings on an image that had them. On another it called a real cardiomegaly normal and missed the finding entirely. On the mass image it labeled the mass as "consolidation," off-vocabulary for what it was, and then tacked on a contradictory "no finding" to the same output, which silently knocked out the URGENT flag, the single case I least wanted to lose. None of this was a knowledge gap; it was a reliability gap, narrating instead of emitting, missing findings, and mislabels that quietly cost the urgent flag. That is what sent me to fine-tuning, and that framed exactly what fine-tuning was for: not to teach the model to see a chest X-ray, which it already could, but to discipline it into reliable, parseable, on-vocabulary findings-with-boxes, every time, so the comparator downstream could trust what it was handed.

The fine-tune itself used QLoRA (4-bit NF4) on that same base, through TRL's SFTTrainer with PEFT: rank 16, alpha 16, learning rate 2e-4, with LoRA on the attention and MLP projections and the loss over the assistant target. The training target is the constrained finding JSON with boxes. I ran it on a single A100 through Hugging Face Jobs (after a cheap smoke run to catch container and CUDA issues first), merged the adapter into a clean bf16 base rather than the 4-bit model, and verified the merge actually captured the adapter before publishing, because a silent merge would publish the unchanged base. The merged 16-bit model fits the ZeroGPU tier at bf16 with no quantization and serves through a Gradio Space. An Unsloth FastVisionModel path is also provided for free Kaggle or local training.

I trained this twice. V1 was one epoch over roughly 6,800 curated, class-balanced grounding examples (alex-feeel/cxr-sft). When I looked at v1 closely it under-localized on harder images and kept double-labeling one region as both opacity and nodule, so I re-curated the corpus: alex-feeel/cxr-sft-v2 deduplicates same-region cross-finding overlaps down to the more specific label (cross-finding overlap at IoU >= 0.6) and merges duplicate triple-reader boxes (union-find at IoU >= 0.5). V2 is two epochs over that cleaner corpus. V2 (alex-feeel/medgemma-cxr-auditor-v2) is the model the Space serves today; v1 (alex-feeel/medgemma-cxr-auditor) is kept public for reference and comparison.

What the evaluation showed

For a while my own impression was that v2 had regressed: on a couple of out-of-distribution cancer X-rays it seemed to drop an urgent flag, and on the held-out comparison v2 looked worse than v1 on exactly the can't-miss findings I cared about most. I came close to keeping v1 as the served model on the strength of that comparison. That would have been the wrong call, and catching why is the most useful thing I learned this week.

The numbers were not measuring the model. A silent output-handling failure was quietly corrupting the evaluation itself. The step that turned a raw generation into a scored result threw away any output it could not consume cleanly, with no error and no count, so a chunk of v2's generations were being dropped before they were ever scored. Six of v2's nine held-out nodule cases vanished that way, which is exactly why v2 looked worse on urgent recall. The model was producing the findings; the evaluation was discarding them and then scoring v2 as if it had stayed silent. A measurement that silently drops data does not produce a smaller dataset, it produces wrong conclusions, and this one was about to flip a real decision toward the worse model.

The lesson I took from it is about evaluation integrity, not about one bug. Once I made the output handling refuse to drop anything silently (recover what it can from imperfect output, fall back visibly when it genuinely cannot, and never discard a generation without saying so) and re-ran the held-out comparison, the real signal showed through and the v1-versus-v2 decision flipped: v2 was the better model and had been all along. I would not have known that without distrusting my own first set of numbers and tracing where the data went. The same principle now governs the live system: when the draft parser cannot make sense of a draft, the audit degrades to an image-only pass with a visible note rather than silently dropping the draft, so a failure is always something you can see.

With the evaluation trustworthy I ran the systematic held-out comparison: 273 images held out from both models, a single greedy generation that matches exactly what production does, scored end to end with nothing silently discarded. V2 wins on every axis that matters.

MetricV1V2
Presence macro-F10.6460.735
Box [email protected] rate0.4840.633
Box [email protected] precision0.6130.791
Box [email protected] rate0.3600.531
Mean IoU on matched boxes0.6140.700
Urgent recall, nodule / mass3/94/9
Urgent recall, pneumothorax0/11/1

So v2 detects better, localizes better, and catches more of the can't-miss findings. That is why v2 is the served model. The earlier "v2 regressed" story was entirely an artifact of the corrupted evaluation; once nothing was silently dropped, the real signal showed through and the decision flipped to the right model.

I want to be clear about the limits of these numbers. The urgent classes are scarce in the held-out data (nodule / mass N=9, pneumothorax N=1), so the urgent recall figures are directional, not statistically robust, and I would not present them as anything stronger. The held-out ground truth still carries the same-region double-labels that v2's curation deduplicated, which slightly understates v2's generic-label recall against that ground truth. And the headline remains: a 4B model on noisy, triple-annotated boxes is good enough to demonstrate the audit loop and to flag can't-miss findings for a second look, but it is frequently wrong and is research and educational only, never a diagnosis and never a substitute for a radiologist.

Where this came from: a radiologist's need

This did not start with a model; it started with a person. The radiologist behind it is one I know, Alexey Amelin (https://vk.ru/xraydiag), whose mostly-pediatric practice is exactly the kind of high-volume reading the opening describes. He has since tried the tool himself. Here, in his own words, is what he thinks:

When a colleague told me about this project, the idea landed on familiar ground right away. Every radiologist knows the value of a second look. The reading volume is high, the eyes tire toward the end of a long worklist, time is short — add night shifts and even an experienced specialist may not catch a detail immediately. This is no reproach to the profession; it is its everyday reality. And some findings are genuinely subtle: the smallest pleural effusion; focal changes superimposed on dense tissue; abnormalities of the chest organs in patients with coexisting somatic disease; the early changes of a disseminating process. It is not always possible to ask a colleague for a second read, even when the need for one is obvious. Emerging radiograph-audit models are a safety net for the radiologist and the patient alike.

The concept of the tool itself resonates with me. It does not make a diagnosis; it compares the doctor's draft impression against the image and highlights the disagreements: a finding present on the image but missed or denied in the text; a claim in the text that the image does not support; and, separately, it surfaces potentially urgent findings for a second look — all marked on the image so a person can look again. The decision always stays with the doctor. That is precisely the kind of helper I would value — a calm second pair of eyes, not a replacement for the specialist.

At the same time, as a radiologist whose practice is mainly pediatric, I should be clear: for pediatric chest radiography, artificial intelligence is less validated than it is for adults. Large, high-quality, age-diverse pediatric image sets are scarce — children make up only a small share of publicly available medical imaging, and the major open chest-radiograph databases were collected from adults. Because of this, models trained mostly on adult data can show clinically meaningful age-related bias in children — for example, a noticeable rise in false-positive cardiomegaly and thymomegaly in infants. So tools that grew out of adult data should not be assumed reliable by default: they need to be independently validated and recalibrated on pediatric data before there is any talk of using them in children.

And let me underline this separately: this is a research and educational quality-assurance tool, not a medical device and not a diagnostic instrument. Imaging findings cannot be interpreted in isolation from a particular patient's clinical and laboratory picture and history. The final word always rests with a qualified radiologist.

— Alexey Amelin (https://vk.ru/xraydiag), pediatric radiologist

What I learned

  • The data constraint, not the model, was the real problem. Once I accepted that the perfect triple did not exist as open data, the synthetic-draft method unblocked everything.
  • Decomposition buys explainability. The two perception models can each be wrong, so the only judgment lives in a deterministic, model-free comparator: every flag traces back to two explicit label sets — what the image found, what the draft said, and where they disagree. A wrong flag is then debuggable, not mysterious, which I valued more than a slightly higher end-to-end score.
  • Trust your evaluation before you trust its verdict. A silent output-handling failure was corrupting my numbers and nearly led me to serve the worse model. Nothing in a measurement pipeline should ever discard data without saying so; the moment it does, you are scoring an artifact, not a model.
  • Use the right model for each job, not the same model for everything. Fine-tuning MedGemma hard on grounded extraction made it worse at reading free text, so the draft is now parsed by a small instruction-following text model (NVIDIA Nemotron-3 Nano 4B, on the GPU through transformers) while MedGemma keeps doing the grounding. Two narrow models each doing what they are good at beat one model stretched across two jobs.
  • Keeping the core dependency-light made the week survivable. Pure-logic modules that test without a GPU meant I could iterate on the audit rules and the schema without waiting on the model stack.
  • A tiny model is enough for this framing. I am not asking the vision model to write a report; I am asking it for a constrained label set with boxes, and I am asking a small text model to read a draft into the same labels. Both are narrow asks that small models do usefully.

Try it and read the rest

The Space runs the full loop end to end. The grounding model is alex-feeel/medgemma-cxr-auditor-v2, with the earlier v1 alex-feeel/medgemma-cxr-auditor kept up for comparison, and the draft parser is NVIDIA Nemotron-3 Nano 4B.

🩻
🤗 Hugging Face / SPACE

build-small-hackathon / cxr-draft-auditor

Research QA for chest X-ray draft impressions (not a device)

Research and educational QA only. The system described here is NOT a medical device, NOT diagnosis, and NOT for clinical use. Outputs are frequently wrong. Always consult a qualified radiologist.

About Authors

Aleksandr Filippov

Explore the professional journey of Aleksandr Filippov, spanning IT project management, technical strategy, and a keen insight into business and systems analysis. This site offers a window into Aleksandr’s comprehensive skill set, highlighting his contributions to IT and AI advancements.