Five checks before you let AI screen your studies

Why screening is the riskiest place to add AI

Title-and-abstract screening is the most attractive task to automate in a review: it is high-volume, repetitive, and slow. It is also where a wrong call is hardest to see after the fact. A study that an AClassifier wrongly excludes never reaches full-text review, never appears in your PRISMA flow as anything but a number, and never gets a second look. The time you save is visible; the evidence you lose is not.

That asymmetry is the whole reason to put deliberate checks around AI screening. The goal is not to avoid AI — used well it genuinely accelerates a review — but to make sure the studies your conclusions rest on are still in the set when you reach them.

1. Know what the tool was validated on

A screening tool's performance is not a fixed property — it depends on the review it was measured against. A recall of 98% on a clinical-drug corpus tells you little about how it behaves on implementation research or grey literature. Before you trust a tool, ask what it was validated on and how close that is to your question.

2. Decide your recall threshold before you start

Most AI screeners rank records by relevance rather than giving a clean include/exclude. That makes the cut-off your decision, not the tool's. Set the threshold — and the recall you are willing to accept — before you see the results, so the number is a methodological choice rather than a convenience.

3. Keep a human in the loop on the exclusions

The cheapest, most effective safeguard is to sample what the tool excludes, not just what it includes. A small human-screened sample of the rejected pile surfaces systematic blind spots — a phrasing the model never learned, a study type it under-weights — long before they reach your synthesis.

4. Report the tool, the settings, and the checks

If an AI tool touched your study set, that is part of your methods. Name the tool and version, the threshold you used, and the validation you ran. Reporting is what lets a reader judge the review — and what lets the next team reproduce it.

5. Re-check when the inputs change

A tool that performed well on your pilot search can drift when the search is broadened, the topic shifts, or the model is updated underneath you. Treat validation as something you repeat at the points where the inputs change, not a box ticked once at the start.