LLM-resistant exams

Last updated: 30 April 2026 · Reviewed by Tim Burnett (Admin)

TLDR

LLM-resistant exams are designed to make it harder for generative AI to complete or significantly assist the work without detection, while still testing the intended human capability. The central question is not whether a task can frustrate AI, but whether it still validly evidences the learner’s own knowledge, judgement, and performance. Stronger evidence suggests assessment design matters more than after-the-fact policing, but there is no settled best model across subjects or stakes. The main risk is mistaking superficial resistance for genuine validity and authenticity.

Definition

LLM-resistant exams are assessment designs intended to reduce the extent to which generative AI can complete, materially assist, or disguise the candidate’s work without being detected. In assessment terms, the issue is whether the exam still captures the construct it is meant to measure, rather than whether it simply makes AI use awkward.

Why It Matters

This matters because many assessment teams are moving from asking whether AI use is possible to asking whether the assessment should be redesigned. If an exam relies on generic prompts, predictable structures, or easily outsourced reasoning, it may be more vulnerable to AI than its designers assumed. The deeper assessment issue is authenticity: what does the exam really prove about the candidate’s own judgement, knowledge, and performance?

Key Concepts

- **AI resistance**: the extent to which a task is difficult for AI to complete well enough to substitute for the candidate. - **Authenticity**: whether the work still shows the candidate’s own capability. - **Construct alignment**: whether the task still measures the intended knowledge or skill. - **Task design**: the choice of scenario, prompt style, response mode, and timing conditions. - **Contextualised judgement**: assessment of decisions grounded in specific evidence, local context, or live reasoning, which is often harder to outsource than generic text production.

What Experts Agree On

The main point of agreement is that exam design matters more than after-the-fact policing. Stronger sources point towards redesigning vulnerable tasks rather than relying only on rules, detection, or suspicion. There is also broad convergence around contextualised, real-world, and judgement-heavy tasks as more defensible where they fit the construct. A second shared view is that AI resistance should not be treated as a technical bolt-on. If the task no longer captures the intended capability, it has not become better simply because it is harder for AI to answer.

What Is Contested

What is unsettled is how resistant an exam needs to be before it is fit for purpose. Making AI assistance harder is not the same as making it irrelevant, and some redesigns may raise the cost of AI use without solving the underlying validity question. The open question is which changes genuinely preserve the construct and which mainly shift the format. There is also tension around the label itself. “LLM-resistant” can sound like a neat technical fix for what is really a design and governance issue. If an assessment still rewards polished output more than observed reasoning, it may remain vulnerable even after prompt or scenario changes.

Risks

- Overstating the protection offered by task redesign alone. - Introducing artificial scenarios that reduce authenticity in the name of AI resistance. - Narrowing the construct so much that the assessment no longer reflects real-world competence. - Increasing workload for assessment writers and moderators. - Confusing resistance to generic LLM use with resilience against broader AI-enabled assistance.

Good Practice

A sensible assessment approach is to start with the construct and ask what the exam is actually meant to evidence. 1. Define the capability that must be shown unaided. 2. Identify where AI support would change the meaning of the result. 3. Test whether the prompt, timing, format, or evidence requirements still capture the intended judgement or knowledge. 4. Add checkpoints, viva elements, or process evidence where the candidate’s own reasoning needs to be visible. 5. Check whether the redesign improves authenticity and validity, rather than merely making the task harder to automate. 6. Ask what a candidate would need to show to demonstrate competence without relying on generic generated text.

Options or Comparison

### 1. Prohibit AI use Best suited to settings where unaided performance is essential and easily specified. This is clear to communicate, but it can be hard to enforce and may not align with authentic workplace practice. ### 2. Permit limited AI use Useful where the construct includes responsible tool use, but assessment teams need to define exactly what is allowed and what evidence remains individual. This can be more realistic, but it depends on strong task design and clear governance. ### 3. Integrate AI into the assessment design Appropriate when using AI is part of the capability being assessed, such as evaluation, judgement, or checking. The advantage is better alignment with practice; the trade-off is that the assessment must still isolate the learner’s own understanding and decision-making.

Example in Practice

A qualification team redesigns a short written response task that previously rewarded generic explanation. Instead of asking for a broad essay, the task now uses a local scenario with specific evidence, followed by a brief viva or justification step. That does not make AI irrelevant, but it makes it harder for a polished generated answer to stand in for the candidate’s own reasoning.

Key Sources

- Research-oriented article on creating large language model resistant exams. - Webinar page on Gen AI in assessment, deepfakes, and synthetic media. - Regulatory signal that authenticity pressure is already affecting assessment design.

Vendor Landscape

The vendor footprint is indirect. Suppliers are more likely to frame this as exam security, integrity, or adaptive design than as “LLM resistance” specifically. That makes the market useful as a signal of response options, but not as validation that a redesign preserves validity and fairness.

FAQs

### How can an exam be made more resistant to ChatGPT and other LLMs? Use task designs that require specific evidence, live judgement, local context, or visible reasoning, rather than generic explanations that an AI can easily produce. ### Does making an exam AI-resistant solve authenticity problems? Not by itself. It may help, but the task still has to measure the right construct and remain defensible in context. ### What should assessment teams ask when reviewing an exam for AI resistance? Ask what evidence the task still captures, whether the candidate’s own reasoning remains visible, and how the redesign preserves validity as well as resistance. ### Is a more AI-resistant exam always a better exam? No. If the redesign reduces authenticity or narrows the construct too far, it may be less defensible even if it is harder for AI to use.

Last Reviewed By

Tim Burnett (Admin)

Suggested Citation

Test Community Network. "LLM-resistant exams." TCN AI & Assessment Wiki. Last reviewed 2026-04-30. https://www.testcommunity.network/wiki/llm-resistant-exams.html

Sources

- Research-oriented article on creating large language model resistant exams. - Webinar page on Gen AI in assessment, deepfakes, and synthetic media. - FE Week / Schools Week article on Ofqual, AI misuse, coursework authenticity, and digital exams.

Sources

← Back to Artificial Intelligence (AI) in Assessment