A New Era: Has Online Proctoring Reached Its 4th Generation?

How Agentic AI and Contextual Analytics Are Reshaping Assessment Securit

Published: 11/12/2025

A New Era: Has Online Proctoring Reached Its 4th Generation?

Introduction: From Watching People to Understanding Risk

As we've discussed in previous articles, remotely proctored assessment is now routine, but it is facing some new and interesting challenges.

Since 2022, the threat has shifted from visible behaviours to how responses are generated and by whom. With Large Language Models (LLMs) and low-friction tools, giving candidates near PhD-level knowledge with vision capabilities in real-time. Candidates can therefore obtain covert assistance without obvious on-camera cues. That change, from behavioural to cognitive risk, demands proctoring (invigilation) that reasons over evidence in real time and, as we will explore, more importantly across multiple sessions.

Before we explore Gen 4, let me explain my taxonomy:

Gen 1 - Live Human: a trained human observes the candidate live and intervenes immediately.
Gen 2 - Record and Review: sessions are recorded for later human review.
Gen 3 - AI-Assisted: algorithms flag objects, people, and behaviours within an individual session for human follow-up.
Gen 4 - Agentic, Contextual Analytics: AI perceives multiple signals, reasons about risk during the session, and correlates data across concurrent sessions and cohorts to make real-time, evidence-based decisions, with human override and audit trails.

Why Gen 3 Hit a Ceiling

Gen 3 was built to spot things in a session - faces, gazes, second people, objects, focus changes - and escalate. It reduced some manual effort but remained narrow: each decision focused on a single candidate at a single point in time.

Two weaknesses follow:

Cognitive assistance can occur with no visible artefacts.
Pattern signals are weak in isolation; without cohort context, such as unusual accuracy spikes aligned to item exposure, false positives rise and genuine risks can be missed.

It's perfectly possible that some systems that I'm classifying as Gen 3 were looking for triggers behind the scenes. My argument is that as AI is introduced to process all the evidence and make calls, I believe it's worth applying some more detailed distinctions.

What Gen 4 Adds and Why It Matters

Gen 4 introduces agentic reasoning and scope.

Within-session reasoning: fuse identity, device posture, window focus, input cadence, answer-change patterns, environment audio, and item-level timings to decide why something happened, not just that it happened.
Cross-session correlation in real time: monitor multiple live sessions and recent history to spot exposure or coordination signals, for example synchronised spikes on specific items, improbable timing patterns across a cohort, or anomalous progression.
Actions with guardrails: suppress noise; trigger re-authentication or environment checks; escalate to a human; or quarantine the attempt. Every step is logged for audit, appeal, and quality assurance.

Platform examples

Talview "Alvy": positions an autonomous, agentic proctor that perceives, reasons, and acts within a session with human-in-the-loop controls.

Excelsoft: advocates contextual, near-real-time analytics and forensics to protect item confidentiality and detect anomalous behaviour quickly across multiple candidates.

Caveon Observer: emphasises data-driven, risk-based monitoring that brings humans in only when risk thresholds are crossed, combining real-time behavioural and response data with targeted expert review to lower false alerts and reduce invasiveness.

Interestingly, the latter, Caveon, suggest their service is not a proctoring solution, but a solution to proctoring.

UK fit: what regulators and boards actually want

If you work in an awarding organisation or professional certification provider, the bar is not "did we spot something odd," it is "can we show proportionate, fair, auditable control." Gen 4 helps you do that. Instead of a heap of disconnected flags, you get a single narrative: what happened, when thresholds were crossed, who reviewed it, and why the final call was made.

That, in theory, makes malpractice prevention, detection, and investigation calmer and more defensible, with the added bonus of less flags to follow up on.

What real-time, contextual forensics actually looks like

Think of Gen 4 as moving from "that looked odd" to "here is why this pattern matters." During a session it pulls in identity checks, device posture, window focus, typing cadence, per-item timing, answer changes, environmental audio, and the delivery platform's own safeguards. If someone repeatedly loses focus, types at a cadence that is not humanly plausible, and suddenly spikes accuracy on the hardest items, the system asks for a quick re-auth and pings a reviewer. No drama, just a targeted check.

At the same time it is watching the cohort. If several live sessions show the same improbable timing on the same items and their answer similarity jumps above norms, it raises an item-exposure alert, notifies test security, and rotates variants before the leak spreads. Every step is logged. After the sitting, those logs feed the thresholds, so the system gets quieter on noise and sharper on risk.

The bottom line

Gen 4 widens the lens to the cohort and to the live delivery itself, reasons across multiple sessions, and cuts false positives without going soft on security.

As ever, the Online Proctoring is only part of the story. The best security option is to connect your online proctoring with other systems in real-time such as the test delivery client. And, of course we can't forget, the most important question we should be asking is whether or not assessment design is overall fit for purpose for the age of AI.