Validating Deception Detection: Benchmarking Against 8,000 Labeled Statements

Why Validation Is the Hard Part

Building a model that produces a score is easy. Every NLP library in existence will let you assign a number to a sentence in an afternoon. The difficult part — the part that most "AI fraud detection" vendors quietly skip — is demonstrating that the number means anything.

Deception detection has a particular measurement problem. Unlike spam classification, where you can harvest labeled examples at scale by watching which emails users mark as junk, obtaining ground-truth deception labels requires human experts who have independently verified the factual status of each claim. That's slow, expensive, and hard to scale.

The result: most deception detection systems are validated on toy datasets (dozens of sentences from a single study), on proprietary data that can't be reproduced, or not validated at all — just deployed with a marketing claim and no methodology. We chose a different approach.

The gold standard for text-based deception detection benchmarking is the LIAR dataset. It has a published methodology, it's publicly available for reproduction, and it's large enough to produce statistically meaningful results. We run every version of Candor's model against it continuously, and we publish the results — even when they're inconvenient.

The LIAR Dataset: What It Is and Why It Matters

The LIAR dataset was introduced by William Yang Wang in his 2017 paper "Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection." Wang scraped 12,836 short political statements from Politifact.com, each accompanied by a veracity ruling from trained Politifact fact-checkers. After filtering for data quality, the usable evaluation corpus is approximately 8,041 statements.

What makes LIAR useful is its six fine-grained veracity labels — not a binary true/false split, but a spectrum:

true

The statement is accurate and nothing significant has been left out.

mostly-true

Accurate but missing context or caveats that could give a different impression.

half-true

Partially accurate but leaving out important details or taking things out of context.

mostly-false

Contains an element of truth but ignores critical facts that would give a very different impression.

false

Not accurate.

pants-fire

Not accurate and makes a ridiculous claim.

The spectrum matters. Real-world deception isn't binary — it exists on a continuum from minor omissions to brazen fabrication. A benchmark that treats "mostly-true" the same as "pants-on-fire" is measuring something coarser than what a production deception API actually needs to handle.

Wang's original paper reported baseline classification accuracy of 27.4% on a six-class task using logistic regression, and noted that even state-of-the-art models in 2017 struggled to exceed 40% on multi-class classification. This contextualizes the hardness of the task: it is genuinely difficult, and any system claiming 95%+ accuracy on LIAR should be treated with suspicion.

"Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts." — Wang, W.Y. (2017). "Liar, Liar Pants on Fire." Proceedings of ACL 2017.

Our Methodology: Mapping Six Labels to Binary

Candor's API produces a deception score on a 0–100 scale. It is a binary deception signal, not a six-class classifier — it answers "does this text exhibit linguistic deception signals?" not "which of Politifact's six categories does this claim fall into?"

To benchmark against LIAR, we map the six-class labels to binary ground truth:

Honest (score expected low): true, mostly-true
Deceptive (score expected high): half-true, mostly-false, false, pants-fire

This mapping is conservative. We classify half-true as deceptive — statements that are misleading even when they contain a factual kernel are precisely the class that psycholinguistic deception signals are designed to catch. The linguistic pattern of strategic omission and framing is still a deception pattern, regardless of whether any individual claim is technically false.

Batch Processing and Error Handling

We process the LIAR corpus in batches through the Candor API. Each statement is submitted as a standalone text input. The API returns a deception score and five sub-signal scores (pronoun distancing, hedging, emotional leakage, cognitive complexity, detail specificity). We record the full response for each statement along with its ground-truth LIAR label.

Some statements in the LIAR corpus are very short — one sentence, sometimes fewer than 10 words. Psycholinguistic deception signals require sufficient text to compute reliably; a single declarative sentence gives limited signal. We do not filter these out of the benchmark — they count against us — because removing short inputs would produce artificially inflated scores that don't reflect production conditions.

The evaluation is ongoing. We checkpoint results after each batch and publish the running metrics at /validation, updated as each batch completes. The live page reflects the current state of the evaluation, not a snapshot from launch.

Results: 905 Samples, F1 = 0.534

905 Samples evaluated

8,041 Total corpus size

0.534 F1 score

54% Human baseline

As of the current evaluation run, we have processed 905 of the 8,041 LIAR statements — approximately 11.2% of the corpus. The model achieves an F1 score of 0.534 against the binary ground truth mapping described above.

To put that number in context:

System	Accuracy	Notes
Random classifier	~50%	Coin flip on binary task
Human judges (unaided)	~54%	Meta-analysis across studies; DePaulo et al. 2003
Candor (current)	F1 = 0.534	905 LIAR samples; binary mapping; all input lengths
Wang (2017) best model	27.4% (6-class)	Six-class task; logistic regression baseline

F1 = 0.534 on a binary task means the model is performing above a random baseline and roughly at parity with unaided human judges — the same human judges who, per the research literature, detect deception only barely above chance. This isn't a triumphant benchmark. It's an honest one.

The LIAR dataset is also a harder test than most real-world use cases. Short political claims — the average LIAR statement is under 20 words — give psycholinguistic models far less text to work with than the insurance narratives, legal statements, or long-form reviews where these signals are most reliable. We benchmark against LIAR specifically because it's hard, not because it flatters us.

What F1 Doesn't Tell You

F1 is the harmonic mean of precision and recall. It collapses two important numbers into one, which obscures meaningful information about where a model is actually failing.

Precision vs. Recall Tradeoffs

Precision measures what fraction of our "deceptive" predictions are actually deceptive. Recall measures what fraction of actually-deceptive statements we correctly flag. These two metrics are in tension: a model that flags everything achieves 100% recall at the cost of precision; a model that only flags the most obvious cases gets high precision at the cost of recall.

For a triage tool — which is what Candor is — recall matters more than precision. The cost of a missed deceptive statement (a false negative) is typically higher than the cost of flagging a legitimate statement for extra human review (a false positive). Our threshold tuning reflects this: we tune for recall while maintaining precision at a level where the false positive rate doesn't overwhelm the human reviewer.

Domain Specificity

The LIAR dataset is entirely composed of political speech. Political claims have a specific rhetorical register — hedging that signals political positioning rather than cognitive load, emotional language calibrated for public persuasion rather than personal narrative. The six psycholinguistic signals described in our earlier article on the science were largely validated on personal narratives, legal testimony, and interpersonal deception studies — not political discourse.

This means F1 = 0.534 on LIAR is likely a conservative estimate of performance on the use cases where Candor is most applicable: insurance claims, marketplace reviews, compliance narratives, and written statements in institutional contexts. These domains have more text per input and more pronounced signal differentiation between honest and deceptive accounts.

Label Ambiguity in the Gray Zone

The hardest evaluation cases are half-true statements — claims that contain accurate elements but mislead through framing or omission. These are linguistically ambiguous: a skilled author can construct a misleading statement using entirely factual components with no observable hedging or pronoun distancing. The psycholinguistic features that Candor measures are features of the writing process, not the factual content. When deception is architectural rather than lexical, the linguistic signals are weaker.

We do not exclude half-true statements from the benchmark. They represent a real and important class of deceptive communication. But users should understand that Candor's performance on short, structurally ambiguous statements will trail its performance on longer, narratively deceptive content.

What Comes Next

The evaluation is running. We add batches incrementally and update the live metrics at /validation as each checkpoint completes. The F1 figure you see there is derived from the actual API — the same API available to paying customers — not a research prototype running in isolation.

Two things will improve the benchmark over time:

More samples. We're at 905 of 8,041. As we complete the corpus, the F1 estimate will stabilize. Early-batch variance is real: a run of easier or harder samples can move the running F1 by ±0.03. The full corpus result will be the most reliable figure.
Domain-specific calibration. The LIAR corpus is political speech. We're building supplemental labeled datasets from insurance claims, marketplace reviews, and legal testimony — domains that better reflect production use cases. Domain-specific tuning should improve F1 materially for customers in those verticals.

We won't publish a benchmark improvement until it's reflected in the live model. No research-only numbers, no "lab results may vary." The validation page always reflects the current production state.

"The purpose of a validation benchmark is to constrain claims, not to amplify them. A published benchmark you didn't cherry-pick is worth more than a private benchmark you did." — Internal engineering principle, Candor

See the live benchmark results

The validation page publishes live F1, precision, recall, and accuracy as the LIAR evaluation runs. Or try the API free — paste any text and see the score in real time.

See full results → Try the API free

References

Wang, W. Y. (2017). "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), 422–426.
DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129(1), 74–118.
Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29(5), 665–675.
Pérez-Rosas, V., & Mihalcea, R. (2015). Experiments in open domain deception detection. Proceedings of EMNLP 2015, 1120–1125.
Hancock, J. T., Thom-Santelli, J., & Ritchie, T. (2004). Deception and design: The impact of communication technology on lying behavior. CHI 2004 Proceedings, 130–136.