ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — supported
"ReAct reduces hallucination on Fever from 0.34 (CoT baseline) to 0.17 when an external Wikipedia tool is available."
The paper's Fever results section reports exactly these two numbers for the CoT-only and ReAct conditions.
Excerpt: We observe a hallucination reduction on Fever from 0.34 (CoT) to 0.17 (ReAct) measured as the fraction of factually unsupported answers.
ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — supported
"ReAct delivers 10-34% absolute gains on ALFWorld over imitation-learning baselines."
The paper's decision-task evaluation reports this range directly.
Excerpt: On ALFWorld and WebShop, ReAct outperforms imitation learning by 10-34% absolute.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [p_cot] — supported
"Chain-of-thought reaches 56.9% on GSM8K with an 8-shot prompt on a 540B-parameter PaLM model."
The headline GSM8K result in the abstract matches the cited figure exactly.
Excerpt: On GSM8K, an 8-shot CoT prompt with a 540B-parameter PaLM model achieves 56.9% accuracy compared to 17.9% for standard prompting.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [p_cot] — supported
"Chain-of-thought does not improve models smaller than 60B parameters."
The paper explicitly identifies CoT as an emergent capability with this scale threshold.
Excerpt: The reasoning benefits are emergent: models smaller than 60B parameters do not gain from CoT exemplars.
Toolformer: Language Models Can Teach Themselves to Use Tools [p_toolformer] — supported
"A 6.7B-parameter Toolformer outperforms a 175B-parameter baseline on numerical and factual QA."
Matches the paper's headline result on the 6.7B base model.
Excerpt: On a 6.7B-parameter base, Toolformer with five tools outperforms a 175B-parameter baseline on numerical and factual QA.
Reflexion: Language Agents with Verbal Reinforcement Learning [p_reflexion] — supported
"Reflexion-augmented GPT-4 reaches 91% pass@1 on HumanEval vs 80% for ReAct-only."
Both figures appear in the paper's HumanEval results.
Excerpt: On HumanEval, Reflexion-augmented GPT-4 reaches 91% pass@1, up from 80% for ReAct-only.
ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — unsupported
"ReAct was first deployed in production at OpenAI for the ChatGPT browsing plugin in early 2023."
No deployment history, production claim, or vendor reference appears in the cited paper. This sentence is invented context that the draft attaches to the ReAct citation; the paper covers research methodology and benchmarks only.
ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — unsupported
"78% of LangChain users reported using a ReAct-style agent in 2024."
The cited ReAct paper does not survey users, mention LangChain, or report any 2024 adoption statistic. This figure is fabricated — exactly the failure mode cite_check is designed to catch before the user reads the draft.