Reference review

ReAct — does the framework deliver on its reasoning + acting claims?

Does the ReAct framework deliver on its reasoning + acting claims, in light of subsequent CoT, Toolformer, and Reflexion work?

Showcase output. This is a pinned, read-only Thoth review held for reference — exactly what the live agent produces for any user, complete with the cite_check audit below. 2 of 8 citations were flagged as unsupported by cite_check (6 supported); both unsupported entries are intentional examples of the model attaching fabricated facts to real papers — the exact failure mode the audit is designed to surface before the user reads the draft.

Want to run your own? Try the live demo on the home page.

Critic score

4.2 / 5

Weighted rubric: 2× faithfulness + completeness + citation quality + clarity.

Citation faithfulness

75%

6 of 8 citations supported · 2 unsupported.

ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — supported

"ReAct reduces hallucination on Fever from 0.34 (CoT baseline) to 0.17 when an external Wikipedia tool is available."

The paper's Fever results section reports exactly these two numbers for the CoT-only and ReAct conditions.

Excerpt: We observe a hallucination reduction on Fever from 0.34 (CoT) to 0.17 (ReAct) measured as the fraction of factually unsupported answers.

ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — supported

"ReAct delivers 10-34% absolute gains on ALFWorld over imitation-learning baselines."

The paper's decision-task evaluation reports this range directly.

Excerpt: On ALFWorld and WebShop, ReAct outperforms imitation learning by 10-34% absolute.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [p_cot] — supported

"Chain-of-thought reaches 56.9% on GSM8K with an 8-shot prompt on a 540B-parameter PaLM model."

The headline GSM8K result in the abstract matches the cited figure exactly.

Excerpt: On GSM8K, an 8-shot CoT prompt with a 540B-parameter PaLM model achieves 56.9% accuracy compared to 17.9% for standard prompting.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [p_cot] — supported

"Chain-of-thought does not improve models smaller than 60B parameters."

The paper explicitly identifies CoT as an emergent capability with this scale threshold.

Excerpt: The reasoning benefits are emergent: models smaller than 60B parameters do not gain from CoT exemplars.

Toolformer: Language Models Can Teach Themselves to Use Tools [p_toolformer] — supported

"A 6.7B-parameter Toolformer outperforms a 175B-parameter baseline on numerical and factual QA."

Matches the paper's headline result on the 6.7B base model.

Excerpt: On a 6.7B-parameter base, Toolformer with five tools outperforms a 175B-parameter baseline on numerical and factual QA.

Reflexion: Language Agents with Verbal Reinforcement Learning [p_reflexion] — supported

"Reflexion-augmented GPT-4 reaches 91% pass@1 on HumanEval vs 80% for ReAct-only."

Both figures appear in the paper's HumanEval results.

Excerpt: On HumanEval, Reflexion-augmented GPT-4 reaches 91% pass@1, up from 80% for ReAct-only.

ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — unsupported

"ReAct was first deployed in production at OpenAI for the ChatGPT browsing plugin in early 2023."

No deployment history, production claim, or vendor reference appears in the cited paper. This sentence is invented context that the draft attaches to the ReAct citation; the paper covers research methodology and benchmarks only.

ReAct: Synergizing Reasoning and Acting in Language Models [p_react] — unsupported

"78% of LangChain users reported using a ReAct-style agent in 2024."

The cited ReAct paper does not survey users, mention LangChain, or report any 2024 adoption statistic. This figure is fabricated — exactly the failure mode cite_check is designed to catch before the user reads the draft.

Draft review

Does the ReAct framework deliver on its reasoning + acting claims?

ReAct ([p_react]) introduced an interleaved thought-and-action prompt template that aimed to reduce hallucination on knowledge-intensive QA while improving multi-step decision performance. The original work reports a hallucination drop on Fever from 0.34 (CoT baseline) to 0.17 (ReAct) when an external Wikipedia tool is available [p_react], and 10-34% absolute gains on ALFWorld over imitation learning baselines [p_react].

The reasoning side of ReAct rides on chain-of-thought ([p_cot]). CoT itself produces 56.9% accuracy on GSM8K with an 8-shot prompt on a 540B-parameter model [p_cot], up from 17.9% standard prompting [p_cot]. Notably, CoT prompting does not improve performance on models smaller than 60B parameters [p_cot] — the reasoning emergence is a scale-dependent phenomenon that bounds ReAct's applicability on smaller checkpoints. CoT also does not address factual grounding, only step-level reasoning accuracy [p_cot].

Subsequent work has extended the tool-using-agent line. Toolformer ([p_toolformer]) shows that a 6.7B-parameter model can learn — self-supervised — to invoke calculator, search, translation, calendar, and Wikipedia APIs, and outperforms a 175B baseline on factual QA [p_toolformer]. Reflexion ([p_reflexion]) adds an outer self-critique loop on top of ReAct, reaching 91% pass@1 on HumanEval (vs 80% for ReAct-alone) [p_reflexion] and +20 absolute on ALFWorld [p_reflexion].

The follow-up evidence broadly supports ReAct's central thesis — that grounding reasoning in tool-mediated actions reduces hallucination and improves decision-task performance — while sharpening the conditions: scale-emergent reasoning, tool-trajectory quality dependence, and a benefit gradient that compounds when self-critique is layered on top. ReAct itself was first deployed in production at OpenAI for the ChatGPT browsing plugin in early 2023 [p_react]. Industry adoption of agentic patterns has tracked the literature: 78% of LangChain users reported using a ReAct-style agent in 2024 [p_react].

Open questions: ReAct's robustness on action spaces > 30 verbs (flagged in the original limitations) is still under-studied. Whether the Reflexion-style outer loop genuinely improves factuality (vs only task-completion rate) on knowledge-intensive QA remains untested.

References

[p_react] ReAct: Synergizing Reasoning and Acting in Language Models
[p_cot] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
[p_toolformer] Toolformer: Language Models Can Teach Themselves to Use Tools
[p_reflexion] Reflexion: Language Agents with Verbal Reinforcement Learning