Skip to content
Thoth

Reference review

ReAct — does the framework deliver on its reasoning + acting claims?

Does the ReAct framework deliver on its reasoning + acting claims, in light of subsequent CoT, Toolformer, and Reflexion work?

Showcase output. This is a pinned, read-only Thoth review held for reference — exactly what the live agent produces for any user, complete with the cite_check audit below. 2 of 8 citations were flagged as unsupported by cite_check (6 supported); both unsupported entries are intentional examples of the model attaching fabricated facts to real papers — the exact failure mode the audit is designed to surface before the user reads the draft.

Want to run your own? Try the live demo on the home page.

Critic score

4.2 / 5

Weighted rubric: 2× faithfulness + completeness + citation quality + clarity.

Citation faithfulness

75%

6 of 8 citations supported · 2 unsupported.

Draft review

Does the ReAct framework deliver on its reasoning + acting claims?

ReAct ([p_react]) introduced an interleaved thought-and-action prompt template that aimed to reduce hallucination on knowledge-intensive QA while improving multi-step decision performance. The original work reports a hallucination drop on Fever from 0.34 (CoT baseline) to 0.17 (ReAct) when an external Wikipedia tool is available [p_react], and 10-34% absolute gains on ALFWorld over imitation learning baselines [p_react].

The reasoning side of ReAct rides on chain-of-thought ([p_cot]). CoT itself produces 56.9% accuracy on GSM8K with an 8-shot prompt on a 540B-parameter model [p_cot], up from 17.9% standard prompting [p_cot]. Notably, CoT prompting does not improve performance on models smaller than 60B parameters [p_cot] — the reasoning emergence is a scale-dependent phenomenon that bounds ReAct's applicability on smaller checkpoints. CoT also does not address factual grounding, only step-level reasoning accuracy [p_cot].

Subsequent work has extended the tool-using-agent line. Toolformer ([p_toolformer]) shows that a 6.7B-parameter model can learn — self-supervised — to invoke calculator, search, translation, calendar, and Wikipedia APIs, and outperforms a 175B baseline on factual QA [p_toolformer]. Reflexion ([p_reflexion]) adds an outer self-critique loop on top of ReAct, reaching 91% pass@1 on HumanEval (vs 80% for ReAct-alone) [p_reflexion] and +20 absolute on ALFWorld [p_reflexion].

The follow-up evidence broadly supports ReAct's central thesis — that grounding reasoning in tool-mediated actions reduces hallucination and improves decision-task performance — while sharpening the conditions: scale-emergent reasoning, tool-trajectory quality dependence, and a benefit gradient that compounds when self-critique is layered on top. ReAct itself was first deployed in production at OpenAI for the ChatGPT browsing plugin in early 2023 [p_react]. Industry adoption of agentic patterns has tracked the literature: 78% of LangChain users reported using a ReAct-style agent in 2024 [p_react].

Open questions: ReAct's robustness on action spaces > 30 verbs (flagged in the original limitations) is still under-studied. Whether the Reflexion-style outer loop genuinely improves factuality (vs only task-completion rate) on knowledge-intensive QA remains untested.

References

  • [p_react] ReAct: Synergizing Reasoning and Acting in Language Models
  • [p_cot] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  • [p_toolformer] Toolformer: Language Models Can Teach Themselves to Use Tools
  • [p_reflexion] Reflexion: Language Agents with Verbal Reinforcement Learning