EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Abstract

While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. Problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. The benchmark moves beyond final answer accuracy with a two-stage evaluation: it first quantitatively verifies the numerical and semantic validity of each reasoning step and then introduces LLM-as-Judge, an automated system to qualitatively categorize any identified reasoning errors.


EngChain's Taxonomy of 9 domains and 90 problem types.

EngChain's comprehensive taxonomy covers 90 distinct problem types across 9 domains and 3 engineering branches.


Overview and Generation Pipeline

The benchmark is built upon a comprehensive taxonomy of 90 distinct problem templates across 9 core engineering domains and 3 engineering branches. This structure is designed to systematically evaluate reasoning by scaling difficulty across three axes: Conceptual Complexity, Mathematical Sophistication, and Procedural Depth. Problems are generated from symbolic templates with domain-aware parameterization to ensure physical realism (e.g., using real material properties for "6061-T6 Aluminum"). The benchmark features a Two-Stage Verifiable Evaluation that moves beyond final-answer accuracy to check the numerical and semantic validity of each reasoning step.

To ensure correctness, every template underwent an AI-Assisted Quality Assurance (QA) workflow. This process employed an LLM as a 'Peer Reviewer' to evaluate each template's code and example outputs against a multi-axis rubric for physical plausibility and mathematical correctness. Any template failing to meet a high automated threshold (a score < 4/5) was automatically flagged for human review by the authors, who possess domain expertise. This efficient AI-first, human-in-the-loop process proved highly effective, with only 5.55% (5 out of 90) of templates requiring manual inspection and correction.

Example of an ENGCHAIN symbolic template for a CSTR Volume Calculation.

Figure 2: EngChain example template (CSTR Volume Calculation).


A Two-Stage Verifiable Evaluation Framework

To assess the entire reasoning process, we move beyond final-answer accuracy and propose a multi-faceted framework. The first stage is a quantitative verification that computes a Step F1-Score for procedural correctness. This process compares each predicted reasoning step against every ground-truth step. A step is only considered a "match" if it simultaneously satisfies two criteria: high semantic similarity (measured via a Cross-Encoder model) and numerical correctness (within a 2% relative error tolerance for intermediate values). This dual-criteria check ensures that a step with a plausible explanation but a flawed calculation is correctly identified as an error.

This quantitative score, however, only identifies that a reasoning chain is flawed, not why. To diagnose failures, we introduce a second, qualitative stage: an automated error analysis system called LLM-AS-JUDGE. When a step is flagged as a mismatch, this system uses an expert-persona LLM to classify the failure into categories like Conceptual Error, Calculation Error, or Input Error. Critically, this analysis revealed that the vast majority (73.94%) of steps flagged as errors by the rigid F1-score were actually Alternative Correct solutions — valid reasoning paths that simply differed from the single ground-truth. This finding demonstrates that our two-stage evaluation is essential for fairly assessing model capabilities and avoiding penalizing valid, creative problem-solving.


Results

Our primary finding from evaluating 11 frontier models is a stark and universal divergence between final-answer accuracy and the correctness of the reasoning process. While top models like GPT-4.1 Mini and GPT-5 Mini achieve a modest Final Answer Accuracy of ~63.1%, their procedural reasoning is critically low. The best-performing model in reasoning, Gemini 2.5 Flash, only achieved a Step F1-Score of 19.32%. This confirms a widespread "right for the wrong reasons" phenomenon, where models excel at generating fluent, semantically appropriate text (averaging ~0.87 BERTScore) that effectively masks the flawed underlying logic.

Overall Model Performance

Table showing zero-shot performance of 11 frontier models on the ENGCHAIN benchmark.

Table 1: Overall Performance of Frontier Models on the ENGCHAIN benchmark (N=1350).

Branch-Level Performance

Bar chart showing model reasoning performance across Chemical, Electrical, and Mechanical Engineering.

Figure 3: Branch-Level Performance. Chemical Engineering is consistently the most challenging domain.

Domain-Level Performance

Radar plot showing 'spiky' performance across 9 engineering domains.

Figure 4: The 'spiky' profile shows specialized, not generalized, reasoning.

Performance Across Difficulty Levels

Line chart showing model reasoning performance across Easy, Intermediate, and Advanced difficulty levels.

Figure 5: Difficulty-Level Performance. Procedural reasoning remains critically low across all complexity levels.