Planning in lexical-prior-free environments presents a fundamental challenge for evaluating whether large language models(LLMs)possess genuine structural reasoning capabilities beyond lexical memorization.When predica...Planning in lexical-prior-free environments presents a fundamental challenge for evaluating whether large language models(LLMs)possess genuine structural reasoning capabilities beyond lexical memorization.When predicates and action names are replaced with semantically irrelevant random symbols while preserving logical structures,existing direct generation approaches exhibit severe performance degradation.This paper proposes a symbol-agnostic closed-loop planning pipeline that enables models to construct executable plans through systematic validation and iterative refinement.The system implements a complete generate-verify-repair cycle through six core processing components:semantic comprehension extracts structural constraints,language planner generates text plans,symbol translator performs structure-preserving mapping,consistency checker conducts static screening,Stanford Research Institute Problem Solver(STRIPS)simulator executes step-by-step validation,and VAL(Validator)provides semantic verification.A repair controller orchestrates four targeted strategies addressing typical failure patterns including first-step precondition errors andmid-segment statemaintenance issues.Comprehensive evaluation on PlanBench Mystery Blocksworld demonstrates substantial improvements over baseline approaches across both language models and reasoning models.Ablation studies confirm that each architectural component contributes non-redundantly to overall effectiveness,with targeted repair providing the largest impact,followed by deep constraint extraction and stepwise validation,demonstrating that superior performance emerges from synergistic integration of these mechanisms rather than any single dominant factor.Analysis reveals distinct failure patterns betweenmodel types—languagemodels struggle with local precondition satisfaction while reasoning models face global goal achievement challenges—yet the validation-driven mechanism successfully addresses these diverse weaknesses.A particularly noteworthy finding is the convergence of final success rates across models with varying intrinsic capabilities,suggesting that systematic validation and repair mechanisms play a more decisive role than raw model capacity in lexical-prior-free scenarios.This work establishes a rigorous evaluation framework incorporating statistical significance testing and mechanistic failure analysis,providingmethodological contributions for fair assessment and practical insights into building reliable planning systems under extreme constraint conditions.展开更多
基金supported by the Information,Production and Systems Research Center,Waseda University,and partly supported by the Future Robotics Organization,Waseda Universitythe Humanoid Robotics Institute,Waseda University,under the Humanoid Project+1 种基金the Waseda University Grant for Special Research Projects(grant numbers 2024C-518 and 2025E-027)was partly executed under the cooperation of organization between Kioxia Corporation andWaseda University.
文摘Planning in lexical-prior-free environments presents a fundamental challenge for evaluating whether large language models(LLMs)possess genuine structural reasoning capabilities beyond lexical memorization.When predicates and action names are replaced with semantically irrelevant random symbols while preserving logical structures,existing direct generation approaches exhibit severe performance degradation.This paper proposes a symbol-agnostic closed-loop planning pipeline that enables models to construct executable plans through systematic validation and iterative refinement.The system implements a complete generate-verify-repair cycle through six core processing components:semantic comprehension extracts structural constraints,language planner generates text plans,symbol translator performs structure-preserving mapping,consistency checker conducts static screening,Stanford Research Institute Problem Solver(STRIPS)simulator executes step-by-step validation,and VAL(Validator)provides semantic verification.A repair controller orchestrates four targeted strategies addressing typical failure patterns including first-step precondition errors andmid-segment statemaintenance issues.Comprehensive evaluation on PlanBench Mystery Blocksworld demonstrates substantial improvements over baseline approaches across both language models and reasoning models.Ablation studies confirm that each architectural component contributes non-redundantly to overall effectiveness,with targeted repair providing the largest impact,followed by deep constraint extraction and stepwise validation,demonstrating that superior performance emerges from synergistic integration of these mechanisms rather than any single dominant factor.Analysis reveals distinct failure patterns betweenmodel types—languagemodels struggle with local precondition satisfaction while reasoning models face global goal achievement challenges—yet the validation-driven mechanism successfully addresses these diverse weaknesses.A particularly noteworthy finding is the convergence of final success rates across models with varying intrinsic capabilities,suggesting that systematic validation and repair mechanisms play a more decisive role than raw model capacity in lexical-prior-free scenarios.This work establishes a rigorous evaluation framework incorporating statistical significance testing and mechanistic failure analysis,providingmethodological contributions for fair assessment and practical insights into building reliable planning systems under extreme constraint conditions.