Prince Tritto, PhilippePhilippePrince TrittoPonce, HiramHiramPonce2025-10-292025-10-292025Prince Tritto, P., Ponce, H. (2025). Assessing AI-Generated Legal Reasoning: A Benchmark for Legal Text Quality from Literature Review. In: Martínez-Villaseñor, L., Martínez-Seis, B., Pichardo, O. (eds) Artificial Intelligence – COMIA 2025. COMIA 2025. Communications in Computer and Information Science, vol 2552. Springer, Cham. https://doi.org/10.1007/978-3-031-97907-1_597830319790649783031979071https://scripta.up.edu.mx/handle/20.500.12552/1253810.1007/978-3-031-97907-1_5The adoption of Large Language Models in law has sparked debate over how best to evaluate AI-generated legal reasoning. Existing benchmarks focus on surface-level accuracy, overlooking deeper dimensions such as argumentative coherence, practical usability, and alignment with jurisprudential values. This paper provides a comprehensive framework that integrates insights from formalism, interpretivism, realism, and argumentation theory to assess legal AI outputs. We first explore the philosophical foundations of legal reasoning, drawing on MacCormick’s concepts of internal and external justification and Perelman’s notions of audience-centered persuasion to highlight the rhetorical and moral dimensions essential for evaluation. Next, we examine structured approaches to evaluation from related fields before showing why existing benchmarks (e.g., LexGLUE, LegalBench, LegalAgentBench) only partially capture the subtleties of legal reasoning. We also contrast common law and civil law traditions to illustrate how a one-size-fits-all approach neglects the distinct roles of precedent versus codified statutes. Building on these theoretical and comparative insights, we propose a three-stage evaluation methodology that begins with automated screening for factual consistency, proceeds to expert-led rubric assessment across five dimensions (Accuracy, Reasoning, Clarity, Usefulness, and Safety), and concludes with iterative refinement through reliability checks. This structured approach, validated through a pilot study, aims to strike a balance between scalability and nuance, equipping researchers and practitioners with a robust tool for assessing AI-generated legal texts. Unifying theoretical rigor, domain-specific practicality, and cross-jurisdictional adaptability, this framework lays a solid foundation for legal AI benchmarks and paves the way for safer, more transparent deployment of AI in law. ©The authors ©Springer.enAcceso RestringidoLegal AI BenchmarkingAI-Generated Legal TextsLegal Reasoning EvaluationLegal NLP MetricsArgumentative Coherence in AILegal Text Quality AssessmentHuman-in-the-Loop AI EvaluationAssessing AI-Generated Legal Reasoning: A Benchmark for Legal Text Quality from Literature Reviewtext::book::book part