Assessing AI-Generated Legal Reasoning: A Benchmark for Legal Text Quality from Literature Review

Prince Tritto, Philippe; Ponce, Hiram

doi:10.1007/978-3-031-97907-1_5

Assessing AI-Generated Legal Reasoning: A Benchmark for Legal Text Quality from Literature Review

Journal

Artificial Intelligence – COMIA 2025 17th Mexican Congress, Mexico City, Mexico, May 12–16, 2025, Proceedings, Part I

ISSN

1865-0929

Publisher

Springer Nature Switzerland

Date Issued

2025

Author(s)

Type

text::book::book part

DOI

10.1007/978-3-031-97907-1_5

URL

https://scripta.up.edu.mx/handle/20.500.12552/12538

Abstract

The adoption of Large Language Models in law has sparked debate over how best to evaluate AI-generated legal reasoning. Existing benchmarks focus on surface-level accuracy, overlooking deeper dimensions such as argumentative coherence, practical usability, and alignment with jurisprudential values. This paper provides a comprehensive framework that integrates insights from formalism, interpretivism, realism, and argumentation theory to assess legal AI outputs. We first explore the philosophical foundations of legal reasoning, drawing on MacCormick’s concepts of internal and external justification and Perelman’s notions of audience-centered persuasion to highlight the rhetorical and moral dimensions essential for evaluation. Next, we examine structured approaches to evaluation from related fields before showing why existing benchmarks (e.g., LexGLUE, LegalBench, LegalAgentBench) only partially capture the subtleties of legal reasoning. We also contrast common law and civil law traditions to illustrate how a one-size-fits-all approach neglects the distinct roles of precedent versus codified statutes. Building on these theoretical and comparative insights, we propose a three-stage evaluation methodology that begins with automated screening for factual consistency, proceeds to expert-led rubric assessment across five dimensions (Accuracy, Reasoning, Clarity, Usefulness, and Safety), and concludes with iterative refinement through reliability checks. This structured approach, validated through a pilot study, aims to strike a balance between scalability and nuance, equipping researchers and practitioners with a robust tool for assessing AI-generated legal texts. Unifying theoretical rigor, domain-specific practicality, and cross-jurisdictional adaptability, this framework lays a solid foundation for legal AI benchmarks and paves the way for safer, more transparent deployment of AI in law. ©The authors ©Springer.

Subjects

Legal AI Benchmarking...

AI-Generated Legal Te...

Legal Reasoning Evalu...

Legal NLP Metrics

Argumentative Coheren...

Legal Text Quality As...

Human-in-the-Loop AI ...

License

Acceso Restringido

URL License

Prince Tritto, P., Ponce, H. (2025). Assessing AI-Generated Legal Reasoning: A Benchmark for Legal Text Quality from Literature Review. In: Martínez-Villaseñor, L., Martínez-Seis, B., Pichardo, O. (eds) Artificial Intelligence – COMIA 2025. COMIA 2025. Communications in Computer and Information Science, vol 2552. Springer, Cham. https://doi.org/10.1007/978-3-031-97907-1_5

How to cite

Prince Tritto, P., Ponce, H. (2025). Assessing AI-Generated Legal Reasoning: A Benchmark for Legal Text Quality from Literature Review. In: Martínez-Villaseñor, L., Martínez-Seis, B., Pichardo, O. (eds) Artificial Intelligence – COMIA 2025. COMIA 2025. Communications in Computer and Information Science, vol 2552. Springer, Cham. https://doi.org/10.1007/978-3-031-97907-1_5