JT/DL: Court Innovation is Benchmarks
Plus News from Minnesota
The JT/DL is a twice-monthly newsletter about justice technology news, events, and opportunities. Subscribe for free to receive new posts and support my work.
Court Innovation is Benchmarks
After introducing the Court Innovation Fund and exploring how bad court function can hinder everything from affordable housing to national security, I want to talk about what can be done to innovate courts and improve just outcomes. Starting today, I’m going to explore what a philanthropic fund focused on court innovation can accomplish and why it matters.
Time and again, justice technology developers do the justice system dirty. They’ve sold biased risk assessment tools, software packages that lead to false arrests and habeas petitions, and “AI” that wasn’t. This embarrassing trend persists because the justice system and its advocates don’t build the tools and practices needed to interrogate and validate technologies old and new.
It’s in this ecosystem that state and local courts expect to double their use of genAI this year. To state the obvious: the courts are not ready to vet AI—to know if an AI product does what the developers claim. This means untested technology will do everything from banal administrative tasks to review filings for judgment. Done incorrectly, genAI can wrongly take away people’s freedom, limit their access to justice, and demolish their economic well-being. Without standardization, evaluation, and training around genAI, courts run the risk of perpetuating existing harms and creating new ones at the scale and speed of AI.
As part of the development of the Court Innovation Fund at Renaissance Philanthropy, we are exploring the development and use of AI benchmarks for court software. We believe that building benchmarks will create standards for court AI, incentivize a new generation of technologists to work on court innovation, and provide court officials actionable information when considering the purchase and adoption of AI systems.
First, a brief primer on benchmarks. Benchmarking is the process of figuring out if you can trust an AI system by gauging factors like accuracy or bias.
To accomplish this, a benchmark dataset has model inputs and model outputs. The inputs are the data that are fed into the model and the outputs are the ideal conclusion, like an answer sheet. To put this into context, say you wanted to test an AI model on how well it categorizes pictures of blueberry muffins and chihuahuas. You would put together a dataset of blueberry muffin and chihuahua images (model inputs) and feed them into the AI. Then you would compare the AI’s output (it’s attempt at classification) with the benchmark’s model outputs. The delta between the AI’s output and the benchmark’s model output is your accuracy number.
Once the developers get their accuracy numbers back, they can make improvements on their model and then test again. It’s through this iterative process that AI models become better and more trustworthy. When applying benchmarks to genAI, this form of validation is only more important as we are not just asking “can the software classify this?” but “does the software do what its developers claim?”
Benchmarks are standard across fields like education, medicine, and science. They are often made public, so anyone can test against them and the results can be viewed like a nerdy leader board. This is an incredibly powerful feedback loop that creates a level of standardization and transparency that would otherwise be missing. Yet, there is no concerted effort to create benchmarks for AI deployed in the courts.
This is problematic, because AI in the courts isn’t trouble ahead, it’s here now. AI tools are helping litigants fill out forms, draft police reports, and assist in courtroom interpretation. It’s also being developed and tested to review eviction and debt filings, helping a judge determine how she should rule.
I find filing review tools illustrative of the potential of genAI in the courtroom—for both court and public—and the dire need for benchmarks.
Debt cases, which already accounted for 1-in-4 civil cases in state courts before the pandemic, are surging across the country. Usually suing for $10,000 or less, debt buying companies file millions of cases across the U.S. with the hope that a defendant doesn’t appear in court. This allows the judge to rule against the defendant without considering the facts of the case. With that judgment, the company takes the defendant to collections, perpetuating a punishing cycle of debt. The problem is that many of these claims, drafted with the help of AI, do not pass basic scrutiny, like whether the defendant actually holds the debt.
To fight back against this predatory practice, states like Arizona and New York, now require a judge to assure that the claim meets statutory standards before entering a default judgment. This is a smart policy change to curb predatory debt suits. It also creates a massive administrative challenge for judges to sift through the ever-growing pile of claims. Luckily, it’s the perfect job for AI to automate the review process.
However, we’re left wondering: do these new review tools actually work? Without independent verification, we are left with two main mechanisms to vet court technology: asking other courts if they liked the tool and vendor ad copy. This status quo is woefully insufficient.
However, it’s easy to understand how we got here. Due to the significant costs, technical expertise needed, and logistical challenges collecting and publishing court data, there is a dearth of datasets available for research and development in the courts. Adding complication, benchmarks need to evolve over time either because the foundational models improve or the application of the tool changes. Collectively, these factors put benchmark development out of reach for most courts. This creates a need for this work at scale and to build a commons infrastructure that all courts can benefit from.
That is why the Court Innovation Fund is focused on building a strategy to develop benchmark datasets for courts. The first step will be to figure out the economics and practicalities of doing this work at scale and over time. With that information, we can bring foundational benchmarks to court AI. Doing so creates standardized, reliable data for building and evaluating the performance of new tools that support just outcomes. It also creates an easy on-ramp for technologists to explore how their technology could help court issues, expanding the coalition of people developing solutions for these problems. Last, it gives courts more information when considering AI adoption.
Courts are places for fact-finding. They provide a process to gather and assess evidence to determine the truth behind a legal claim. We can’t let AI's accuracy and dependability—literally its ability to be factual—fall short of the standards required in court. That is the path we are currently on. Luckily, we can do better.
News
The war in Minnesota is for our phones. (New York Times)
The FBI is investigating Minnesota Signal chats tracking ICE, Patel says. (NBC)
The White House is altering ICE arrest photos, says “the memes will continue”. (Ars Technica)
How local police body cams help ICE. (NPR)
Palantir defends working with ICE to staff following Pretti killing. (Wired)
ICE keeps trying and failing to unmask anonymous critics online. (Ars Technica)
TikTok, now under U.S. control, is collecting immigration status and gender identity. (AV Club)
UK police blame Microsoft Copilot for intelligence mistake. (The Verge)
Interest in law school is surging, but AI makes the payoff less certain. (New York Times)
Angry Norfolk residents lose lawsuit to stop Flock license plate scanners. (Ars Technica)
Oregon legislation poised to tackle “fishing expedition” searches of license plate data. (OPB)
States strengthen shield laws to protect abortion and gender-affirming care data. (Route Fifty)
All rise for judge GPT. (The Verge) (h/t Keith Porcaro)
The Honorable AI? (Lawfare)
AI video tools depict lawyers and judges as women at far lower rates than real life. (LawNext)
Reboot launched the Observatory of Public Sector AI. (Reboot)
There’s a new VC in the UK for early stage social ventures. (Social Tech Ventures)
Events
Pathfinders for Justice is hosting a virtual AI for Justice discussion Feb. 5. (PfJ)
Suffolk Law is hosting LIT Con April 13. (SLS)
RightsCon is May 5-8 in Zambia. (RC)
Jobs & Opportunities
The Brennan Center has paid and internship opportunities, including in justice reform. (BC) (h/t Eduardo Gonzalez)
[New] The Center for Democracy and Technology has multiple openings. (CDT)
Code for America has multiple openings. (CfA) (h/t Russ Finkelstein)
The Electronic Frontier Foundation is hiring legal interns. (EFF)
[New] The Ford Foundation needs a director for its office of innovation. (FF)
The Gates Foundation needs a deputy director for its AI and Data Enablement Hub. (GF)
The Institute for Law and AI has multiple openings. (ILAI)
Lambda Legal needs a legal project manager. (LL)
The MacArthur Foundation needs a director of AI and opportunity. (MF)
[New] The Montgomery County DA (Penn.) needs a data analyst. (MDA)
The Mozilla Foundation is accepting nominations for fellows. (MF)
[New] New America opened applications for its Eviction Data Response Network 2026 cohort. (NA)
Oregon Law Help needs a content creator. (OSB)
[New] Recidiviz needs a product manager. (R)
[New] RTI is looking for a director of data governance. (RTI)
[New] The Surveillance Technology Oversight Project needs a legal director. (STOP) (h/t Eleni Manis)
[New] TechTonic Justice is hiring for multiple roles. (TTJ)
[New] The University of Texas School of Law needs a AI and legal practice fellow. (UT)
Wired needs a senior politics reporter. (W)




