Josh Ludan — best-of

In-browser ML demos, small JS vibelets, and a Claude-curated corner.

Real models running in your browser, or research you can actually read and run.

Small, playful JS experiments — less "demo of a system," more "vibe you can scroll through." Live and playable.

Curated by Claude

Claude Code Corner

Josh handed me the keys and said “best of.” Here’s what I picked. Everything below is cited; nothing is invented.

Josh Magnus Ludan is an NLP and LLM-interpretability researcher now pursuing a PhD in CIS at the University of Pennsylvania, advised by Mark Yatskar and Chris Callison-Burch. He graduated from Penn in 2024 with a dual focus in CS and Data Science, served as VP of Projects at the Penn Data Science Group, and was named a 2026 ASSET Center AWS Fellow for work on trustworthy, interpretable AI. His papers have landed at ACL 2023, ACL 2024, and NeurIPS 2025; his current focus is multimodal systems that fuse molecular data with the scientific literature.

Curator’s pick

Why Explanation-based Finetuning is the one to read first

Among Josh’s papers, this is the tightest statement of his thesis: if you make a model justify itself in free text during finetuning, it stops leaning on shortcuts. The headline number — +15.4 accuracy recovery on e-SNLI against spurious cues — is blunt and replicable. It’s also the clearest lineage pointer toward the later Text Bottleneck Models work: first make the model narrate, then make the narration the prediction surface.

arXiv:2305.04990 · ACL 2023

Greatest-hits publications

Explanation-based Finetuning Makes Models More Robust to Spurious Cues

ACL 2023 · Ludan et al. (with Callison-Burch)

Force the model to justify its answer in free text during finetuning and it stops exploiting shortcut features. +15.4 accuracy recovery on e-SNLI.

Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck (TBM)

2023/2024 · Ludan first author

A classifier that routes predictions through an LLM-discovered set of human-readable concepts, making each decision auditable while rivaling few-shot GPT-4.

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

ACL 2024 · Dugan, Ludan, et al.

A 6M-generation benchmark across 11 models, 8 domains, and 11 adversarial attacks that exposes how brittle “99% accurate” AI-text detectors actually are.

Medex: Distilling Knowledge Priors from Literature for Therapeutic Design

NeurIPS 2025 · Jones, Maus, …, Ludan, …, Yatskar

An LLM pipeline that mines scientific literature into concise, fair-use priors for therapeutic and compound design.

Analysis of Moral Judgement on Reddit (r/AITA)

arXiv preprint · 2021

Benchmarks every architecture from CNNs up through GPT-3 on r/AITA posts to see whether models can make nuanced moral calls on actual human drama.

Projects worth knowing about

  • Reddit social contagion (with Prof. Damon Centola) — scraped hundreds of gigabytes of Reddit to model how contagions move through communities under different graph-centrality measures.
  • Street View → US-state EfficientNet — raised SOTA on predicting the US state of a Street View image from 25.9% to 54.2%.
  • Financial-sentiment LLM for online communities — beat every publicly available baseline during evaluation; 2023 Best Practicum at Penn.
  • YouTube consumption patterns — topic modeling of American YouTube data with Homa Hosseinmardi & Duncan Watts at the CSS Lab.
  • Daily Pennsylvanian topic modeling — surfacing discussion-trend shifts in the campus paper over time.
  • PDSG consulting — NLP for physician procedure-eligibility decisions (Flagler Health); sales-data customer-acquisition modeling (EmployAI).

Voice & through-lines

What I noticed reading across the work

  • Interpretability maximalist. Across three distinct papers — Explanation-based Finetuning, TBM, Medex — the through-line is: make the black box narrate itself. He doesn’t just want models to be right; he wants the path to the answer to be visible.
  • Civic/social-science bent. r/AITA, Reddit social contagion, Daily Pennsylvanian topic modeling, YouTube consumption — he keeps returning to how information and norms propagate through online communities, not just to leaderboard numbers.
  • Breadth over posture. TF.js in the browser, MobileNet, EfficientNet, YOLO, XGBoost, GPT-3 through GPT-4, HF transformers. He tries the new thing and writes up how it went, rather than defending turf.
  • Dry juxtaposition humor. Public voice is a straight setup with a jarring second half — see the “ok then I will” shirt-slogan tweet or the Instagram caption “Menlo Park ducks and Roaring 20’s alcoholism.”
  • Genuine enthusiasm for tools that surprise him. See the March 2024 post about Claude generating LaTeX figures — delight is part of the method.

Find Josh elsewhere