AI Research

Plain-English summaries of papers worth reading, with what changed and why it matters.

59 articles

Figure 1 of Zou et al.'s Representation Engineering paper (arXiv:2310.01405) showing the top-down framework that places population-level representations at the center of analysis for monitoring and controlling cognitive phenomena in language models

Activation steering and representation engineering for LLM safety — a multi-paper review

Plain-English walkthrough of ActAdd, Representation Engineering, Contrastive Activation Addition, and Anthropic's Persona Vectors — what each paper proves, where…

Neural Tech Daily20 May 2026~62 min

Figure showing an ARC-AGI task grid from On the Measure of Intelligence (Chollet, arXiv:1911.01547), with demonstration input-output pairs and a held-out test input that the system must complete

ai-research

ARC-AGI, Five Years On: A Multi-Paper Review of Chollet's Benchmark, the 2024 Prize, o3, and ARC-AGI-2

Multi-paper review of the ARC-AGI benchmark: Chollet's On the Measure of Intelligence (2019), the 2024 Prize Technical Report, o3's 87.5% breakthrough, and ARC-AGI-2.

Neural Tech Daily20 May 2026~59 min

Figure 1 of ColBERT (Khattab and Zaharia, arXiv:2004.12832) showing the four query-document interaction paradigms: representation-similarity, query-document interaction, all-to-all interaction, and late interaction

ai-research

ColBERT, ColBERTv2, Sentence-BERT and Modern Rerankers: A Multi-Paper Review of Late-Interaction Retrieval

Multi-paper review of late-interaction retrieval (ColBERT, ColBERTv2), bi-encoder baselines (Sentence-BERT), and cross-encoder rerankers (BGE, Cohere) for production RAG.

Neural Tech Daily20 May 2026~40 min

Figure 1 of arXiv:2212.08073 Constitutional AI: process diagram showing the two-phase training pipeline (Supervised Learning from critique-revision pairs followed by Reinforcement Learning from AI Feedback using a learned preference model trained on constitutional comparisons)

ai-research

Constitutional AI: A Technical Reference with 2026 Update

Bai et al. 2022 founded RLAIF. Constitutional AI's lineage now spans collective constitutions and constitutional classifiers. A reference for ML teams.

Neural Tech Daily20 May 2026~60 min

$Figure 2 of Ho et al. — Denoising Diffusion Probabilistic Models (arXiv:2006.11239), the directed graphical model showing the forward noising chain q(x_t | x_{t-1}) and the learned reverse chain p_theta(x_{t-1} | x_t), the foundational diagram for the diffusion-image-generation lineage covered in this multi-paper review$ ai-research

Diffusion image generation from DDPM to Flux: a multi-paper lineage review

Multi-paper review of the diffusion-image-generation lineage — DDPM, Latent Diffusion, SDXL, and Flux. What each paper changed about the loss, the architecture, and…

Neural Tech Daily20 May 2026~68 min

Figure 4 of DeepSeekMath (arXiv:2402.03300): schematic comparing PPO (with separate value model) and Group Relative Policy Optimization (GRPO, which replaces the value model with group-averaged rewards across multiple sampled outputs)

ai-research

GRPO and Reinforcement Fine-Tuning: A Multi-Paper Technical Reference

GRPO replaces PPO's value network with group-averaged rewards. Multi-paper review of DeepSeekMath, DeepSeek-R1, and OpenAI RFT — math, algorithms, benchmarks.

Neural Tech Daily20 May 2026~52 min

Figure 1 of MiniLLM (arXiv:2306.08543) — the schematic comparing forward-KL (mode-covering, student spreads mass over teacher's low-probability tails) against reverse-KL (mode-seeking, student concentrates on the teacher's high-probability modes), the central distinction this multi-paper review unpacks across three distillation papers

ai-research

Knowledge distillation for LLMs: MiniLLM, Distilling Step-by-Step, and the Gemma-2 recipe

Multi-paper review of three LLM knowledge-distillation methods: MiniLLM reverse-KL on-policy, Distilling Step-by-Step CoT rationales, Gemma-2 pre-training soft-target.

Neural Tech Daily20 May 2026~37 min

Schematic of the Joint Embedding Predictive Architecture (JEPA): a context encoder, a target encoder updated by exponential moving average, and a predictor that maps context embeddings to target embeddings — figure reproduced from arXiv:2301.08243 (I-JEPA, Assran et al.).

ai-research

JEPA, I-JEPA, V-JEPA: Predicting in Latent Space Instead of Pixels

Multi-paper review of Joint Embedding Predictive Architectures: LeCun's framework, I-JEPA (Assran 2023), V-JEPA (Bardes 2024), feature-space prediction.

Neural Tech Daily20 May 2026~55 min

Figure 1 of Zhang et al. (arXiv:2306.14048, H2O): visualization of accumulated attention scores across decoding steps showing the heavy-hitter phenomenon — a small subset of tokens absorbs a disproportionate share of attention, the empirical observation that motivates H2O's eviction policy and anchors this multi-paper review of KV cache compression

ai-research

KV cache compression: H2O, SnapKV, StreamingLLM, and prompt caching reviewed

Multi-paper review of four approaches to the linear KV-cache growth problem: H2O heavy-hitters, SnapKV observation window, StreamingLLM attention sinks, Anthropic…

Neural Tech Daily20 May 2026~52 min

Figure 2 of The Llama 3 Herd of Models (arXiv:2407.21783) — isoFLOP scaling curves used by Meta to predict the compute-optimal 405B-parameter dense Transformer trained on 15.6 trillion tokens, the dataset and compute regime this multi-paper review reconstructs

ai-research

The Llama 3 and Llama 4 technical reports — a multi-paper review

Multi-paper review of Meta's Llama 3 herd (arXiv:2407.21783) and Llama 4 release material. Scaling laws, data mixture, post-training pipeline, MoE shift.

Neural Tech Daily20 May 2026~48 min

Composite of paper figures from MMLU-Pro (arXiv:2406.01574), GPQA (arXiv:2311.12022), and SWE-Bench (arXiv:2310.06770), the three benchmark papers covered in this multi-paper review

ai-research

LLM evaluation benchmarks in 2026: MMLU-Pro, GPQA, SWE-Bench and the verification problem

Multi-paper review of MMLU-Pro, GPQA, SWE-Bench, and SWE-Bench Verified: how each was built, what they measure, where contamination and saturation bite.

Neural Tech Daily20 May 2026~48 min

Figure 1 of Kirchenbauer et al. 'A Watermark for Large Language Models' (arXiv:2301.10226), the teaser comparing unwatermarked text against watermarked output where 28 of the response tokens fall on the secret 'green list' instead of the 9 expected by chance — the foundational visual for the green/red list watermarking scheme this multi-paper review covers

ai-research

LLM watermarking and AI-content detection: a four-paper review

Kirchenbauer green/red lists, Aaronson's Gumbel scheme, Christ-Gunn-Zamir undetectable watermarks, and SynthID-Text. What each method changes and where each one breaks.

Neural Tech Daily20 May 2026~58 min

Figure 3 of arXiv:2307.12856 (A Real-World WebAgent): the WebAgent architecture pipeline showing HTML-T5 performing planning + HTML summarisation, feeding Flan-U-PaLM which synthesises Python Selenium scripts for browser actions

ai-research

Long-Horizon LLM Agents — WebGPT, WebAgent, V-IRL, Operator, Computer Use: A Multi-Paper Review

Five-paper review tracing how LLM agents close long task horizons via browser tools — from WebGPT (2021) through WebAgent, V-IRL, OpenAI Operator and Anthropic…

Neural Tech Daily20 May 2026~60 min

Composite header for Anthropic's mechanistic-interpretability research thread, covering the 2021 transformer-circuits framework, the 2022 induction-heads paper, and the 2024 Scaling Monosemanticity report on Claude 3 Sonnet, the three papers reviewed in this article

ai-research

Mechanistic interpretability: induction heads, transformer circuits, and scaling monosemanticity — a multi-paper review

Plain-English walkthrough of induction heads, transformer circuits, and Anthropic's 2024 scaling-monosemanticity work — what's known, what's open.

Neural Tech Daily20 May 2026~51 min

$Figure 1 of arXiv:2404.02258: schematic of a Mixture-of-Depths transformer block; a per-token router scores each position, the top-k tokens enter the self-attention plus MLP path while the remaining tokens skip the block via the residual connection, producing a static computation graph at a user-defined fraction of the baseline FLOP cost$ ai-research

Mixture-of-Depths (MoD): A Technical Reference

DeepMind's MoD lets transformer tokens skip entire blocks via top-k routing. Technical reference covering the routing math, FLOP savings, and MoE comparison.

Neural Tech Daily20 May 2026~49 min

Figure 1 of the LLaVA paper (arXiv:2304.08485) — the architecture diagram showing a CLIP vision encoder feeding image features through a trainable projection matrix W into a language model (Vicuna) that processes the projected vision tokens alongside the user instruction, the canonical open-VLM recipe this multi-paper review traces across LLaVA, PaliGemma, and Qwen2-VL

ai-research

Open VLMs in 2026: a multi-paper review of LLaVA, PaliGemma, and Qwen2-VL

Multi-paper review of the open vision-language model landscape: LLaVA's projector lineage, PaliGemma's prefix-LM recipe, Qwen2-VL's dynamic resolution and M-RoPE.

Neural Tech Daily20 May 2026~54 min

Figure 1 of Abdin et al. Phi-4 Technical Report (arXiv:2412.08905), reproduced for editorial coverage — AMC-10/12 mathematics test scores positioning Phi-4 14B alongside substantially larger frontier models

ai-research

Phi-4, Phi-4-Mini, and Phi-4-reasoning: small LLMs chasing frontier reasoning

Multi-paper review of the Microsoft Phi-4 family — the 14B base model, the 3.8B Phi-4-Mini multimodal variant, and the Phi-4-reasoning SFT plus GRPO pipeline.

Neural Tech Daily20 May 2026~52 min

Figure 1 of DistServe (arXiv:2401.09670, OSDI 2024), performance comparison between colocated prefill+decode serving and disaggregated serving, the headline result motivating the architecture pattern covered in this multi-paper review

ai-research

Prefill-decode disaggregation: a DistServe and Splitwise paper review

Multi-paper review of DistServe (OSDI 2024) and Splitwise (ISCA 2024). Why separating prefill and decode GPUs cuts tail latency, plus the KV-cache transfer.

Neural Tech Daily20 May 2026~33 min

Figure 1 of RAFT (Zhang et al., arXiv:2403.10131) contrasting closed-book exam, open-book exam, and domain-specific open-book training paradigms

ai-research

RAFT, Self-RAG, Adaptive-RAG, and Corrective-RAG: A Multi-Paper Review of the 2024 RAG-Improvements Wave

Multi-paper review of RAFT, Self-RAG, Adaptive-RAG, and Corrective-RAG covering training, reflection tokens, query routing, and retrieval evaluation in modern RAG.

Neural Tech Daily20 May 2026~35 min

Figure 1 of Ring Attention with Blockwise Transformers (arXiv:2310.01889) — the ring-topology diagram showing how key-value blocks rotate around N devices while each device holds a fixed query block, with communication overlapping computation

ai-research

Ring Attention and Striped Attention: a multi-paper review of long-context attention engineering

Multi-paper review of Ring Attention and Striped Attention — how blockwise distributed self-attention pushes context windows past a million tokens.

Neural Tech Daily20 May 2026~49 min

Figure 1 of Self-Refine (arXiv:2303.17651) — the teaser diagram of the generate-feedback-refine loop covered in this multi-paper review alongside Reflexion and Constitutional AI

ai-research

Self-Refine, Reflexion, Constitutional AI: a multi-paper review of verbal self-correction in LLMs

Multi-paper review of Self-Refine, Reflexion, and Constitutional AI — three ways an LLM uses its own natural-language feedback to improve outputs without weight updates.

Neural Tech Daily20 May 2026~57 min

Header figure from Anthropic's Towards Monosemanticity report showing the sparse autoencoder pipeline that decomposes a one-layer transformer's MLP activations into a large dictionary of features

ai-research

Sparse autoencoders for LLM interpretability — a three-paper review

Plain-English walkthrough of Bricken et al.'s Towards Monosemanticity, Anthropic's Scaling Monosemanticity on Claude 3 Sonnet, and OpenAI's TopK SAEs on GPT-4.

Neural Tech Daily20 May 2026~61 min

Figure 3 of EAGLE (arXiv:2401.15077), reproduced for editorial coverage — diagram illustrating feature-level uncertainty and the second-to-top-layer drafting that distinguishes EAGLE from token-level speculative decoding

ai-research

Speculative decoding for LLM inference: Leviathan, Medusa, EAGLE, EAGLE-2 multi-paper review

Multi-paper review of speculative decoding (Leviathan 2023), Medusa, EAGLE, and EAGLE-2: how each accelerates LLM inference, what speedups they ship, where they break.

Neural Tech Daily20 May 2026~47 min

Figure from Gunasekar et al. Textbooks Are All You Need (arXiv:2306.11644) showing phi-1's pass@1 HumanEval performance against larger models, the headline result that motivated the synthetic-data-for-small-LLMs research line covered in this multi-paper review

ai-research

Synthetic data + textbook-quality data for small LLMs: Phi, Orca, Self-Instruct, Cosmopedia

Multi-paper review of the synthetic-data lineage that produced small but capable LLMs: Self-Instruct, Orca, Phi-1, Phi-3, and the Cosmopedia open replication.

Neural Tech Daily20 May 2026~44 min

Figure 1 of Sun et al. 2020 (arXiv:1909.13231): bar chart of test-error reductions on CIFAR-10-C level 5 corruptions, comparing TTT and online TTT against baselines across 15 corruption types.

ai-research

Test-Time Training and Continual Learning at Inference: A Multi-Paper Technical Reference

Multi-paper review of test-time training (TTT). Sun et al. 2020 + Akyürek et al. 2024: how parameter updates at inference extend test-time compute beyond CoT and RFT.

Neural Tech Daily20 May 2026~62 min

Figure 1 of arXiv:2305.10601 (Yao et al.): the Tree of Thoughts schematic contrasts input-output prompting, chain-of-thought, self-consistency over chains, and the new tree-search framework that explores multiple branches with self-evaluation and backtracking.

ai-research

Tree of Thoughts: A Technical Reference on Yao et al. (NeurIPS 2023)

Technical reference on Tree of Thoughts (Yao et al. 2023). Walks through the BFS and DFS variants, Game of 24 and Creative Writing results, and where the claims break.

Neural Tech Daily20 May 2026~45 min

Figure from Kwon et al.'s PagedAttention paper (arXiv:2309.06180) illustrating the block-table mapping between a request's logical KV blocks and the underlying physical GPU memory blocks that vLLM allocates on demand

ai-research

LLM serving systems — PagedAttention, RadixAttention, and TensorRT-LLM in one technical reference

Multi-paper review of vLLM's PagedAttention, SGLang's RadixAttention, and NVIDIA's TensorRT-LLM — how each system manages the KV cache and where they diverge.

Neural Tech Daily20 May 2026~56 min

Figure 1 of AlphaGeometry (Nature 2024) — neuro-symbolic loop where the language model proposes auxiliary geometric constructions to unblock the symbolic deduction engine

ai-research

AlphaGeometry, AlphaProof, and AlphaGeometry2: a multi-paper review of neuro-symbolic math reasoning

Multi-paper review of AlphaGeometry, AlphaProof, and AlphaGeometry2 — how a language model paired with a symbolic engine reaches IMO medallist level on olympiad math.

Neural Tech Daily19 May 2026~58 min

Figure 1 of Code Llama: Open Foundation Models for Code (arXiv:2308.12950) — the specialization pipeline from Llama 2 through code-tokens pretraining, long-context fine-tuning, and instruction tuning that produced the Code Llama family this multi-paper review reconstructs alongside StarCoder, StarCoder 2, and DeepSeek-Coder-V2

ai-research

Code Llama, StarCoder, and DeepSeek-Coder — a multi-paper review

Multi-paper review of Code Llama (2308.12950), StarCoder + StarCoder2 (2305.06161, 2402.19173), and DeepSeek-Coder-V2 (2406.11931). FIM, HumanEval, MoE for code.

Neural Tech Daily19 May 2026~51 min

Figure 1 of DiLoCo (arXiv:2311.08105) — overview of k workers training independently for H inner steps before averaging their parameter deltas through an outer Nesterov-momentum optimizer that updates the global model

ai-research

DiLoCo, OpenDiLoCo, and AsyncDiLoCo: a multi-paper review of decentralized low-communication LLM training

Multi-paper review of DiLoCo, OpenDiLoCo, and AsyncDiLoCo — how language models can be trained across non-co-located GPUs with infrequent gradient sync.

Neural Tech Daily19 May 2026~61 min

Figure 1 of arXiv:2310.16834 (SEDD): the graphs of L_CSM versus L_SE for a ground-truth score of 0.2, showing the score-entropy loss respects nonnegativity where the conditional-score-matching loss does not.

ai-research

Diffusion Language Models: A Technical Reference (SEDD, DiffuLLaMA)

Technical reference on discrete-diffusion language models. Walks SEDD (Lou 2024) and DiffuLLaMA (Gong 2024): what they solve, where AR baselines still win.

Neural Tech Daily19 May 2026~28 min

Figure 1 of the DPO paper (arXiv:2305.18290) — the teaser diagram contrasting the standard RLHF pipeline (separate reward model trained from preferences, then RL fine-tuning) with Direct Preference Optimization's single-stage policy update directly from preference pairs, the foundational method covered in this multi-paper review

ai-research

DPO vs IPO vs KTO vs SimPO: a multi-paper review of direct-preference-optimization variants

Multi-paper review of DPO and three successors (IPO, KTO, SimPO). What each paper changes about the loss, what it gains, and where it still fails.

Neural Tech Daily19 May 2026~19 min

Figure 1 of WizardLM (arXiv:2304.12244) — the Evol-Instruct schematic showing an initial instruction iteratively rewritten into deeper and broader variants via prompted in-depth and in-breadth evolution operations, the foundational diagram for the multi-paper lineage covered in this review

ai-research

Evol-Instruct, WizardLM, WizardCoder, and the Tülu lineage: a multi-paper review

Multi-paper review of Evol-Instruct and the WizardLM/WizardCoder/Tülu open-instruction-tuning lineage — what each paper changes, and where the lineage still leaks.

Neural Tech Daily19 May 2026~48 min

Figure 1 of Granite Code Models (arXiv:2405.04324) — HumanEvalPack performance comparison of Granite-8B-Code-Base and Granite-8B-Code-Instruct against same-class open-weights code models across six programming languages and three coding tasks

ai-research

IBM's Granite open-model lineage — a multi-paper review

Multi-paper review of IBM's Granite-3.0 language models, Granite Code Models (arXiv:2405.04324), and Granite-TimeSeries TTM (arXiv:2401.03955).

Neural Tech Daily19 May 2026~60 min

Figure 1 of Lookahead Decoding (arXiv:2402.02057), reproduced for editorial coverage — diagram showing the lookahead window, verification branch, and n-gram pool in a single decoding step

ai-research

Lookahead Decoding (Fu et al., ICML 2024) — paper review and Jacobi-decoding context

Paper review of Lookahead Decoding (Fu, Bailis, Stoica, Zhang) — parallel LLM inference via Jacobi iteration + n-gram pool, no draft model required.

Neural Tech Daily19 May 2026~35 min

Figure 1 of arXiv:2402.02057 — Lookahead Decoding workflow showing the lookahead branch generating N-grams in parallel and the verification branch validating candidates against the target model

ai-research

Lookahead Decoding (arXiv:2402.02057): parallel LLM decoding without a draft model, reviewed

Lookahead Decoding (Fu et al.) breaks autoregressive sequential dependency using Jacobi iteration. This article reconstructs the method, math, and 1.8x speedup claim.

Neural Tech Daily19 May 2026~37 min

Architectural diagram from arXiv:2405.21060 (Dao and Gu, Mamba-2): the structured state space duality framework, depicting the equivalence between selective SSMs and a specific class of attention via semiseparable matrix decompositions.

ai-research

Mamba-2 and State Space Duality: A Technical Reference

Technical reference on Mamba-2 (Dao and Gu, ICML 2024). Walks the State Space Duality framework — what holds, where transformers still win.

Neural Tech Daily19 May 2026~24 min

Figure 1 of arXiv:2202.08906 (ST-MoE): training instabilities for sparse models, comparing an unstable training run on the left where the training loss diverges with a stable run on the right under the same configuration.

ai-research

Mixture-of-Experts Routing Stability: A Multi-Paper Technical Reference

Technical reference on MoE routing stability. Walks ST-MoE (Zoph 2022), Expert Choice (Zhou 2022), StableMoE (Dai 2022) — what each fixes, where they disagree.

Neural Tech Daily19 May 2026~31 min

Composite illustration of the closed-model reasoning paradigm: OpenAI o1 and Claude Extended Thinking, both characterised by long internal chains of thought before producing a final answer.

ai-research

Reasoning Models: o1, o3, and Claude Extended Thinking Reviewed

Technical reference on the closed-model reasoning paradigm: OpenAI o1/o3 and Claude Extended Thinking. What is disclosed, what is hidden, what the benchmarks show.

Neural Tech Daily19 May 2026~44 min

Figure 2 of P-Tuning v2 (Liu et al., arXiv:2110.07602) showing the architecture difference between Lester-style prompt tuning (input-layer-only soft prompts) and P-Tuning v2 (per-layer prefix prompts)

ai-research

Prompt Tuning, Prefix-Tuning and P-Tuning v2: A Multi-Paper Review of Soft-Prompt PEFT

Multi-paper review of soft-prompt PEFT — Lester prompt tuning, Li and Liang prefix-tuning, P-Tuning v2 deep prompts — across NLG, NLU and sequence labelling.

Neural Tech Daily19 May 2026~48 min

Figure 1 of the ReAct paper (arXiv:2210.03629) — the four-panel comparison of Standard, Chain-of-Thought, Act-only, and ReAct prompting on a HotpotQA question and an ALFWorld household task, the foundational diagram for the multi-paper review on agent reasoning patterns

ai-research

ReAct vs ReWOO vs Plan-and-Solve: a multi-paper review of agent reasoning patterns

Multi-paper review of ReAct (Yao 2022), ReWOO (Xu 2023), and Plan-and-Solve (Wang 2023): how the three prompting patterns differ, what each measures, where each breaks.

Neural Tech Daily19 May 2026~49 min

Figure 1 of the Qwen3 Technical Report (arXiv:2505.09388) — the four-stage post-training pipeline integrating long-CoT cold start, reasoning RL with GRPO, thinking-mode fusion, and general RL into the flagship Qwen3-235B-A22B mixture-of-experts model reviewed here alongside the Qwen 2.5 dense lineage

ai-research

The Qwen 2.5 and Qwen 3 technical reports — a multi-paper review

Multi-paper review of Alibaba's Qwen 2.5 (arXiv:2412.15115) and Qwen 3 (arXiv:2505.09388). Pretraining recipe, post-training, multilingual sweep, MoE.

Neural Tech Daily19 May 2026~56 min

Figure 1 of Kaplan et al. arXiv:2001.08361 showing test loss as a power-law function of compute, dataset size, and parameter count across more than seven orders of magnitude

ai-research

Scaling Laws, Chinchilla, and Emergent Abilities: A Multi-Paper Technical Reference

Kaplan 2020, Chinchilla 2022, and Wei 2022 emergence — power-law exponents, compute-optimal allocation, emergence thresholds, and the mirage critique.

Neural Tech Daily19 May 2026~49 min

Figure 3 of Peebles and Xie — Scalable Diffusion Models with Transformers (arXiv:2212.09748), the four DiT block variants (in-context, cross-attention, adaLN, adaLN-Zero) that became the architectural template behind Sora, Veo, and Movie Gen

ai-research

Generative video models from DiT to Sora, Veo, and Movie Gen: a multi-paper review

Multi-paper review of generative video models — the DiT architecture, OpenAI's Sora, DeepMind's Veo 2/3, and Meta Movie Gen. Architecture, training, data curation…

Neural Tech Daily19 May 2026~64 min

Header image from Anthropic's engineering blog post on raising the bar on SWE-bench Verified with Claude 3.5 Sonnet, showing the visual framing of the two-tool minimal-harness approach that is one of the three designs reviewed in this article.

ai-research

SWE-agent, OpenHands, and Claude on SWE-bench — A Multi-Paper Technical Reference

Three software-engineering agent designs read together: SWE-agent's ACI, OpenHands' CodeAct platform, and Anthropic's minimal two-tool harness on SWE-bench Verified.

Neural Tech Daily19 May 2026~63 min

Figure 1 of arXiv:2408.03314 (Snell et al.): compute-optimal scaling of test-time compute matches or exceeds a 14x larger pretrained baseline on MATH at equivalent FLOPs, with the gain widening on easier and medium-difficulty questions and narrowing on the hardest ones.

ai-research

Test-Time Compute Scaling: A Multi-Paper Technical Reference (Snell 2024, Liu 2025)

Technical reference on test-time compute scaling for LLMs. Walks through Snell et al. (2024) and Liu et al. (2025) — what holds, where the claims break.

Neural Tech Daily19 May 2026~35 min

Figure 1 of Toolformer (Schick et al., arXiv:2302.04761) showing example tool calls for question answering, calculator, machine translation, Wikipedia search and calendar inserted into a single passage of natural text

ai-research

Toolformer, Gorilla and the Berkeley Function Calling Leaderboard: A Multi-Paper Review of Tool Learning

Multi-paper review of tool learning: Toolformer's self-supervised API filtering, Gorilla's retriever-aware fine-tuning, and the Berkeley Function Calling…

Neural Tech Daily19 May 2026~52 min

Figure 1 of Learning Transferable Visual Models From Natural Language Supervision (CLIP), reproduced from arXiv:2103.00020 — the contrastive image-text pretraining diagram showing N image-text pairs aligned along the diagonal of a similarity matrix.

ai-research

Vision Encoders: CLIP, SigLIP, EVA-CLIP — Contrastive vs Sigmoid vs Scaling

Multi-paper review of CLIP (Radford 2021), SigLIP / SigLIP 2 (Zhai 2023/2025), and EVA-CLIP-18B (Sun 2024) — losses, batch scaling, and recipes.

Neural Tech Daily19 May 2026~57 min

Figure 1 of arXiv:2311.16502 (MMMU): overview of the MMMU dataset showing comprehensiveness across 11.5K college-level problems, six broad disciplines, 30 subjects, heterogeneous image types, interleaved text-image questions, and expert-level perception-and-reasoning requirements.

ai-research

Vision-Language Model Benchmarks: A Multi-Paper Reference (MMMU, MathVista)

Technical reference on vision-language model benchmarks. Walks MMMU (Yue, CVPR 2024) and MathVista (Lu, ICLR 2024) — what each catches, where leaderboards mislead.

Neural Tech Daily19 May 2026~25 min

Hugging Face 'Getting Started With Embeddings' blog post header graphic, a stylised illustration depicting text passages being encoded into vector representations for semantic similarity search

ai-research

What are embeddings? Vector representations of text, images, and code, in 2026

An embedding is a learned vector that places similar items close together in a high-dimensional space. The 2026 landscape of models, dimensions, and pitfalls.

Neural Tech Daily19 May 2026~11 min

Hugging Face Transformers quantization overview page listing AWQ, GPTQ, bitsandbytes, and GGUF as supported backends

ai-research

What Is Quantization? Running LLMs on Commodity Hardware (GGUF, AWQ, GPTQ, INT4/INT8)

A plain-English explainer on LLM weight quantization: what 4-bit and 8-bit actually mean, how GGUF, AWQ, and GPTQ differ, and which format fits which hardware.

Neural Tech Daily19 May 2026~9 min

Close-up photograph of a matrix-style code background — ambient stock framing for an article on LLM security and prompt injection

ai-research

What is prompt injection? The LLM security problem nobody has solved (2026 explainer)

Prompt injection is the unsolved LLM vulnerability where instructions and data share one input channel. Direct vs indirect, OWASP LLM01, and why mitigations are partial.

Neural Tech Daily19 May 2026~10 min

Composed hero card naming the three reviewed architectures — xLSTM, RWKV-7, and RetNet — over a neutral canvas with the multi-paper review framing

ai-research

xLSTM, RWKV-7, and RetNet — a multi-paper review of linear-time transformer alternatives

Technical walkthrough of three non-attention sequence architectures: xLSTM's exponential gating, RWKV-7's generalized delta rule, and RetNet's three-form retention.

Neural Tech Daily19 May 2026~71 min

Figure 2 of arXiv:2510.05592: AgentFlow architecture diagram with planner, executor, verifier, generator modules around an evolving memory and a five-tool toolset

ai-research

AgentFlow (ICLR 2026 Oral): a trainable agent framework, reviewed in plain English

AgentFlow (arXiv:2510.05592) trains a planner with on-policy RL across a four-module agent. We summarise what it claims, where the evidence holds, and where it doesn't.

Neural Tech Daily9 May 2026~22 min

Schematic showing how RLVR re-weights base-model reasoning paths versus the CoT-Pass@K metric that requires both correct answer and correct chain of thought.

ai-research

RLVR at ICLR 2026: what Wen et al. prove about LLM reasoning, and what the abstract glosses over

Wen et al.'s ICLR 2026 paper says RLVR genuinely improves reasoning under a corrected metric. The headline holds for math; the limits matter.

Neural Tech Daily8 May 2026~34 min

Figure 1 of arXiv:2604.28139 (Claw-Eval-Live): overview of the benchmark design showing how the refreshable ClawHub Top-500 signal pool feeds task construction, which feeds the hybrid deterministic-plus-LLM-judge evaluation loop across the two domains (controlled business services and local workspace repair)

ai-research

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows — A Technical Reference

Claw-Eval-Live evaluates 13 frontier models on 105 live workflow tasks across business services and workspace repair. Top score 66.7%.

Neural Tech Daily4 May 2026~43 min

Figure 1 of arXiv:2305.18290: schematic comparing PPO-RLHF (reward model + RL training) to DPO (single supervised classification loss); DPO directly optimises for the policy best satisfying preferences via an implicit reward model whose corresponding optimal policy can be extracted in closed form

ai-research

Direct Preference Optimization (DPO): A Technical Reference

DPO replaces the entire RLHF pipeline with a single supervised loss. Technical reference for ML teams choosing between PPO-RLHF, DPO, and 2024 successors.

Neural Tech Daily4 May 2026~32 min

Figure 1 of arXiv:2512.24601: GPT-5 vs RLM performance on three long-context benchmarks (S-NIAH, OOLONG, OOLONG-Pairs) as input length scales from 2^13 to 2^18 tokens, showing RLM holding stronger across the scaling curve

ai-research

Recursive Language Models (RLM): A Technical Reference

Technical reference for arXiv:2512.24601: RLM architecture, mathematical contributions, algorithmic pseudocode, ablations, and reusable components.

Neural Tech Daily4 May 2026~33 min

Figure 5 of arXiv:2604.27085: RoundPipe system overview showing the stateless GPU workers, the round-robin dispatcher, and the CPU-resident master parameters that stream to assigned GPUs on demand

ai-research

RoundPipe (arXiv:2604.27085): Training Qwen3-235B on 8x RTX 4090, Technical Reference

RoundPipe (arXiv:2604.27085): stateless pipeline parallelism with CPU-offloaded master weights. 1.48-2.16x speedup on 1.7B-32B models; 235B feasibility on 8x RTX 4090.

Neural Tech Daily4 May 2026~46 min