/ ? ! $
2026-02-11 Signals
W66 GLM-5 and Chinese frontier model releases for agentic tasks

GLM-5 scored 50 on the Intelligence Index as new open-weights leader with 66K+ downloads, released alongside MiniMax M2.5, both targeting long-horizon agentic engineering; Z.ai publicly stated GPU starvation.

convergence
15/35 implementation
20/30 engagement
15/15 significance
16/20

GLM-5 at 66K downloads and Intelligence Index score of 50 sets a new open-weights bar — next bottleneck is GPU supply for inference at scale, as Z.ai publicly acknowledged being GPU-starved.

9 sources
2026-02-11 Tracking
W59 Small general-purpose models under 5B parameters

Nanbeige4.1-3B explores whether a 3B model can reason, align, and act as a general model; MiniCPM-SALA (426 likes, 2569 downloads) targets similar small-model general capability — both push the floor of useful model size.

convergence
15/35 implementation
20/30 engagement
15/15 significance
9/20

Nanbeige4.1-3B and MiniCPM-SALA both target general capability at 3B scale — next bottleneck is whether agentic tool-use and multi-step reasoning hold up at this size.

2 sources
W56 Flux 2 Klein 9B LoRA trainability and image editing

Multiple users report Flux 2 Klein 9B outperforming Qwen Image for editing consistency and LoRA trainability at 4 inference steps, with successful style LoRAs at rank 32 over 7000 steps on Runpod.

convergence
10/35 implementation
25/30 engagement
15/15 significance
6/20
5 sources
W50 3D ControlNet conditioning from animated 3D assets

ComfyUI custom node renders pose, depth, normal, and canny batches from FBX/GLB animation files (Mixamo) in an interactive 3D viewport for ControlNet conditioning.

convergence
10/35 implementation
25/30 engagement
9/15 significance
6/20
1 sources
W47 LLM agents for complex software engineering benchmarks

GameDevBench evaluates multimodal coding agents on game development, FeatureBench benchmarks agentic coding for complex feature development, and CodeRLM uses tree-sitter indexing to improve how LLM agents navigate codebases.

convergence
15/35 implementation
20/30 engagement
3/15 significance
9/20

Multiple benchmarks now test agents on multi-file feature-level coding rather than single-function tasks — next bottleneck is reliable multi-step planning across large codebases.

3 sources
W42 Reinforcement learning for visual reasoning in MLLMs

MetaphorStar applies visual RL to image metaphor understanding, while Reinforced Curriculum Pre-Alignment uses RL-style curriculum for domain-adaptive VLMs — both use reinforcement signals to improve visual reasoning beyond supervised fine-tuning.

convergence
15/35 implementation
20/30 engagement
0/15 significance
7/20
2 sources
W42 LLM safety evaluation for harmful persuasion

Six-month follow-up on the Attempt-to-Persuade Eval shows GPT and Claude improved on harmful persuasion resistance while Gemini regressed.

convergence
10/35 implementation
25/30 engagement
0/15 significance
7/20
1 sources
W42 Instruct fine-tuning behavioral fingerprints in hidden states

Probing 6 open-weight LLMs (7B-9B) via hidden state projections onto contrastive axes reveals instruct fine-tuning creates measurable behavioral constraints detectable without prompting.

convergence
10/35 implementation
25/30 engagement
0/15 significance
7/20
1 sources
W38 VLA models with world models for robot manipulation

Three papers independently address VLA model brittleness in contact-rich manipulation: RISE adds a compositional world model for self-improvement, ABot-M0 uses action manifold learning across hardware, and MolmoSpaces provides a large-scale ecosystem for navigation/manipulation.

convergence
7/35 implementation
20/30 engagement
1/15 significance
10/20

Three concurrent papers attack VLA fragility in dynamic manipulation via world models and action manifolds — next bottleneck is sim-to-real transfer fidelity for contact-rich tasks.

3 sources
W32 Discrete audio tokenizers for LLM-native audio processing

MOSS-Audio-Tokenizer scales discrete audio tokenization for future audio foundation models (47 likes), while Voxtral Realtime achieves sub-second latency streaming ASR matching offline quality — both address the bottleneck of integrating audio natively into LLM architectures.

convergence
0/35 implementation
20/30 engagement
2/15 significance
10/20

MOSS-Audio-Tokenizer targets scaling tokenizers beyond pretrained codec limitations while Voxtral hits sub-second streaming latency — next bottleneck is joint speech understanding and generation in a single LLM pass.

2 sources
FAQ
What is HiddenState?

A daily briefing that scrapes 8 source types across the ML ecosystem, filters out the noise, and clusters what remains by technical mechanism — not topic.

Most ML news is recycled press releases. HiddenState watches for convergence: when multiple independent sources start working on the same bottleneck, something real is happening. Everything else is noise.

The top 10 mechanisms are ranked by W-index and split into Signals (strongest evidence) and Tracking (early signals worth watching) at the largest natural score gap.

What is W-index?

A 0–100 score measuring signal strength. Higher = more evidence that something real is happening.

ComponentMaxWhat it measures
Convergence35How many independent sources report this. Single source = 0 — unless it links to working code, which counts as a second data point.
Implementation30Evidence of working code. GitHub repo = 30. HuggingFace model = 20. Paper only = 0.
Engagement15Upvotes, stars, points. Capped low so hype can't inflate the score.
Significance20Clustering model's assessment of technical importance.

W60+ strong — W25-59 moderate — W<25 early/weak

Code beats vaporware. A shipped GitHub project with 3 sources will always outscore a hyped paper with 500 Reddit upvotes but no implementation.

Who are our sources?
SourceWhat we pull
arxivPreprints from cs.LG, cs.CL, cs.AI, cs.CV, stat.ML — the raw research firehose
Redditr/MachineLearning, r/LocalLLaMA, r/StableDiffusion, r/MLOps — practitioner signal
GitHubTrending ML repos with 50+ stars — implementation evidence
Hacker NewsML-related posts with 15+ points — cross-domain attention
HuggingFaceTrending models + watched quantizers (bartowski, MaziyarPanahi, LoneStriker)
OpenReviewTMLR + NeurIPS workshops — peer-reviewed & bleeding-edge
Twitter9 curated accounts (akhaliq, karpathy, srush, fchollet, etc.)
Papers w/ CodeTrending papers with implementations — community-vetted research
RSS BlogsLilian Weng, Chip Huyen, Eugene Yan, Simon Willison, Interconnects, Latent Space, Netflix Tech + PyTorch & HF blogs

Items that appear across multiple sources score higher. Single-source items start at zero convergence.

Signals vs Tracking — what's the difference?

Both sections show real signals. Up to 10 mechanisms are sorted by W-index and split at the largest natural score gap — Signals are above the gap, Tracking below. The split point changes daily based on the data; tied scores always land on the same side.

Tracking does not mean bad, unimportant, or wrong. It usually means a signal has fewer independent sources so far, or lacks public code — things that can change overnight. Some of the most consequential developments start in Tracking before the rest of the ecosystem catches up.

Likewise, a high W-index does not mean research is good, correct, or worth adopting. W-index measures visibility and convergence across sources, not quality. A flawed paper that gets widely discussed will score higher than a brilliant one nobody has noticed yet.

HiddenState is a detection tool, not an endorsement. It tells you where activity is clustering — what you do with that is up to you. Nothing here should be read as a recommendation, ranking of merit, or judgement on any researcher's work.

What does noise rejection mean?

Of all items collected, only 10 make it to the final briefing. The rejection rate is the percentage that got cut.

Filtering happens in three stages:

StageWhat gets cut
Pre-filterShort abstracts, low-engagement posts, duplicates across sources
ClusteringItems that don't converge on a shared mechanism with other items
RankingClusters below the top 10 by W-index

A 99% rejection rate means 99 out of 100 items were noise. That's the point — most ML news doesn't matter on any given day.

Privacy
Data collection

None. HiddenState collects no personal data, no email addresses, no IP logs, no usage analytics, and no telemetry of any kind.

Cookies & tracking

Zero cookies. No first-party, no third-party, no session cookies, no tracking pixels.

The only client-side storage is localStorage for your theme preference (dark/light). This never leaves your browser and contains no identifying information.

External requests

Pages load zero external scripts, fonts, stylesheets, or analytics. Everything is self-contained. The only outbound link is to Ko-fi if you choose to click it.

Data sources

HiddenState monitors 9 distinct public data streams (ArXiv, GitHub, Reddit, etc.) to detect cross-platform convergence. We do not use private user data; we only analyze what the community has already published.