Kimi K2 Thinking: 1T params, $4.6M, 300 tool calls

28 articles scored today

Kimi K2 Thinking: 1T MoE at $4.6M, native INT4, 300 sequential tool calls

Recode China AI · EN · Kir-News 91

Moonshot's K2 Thinking is 1T total / 32B active, trained on 15.5T tokens on H800s for roughly $4.6M, shipping natively in INT4 at about 594GB. It does "interleaved thinking", reasoning between every tool call, and runs 200 to 300 sequential tool calls autonomously, scoring 60.2% on BrowseComp (above GPT-5) and 44.9% on HLE with tools. Architecture reuses the DeepSeek-V3 skeleton but swaps the optimizer for MuonClip, which apparently got them through the full pretraining run without a single loss spike.

Open weights, agentic depth, and training economics in one package. This is the headline profile for your stack.

Read at Recode China AI →

Seedance 2.0 and the Chinese AI video production stack

Recode China AI · EN · Kir-News 88

ByteDance's Seedance 2.0 reports ~90% usability per generation versus a prior industry norm near 20%, which is why studios are queuing six hours and working at 3am for off-peak capacity. A four-person food brand team produced a 90-second viral spot for ¥4,000 to 5,000 in five hours that hit 5 billion views. The "Manju" AI short drama format now runs three pricing tiers from ¥400/min at under 10% margin to ¥30,000/min above 60%, with 14,634 titles launched in January 2026 alone.

Concrete unit economics for AI video production, directly mappable to your creative and marketing work.

Read at Recode China AI →

GLM-5 and Qwen3.5: sparse attention, async RL, and a 19x decode speedup

Recode China AI · EN · Kir-News 88

GLM-5 is 744B total / 40B active, pretrained on 28.5T tokens, and currently tops Artificial Analysis as the best open-weight model. It uses DeepSeek Sparse Attention for 1.5 to 2x long-context savings and an asynchronous RL stack that decouples rollout from training, with TITO plus double-sided importance sampling handling off-policy drift. Qwen3.5-Plus (397B / 17B active) introduces a Gated DeltaNet + Gated Attention hybrid delivering 8.6 to 19x decode speedup at 32K to 256K context, 1M token support, and pricing of ¥0.8/M input, about 1/18th of Gemini 3 Pro. GLM-5 is also tuned for Huawei Ascend, Cambricon, and Kunlunxin.

Two architectures worth reading in full. Async RL and the hybrid attention scheme are the real signal.

Read at Recode China AI →

DeepSeek V4: Ascend detour, sparsity roadmap, native multimodal

Recode China AI · EN · Kir-News 85

V4 is expected imminently, with native text/image/video and long-context coding internally reported as beating Claude and ChatGPT. The piece traces the sparsity progression: DeepSeekMoE (64+2, topk=6) to V2 (160+2, MLA) to V3 (256+1, topk=8, DeepEP). The delay traces to a failed attempt to train on Huawei Ascend, citing instability, slow interconnects, and immature tooling; training reverted to Nvidia while Huawei keeps inference. DeepSeek withheld V4 from Nvidia and AMD for optimization, handing early access to Huawei and Cambricon first.

Rare operational account of what training on domestic Chinese silicon actually costs.

Read at Recode China AI →

Inside DeepSeek and Moonshot: departures, valuations, chip strategy

Recode China AI · EN · Kir-News 82

V4 was expected at Chinese New Year but may slip to April; a smaller variant has been circulated for compatibility testing. Notable departures: core R1 author Guo Daya and multimodal lead Ruan Chong. Liang Wenfeng is now pursuing a formal valuation, with rivals offering 2 to 3x to poach talent. Moonshot hit $18B valuation and still runs with no KPIs and no formal titles.

Org-level texture on the two labs whose weights you actually use.

Read at Recode China AI →

DeepSeek-V3 hardware co-design paper: 70KB KV cache per token

Synced Review · EN · Kir-News 82

A 14-page follow-up co-authored by Liang Wenfeng explains how hardware shaped V3. MLA cuts per-token KV cache to 70KB versus 516KB for LLaMA-3.1-405B and 327KB for Qwen-2.5-72B. V3's MoE activates 37B of 671B parameters at ~250 GFLOPS/token versus 394 for dense 72B. FP8 training, dual micro-batch communication overlap, and an argument that local MoE inference on AI SoC laptops can hit 20+ TPS.

Hard numbers for local deployment feasibility of frontier-class models.

Read at Synced Review →

DeepSeek-V3.2: Sparse Attention, 10%+ RL compute, verbosity problem

Recode China AI · EN · Kir-News 82

V3.2's sole architectural change is DeepSeek Sparse Attention: a Lightning Indexer scores tokens in FP8, then full attention runs only on top-k, complementing MLA's KV compression. Post-training RL ate more than 10% of total pretraining compute, with a synthetic pipeline generating 85,000 agent tasks across 1,800 environments. V3.2-Speciale hits 96% AIME 2025 and gold at IMO/IOI/ICPC, but burns 77K tokens where Gemini uses 20K, a known GRPO reward-normalization artifact. API: $0.28/M input, $0.42/M output.

Verbosity at this scale is a real cost line. Worth tracking whether it gets fixed in V4.

Read at Recode China AI →

Seedance 2.0 and Seed2.0: native audio-visual, CapCut distribution

Recode China AI · EN · Kir-News 82

Seedance 2.0 accepts up to 9 reference images, 3 videos, and 3 audio clips, synthesizing audio and video in a single pass with fine camera control. Direct integration into Douyin, TikTok, and CapCut/Jianying (800M users) gives ByteDance a distribution lane Sora 2 does not have. Companion model Seed2.0 is positioned as production-oriented, optimizing real usage over benchmarks. Copyright pressure forced ByteDance to disable real-human image references on Feb 9.

The distribution story matters as much as the model. This is the Chinese answer to Sora.

Read at Recode China AI →

MiniMax MaxClaw on Alibaba Cloud: four blockers for production agents

ChinAI Newsletter · EN · Kir-News 82

MaxClaw runs on Alibaba's ACK/ACS to serve hundreds of thousands of concurrent agents. The write-up names four production blockers: security boundaries (prompt injection and privilege escalation), long-task state volatility, multi-agent scheduling, and cost/workload spikes. ACS Agent Sandboxing plus elastic Kubernetes scheduling address them. A linked note flags DingTalk and Feishu pivoting to CLI over MCP for agent-computer interaction, which is a sharp counterargument on MCP's enterprise fit.

Actual operational constraints on shipping agents at scale, not benchmark theater.

Read at ChinAI Newsletter →

Huawei HiFloat4 beats MXFP4; Anthropic's alignment agent swarm at $22/hr

Import AI · EN · Kir-News 78

Huawei's HiFloat4 4-bit training format beats OCP's MXFP4 on Ascend NPUs: ~1% relative loss versus BF16, against MXFP4's ~1.5%, tested on Qwen3-MoE-30B, Llama3-8B, and OpenPangu-1B. Separately, Anthropic ran parallel Claude Opus 4.6 agent teams on weak-to-strong supervision: 800 cumulative agent-hours, $18K compute, PGR 0.97 versus human researchers' 0.23 in seven days, at $22/agent-hour. Agents coordinated via a forum and MCP tools with no detailed scaffolding.

HiFloat4 is real domestic-silicon progress. The alignment swarm is a price point to remember.

Read at Import AI →

MCP one year in: 55K tokens before content, and a case against

ChinAI Newsletter · EN · Kir-News 78

Baidu, Alibaba, and Tencent adopted MCP before Google and OpenAI. One year in: GitHub's official MCP burns 55,000 tokens before generating any content; a scan of ~1,900 MCP servers found widespread maintenance failures and exposed credentials. The sharpest argument: deterministic protocol design is mismatched with probabilistic agents, and a sufficiently capable model will not need MCP at all.

Pairs directly with today's DingTalk/Feishu CLI-over-MCP note. The MCP consensus is not holding.

Read at ChinAI Newsletter →

Kwai SRPO: R1-Zero-32B parity with 10x fewer RL steps

Synced Review · EN · Kir-News 78

Kuaishou's SRPO-Qwen-32B matches DeepSeek-R1-Zero-32B on AIME24 (50) and LiveCodeBench (41.6) using one-tenth the training steps of vanilla GRPO. The trick is staged training: math first to build long-chain reasoning, then code; mixing both simultaneously hurts both domains. A history-resampling step addresses GRPO's reward-variance collapse. Model and technical report are open.

10x RL efficiency from a staging trick is a cheap win to replicate.

Read at Synced Review →