LLM
懂交付,更懂质感:MiniMax M2.1 Vs. GLM 4.7 国产开源顶流对决丨302.AI 基准实验室
12 月 23 日,MiniMax 正式对外发布其新一代旗舰级 Coding & Agent 模型 MiniMax M2.1。 与许多大模型发布会执着于罗列通用知识得分不同,M2.1 这次把所有的聚光灯都打在了“编程”与“智能体”这两个关键词上,官方定位直言不讳:为真实世界的复杂任务而生。显然,这不仅仅是一次常规的版本迭代,更像是 MiniMax 在…
302.AI客户端:零配置,支持任意模型,最适合新手的Vibe Coding工具 | 新品发布
在AI行业飞速发展的2025 年,最炙手可热的关键词之一绝对少不了 “Vibe Coding” 。 所谓 Vibe Coding,即“氛围感编程”——你只需使用自然语言描述需求,AI 便会为你生成代码。这一变革彻底粉碎了编程的技术高墙,让每一位普通人都能跳过晦涩的编程语言,亲手打造专属应用。 为Vibe Coding打造的工具也层出不穷,在 Cursor、L…
智谱压轴力作 GLM-4.7 实测:从基准刷榜到任务交付,稳坐开源第一 丨302.AI 基准实验室
随着2025年接近尾声,大模型领域的竞争未见放缓,反而迎来了一波重磅更新。今日凌晨,智谱突袭发布了其新一代旗舰模型——GLM-4.7,以一系列 SOTA 表现,为今年的开源战场献上了堪称“压轴”的力作。 此次更新将核心焦点投向了编码能力、长程任务规划与智能体协作,不仅在多项国际主流基准测试中横扫开源榜单,更以任务交付为核心,致力于成为开发者手中真正高效、可靠…
谷歌的“普惠核弹”:Gemini 3 Flash 实测——更快、更强、更省可以兼得丨302.AI 基准实验室
12 月 18 日深夜,谷歌闪击式抛出一枚“重磅炸弹”——Gemini 3 Flash 发布。这次发布没有过多的预热,但其展现出的性能与成本组合,足以让整个 AI 领域重新审视现有的竞争格局。 简而言之,Gemini 3 Flash 做了一件看似矛盾的事:它以一个“轻量版”模型的定位和极低的成本,提供了接近甚至部分超越旗舰模型的顶尖性能。 性能:打破“轻量即…
OpenAI 十周年答卷 GPT-5.2 实测:颠覆神话不再,未来使命何往?丨302.AI 基准实验室
正值成立十周年之际,OpenAI 于12月12日突袭发布新一代大模型GPT-5.2 系列,而这距离上一代 GPT-5.1 的发布仅过去一个月。在此期间,Gemini 3 与 Claude Opus 4.5 轮番炸场的内卷周期里,行业竞争已陷入胶着,往日发布即颠覆的市场震撼力正在边际递减。 OpenAI 此次并未选择单纯堆砌参数,而是首次祭出了三版本细分的精准…
GLM-4.6V 实测:当视觉模型学会“动手”,它离“顶尖”还差什么?丨302.AI 基准实验室
智谱 AI 于 12 月 8 日正式开源了其新一代多模态模型 GLM-4.6V 系列,包含面向高性能场景的 106B 版本与轻量本地部署的 9B Flash 版。此次升级不仅将训练上下文窗口一举推至 128K tokens,更在模型架构中做了一个关键变革:让工具调用(Function Call)成为视觉模型的原生能力。这意味着,模型不再止步于识别图像,而是能…
实测开源标杆 DeepSeek-V3.2:在“效率”与“深度”之间寻找新平衡丨302.AI 基准实验室
刚进入12月,DeepSeek 又一次无预告地发布了备受期待的 V3.2 系列模型—— DeepSeek-V3.2 与 DeepSeek-V3.2-Speciale,距离上次9月末发布Deepseek-V3.2-Exp仅过去2个月。本次更新不仅是技术迭代的成果,更像是一次针对大模型能力天花板的主动探索。两款模型师出同门,却有着清晰的分工:一个追求高效实用的日…
The price has dropped by 66%, and the performance is still the ceiling? Claude Opus 4.5 Who panicked by this wave of “price reduction strikes”?丨302.AI Benchmark laboratory
On November 25th, when the spotlight of the big model competition was still flowing between GPT-5.1 and Gemini 3 Pro, Anthropic brought its king product Claude Opus 4.5 back strongly, and claimed that this is currently the most powerful model in programming, agents, and computer use on a global scale, with programming capabilities surpassing humans.expert. The most eye-catching trump card of the Claude series has always been its dominant performance in the field of programming. In the real world of authority.…
After finishing the parameter volume "personality”? Grok 4.1 Actual measurement: full EQ,编程大幅提升丨302.AI Benchmark laboratory
Last week, when the eyes of the entire AI circle focused on the iterations of the two giants Google and OpenAI, xAI once again used its iconic raid method to open the Grok 4.1 series model for free to all users in the early hours of November 18th. This means that in just four months, the Grok 4 series has completed a key upgrade, and this upgrade clearly conveys xAI's unique competitive strategy to the outside world: the next frontier of the large model may no longer be the cold computing power and parameters, but the cold computing power and parameters.…
Doubao-Seed-Code actual measurement: roll price, roll running points, but can't roll the real code?丨302.AI Benchmark laboratory
The AI programming circuit in the second half of this year can be described as a race against the clock and fierce competition. In the past, Kimi-K2-0905 was strongly ranked in the first echelon, and then Jipu GLM-4.5 challenged the ring defender Claude Sonnet 4.5. MiniMax also launched the latest masterpiece MiniMax-M2, which topped the list of Open source with strength. It is not difficult to find that these models that have emerged one after another like throwing stones into a lake, without exception, emphasized their significant improvement in programming capabilities when they were released. This trend is clear…