DeepSeek has released a new paper,?? ?? ?? with co-founder Liang Wenfeng credited as a contributor, detailing how its latest large language model DeepSeek-V3 achieves efficient training and inference using only 2,048 H800 GPUs – significantly fewer than the tens of thousands typically required. The team attributes this efficiency to four key innovations: memory optimization through multi-head latent attention (MLA), computational savings via a Mixture-of-Experts (MoE) design with FP8 precision, communication improvements using a multi-plane network topology, and faster inference through multi-token prediction (MTP). With MLA, KV cache memory usage is cut to just 70KB per token, up to 1/7 that of competing models. MoE architecture activates only 37 billion of the model’s 671 billion parameters per forward pass, reducing training costs by 90% compared to dense models. FP8 training further halves compute and memory usage, with minimal accuracy tradeoff. Beyond the model, the paper also outlines five future directions for AI hardware design, advocating for tighter integration between software and hardware to address memory, compute, and networking bottlenecks. [36Kr, in Chinese]
Missing Black and Indigenous people don’t get the same attention as missing white women5 coolest cases for the iPhone 13It shouldn’t be teen girls’ job to mitigate harm on InstagramNew Google Maps layer shows every wildfire burning'Venom: Let There Be Carnage' is the spectacularly stupid sequel fans cravedHow to use TwitterScoring 'Sable' took Japanese Breakfast into a whole new worldThe iPhone 13 has an Apple Music bug, but there's a fix'Squid Game' is a pastel nightmare with a lot to say5 best free music download sites Google says it 'may' delete your files if you don't log in enough Golfer skims ball across a lake to sink ridiculously impressive hole Netflix's "The Crown" recreated these real Princess Diana outfits Google's Australian addition to its mobile AR puts koalas in your house Notice of data security incident Microsoft is tired of Russian hackers' COVID 'Bugsnax' is more perfectly timed for 2020 than anyone realized Senators warn Facebook, Twitter at post Twitter launches its own version of Stories, calls it Fleets How an airborne NASA mission took flight, amid the pandemic
0.2524s , 9813.21875 kb
Copyright © 2025 Powered by 【?? ?? ??】DeepSeek reveals cost,Feature Flash