10万美元训出Llama-2级大模型！全华人打造新型MoE，贾扬清SD前CEO围观

量子位04-05

“只需”10万美元，训练Llama-2级别的大模型。尺寸更小但性能不减的MoE模型来了：它叫JetMoE，来自MIT、普林斯顿等研究机构。性能妥妥超过同等规模的Llama-2。△贾扬清转发要知道，后者可是数十亿美元级别的投入成本。JetMoE发布即完全开源，且学术界友好：仅使用公开数据集和开源代码，用消费级GPU就能进行微调。不得说，大模型的打造成本，真的比人们想的要便宜更多了。Ps. ...

网页链接

免责声明：本文观点仅代表作者个人观点，不构成本平台的投资建议，本平台不对文章信息准确性、完整性和及时性做出任何保证，亦不对因使用或信赖文章信息引发的任何损失承担责任。

精彩评论

我们需要你的真知灼见来填补这片空白

发表看法

{"i18n":{"language":"zh_CN"},"isChannel":false,"data":{"share":"https://www.laohu8.com/m/news/2425058571?lang=zh_CN&edition=full","thumbnail":"","is_english":false,"pubTime":"2024-04-05 12:27","share_image_url":"https://static.laohu8.com/e9f99090a1c2ed51c021029395664489","id":"2425058571","market":"us","top_or_hot":-1,"title":"10万美元训出Llama-2级大模型！全华人打造新型MoE，贾扬清SD前CEO围观","media":"量子位","content":"<div>\n<p>“只需”10万美元，训练Llama-2级别的大模型。尺寸更小但性能不减的MoE模型来了：它叫JetMoE，来自MIT、普林斯顿等研究机构。性能妥妥超过同等规模的Llama-2。△贾扬清转发要知道，后者可是数十亿美元级别的投入成本。JetMoE发布即完全开源，且学术界友好：仅使用公开数据集和开源代码，用消费级GPU就能进行微调。不得说，大模型的打造成本，真的比人们想的要便宜更多了。Ps. ...</p>\n\n<a href=\"https://tech.ifeng.com/c/8YX2WfAzoip\">网页链接</a>\n\n</div>\n","source":"fenghuang_stock","html":"<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n<meta name=\"viewport\" content=\"width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no\"/>\n<meta name=\"format-detection\" content=\"telephone=no,email=no,address=no\" />\n<title>10万美元训出Llama-2级大模型！全华人打造新型MoE，贾扬清SD前CEO围观</title>\n<style type=\"text/css\">\na,abbr,acronym,address,applet,article,aside,audio,b,big,blockquote,body,canvas,caption,center,cite,code,dd,del,details,dfn,div,dl,dt,\nem,embed,fieldset,figcaption,figure,footer,form,h1,h2,h3,h4,h5,h6,header,hgroup,html,i,iframe,img,ins,kbd,label,legend,li,mark,menu,nav,\nobject,ol,output,p,pre,q,ruby,s,samp,section,small,span,strike,strong,sub,summary,sup,table,tbody,td,tfoot,th,thead,time,tr,tt,u,ul,var,video{ font:inherit;margin:0;padding:0;vertical-align:baseline;border:0 }\nbody{ font-size:16px; line-height:1.5; color:#999; background:transparent; }\n.wrapper{ overflow:hidden;word-break:break-all;padding:10px; }\nh1,h2{ font-weight:normal; line-height:1.35; margin-bottom:.6em; }\nh3,h4,h5,h6{ line-height:1.35; margin-bottom:1em; }\nh1{ font-size:24px; }\nh2{ font-size:20px; }\nh3{ font-size:18px; }\nh4{ font-size:16px; }\nh5{ font-size:14px; }\nh6{ font-size:12px; }\np,ul,ol,blockquote,dl,table{ margin:1.2em 0; }\nul,ol{ margin-left:2em; }\nul{ list-style:disc; }\nol{ list-style:decimal; }\nli,li p{ margin:10px 0;}\nimg{ max-width:100%;display:block;margin:0 auto 1em; }\nblockquote{ color:#B5B2B1; border-left:3px solid #aaa; padding:1em; }\nstrong,b{font-weight:bold;}\nem,i{font-style:italic;}\ntable{ width:100%;border-collapse:collapse;border-spacing:1px;margin:1em 0;font-size:.9em; }\nth,td{ padding:5px;text-align:left;border:1px solid #aaa; }\nth{ font-weight:bold;background:#5d5d5d; }\n.symbol-link{font-weight:bold;}\n/* header{ border-bottom:1px solid #494756; } */\n.title{ margin:0 0 8px;line-height:1.3;color:#ddd; }\n.meta {color:#5e5c6d;font-size:13px;margin:0 0 .5em; }\na{text-decoration:none; color:#2a4b87;}\n.meta .head { display: inline-block; overflow: hidden}\n.head .h-thumb { width: 30px; height: 30px; margin: 0; padding: 0; border-radius: 50%; float: left;}\n.head .h-content { margin: 0; padding: 0 0 0 9px; float: left;}\n.head .h-name {font-size: 13px; color: #eee; margin: 0;}\n.head .h-time {font-size: 11px; color: #7E829C; margin: 0;line-height: 11px;}\n.small {font-size: 12.5px; display: inline-block; transform: scale(0.9); -webkit-transform: scale(0.9); transform-origin: left; -webkit-transform-origin: left;}\n.smaller {font-size: 12.5px; display: inline-block; transform: scale(0.8); -webkit-transform: scale(0.8); transform-origin: left; -webkit-transform-origin: left;}\n.bt-text {font-size: 12px;margin: 1.5em 0 0 0}\n.bt-text p {margin: 0}\n</style>\n</head>\n<body>\n<div class=\"wrapper\">\n<header>\n<h2 class=\"title\">\n10万美元训出Llama-2级大模型！全华人打造新型MoE，贾扬清SD前CEO围观\n</h2>\n\n<h4 class=\"meta\">\n\n\n2024-04-05 12:27 北京时间&nbsp;&nbsp;&nbsp;<a href=https://tech.ifeng.com/c/8YX2WfAzoip><strong>量子位</strong></a>\n\n\n</h4>\n\n</header>\n<article>\n<div>\n<p>“只需”10万美元，训练Llama-2级别的大模型。尺寸更小但性能不减的MoE模型来了：它叫JetMoE，来自MIT、普林斯顿等研究机构。性能妥妥超过同等规模的Llama-2。△贾扬清转发要知道，后者可是数十亿美元级别的投入成本。JetMoE发布即完全开源，且学术界友好：仅使用公开数据集和开源代码，用消费级GPU就能进行微调。不得说，大模型的打造成本，真的比人们想的要便宜更多了。Ps. ...</p>\n\n<a href=\"https://tech.ifeng.com/c/8YX2WfAzoip\">网页链接</a>\n\n</div>\n\n\n</article>\n</div>\n</body>\n</html>\n","isBrief":false,"type":0,"news_type":1,"symbol":"SD","symbol_name":"SandRidge Energy","start_time":0,"source_url":"https://tech.ifeng.com/c/8YX2WfAzoip","article_id":"2425058571","we_media_id":null,"thumbnails":[],"rights":null,"url":"https://stock-news.laohu8.com/highlight/detail?id=2425058571","pubTimestamp":1712291222,"sourceInfo":{"source_id":"fenghuang_stock","name":"凤凰网"},"weMediaInfo":null,"summary":"“只需”10万美元，训练Llama-2级别的大模型。尺寸更小但性能不减的MoE模型来了：它叫JetMoE，来自MIT、普林斯顿等研究机构。性能妥妥超过同等规模的Llama-2。Ps. Stable Diffusion前老板Emad也点了赞：10万美刀实现Llama-2性能JetMoE启发于ModuleFormer的稀疏激活架构。最终，团队使用96×H100的GPU集群，花费2周时间、约8万美元搞定JetMoE-8B。在MT-Bench基准上得分6.681，也超过了130亿参数的LLaMA2、Vicuna等模型。这家公司刚刚融资了1100万美元，投资者包括Transformer的作者。","collect":0,"end_time":0,"defaultTopTitle":"ifeng.com","property":[],"viewcount":null,"language":"zh","relate_stocks":{"SD":"SandRidge Energy","BK4213":"石油与天然气的勘探与生产"},"translate_title":"$100,000 to train Llama-2 model! All Chinese create a new MoE, and Jia Yangqing's former CEO of SD watches","themeId":null,"isJumpTheme":false,"ttsUrl":null,"symbols_score_info":{"SD":1},"content_text":"“只需”10万美元，训练Llama-2级别的大模型。尺寸更小但性能不减的MoE模型来了：它叫JetMoE，来自MIT、普林斯顿等研究机构。性能妥妥超过同等规模的Llama-2。△贾扬清转发要知道，后者可是数十亿美元级别的投入成本。JetMoE发布即完全开源，且学术界友好：仅使用公开数据集和开源代码，用消费级GPU就能进行微调。不得说，大模型的打造成本，真的比人们想的要便宜更多了。Ps. Stable Diffusion前老板Emad也点了赞：10万美刀实现Llama-2性能JetMoE启发于ModuleFormer的稀疏激活架构。（ModuleFormer，一种基于稀疏专家混合(SMoE)的模块化架构，可提高大模型效率和灵活性，去年6月提出）它的注意力层中仍然使用了MoE：80亿参数的JetMoE一共有24个区块，每块包含2个MoE层，分别是注意力头混合 (MoA) 和MLP专家混合 (MoE）。每个MoA和MoE层又有8个专家，每次输入token激活2个。JetMoE-8B使用公开数据集中的1.25T token进行训练，学习率5.0 x 10-4，全局batch size为4M token。具体训练方案遵循MiniCPM（来自面壁智能，2B模型就能赶超Mistral-7B）的思路，共包含两阶段：第一阶段使用线性预热的恒定学习率，用来自大规模开源预训练数据集的1万亿个token进行训练，这些数据集包括RefinedWeb、Pile、Github data等等。第二阶段则使用指数学习率衰减，用2500亿个token训练来自第一阶段数据集和超高质量开源数据集的token。最终，团队使用96×H100的GPU集群，花费2周时间、约8万美元搞定JetMoE-8B。更多技术细节将在不久后发布的技术报告上揭露。而在推理过程中，由于JetMoE-8B仅具有22亿个激活参数，因此计算成本大大降低——同时，它还收获了不错的性能表现。如下图所示：JetMoE-8B在8个评测基准上获得了5个sota（包括大模型竞技场Open LLM Leaderboard），超过LLaMA-13B、LLaMA2-7B和DeepseekMoE-16B。在MT-Bench基准上得分6.681，也超过了130亿参数的LLaMA2、Vicuna等模型。作者介绍JetMoE一共4位作者，分别是：Yikang ShenMIT-IBM Watson Lab研究员，研究方向NLP。本硕毕业于北航，博士经历于Yoshua Bengio创办的Mila研究机构。国振 (Gavin Guo)MIT博士在读， 研究方向为3D成像的数据高效机器学习。UC伯克利本科毕业，去年夏天作为学生研究员加入MIT-IBM Watson Lab，导师为Yikang Shen等人。蔡天乐普林斯顿博士在读生，本科毕业于北大应用数学和计算机科学，目前也是Together.ai 的兼职研究员，与Tri Dao合作。Zengyi QinMIT博士在读，同时在创业，MyShell的AI研发主管。这家公司刚刚融资了1100万美元，投资者包括Transformer的作者。","kind":null,"is_publish_news":true,"is_publish_highlight":false,"is_publish_live":null,"is_publish_wemedia":null,"editions":null,"column":"","sentiment":"0","news_tag":"","news_rank":0,"symbols":[],"gpt_button":1},"commentList":[],"isCommentEnd":true,"newsSizeData":{"likeSize":0,"commentSize":0,"repostSize":0,"favoriteSize":0,"likeStatus":false,"favoriteStatus":false},"APP":{"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","isDev":false,"isTTM":false,"deviceId":"web-server-community-laohu8-v3","version":"4.25.1","shortVersion":"4.25.1","platform":"web","vendor":"web","appName":"laohu8","isIOS":false,"isAndroid":false,"isTiger":false,"isTHS":false,"isWeiXin":false,"isWeiXinMini":false,"isWeiBo":false,"isQQ":false,"isBaiduSwan":false,"isBaiduBox":false,"isDingTalk":false,"isToutiao":false,"isOnePlus":false,"isHuaWei":false,"isXiaomi":false,"isXiaomiWebView":false,"isOppo":false,"isVivo":false,"isSamsung":false,"isMobile":false},"href":"/m/news/2425058571"}