Dolly

AI领域的“多莉”(Dolly):让每个人都能拥有AI大脑的开源模型

在当今科技浪潮中,人工智能(AI)正以前所未有的速度改变着我们的生活。从智能手机上的语音助手到自动驾驶汽车,AI无处不在。其中,大型语言模型(Large Language Models, LLM)是AI领域最耀眼的新星,它们能够理解、生成人类语言,并执行各种复杂的任务。当提到Dolly时,我们通常指的是Databricks公司推出的Dolly系列大型语言模型,尤其是备受瞩目的Dolly 2.0。它就像AI世界里的一股清流,以其独特的开放性和易用性,让更多人有机会触及并驾驭AI的力量。

什么是Dolly?它从何而来?

想象一下,你有一个非常聪明的学生,他读遍了图书馆里所有的书籍(这就像大型语言模型的基础模型,例如EleutherAI的Pythia系列模型)。这个学生知识渊博,但可能还不太懂得如何根据你的具体要求完美地完成作业。

Dolly 2.0就是这个学生经过“特别辅导”后的升级版本。它是一个拥有120亿参数的大型语言模型,由数据智能公司Databricks开发。与其它的“大厂私有”模型不同,Dolly最大的特点是它被训练来理解并遵循人类的指令。换句话说,就像你给学生布置作业时,他不仅能理解你的意思,还能按照你的指示一步步地完成。

这个“特别辅导”的过程,被称为“指令微调”(instruction-tuning)。Databricks的5000多名员工在2023年3月至4月期间,手动创建了一个高质量的指令-响应数据集,包含约1.5万对问答记录,名为databricks-dolly-15k。这些数据涵盖了头脑风暴、分类、问答、内容生成、信息提取和总结等多种任务类型。正是通过这些由真人精心设计和回答的“作业”,Dolly 从一个“博览群书”但缺乏实践经验的学生,变成了一个“知行合一”、能干实事的助手。

Dolly的独特之处:开源精神

在AI世界里,很多最强大、最先进的模型往往是“闭源”的,就像顶级大厨的独家秘方,只在自己的餐厅使用,不对外公开。如果你想使用它们,通常需要支付昂贵的API调用费用,并且你的数据可能会被用于训练模型,存在隐私风险。

而Dolly 2.0则完全不同。Databricks将Dolly 2.0及其完整的训练代码、模型权重和那个独特的人工生成数据集全部开源,并允许商业使用。这就像那位顶级大厨,不仅把秘方(模型权重)公之于众,还详细讲解了如何烹饪(训练代码),甚至还把做菜所需的所有优质食材(数据集)也免费提供给大家。

这种开放性具有里程碑式的意义:

  • 降低门槛:不再需要巨额的研发投入,中小企业和个人开发者也能拥有并定制自己的大型语言模型。
  • 数据主权:企业可以在自己的基础设施上运行Dolly,无需与第三方服务共享敏感数据,从而更好地保护数据隐私和安全。
  • 促进创新:开放源码和数据集鼓励全球的开发者和研究者在其基础上进行修改、扩展和优化,共同推动AI技术的发展。

Dolly能做什么?

经过“指令微调”的Dolly,就像一个多才多艺的智能助手,能够理解并执行多种基于自然语言的指令。它的能力包括但不限于:

  • 总结归纳:将一篇长文章浓缩成几个关键点。
  • 问题回答:根据你提出的问题,从其知识中提取并给出答案.
  • 头脑风暴:为某个主题提供创意或想法。
  • 内容生成:撰写博客文章、诗歌、电子邮件等。
  • 信息提取:从文本中识别并提取特定信息。
  • 分类:判断文本的情感倾向、主题类别等。

举个例子,你可以问它:“请总结一下最近关于AI开源模型的进展。”或者让它:“帮我写一封感谢信给我的同事。” Dolly 2.0会尝试理解你的意图并生成相应的文本。

为什么Dolly如此重要?

Dolly 2.0的出现,标志着大型语言模型领域进入了一个新的阶段:AI的民主化。在此之前,开发和部署大型语言模型的成本高昂,技术门槛极高,只有少数科技巨头有能力做到。这使得AI的发展路径相对集中,创新活力也受到一定限制。

Dolly通过提供一个真正开源且可商用的选择,打破了这种壁垒。它让更多的企业和个人可以:

  • 定制化:根据自身特定的业务需求或领域知识,对Dolly进行进一步的微调,使其表现更出色、更符合个性化要求。
  • 成本效益:与需要付费API的模型相比,Dolly提供了更经济的选择,尤其适合那些希望控制成本的企业。
  • 自主掌控:完全拥有模型的控制权,不再受限于外部服务提供商的政策和价格变动。

这就像过去只有大公司才能拥有自己的超级计算机团队来解决复杂问题,而Dolly的出现,相当于提供了一套高质量、性价比高的“家用超级计算机”套件,让更多小公司和个人开发者能够在家中甚至在云上搭建属于自己的AI工作站。

Dolly的局限与展望

尽管Dolly 2.0意义重大,但它并非完美无缺。Databricks也坦诚表示,Dolly 2.0并非“最先进”(state-of-the-art)的模型,在某些基准测试中可能无法与拥有更多参数、更先进架构的商业模型相媲美。由于其训练数据量相对较小(虽然质量很高),它也可能继承了基础模型的一些局限性,例如可能生成一些不准确或有偏见的内容。

然而,Dolly的价值在于它提供了一个高质量的起点和开放的生态。它证明了即使是相对较小的模型(相比于数百上千亿参数的模型),通过高质量的指令微调数据,也能展现出令人惊喜的指令遵循能力。它为整个开源AI社区树立了一个榜样,激励更多组织投入到开放模型的研发中。

结语

在AI快速发展的今天,Dolly 2.0不仅仅是一个大型语言模型,更代表着一种开放、共享的精神,它正加速推动着人工智能技术的普及和创新。它让曾经遥不可及的AI能力,如今能被更多开发者和企业所掌握,共同塑造一个更加智能、普惠的未来。

Dolly: The “Dolly” in AI, Making AI Brains Available for Everyone

In today’s technological wave, artificial intelligence (AI) is changing our lives at an unprecedented speed. From voice assistants on smartphones to self-driving cars, AI is everywhere. Among them, Large Language Models (LLM) are the brightest new stars in the field of AI, capable of understanding, generating human language, and performing various complex tasks. When referring to Dolly, we usually refer to Databricks’ Dolly series large language models, especially the high-profile Dolly 2.0. It is like a clear stream in the AI world, with its unique openness and ease of use, giving more people the opportunity to touch and harness the power of AI.

What is Dolly? Where did it come from?

Imagine you have a very smart student who has read all the books in the library (this is like basic models of large language models, such as EleutherAI’s Pythia series models). This student is knowledgeable but may not know how to complete assignments perfectly according to your specific requirements.

Dolly 2.0 is an upgraded version of this student after “special tutoring”. It is a large language model with 12 billion parameters developed by data intelligence company Databricks. Unlike other “big factory private” models, Dolly’s biggest feature is that it is trained to understand and follow human instructions. In other words, just like when you assign homework to a student, he can not only understand what you mean but also complete it step by step according to your instructions.

This “special tutoring” process is called “instruction-tuning”. More than 5,000 Databricks employees manually created a high-quality instruction-response dataset from March to April 2023, containing about 15,000 Q&A records, named databricks-dolly-15k. These data cover various task types such as brainstorming, classification, Q&A, content generation, information extraction, and summarization. It is through these “assignments” carefully designed and answered by real people that Dolly has transformed from a student who “reads a lot of books” but lacks practical experience into an assistant who “combines knowledge and action” and can do practical things.

The Uniqueness of Dolly: Open Source Spirit

In the AI world, many of the most powerful and advanced models are often “closed source”, just like the exclusive recipe of a top chef, used only in his own restaurant and not disclosed to the public. If you want to use them, you usually need to pay expensive API call fees, and your data may be used to train the model, posing privacy risks.

Dolly 2.0 is completely different. Databricks open-sourced Dolly 2.0 and its complete training code, model weights, and that unique manually generated dataset, and allowed commercial use. This is like that top chef not only making the secret recipe (model weights) public but also explaining in detail how to cook (training code), and even providing all the high-quality ingredients (datasets) needed for cooking to everyone for free.

This openness is of milestone significance:

  • Lowering the Threshold: No huge R&D investment is required, and small and medium-sized enterprises and individual developers can also own and customize their own large language models.
  • Data Sovereignty: Companies can run Dolly on their own infrastructure without sharing sensitive data with third-party services, thereby better protecting data privacy and security.
  • Promoting Innovation: Open source codes and datasets encourage developers and researchers around the world to modify, extend, and optimize based on them, jointly promoting the development of AI technology.

What Can Dolly Do?

Dolly, after “instruction tuning”, is like a versatile intelligent assistant capable of understanding and executing various natural language-based instructions. Its capabilities include but are not limited to:

  • Summarization: Condense a long article into a few key points.
  • Q&A: Extract and give answers from its knowledge based on the questions you ask.
  • Brainstorming: Provide ideas or thoughts for a topic.
  • Content Generation: Write blog posts, poems, emails, etc.
  • Information Extraction: Identify and extract specific information from text.
  • Classification: Judge the emotional tendency, topic category, etc., of the text.

For example, you can ask it: “Please summarize recent progress on AI open source models.” Or let it: “Help me write a thank you letter to my colleague.” Dolly 2.0 will try to understand your intent and generate corresponding text.

Why is Dolly So Important?

The emergence of Dolly 2.0 marks a new stage in the field of large language models: Democratization of AI. Before this, the cost of developing and deploying large language models was high, and the technical threshold was extremely high. Only a few tech giants had the ability to do so. This made the development path of AI relatively concentrated, and innovation vitality was also limited to a certain extent.

Dolly broke this barrier by providing a choice that is truly open source and commercially available. It allows more companies and individuals to:

  • Customization: Further fine-tune Dolly based on their specific business needs or domain knowledge to make it perform better and meet personalized requirements.
  • Cost-Effectiveness: Compared with models that require paid APIs, Dolly provides a more economical choice, especially suitable for companies wishing to control costs.
  • Autonomy: Fully own the control of the model and are no longer limited by the policies and price changes of external service providers.

This is like in the past, only large companies could own their own supercomputer teams to solve complex problems, and the appearance of Dolly is equivalent to providing a set of high-quality, cost-effective “home supercomputer” kits, allowing more small companies and individual developers to build their own AI workstations at home or even on the cloud.

Limitations and Outlook of Dolly

Although Dolly 2.0 is significant, it is not flawless. Databricks also frankly stated that Dolly 2.0 is not a “state-of-the-art” model and may not be comparable to commercial models with more parameters and advanced architectures in some benchmark tests. Due to its relatively small amount of training data (although of high quality), it may also inherit some limitations of the base model, such as possibly generating some inaccurate or biased content.

However, the value of Dolly lies in that it provides a high-quality starting point and an open ecosystem. It proves that even relatively small models (compared to models with hundreds of billions of parameters) can demonstrate surprising instruction-following capabilities through high-quality instruction fine-tuning data. It sets an example for the entire open-source AI community and inspires more organizations to invest in the research and development of open models.

Conclusion

In today’s rapid development of AI, Dolly 2.0 is not just a large language model, but represents an open and shared spirit. It is accelerating the popularization and innovation of artificial intelligence technology. It allows AI capabilities that were once out of reach to be mastered by more developers and companies today, jointly shaping a smarter and more inclusive future.