What is Dolly 2.0 by Databricks?
🐑 Licensed to allow independent developers and companies alike to use it commercially. 🐏
Hey Guys,
So I’m always watching Databricks and Snowflake. So it feels like everyone is making their own iteration by now anyways and it’s hard to tell exactly what it all means.
Databricks released Dolly 2.0, a text-generating AI model that can power apps like chatbots, text summarizers and basic search engines. [Tweet link]
It’s the successor to the first-generation Dolly, which was released in late March. And — importantly — it’s licensed to allow independent developers and companies alike to use it commercially.
Is AI-As-a-Service taking offer in 2023? It certainly feels that way, baa (mehhh)
Read it from the source:
I cannot help but feel like Dolly should be a meme.
Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.
🐏 Databricks Is Open Sourcing It!
So I really like that Databricks is open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.
databricks-dolly-15k dataset
databricks-dolly-15k
contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k
(Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify, or extend this dataset for any purpose, including commercial applications.
So why is Databricks — a firm whose bread and butter is data analytics — open sourcing a text-generating AI model? Philanthropy, says CEO Ali Ghodsi, according to their correspondence with TechCrunch.
To my knowledge (and theirs apparently) this dataset is the first open source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT.
I think Hugging Face was the “woodstock of A.I.” for open-source, but some companies have the guts to join us at that table.
“We are in favor of more open and transparent large language models (LLMs) in the market in general because we want companies to be able to build, train and own AI-powered chatbot and other productivity apps using their own proprietary data sets,” - Their CEO said.
All in all I think this is interesting news since OpenAI have become such a covert little secretive camp within Microsoft. Very much centralized and definitely not open.
See Dolly 2.0 on Github
Databricks is scoring brownie points with me for this. They say on the GitHub: “Databricks is committed to ensuring that every organization and individual benefits from the transformative power of artificial intelligence. The Dolly model family represents our first steps along this journey, and we’re excited to share this technology with the world.”
How do you get started today?
To download Dolly 2.0 model weights simply visit the Databricks Hugging Face page and visit the Dolly repo on databricks-labs to download the databricks-dolly-15k dataset
. And join their webinar to discover how you can harness LLMs for your organization.
Databricks recommends also:
Resources
Fine-Tuning Large Language Models with Hugging Face and Deepspeed
Does One Large Model Rule Them All
Self-Instruct: Aligning Language Model with Self Generated Instructions
Training Language Models to Follow Instructions with Human Feedback
So I didn’t know this but it turns out that “most other ChatGPT-like open source models”, like Databricks’ own first-gen Dolly, make use of a datasets that contain outputs from OpenAI, violating OpenAI’s terms of service.
Jeez and I thought ChatGPT was violating our privacy and rights. See this Poll.
Whether Open-Source is still important in this context of innovation is leading to some debate.
This set was used to guide an open source text-generating model called GPT-J-6B, provided by the nonprofit research group EleutherAI, to follow instructions in a chatbot-like fashion — which became Dolly 2.0.
So there is an fast evolution in this stuff? According to VentureBeat, y Alpaca, another open-source LLM released by Stanford in mid-March. Alpaca, in turn, used the weights from Meta’s LLaMA model that was released in late February. LLaMA was immediately hailed for its superior performance over models such as GPT–3, despite having 10 times fewer parameters.
Clearly the battle between closed and open LLMs is not going away any time soon.
Some believe Stable Diffusion is an example of what happens when open-source can open a can of worms for a company, their business model, lawsuits, etc…
Dolly 2.0 is a 12 billion-parameter language model based on the open-source Eleuther AI pythia model family and fine-tuned exclusively on a small, open-source corpus of instruction records (databricks-dolly-15k) generated by Databricks employees. It’s definatley not going to take over the world, but it demonstrates a very interesting exercise internally by Databricks.
Researchers and developers can download the Dolly 2.0 model weights from Databricks starting today and the dataset is also available on GitHub.