Dolly 2.0 trains using a set of questions and answers that were entirely human-generated and available online.
Dolly 2.0, the first open source instruction-tuned language model, by Databricks. Similar to InstructGPT, it uses the same techniques but with a dataset that is 100% open source and allegedly of higher quality.
Since this model is entirely open source, it is free, including for profit.
Enterprises want to use large language models produced on open-source platforms like Dolly for specific or focused use cases, according to analysts.
Up until now, training large language models on ChatGPT output has taken place in legal limbo. The large language model with ChatGPT-like human interactivity that Databricks released just two weeks ago, Dolly 2.0, appears to have found a way around this. Dolly 2.0 differs from other “open source software in that it is available for commercial purposes without requiring payment for API access or data sharing with outside parties.Â
Dolly 2.0, according to the company’s official statement, is the first open-source LLM in the world that is trained on a transparent dataset and is trained to follow instructions. The EleutherAI Pythia model family-based LLM has an impressive 12 billion parameters and has only been refined using open-source corpus databricks-dolly-15k.
What distinguishes open-source-based LLMs from closed LLMs
Open-source based models tailor to a company’s needs commercially, in contrast to closed LLMs. Analysts claim that this is due to the public nature of the training data.
Closed models, like ChatGPT, must first use data that belongs to OpenAI, the company that created them, for profit. They can then only via a paid access API.
The phrase ‘open LLMs’ may have various meanings. The deployment flexibility of these models and the availability of the source code are the most obvious and important features. Access to training data sets and model weights, as well as the transparent, cooperative nature of decision-making, are additional examples of openness.Â
Dolly 2.0 is an LLM in which the dataset, the model, the training code, and the model weights were all trained. Databricks offers all of these as open source. Businesses can use it for commercial purposes to design a unique LLM. Noting how this strategy differs from other LLMs that haven’t made their individual model-building components open-sourced.
Analysts noted that the number of parameters used to train the models differs between “closed” and “open” LLMs. With closed LLMs, you typically use a higher number of parameters.
In contrast to Dolly 2.0’s 12 billion parameters, ChatGPT4 trains on 100 trillion parameters.
Exactly how was Dolly 2.0 trained?
Using a data set created by the Stanford Alpaca Team, the company’s first iteration of Dolly cost $30. Dolly 2.0 has been built on the OpenAI API.
In the data set used to train Dolly 1.0, the output of ChatGPT was also present. The terms of service prevent anyone from creating a model that competes with OpenAI, as the Stanford team pointed out. To avoid the problem and produce a commercially viable model. Using a language model with 12 billion parameters based on EleutherAI’s Pythia model family, Databricks developed Dolly 2.0.
A new, top-notch, human-generated instruction followed the dataset that improved the model. By 5,000 Databricks employees, the company claims.
Databricks-dolly-15k is the brand name the company uses to describe the high-caliber, human-generated responses or prompts. The Creative Commons Attribution-ShareAlike 3.0 Unported License governs this.
The company’s GitHub page offers a download link for the dataset. According to the statement, it may apply, alter, or expand for any reason, including for commercial purposes.
On the Databricks Hugging Face page, according to Databricks, you can download the model weights.
Small but mighty, Dolly 2.0
The 2.0 version of Dolly “exhibits a surprisingly capable level of instruction-following behavior” considering the size of the training corpus. Despite the fact that it is not as sophisticated as the first Dolly. It now takes orders of magnitude less time and money to develop efficient AI technologies than was previously believed. As a result of Dolly 2.0, the AI community will begin to collaborate and develop additional solutions.Â
Restrictions on the Dataset
The dataset’s GitHub page acknowledges that there might be some issues with it. When developing prompts and responses, some of the training made use of Wikipedia data. This means that any bias discovered in Wikipedia might reflect in the final dataset.
The fact that some of the workers on the dataset weren’t native English speakers could have caused some inconsistencies.
The dataset may already contain biases that are specific to the employees who created it due to their demographic make-up.
Databricks claimed that their dataset is of higher quality despite any potential flaws.
Furthermore, Dolly 2.0 intends to act as a foundation for others to develop and innovate even better versions.
Open Source AI Is Better, According to Databricks
Dolly 2.0 allows users to own the models they create and better protect their data by not having to share it with a third party.
Additionally, they think that all stakeholders—rather than just the three major corporations— are in AI safety.
With open source gaining popularity, it will be interesting to see where this sector is at in the following two years.