Will DPO or an Explicitly DPO-based Technique be Used to Train a Public Frontier Lab LLM Before Jan 1 2025?

Plus

Ṁ2166

Jan 2

84%

chance

ALL

An explicitly DPO-based technique is one that cites DPO as seed material for its creation.

Frontier labs currently include: OpenAI, DeepMind, Anthropic, Google. I will modify this description if this changes (e.g. if Meta releases a SOTA LLM.)

Public simply means that it has been announced or otherwise discovered that this DPO LLM has been trained.

This question is managed and resolved by Manifold.

#AI

Get

1,000

and

3.00

17 Comments

26 Holders

51 Trades

Sort by:

7mo

So you mean mostly or entirely DPO?

10mo

Meta just released Llama 3: "Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO)".

I think this prediction should resolve to yes.

@StephenMcAleese Likely this market will resolve yes. Keep in mind that Llama-3 has over 400B params and benchmarks worse than Opus at the current checkpoint. I will wait a few days after the model is widely available to determine whether I classify Meta as a frontier lab or not.

10mo

@1832489723645 Really? It says on the website that is has 70B parameters like Llama 2: https://ai.meta.com/blog/meta-llama-3/

10mo

@StephenMcAleese The model that is benchmarking close to Opus at the current checkpoint is the 400B model, which is not available for use yet.

9mo

@StephenMcAleese 70B is not a frontier model, but 400B will be.

predictedYES 1y

Are you adding Mistral to "Frontier Lab"?

predictedYES 1y

a very good Llama2-70b tuned with DPO

https://huggingface.co/allenai/tulu-2-dpo-70b

predictedNO 1y

Do you consider IPO (http://arxiv.org/abs/2310.12036) explicitly DPO based? It is a generalisation.

predictedYES 1y

@Riemann Yes, I consider IPO to be DPO-based.

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

huggingface achieved sota 7b with DPO

HuggingFaceH4/zephyr-7b-beta · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

predictedYES 1y

@HanchiSun I won't resolve because I don't consider HuggingFace a frontier lab, but it's interesting that FOSS is starting to prefer DPO for smaller models.

predictedYES 1y

@marcer I am not suggesting u to resolve. It just shows the potential of DPO

predictedYES 1y

@marcer Plus the base model is really mistral. I feel like the mistral 34b or 70b will be amazing if they find enough compute power. Maybe u will consider mistral as a frontier lab then.

@HanchiSun Another good 7B model using DPO published by Intel.

DPO = https://arxiv.org/abs/2305.18290

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

predictedYES 1y

@Tossup Yes.

What is this?

What is Manifold?

Manifold is the world's largest social prediction market.

Get accurate real-time odds on politics, tech, sports, and more.

Win cash prizes for your predictions on our sweepstakes markets! Always free to play. No purchase necessary.

Are our predictions accurate?

Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.

In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.

How do I win cash prizes?

Manifold offers two market types: play money and sweepstakes.

All questions include a play money market which uses mana and can't be cashed out.

Selected markets will have a sweepstakes toggle. These require sweepcash to participate and winners can withdraw sweepcash as a cash prize. You can filter for sweepstakes markets on the browse page.

Redeem your sweepcash won from markets at 1.00
→ $1.00, minus a 5% fee.

Learn more.

What is this?

What is Manifold?

Manifold is the world's largest social prediction market.

Get accurate real-time odds on politics, tech, sports, and more.

Win cash prizes for your predictions on our sweepstakes markets! Always free to play. No purchase necessary.

Are our predictions accurate?

Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.

How do I win cash prizes?

Manifold offers two market types: play money and sweepstakes.

All questions include a play money market which uses mana and can't be cashed out.

Selected markets will have a sweepstakes toggle. These require sweepcash to participate and winners can withdraw sweepcash as a cash prize. You can filter for sweepstakes markets on the browse page.

Redeem your sweepcash won from markets at 1.00
→ $1.00, minus a 5% fee.

Learn more.

What is this?

Related questions

What is this?

Related questions