Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)

Plus

Ṁ2417

Dec 31

83%

chance

ALL

Definitions

Modalities: This market considers four key modalities: Image, Audio, Video, Text.
Any Input-Output Combination: The AI should be versatile enough to accept any mixture of these modalities as input and produce any mixture as output.

Combination of modalities examples:

The AI model can take in a single or multiple modality inputs to generate single modality outputs.

For example:

Input: Text + Image, Output: Video + Sound
Input: Audio + Image, Output: Text + Image
Input: Text, Output: Video + Audio
Input: Video + Audio, Output: Text

Single to Single Generations examples:

The model should also be able to handle inputs in a single modality to single modality output, such as:

Text -> Image
Audio -> Text
Image -> Video
Image -> Audio
Audio -> Text
Image -> text

Criteria for Market Close

OpenAI must officially announce the model's capabilities to meet these criteria.
A staggered or slow release of the model is acceptable (by means of the API or the UI interface).
OpenAI allows at least some portion of the general public access to the model.

Market inspiration comes from rumors about "Arrakis" and academic work on Composable Diffusion (https://arxiv.org/pdf/2305.11846.pdf).

This question is managed and resolved by Manifold.

#AI

#OpenAI

#Technical AI Timelines

#AGI

Get

1,000

and

3.00

Sort by:

What's surprised me is that everyone has bought YES! I've not seen a lot of one sided markets under AI category

If they announce the model but nobody can use it(or only Microsoft or big $$$ business partners can), does it count for YES? Or does it have to available to the public i.e. anyone in the US could sign up and test it?

@Mira edited the description to clarify this. Thanks!

Video = sequence of image + sound?

i.e. is GPT-4 considered to consume video because it can consume multiple images as input?

Or are you thinking more like "video-specialized autoencoder that maps a time-indexed (image, sound) to the embedding space as a single object"?

@Mira No, GPT-4 isn't considered (under this definition) to consume video because it can handle multiple images as inputs. (Can it, though? I haven't checked.)

Of course it should have temporal qualities of a video, but i won't comment on the way it might be trained to ensure that.

@firstuserhere It's hard to tell since it's not public. Bing currently only keeps a single image in context, but I believe the reason is for cost not capability. AFAIK, images get mapped to the embedding space just like a token, so I would think it does.

I would expect any official video support to use a better representation: They don't have enough GPUs to do it like that, and it would be hard to train. But you don't want someone arguing on a technicality that it supports video because it supports images + sound...

@Mira Yep, not going to count that or other similar technicalities.

As for the training, from the CoDI paper:

Specifically, we start by independently training image, video, audio, and text LDMs.
These diffusion models then efficiently learn to attend across modalities for joint multimodal generation (Section 3.4) by a novel mechanism named “latent alignment”.

and finally:

The final step is to enable cross-attention between diffusion flows in joint generation, i.e., generating two or more modalities simultaneously. This is achieved by adding cross-modal attention sublayers to the UNet.

What is this?

What is Manifold?

Manifold is the world's largest social prediction market.

Get accurate real-time odds on politics, tech, sports, and more.

Win cash prizes for your predictions on our sweepstakes markets! Always free to play. No purchase necessary.

Are our predictions accurate?

Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.

In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.

How do I win cash prizes?

Manifold offers two market types: play money and sweepstakes.

All questions include a play money market which uses mana and can't be cashed out.

Selected markets will have a sweepstakes toggle. These require sweepcash to participate and winners can withdraw sweepcash as a cash prize. You can filter for sweepstakes markets on the browse page.

Redeem your sweepcash won from markets at 1.00
→ $1.00, minus a 5% fee.

Learn more.

What is this?

What is Manifold?

Manifold is the world's largest social prediction market.

Get accurate real-time odds on politics, tech, sports, and more.

Win cash prizes for your predictions on our sweepstakes markets! Always free to play. No purchase necessary.

Are our predictions accurate?

Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.

How do I win cash prizes?

Manifold offers two market types: play money and sweepstakes.

All questions include a play money market which uses mana and can't be cashed out.

Selected markets will have a sweepstakes toggle. These require sweepcash to participate and winners can withdraw sweepcash as a cash prize. You can filter for sweepstakes markets on the browse page.

Redeem your sweepcash won from markets at 1.00
→ $1.00, minus a 5% fee.

Learn more.

Definitions

Combination of modalities examples:

Single to Single Generations examples:

Criteria for Market Close

What is this?

Related questions

What is this?

Related questions