No Priors Ep.61 | OpenAI's Sora Leaders Aditya Ramesh, Tim Brooks and Bill Peebles

No Priors: AI, Machine Learning, Tech, & Startups

2024 ж. 22 Мам.

8 160 Рет қаралды

AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long.
Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future.
Show Notes:
0:00 Sora team Introduction
1:05 Simulating the world with Sora
2:25 Building the most valuable consumer product
5:50 Alternative use cases and simulation capabilities
8:41 Diffusion transformers explanation
10:15 Scaling laws for video
13:08 Applying end-to-end deep learning to video
15:30 Tuning the visual aesthetic of Sora
17:08 The road to “desktop Pixar” for everyone
20:12 Safety for visual models
22:34 Limitations of Sora
25:04 Learning from how Sora is learning
29:32 The biggest misconceptions about video models

Пікірлер

Really great interview. Thanks to all.
@jonkraghshow27 күн бұрын
As a 3d artist, filmmaker and actor, SORA has me super excited. I can't wait to play around with this tech. It's pretty crazy how all these modalities are happening at once--image, video, voice, sound effect, and music. All the pipelines needed to create media. There will be a time not far off, where we can plug in the prompt, and SORA 5 will create all the needed departments. As the human working with this, I would of course be heavily involved in the iterative generation and direction of each piece of media...and in the end the edit would be mine. I wonder how much 'authorship' a creator will have or be given.
@Glowbox3D27 күн бұрын
- but prior to commercially utilizing the SORA output there must be clarity on the source of the training data it can't be OpenAI pushing it to creators, and the creators saying they trust OpenAI this is almost the exact same issue as textual generation for fun and brainstorming, fair use i suppose
  @boonkiathan24 күн бұрын
Cool interview, awesome to see a glimpse into the innovation being done to develop these video models
@erniea584327 күн бұрын
Smart! 😊 Personalisation and esthetics. Cool. But also PRACTICAL worldbuilding please. How can this help create quality lifestyles? Happy communities? A convivial society?
@garsett27 күн бұрын
Interesting video! It really highlights the potential of using 3D tokens with time as an added dimension :). My experience with diffusion models and video generation didn't show anything quite like Sora's temporal coherence. Looking ahead, I'm excited about the prospects of evolving from polygon rendering to photorealism via image-to-image inference. While I might be biased due to my interest in this rendering, I think incorporating 'possibility' as an additional dimension, as suggested by "imagining higher dimensions", could address issues like the leg switching effects we currently see. Such physics-consistent behavior could potentially be borrowed from game engine scenarios, where, unlike an apple that behaves predictably when dropped, a leg has specific movement constraints (also affected by perspective shifts). It’s a speculative route, but it might be worth exploring if it promises substantial improvements.
@leslietetteh729228 күн бұрын
- Maybe internal 3D modling should be introduced to solve the issue you have mentioned (leg switching, or so called "entity inconsistency".
  @tianjiancai111826 күн бұрын
- @@tianjiancai1118 How so? (NB: you're familiar with how diffusion models work? It's just learning to denoise an image, or a cube in this case. I just suggest that it learns to denoise the branching possibilities rather than a cube, so it knows what is not a possibility - suggesting, not guaranteeing the idea will work. There are things like ControlNets though, so if this internal 3D modelling is a valid idea, please share)
  @leslietetteh729224 күн бұрын
- Sorry to clear that, but internal 3d modeling is hard to achieve in a diffusion model (as far as I know). What I mean is somehow a totally new arch.
  @tianjiancai111824 күн бұрын
Great interview
@EnigmaCodeCrusher27 күн бұрын
Compute and data are converging on becoming interchangeable sides of the same coin. Flops are all you need.
@JustinHalford28 күн бұрын
Im definitely following these three talented guys on X. Really great interview and without a doubt Sora is already making an impact in Hollywood like once Pixar did during a steve jobs era.
@amritbro27 күн бұрын
Really all these amazing things are just possible with transformers, nothing much innovation but just apply transformers to X and scale it. The most innovative thing they did was a tokenization method as boxes the rest is mechanics.
@AIlysAI28 күн бұрын
- Adding another axis in the form of imaginary numbers improved our ability to model higher dimensional interactions before. That's negative, bordering on bias - if it isn't innovation, then why didn't everyone else do it?
  @leslietetteh729228 күн бұрын
I'm old. these guys look like they just left high school.
@oiuhwoechwe28 күн бұрын
- Haha, I'm 71. I know exactly what you mean. The average age of the developers of the first Mac was 28 years old. It seems like the average age of the AI community is so young but that gives these super smart people a lot of years to get things straightened out.
  @voncolborn943727 күн бұрын
- They almost have . Peebles is just out of university.
  @mosicr26 күн бұрын
vocal fry contest
@BadWithNames12324 күн бұрын
The Matrix basically
@phen-themoogle765127 күн бұрын
our subconscious does a much better job at modeling physics. you conscious mind imagines the apple falling vaguely. your subconsious mind can learn to juggle several apples without dropping them so it knows when they will be where
@jeffspaulding4328 күн бұрын
- We perceive possibility (which can be thought of as an extra dimension, idea from "imagining extra dimensions"). I would think if trained on branching "possibilities" it'd be much more consistent physics. But especially with the idea of polygon-rendering to photoreal image-to-image inference on the horizon, there's more of a focus on speeding up inference these days (see Meta's amazing work on "Imagine flash" with emu). With this sort of temporal consistency, if openai manages to get inference speed up, could just use a traditional videogame physics engine with photoreal inference laid on top. It'll probably sell a lot, especially if they map electrical signals through the spinal cord to touch input and replicate that. Seeing and touching the real world through vr will be epic, and yeah probably sell loads. Could train the next gen of AI engineers (think deep-sea or deep space repair) in a simulation that looks identical to, and behaves identically to the real world.
  @leslietetteh729227 күн бұрын
- Branching possibility introduces higher cost in an exponential way, so knowing how to (ralatively) precisely predict something is also important. Human certainly learn possibility, and we learn certainty too.
  @tianjiancai111826 күн бұрын
- @tianjiancai1118 Certainly. I'm almost sure it'd have a positive effect on modelling what are essentially 4d interactions effectively, but with the sort of inference speed ups we're seeing now, I'm pretty sure image-to-image inference, polygon rendering to photorealistic is the way to go for the easy win.
  @leslietetteh729226 күн бұрын
- You have memtioned "easy win". I would argue that any generation without understanding its nature can't be precise enough. Reference speed is important, but reference quality is also important to achieve indistinguishable (or so called no mistake) result. Though you can speed up reference and offer realtime generation, they are still cases requiring resonable results.
  @tianjiancai111826 күн бұрын
- @@tianjiancai1118 "Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation" is a really good paper by Meta that you should read, its achieves super-fast inference without really compromising on quality. there are some pretty good demos of the quality they're achieving with real-time inference.
  @leslietetteh729224 күн бұрын
Why would they hype Sora up and then not even have a timeline for releasing a product??
@davidh.6527 күн бұрын
- Because they are still working on prevention from misuse
  @tianjiancai111826 күн бұрын