A little guide to building Large Language Models in 2024

2024 ж. 2 Мам.
17 881 Рет қаралды

A little guide through all you need to know to train a good performance large language model in 2024.
This is an introduction talk with link to references for further reading.
This is the first video of a 2 part series:
- Video 1 (this video): covering all the concepts to train a good performance LLM in 2024
- Video 2 (next video): hands-on applying all these concepts with code example
This video is adapted from a talk I gave in 2024 at a AI/ML winter school for graduate student. When I shared the slides online people kept asking for a recording of the unrecorded class so I decided to spend a morning recording it to share it more widely along the slides.
Link to the slides: docs.google.com/presentation/...
Chapters:
00:00:00 Intro
00:00:59 Workflow for LLMs
Part 1: Training: data
00:01:17 Data preparation - intro and good recent ressources on data preparation
00:05:28 A web scale pretraining corpus - goals and challenges
00:11:29 Web scale data sources - Focus on recent datasets
00:18:01 Language, and quality filtering
00:24:34 Diving in data deduplication
00:27:40 Final data preparation for training
00:31:31 How to evaluate data quality at scale
00:36:29 The datatrove and lighteval libraries
Part 2: Training: modeling
00:38:18 Introduction in modeling technics for LLM training
00:39:09 When the model is too big: parallelism
00:40:00 Data parallelism
00:41:18 Tensor parallelism
00:44:38 Pipeline parallelism
00:47:00 Sequence parallelism and references on 4D parallelism
00:47:52 Synchronisation: GPU-CPU and GPU-GPU challenges
00:52:14 Flash attention v1 and v2
00:56:23 Stable training recipes
00:59:12 New architectures: Mixture-of-experts
01:03:13 New architectures: Mamba
01:04:49 The nanotron library
Part 3: Fine-tuning: RLHF and alignement
01:06:15 RLHF in 2024
01:08:23 PPO, DPO and REINFORCE
Part 4: Fast inference techniques
01:11:23 Quantization, speculative decoding and compilation: overview and ressources
End
01:14:36 Sharing your model, datasets and demo - final words

Пікірлер
  • This is why I love youtube. Getting to hear the thoughts of the CSO of one of the hottest startups around! Thomas, I'll be at the HuggingFace x Mixtral hackathon in Paris next month, hope to see you there!

    @angelogiacco857@angelogiacco857Ай бұрын
  • Thanks for posting this. Lots of customers have been asking us how they can understand the process of creating LLMs

    @FusionQuill@FusionQuillАй бұрын
  • Thank you so much for this extensive overview of the complete pipeline on LLM training and inference.

    @venkateshmurugadas7481@venkateshmurugadas7481Ай бұрын
  • Thank you for this! A very good introduction to the whole LLM training ecosystem for beginners.

    @dheerajnunni8611@dheerajnunni8611Ай бұрын
  • Brilliant lecture! please, continue recording and sharing your knowledge, it's invaluable resource for everyone in this field.

    @shotanatenadze3705@shotanatenadze3705Ай бұрын
  • Thank you very much for your effort..Awaiting for Video 2.

    @user-fh9cq9oz4m@user-fh9cq9oz4mАй бұрын
  • Thank you for sharing this amazing video!

    @jennyliu07@jennyliu07Ай бұрын
  • Brilliant lecture! Just so much information and insights! Thanks a lot for this!

    @stalinthomas9850@stalinthomas9850Ай бұрын
  • This was wonderful, spending this much time on talking data preparation is key!

    @ndamulelosbg8887@ndamulelosbg888716 күн бұрын
  • Very insightful. Thank you for sharing.

    @anabildea9274@anabildea9274Ай бұрын
  • Really insightful 🔥🔥🔥

    @computerauditor@computerauditorАй бұрын
  • Thank you, Thom.

    @theglionking@theglionkingАй бұрын
  • Thanks so much!!! Much appreciated.

    @danberm1755@danberm175514 күн бұрын
  • This is really helpful! Thank you very much.

    @user-zr2ps3km8m@user-zr2ps3km8m28 күн бұрын
  • Thank you for this video

    @minhnguyenbinh609@minhnguyenbinh609Ай бұрын
  • Gold 🥇🥇🥇

    @1littlecoder@1littlecoderАй бұрын
  • Merci beaucoup Thomas!!

    @7alexopoulos@7alexopoulos16 күн бұрын
  • Very interesting, thank you.

    @willsmithorg@willsmithorg16 күн бұрын
  • Amazing!

    @MLTOKYO@MLTOKYOАй бұрын
  • Thanks a lot for this. Nanotron is really useful

    @husseinekeita8909@husseinekeita8909Ай бұрын
  • amazing lecture

    @phaZZi6461@phaZZi646117 күн бұрын
  • Thank you so much :)

    @Pingu_astrocat21@Pingu_astrocat2128 күн бұрын
  • Great video! When is the second one coming out?

    @stevechiou5760@stevechiou576017 күн бұрын
  • 很不错的视频

    @linli6838@linli6838Ай бұрын
  • 🎉❤

    @ojasvisingh786@ojasvisingh786Ай бұрын
  • What has become of the retentive network architecture which was touted as alternative for transformers? Why have no published LLMs been trained using it?

    @clray123@clray12316 күн бұрын
  • slides link? :)

    @lynncherny@lynnchernyАй бұрын
  • 33:23, what's his example of the noisier dataset? It sounds like he's saying "Zopalé" or something 😄

    @420_gunna@420_gunna17 күн бұрын
    • "The Pile" - it's on the slide...

      @clray123@clray12316 күн бұрын
    • @@clray123 Hah! Duh -- thank you 😅

      @420_gunna@420_gunna16 күн бұрын
KZhead