Reparameterization Trick - WHY & BUILDING BLOCKS EXPLAINED!

2022 ж. 4 Қаң.
9 945 Рет қаралды

This tutorial provides an in-depth explanation of challenges and remedies for gradient estimation in neural networks that include random variables.
While the final implementation of the method (called Reparameterization Trick) is quite simple, it is interesting and somewhere important to understand how and why the method can be applied in the first place.
Recommended videos to watch before this one
Evidence Lower Bound
• Evidence Lower Bound (...
3 Big Ideas - Variational AutoEncoder, Latent Variable Model, Amortized Inference
• Variational Autoencode...
KL Divergence
• KL Divergence - CLEARL...
Links to various papers mentioned in the tutorial
Auto-Encoding Variational Bayes
arxiv.org/abs/1312.6114
Doubly Stochastic Variational Bayes for non-Conjugate Inference
proceedings.mlr.press/v32/tit...
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
arxiv.org/abs/1401.4082
Gradient Estimation Using Stochastic Computation Graphs
arxiv.org/abs/1506.05254
A thread with some insights about the name - "The Law Of The Unconscious Statistician"
math.stackexchange.com/questi...
#gradientestimation
#elbo
#variationalautoencoder

Пікірлер
  • I watched in sequence your videos about KL, ELBO, VAE and now this. They helped me a lot to clarify my understanding on Variational Auto-Encoders. Pure gold. Thanks!

    @anselmud@anselmud2 жыл бұрын
    • 🙏 ...glad that you found them helpful!

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • Glad that someone finally take the time to decrypt the symbols in the loss function equation!! What a great channel :)

    @mikhaeldito@mikhaeldito2 жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • Man, these have to be the best ML videos on KZhead. I don't have a degree in Stats and you are absolutely right - the biggest roadblock for understanding is just parsing the notation. The fact that you explain the terms and give concrete examples for them in the context of the neural network is INCREDIBLY helpful. I've watched half a dozen videos on VAEs and this is the one that finally got me to a solid mathematical understanding .

    @sklkd93@sklkd93 Жыл бұрын
    • 🙏 I don’t have a degree in stats either 😄

      @KapilSachdeva@KapilSachdeva Жыл бұрын
    • @@KapilSachdeva what was your path to decoding this? I am curious about where you started and how you ended up here. I am sure that's just as interesting as this video.

      @RajanNarasimhan@RajanNarasimhan2 ай бұрын
  • I knew the concept now I know the maths. Thanks for the videos sir.

    @ssshukla26@ssshukla262 жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • Incredible quality of teaching 👌.

    @adamsulak8751@adamsulak87515 ай бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva5 ай бұрын
  • I just found treasure! This was the clearest explanation I've come across so far... And now I'm going to binge-watch this channel's videos like I do Netflix shows. :D

    @ThePRASANTHof1994@ThePRASANTHof1994 Жыл бұрын
    • 🙏 …. all tutorials are PG :)

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • This explanation is what I was looking for for many days! Thank you!

    @vslaykovsky@vslaykovsky Жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • Wonderful video Kapil. Thanks from the University of Oslo.

    @leif-martinsunde1364@leif-martinsunde1364 Жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • I have watched so many ML / deep learning videos from so many creators, and you are the best. I feel like I finally understand what's going on.. Thank you so much

    @user-lm7nn2jm3h@user-lm7nn2jm3h8 ай бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva8 ай бұрын
  • Enjoyed watching your clear explanation of the re-parameterization trick. Well done!

    @chyldstudios@chyldstudios Жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • I hope to also learn your style of delivery from these videos. Its so effective in breaking down the complexity of topics. Looking forward to whatever your next video is.

    @SY-me5rk@SY-me5rk2 жыл бұрын
    • 🙏 Thanks.

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • Thanks a lot sir for your excellent explanation. It made me understand the key idea behind the reparameterization trick.

    @mohdaquib9808@mohdaquib98082 жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • This series was so informative and enjoyable. Absolutely love it! Hope to understand the diffusion models much better and have some ideas about extensions

    @ayushsaraf8421@ayushsaraf84218 ай бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva8 ай бұрын
  • I was looking for this. It’s full of essential information. Convention matters, and you clearly explained the differences in this context.

    @prachijadhav9098@prachijadhav90982 жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • For the quiz at the end: From what I understood, the Encoder network (parametrized by phi) predicts some mu and sigma (based on input X) which then define a normal distribution that the latent variable is sampled from. So I think the answer is 2 "predicts" not "learns".

    @television9233@television92332 жыл бұрын
    • You answer is 100% correct 🤗

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • Really appreciate, I enjoyed your teaching style and great expectations! Thank you ❤️❤️

    @alirezamogharabi8733@alirezamogharabi8733 Жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • Wow~ an amazing tutorial. Thank you!

    @inazuma3gou@inazuma3gou Жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • Amazing video for VAE and VI. Could you make a tutorial about Variational inference in Latent Dirichlet allocation? The descriptions and explanation for this part of work are rather rare.

    @longfellowrose1013@longfellowrose1013 Жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • Thank you for this series. It has really helped understand the theoretical basis of the VAE model. I had a couple of questions: Q1) At 21:30, is dx=d(epsilon) only because we have a linear location-scale transform or is that a general property of LOTUS? Q2) At 9:00, how are the terms combined to give the joint distribution when the parameters of the distribution are different? We would have the log of the multiplication of the probabilities but the two thetas are different right? Sorry if this is a stupid question.

    @atharvajoshi4243@atharvajoshi42439 ай бұрын
    • Q1) it has nothing to do with LOTUS just linear location-scale transform Q2) theta here represents the parameters of the “joint distribution”. Do not think of it as log of multiplication of probs rather think that it is a distribution of two random variables and theta represents the parameters of this distribution.

      @KapilSachdeva@KapilSachdeva9 ай бұрын
  • Incredibly clear, and thank you so much for these videos. Looking forward to more...

    @midhununni951@midhununni9516 ай бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva6 ай бұрын
  • As always I am stunned by your video! May I ask with what software you produce such videos?

    @ArashSadr@ArashSadr2 жыл бұрын
    • 🙏 thanks Arash for the kind words. I use PowerPoint primarily except for very few advanced animations I use manim (github.com/manimCommunity/manim)

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • thank you sir clear explanation , i want to ask regarding the expression p(xi,z) , is this the joint probability or its the likelihood under z and theta ?

    @slemanbisharat6390@slemanbisharat6390 Жыл бұрын
    • Joint prob

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • Absolutely brilliant! One issue that I have is that the Leibniz integral rule is concerned with the support of the integral being functions of the variable w.r.t which we are trying to take the derivative. I don't see how this applies to our case in your video! Isn't the support here just a lower and upper bound CONSTANT values for the Phi parameter? In other words, am I wrong in saying that the support is NOT a function of Phi, thus, we should be able to move the derivative inside the integral? I would appreciate your feedback on this. Thanks

    @MLDawn@MLDawn8 ай бұрын
    • This is where the notation creates confusion. You should think phi to be a function (a neural network in this case) that your are learning/discovering.

      @KapilSachdeva@KapilSachdeva8 ай бұрын
  • At 6:39, the distribution p_\theta(x|z) cannot have mean mu and stddev sigma as the mean and std dev live in the latent space (the space of z) and x lives in the input space.

    @somasundaramsankaranarayan4592@somasundaramsankaranarayan45928 күн бұрын
  • Thank you for a detailed explanation. I had one question though. I am not being able to understand why we cannot take derivative over theta inside the integration when integral is over x (in 20:52). Could you please help me get insight on it?

    @rubyshrestha5747@rubyshrestha57472 жыл бұрын
    • Hello Ruby, thanks for your comment and more importantly for paying attention. The reason you are confused here is because I have a typo in this example. The dx in this example should have been dtheta. Now that I look back I am not happy with this simpler example that I tried to use before explaining it for ELBO. Not only there is a typo but it can create confusion. I would suggest ignoring this (so-called simpler) example and directly see it for ELBO. Apologies!

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • 19:09 Since the base distribution is free of our parameters, when we backprop and do differentiation, we don't have to do differentiation on unit normal distribution? Is this correct?

    @blasttrash@blasttrash Жыл бұрын
    • Correct. Now this should also make you think if this assumption of prior being standard normal is a good assumption? There are variants of variational autoencoder in which you can also learn/estimate the parameter of prior distribution.

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • I have a question. From the change of variable concept, we assign z to be a deterministic function of a sample from base distribution and parameters of the target distribution. But when we apply this in the case of ELBO, we assign z to be a deterministic function of Phi, x and epsilon, the Phi is the parameters of the encoder network but not the parameters of the target distribution p(z|x). Would this not create any inconsistency in the application?

    @spandanbasu5653@spandanbasu5653 Жыл бұрын
    • ELBO (the loss function) is used during the “training” of the neural network. During training you are learning the parameters of encoder (and decoder) networks. Once the networks are trained then q(z|x) would be approximate of p(z|x).

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • In 6:35, the output of the decoder is the reconstruction of X not μ and σ?

    @jimmylovesyouall@jimmylovesyouall Жыл бұрын
    • The output of the decoder could be either of following - a) Direct prediction of X (the input vector) or b) Prediction of mu and sigma of distribution from which X came. Note the mu and sigma if predicted (by the decoder) will be that of X and not Z

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • GEM

    @omidmahjobian3377@omidmahjobian33772 жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
  • 8:48 how can the terms be combined if one follows the conventional syntax (sigma denoting the parameters of the density function) and the other the non-conventional syntax (sigma denoting the parameters of the decoder leading to estimates of the parameters of the density function). In essence the sigmas they are refrencing are not the same.

    @RAP4EVERMRC96@RAP4EVERMRC96 Жыл бұрын
    • Assuming when you mentioned “sigma” you meant “theta”. This is yet another example of abuse of notation and hence your confusion is normal. Even though I say that theta is the parameters of decoder network in this situation think that network has predicted mu and sigma (watch the VAE tutorial) and in the symbolic expression when combining two terms we are considering theta to a set of mu and sigma.

      @KapilSachdeva@KapilSachdeva Жыл бұрын
    • @@KapilSachdeva thanks for clearing that up and yes I meant theta. I always mix them up.

      @RAP4EVERMRC96@RAP4EVERMRC96 Жыл бұрын
    • 😊

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • It predicts the parameters of the latent variable.

    @anupgupta3644@anupgupta364411 ай бұрын
    • Correct. 🙏

      @KapilSachdeva@KapilSachdeva11 ай бұрын
    • Thank you sir :)

      @anupgupta3644@anupgupta364411 ай бұрын
  • 3:00 shouldn't it be the "negative reconstruction error" instead?

    @vslaykovsky@vslaykovsky Жыл бұрын
    • Since in optimization we minimize then we minimize the negative ELBO which will result in the negative reconstruction error.

      @KapilSachdeva@KapilSachdeva Жыл бұрын
  • Is there a difference between VAE and GAN

    @medomed1105@medomed11052 жыл бұрын
    • They are two different architectures with some goals that are shared. VAE were primarily designed to do efficient latent variable model inference (see previous tutorial for more details to understand this line) but they can be used as generative models. GAN is a generative architecture whose training regime (loss function, setup etc) is very different from VAE. For long time they produced much better images but now VAE have also caught up to the quality of generated images Both of the architecture are somewhat difficult to train. VAE are relatively easier to train though. Hope this sheds some light.

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
    • @@KapilSachdeva thank you very much If there is a possibility to make tutorial about GAN it will be very appreciated Thanks again

      @medomed1105@medomed11052 жыл бұрын
    • 🙏

      @KapilSachdeva@KapilSachdeva2 жыл бұрын
KZhead