The challenges in Variational Inference (+ visualization)

2024 ж. 25 Мам.
9 808 Рет қаралды

VI attempts to find an optimal surrogate posterior by maximizing the Evidence Lower Bound (=ELBO). The surrogate posterior acts as a replacement for the intractable true posterior. Let's look at some details. Here are the notes: github.com/Ceyron/machine-lea...
In this video, we will look at the simple example of the Exponential-Normal Model with a latent and an observed variable. Even in this simple example with one-dimensional random variables, the marginal and therefore also the posterior is intractable, which motivates the usage of Variational Inference.
We are going to compare the probability distributions we have access to: Prior, Likelihood and Joint, as well as the ones we do not have access to (due to intractable integrals): Marginal and Posterior. This should show that latent does not necessarily have to mean not computable.
Finally, we will analyze a visualization you can also access here: share.streamlit.io/ceyron/mac...
If you want, you can also check out the corresponding Python code: github.com/ceyron/machine-lea...
Timestamps:
00:00 Recap VI and ELBO
00:30 Agenda
00:52 Example: Exponential-Normal model
02:26 (1) We know the prior
04:15 (2) We know the likelihood
05:36 (3) We know the joint
06:34 (1) We do NOT know the marginal
08:15 (2) We do NOT know the (true) posterior
08:53 Why we want the posterior
09:51 Remedy: The surrogate posterior
10:31 Example for the ELBO
10:58 Fix the joint to the data
11:37 Being able to query the joint
12:56 Visualization
14:52 Outro
-------
📝 : Check out the GitHub Repository of the channel, where I upload all the handwritten notes and source-code files (contributions are very welcome): github.com/Ceyron/machine-lea...
📢 : Follow me on LinkedIn or Twitter for updates on the channel and other cool Machine Learning & Simulation stuff: / felix-koehler and / felix_m_koehler
💸 : If you want to support my work on the channel, you can become a Patreon here: / mlsim
-------
⚙️ My Gear:
(Below are affiliate links to Amazon. If you decide to purchase the product or something else on Amazon through this link, I earn a small commission.)
- 🎙️ Microphone: Blue Yeti: amzn.to/3NU7OAs
- ⌨️ Logitech TKL Mechanical Keyboard: amzn.to/3JhEtwp
- 🎨 Gaomon Drawing Tablet (similar to a WACOM Tablet, but cheaper, works flawlessly under Linux): amzn.to/37katmf
- 🔌 Laptop Charger: amzn.to/3ja0imP
- 💻 My Laptop (generally I like the Dell XPS series): amzn.to/38xrABL
- 📱 My Phone: Fairphone 4 (I love the sustainability and repairability aspect of it): amzn.to/3Jr4ZmV
If I had to purchase these items again, I would probably change the following:
- 🎙️ Rode NT: amzn.to/3NUIGtw
- 💻 Framework Laptop (I do not get a commission here, but I love the vision of Framework. It will definitely be my next Ultrabook): frame.work
As an Amazon Associate I earn from qualifying purchases.
-------

Пікірлер
  • amazing video! Great help, thank you for your effort to make such an excellent video !!!

    @clairedaddio346@clairedaddio346Ай бұрын
    • You're very welcome 🤗 Thanks for the kind comment 😊

      @MachineLearningSimulation@MachineLearningSimulationАй бұрын
  • Just Discovered your Channel and ive got to say that i am really impressed by the amount of work you put into it. Looking forward to seeing Other great vidéos like this! (Until then I have a lot to catch up)

    @Louis-ml1zr@Louis-ml1zr2 жыл бұрын
    • Thanks a ton! :) And welcome to the channel!

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
  • I’ve watched various university course lecture, some papers, some blog posts for several days, still can’t understand “what we have”, “what we want to find”, etc.. You explain them concretely, even show an simple example that p(x) is intractable. Instantly make me understand, much appreciated! 🎉

    @kimyongtan3818@kimyongtan3818 Жыл бұрын
    • You're very welcome 🤗.

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
  • Thanks for the video

    @soumyasarkar4100@soumyasarkar41002 жыл бұрын
    • You're very welcome :) Thanks a ton for the donation!

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
  • Just fond out this channel & and I would like to thank you for your thoughtful work

    @hugogabrielidis8764@hugogabrielidis8764 Жыл бұрын
    • Appreciate it :) Thanks a lot.

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
  • The best explanation ever. Thumb up!

    @GH_WH_SIN@GH_WH_SIN10 ай бұрын
    • Thanks a lot :)

      @MachineLearningSimulation@MachineLearningSimulation9 ай бұрын
  • Very well explained! earned my sub. Looking forward for more videos!

    @user-yh2qf4ti9w@user-yh2qf4ti9w6 ай бұрын
    • Awesome, thank you! 😊

      @MachineLearningSimulation@MachineLearningSimulation6 ай бұрын
  • Good explanation as always :)

    @mickolesmana5899@mickolesmana58992 жыл бұрын
    • Glad to hear that! :) Feels good to have some Probabilistic ML content again.

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
  • Great video! I have a question, why don't we just model p(x) with some known distribution like Gaussian distribution? Why do we have to compute the integral of p(x,z) w.r.t. z?

    @yccui@yccui3 ай бұрын
    • Hi, Thanks a lot for the kind words and the great questions 😊 Do you have timestamps for the points in the video you refer to (helps me recap what I said in detail). Some more general answers: The p(x) distribution is a consequence of the full joint model. So, if there is a model p(x,z) it implies a certain functional p(x) purely by its definition (as the expectation over z). Maybe you mean, whether we could propose a surrogate marginal (similar to the surrogate posterior one commonly sees in VI)? In that case: certainly this can also be done, but might be of less practical usage. Regarding your second question: p(x) = int_z p(x,z) dz is a fundamental result in probability theory. You can, for instance, check the first chapter of Chris Bishops "pattern recognition and machine learning".

      @MachineLearningSimulation@MachineLearningSimulation3 ай бұрын
  • Great video, I really liked the concrete example and the actual computation of integral approximations etc. I also really like the amount that you distinguish between what we know and what we don't know when defining the different distributions (i.e. *assuming we have a z*, we can plug in and get p(x, z)). On that note, towards the end you talked about p(z, x=D), i.e. the joint over z and x where you've plugged in the observed dataset for x. You showed that this is actually not a valid probability distribution. Can you explain a bit more about why exactly that is the case? Why can't we simply treat the joint p(z, x=D) as the conditional. We are plugging in known data, and getting a value representing the probability of the latent. Thanks as always for amazing content, keep it up! :)

    @addisonweatherhead2790@addisonweatherhead27902 жыл бұрын
    • Hey Addison, thanks for commenting :). I hope that the video was able to solve some of the open questions from the last video. Thanks a lot for the feedback. Regarding your question: I think the crucial observation is that the joint is proportional to the posterior (according to Bayes' rule). Therefore, it inhibits the same features, e.g., minima or maxima. Hence, we could query it to compare certain Z against each other. For example, p(Z=0.2, X=D) = 0.03 and p(Z=0.3, x=D) = 0.12. This would allow us to say which of the two Z is more probably. That is helpful for Maximum A Posteriori Estimates. However, we cannot say anything with respect to whether the probability values for both Z are high or low in comparison to the entire space of possible Z. That's where we would need a full distribution for. I hope the most recent video was able to shine some more light on it: kzhead.info/sun/qJh7esh6enaIbK8/bejne.html Please feel free to ask a follow-up question if something remained unclear.

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
  • Thanks!

    @akshaykiranjose2099@akshaykiranjose20995 ай бұрын
    • Thanks a lot for the generous donation 🙏😊

      @MachineLearningSimulation@MachineLearningSimulation5 ай бұрын
  • I am still struglling with the concept that in the beginning we already somewhat have the joint P(Z,D) that we can evaluate for values of Z and D and get probabilities, but we do not yet have the conditional P(Z|D). The joint itself already encodes the relationship between Z and D, P(Z,D), no? Why do we want the conditional that should effectively encode the same thing. (Perhaps I ll rewatch the part "Why we want the posterior") again.

    @matej6418@matej6418 Жыл бұрын
    • You're right. :) The joint encodes this relationship, but only unnormalized. This means if I propose two latent variables to you, then (given the same observed data) you could compute which of the two has the higher probability. This is also the fact we use to do optimization, it allows us to find MAP estimates. However, you cannot tell the actual (normalized) probability of any of the two Z values.

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
  • Thanks for the video. However, I would like to ask: in the visualization, you computed the integral over Z, is this the marginal P(X)? But as you've said earlier in the video (7:58), it is intractable to compute. Is there something i'm missing?

    @sfdv1147@sfdv1147 Жыл бұрын
    • Also what book would you recommend for Probabilistic Graphical Model? I finished Prof. Koller's lecture on Coursera but I find there are too many things left out. I know she also wrote a book on this topic but I find it a bit difficult to read 😅😅😅

      @sfdv1147@sfdv1147 Жыл бұрын
    • You're welcome 😊 thanks for the kind comment. The integral value is just an approximation (I also used the approx sign there). In the streamlit script there are only the final computed values. If I remember correctly then I evaluated the integral from 0 to 10 (so not even to infinity, but I believe that there are no fat tails, correct me if I'm wrong) with a composite trapezoid rule. Probably, I used sth like 100-1000 evaluation points. Your question is very valid, because it still holds true: in general those integrals (for the marginal) are intractable. In lower dimensions (let's say below 15) you can often resort to numerical quadrature techniques (like newton-cotes or gauss or sth else). Beyond that, one can only use monte Carlo techniques for which one often uses special markov-chain Monte Carlo approaches. This is a very interesting field for itself, because some MCMC techniques like Hamilton Monte Carlo then link back to differential equations (that are also a major content of the channel). I want to create videos on these topics, but the probabilistic topics are a bit on hold at the moment since they are not part of my PhD research. Still, I want to continue with these topics on the channel at some point in the future. Stay tuned ;)

      @MachineLearningSimulation@MachineLearningSimulation11 ай бұрын
    • As you probably figured out there are many difficult books on these topics. I can recommend bishops "pattern recognition and machine learning". Generally speaking, I like a more code-focused approach (that I try to also teach in my videos). For instance, the documentation of probabilistic programming languages (like TFP, Stan, PyMC3, Pyro or Turing.jl in Julia) comes with many nice examples and use-cases. I learned a lot just by replicating these myself being guided by the documentation. 😊 Good luck on your learning journey 👍

      @MachineLearningSimulation@MachineLearningSimulation11 ай бұрын
    • @@MachineLearningSimulation Thanks for the answer and book suggestions.

      @sfdv1147@sfdv114711 ай бұрын
  • Superb!

    @todianmishtaku6249@todianmishtaku6249 Жыл бұрын
    • Thanks 😊

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
  • Awesome

    @Stealph_Delta_3003@Stealph_Delta_3003 Жыл бұрын
    • Thanks 😊

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
  • Hello, this might be a trivial doubt. But at 12:01 you estimate the P(Z, X=D) for one observed datapoint. What if we have more than one datapoint ? How will this equation be generalised ? Thanks a ton for the video !!

    @ashitabhmisra9123@ashitabhmisra9123 Жыл бұрын
    • That depends a bit on the concrete model, one often done approach for a dataset is an I.I.D. assumption. Consequentially, you would take the product of the probability for each entry in the set.

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
    • @@MachineLearningSimulation makes sense. Thanks!

      @ashitabhmisra9123@ashitabhmisra9123 Жыл бұрын
  • Thanks for your amazing videos about variational inference, it's extremely helpful! I have a question regarding the joint distribution p(z, x). It seems intuitive that we assume latent variable p(z) is a given distribution like a normal distribution, but what if we don't know the likelihood p(x|z)? Is it still possible to do variational inference and how should I understand this in the example of images and the camera settings? Thanks! 😊

    @tony0731@tony0731 Жыл бұрын
    • First of all thanks: thanks for the kind feedback and the nice comment 😊. It's an interesting question, but unfortunately beyond my knowledge. The VI as we looked it at here is based on having a joint distribution that can be factored into a Directed Graphical Model. As such, we will (by assumption) always have access to the distribution over the root nodes and the conditional distributions for all other nodes. Ultimately, what is required to perform VI is a (differentiable) implementation of the joint distribution fixed to the data, that can be queried for various latent variables. Maybe there is a way to find sth like this without the likelihood p(x|z), but I'm unfortunately not aware of 😅

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
  • really nice, could you also make some advance thing about sparse gaussian process?

    @jiahao2709@jiahao27099 ай бұрын
    • Also a great suggestion, will also come back to it once I revive the probabilistic ML series in the future 😊

      @MachineLearningSimulation@MachineLearningSimulation9 ай бұрын
  • Thanks for a great video. You mentioned that in order to make the connection between x and z in the likelihood function, p(x|z), we make z to be the mean of the Gaussian. As you know, in a Gaussian we have the term (x-z)**2. Now, x and z can have very different dimensions! In that case, how on earth can we take theri difference, let alone compute the p(x|z)? Thanks

    @MLDawn@MLDawn8 ай бұрын
    • You're welcome 🤗 thank for the kind comment. For this specific example, it works because both the latent z and the observed x are scalar ("or 1-dimensional"). Generally speaking, there might be two cases you could refer to: either the true dimension of the other variable is different or the the other variable has an additional batch axis (like in a dataset). For the former, yes there would be an inconsistency but one would then go back to remodeling to make it work. For the latter, you could use plate notation (kzhead.info/sun/dM-qf6uJi5tvl5E/bejne.htmlsi=phRiOjQKpo3-cxOa ) and computer the likelihood as the product of the probability over all samples.

      @MachineLearningSimulation@MachineLearningSimulation7 ай бұрын
  • Really good video as always. But just make sure I understand the variational inference example. Say we are doing dog and cat image classification task. And in the dataset there are 40 percent dog images and 60 percent cat images. Z is latent variable and X is the image. For the prior, P(Z = dog) = 0.4 and P(Z = cat) = 0.6? The P(X|Z) is the likehood of the data, we won't know the actual probability, but we can approximate and train an approximater by using negative log likehood or some types of likehood function? And for the variational inference we just want to know the P(Z|X)? Is my example and understanding correct? Thanks

    @junhanouyang6593@junhanouyang65932 жыл бұрын
    • Thanks for the comment, and the nice feedback, :) I really appreciate it. In the case of (simple) Probabilistic Models, you can always express the functional form of the likelihood P(X|Z) and could therefore also compute the likelihood of your data. You could use the likelihood to train your model, that's correct :). This classifier would also be some form of an approximator of the posterior as you have a model for the task "given a new image tell me whether it is a dog or a cat". For VI, we are interested in a full distribution (surrogate) for one data-point (or a bunch of). I think there is some small misconception you have in your question that is related to the difference between discriminative and generative models. There will be more videos on VI and VAE in the next weeks. I hope they can clear this up a bit :). Please leave a comment also under them, if they did not fully answer your question.

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
    • @@MachineLearningSimulation Thank you, I think I know where my misunderstanding are. In Generative Approach, we want to find the probability of P(X,Z) while discriminative approach we only interested in P(X|Z)? Since in order to calculate P(X,Z), you also need to know P(X|Z). So both approach can let us know the likehood of data correct?

      @junhanouyang6593@junhanouyang65932 жыл бұрын
    • @@junhanouyang6593 I think there are some more axes to this question. The example of cat and dogs was using in this video was in the form of VAEs, where we are not interested in classifying this images with their corresponding labels, but rather trying to find latent information within the images. This could be that the images of cats and dogs are distinct in the way that they are different types of animals. However, it could also be other information like the lighting situation in which the picture was taken. In a sense, using VAEs on these problems is unsupervised learning. In my point of view, the difference between generative and discriminative models is not necessarily what we are interested in. Rather, it is "how we model it". And I think here is also a bigger catch in your initial question. For classical classification problems (and also classical regression problems) the input and the output is not latent. Often only the parameters to the models are considered latent, often even without a prior on them (see e.g. linear regression: kzhead.info/sun/ob2Ad9VtsYaimqs/bejne.html ). Let's put it into supervised learning with X being the input, Y being the output. Then, generative models would model P(X, Y) whereas discriminative models would model P(Y| X). In other words, generative models model the joint (and the DGM) whereas discriminative models model the posterior only. I am sure, this might not have been the best answer to your question. Maybe check back on some basic aspects of probabilistic modeling like latent variables: kzhead.info/sun/hrKec8OLqZSEja8/bejne.html

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
  • Maybe, this is a stupid question, but is it true that intractability of the marginal is valid only for continuous distributions of Z? For discrete distributions we can always calculate summation for all values of z in P(x,z) to get P(x). This implies that variational inference is applicable only for continuous distributions?

    @ritupande1420@ritupande1420 Жыл бұрын
    • Hi, thanks for the interesting question. :) Indeed, the video did not say too much about problems involving discrete latent random variables. There are two parts to your comment: (1) The claim that you can always sum over all discrete z and (2) The claim that once you could express the marginal, you can't apply VI anymore. (1): This might seem intriguing, since it will probably always work if the discrete z is one-dimensional. A good example for such is if z represents a class (like in a (Gaussian) Mixture Model). As you correctly mentioned, even if you have many classes (let's say 1000) you can still sum over them. In contrast, even in 1D you could come up with integrals that are intractable. The problem with discrete variables arises once you have higher dimensional latent spaces, which is due to the combinatorial complexity. Imagine you have multiple latent attributes that can have 10 classes each. If you have two attributes (a 2D latent space) you have 100 possible combinations to sum over, for three attributes (a 3D latent space) it becomes 1000 etc. In essence, it grows exponentially. Hence, for some smaller discrete latent spaces, it might be possible to just sum in order to marginalize, but it quickly becomes infeasible/intractable. On top of that, you would have to do that each time you wanted to query one value of the posterior distribution. With VI, you would get a full surrogate you could do whatever you want with (like finding modes, sampling etc.). (2): I can also understand the thought, especially because I motivated Variational Inference as the remedy to intractable posteriors. However, even if the marginal is tractable like in Gaussian Mixture Models, you can still use VI. In these cases, it is also Expectation Maximization (=EM). Hope that helped :). Let me know, if something is unclear.

      @MachineLearningSimulation@MachineLearningSimulation Жыл бұрын
    • @@MachineLearningSimulation Thanks for a very detailed and clear explanation.

      @ritupande1420@ritupande1420 Жыл бұрын
  • why do not know the exact value of posterior, when we know the posteiror is propotional to the joint distribution which can be calculated ? Could you give a practical example to show that knowing the exact value of posterior is needed for an application ? Thank you.

    @jason988081@jason9880816 ай бұрын
    • Thanks for the great question! :) With the posterior being proportional to the joint distribution we can already find MAP estimates, which is a good start. Analogously, we can also compare two different proposals for latent variates: if one has a higher joint probability (under the same observed data) it also has a higher posterior probability. The problem is that, we can only query point-wise! In other words, we do not have a full probability distribution, meaning that we cannot sample from it. With a full distribution it is also easier to assess credibility intervals which is harder to do by sampling via MCMC.

      @MachineLearningSimulation@MachineLearningSimulation4 ай бұрын
  • What I do not understand: when looking at the ELBO, we still compare the surrogate function q(Z) - that is a valid probability - with the unnormalized probability p(Z,X=D), right? But why does this comparison even make sense? For me, it seems like VI is some magic to compare the surrogate q(Z) to an unnormalized probability instead to the (unavailable) normalized conditional. Is this actually the gist of it?

    @besarpria@besarpria2 жыл бұрын
    • It's actually the gist of it. :D I can understand, this might seem a bit magical. The reason, of course, we are doing this comparison is to find (=train/fit/optimize) the surrogate q(Z) to then do fancy things with it. One crucial observation is that the unnormalized probability p(Z, X=D) (the joint fixed to the data) is proportional to the (hypothetical, but unavailable) posterior. If a function is proportional to another function, their features are identical. Those features could for instance be maxima and minima. Hence, we could already use the unnormalized probability to do MAP estimates. Here with VI, we are just going one step further to get a full (surrogate) pdf. The next video (to be released on Friday) should clear this up. There, we will look at this Exponential-Normal model and the derivation in great detail. I hope this helps :) If something is unclear, feel free to leave a follow-up comment.

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
    • @@MachineLearningSimulation Thanks for your clarification, I highly appreciate it. Looking forward to the follow-up video!

      @besarpria@besarpria2 жыл бұрын
    • @@MachineLearningSimulation There is one thing I still can't get my head around though: It is often said that L(q) becomes tractable to compute and to maximize for a reasonable family of surrogate distributions Q. However, computing L(q) requires solving an expectation, i.e., computing an integral over the complete latent space. How can I compute this expectation without evaluating p(Z, X=D) for each possible latent vector Z? Is this even possible in general or do we need a nice closed-from joint distribution for this to work?

      @besarpria@besarpria2 жыл бұрын
    • ​@@besarpria That's a great question. It also took me a while to get my head around that. Especially because in classes you might only face simple artificial scenarios in which many things can be -solved analytically. However, once you then apply it to realistic problems things become more challenging, and you have to "engineer" more often than you might like :D You are right, in many applications the integral corresponding to the ELBO (due to the Expectation) is intractable. Hence, it does not have a closed form antiderivative. You then have to resort to sampling techniques to approximately evaluate it, in the sense of Monte-Carlo. For me that raised the question: Okay, if we have to approximate an integral either way, why can't we just also approximate the Marginal by sampling and then normalize the joint to obtain a posterior. The catch is the necessary precision of these approximations. For the marginal, you need quite a high precision since you want your posterior to be a valid PDF (with the integral = 1 condition). On the other hand, for the ELBO it is usually fine to just use a handful of samples since it is going to be repeatedly evaluated over the course of the optimization. Even 1 sample was considered to be sufficient (take a look at the VAE paper: arxiv.org/pdf/1312.6114.pdf right below Eq (8) the authors note this). I hope that could at least give some information regarding the answer to your question :)

      @MachineLearningSimulation@MachineLearningSimulation2 жыл бұрын
    • @@MachineLearningSimulation Thank you so much for the extremely detailed answer! :) Very interesting to see that in the end it becomes a question of whether fitting to the joint or to the marginal using numerical integration is cheaper. I will also have a look at the paper, it looks really cool.

      @besarpria@besarpria2 жыл бұрын
  • I'm not sure that claim that we can plug in some value into continious likelihood and get probability value is correct. The probability of this value should be zero, because the measure of some value is zero. Plus p(x) can be greater than 1 and it's strange to have probability of something greater than one. Only integral of p(x) over domain has to be one. Or I missed something?

    @alexanderkhokhlov4148@alexanderkhokhlov41483 ай бұрын
    • Hi, thanks for the comment 😊 Do you have a timestamp for when I say this in the video? It's been a while since I uploaded it. Probably, I referred to the probability density in that case.

      @MachineLearningSimulation@MachineLearningSimulation3 ай бұрын
KZhead