How do you minimize a function when you can't take derivatives? CMA-ES and PSO

2024 ж. 20 Мам.
7 376 Рет қаралды

What happens when you want to minimize a function, say, the error function in order to train a machine learning model, but the function has no derivatives, or they are very hard to calculate? You can use Gradient-Free optimizers. In this video, I show you two of them:
- CMA-ES (Covariance matrix adaptation strategy)
- PSO (Particle swarm optimization)
This video is a sequel to "What is Quantum Machine Learning"
• What is Quantum Machin...
and also part of the blog post:
www.zapatacomputing.com/why-g...
Introduction: (0:00)
CMA-ES: (1:23)
PSO (9:17)
Conclusion: (14:00)

Пікірлер
  • Very clear explanation of these two optimization algorithms. Well done!

    @chyldstudios@chyldstudios Жыл бұрын
    • Thank you, glad you like it! :)

      @SerranoAcademy@SerranoAcademy Жыл бұрын
  • Fantastic video! Can't wait for the third part of the series!

    @luisvasquez5015@luisvasquez5015 Жыл бұрын
  • Muy buen video, explicación muy clara e intuitiva. Sigue adelante! Saludos desde Bolivia.

    @cesarkadirtorricovillanuev9761@cesarkadirtorricovillanuev9761 Жыл бұрын
  • Thanks, the explanation was crystal clear!

    @imadsaddik@imadsaddik Жыл бұрын
  • The best explanation I've found,thank you!

    @RK-TKLINK@RK-TKLINK4 ай бұрын
  • Luis, yet another awesome video! Thanks to you, I've learned something new today! The step-by-step visualization with the gaussian evolution of candidates is epic - super helpful and eye-opening! Thank you!

    @Todorkotev@Todorkotev Жыл бұрын
    • Thank you so much Boyko, I’m glad you enjoyed it! :)

      @SerranoAcademy@SerranoAcademy Жыл бұрын
    • And thank you so much for your contribution! It's very kind.

      @SerranoAcademy@SerranoAcademy Жыл бұрын
  • Thanks a lot, very good introductory video! One question on PSO : is the size of the steps fixed or proportional to (respectively) the current « speed », the distance to PB, the distance to group best? Thanks

    @elie_@elie_10 ай бұрын
  • Thanks a lot for this super informative lecture! Could you please make one on Genetic Algorithms 😅

    @shazajmal9695@shazajmal9695Ай бұрын
  • amazing video really, I also wish if u could explain it with more math and some coding as a second part will be amazing.

    @omarmohy3975@omarmohy3975 Жыл бұрын
  • Do you have a citation for this method? Thanks.

    @centscents@centscents Жыл бұрын
  • so well explained. that is amazing. I wonder what can be the downside with such a particle swarm.

    @user-wr4yl7tx3w@user-wr4yl7tx3w Жыл бұрын
    • Thanks! Great question. The exact same problems can happen with PSO, it can be stuck at a local minimum just like CMA-ES. And the way to overcome them is the same.

      @SerranoAcademy@SerranoAcademy Жыл бұрын
  • Great explanation. Where can I get the math formulas?

    @brunorcabral@brunorcabral Жыл бұрын
    • Thanks! Here's a huge repository of CMA-ES info, code, tutorials, etc. For PSO I haven't found so much info, so mostly wikipedia.

      @SerranoAcademy@SerranoAcademy Жыл бұрын
  • How fast is it? If I train a neural net (which we know how to compute the gradient of) with CMA-ES or PSO, will it take longer to converge? I would imagine particularly PSO is pretty slow, maybe only slightly better than a Monte Carlo approach. You're basically doing a line search algorithm without the advantage of knowing the direction you're moving is a descent direction. CMA-ES, on the other hand, might be reasonable?

    @floydmaseda@floydmaseda Жыл бұрын
    • Great question! I haven't used it in neural networks, I imagine that gradient descent is better. I've used CMA-ES and PSO for quantum neural networks, as derivatives are hard there, and I've noticed that CMA-ES tends to work better. Not so much in speed, but actually in finding minimums and not getting stuck. That's where I think the gains are, in the fact that the randomness in CMA-ES allows it to explore more parts of the space than a gradient-based algorithm that only gives small steps. I think a good combination of gradient and non-gradient based methods is the best combination at the end.

      @SerranoAcademy@SerranoAcademy Жыл бұрын
    • @@SerranoAcademy A common strategy when training a model is to reduce the learning rate of gradient descent when the loss is no longer decreasing to see if we're bouncing around inside a local minimum without decreasing. I wonder if trying an iteration or two of CMA-ES at these times might sometimes allow us to jump to nearby local minima which may be deeper but could not be reached with any gradient-based approach. Or another use might be during initialization, which is often just random. Maybe doing CMA-ES a few iterations at the beginning of training and picking the best out of like 5 choices might shoehorn the network into a better minimum than just having a single initialization point.

      @floydmaseda@floydmaseda Жыл бұрын
  • The CMA-ES is pretty much similar to CEM (Cross Entropy Method)

    @ZaCharlemagne@ZaCharlemagne4 ай бұрын
  • I wish you could make a course for statistics in great detail or write a book for that.

    @mohammadarafah7757@mohammadarafah7757 Жыл бұрын
    • Thank you! I'm building a course on that, hopefully it'll be out in the next few months, I'll announce it in the channel when it's ready! :)

      @SerranoAcademy@SerranoAcademy Жыл бұрын
    • That's great news. I can help you in practical labs; I am PhD researcher in generative modelling@@SerranoAcademy

      @mohammadarafah7757@mohammadarafah7757 Жыл бұрын
  • Thanks. That's just a guess but i doubt this method be efficient in higher dimensions for the following reason. For instance had to take 5 points randomly in 2d. Let's take the square root to guess how much point per dimension you need. Thats about 2. So with 10 dimension i would need 2^10 = 1024. But i think ML involves much more dimensions ; that's basically about the number of weights in a multi layered network. say 3 layers of fully connected neurons : 100*100 about 2 ^ 10000 . 1 Gigabyte of RAM is 2^30 so no way to apply this. Am i wrong somewhere ? :-)

    @java2379@java2379 Жыл бұрын
    • The idea of such methods is to optimize functions when gradient descent isn’t available because of lack of derivability. In the case you mention, the network optimization is done in the usual way (SGD, Adam, …), and what you will look at is, say, the loss after a fixed number of epochs for a given set of hyperparameters (learning rate, beta coefficients, etc…) which at less numerous. Then you reiterate the process using CMA-ES/PSO solely on those hyperparameters.

      @elie_@elie_10 ай бұрын
  • Question: you don’t have gradient so how do you know if cma-es reached a local minimum?

    @zyzhang1130@zyzhang1130 Жыл бұрын
    • Great question! You can notice if after several iterations you keep getting generations that don’t improve your minimum, or that improve it very very slightly. Then you assume you’re at a local minimum.

      @SerranoAcademy@SerranoAcademy Жыл бұрын
    • @@SerranoAcademy so similar to the convergence analysis of gradient descent. Thank you for your reply😁

      @zyzhang1130@zyzhang1130 Жыл бұрын
  • Fantastic video! Can't wait for the third part of the series!

    @luisvasquez5015@luisvasquez5015 Жыл бұрын
KZhead