"Don't Use Loops, They Are Slow! Do This Instead" | Code Cop

2024 ж. 8 Ақп.

60 196 Рет қаралды

Use code MODULAR and get 20% off the brand new "Getting Started with Modular Monoliths" course on Dometrain: dometrain.com/course/getting-...
Become a Patreon and get special perks: / nickchapsas
Hello, everybody, I'm Nick, and in this episode of Code Cop, I will talk about loop unrolling or loop unwinding and explain why I don't think it is something you should even be considering in C#
Workshops: bit.ly/nickworkshops
Don't forget to comment, like and subscribe :)
Social Media:
Follow me on GitHub: github.com/Elfocrash
Follow me on Twitter: / nickchapsas
Connect on LinkedIn: / nick-chapsas
Keep coding merch: keepcoding.shop
#csharp #dotnet #codecop

Пікірлер

To those saying that the compiler will do this anyway, that is wrong (at least in C#). There are very specific cases where unrolling can be done automatically and it's not a compiler thing but rather a JIT thing.
@nickchapsas3 ай бұрын
- I mean, I feel like that's splitting hairs. The idea is the same between .NETs JIT and a compiler. But I didn't expect that the JIT did much optimization beyond what the IL had in place. I always fig'd it was just translating the IL one line at a time to whatever machine code was appropriate.
  @jimread23543 ай бұрын
- @@jimread2354 If I have this correct, IL, which is the compiled form uses a stack and not registers. The JIT then replaces the stack with registers depending on the hardware and how many registers there are, among other things. This is because its quite easy to map from a stack to registers, but very hard to do it the other way, and this way JIT can use all the registers without the IL having to be compiled for it. So the jit does quite a lot of the compilation, and loop unrolling is one such thing since it can depend a lot on the hardware if its beneficial or not.
  @davidmartensson2733 ай бұрын
- Definitely in JIT or NativeAOT, it never shows up in the compiler generated IL indeed. Isn't this pretty much the idea behind the Vectorization thing though? Which obviously never uses the standard managed heap arrays anyway, all just efficient stack and CPU register based stuff. I am actually surprised that it even works with normal arrays. Anyways the dotnet team has already done tons of this type of stuff in the standard libraries, which also played an important role in all recent performance optimization. They test all these things. All so that we normally do not need these things in our own C# code, where we have tons of other concerns indeed. Performance concerns in most applications are still almost entirely on the IO side and very limited on the CPU side. It would be great if C# could get more optimizations though. I still hope for a new LINQ implementation built around value types and Spans. In Rust very abstracted code is all zero cost and sometimes faster than normal loops because lots of optimizations happen behind the scenes.
  @jongeduard3 ай бұрын
- Semantics. One can say it's called "just-in-time compiler" in official docs for a reason: because it does what compilers do which is changing code (in this case in MSIL) to native code. So, I guess you are nitpicking here a bit.
  @proosee2 ай бұрын
- If you check the array copy function in the DotNet source, you'll see that it's loop is unrolled in the source code. They know the compiler isn't doing it, but there they don't mention JIT. It's possible that that code was written prior to JIT being able to unroll better.
  @Misteribel2 ай бұрын
Loosing performance by not un-rolling a loop should be the last item in your to-do list of your entire life.
@Andrei-gt7pw3 ай бұрын
- Unless you’re a demo scene programmer but in that case you probably aren’t using C# lol
  @ar_xiv3 ай бұрын
- If you need that kind of performance and still are using C#, throw out any normal advice, that type of optimization is so rarely needed that it should not show up in any general advice in my opinion. Would have been interesting to see how performance was with the required if statements to handle non conforming loop sizes, I have a feeling it would be less, meaning this advice comes with a really big problem because if a junior tries it and fails to understand the problem and just happens to only test with valid sizes, even if he builds unit tests, you might have a bug going into production. Yes, in a bigger team, code review should capture this, but there are to many developers out there that have no one to do CR, and to many teams that do not practice good code reviews anyway.
  @davidmartensson2733 ай бұрын
- In fact, if you need this kind of performance, you wouldn't be looking to unroll the loop, instead you'd be using SIMDs directly.
  @JustNrik3 ай бұрын
- @@JustNrik You often need to unroll a loop before you can use SIMD instructions.
  @curtis69193 ай бұрын
- Disagree, it shouldn't be on your list
  @YungDeiza3 ай бұрын
I unrolled loops once, that I can recall, in my 50 years of programming. It was satellite data downlink processing where removing the loop allowed the program to complete before the next scan line arrived. This was in assembler, not C or C#. I can't think of any other cases where timing is that tight. Cleaner code is virtually always better.
@HarshColby3 ай бұрын
- I did once too in a graphics driver where I had to blit a sprite to video memory in the time it took the CRT to do a vertical refresh. I also couldn't make a function call into functions that were in the ROM because the overhead of pushing all the registers on the stack and then popping them off took too many clock cycles. So I had to copy what the ROM was doing directly in my code. Of course, the CPU was an 8088 and at the time memory could be accessed by the CPU or the video hardware, but not at the same time. If you tried to write to video memory while the screen was updating, you got noise on the screen because the video hardware couldn't read the memory to know what to draw. And all of this had to be done in 4K of RAM because that's all the spare memory that was available.
  @davidwilliss55553 ай бұрын
I'm happy you mentioned the bounds check. I was squirming through most of the video as I resisted jumping to the comments section. I would've liked to see a benchmark with the bounds checks included.
@badwolf013 ай бұрын
- Instead of checking bounds might be more performant to do a loop from 0 to length - 5 (or whatever the stride is) and have a second normal loop fom max(0, length - 5) to length (with i++) Edit: max is to account for the edge case where length is < 5
  @guiorgy3 ай бұрын
- @@guiorgy If you go that route then just instantiate i outside of the first loop, then the second loop can just "continue"from where we were...
  @UrielZyx3 ай бұрын
- I think the point is that you already know the loop bounds.
  @gavinw773 ай бұрын
- @@gavinw77 No, it's not. Neither in the LinkedIn post nor in Nick's example...
  @UrielZyx3 ай бұрын
- I had the exact same reaction. 😂
  @lhederidder3 ай бұрын
Compiler Loop Unrolling: Am I a joke to you?
@rubberduckdebug3 ай бұрын
- @@mohammadtoficmohammad3594 When you use C# everywhere, it's not an easy feat to explain to everybody else in your company, that you actually need Assembler or C++ to get that extra little bit of performance. It would be much more horrible from support perspective to use different languages for different parts of your code. So programmers always should know of some techniques like aforementioned one, though they may rarely need them.
  @zachemny3 ай бұрын
This is an old-school technique used primarily in game development where ever cycle counted. The idea is to mitigate CPU branch prediction stalls (plus the overhead of loop management). This still has its uses, sometimes, but other things tend to be much more important these days in highly performant code. You do not need special checks to avoid going "out of bounds" inside the loop. The way this is typically handled is by checking the iteration count before the loop and then, if it is not a even multiple of the unrolling, it will jump (i.e. goto) into the middle of the loop at the appropriate instruction so it works out correctly. Thus it incurs the overhead of this only once for the entire loop sequence vs. every iteration. This sort of thing, when implemented at the assembly/machine-code level is pretty tight and super efficient on CPU architectures of the day when this was essential. I had to do a lot of it back in the 386SX 16Mhz days in order to have any chance of rendering my 3D scene in what would loosely be described (these days) as real-time. :)
@thereal_nsxdavid3 ай бұрын
- Except the example doesn't have the one you are talking about?
  @asagiai49653 ай бұрын
- Sorry, what "one" are you referring too? Not sure I understand what you mean. @@asagiai4965
  @thereal_nsxdavid3 ай бұрын
- In essence - we are really wasting a ton of cpu these days.....
  @erikthysell3 ай бұрын
- Can you provide an example for an array with 37 items?
  @andriiyustyk93783 ай бұрын
- For those wondering, the initial goto into the middle of the unrolled loop body is called Duff's device.
  @vyrp3 ай бұрын
Usually I agree with you on these topics, but the keyword here is CPU Pipelining. Basically unrolling allows the CPU to process multiple operations in a single cycle instead of waiting for the result of the previous one and stall the CPU each time. Also I think 5 was a bad number usually 4 is used because CPU pipelines usually can process 2-4 elements sequentially at the same time. You also use count += which defies the benefits of unrolling because the CPU still needs to stall for the previous += call so I think your benchmark is actually a bad example tbh. So if you really want to optimize a hot code path where calculations are simple this can lead to pretty significant performance boosts. I did this for my voxel engine operation directly on pointers and use unrolling to satisfy the CPU pipeline and it increased performance by >20%. Although I would never recommend this for normal loop operations with a few elements, this only makes sense if you a) know what you are doing b) you are operation on massive amounts of data. In my use case I have to calculate the bool XOR of two arrays and assign it to a third one each with sizes >1_000_000. It is basically the same usecase where you could use SIMD and get similar results.
@TheOneAndOnlySecrest3 ай бұрын
- Which brings me to the point, would it be better to use SIMD anyways. And shouldn't the JIT optimize this? 🤔
  @guiorgy3 ай бұрын
- As a follow up here is an example for the case I mentioned above using BenchmarkDotNet.Attributes; namespace LoopUnrolling; [SimpleJob] public class Benchmark { private int[] _array1 = Enumerable.Range(0, 10_000_000).ToArray(); private int[] _array2 = Enumerable.Range(0, 10_000_000).ToArray(); private int[] _array3 = new int[10_000_000]; [Benchmark(Baseline = true)] public void Simple() { var length = _array1.Length; for (var i = 0; i < length; i++) { _array3[i] = _array1[i] ^ _array2[i]; } } [Benchmark] public void Unrolled() { var length = _array1.Length; for (var i = 0; i < length; i+=4) { _array3[i] = _array1[i] ^ _array2[i]; _array3[i+1] = _array1[i+1] ^ _array2[i+1]; _array3[i+2] = _array1[i+2] ^ _array2[i+2]; _array3[i+3] = _array1[i+3] ^ _array2[i+3]; } } } // | Method | Mean | Error | StdDev | Ratio | // |--------- |---------:|----------:|----------:|------:| // | Simple | 8.958 ms | 0.1254 ms | 0.1173 ms | 1.00 | // | Unrolled | 6.431 ms | 0.0196 ms | 0.0153 ms | 0.72 | 28% Speed increase for basically 0 effort I would take that all the time if I need to call this often (in a gameloop or sth).
  @TheOneAndOnlySecrest3 ай бұрын
- @guiorgy in theory yes, if you use raw pointers and align them correctly SIMD can cut this time by 8. But as in C# we don't have control over array alignment in RAM it is usually even slower than just doing it this way. The reson behind is that read and write operations outside of Vector boundaries is slower so it depends on your workload. Also SIMD code is a LOT harder to write and is also dependant on the CPUs capabilities not all processors support AVX256 or AVX512 so you need multiple code paths to handle all of the cases including a software fallback. The Jit could and actually will optimize these calls if you paste them in sharplab but it will usually not be faster unless you have full control over memory alignment and sizes.
  @TheOneAndOnlySecrest3 ай бұрын
- In C# the most efficient safe way to do this would using ref and Unsafe.Add, sth like this: using System.Numerics; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using BenchmarkDotNet.Attributes; namespace LoopUnrolling; [SimpleJob] public class Benchmark { private int[] _array1 = Enumerable.Range(0, 10_000_000).ToArray(); private int[] _array2 = Enumerable.Range(0, 10_000_000).ToArray(); private int[] _array3 = new int[10_000_000]; [Benchmark(Baseline = true)] public void Simple() { for (var i = 0; i < _array1.Length; i++) { _array3[i] = _array1[i] ^ _array2[i]; } } [Benchmark] public void SimpleRef() { var length = _array1.Length; ref var array1 = ref MemoryMarshal.GetArrayDataReference(_array1); ref var array2 = ref MemoryMarshal.GetArrayDataReference(_array2); ref var array3 = ref MemoryMarshal.GetArrayDataReference(_array3); for (var i = 0; i < length; i++) { Unsafe.Add(ref array3, i) = Unsafe.Add(ref array1, i) ^ Unsafe.Add(ref array2, i); } } [Benchmark] public void Unrolled() { var length = _array1.Length; for (var i = 0; i < length; i+=4) { _array3[i] = _array1[i] ^ _array2[i]; _array3[i+1] = _array1[i+1] ^ _array2[i+1]; _array3[i+2] = _array1[i+2] ^ _array2[i+2]; _array3[i+3] = _array1[i+3] ^ _array2[i+3]; } } [Benchmark] public void UnrolledRef() { var length = _array1.Length; ref var array1 = ref MemoryMarshal.GetArrayDataReference(_array1); ref var array2 = ref MemoryMarshal.GetArrayDataReference(_array2); ref var array3 = ref MemoryMarshal.GetArrayDataReference(_array3); for (var i = 0; i < length; i+=8) { Unsafe.Add(ref array3, i) = Unsafe.Add(ref array1, i) ^ Unsafe.Add(ref array2, i); Unsafe.Add(ref array3, i+1) = Unsafe.Add(ref array1, i+1) ^ Unsafe.Add(ref array2, i+1); Unsafe.Add(ref array3, i+2) = Unsafe.Add(ref array1, i+2) ^ Unsafe.Add(ref array2, i+2); Unsafe.Add(ref array3, i+3) = Unsafe.Add(ref array1, i+3) ^ Unsafe.Add(ref array2, i+3); } } [Benchmark] public void Simd() { var length = _array1.Length / Vector.Count; ref var array1 = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(_array1)); ref var array2 = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(_array2)); ref var array3 = ref Unsafe.As(ref MemoryMarshal.GetArrayDataReference(_array3)); for (var i = 0; i < length; i++) { Unsafe.Add(ref array3, i) = Unsafe.Add(ref array1, i) ^ Unsafe.Add(ref array2, i); } } } // | Method | Mean | Error | StdDev | Ratio | // |------------ |---------:|----------:|----------:|------:| // | Simple | 9.944 ms | 0.0308 ms | 0.0241 ms | 1.00 | // | SimpleRef | 6.850 ms | 0.0546 ms | 0.0511 ms | 0.69 | // | Unrolled | 6.354 ms | 0.0441 ms | 0.0391 ms | 0.64 | // | UnrolledRef | 5.198 ms | 0.0427 ms | 0.0356 ms | 0.52 | // | Simd | 5.241 ms | 0.0363 ms | 0.0303 ms | 0.53 | 50% faster but also obviously less pleasing to look at :D Simd really depends on the array boundary could be faster or equally as fast.
  @TheOneAndOnlySecrest3 ай бұрын
- @@TheOneAndOnlySecrest "0 effort" For the sake of completeness, you should add the lines responsible for going over the entire range - unroling() does not perform the entire range(). In penny cases unrolling makes sense (heavy counting). But using this as general advice for any loop is meaningless.
  @piotrc9663 ай бұрын
If performance is so critical to your application that you're considering loop unrolling, then maybe C# isn't the best language for the task.
@andywest57733 ай бұрын
- What if performance is not critical in 98% of cases, but in other 2% of cases it is? Should we really switch to C++ or Assembler and complicate support so much more?
  @zachemny3 ай бұрын
- @zachemny Nick's point remains. If you are trying to eek out performance in that 1% use case there are likely 1,000 yoy could do that will give you a better benefit before having to resort to this
  @drewkillion28123 ай бұрын
- @@drewkillion2812 Each and every advanced performance optimization will cost you some readability. And loop unrolling is in no way different in that regard. It's one of the cheapest and easiest to implement in fact. Compare it to AVX Intrinsics for example.
  @zachemny3 ай бұрын
- @@zachemnywell, Microsoft, google etc etc do. But remember the rule of thumb is it takes 10 times longer to develop in those faster languages. I understand your point, and that’s why huge companies with money to burn will generally turn out faster code.
  @saberint3 ай бұрын
- @@drewkillion2812 Loop unrolling is no different from other performance techniques in that it will take some readability from your code and give you some performance. It's not better or worse. In fact it is fairly straightforward compared to other tricks like for example AVX intrinsics or branchless programming and doesn't pollute code much.
  @zachemny3 ай бұрын
Bonus points: Rewrite the code using Array.Sum() (LINQ method), given you're basically doing a sum here. In .NET 6, that would be a horrible choice, but in .NET 8, the vectorization improvements have done wonders. For loop: 3.1 us in .NET 6, 3.4 us in .NET 8 Unrolled loop (size 4): 3.1 us in .NET 6, 2.4 us in .NET 8 Unrolled loop (size 8): 2.8 us in .NET 6, 2.4 us in .NET 8 Array.Sum: 38.8 us in .NET 6, 0.6 us in .NET 8 So loop unrolling is actually better in .NET 8, as I saw a drop from 3.1 to 2.4 us, which is better than 20% improvement compared to .NET 6. It's a 30% improvement compared to the simple for loop, given that basic for loops seem to have gotten a bit slower. Also, the loop unroll for 4 elements at a time is now the same speed as 8 elements at a time, whereas it used to be faster to do 8 elements at a time in .NET 6. So the compiler may be compensating for the gains that used to require more manual work. However the vectorization work done in LINQ in .NET 8 has turned it from an order of magnitude worse in .NET 6 (38.8 us) to an order of magnitude better (0.6 us). It's 4x faster than the unrolled loop, and the clear best choice for this particular scenario. These tests were done on a Ryzen 3700X on an array size of 10_000.
@David-id6jw3 ай бұрын
- Note that doing the *2 calculation shown in the original article eliminates the gains of using LINQ, and makes it a bad choice again (about 20 us). Apparently Array.Sum(a => a*2) doesn't vectorize the extra calculation. Also also, just implementing your own version using Vector is much better, putting it on par with the Enumerable.Sum() function, but where you can still stay fast with other operations (such as the *2) and not lose much speed at all. I'm getting 0.6 to 0.9 us with a few different variations on the code.
  @David-id6jw3 ай бұрын
- Yeah, after messing with the code and getting rid of some unnecessary stuff, the Vector implementation is at 625 ns for a straight sum, and 635 ns for the *2 sum. If you want to speed things up and don't mind a slight increase in complication, go for the Vector loop instead of loop unrolling.
  @David-id6jw3 ай бұрын
- @@David-id6jw Just first sum everything and then apply *2? But I agree when trying to optimize, use in build methods as much as possible, these have been implemented in native assembly/C which is always faster than doing it manually in C#. This goes for every interpreted language btw, like PHP, java. javascript etc.
  @imqqmi2 ай бұрын
When people try to be smarter than the compiler.... it usually fails.
@J_i_m_3 ай бұрын
- This is a standard SIMD way of doing things go read up on the vector class you will get significant performance.
  @FilipCordas3 ай бұрын
- the compiler can't fully unroll a loop. the compiler isn't a magic box.
  @curtis69193 ай бұрын
- @@curtis6919 Ever watched the assembly that gcc produces? You would be amazed how far a compiler can go...
  @J_i_m_3 ай бұрын
- @J_i_m_ yes and it can't go far enough
  @curtis69193 ай бұрын
- @@J_i_m_ True. But that's not the case for C# ;-)
  @igorthelight2 ай бұрын
Unrolling's greatest advantage ist not the reduction of the loop overhead, but the parallelism it allows on the cpu level. Further, loop unrolling sets up your code to be parallelized using SIMD, if you needed the performance. However, compilers will likely perform these micro optimizations better than you do - so always measure!
@sheep29093 ай бұрын
- Agreed; What Nick says is not wrong, just a bit too biased for my taste. Like so often it depends on the use case, and in maybe 98% of them unrolling is not worth it. Yet at the same time, it can be mandatory for time critical loops. Simple example: I need to clear an RGB array with a certain color value asap - or generally spoken, it makes sense for (realtime) image or audio data manipulation.
  @cooperfeld3 ай бұрын
- @@cooperfeldaren't we now in territory where using CUDA might be just as or even more efficient?
  @Mipzhap3 ай бұрын
- @@cooperfeld In my opinion, the few edge cases where manual unrolling would be justified are so rare that it should never be a "general" advice. It should be filed under "Dangerous solution for super time critical code, use at own risk" or something ;)
  @davidmartensson2733 ай бұрын
- I know the iført (Intel fortran) compoler is rather excellent at these things, unrolling parts of loops, rearranging loop order where possible and beneficial etc... quite interesting g what it can accomplish
  @erikthysell3 ай бұрын
- @@Mipzhap You are right; I was thinking of a C# .Net/Unity environment, which *might* not be the best area to use this)
  @cooperfeld3 ай бұрын
Also Nick Chapsas: fastest way to iterate a list **unsafe Marshal** Edit: wait just saw the code what in the SIMD hell is this
@asedtf3 ай бұрын
well.. the example given in 2:19 onlt works for lenght being muliply of 4. Otherwise it will require additional checks for each line. Ant then the compiler will do whatever it wants. I remember a long ago one of first versions of MS C compiler I checked what the compiler does. Task was to copy array of 7 characters. Assembly code generated was just three instructions: move dword, move word, move byte. And that was very long ago, since then I assume compliers and optimizers made a great progress.
@czajla3 ай бұрын
- Yeah, this is what I came here to mention as well. These unrolling efforts only work if you can either guarantee the multiple or you handle the multiple which, in my opinion, would take more code having the checks and handlers. Otherwise, you end up with exceptions for index out of bounds which need to be handled. Or you have to ensure the multiple and what if it doesn't fit? Do you put put a switch case on the multiples and do things like 4, 5, 6 unrolling logic blocks/methods, but have a default that just loops through one by one? The overhead of handling the unrolling doesn't seem worth the extra headache of this 'trick'.
  @dmstrat2 ай бұрын
- No need for additional branches: you pick a unrolling factor as a 2^n number, so that you can ensure a division is fast (no matter what CPU you have). You divide the loop limit beforehand, and do the remainder after the loop. That saves you from running into branches each time you run through the loop. That said: modern compilers do exactly that...
  @lokolb2 ай бұрын
- @@lokolb You mean you create, let's say, a unrolled loop that does 4 at a time, then at the end you put the loop to handle the "up to 3 left over"? Replacing 4 with x and 3 with x-1 to do basically any size? And that's more maintainable code?
  @dmstrat2 ай бұрын
Seeing several comments where people say they never needed to do this... meaning they never needed performance critical code. Which is true in most cases, its a last needed optimization, so 99% of people shouldn't. Doesn't mean it isnt performant - Nick proved that it is. As a few others noted, it's for experts, who actually need the 'low-level' optimization, and understand the pitfalls. Anecdotes of never using something is not proof of anything other than ignorance.
@VeNoM06193 ай бұрын
I will not compromise code readability for just a nanosecond performance ! After ages if i tried to read my code i have to figure out what i did hear and for what reason.
@user-bf4eu8cb2y2 ай бұрын
Eventually, one of these videos is just going to have the "advice" up on the screen, with Nick staring directly into the camera with a very annoyed look for a solid 20 seconds, and then he just snaps "WTF" and slaps the camera and the video ends. 😄
@asteinerd3 ай бұрын
This is a great tip.... For programming in the Commodore 64!!!
@MichaelBattaglia3 ай бұрын
- 😂 Same Age!
  @Navid7h2 ай бұрын
3:20 actually, the first thing I picked up is that the unrolled version only works if the amount of items is a multiple of 4. He doesn't even use a Duff's device to make it compatible with any range...
@billy65bob3 ай бұрын
I'd be more concerned with what kind of struct I am looping over, typically if I really care about some performance rather than going to un-rolling 'which I've never had the need to do so because of array length bounds being not entirely predictable' I'd typically do a zero allocation convert to ImmutableArray or ReadOnlySpan if I really wanted a standard loop to execute 'faster' failing that I'd have to re-think the language of choice, but it feels like alot of this may fall under 'premature optimization'.
@andyfitz19922 ай бұрын
Loop unrolling is a good and fairly easy technique when used in appropriate situations, i.e. in hot paths, where you really need performance. Even in C#. C# can be different and suits different tasks. And author here is suddenly forgot about it and threw possible performance boost out of the window. If you don't need or care about performance than just skip optimizations and use LINQ instead. If you do however, than give it a try. It doesn't ruin your code much from support perspective because all similar lines are localized and easily understood. If you really value DRY principle, however, there is a neat trick to avoid duplication with clever use of struct generics.
@zachemny3 ай бұрын
Bruh, I thought their advice would only apply to const collection (where the length is determined at compile time), but they actually extended it to variable length. Respect 👏
@parlor31153 ай бұрын
"Performance" word inventor will forever be haunted
@dii22 ай бұрын
Hi Nick, I think there is a problem in your benchmarks : in the first run the loop unrolling reduce the execution time is reduced by ~3.5% but in the second run, the execution time is reduced by ~40% ! I pretty much agree with the fact that one should never use loop unrolling in C# but if you base your demonstration on numbers, they must be consistent
@mareek14433 ай бұрын
- I was scrolling through comments just searching for this. At first in the video I completely dismissed the idea because of marginal gains. And then it's almost half. That's actually really solid improvement. 😅
  @kaliCZE3 ай бұрын
This is (among other things) what an optimizing compiler is for, and a JIT compiler such as the CLR uses, ought to be able to figure out dynamically whether unrolling will be a win for any given loop. Doing it in your C# is just making the code unreadable. I'd even guess it might prevent some of the optimizations that the compiler might otherwise be able to make.
@cronintechnology99013 ай бұрын
Your first benchmark has an output data dependency so it would never be vectorized as effectively as the shown example. Unrolling (and adding SIMD which C# now lets you do) helps a ton for dense matrix/vector maths. While marshaling it to unmanaged might be slower.
@EraYaN3 ай бұрын
Maybe you should mention the Big O Notation in cases like this. The time complexity is exactly the same, so you don't really need to bother with it. It is O(n) no matter if you write it as one loop, an unrolled loop or even multiple loops iterating multiple times over the same array and doing part of the calculation each time. You will save much more time solving stupid things that actually affect complexity than things withingthe same complexity class.
@speedy37492 ай бұрын
Good video thanks! Aside: back in the day we used to sometimes even optimize for cpu "load shadows", wherein you could reorganize calculation sequence to minimize register loading wait times so that a calculation on one register could execute "in the shadow" of another that was being loaded. I just googled that term and nothing remotely relevant is even found, lol. I wonder if people still do it at very low level like compilers.
@rreiter3 ай бұрын
That advice works only in interpreted languages, in compiled langs the compiler unrolls the loop (if it can and its reasonable) automatically as part of optimization
@hydro633 ай бұрын
Yes, sort of things like this are usually done in C++ to achieve some more performance, but nowadays even in C++ it is often better to let optimizing compiler do it for you automatically when it will decide that in particular case that makes sense.
@ivanp_personal3 ай бұрын
Why do you use 5 times for loop unrolling? There is a reason why we unroll 2x, 4x, 8x, 16x times -> CPU cacheline! If you use odd times like 5, you will end up between two cachelines - which hurts your performance - especially with parallel execution and hyper-threading. Also there is actually a valid use-case for unrolled loops, even in C# and its called SIMD! If you use SIMD you don't process one piece of data, you process 4, 8, 16 or even 32 data slots doing bit or math manipulations with one CPU instruction in just a few cycles. To properly work with SIMD, its a good practice to unroll your loops beforehand to 2x, 4x, 8x so that you can much easier translate it to SIMD. This is the only case, i think loop unrolling is useful. If i translate code to SIMD, i always do this first (simulating SIMD) in this way and then simply call Vector.Multiply() or what ever SIMD operation i want to perform. Also you normally dont perform one intruction, you perform multiple of those in sequence and at the very end you most likely need to get it back into your normal memory structures. Also one last note: Every modern CPU uses SIMD instructions for almost every math and bit operation, but it is not used efficiently because values are duplicated between all lanes and only one lane is taken out at the end. Example: Multiplying a single float with a single float is done with SIMD instructions, e.g. AVX-2 with 8 lanes, so 8 times floating operations, for one floating point operation - wasting 7 floats. With AVX-512 you can process up to 16 floating points at once, or 64 bytes/bools. Please consider talking about this in another video, using the Vector namespace its really easy to use SIMD in C# / .NET. Used properly you can drastically improvement performance of computations by 4x, 8x or even higher!
@F1nalspace2 ай бұрын
I used this in a C class, there are a lot of things to consider. It can work for operations on unmanaged situations. Like live video encoding, or live audio effects. Situations were the utmost performance is needed.
@alexandernava92753 ай бұрын
- Also that example isn't doing it right. You need to see if it is divisible by the given number of unrolls you do. Also note the number of unrolls that increases performance is dependent on what you are doing, an what hardware.
  @alexandernava92753 ай бұрын
- Ah you don't do a check per operation, you just do a check on the length, and times. Once you hit the last time you can unroll you do a normal loop on the last elements.
  @alexandernava92753 ай бұрын
Me instantly noticing the index out of range bug in the good example 🙈
@JarleXXX3 ай бұрын
Number of times I've needed to do this over last 30 years - Zero in all languages (C, C++, C#) Closest I've ever got to doing this is to convert something that was using a loop to using Vector256... and changing to use spans instead of the arrays got 90% of the performance improvement. If it wasn't something that profiling showed was on the critical path it certainly wouldn't have been worth it.
@MarkRidgwell3 ай бұрын
- You didn't need to do this, but the technique is still useful. Just not a first level optimization.
  @VeNoM06193 ай бұрын
- Then you've never written performance critical code. This is a stock-standard technique for game development and high-frequency trading.
  @curtis69193 ай бұрын
I would say that loop body in benchmarks is "empty". May be worth adding some functionality to the loop and demonstrate the impact of the proposed pseudo-optimization on "heavy" loops. Also it might be a good idea (especially for short loops) to add handling the situation when number of elements in array isn't a multiple of step.
@vanstrihaar3 ай бұрын
It's kinda strage that unrolled loop has better performance, I thought C# compiler would optimize loops using SIMD or unroll it. For example GCC with flag -O3 will optimize loops using SIMD so manually unrolled iteration could be even slower than normal one (I would have to benchmark that, but looking only at assembly not unrolled version seems to be more efficiently compiled).
@scuroguardiano97873 ай бұрын
The biggest (not big, biggest) gain here is less predictions of condition check results. It would be nice if we had a way to tell the virtual machine that one outcome of the "if" is far more probable than the other.
@rafazieba99823 ай бұрын
8:25 - that's really easy to resolve though without additional checks by iterating up until the length minus the size of the unrolled chunks and then dealing with the last chunk outside of the loop though. Not advocating for this, but if you're implementing something performance-critical, it would be an easy thing to address. Adding checks within the loop would be the naive way to deal with this scenario. Change the loop to deal with [i < len - 4]. The problem with this is that the poster implies that this is applicable device universally, and this YT video implies the same thing IMHO. For non-hot-paths, and non-performance-critical code, definitely don't do this. If you are, consider if it'll make an impact.
@TheAceInfinity3 ай бұрын
For all the naysayers, look at the source for the HashCode class. As for the benchmark the unrolled code is almost 2 times as fast, and that was with 5 ints, which can't be vectorized. 4 or 8 would be better. As for the bounds check, you do that first, use the unrolled loop for the first multiple of the unrolling, then the remaining one at a time. Of course you wouldn't unroll loops as the first step of programming, but once you've narrowed down performance bottlenecks, it's definitely a valid tool. Along with SIMD and refs or pointers of course.
@phizc3 ай бұрын
I've unrolled loops in shaders, so it's a useful technique in specific applications where you have an execution time budget. It is also super important to mention that it is highly unlikely to be a bottleneck in case of big applications and if it is the case, it has to be on the bottom of the optimization todo list. That being said, it really hurts that the guy calling it dumb and the numbers in benchmark shows that it's actually faster. Yes, you will have ugly code (which is almost always the case when you start getting closer to the metal), but i really wish that we ship optimized and faster working products that I dont need to buy a new machine just to run a piece of software faster. You are shipping products and not shipping code, and im totally fine with slightly uglier code to gain speed :P
@happypixls2 ай бұрын
This feels like something some university tutor would come up with for a stupid programming test.
@Punkologist3 ай бұрын
Would unrolling for a multiple of what you can fit in a vector register help the compiler vectorize your loop?
@EtienneFortin3 ай бұрын
You made me fall in love with clean, performant code and like coding again and made my job so much more fun... but the some of LinkedIn advice is um interesting!
@willemavw3 ай бұрын
Why did you use unrolling by 5? I'm sure you know about memory alignments and it's *very important* when you are doing this kind of optimization. Depending on your CPU architecture, use 4 or 8, or a multiple of those. Same is true for page sizes and the like. A good rule of thumb is an outer loop with 64 and an inner with 8, and a last one for the rest.
@Misteribel2 ай бұрын
I think making benchmark on the original code would show much greater difference in favour of unrolling, but due to other change. In the first example Array.Length (property, which is actually a function in C#) is in loop, in the second example it is "len" (field, which is value).
@user-fr2tk1we7r3 ай бұрын
Long-long time ago I wrote my own 3x3 matrix multiplication method and I test it and it was slightly faster without loops, by manually writing every multiplications and sums for every element. Don't know how is it in modern compilers though.
@Knuckles27613 ай бұрын
- How long ago is "long-long time ago"? If you were in a galaxy far away, it would make sense, but compilers in our universe have been unrolling loops for at least a decade. .NET, when combined with dynamic PGO is even better.
  @colejohnson663 ай бұрын
- @@colejohnson66 about 15-18 years ago, and I think it was on Pascal. Never bothered with such micro-optimization since.
  @Knuckles27613 ай бұрын
This is a performance boost of ~60%, which is quite a lot. For performance critical areas this is a big increase. However, remember that in this case the operation within an iteration is simple... so if the iteration itself is demanding, the overhead from loops becomes negligible, and the operation itself becomes demanding. So doing this is only reasonable in cases of many iterations with simple actions within each iteration. At that point it raises the question if performance is so important, but the operation is still simplistic, why not relay it to a C or C++ code?
@brianviktor82123 ай бұрын
There was room for more benchmark, also showing the exact benefit the division by 4 would have, and maybe unrolling by 8
@AlFasGD3 ай бұрын
Could using a span cut down the operation time to bring it to be closer/faster than unrolling?
@jcx-2003 ай бұрын
I've neve used loop unrolling except to try it out when I was learning programming in the late 90's. It made a lot more sense back then, and even still it didn't make much sense.
@discerningfreedom41243 ай бұрын
I'm actually surprised how much the performance increased. I was expecting it to be negligible, although still not something worth doing unless you really need those extra few ns, in which case C# probably isn't the right language to use in the first place lol
@bluesillybeard3 ай бұрын
If I know that A list contains only one item, is it efficient to do a .FirstOrDefault or foreach over it. In my project scaffolding, every layer returns a dto object with a list which can either contains stuff or be empty.
@AbhinavKulshreshtha3 ай бұрын
That only works, if the size is known and divisible by the step-size or at most one past the array length. Performance boosts can only be seen if the data structure has several 100 thousand if not million entries. I don't know how C# is handling loops, but incrementing i by the step-size at the end, then jumping back to the loop header to test the condition, should do the job.
@philperry65642 ай бұрын
When I was doing games dev in unity I saved about 1ms by unrolling every single loop in all of the hot code. Maybe it was a quirk of the unity engine. Or maybe my loops were bad. The maintainability wasn't impacted a whole lot, the code that was unrolled were loops of 3 over x y n z positions in most cases, so very easy to understand at a glance when unrolled.
@plyr23 ай бұрын
"If the overhead of the loop is more than what is happening within the loop itself." You're wrong. Loop unrolling doesn't only seek to minimise the impact of the loop itself (which any reasonable compiler will turn into a while loop, anyway). Rather, loop unrolling allows the CPU to process as many array indices at one time, rather than one a time, due to ILP. This occurs because when you unroll a loop, the compiler can see that each loop iteration is not dependent on the previous (something that can't be implicitly determined by the compiler). Furthermore, loop unrolling allows you to see if you can vectorise a solution, allowing you to use SIMD instructions--increasing performance by orders of magnitudes. In your example, you also increment by 5. This, again, is wrong. You always want to increment with powers of 2. FURTHERMORE, your example makes a VERY bad mistake of using a single accumulator (what you call `count`). The point of loop unrolling is to have the number of accumulators for the number of times you've unrolled the loop. If you unroll it 5 times like you did in your example, then you need 5 accumulators (`count0`, `count1`, and so on). Then AFTER the loop is complete you perform the operation *1 more time*, combining the accumulators together--in your case one final addition of all the counts. I benchmarked code in C# 11 which unwinds the loop properly (with NO SIMD vectorisation) and the unwound loop performs 2x for not much more complexity. The compiler--nor the JIT code execution--is not a magic box that outputs the best code possible.
@curtis69193 ай бұрын
Simplicity is elegant. Idk why so many people in this field want to make things more complex chasing efficiency for their current use case while adding unnecessary complexity in the long term. Coding psychology needs to be a course in computer science so newer grads have a more holistic view to programming.
@yashpatel2612 ай бұрын
Would have loved to see a benchmark where you include some out of bounds checks and what not. I think it's pretty ugly and Ill probably never use it but it I would have loved to know the result because speed is speed so there might be some instances where its still useful! Great Video!
@cloudenvier22602 ай бұрын
What if the process you need to do with the collection is CPU bound and you then spawn N processes same as the number of cores on your device and the processes concurrently, of course given that you are doing boundary checks and of course given that the collection is thread safe. Wouldn't that be faster?
@salehdanbous67833 ай бұрын
Compiler actually uses unrolling and also vectorisation. So these mechanisms are needed, but usually not in C# code
@user-zk5ym9ut1j3 ай бұрын
What is the benefits of using loop if i already know the array size? And if i know the size why I increase the index by any x number I can access them directly.😮 If you don't care about the execution sequence go for parallel loops
@Darawsheh3 ай бұрын
Performance increase is best not measured absolute, but relative. That 100ns improvement is about 3.6%. I agree that that isn't spectacular. But in some cases it may still be significant. 3.6% of 1 day is still over 50 minutes. Still, for the vast majority of cases, 3.6% is not worth it. So, good to see have seen this put in perspective. Good thing to bring up scaling. If scaling is worse than linear, and large numbers are to be expected, unrolling may be a good idea. But that is such a niche case, making it a general rule is totally not warranted.
@frankroos11672 ай бұрын
My first step to increase performance would be to remove all the Linq statements. I can only hope they speed it up over time.
@T___Brown3 ай бұрын
Loop unrolling has been a compiler optimization for decades. And modern CPUs apply branch prediction, which can also help to further reduce the overhead due to the condition checking in loops. And modern managed programming language runtimes, like those of Java and .NET, are able to re-arrange code based on performance measurements during execution.
@michaelschneider6033 ай бұрын
I got the impression that someone read up on batches and was trying to shoehorn how to do batching in a loop from this post highlighted here. I will include myself in the group that was cringing seeing Range Bound exceptions and had to endure the 'nails on a chalkboard' the rest of the video hehehe
@travisabrahamson88643 ай бұрын
thanks for the tip
@riccarrasquilla3793 ай бұрын
Seems to me the original advice would be great for someone who doesn't have to go back and maintain the code spaghetti that gets created. However if I am going to be maintaining code, readability is paramount. I can't tell you how many times I have had to go back and review code from years prior and spent way too much time trying to figure out the previous coder's *elegant* solution.
@danervin51382 ай бұрын
I usually agree with you on these, but not so much on this one. The performance improvement is clear, and if you're unrolling like this, you probably wouldn't even need bounds checking, if you did, it could easily be done outside the loop, with the remainder computed after. And I do wonder why you used 5 instead of 4?
@SirBenJamin_3 ай бұрын
Hi Nic, big congs on your channel.. also saw courses from Steve Smith that I would love to subscribe, need to find the time to do it... about this clip.. why people is that stressed on performance ? ( in c# :-) :-) :-) ? ) I mean think to the split loop technique by Fowler.. why readability is that ignored? Quoting someone slightly better than me.. it's better have a not working software that we can understand and fix than one working but not understandable.... why that's so hard to get? why we have to struggle everyday with people they don't know even how to deal with an interface ??
@danieleluppi66483 ай бұрын
I always go for the KIS strategy :)
@aurelienpiquet67112 ай бұрын
@5:23 Your example is diffrent, in his example he writes to 4 separate array cells wich can parralezied by cpu or compiler, in your example all lines are interdependent excluding this from optimizations on write. Also his example don't do any reads at all, wich also is more allegable to optimiziations. But generaly i agress loop unroling should not make much difference, especialy in c#
@AK-vx4dy3 ай бұрын
What chair is that, that you are using, other than an office chair? Brand, type?
@Assgier3 ай бұрын
i code in snes sometimes and yeah, unrolled loops are useful in some cases because the poor processor speed of snes, but in modern language I think it is totally unnecessary
@anonimxwz3 ай бұрын
- What was it, 2.9mhz on SNES? That makes you think different on individual operations.
  @sealsharp3 ай бұрын
- @@sealsharp A single core running between 1.79 MHz and 3.58 MHz, according to Wikipedia
  @fg345ergdfg453 ай бұрын
One year after, Nick: you really don't need to use spans here... 😅
@antonmartyniuk3 ай бұрын
If I really need to increase loop performance, I would rather look into the SIMD extensions and check if those can be utilized on my data set. I tend to prefer readability over performance in most cases though.
@peryvindhavelsrud5902 ай бұрын
How does the optimization work? Does it exploit CPU putting additions and multiplications in parallel?
@luboshemala34853 ай бұрын
- I am far from an expert on performance. My best guess is that you are reducing the number of jump on condtion operations that the processor needs to do. These operations I think can prevent the cpu from looking ahead to fetch the next instructions to execute them without interuption, as it does not know where it will need to continue on from the jump until it executes. Effectively breaking the flow. In this case, the cpu does not know when it gets to the end of the loop instructions if it needs to exit the loop or jump to the start of the loop until it performs the loops condition for the current iteration.
  @mafiamole3 ай бұрын
Reminds me of the old argument about using ++i instead of i++ for performance reasons where the performance gain is literally so small it's not worth worrying about in any context
@senti23063 ай бұрын
On some old processors, i'll say "go for it". But since the pentium, there's an optimisation for conditional jumps. It does not wait for the test result and rolls back when then result is not what was expected. It improves loops' performances by a lot. As I don't think you'll be able to execute your .NET code on a 486, the advice is really not worth it. If you really are into performance, consider using unsafe and pointers before unrolling. And keep in mind your compiler may do it for you.
@warny19783 ай бұрын
I’ve only seen loop unrolling in HLSL and other shader languages funny to see you test it in C#
@DaddyFrosty3 ай бұрын
I think the extra checks for array length inside the loop will nullify this performance boost
@andriiyustyk93783 ай бұрын
This is the sort of micro-optimisation that is penny wise and pound foolish. Most systems won't even approach the size and performance where this is near the top of the list of things to consider.
@sasukesarutobi38623 ай бұрын
- Totally agree. Adding 5x more complexity in exchange of a couple hundred micro secs or even milliseconds it’s not worth it. What’s next? “Don’t use garbage collector, manage memory yourself”. At this point why don’t just use C/C++?
  @yunietpiloto44253 ай бұрын
- @@yunietpiloto4425 Because C# is still way easier to grasp than C/C++ and has better tooling, so it does make sense to use it even in these extreme scenarios.
  @diadetediotedio69183 ай бұрын
- @@yunietpiloto4425 "5x more complexity" bro it's just addition. Are you really that scared of +?
  @curtis69193 ай бұрын
Loop control flow overhead? He means a CMP (compare) and JNE/JE ("jump if [not] equal") instruction? Oh, ok, so we might have an ADD/SUB and a MOV or two, but that's usually it, lol ... to get rid of that, you're losing the dynamic nature of the loop and either get locked into a certain number of iterations or have to write _additional_ control flow constructs anyways, and unrolling 4 or 5 steps is unlikely to result in tangible gains anyhow. You also can bloat your code size ... I've unrolled logic before, but usually in ASM/C/C++ or CUDA ... very rarely in C# code for 3D rendering. Like I might operate on several vertices per pass because it made sense in that context, like handling 3 indices of a triangle buffer or something I know will multiply/exponentially scale out.
@GameDevNerd2 ай бұрын
If you want performance in loops isn't the Parallel.For() not a better solution?
@NairuYukoshi3 ай бұрын
You wouldn't need the out of bounds check inside the loop though. Simply loop < length - 4 and do the remainder of the items outside of the loop. And since we're anyway already on such a Micro Performance boost trip writing these extra lines you wouldn't mind about these extra lines.
@W1ese13 ай бұрын
- Not if the length of the collection is < 4. And this is just the first of the bugs you'll have.
  @parlor31153 ай бұрын
- @@parlor3115 Works fineeven if length is less than 4; you enter the loop 0 times.
  @taemyr3 ай бұрын
- @@taemyr But that's a bug. You should have 1 iteration to run your code.
  @parlor31153 ай бұрын
- @@parlor3115 No because the only items you deal with are the remaining items that are handled outside the loop.
  @taemyr3 ай бұрын
- @@parlor3115 You're misunderstanding. The code would be something like this: var i = 0; var limit = length - 4; //First loop handles all except for the last 1-3 for(;i
  @UrielZyx3 ай бұрын
I am surprised C# did not optimize this; today's compilers will often do this for us behind the scenes. In my world, a speedup of 45ns to 25ns is justified with some benchmarks for high iteration code. That is almost double the time. We can add a clause "// unrolling code for better performance." But I am not fully on board with the "premature optimization is the root of all evil" - this statement is not always true.
@FunWithBits3 ай бұрын
This is not the first I hear of this. The silly idea behind unrolling is that it will run much faster on the metal because of CPU branch prediction on conditional jumps inside loop structures by halving or quartering them while using local area L1 caching inside loop to optimize access. Its silly. And doesn't work that way really. If you had any memory access code inside the loop body it would wreck what little benefit you got in benchmarks.
@MilYanXo2 ай бұрын
7:20 Can someone point me to the documentation on the ! for nulls? I've seen it pop up a couple times but never understood it.
@C00l-Game-Dev2 ай бұрын
What about using Parallel.ForEach to separate each iteration in a separate Task?
@agustinsilvano63 ай бұрын
Out of curiosity, would this be any faster? 😅 count += _array[i] + _array[i+1] + _array[i+2] + _array[i+3] + _array[i+4];
@PhaaxGames3 ай бұрын
- It doesn't appear to make any notable difference in either .NET 6 or .NET 8.
  @David-id6jw3 ай бұрын
Loop unrolling is usually done with powers of 2, not 5 I wonder if that'd change things a bit iirc, loop unrolling adds a significant boost if you also have multiple accumulators (and of course, with simd) And I wonder how all of this compares to just making a native call and doing it from a native language How's the cost of calling a native function compare to executing it in C#? And yeah, I agree this isn't something to consider if you don't need to squeeze all of the performance that a machine can do And at that point, why not just do it in a native language to begin with?
@sepdronseptadron3 ай бұрын
how often do you have to do that because your application was actually too slow?
@Patterner3 ай бұрын
The only market where I could see this being useful is games programming, where those nanoseconds matter, and code maintainability is second to performance. In particular, pixel art games that have arrays whose dimensions are a fixed size. That said, I would consider other forms of optimization first, before I tried loop unrolling. But I would not discount it. If you have a loop that can be easily unrolled in a game it is basically free performance. Outside of games; yeah, I wouldn't touch it with a ten-foot pole.
@renynzea3 ай бұрын
Perfect!
@LucasOliveira-sn8ls3 ай бұрын
This may have more sense if I iterate over Span and I slice it each iteration. Then I can always access 0,1,2,3,4 index and get rid of "+" operations. But honestly I dont like the idea at all.
@oozierus2 ай бұрын
Every time I make an API with complex input parameter I wonder: "Why the API response is 500. What's wrong with sent request." Can I find parameter serialization stack trace with Rin to understand what's wrong with parameter?
@varmelot3 ай бұрын
Loop unrolling used to be much more common when the computers were slower, especially for arrays with well defined sizes, such as screen/video memory size, or operating on 4 byte values with 16 (or 8) bit computers. Most of the benefits there was, in fact, due to less often checking if you're at the end of the array - straight local non-conditional jump operation is very fast. Today, it makes very little sense to do this - perhaps only for situation where algorithm executed by the loop is extremely simple/fast AND the number of items in the loop is extremely large. In any other situation, performance benefits gained by unrolling the loop would simply be lost in the background noise.
@Sanabalis3 ай бұрын
- No loop unrolling has to do with SIMD and brings significant improvements when using Vector class you can check the docs for that.
  @FilipCordas3 ай бұрын
Ah yes my favourite trade-off: performance vs correctness. If unwinding can be safety done and it improves performance and I bet compiler does it under the hood anyway.
@Herio73 ай бұрын
Write loops in one line so compiler doesn't see it as a loop 😛
@ivanz63683 ай бұрын
Should have done... 4 or 8 in an array Nick not 5. Think you missed the point say in DSP operations you get massive boosts from SIMD and there are plenty of use cases where you might want these FMA ops to remain in your .NET codebase
@ianknowles2 ай бұрын
Benchmark check operation for len will reduce performance, as you said, they just want to have a daily post.
@Aqil6653 ай бұрын
Now do those guys who first do a LINQ Where operation, and then iterate over the resultant collection, rather than just coding an if in the loop.
@lukewebber55623 ай бұрын
It is looks like optimization then can be made with some extention or at compilation, 100% should not be used manually
@Razeri3 ай бұрын