Wednesday 5 October 2022

Apple M1 is much faster than what people thinks

 Hey,

I experimented the other day with some algorithms that were written in Akka and Golang (https://breaking-the-system.blogspot.com/2022/10/how-we-can-learn-parallel-computing.html), which is highly parallelised/threaded. I tested those against 2950x/3960x AMD ThreadRippers and against Apple M1 + M1 Max. The results were really shocking for me. The Apple won (by a long shot) the ThreadRippers, even when the TRs were OC to 250W and 350W+ and they are one the fastest CPUs money can buy for their time and costing so much more than the Apple one (not considering the insane cooling/electricity they need) and the algorithm is highly parallelized. 


In this post I want to further experiment with the Apple processors to find out what the hell is happening here. I wanted to test "memory parallelism", which basically means how much the processor continues while waiting for memory, and how much memory data the CPU can access in parallel.

I search the web and found this really popular article (https://lemire.me/blog/2021/01/06/memory-access-on-the-apple-m1-processor/) claiming 26x memory parallelism (to be fair they wrote "or more") which did not make much sense to me. M1 has low frequency and high memory latency 


While the TRs have 70-80ns memory latency (according to AIDA64) while having a higher frequency. I tested the code provided by the blog (https://github.com/lemire/testingmlp) and told me that the TR has 10.1x parallisem, while M1 had around 23. This is a big difference, but considering all the other drawbacks it might not explain everything. 

I wanted to test it for myself. I created a pointer chasing program (linked list):



And tested this on both my 2950x TR and Apple M1 max.

M1 took 0.45 seconds while 2950X TR took 2.42 seconds.
Now let's calculate some stuff:

2950x have 82ns memory latency (according to AIDA64).
82ns * 300M (the amount of the chase) / 2.42 seconds means the TR has 10.1 memory parallelism (exactly as the blog benchmark said).

M1 Max:
100ns * 300M / 0.45 means the M1 has a parallelism of 66, over 6 times the x86 TR. 
I might be totally wrong here, and please tell me if I do.
I think there are many other things the M1 does drastically different from x86 that we yet to understand. 






No comments:

Post a Comment