Not a question, just sharing with the community.
We do our solving locally. We've had Insight Ultimate for about 4 years. We use a network solver model, where the engineers send the jobs to the solver directly. My engineers are still using version 2019.
Over the last year or so I've been testing 2021. Versions used are 2021 and 2021.1, nothing newer. Part of this was to evaluate SCM and how it's setup, part to research what kind of solver setup we should upgrade to as our current ones are 4 years old this spring.
I did a bunch of testing with different RAM speeds and channels to try and model performance of other theoretical setups, but below I am reporting numbers only for the normal performance configuration on the testing machines.
Unless otherwise noted, all solvers were running Window 10 Pro.
The test file used for these benchmarks is a 3D mesh with 2.79 million elements. Unfilled resin. Analysis was Cool + Fill + Pack + Warp. Cool solver therefore is the default BEM solver which uses only 1 thread.
Raw data is below. My conclusions are:
1) Disk speed is a non-factor as long as it's an SSD. No reason for separate OS and temp file drives.
2) The best mix of CPU single thread speed and multi thread wins (so if many core MHz drops off a cliff, performance will suffer - this is really common with e.g. 24, 32, 64 core CPUs).
3) Scaling to more cores is impacted by both RAM latency and bandwidth, with Warp being slightly more sensitive than Fill+Pack.
4) Cool solve time seems very sensitive to RAM Latency.
5) At 16 cores, the impact 8 channel DDR4 RAM vs 4 channel is approximately equivalent to a 10% CPU frequency increase. e.g. an 8 channel setup at 4.0 GHz would perform about the same as a 4 channel at 4.4 GHz. See Threadripper vs Threadripper Pro or Frequency Optimized EPYC for real world chips built like this.
6) Performance scaling past 8 cores is very poor in any configuration tested. In fact for 2 or 4 channel RAM, 90% of the max performance is achieved by 6 cores. For 8 Channel RAM on Zen2, there's only a 20% performance improvement moving from 6 to 16 cores. For Xeon Gold 6244 (6 channel RAM, high frequency 8 core CPU with hyper-threading), there's a 15% improvement moving from 6 cores to 16 Threads (8 cores with hyper-threading).
7) Always disable hyper threading on any system with 4 or more cores unless you have an edge case (like an 8 core CPU). Note my Xeon 6244 testing was on an in production server, so I could not reboot it to disable HT.
8 Because of the relationship between RAM bandwidth, latency and core speed to performance, having a server grade box, which tend to have slower cores and slower RAM (higher latency and less bandwidth per channel) will likely net worse performance than individual solver boxes, where each solver box is limited to 1 solve at a time.
All of the above I had by last fall. I've been waiting for workstation/server quality mainboards for the Intel Alder Lake to come out so that I could test it, as I predicted it would be the way to go for a setup with 1 solver box per Insight solver. Supermicro released it's W680 chipset board recently and so those test results are in. Our new setup will be 1 solver box with a 12900K per Insight solver we have.
Platform | CPU | Cores | HT enabled/Disabled | RAM type | RAM speed | RAM channels | Cool | Fill+Pack | Warp | Net (sec) | Net (min) | Net (hours) |
Supermicro x299 DIY | Intel i9-7690x (Skylake) | 16 | disabled | DDR4, fancy fast RAM, no ECC | 2666 | 4 | 2615 | 11635 | 2306 | 16556 | 276 | 4.60 |
Asus x470 (enthusiast DIY) | Ryzen 3900x (Zen 2) | 12 | disabled | DDR4, no ECC | 3000 | 2 | 2102 | 12487 | 2123 | 16712 | 279 | 4.64 |
Asus x570 (enthusiast DIY) | Ryzen 5900x (Zen 3) | 12 | disabled | DDR4, fancy fast RAM, no ECC | 3600 | 2 | 1730 | 9623 | 1611 | 12964 | 216 | 3.60 |
Dell R740xd, 2019 VM under Hyper-V 2016 | Xeon Gold 6244 (Cascade Lake) | 8 | enabled on host, confirmed 8 physical cores on one socket used by VM | DDR4, OEM ECC RAM | 2933 | 6 | 2647 | 14416 | 2405 | 19468 | 324 | 5.41 |
Dell R6515, bare metal, Server 2019 | EPYC 7302P (Zen 2) | 16 | disabled | DDR4, OEM ECC RAM | 3200 | 8 | 2791 | 11865 | 1857 | 16513 | 275 | 4.59 |
Supermicro X13SAE-F (W-680) | Intel i9-12900K (Alder Lake) | 8 | HT disabled, E cores disabled | DDR5, 4800MHz, no ECC. Note MB limited to 4400 MHz | 4400 | 2 | 1063 | 7703 | 1293 | 10059 | 168 | 2.79 |
2 additional observations I found noteworthy:
1) Ran the same job on the new Alder Lake solver last night but with SJM / 2019.
The Cool (BEM) solve was 40% faster under the old 2019.5 version. The remainder were within 5-6%. Since Fill + Pack takes most of the net time, the final total delta was 7% faster solve time under 2019. I know one of the changes for 2021.3 is faster solve time so this might have improved since (and this is just 1 data point, so grain of salt and all that).
2) When stress testing the new solver build and making sure I understood the various knobs I could turn on the Supermicro mainboard, I verified the new CPU would run flat out at 241W power consumption - the highly publicized power limit for Alder Lake. Example common "stress testing" applications are Prime 95 (which does FFTs) and Cinebench, a video renderer.
Under Moldflow, even at 100% CPU utilization and the same CPU frequency as the above stress test/benchmark tools the highest power draw during Fill+Pack is about 120W. This implies there's a lot of potential performance being left on the table due to how the solvers are coded. Hopefully some opportunity for improvement to help us get more solves computed per day.
I realize it's a bit gauche to reply to ones own thread like this, but I have an interesting update to share:
I was giving a customer a tour this week, he's also a Moldflow Insight user, but at a med device OEM. He commented that he found the cloud solves to be pretty fast - that was all their company used.
So I decided to also run this benchmark solve on the current Autodesk cloud. The cloud uses AWS. Last updated in Feb this year: https://forums.autodesk.com/t5/moldflow-insight-forum/new-faster-cloud-workers-for-moldflow-2021/m-p...
The analysis logs reported the instance as Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz with 16 threads and 128GB RAM. This matches the AWS r6i.4xlarge instance, which should be a good fit for Insight. That's 8 physical cores with hyperthreading for 16 threads. That CPU seems to be a 32 core CPU so that means there could be 3 other similar AWS instances running on that CPU. This means worst case I only get access to 25% of the available memory bandwidth on that system. That could hurt performance.
The total solve time was 6.87 hours - slower than any of the systems I tested previously and about 2.5 times slower than the newest Alder Lake solver we have (2.79 hours).
Cost of the solve was 21 Flex credits which nets to $63 for this one solve. With a 3 year life span, our solvers cost about $5.50 per day purchase, build, maintain and run flat out. Admittedly, electricity is cheap here.
We'll be sticking with local solvers unless we need the scalability of the cloud for some specific project. e.g. it would be a good, but expensive, fit for a bunch of simultaneous DOE runs. Remember that one advantage of the Autodesk cloud, from what I've read, is that it does not consume an Insight license (though one needs an active subscription to use it). So presumably part of that $63 is to cover that additional effective licensing.
Ryan, This is great information! This impacts all of us users every day, although really only every 3-4 years when we a spec'ing a new box. Thank you for sharing!
An update with some new solver testing results, single solve, automatic threads: cool, F+P, Warp. This is the same benchmark file as my initial post, but now on Insight 2023. These are all local solves, not using a queue/provisioner machine (localhost network only, other than licensing).
Intel Alder Lake (12900K, DDR5): 2.90 hours
AMD Zen 4, 7950X, DDR5: 2.70 hours
Intel Raptor Lake, 13900K, ECC DDR5: 2.60 hours
On general purpose workloads Raptor Lake benchmarked about 10% faster than Alder Lake, we see the same here. Zen 4 splits the difference.
As before, gains really taper off once you get to 6 or 8 cores, so if someone wanted a budget solver rig, you could go that way and save on the CPU cost a bit.
Hopefully I get to try the new Sapphire Rapids Intel HEDT chips in the next few months. Could be interesting for stable performance for multiple, simultaneous solves. But we do not yet know how the frequency scales with core loading.
Unfortunately no, I have not made time to try. Looks like a $350 experiment + my time. May be an interesting diversion form my normal work ;).
Related, currently shopping for a VM server for Q2. If I end up going Genoa-X (e.g. 9184X - 768MB of L3 cache!), which there's probably a 50% chance, I'll be sure to run some tests before it's out of reach in production. That would be an interesting comparison to 7800X3D. Both flat out and clock equalised.
The new 7000 Threadrippers are available now too. Have not dug in to see how their boost behaves. No 3D Vcache options there, but lots of RAM bandwidth and for at least some cores, some potential frequency headroom vs 7800X3D.
Intro:
As suggested above, I purchased and tested one of the AMD 3DVcache CPUs. The 7800X3D was chosen as it would give unambiguous results as all 8 cores feature the extra L3 cache. In the desktop AMD parts, it has the most available L3 cache per core with 8 or more cores.
Comparison is the 7950X, which is 16 cores without the extra L3. It also clocks higher being the “flagship” part.
This “Zen 4” architecture uses “chiplets” with up to 8 cores each. Standard L3 is 32MB per chiplet or 4 MB per core. The 3DVcache stacks another 64MB on each chiplet for 12 MB per core.
Therefore, the total L3 of the 7950X is 64MB (over 16 cores) and the 7800X3D is 96MB (over 8 cores).
This extra L3 cache can help with any process where normally the CPU needs to reach out to RAM. A modest reduction in RAM traffic can net a modest to healthy improvement in net solve time. For some executables, the speed increase can be much larger.
Expectations:
Because of the extra thermal insulation caused by the physical placement of the L3 cache, the 7800X3D is limited in total thermal headroom and therefore clocks. It is also not capable of using the Precision Boost Overdrive tech for a low risk overclock.
I observed max single threaded speeds with the 7800X3D of 5050MHz which is what is expected. All core workloads in Moldflow Fill+Pack and Warp solvers were also 5.0-5.05GHz on this processor. So it was not cooling limited.
The 7950X clocks up to about 5.8GHz single threaded and 5.5 GHz all core. This is with a conservative use of Precision Boost Overdrive. Both setups shared the same Motherboard, RAM and AIO watercooler – only the CPU was swapped.
As you can see then, the 7800X3D is at a clock speed deficit of 10-15%, depending on how many cores are working on the 7950X.
Therefore, for it to be faster, the extra L3 in the 7800X3D would have to make up for that deficit and then some. This is not improbable, as some other CFD simulations have shown ~30% uplift with extra L3 cache on this Zen 4 architecture.
The 7800X3D is also down 8 cores, but the scaling in Moldflow on these desktop CPUs moving from 8 to 16 cores is minimal, only about 7.5% gain for doubling the core count, and almost all of that in Fill+Pack (the warp solver scales even more poorly on the desktop).
Hyper-threading / SMT disabled for all runs. A reminder that the BEM cool solver is single threaded. Expected run to run variation is about +/- 1%.
Results:
First, straight up, a single solve, same as above posts, using the 2023 solvers. Solve time in seconds:
CPU | Max Threads | Cool (BEM) | Fill+Pack | Warp | Net (sec) | Net (hours) |
7950X | 16 | 1084 | 7423 | 1238 | 9745 | 2.71 |
7950X | 8 | 1084 | 8198 | 1258 | 10540 | 2.93 |
7800X3D | 8 | 1077 | 8310 | 1269 | 10656 | 2.96 |
% as fast: 7800x3D vs 7950X @8 core |
| 100.6% | 98.7% | 99.1% | 98.9% | 98.9% |
% as fast: 7800X3D vs 7950X @16core |
| 100.6% | 89.3% | 97.6% | 91.5% | 91.5% |
Results discussion:
Core to Core, the 7800X3D is just as fast as the 7950X (+/- 1%). This means the extra L3 cache is making up for the clock speed deficit, but it’s not able to overcome that 10-15% frequency difference to make net gains on a per core basis.
Therefore, CPU to CPU, the 7950X, with its 16 cores, is about 8.5% faster than the 8 core 7800X3D in net solve time.
Street price of these CPUs is $340 and $490 for a ~$150 delta.
Simultaneous Solves
Another potential consideration is if the extra L3 may help when doing simultaneous solves. To save space I’ll refrain from posting another table, but the testing shows it does not. Both the 7950X and 7800X3D when running 2 solves at the same time net about a 50% increase in solve time.
Implications for Threadripper and EPYC
There are no “3DVcache” Threadripper parts in the Zen 4 generation (model numbers 7000). All use the standard chiplets. The Pro line has 8 channel RAM and the regular Threadripper has 4 channel, vs the Ryzen desktop at 2 channel.
There are a few “Genoa-X” EPYC parts with absolute gobs of extra L3. On a per core basis the 16 core 9184X is the most extreme, with 768 MB total L3 for 48 MB per core. These only give up about 8% all core frequency vs the “frequency optimised” EPYCs with regular amounts of L3, but there is a 20% all core frequency deficit vs. the 16 core Threadripper 7960X. The EPYC parts have 12 channels of RAM available. Interesting mix of plusses and minuses. Total system costs for these is likely prohibitive, especially for a pre-built.
I do hope to get some testing time on a 9184X that I have on order for server duty. Schedule allowing, I’ll have results to share in 4-6 weeks.
This is the most helpful resource I've found for specifying a new workstation for Moldlow. Unfortunately, I only have access to advisor but based on your results I think I'll be going with a 9900X from AMD to avoid the current 14900k long term stability concerns.
Can't find what you're looking for? Ask the community or share your knowledge.