HtoA VDB Volume Render disparity EC2 vs OnPrem

jacobW2QRD · ‎03-28-2022

Hi Arnold Folks,

Hope that you are all well.

We are having a rather hard time diagnosing pretty substantial differences in render times for certain scenes rendered on AWS EC2 (thinkbox/deadline) vs onprem. Here are the details and key differences between the 2.

We've noticed the issue with particular scenes that are rendering pre-cached VDB smoke volumes. We are using Houdini 18.5.596 and HtoA htoa-5.6.2.0

Rendering via Deadline 10.1.20.2

-OnPrem

Windows10

2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (28 cores, 56 logical) with 130960MB

Windows 10 (version 10.0, build 19041)

from 0% - 100% of actual ray/pixel rendering.

render done in 4:53.488

-Cloud

Linux

2 x Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz (36 cores, 72 logical) with 140766MB

Amazon Linux 2, Linux kernel 4.14.256-197.484.amzn2.x86_64

from 0% - 100% of actual ray/pixel rendering.

render done in 13:21.473

Excluding file upload time, texture generation, .ass generation, etc. The render on the EC2 cloud instance takes roughly 3X as long. The Thinkbox/Deadline engineers were stumped and suggested we contact Autodesk/Arnold regarding any potential CPU reasons for the slowdown?

From Thinkbox:

"The majority of the slowdown is happening in the light cache processing which just may be more performant on one type of CPU over another. Maybe these two machines aren't as comparable as we think but it's hard to say.

I feel confident that it's a difference in the CPU and just for the VDB renders as you mention you're not seeing a 3x render time across the board, and the EC2 instance has the CPU at 90% on average (100% peak) so that must be the bottleneck as opposed to file read/write."

Thankyou,

-Jake

thiago.ize · ‎03-28-2022

Can you post the logs with info level verbosity for the two machines? Make sure it includes all of the stats. That will help give us a preliminary idea of what might be going on.

jacobW2QRD · ‎03-28-2022

Hello,

Thanks for your quick reply.

I've attached the full verbose renderlogs. I had the HtoA verbosity set to 5 during this render. Unfortunately, didn't really log much during the actual ray/pixel portion.

Thanks,

-Jake

thiago.ize · ‎03-28-2022

The EC2 machine, at least superficially, looks like it should have the upper hand. It seems to be the better CPU and it's running linux instead of windows (linux tends to be faster).

From the logs I note that the EC2 is running proportionally the same as your local machine. See here how the volume shader is about 50% of render time on both machines:

00:05:03 2186MB | volume shader 2:24.30 (47.29%)  <--onprem
00:13:30 2206MB | volume shader 7:17.15 (53.46%)  <--EC2

And if we look at BVH intersection time, which is a totally different code path and should not be affected by volumes, we once again get similar results where the percentages of the render time are the same while the times are 3x different:

00:05:03  2186MB |  BVH_motion::intersect  0:23.90 ( 7.83%) <-- onprem
00:13:30  2206MB |  BVH_motion::intersect  1:02.79 ( 7.68%) <--EC2

So it's not that there's a specific part of arnold that is slow, but rather that everything is uniformly slower. This makes me doubtful that it's a software problem and it's much more likely that something is wrong with the EC2 instance either in hw or in the OS configuration. Maybe it's running at a lower clock speed, possibly due to thermal problems? Or maybe its memory BW is suboptimal?

Do other Arnold renders perform as expected? Do other applications that are BW and compute heavy perform as expected?

jacobW2QRD · ‎03-28-2022

Thanks Thiago,

Will relay back to thinkbox/deadline and see what we can do.

-Jake