rendering with Multiple GPUs makes render fail

rendering with Multiple GPUs makes render fail

giandosopaolo
Advocate Advocate
1,579 Views
14 Replies
Message 1 of 15

rendering with Multiple GPUs makes render fail

giandosopaolo
Advocate
Advocate

So, I am testing a GPU render farm with access to 8 Geforce3090. it is pretty amazing

Problem is, the same scene I can easily render with 1  3090 .. doesn't complete the render with 8.

If I select 8 cards I always get this error around 10% which stops the render. 

 

// Error: 00:09:42 46420MB ERROR | [gpu] an error happened during rendering. OptiX error is: Memory allocation failed (Details: Function "_rtContextLaunch2D" caught exception: Memory allocation failed)

GPU 0 had 24347MB free before rendering started and 9405MB free when crash occurred
GPU 1 had 24340MB free before rendering started and 9405MB free when crash occurred
GPU 2 had 24301MB free before rendering started and 9423MB free when crash occurred
GPU 3 had 24347MB free before rendering started and 9661MB free when crash occurred
GPU 4 had 24071MB free before rendering started and 175MB free when crash occurred
GPU 5 had 24347MB free before rendering started and 9597MB free when crash occurred
GPU 6 had 23976MB free before rendering started and 9190MB free when crash occurred
GPU 7 had 24347MB free before rendering started and 9597MB free when crash occurred

 

If I reduce the number of cards to the first 5 3090s it works (obviously it is faster).

When I go up to 6 I get the same error at 11%


// Error: 00:05:54 32757MB ERROR | [gpu] an error happened during rendering. OptiX error is: Memory allocation failed (Details: Function "_rtContextLaunch2D" caught exception: Memory allocation failed)
// GPU 0 had 24347MB free before rendering started and 7217MB free when crash occurred
// GPU 1 had 24320MB free before rendering started and 7166MB free when crash occurred
// GPU 2 had 24301MB free before rendering started and 7235MB free when crash occurred
// GPU 5 had 24347MB free before rendering started and 7281MB free when crash occurred
// GPU 6 had 24115MB free before rendering started and 7036MB free when crash occurred
// GPU 7 had 24347MB free before rendering started and 7281MB free when crash occurred

 

What's going on?

0 Likes
Accepted solutions (1)
1,580 Views
14 Replies
Replies (14)
Message 2 of 15

thiago.ize
Autodesk
Autodesk

Looks like GPU 4 ran out of memory. Check to see if something else is running on it consuming memory.

GPU 4 had 24071MB free before rendering started and 175MB free when crash occurred
0 Likes
Message 3 of 15

thiago.ize
Autodesk
Autodesk

One more thing, when you render with a few GPUs so it works, how much memory does Arnold report being used by the GPUs? How much memory does the OS report being used? I wonder if maybe it is crashing because a big allocation is being made (say 8GB) which is more than the 7GB of memory that was available when you used 6 GPUs?

Could it be that as you add more GPUs the amount of memory consumed by Arnold in each GPU goes up?

0 Likes
Message 4 of 15

giandosopaolo
Advocate
Advocate

.

0 Likes
Message 5 of 15

giandosopaolo
Advocate
Advocate

Looks like GPU 4 ran out of memory. Check to see if something else is running on it consuming memory.

Nothing else is running, just maya. 

I also thought it was just GPU 4 problem, but I tried a different server and the same happened.

Also if you see, then I have tried 6 of the 8 GPUS excluding GPU 4 .  But it still crashed.

Also, I noticed something weird. The max number of GPU that works seems to be 5, but the render time with 5 is 20 m, while the render time with 4 GPUs is 13 minutes. How can it be?

With 4 GPUs it renders the 1st frame from the batch render but at the 2nd it still stops at 45% when the dedicated GPU memory goes above 24 GB (does not happen in the 1st frame, it stays around 23).

 

One more thing, when you render with a few GPUs so it works, how much memory does Arnold report being used by the GPUs? How much memory does the OS report being used?

I noticed that the dedicated GPU memory goes above 24 GB. That is what is causing the crash I fear @thiago.ize 

But when I render on a single GPU (my personal workstation) the dedicated GPU memory is lower, 18.5. 

 

Could it be that as you add more GPUs the amount of memory consumed by Arnold in each GPU goes up?

This is interesting! Why would it do that though? The documentation says that up to 8 GPUs are supported. Is this memory allocation stacking supposed to happen? 

And most importantly how do I prevent that from happening?

Are there ways to reduce the dedicate GPU memory consumption?

 

0 Likes
Message 6 of 15

thiago.ize
Autodesk
Autodesk

https://docs.arnoldrenderer.com/display/A5AFMUG/Log Explains how to get log files. You'll want to set the verbosity level to info in order to get the stats printed out at the end of the render. Do that for a render with a single GPU, then repeat for 2, 3, 4, 5, 6, etc. until it runs out of memory. While doing that, also take a look at activity monitor or some other tool to see how much memory each GPU is using. Look at the max amount.

If things are behaving properly, the memory reported by activity monitor for each GPU should be the same whether you use 1 or 8 GPUs. But there's always a chance there could be a bug we don't know about that causes more memory to be used per GPU.

 

As for scaling performance, it is challenging to get 8 GPUs to be 8x faster than 1 GPU.

One of the reasons is that as you add more GPUs, you might reduce the number of PCIe lanes that can be used by each GPU. See https://www.cgdirector.com/guide-to-pcie-lanes/ for details. Another hardware level issue is that adding more GPUs might result in your power supply not being able to power them adequately which results in the GPU running slower (it's not just total watts of your PSU but making sure each rail of your PSU isn't being overloaded). Finally, if the GPUs overheat, that too will result in them running slower.

Another reason is that Arnold currently has significant communication between GPUs (one reason why PCIe lanes matter). A pair of GPUs connected by nvlink should usually see a 2x speedup, but nvlink doesn't work on more than 2 GPUs, so that means you'll have a harder time scaling well past two GPUs. We do want to reduce the communication between GPUs, so hopefully future versions of Arnold will scale better when in this situation.

Finally, it's possible that one GPU might end up getting a harder to render chunk of the image, in which case the render time will be bottlenecked by how long it takes that GPU to finish. This too is another area we want to make improvements.

 

Message 7 of 15

giandosopaolo
Advocate
Advocate

When I create a log file, must I select "file" in the options? does it
creates a single file or many?
I'm not expecting a 8x speed, just to be able to batch render without it
crashing on frame 2.
On my machine I have rendered 10 frames in a row without much problem.
On the render farm, with a machine holding 4 3090s, it render fast, but it
aborts midway during the second frame. it's so frustrating.
I now am trying a new strategy :

 

  1. I am down-scaling all background textures to 2k.
  2. I set the max texture resolution at 4096 (my displacement textures where 8K). 
  3. I merged together a few objects I could merge.
  4. I increased the adaptive Displacement adaptive error from 4 to 4.5 and reduced the subdivision iterations from 10 to 9.
  5. I set the Render Cam to work as a Dicing Camera to the subdivisions.
  6. I deleted all unused nodes.

Would these strategies help?

--------------------UPDATE----------------------

I deleted a lot of stuff and downscaled a lot of textures. This should technically free up a LOT of Vram.

I noticed the render starts well now at 20 m render time with a Dedicated GPU memory of 20 GB, which stays the same for a few frames.. but then the render time quickly goes up after that to 50 minutes or way more, as the dedicated GPU memory, which balloons to more or less 23.2/24.0 GB. (see maya render log_latest).. If I stop the batch render the same frame renders well, and the render time stays ok for the next few frames, then the problem shows up again.. 

 

  1. Also, very importantly, since each 8k texture eats 256 Mb of Vram (but 4K textures only 64Mb), does setting the "Max Texture Resolution" under Manual Device Selection (Local Settings)  to 4096 prevents this extra expenditure in a batch Render? Or since the 8k textures are loaded in the project they still consume those whopping 256 Mb  of space?
  2. Do textures which are not currently visible by the camera angle still eat at the VRAM?
  3.  Is there a way to use the command line render to render the sequnce in "bits"? Would that help?
0 Likes
Message 8 of 15

giandosopaolo
Advocate
Advocate
Accepted solution

@thiago.ize 

There seems to be a memory overflow in arnold GPU when rendering more then 1 frame. 

I can see that there is the same issue I am having being reported in this thread now too.

 

Since I found a temporary solution I thought about sharing it.

I found out that it is possible to work around it by writing a .bat file and using it to render though the command line (like it is described in THIS video).

 

We can use the -seq handle to split the render in packages of 3 frames (or whatever your card can take before crashing). In my case is 3.

In this way the memory is reset to normal every 3 frames. 

 

So 

render -r arnold -seq "1..3" C:\Users\paolog\Desktop\myScene\Maya\scenes\myScene_01.mb
render -r arnold -seq "4..6" C:\Users\paolog\Desktop\myScene\Maya\scenes\myScene_01.mb
render -r arnold -seq "7..9" C:\Users\paolog\Desktop\myScene\Maya\scenes\myScene_01.mb

.. and so on.

Bit of a pain in the ass but it is working. I hope it gets fixed soon.

BTW, the arnold documentation on writing .bat files is a bit lacking. 

0 Likes
Message 9 of 15

giandosopaolo
Advocate
Advocate

Seems like in the the latest Arnold version the memory leak has been fixed! 🙂 

Thanks Arnold team!

0 Likes
Message 10 of 15

LegoFan
Explorer
Explorer

Actually NO, this is NOT fixed.

 

I have installed the latest drivers and versions (latest Nvidia Driver for my RTX3090 and latest Arnold Version for Maya 2023)

 

Whenever I am doing a render sequence with GPU Rendering, it writes my Memory with every new frame until ultimately Maya crashes because out of Memory.

 

It adds about 100 to 200 mb of cache per frame so even though it is a slow process, it will eventually happen.

 

Also I tried older scenes and it is the same result with GPU Rendering.

(I am outputting an Optix Denpoiser in my sequence but it has happened without denoisers, too)

 

Also, I havent seen any mentioning in the bugfixes of latest Arnold updates, version being 7.1.3.0

0 Likes
Message 11 of 15

giandosopaolo
Advocate
Advocate

@LegoFan 

try and use my method to work around the issue. It works. 

0 Likes
Message 12 of 15

LegoFan
Explorer
Explorer

Looks like my only option right now...

 

Thanks anyways 🙂 

0 Likes
Message 13 of 15

thiago.ize
Autodesk
Autodesk

Hi LegoFan, it seems like you have a different memory leak than giandosopalo. Could you please make a new posting with an accompanying log file?

0 Likes
Message 14 of 15

LegoFan
Explorer
Explorer

Hello there,

 

Now.... I really dont know what to say there anymore...

 

I have completely reinstalled Maya 2023, Arnold, even reinstalled my latest driver for my NVidia RTX 3090.

All new.

 

You know what? This even happens with an EMPTY scene. No need to add ANYTHING there. No matter if CPU or GPU rendering.

 

I simply ask Arnold to Render Sequence a certain amount of frames and I can watch it filling up my RAM. The bigger

my output file size is, the faster it climbs. And we are talking about 100times the file size. So basically, full-hd images I can render about 150 with my 32GB RAM. Then it keeps very close to 100% RAM usage and in my last test it finally

shut down after ca. 900 frames, after jumping around within 95 to 99% of RAM usage for some time.

 

 

Also, the RAM won't free after aborting/ending the render sequence. It won't even let me manually flush caches.

Only after closing the scene the RAM is free again.

 

 

 

 

Personally, all that I can remember is that I had a scene back then where I tried to use Batch rendering for the first time (which in that current scene created watermarks on my materials in hypershade.) Other than that I cannot remember any change I made over the last time. I only had a 2 or 3 weeks period where I did not use Maya. No changes otherwise.

 

I did not purchase an extra Arnold license so I only can use Render Sequence but at this point I am really thinking about changing the Render Engine completely...

 

 

0 Likes
Message 15 of 15

thiago.ize
Autodesk
Autodesk

LegoFan, please make a new thread with a log file so we can investigate your issue. The log file will give us important info, such as the version of Arnold you're using.

0 Likes