Re: HPC - a myth?

OmkarJ · ‎11-19-2012

Hello

I was wondering how effective HPC can be in curtailing the running times of the simulation.

What we have currently is a cluster of 4 "boxes" with 4 physical cores and 16 GB RAM for each box. This totals to around 16 physical cores and 64GB RAM. While, my machine is 4 physical cores and 16GB RAM as well.

Now, here are the benchmark results for physics and same iterations run:

MyComputer, 4 physical cores: Time ~ x

Cluster with 8 cores (i.e. two boxes of 4 cores each, running parallely) : Time ~ 0.75x

Cluster with 16 cores (i.e. four npxes of 4 cores each, running parallely) : Time ~ 0.5x

So, does an inrease in computational resource by 300% reduce the time by just 50%?! Now I know it is not a linear relationship. But, I would expect a bit more than that.

Regards

OJ

Royce_adsk · ‎11-19-2012

Hi OJ,

Thanks for sharing those results!

I do have a few questions to ask.

1) Are you using CFD 2013?

2) How are the 'boxes' connected together?

3) What sort of simluation did you run? Can you share the benchmark model that you used?

4) How large was the model, number of elements?

5) Can you share the 'scenarioname.sol' file from each solver folder? This is the solver log file.

Thanks,

Royce

Royce.Abel
Technical Support Manager

OmkarJ · ‎11-20-2012

Sure Royce, I should have been more deligent in mentioning all these in the original post!

#1. Yes, SimCFD 2013, Version 13.1.

#2. Infiniband SDR 4X using RDMA, 10 Gbits/s, latency ~5 microseconds (Yea, makes me sound smart, but just copy-pasted from IT-admin's mail 😉 )

#3. Conical resistance used as surface shell, RNG/ADV5. Sorry, I can't share the model due to IP concerns.

#4. Geometry is conical filter with diameter 30 inches and having piping before and after filter upto 40*ID. Mesh typically 2 mllion cells.

#5. Am not sure if I will be able to, though this may be a generic file.

As a disclaimer, I should mention that given the time pressures, the test was not done for many iterations. I took an average of three iterations by these three approaches. The average times for ONE ITERATION for the consecutive appoaches were:

#MyComputer 4 cores : ~ 37 sec

#Cluster 8 cores: ~ 27 sec

#Cluster 16 cores: ~ 19 sec

Hope this helps?

OJ

Royce_adsk · ‎11-20-2012

As a disclaimer, I should mention that given the time pressures, the test was not done for many iterations. I took an average of three iterations by these three approaches. The average times for ONE ITERATION for the consecutive appoaches were:

This could be a key disclaimer. I would prefer to see you run each of these tests for 100 iterations before you take the per iteration average.

-Royce

Royce.Abel
Technical Support Manager

OmkarJ · ‎11-20-2012

Well, I knew it would go this way. Anyways, will update here with at least 100 iterations when I can.

I have a feeling the trend would still be the same, though, the values may change.

Regards

OJ.

apolo_vanderberg · ‎11-20-2012

Omkar,

You don't mention the CPU speeds of the machines. Are all of them the same speed? What are the specs of the machines? If there is 1 that is slower than the others, we will only run as fast as the slowest core being used.

The ram does not specicifically add up in that fashion. Currently, for a single machine running 4 cores we will have 5 solvers (1 Master thread and 4 compute threads). The Master thread will take roughly the same amount of memory as the 4 compute threads. In an HPC environement the Master thread must reside on one of the machines, therefore, if you customized the setup such that the Master thread was on 1 machine and all compute threads were distributed evenly, we'd end up with ~32GB in use as the Master thread would be limited to the 16GB ona given machine.

The method our solver is parallelized today, we do have a fair amount of message passing between machines.

10GB infiniband cards are a great step versus the typical Gig-e you would have on a workstation. Given that, you could potentially see a slightly better performance with 20GB cards (these have been the ones we've previously purchased for our support clusters in the past). Stepping to 20GB cards however wont make those numbers jump to a linear path.

Given the details I would say that the numbers typical.

Going from 4 to 8 cores typically yields the best performance per dollar with 20-30% runtime savings.

Going to 16 cores has typically been about another 20-30% performance boost.

In some cases, this is where some might have their configuration as 2 clusters of 2boxes such that you could run parallel jobs with improved runtime.

OmkarJ · ‎11-21-2012

Apollo, this is the processor that resides on every box in the cluster:

http://ark.intel.com/products/52213/Intel-Core-i7-2600-Processor-8M-Cache-up-to-3_80-GHz

Essentially, there are 4 physical cores having 3.4GHz and 16 GB RAM for each box.

All boxes have same configuration. Is there anything else you would need for gaging the specifications?

As I understand from adminstrator, one of the boxes is Master box, which communicates with my workstation and then distributes the work.

Going by your estimate (conservatively), let's assume that there is 20% reduction in times from 4 to 8 cores and 20% further reduction for 8 to 16 cores (total 40% from 4 cores). And let's consider the simulation that takes 10 hours to finish with 4 cores on My Computer. Thus, to finish two such simulations we require:

# My Computer: 10+10 = 20 hours.(2 jobs CAN'T RUN PARALLELY because we are using all 4 cores in machine)

# Cluster 16 cores : 6 + 6 = 12 hours (2 jobs CAN'T RUN PARALLELY) because we are using all 16 cores in cluster)

# Cluster 8 cores : 8 hours (2 jobs CAN RUN PARALLELY since we are using only 8 cores for each)

Thus theoretically, running two jobs simulataneously with 8 cores dedicated to each, is more 33% more economical than running two jobs in series in 16 cores!!

Any thoughts on this hypothesis?

Regards

Omkar.

PS: My observations fall roughly on the midpoint of your estiamtes of reduction in times. i.e. 25% reduction each when upped from 4 to 8 and 8 to 16 cores against your estiamte of 20-30%. Guess I was being overtly optimistic then.

apolo_vanderberg · ‎11-26-2012

Omkar,

Absolutely. Completely expected at times.

Prior to our use of multi-core machines / clustering the easiest way to get faster performance was to leverage more solvers on more machines.

If you solve a job on 1 machine and have to queue 2 more behind it. You will get immediate performance increase if you happen to have the available hardware (and solver license) to run the 2nd job on another machine versus waiting for the first to finish.

So more to your specific question, yes running 2-8core jobs will outperform 1-16core job with a 2nd job queued.

HPC - a myth?

HPC - a myth?

Forums Links

Post to forums