Community
CFD Forum
Welcome to Autodesk’s CFD Forums. Share your knowledge, ask questions, and explore popular CFD topics.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Cluster communication for SimCFD 2014

18 REPLIES 18
Reply
Message 1 of 19
OmkarJ
651 Views, 18 Replies

Cluster communication for SimCFD 2014

I have faced lot of issues with cluster integration of SimCFD 2013. I was wondering if 2014 brings any improvement in this area. I perused through the "What's new" section in wiki but I couldn't find any mention about it there. 

 

OJ

18 REPLIES 18
Message 2 of 19
apolo_vanderberg
in reply to: OmkarJ

Omkar,

   Are you referring to the setup procedure for Clusters, or use of the cluster as a remote solver?

 

While we did not higlight anything in the Whats New, the setup process should be a little easier for 2014 than it was for 2013.

 

As well, overall with the evolution of CFD 360 for 2014 there were also some back end improvements for communications on the desktop product when dealign with remote solvers (cluster or not).

 

 

Message 3 of 19
OmkarJ
in reply to: apolo_vanderberg

I use Cluster in our office as a solver (remotely), while the simulation files are on my local machine. I am referring to the following specific expectations:

 

  1. If I run a queue of say 5 scenarios, from either same cfdst or more than one, the Solver should proceed solving all sequentially and then update the local folders of cfdst scenarios. I have seen that the Solver has run the simulation, but local files are not updated. When I navigate to HPCAnalyze folder in cluster, I see that the folder of scenario in the jobxxx folder has res files of latest iterations, but they are not updated in the local folders in my machine.
  2. Subsequently, when I open the scenario in SimCFD, I should see that not only the results of last iteration are mapped to the mesh, but also I should have history or result in the form of plots. I have seen sometimes,  that though the results are mapped to the mesh, the plots are not present.

I believe these are reasonable expectations.

 

Regards

OJ

Message 4 of 19
apolo_vanderberg
in reply to: OmkarJ

Omkar,

 

I can agree with your sentiment as most would expect that if you remotely solve that you should be able to retrieve the results. 2014 has had some work done on communication between the Interface and the Solver, I would be curious to see if you experience similar with the latest version.

 

I'd like to dig in a little further with some of the specifics on this:

 

1) Where are the files on your local machine? Stored on a C: or a mapped / network drive?

2) When you submit the first analysis, when do you switch to the next to submit (what stage is the analysis at)? Or are you using the Solver Manager to submit all the jobs at once?

3)During the run, all files will be stored on the remote machine, nothing gets copied back until the end of the analysis.

4) How many intermediate results are you saving? Do you know the total size of the Jobxxx folder on the remote machine as this is the amount of data that would have to get copied back.

5) When you do open the analysis again, what happens, does it begin to load the data and then stop, whats the progression?

6) is tehre anything else you're doing with the local machine that you are doing while the run is happening remotely (like shuitting down the local machine)?

7) If you do not close the interface do the jobs more consistently come back?

😎 How frequently do you see this issue (roughly 1 in 5 jobs? 1 in 10? )?

 

*If you do see this frequently, is there a common/repeatable set of steps that you can do to replicate this?

While I have had the occasional job not come back, it has not been frequent enough or repeatable enough such that I could work with our development team to point it out.

Message 5 of 19
OmkarJ
in reply to: apolo_vanderberg

To be honest, I have seen the same sentiment in many posts and SimCFD is believed to be having chronic problems with cluster communication. 

 

Here are the clarificatiosn to your questions. 

 

1) Where are the files on your local machine? Stored on a C: or a mapped / network drive?

Typically, 😧 drive. However, after lodging a case on this, as per suggestion by Autodesk experts, we started storing it on mapped Cluster drive and started opening the SimCFD of cluster using Remote Desktop. This has been a bit more successful, however, it is inconvinient, because this way only one user can work with Cluster through remote desktop. I would prefer to use Cluster as remote solver, as it is meant to be!

 

2) When you submit the first analysis, when do you switch to the next to submit (what stage is the analysis at)? Or are you using the Solver Manager to submit all the jobs at once?

This depends on the pipeline of the jobs I am working with. The ones that are finished with meshing are set for solving using "Solve". Sometimes when we need licences for meshing, we stop the cluster jobs for the day, and at the end of the day they are again submitted through indivldual scenarios using "Solve"I rarely use solver manager. 

 

3)During the run, all files will be stored on the remote machine, nothing gets copied back until the end of the analysis.

This is straight forward .

 

4) How many intermediate results are you saving? Do you know the total size of the Jobxxx folder on the remote machine as this is the amount of data that would have to get copied back.

Typically after every 500 iterations. How would it be beneficial to know the size of jobxxx folder? I would expect the data it has should be copied to local folders. 

 

5) When you do open the analysis again, what happens, does it begin to load the data and then stop, whats the progression?

Either of this:

  • The mouse pointer goes to "busy" mode, and after few seconds I see that neither the plots are updated nor the results mapped on mesh
  • Sometimes the results are mapped but I don't see the plots so can't judge the covnergence. 

6) is tehre anything else you're doing with the local machine that you are doing while the run is happening remotely (like shuitting down the local machine)?

No, we don't shut down the local machine. But we can't also keep all the cfdst files open because of limited licences for interface so we close the cfdsts if there are too many. Also, in single design study with multiplpe scenarios running, only one can be open anyway. During the day, of course the local machine is being used for other purposes. 

 

7) If you do not close the interface do the jobs more consistently come back?

It is rare that the scenario that is open and running will have problem But even if the interface is open, only one scenario can be open, so rest of the scenarios would ideally be having the same problems as that of closed cfdsts. 

 

😎 How frequently do you see this issue (roughly 1 in 5 jobs? 1 in 10? )?

It is as unpredictable as rain in UK! But quantitatively, 20-30% sounds about right.

 

*If you do see this frequently, is there a common/repeatable set of steps that you can do to replicate this?

I haven't seen any pattern or causality in this and hence, it is difficult to say if it can be definitively replicated. The best bet can be to generate a queue and wait for it to happen. 

 

OJ

Message 6 of 19
apolo_vanderberg
in reply to: OmkarJ

Omkar,

 

Thank you for those answers

Let me be a bit more specific on some of these questions:

 

If you are logging on to the cluster to run locally on the cluster, we want the files on the cluster's headnode

If you are using the cluster as a remote solver we want the files on the local harddrive of the local machine not on the cluster.

Is 😧 a network path?

 

When sending a job from local machine to Cluster, have you been storing your files on the local machine or the cluster?

 

 

If you are using the cluster as a remote solver we send files to the cluster, the analysis will solve locally in the HPCanalyze\JobXXX folder and then when done gets copied back to the local machine.

 

If someone has to mesh you stop all jobs on the Cluster?

How are you doing this?

Can the user not mesh locally vs sending to the cluster?

 

How many Jobs are typically queued up at any given time?

 

Message 7 of 19
OmkarJ
in reply to: apolo_vanderberg

Thanks for the interest. The clarifications are:

 

If you are logging on to the cluster to run locally on the cluster, we want the files on the cluster's headnode

Yes, when we log on to the cluster's headnode, to run using Cluster as remote solver, we keep the cfdst files on a shared location on Cluster's headnode, that is accessible from everywhere. Also, we open the cfdst from this shared network location, not from local location, when we do this. 

 

If you are using the cluster as a remote solver we want the files on the local harddrive of the local machine not on the cluster.

 

Yes, when we submit to cluster as a remote solver through local machine, we have files stored on local hard drive, and we open the files directly from the local hard drive location.

 

Is 😧 a network path?

😧 is local path, that is not shared.

 

If you are using the cluster as a remote solver we send files to the cluster, the analysis will solve locally in the HPCanalyze\JobXXX folder and then when done gets copied back to the local machine.

Yes, it is straight forward. But the problem lies in its incosistency and hence this thread.

 

If someone has to mesh you stop all jobs on the Cluster?

Since we have only two Solver licences, if two engineers want a licence for meshing, we can't run the cluster jobs - since mesher also requires the Solver licence. Hence we have to stop jobs on cluster.

 

Can the user not mesh locally vs sending to the cluster?

Yes, we mesh locally, using MyComputer as Solver. The cluster jobs are stopped only to free up a licence, not to use cluster as a solver for meshing.

 

How many Jobs are typically queued up at any given time?

Typical values are:

Minimum: 3

Maximum: 8

 

OJ

Message 8 of 19
apolo_vanderberg
in reply to: OmkarJ

Omkar,

  A few more questions with some of this.

 

Do you see these issues more when you have to stop the cluster so that others can mesh?


     If so, is there any reason you do not laeve the cluster runnign with its 1 license and let the 2 engineers take turns meshing (as meshing shouldnt take that long and would be less troublesome than stopping the whole queue and then restarting it).

 

It might be useful to keep a mental note of the typical sizes of the JobXXX folders, so taht way we can see if there is a threashold where this appears (does it happen more often as the folder size increases? Is there a specific size that starts being problematic? )

 

 

Message 9 of 19
OmkarJ
in reply to: apolo_vanderberg

Thanks, here are the clarifications

 

Do you see these issues more when you have to stop the cluster so that others can mesh?

No. We typically don't observe this while we manually stop. The problem is in cluster communication and coordination when it is operating on its own, ie, updating the folders with results after simulation is complete etc. I only raised this issue to 

 

If so, is there any reason you do not laeve the cluster runnign with its 1 license and let the 2 engineers take turns meshing (as meshing shouldnt take that long and would be less troublesome than stopping the whole queue and then restarting it).

The nature of CFD work dictates that majority of time of CFD engineer is spent in geometry cleanup, meshing and model setup. Since we use parameetric models for geometry and templates for model setup, these are relatively quick. So meshing is what occupies most of the time of an engineer. It is not possible that engineers sit idle in turns just to keep queue unaltered. Infact, I have a strong objection to the fact that meshing occuppies a solver licence, considering that meshing and solving NS equations are two completely exclusive processes. I do not observe this trend in any other software, as all come with exclusive licence for meshing and solving. I am trying to find a right platform to communicate this to Autodesk. I am sorry if I sound rude but is it me... or is it true that Autodesk has employed this unfair and unjust tactic, even if it comes from BRN in legacy? 

 

It might be useful to keep a mental note of the typical sizes of the JobXXX folders, so taht way we can see if there is a threashold where this appears (does it happen more often as the folder size increases? Is there a specific size that starts being problematic? )

I have observed that small and big meshes behave as random as each other alike.

 

OJ

Message 10 of 19
apolo_vanderberg
in reply to: OmkarJ

Omkar,

 The fact taht meshing takes a solver license has been part of CFdesign for many years.
This isn't something that Autodesk employed when we were acquired.

 

If this is something you'd like to see changed, I would recommend you logging an Enhancement Request on IdeaStation (forum thread sticky post has the link to this), as this will allow you to log what you'd like and allow for other users to vote on existing ideas to help promote their priority.

 

So from what you've stated the bulk of the issues comes from when you do not touch any of the jobs and they finish on their own?

Do some of those jobs sit in a Finished state for a while before the analysis is opened and the data is then copied back to the local machine?

 

 

Message 11 of 19
OmkarJ
in reply to: apolo_vanderberg

 The fact taht meshing takes a solver license has been part of CFdesign for many years. 
This isn't something that Autodesk employed when we were acquired.

I understand, and hence I mentioned that this may have been adopted as a legacy practice from BRN's CFDesign. I have no doubt that all CFD engineers who have worked on CFDesign/SimCFD would have sulked on this limitation when they were faced with it. Change for good is always a welcome. I hope you are able to see the point 🙂

 

If this is something you'd like to see changed, I would recommend you logging an Enhancement Request on IdeaStation (forum thread sticky post has the link to this), as this will allow you to log what you'd like and allow for other users to vote on existing ideas to help promote their priority.

I have already done that as soon as I posted the earlier post, since that seems to be the only platform to voice this concern out. 

 

So from what you've stated the bulk of the issues comes from when you do not touch any of the jobs and they finish on their own?

Mostly, yes. Probably, it may be because only the untouched queue is long. But then, I gave up on causality on this!

 

Do some of those jobs sit in a Finished state for a while before the analysis is opened and the data is then copied back to the local machine?

This depends on the holidays etc. You can imagine that at a time, only one scenario is open so only one job gets copied back. All the scenarios in same cfdst or other cfdst sit idle until they are opened. Typically, the larger queues get executed only on holidays, ie, weekends and bank holidays etc. Because, on day-to-day basis, I prefer to open the run scenarios and get the results copied, lest - the inconsistency in the cluster integration would start affecting us.

 

OJ

 

 

Message 12 of 19

Hello, I'm having the exact problem that you listed here. Based on how it was written, I'm guessing that there's a solution?
Essentially, I've got runs that appear on the remote solver as having finished after a night of running, but they don't write properly back on my laptop. This only happens overnight when I have disconnected the laptop from the network where the remote solver lives.
Is there a solution? I'm stuck on 2013 for a little while for info.


@apolo_vanderberg wrote:

7) If you do not close the interface do the jobs more consistently come back?



 

Message 13 of 19
Erik.Gifford
in reply to: RobTipples

We've consistently been fighting this and similar issues with remote solving  for the past few revisions of the software.  It usually falls into a couple categories:

1) User kicks the analysis over to the remote PC to solve, closes the interface.  The solution runs and states it is finished, when the user reconnects the simulation with results don't pull back to the user's PC.  Has happened a couple times when the user leaves the UI open, but happens more often when they close the interface and use the simulation monitor tool to watch for when their sim finishes as they do other work.  The same simulation will run and complete when run locally.

 

2) Queuing function fails to hand off to the next simulation.  Example, a stack of 6 simulation are queued (from one user or from a couple of users).  A simulation in the queue will finish, but the next one in the sequence fails to start.  Only solution we've found is to stop and start the solution service which flushes out the queue forcing everyone to resubmit.

 

We have one user that uses Sim360 for this and rarely runs into a problem - usually an issue with how the simulation was set up, geometry or such.  The problem is our other uses have to use the in-house setup - they continue to have sporadic issues with remote solving to the point they sometimes choose just to avoid going remote and run locally.  In our case we're just remote solving to a single PC, not a cluster.  We do it to allow multi-user queuing using a single solver license and to avoid crippling the user's PC as the run completes.


Erik

Message 14 of 19
RobTipples
in reply to: Erik.Gifford

I did some fairly extensive testing to track the exact problem regarding 1) in your example above. I've reported the exact bug to Autodesk so they know about it now. Perhaps it'll be solved in 2015?

 

Specifically, the problem relates to killing the process “SimCFDserver 2014”. This can be either through shutting down windows or by using task manager; both give the exact same result.

 

My team now remote login to the CFD solver machine. This solves the problem. It might improve the queing too?

Message 15 of 19
OmkarJ
in reply to: RobTipples

Rob,

 

I second your view, this is how we do it as well. Remote logging into headnode of cluster and then queuing is generally more robust. But we have two licences and of course would like to run two simulations at any time. We generally have several scenaros in one Design Study, but have to queue serially since you can't parallely run two scenarios with the same solver (Technically you can by using half no. of cores etc, but it is the least robust approach in my experience). So currently I have to create a copy of of the same simulation file locally and queue half of the scenarios locally while cluster is solving the other half of the queue. So, there are two simulation files generally for a project.

 

To mitigate this, I was thinking of experimenting a little.  If I could map my local machine as remote solver for cluster's headnode, then while half the scenarios are queued on the cluster (run through its headnode by remote logging), I can run other scenarios as well. So basically, my local PC (a decent configuration so won't buckle)  will be a remote solver to my cluster's headnode! I know it's crazy. Have you tried this workaround ever?

 

I also welcome Autodesk personnel to comment on this approach. 

Message 16 of 19
RobTipples
in reply to: OmkarJ

A soultion so simple?!?

 

More seriously, it's not a method that we could really use here. I'd be interested to hear if you ever get it working though!

Message 17 of 19

Rob,
This is a good example where in your case we had very specific workflows that allowed us to consistently repeat what you were seeing and be able to present this to our development team. It is something that is actively being investigated.
Message 18 of 19
OmkarJ
in reply to: apolo_vanderberg

Apolo,  

 

Any thoughts on the method I mentioned (using local machine as a remote solver for cluster)? Not the most elegant approach, nor do I know if it will work consistently...

 

Thanks

Message 19 of 19
apolo_vanderberg
in reply to: OmkarJ

Omkar,

There shouldnt be anything wrong with this procedure. Obviously once you remote solve from the Remote machine to your local machine you will want to avoid shutting down your local machine while the run is occuring to avoid killing the solution prematurely.

 

 

Apolo

Can't find what you're looking for? Ask the community or share your knowledge.

Post to forums  

Autodesk Design & Make Report