Cloud nodes don't progress beyond "submitted" stage

brandon_hazelton
Not applicable
0 Views
20 Replies
Message 1 of 21

Cloud nodes don't progress beyond "submitted" stage

brandon_hazelton
Not applicable

[ FlexSim 23.0.2 ]

After following this walk-through along with some information from this seemingly old procedure. I have ended up in a state where the experiments that I set to run on distributed cloud nodes reach the "Submitted" stage but never progress beyond that. I have allowed them to wait in that state for at most 10 minutes with no change. Do you have any ideas of what I could do to debug this?

I am using AWS with an EC2 instance that has 2vCPUs, 4GB RAM, and is running Windows Server 2022. Due to the low RAM amount, I entered "1" for the CPUs field in my local FlexSim setup in order to abide by the minimum/recommended requirements called out in Distributed experiments or optimizations.

Additionally, I seem to get this error in the System Console every few tries. Unsure if it is related.

exception: Experimenter[IP ADDRESS] did not return valid ports
exception: ExperimenterUnable to create child processes on cloud nodes
exception: Experimenter Could not get jobID
exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertScenario c: MODEL:/Tools/Experimenter
exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertTask c: MODEL:/Tools/Experimenter
exception: FlexScript exception: Array index out of bounds at MAIN:/project/library/Experimenter>behaviour/eventfunctions/approveTasks c: MODEL:/Tools/Experimenter
exception: treewin__CallUserCallback
ex: CallUserCallback
Accepted solutions (1)
1 View
20 Replies
Replies (20)
Message 2 of 21

ben_wilson5K2EV
Autodesk
Autodesk

Hi @Brandon H,

The exceptions suggest that your main FlexSim instance is not able to successfully connect to the cloud nodes. Do you have the webserver running there? Are you able to hit those nodes' FlexSim webservers with an internet browser over their IP address and your chosen port number?

Also, the article mentions that your cloud nodes ought to meet FlexSim's recommended requirements - not the minimums. It could be that your nodes don't have enough RAM to run even one replication of your simulation. This depends on your model's exact requirements, of course, but I will note that in 2023 our minimum RAM spec is 8GB. As a test you could run a replication directly on one of your nodes while watching the task manager to make sure that you aren't getting close to your node's hardware limitations. Keep in mind that when running as a cloud node you'll need additional overhead at the end of the model run to store and report back any metrics your model keeps, so you may need several hundred more MB when this node runs your model as a cloud node, depending on what stats you're gathering. See the Experiments and Optimizations heading under Memory for more info about how a replication uses memory.

@Jordan Johnson, do you have any other suggestions or ideas for Brandon?

0 Likes
Message 3 of 21

brandon_hazelton
Not applicable
@Ben Wilson Firstly, thank you for redirecting this as its own question. I will make sure to keep this in mind for future items.


The only checks that I had performed to determine if I was getting a proper connection were opening the webserver in a browser on the node itself and using the "Test Connections" button in the Global Preferences/Environment/Cloud Computing section of FlexSim on my local computer. The fact that the job replications hit the "Submitted" state but did not do anything leads me to believe that the connection was made but perhaps the resources available were not adequate. I was not aware that the new minimum RAM spec was 8GB. I will have to try getting a more suitable instance next time I try this and follow your instructions on monitoring the status of the node.

I will report back any findings after I conduct this test.

0 Likes
Message 4 of 21

jordan_johnsonPM57V
Autodesk
Autodesk

I'm not sure what this would mean. How many instances are you launching? If you are launching more than 4 instances, I could see FlexSim spending more than 10 minutes starting the experiment.

Have you verified that your model can run in Windows Server 2022? I recently had trouble making a container in that version of windows, so I wonder if something about it doesn't work.

Other than that, I'm not sure what might be wrong. We'd probably need to look at an exact model to be more sure.

As far as the exceptions you are getting, are you starting new instances between each try, or are you trying multiple times with the same instance? The exceptions look like the instance still has a FlexSim child process running and consuming the port, so new experiments can't use that port.

.


Jordan Johnson
Principal Software Engineer
>

0 Likes
Message 5 of 21

brandon_hazelton
Not applicable

I am unsure what you mean by "instances" here. In terms of the number of scenarios and replications per run, I had initially started this by trying 1 scenario and 5 replications. I have since cut that down to 1 scenario and 1 replication. I found that it required about 300MB of RAM when run on my local computer. I just opted to go for a 4 CPU and 16GB RAM EC2 instance and found the same result of the job status staying stuck at "Submitted" for ~10 minutes before I quit it.



Upon further testing, that exception error would show up if I stopped the test and started again immediately without closing out the FlexSim Program on my local computer. I did not touch anything on the cloud node/instance.

I will now try to use an older Windows Server version to see if that helps.

UPDATE - Switching to Microsoft Server 2016 did not change the outcome

0 Likes
Message 6 of 21

brandon_hazelton
Not applicable


I have included some screenshots to hopefully help debug what I am doing wrong. Including some information about tools I used:
AWS Instance:
Config 1 -> Microsoft Server 2022, 4vCPUs, 16GB RAM, 30GB Storage
Config 2 -> Microsoft Server 2016, 4vCPUs, 16GB RAM, 30GB Storage
Experimenter Settings:
Job 1 -> 1 Scenario with 1 Replication

Added Data:
When the Experimenter is run on my laptop, the specific job seems to require a max of 300MB RAM. On the cloud node/instance, I see a spike that looks to be around 300MB RAM when I first send the job but the RAM drops back down to an idle state immediately afterwards. Something similar happens with the CPU usage. I have waited up to 20 minutes for the "Submitted" status to change to "Running".

experimenter-setup.png


Experimenter and cloud node setup

Image.png

Verification that Webserver can be reached

Image.png

Remote Desktop view of cloud node/instance (moments after the Job was started on the local computer)

0 Likes
Message 7 of 21

jason_lightfootVL7B4
Autodesk
Autodesk

Can you share your webserver Configuration file?

0 Likes
Message 8 of 21

brandon_hazelton
Not applicable
# This file needs to be in the same directory as flexsimserver.bat


General:
    Flexsim Program Directory:      %PROGRAMFILES%\FlexSim 2023\program
    Model Directory:                %DOCUMENTS%\FlexSim 2023 Projects
    Port:                           80
    Reply Timeout (milliseconds):   10000
    Max Instances (of Flexsim):     8
    Max Threads Per Instance:       max
    Ignore Auto Save Files:         yes
Remote Operations (security hazards):
    Model Uploading:                no
    Model Downloading:              no
    Model Deleting:                 no
    Max Upload Size (bytes):        10000000
Jobs:
    Flexsim Data Directory:         %AllUsersProfile%\FlexSim\FlexSim23.0
    Max Job Queue Length:           100
    Max Job Timeout (seconds):      3600
Windows Authentication:
    Use Windows Authentication:     no
    Restrict UserGroup Directories: no
    Active Directory:
        url:                        ldap://dc.domainName.com
        baseDN:                     dc=domainName,dc=com
        username:                   specialUser@domainName.com
        password:                   password
Session:
    Enable:                         no
    Secret:                         flexsim secret
    Max Age (seconds):              3600
0 Likes
Message 9 of 21

jason_lightfootVL7B4
Autodesk
Autodesk

In the servers task manager do you see instances of FlexSim being started?

0 Likes
Message 10 of 21

brandon_hazelton
Not applicable
I do not have one up and running to show it but when I push a job from my local machine using the server's resources, I do notice extra sub-tasks of the main FlexSim task in Task Manager being created.
0 Likes
Message 11 of 21

jason_lightfootVL7B4
Autodesk
Autodesk
There should be no 'main' FlexSim task running before you submit a job to the server when submitting the experiment from your laptop. You can add the command line column to the task manager to see what type of process is being invoked - but it does sound like the launching of child processes for each replication is working.


0 Likes
Message 12 of 21

brandon_hazelton
Not applicable

I ran another test just after I sent that comment and you are correct.
When I run Webserver on the server, the command prompt comes up but no task is made in Task Manager. When I run a job on my local computer, this is when a task shows up in Task Manager.

We have not talked about this yet but in regards to the security rules (Inbound/Outbound), I have only added Inbound rules for the server as shown in this post. Is there any chance that I need to add Outbound rules or add rules on my local computer? This is the only thing I could come up with so far on the debugging front.

0 Likes
Message 13 of 21

jordan_johnsonPM57V
Autodesk
Autodesk
Accepted solution

@Brandon H It's not clear to me what's happening. It does seem likely that the connection isn't working correctly. These are the things I have thought of:

  • Be sure you have the latest version of the webserver installed on the EC2 instance.
  • Be sure you have the same version of FlexSim installed on the instance as you are running locally. For example, if you are using 23.0.2 on your local computer, install 23.0.3 on the EC2 instance/image.
  • Be sure that the webserver is running when the instance starts and is listening on port 80

Try testing a basic model:

  • Use a very simple model for testing, perhaps Source-Queue-Processor-Sink.
  • Verify that you can run that experiment normally/locally on your own computer
  • Verify that you can run the model on the EC2 instance. I think you can drag/drop or maybe copy/paste the model file to the EC2 instance. Then open and run it with FlexSim. Since it's a simple model, it should run fine in that environment.
  • Try running the experiment using the EC2 instance.

Try testing your model:

  • Make sure you can run every scenario (at least for a short amount of time) on your local computer.
    • If you use the experimenter, make sure that there are no system console messages in the performance measures.
    • If you run locally, make sure the scenario works.

If the simple model works and yours doesn't, then maybe there's a reason. Perhaps your model relies on files that aren't present on the EC2 instance? For example, I don't think reading from an excel file works unless you've installed Excel on the instance.

Here is some basic information about how the process works:

  • User runs an experiment that uses Remote CPUs.
  • FlexSim send a synchronous http request to the remote host
  • The host PC launches a FlexSim
  • That instance of FlexSim launches the specified number of child processes
  • Those child processes attempt to bind a listening TCP socket, starting with port 9000, and consuming one socket per child process
    • If they don't have permission to bind TCP listening sockets, then this might fail
    • If some other program has already bound itself to port 9000, this might fail.
  • The FlexSim process that launched the children pings each child process to detect that it is ready. Once all are ready, the main process writes a file called "ports.txt"
    • If your instance doesn't have a hard drive/storage for files, then this will fail
  • Once the ports.txt file is present, the webserver responds to the HTTP request to spawn child instances, telling FlexSim which ports to connect to.
  • The main/local FlexSim then connects to each of the remote FlexSims with a TCP socket.
  • The main/local FlexSim saves the entire main tree into memory.
  • The main/local sends a copy of the main tree to the child processes, which attempt to load the main tree.
    • If your model uses modules that aren't installed on the remote instance, this won't work.
  • After that, the experimenter submits the tasks. They are marked in the database as submitted.
  • The child processes are supposed to work as a team to do tasks as they come up.
    • For each task, the child process sets the parameters, resets the model, and then runs the model.
    • If something goes wrong during this process (setting parameters, resetting, and running) then the child process might vanish, leaving FlexSim waiting for work to happen that will never happen.
.


Jordan Johnson
Principal Software Engineer
>

Message 14 of 21

jeanette_fullmer88DK3
Autodesk
Autodesk

Hi @Brandon H, was Jordan Johnson's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always unaccept and comment back to reopen your question.

0 Likes
Message 15 of 21

brandon_hazelton
Not applicable
@Jeanette F Thanks for reminding me to close this out. While Jordan's comment just above did not contain the actual answer to my question, it did guide me down the right debugging paths to learn that it was something to do with my company's network protocols. I have since resolved the issue.


However, since his comment is a "comment" and not an "answer" I cannot seem to mark this question "answered". The only "answer" is the other response he gave me early on in this question's history.

Please let me know how I should proceed or, if you would, close out this question in whatever way you deem fit based on what I have relayed back here.

0 Likes
Message 16 of 21

preetdesai
Not applicable

Hi @Brandon H. Were you able to solve this problem? I am having similar issue where a simple model just gets submitted and it does not go beyond that.

0 Likes
Message 17 of 21

jason_lightfootVL7B4
Autodesk
Autodesk

You can see from Brandon's reply below that Jordan's answer "...did guide me down the right debugging paths to learn that it was something to do with my company's network protocols. I have since resolved the issue."

Message 18 of 21

brandon_hazelton
Not applicable
@Preet As I said in my latest reply it was an issue the network rules set by my company's IT department due to me being a remote employee. I found this out by testing the walkthrough on a spare computer and found immediate success running a full experiment. This indicated to me that the walkthrough was indeed correct and that there must be something unique about my work computer.


Based on my experience, I recommend looking into what rules are set on your local computer as well as double-checking your server's rules set to make sure there are no mistakes.

Message 19 of 21

preetdesai
Not applicable
I checked the inbound/outbound rules on EC2 and that as fine. I also checked firewall on my local PC and Flexsim is allowed app. I also checked on remote desktop. Both nodejs and firewall is allowed. The only thing I could think is I have enterprise license which I have to run by connecting to VPN to my employer network. Not sure, what to do next!
0 Likes
Message 20 of 21

kavika_faleumuE6HT5
Autodesk
Autodesk
Hello @Preet, I suggest you make a separate post/question so we can look into your specific problem and not lose it in this comment thread. You can link back to this post in your new question so future viewers can have context that you've already tried some of the troubleshooting methods found here.