Re: ServerList bails on first unreachable machine

Anonymous · ‎01-24-2007

We are hitting some problems with the ServerList. When querying the list, there is an exception thrown on the first machine that is inaccessible. We have a network of many machines in two facilities, and the other facility is firewalled from the one we are in. The servers in the second facility come first in the list, and basically any problem (a refused connection, name resolution that does not work etc.) stops the iterator across the list.

The following code fails in 50-60 percent of the cases, and always on different machines in the list:

WireTapServerList* server_list;
Data_Get_Struct(self, WireTapServerList, server_list);
int num = 0;
if(!server_list->getNumNodes(num)) {
throwup("Problem in wiretap: %s. File: %s: %d", server_list->lastError(), __FILE__, __LINE__);
}

Zero stays in num. Is that the expected behavior? Can the problem with unreachable machines be resolved in a more elegant way so that the rest of the list stays accessible?

Additionally, wiretap_dump_servers has the same problem. The first time when I run it, I get:

review:/Code/wiretap-api/tools/Mac/OSX julik$ ./wiretap_server_dump
WireTap Servers Storage ID Plug-In
-------------------------------------------------------------------------------
---- list of hosts ---

The second time though:

review:/Code/wiretap-api/tools/Mac/OSX julik$ ./wiretap_server_dump
Failed to acquire server list from interface 192.168.171.198 : Can't open connection to host '192.168.171.193': Connection refused

Seemingly at random (has to do with some internal sorting issue in C++ I guess)

Message was edited by: julik_tar Message was edited by: julik_tar

labuted · ‎01-24-2007

Hi,

Server self-discovery is somewhat complex, and it sounds like your network is too. I would first need to know which version of the SDK you are using. We fixed some important bugs in the last release. We's also need to know the versions of all of your servers ... especially if you have a beta version in house.

First off, the mechanism by which servers are discovered is to send out a multi-cast on all network interfaces to see who responds on the current subnet. The first server to respond is queried for its list of all other servers. Nice and clean ... nothing to configure ... when it works.

In your case, the connection refusal is happening when the first server to respond is queried. It's perfectly normal for the server to vary based on a number of factors related to your network and the activity on each server.

It's odd that the WT server running on 192.168.171.198 is responding to a multi-cast, but that you cannot connect to it. I would check to make sure the server is not in a funny state (e.g. zombie process) and that it is properly configured. Try restarting it.

You may also have some mixed versions in your facility. In particular, there is a known issue if you are a beta site and are not suing the latest SDK.

As for the 'num" being returned as zero, it is common practice not to set return parameters when a functon call fails. The WT API calls should all (hopefully) be consistent in this respect.

Hope this helps,

Dan

Anonymous · ‎01-24-2007

Yes, makes perfect sense - we eliminated the 99 percent of the failures by removing the 2007 beta version from one of the burn machines (looks like it's wiretapp'ed as well now). However I am still getting intermittent problems when ServerList is accessing machines in another facility (that is, sometimes serverr_dump works, sometimes not).

AFAIK, we are running wiretapd's 1.7 on flame 9.513 and friends, and I am using the latest published client API (2007.0).

The question I wanted to ask is - it would be sane to expect the client to set some kind of flag (for instance, return a ServerInfo that responds "dead") but go on with the iteration to let me access the machines which are, well, accessible. Or at least count them. Not bail on the first unreachable host - because otherwise it means that any intermittent crash brings the autodiscovery down.

We do not want to use the beta SDK just yet because we primarily develop on OS X and there has been some flux, especially with compilers/linkers used.

>> In your case, the connection refusal is happening when the first server to respond is queried.

This is clear, but I don't understand the motivation of this decision - let the whole ServerList bail out if one of the hosts is not working right (and let it be a host determined by network latency - to boot 🙂 ).

Besides in this case .198 is my client machine that does the querying.

labuted · ‎01-24-2007

> Yes, makes perfect sense - we eliminated the 99 percent of > the failures by removing the 2007 beta version from one of > the burn machines (looks like it's wiretapp'ed as well now).

Yes, BB manager is now a WT server. If you upgrade to the WT 2007.1 client, the problem will disappear. We unfortunately had no choice but to introduce this bug to fix another issue in older servers. We only discovered the issue in beta1 of 2007.1.

> However I am still getting intermittent problems when
> ServerList is accessing machines in another facility (that is, > sometimes serverr_dump works, sometimes not).

Not sure I understand. Are you saying that you sometimes don't get servers from the other facility? This would likely happen if the other facility is accessible via another network interface. We use a 1 second timeout to discover servers on an interface before we give up. Perhaps the latency to/from your other facility is slightly longer. What does ping say?

The v2007.1 SDK has the ability to target a specific machine to act as a gatway when obtaining a server list on an interface. This would allow you to bypass the multi-cast. I suppose we should add a timeout specification as well. In general, latencies greater than a second are considered fairly long.

> AFAIK, we are running wiretapd's 1.7 on flame 9.513 and
> friends, and I am using the latest published client API
> (2007.0).

Upgrading the servers ($) to 2007 would be helpful as well on a number of fronts.

> The question I wanted to ask is - it would be sane to expect > the client to set some kind of flag (for instance, return a
> ServerInfo that responds "dead") but go on with the
> iteration to let me access the machines which are, well,
> accessible. Or at least count them. Not bail on the first
> unreachable host - because otherwise it means that any
> intermittent crash brings the autodiscovery down.

it is possible, but we rarely encounter a situation where a server responds to the intial multi-cast without being able to respond to the simple query to follow. We have pushed very hard to make this stuff just work ... with minimal if any configuration.

In general multi-cast is not considered a completely 100% reliable protocol for a number of reasons most of which out of our control. Your users will sometimes need to manually refresh a server list, and you should also include a text field in our GUI to alow the user to target a specific host. I do agree that we can always improve our discovery system to introduce better failover.

> We do not want to use the beta SDK just yet because we
> primarily develop on OS X and there has been some flux,
> especially with compilers/linkers used.

The beta SDK is VERY stable, and was specifically created to provide MacOS universal libraries which may be of interest to you. Dynamic libraries are also provided to get around compiler mismatches. What problems are of particular concern to you?

> I don't understand the motivation of this decision - let the
> whole ServerList bail out if one of the hosts is not working
> right (and let it be a host determined by network latency -
> to boot 🙂 ).

To add to my answer above, it should be said that we do NOT contact each server on the list ... which would be quite slow. We only contact the first one that responded. Servers that are down will not be involved in the discovery. It is possible but unlikely that this machine cannt respond to the subsequent query ... unless we have a bug on our side ... as we do in this case. ;).

Cheers,

Dan

Anonymous · ‎01-24-2007

> Not sure I understand. Are you saying that you sometimes don't get
> servers from the other facility?

Not quite. The name resolution does not work across facilities - that is, I am in something.local (under which I reliably have tezro-01, burn4 etc.) but the other facility is somethingelse.local, with it's own flint, inferno and burns. Wire works, but the conventional DNS routines don't. We are connected by fiber so if I access the machines by IP it works farily quickly.

> Upgrading the servers ($) to 2007 would be helpful
> as well on a number of fronts.

You meant $$$.$$$ ? 🙂 Well I certainly should talk to my supervisor about that.

> it is possible, but we rarely encounter a situation where a server
> responds to the intial multi-cast without being able to respond to the
> simple query to follow.

Failing name resolution seems to be exactly the edge case you might consider though.

> The beta SDK is VERY stable, and was specifically created to provide
> MacOS universal libraries which may be of interest to you. Dynamic
> libraries are also provided to get around compiler mismatches. What
> problems are of particular concern to you?

We have to provide a working solution that end users compile themselves, and the Ruby build system becomes a little unwieldy if you start to cater for that many linker-compiler-locations combinations. We hope to catch up when beta1 is, well, out of beta.

labuted · ‎01-24-2007

> Not quite. The name resolution does not work across
> facilities - that is, I am in something.local (under which I
> reliably have tezro-01, burn4 etc.) but the other facility is
> somethingelse.local, with it's own flint, inferno and burns.
> Wire works, but the conventional DNS routines don't.

I think I get it. See detailed answer below.

> You meant $$$.$$$ ? 🙂 Well I certainly should talk to my
> supervisor about that.

The new subscription model is only $$. 🙂

> Failing name resolution seems to be exactly the edge case
> you might consider though.

Indeed. We have already fixed all of this ... in v2007. 🙂 If you notice, each WT server broadcasts its hostname to a remote WT client ... which causes the DNS issues you describe. In v2007, servers broadcast their display name (usually hostname), a house network IP, and a high-bandwidth IP for media transfers ... thus bypassing the very DNS issues you are encountering.

> We have to provide a working solution that end users
> compile themselves, and the Ruby build system becomes a
> little unwieldy if you start to cater for that many linker
>-compiler-locations combinations. We hope to catch up
> when beta1 is, well, out of beta.

IMHO, linking against a dynamic library is the safest bet for you. Not sure I see the issue. Aside, we're coming up on beta3, so get ready. 🙂

What you're doing sounds interesting. Are you able/willing to provide more info to our product designers regarding this effort? Would you be willing to discuss confidentially off-line?

Anonymous · ‎01-24-2007

> IMHO, linking against a dynamic library is the safest bet for you.

As soon as you decide where to put it 🙂 Well we do it already anyway but there are these naasty DYLD_SCHTUFF envars and such.

> Not sure I see the issue. Aside, we're coming up on beta3, so get ready.

If the issue with name resolution is resolved in betas and the first query failing is not causing the ServerList to throw up then we are happy campers. Would be pity if you need to do $erver upgrade$ to benefit from the fix though.

> Would you be willing to discuss confidentially off-line?

The information about the effort (with necessary details) has been filed with you some time ago (through the Sparks program), but feel free to contact me by e-mail if you have any questions. Message was edited by: julik_tar

labuted · ‎01-25-2007

As soon as you decide where to put it 🙂 Well we do it already anyway but there are these naasty DYLD_SCHTUFF envars and such.

MAC is pretty crappy about this. There's a better way to avoid the env vars. The latest SDK talks about this.

> If the issue with name resolution is resolved in betas and
> the first query failing is not causing the ServerList to throw
> up then we are happy campers.

The real fix for your DNS issue is in the 2007 servers. A hacky fix (failover to next server) could only be applied in the beta client SDK, if at all.

> Would be pity if you need to do $erver upgrade$ to benefit
> from the fix though.

I can't see failing over as being all that helpful to you unless you have only one system with a DNS issue. If we fail over to the next server in the other facility, you'll just have the same problem again. Perhaps there is still something I don't understand.

> The information about the effort (with necessary details)
> has been filed with you some time ago (through the Sparks
> program), but feel free to contact me by e-mail if you have
> any questions.

I'll follow up with the Spaks Manager.

Thanks.

ServerList bails on first unreachable machine

ServerList bails on first unreachable machine

Forums Links

Post to forums