> Yes, makes perfect sense - we eliminated the 99 percent of > the failures by removing the 2007 beta version from one of > the burn machines (looks like it's wiretapp'ed as well now).
Yes, BB manager is now a WT server. If you upgrade to the WT 2007.1 client, the problem will disappear. We unfortunately had no choice but to introduce this bug to fix another issue in older servers. We only discovered the issue in beta1 of 2007.1.
> However I am still getting intermittent problems when
> ServerList is accessing machines in another facility (that is, > sometimes serverr_dump works, sometimes not).
Not sure I understand. Are you saying that you sometimes don't get servers from the other facility? This would likely happen if the other facility is accessible via another network interface. We use a 1 second timeout to discover servers on an interface before we give up. Perhaps the latency to/from your other facility is slightly longer. What does ping say?
The v2007.1 SDK has the ability to target a specific machine to act as a gatway when obtaining a server list on an interface. This would allow you to bypass the multi-cast. I suppose we should add a timeout specification as well. In general, latencies greater than a second are considered fairly long.
> AFAIK, we are running wiretapd's 1.7 on flame 9.513 and
> friends, and I am using the latest published client API
> (2007.0).
Upgrading the servers ($) to 2007 would be helpful as well on a number of fronts.
> The question I wanted to ask is - it would be sane to expect > the client to set some kind of flag (for instance, return a
> ServerInfo that responds "dead") but go on with the
> iteration to let me access the machines which are, well,
> accessible. Or at least count them. Not bail on the first
> unreachable host - because otherwise it means that any
> intermittent crash brings the autodiscovery down.
it is possible, but we rarely encounter a situation where a server responds to the intial multi-cast without being able to respond to the simple query to follow. We have pushed very hard to make this stuff just work ... with minimal if any configuration.
In general multi-cast is not considered a completely 100% reliable protocol for a number of reasons most of which out of our control. Your users will sometimes need to manually refresh a server list, and you should also include a text field in our GUI to alow the user to target a specific host. I do agree that we can always improve our discovery system to introduce better failover.
> We do not want to use the beta SDK just yet because we
> primarily develop on OS X and there has been some flux,
> especially with compilers/linkers used.
The beta SDK is VERY stable, and was specifically created to provide MacOS universal libraries which may be of interest to you. Dynamic libraries are also provided to get around compiler mismatches. What problems are of particular concern to you?
> I don't understand the motivation of this decision - let the
> whole ServerList bail out if one of the hosts is not working
> right (and let it be a host determined by network latency -
> to boot 🙂 ).
To add to my answer above, it should be said that we do NOT contact each server on the list ... which would be quite slow. We only contact the first one that responded. Servers that are down will not be involved in the discovery. It is possible but unlikely that this machine cannt respond to the subsequent query ... unless we have a bug on our side ... as we do in this case. ;).
Cheers,
Dan