Revit Cloud Worksharing Forum
Welcome to Autodesk’s Revit Cloud Worksharing Forums. Share your knowledge, ask questions, and explore popular Revit Cloud Worksharing topics.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Revit Cloud Worksharing (C4R) Operations Unavailable - May 14, 2018

101 REPLIES 101
SOLVED
Reply
Message 1 of 102
Kevin.Short
8527 Views, 101 Replies

Revit Cloud Worksharing (C4R) Operations Unavailable - May 14, 2018

The Revit Cloud Worksharing (C4R) service is currently unavailable. We are aware of the problem and are working hard to resolve the issue as quickly as possible. We will post updates as we know more.

 



Kevin Short
Senior Product Owner
101 REPLIES 101
Message 81 of 102
Anonymous
in reply to: Anonymous


@Anonymous wrote:

@sasha.crotty @Kevin.Short @KyleBernhardt @AdamPeter @Zsolt.Varga @Viveka_CD @Ian.Ceicys Autodesk team it's now been 10, yes 10 LONG days since the last major update on the outage on Monday 5/14.

 

When can we expect a summary and not crickets?


You're wasting your time. They've never given a meaningful update that provides any assurance of lessons learned and future preventative actions. Just apologies, resumption notices, etc.

 

Autodesk's back end systems are a mess. I know it. They know it. They just won't acknowledge it publicly. In private, I have had more honest dialog from them but it's rare. They tend to not retain staff that are true customer advocates because it highlights the shortcomings and not mask them.

Message 82 of 102
sasha.crotty
in reply to: Anonymous

Gentlemen - working on it. The investigation is still ongoing so my previous statement still stands. That said, I am working on an update on the current progress and outcomes so far. Please look for it here soon.

 

Thanks,

Sasha

Sasha Crotty
Director, AEC Design Data

Message 83 of 102
Anonymous
in reply to: sasha.crotty

@sasha.crotty @Kevin.Short Sasha, I hope you had a nice long memorial day weekend.

 

When, oh when, can we get an update on the current progress and outcomes so far? You said "soon" on Friday and I was looking for something on the order of hours...not days.

 

We have European and Asian directors who were expecting an update yesterday. 

 

By my count it's been 5 days since your last update and 15 LONG days since the outage and we've heard barely a peep. Can anyone explain who can give us a decent explanation of what happened, what caused the outage, and if our data is safe?

 

This is NOT a way to run a cloud business.

Message 84 of 102
Anonymous
in reply to: Anonymous

DOWN YET AGAIN

Message 85 of 102
Anonymous
in reply to: Kevin.Short

No service here in UK either

 

you have had 3 weeks to investigate the issue and in this time it has happened again!!

 

quick to take the money but not so quick to provide what you have charged for!!

Message 86 of 102
sandip.more
in reply to: Anonymous

oh, again a c4r Issue?

 

can you please give us time line team is about to leave for the day and waiting to resolve the issue. can you give us timeline?

 

Best Regards,

 

Sandip

Message 87 of 102

This is completely unacceptable for a mission critical service... 

Message 88 of 102

Please indicate what kind of problem you have and what measures will be taken.

Everyone wants to know.

 

OBAYASHI Corp.

Hiroshi Mori

Message 89 of 102

We're also having issues in here. This is going out of hand.

If I understand correctly, no one can access any C4R cloud model or can sync their model?

Message 90 of 102
Anonymous
in reply to: gsmitVHYZS


@gsmitVHYZS wrote:

We're also having issues in here. This is going out of hand.

If I understand correctly, no one can access any C4R cloud model or can sync their model?


That is correct

Message 91 of 102
Anonymous
in reply to: gsmitVHYZS

no and also the desktop connector app that was suggested also does not work correctly either, when you open your model through it, you are no longer connected to the cloud hosted model, but a local copy of the model, which doesnt sync your changes to the main cloud model at all, just splinters off a clone version

 

 

Message 92 of 102
Anonymous
in reply to: Anonymous


@Anonymous wrote:

@gsmitVHYZS wrote:

We're also having issues in here. This is going out of hand.

If I understand correctly, no one can access any C4R cloud model or can sync their model?


That is correct


Scrap that. It's working again.... 

Message 93 of 102
Anonymous
in reply to: Anonymous


@Anonymous wrote:

no and also the desktop connector app that was suggested also does not work correctly either, when you open your model through it, you are no longer connected to the cloud hosted model, but a local copy of the model, which doesnt sync your changes to the main cloud model at all, just splinters off a clone version

 

 


But when you reopen the central model, does it not tell you to sync your changes from your local file? That's the way i understood it works?

Message 94 of 102
sandip.more
in reply to: Anonymous

Hi Sasa, 

 

one quick question, some of our resources got recovery error and file is closed, what would be the option to save these files to central once issue is resolved ? we know the standard process but now we need to confirm all situations with you considering 360 frequent failures and your suggestions on that.

 

Best Regards,

 

Sandip

Message 95 of 102
chris.kershaw
in reply to: Anonymous

I cant believe this but Ive just been in contact with our re-seller & the have informed us that Autodesk have NOT put any kind of guaranteed up% in the T&C's like most online services so they are not in breach of there contract by the service been down! Surely that must be changed

Message 96 of 102

Only if people stop using it in protest and not renew contracts....Like I
have done...
Message 97 of 102
sasha.crotty
in reply to: sasha.crotty

Thank you everyone for your patience as we investigated the 5-hour Revit Cloud Worksharing outage on Monday the 14th of May. Let me begin with extending my apologies for any impact the downtime may have had on your business. We understand the importance of this service to your work and our team continues our efforts to make service disruptions less likely in the future. While our investigation into this incident continues, you are right that it is also time for an update. We have learned a lot over the past few weeks, so here goes.

 

The Outage

may14-1.png

Figure 1: Model operations on May 14th, 2018

 

may14-2.png

Figure 2: Service node health during the outage

 

Revit Cloud Worksharing has extensive monitoring and health checks built in, so our teams were notified almost immediately when operations began to fail (Figure 1). Autodesk team members responded within minutes and began to execute standard procedures intended to restore service as quickly as possible. As you can see by the fluctuating number of healthy service nodes in Figure 2 above, multiple attempts were made to restore service during the outage including adding new nodes to the service, running two versions of the service in parallel, and a manual restart. Each time, however, the nodes encountered the same condition and were unable to handle traffic. Service was ultimately restored by performing a cold restart using the prior version of the service. In this case, “prior version” refers to the version of the software, not the data stored in the service.

 

A Note on Our Investigation Process
When an incident occurs our engineering team conducts a retrospective within 24 hours to discuss potential root causes and identify areas for further analysis. The subsequent iterative process of investigation is aimed at objectively and exhaustively uncovering the technological, procedural, and human factors responsible for creating the conditions that led to the incident. Once understood, the team identifies corrective and preventive measures. As part of the process, the team is expected to prioritize this investigation and its resultant action items over other work.

 

Root Cause Investigation
After service was restored the team began the task of identifying the cause. Two primary areas of investigation were identified: a deadlock/race condition in the system due to an unexpected sequence of events in multi-threaded code and an issue with changes made to support improved caching. After review, the changes to support caching were ruled out as a cause. The team then continued the investigation in the deadlock along two paths: reviewing the code to find possible areas of deadlock and to reproduce this issue in our testing environments. On reviewing the code, the team identified an area of the code that is possibly prone to a deadlock state. A fix for this code has been written. Unfortunately, replicating the failure in the testing environment has proven more difficult and our team continues with this part of the investigation. As mentioned previously, the investigation and resolution of this issue remains our team’s top priority.

 

What’s Changing
We learned a lot about our service and our processes because of this incident and we are applying that knowledge to our standard practices going forward. This includes:

  • Adjusting our practices to perform a cold restart earlier in the case of a full outage. This will help reduce outage times should we encounter a significant event in the future.
  • Improving our health dashboard update process. As noted in this thread, it took a while for the health dashboard to reflect the outage state. While there are good reasons for why the dashboard update is manual, we do want to make sure that it reflects the service health as accurately as possible. Therefore, we are introducing an automated paging system for the customer communications group. This group is responsible for updating the health dashboard and forums during planned maintenance and unexpected events. This means that we have a team focused on restoring service and a separate group posting updates, ensuring that both parts of the response happen as efficiently as possible.
  • Adjusting the static code analyzer configuration to detect potential deadlock bugs before they ever go into production. Leading up to this incident, it appears that the analyzer’s configuration caused at least one check to not run when expected. We have made the necessary changes to ensure all expected checks run on future code submissions.
  • Enhancing our investments into resiliency and scalability of the service so that service disruptions are less likely in the future. These improvements will help provide a level of availability that you can trust to deliver your projects.

Thanks for taking the time to read this post and thank you for your continued commitment to Autodesk products and services.

 

Sasha Crotty
Director, AEC Design Data

Message 98 of 102

Hi,

 

I appreciate your recovery efforts.Also, I think it is a very good to announce detailed reports.

Even if you and we have problems, we can trust you by making sincere correspondence.

If I ask for only one further improvement,we don't want service disruptions are "less"  but we want service disruptions are "never" in the future.

 

OBAYASHI Corp.

Hiroshi Mori

 

Message 99 of 102
neilltupman
in reply to: sasha.crotty

Hello Sasha - first, thank you for the detailed response.

 

However I am still frustrated as to why it's taken ADSK 3 weeks to post a response/update? As paying customers for a service that has failed and caused significant distress, frustration, loss of revenue and missed deadlines for numerous users, I would expect more. I realise that you still haven't nailed down the root cause yet, so cannot publish a full report, I do however think that ADSK should have provided regular updates (at a bare minimum, weekly) on progress.

 

There are also some issues raised in this thread still not yet addressed, a 'work offline' facility for C4R projects and reimbursement for loss of revenue is what springs to mind. I am however pleased that you have chosen to address the updating of the health dashboard; one of the unknowns on May 14th was whether the failure was in local network infrastructure or yours. Eliminating this uncertainty will be helpful.

 

I do wonder if we didn't have the additional minor outage on Friday 1st June, if this response would have been posted - I hope so. Please ensure this, and future updates, are not left buried in this thread - can it be at least posted to the history tab of the health dashboard for this service?

 

Cloud-based work-sharing is something fairly new and evolutionary in my company, and I am one of the guys singing its praises, so when significant outages like this happen, I look like a fool to those holding the purse-strings and confidence is massively undermined. It also makes me wonder if we're making the right decision in moving to cloud-based collaboration at this time - should we wait until we have a more stable service?

 

Thanks again for the update.

Message 100 of 102
Chad-Smith
in reply to: sasha.crotty

Thanks Sasha for the detailed reply.

From recent discussions I've had with others there is one more improvement which I (and on behalf of those others) would like to suggest.

Scheduled maintenance is only visible (as far as I know) in two locations; the BIM 360 Docs UI, and the Health Dashboard. But neither of these two locations are where full-time Revit (Design) users spend their day.

So, the recommendation is to have better visibility of scheduled maintenance from within Revit. It could be upon the first load of the Revit session as a prompt, or/and in the Communication Center.

The intent is not to alarm the entire design team, but to provide complete visibility of these events to those who are impacted for better planning of work around it.

Thanks.

Can't find what you're looking for? Ask the community or share your knowledge.

Post to forums  

Technology Administrators


Autodesk Design & Make Report