Revit Cloud Worksharing Forum
Welcome to Autodesk’s Revit Cloud Worksharing Forums. Share your knowledge, ask questions, and explore popular Revit Cloud Worksharing topics.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Revit Cloud Worksharing (C4R) Operations Unavailable - May 14, 2018

101 REPLIES 101
SOLVED
Reply
Message 1 of 102
Kevin.Short
8529 Views, 101 Replies

Revit Cloud Worksharing (C4R) Operations Unavailable - May 14, 2018

The Revit Cloud Worksharing (C4R) service is currently unavailable. We are aware of the problem and are working hard to resolve the issue as quickly as possible. We will post updates as we know more.

 



Kevin Short
Senior Product Owner
101 REPLIES 101
Message 61 of 102

Kevin, any word as to the Cause of the Outage?  I'm sure inquiring Minds want to know.

 

Mike

Message 62 of 102
Anonymous
in reply to: mmaloneyVHABU

@sasha.crotty @Kevin.Short @Anonymous @AdamPeter @Zsolt.Varga @Viveka_CD @Ian.Ceicys Autodesk team it's been 2 days since the last major update on the outage on Monday.

 

What more information can you share and what is the timeline for your disclosures?

 

You need to be INCREDIBLY thorough and produce a detailed accounting of all of the steps and actions that you spent 5 hours doing.

 

Every hour that passes without a full explanation, is an hour where your credibility is further eroded and my stakeholders determine Autodesk can't run cloud services. AIDI

 

Here's what I need and I NEED, and I NEEDED this yesterday: 

Why did the outage happen? 

Why was the outage as long as it was?

Is our data in jeopardy, was this an external hack?

How sure are you that we won't go down for the next 6 months? 

 

Can you answer the 5 WHYs?

 

Here's what your peers post in a timely fashion with their outages, do likewise: 

  1. Clever
  2. Fortnite
  3. British Airways
  4. POSTMORTEM OF SERVICE OUTAGE AT 3.4M CCU
  5. day-one-outage-postmortem
  6. travisci

 

Message 63 of 102

On behalf of Autodesk, let me apologize for the Revit Cloud Worksharing outage experienced by our customers on Monday May 14. We understand how critical this service is to getting your work done. While our team did everything in their power to restore service as quickly as possible, we know that even one minute of downtime is too much. The development team continues to investigate the incident to determine the underlying root cause (initial investigation indicates a race condition) and we will provide a summary of that investigation as soon as it is available. In the meantime, we have reverted the service to an earlier version to ensure that we do not encounter the issue again. As with any incident, there is a lot of opportunity for improvement. Two concrete actions that the team will be taking are improvements to our autorecovery mechanisms and ensuring we minimize downtimes in case of failure. These improvements represent only a part of the continued investment to improve the resiliency of Revit Cloud Worksharing.

 

As a reminder, the health.autodesk.com dashboard provides a real-time view into the status of our services, and you can subscribe to receive alerts of scheduled maintenance or outages.

 

Thank you for your commitment to Autodesk products and services.

Sasha Crotty
Director, AEC Design Data

Message 64 of 102
Anonymous
in reply to: sasha.crotty

Sasha,

 

Perhaps what might help is a better explanation of the current state. I've seen the same thing happen w/IT at various companies. User expectations regarding backup, disaster recovery, fail over are typically different between the user and IT staff.

 

What is the status of BIM360 Design (C4R)?  Is there a backup?  Is there load balancing and fail over? Why does the system come down during maintenance? Can't a system simply be isolated, updated and brought back on line and work through the server pool? Certainly this isn't all running on a single server.  Do you have any AI/ML determine "When" my projects are being accessed and focus upgrades/patching outside my normal working hours?

 

Netflix operates in a brutal environment w/Chaos Monkey running on their system.

Microsoft's services and systems fail frequently but seem a lot less "global" in outages, and tend to be less impactful to production. 

 

Why aren't Autodesk's systems build with redundancy and resiliency?  I'd love to see a high-level architecture map of how Autodesk currently (or future) proposes to keep customer operations in light of inevitable issues arising.  

Message 65 of 102
Anonymous
in reply to: sasha.crotty


@sasha.crottywrote:

 The development team continues to investigate the incident to determine the underlying root cause (initial investigation indicates a race condition) and we will provide a summary of that investigation as soon as it is available.


@Anonymous A summary isn't enough...not in the slightest. Please be transparent and post the full root cause. If there's a line of code that can be identified get to that level of specificity. 

 

If this was a "race condition" - here are the 5 whys I NEED to know:

1) Why wasn't this escape defect caught by testing (manual \ automated)?

2) Why didn't monitoring pick up this and recovery gracefully?

3) Why did it take 5 hours to reverted the service?

4) Why did the issue cause such a EPIC failure?

5) Why was the choice to "revert" the service selected as the "fix" and does that revert cause any data lose or performance impact?

Message 66 of 102
MichaelRuehr
in reply to: sasha.crotty

Hi Sasha

I was most properly one of the first persons being alerted to problems before the service went down completely

the health.autodesk.com dashboard is kind of OK

but when I looked it showed ALL is fine.

It took me some time to find a service e-mail link to alert 

Autodesk has an annoying habit, to put them amongst the small print

What is missing is a real-time feedback Button I can use and alert your engineers.

It may be a generational thing but I just don't think a forum like this is a proper channel to deal with

outages that cost your clients 10.000+$ per hour of downtime.

I am sure you all do and did your best...but it does not look very professional.

 

Message 67 of 102

I estimated about 30 minutes of outage before the health dashboard recognised it.  In that 30 minutes, I was trying to dertermine if the problem was at our end or the cloud server.

I appreciate the health dashboard, but it needs to be quicker.

Message 68 of 102

down again?

Message 69 of 102

is C4r DOWN, It is not saving the files now. also when trying to login it give warning " you are not member of any projects. health dashboard is showing normal

what is happening there, we have deadlines. its very difficult to manage workload this way

 

Best Regards,

 

Sandip

Message 70 of 102
Hanez_G
in reply to: Kevin.Short

C4R and BIM360 Hub is down again in Singapore and Australia.....where's Kevin???

Message 71 of 102
Anonymous
in reply to: Kevin.Short

twice in a week?

 

come on autodesk this is ridiculous!!!!!!

Message 72 of 102
audi.capellan
in reply to: Anonymous

Wow. Has anyone thought about solutions yet? 

Message 73 of 102

Thats it Autodesk!!! I'm already considering pulling out my project from cloud platform and do a traditional sharing with our counterparts. Too unreliable service is bad for business.

Message 74 of 102
Chad-Smith
in reply to: MichaelRuehr

 

I definitely agree with this

@MichaelRuehrwrote:

What is missing is a real-time feedback Button I can use and alert your engineers.

and this, because the Dashboard isn't real-time enough.

@adrian_worboys wrote:

I estimated about 30 minutes of outage before the health dashboard recognised it.  In that 30 minutes, I was trying to dertermine if the problem was at our end or the cloud server.

I appreciate the health dashboard, but it needs to be quicker.

Message 75 of 102
Anonymous
in reply to: Anonymous

Working fine in the UK

Message 76 of 102
neilltupman
in reply to: Anonymous

All working fine as of 10am GMT here in the UK...

Message 77 of 102
Anonymous
in reply to: adrian_worboys


@adrian_worboyswrote:

I estimated about 30 minutes of outage before the health dashboard recognised it.  In that 30 minutes, I was trying to dertermine if the problem was at our end or the cloud server.

I appreciate the health dashboard, but it needs to be quicker.


I wouldn't make the assumption that the health dashboard "Recognizes" anything. I don't know for certain but I believe it's updated manually.

 

About a 1-1/2 years ago, I went through the entire history of all products and cataloged the details of the outages and degraded performance.  It was very common to see similar details (not not the same) for the same outage with different time stamps. It was also common to see details of an outage start but never details of the end....the next day, the status was just "green".

 

Multiple products that rely on some of the back end services which failed would have different time stamps regarding the outage and pointing to the root common service failure. 

 

@Anonymous's all a cobbled up mess from some of the insiders I know. Anything architecture or back end related has always been a mess @ Autodesk for at least a couple decades. They only care about the customer facing UI's and little attention is paid to the back end.  They could really take a few lessons from how Microsoft has approached transitioning to the Cloud. 

 

 

Message 78 of 102

 

Greetings,

 

I told repeatedly you that we emphasize on C4R and think C4R is excellent technology. If you are proud as a engineer, you never stopped C4R again without previous notice.

And please let us enjoy our Saturday night.

OBAYASHI Corp.

 

Message 79 of 102
Anonymous
in reply to: sasha.crotty

@sasha.crotty @Kevin.Short @KyleBernhardt @AdamPeter @Zsolt.Varga @Viveka_CD @Ian.Ceicys Autodesk team it's been 9, yes 9 LONG days since the last major update on the outage on Monday 5/14.

 

Has a detailed technical summary of the cause of the outage been posted somewhere?

 

If this was a "race condition" - here are the WHYs I NEED answers to:

1) Why wasn't this escape defect caught by testing (manual \ automated)?

2) Why didn't monitoring pick up this and recovery gracefully?

3) Why did it take 5 hours to revert the service (surely you had some "reasons" to wait 5 hours)?

4) Is our data in jeopardy, was this an external service that had the "race" condition and this caused data to be ex-filtrated by a third party?

5) Why did the issue cause such a EPIC failure in the time it took to recover?

6) Why was the choice to "revert" the service selected as the "fix" and does that revert cause any data lose or performance impact?

 

You have promised repeatedly to be transparent and continued silence doesn't inspire any confidence. Again EVERY hour that passes without a full explanation is an hour where your credibility and your commitment to run a reliable service is further eroded and my stakeholders determine Autodesk can't run cloud services. AIDI

 

At this time in my (humble) estimation it's been 221 hours since the last update, WHAT in the world are you and the engineers doing?

 

Here's what your peers post in a timely fashion with their outages, do likewise: 

  1. Clever
  2. Fortnite
  3. British Airways
  4. POSTMORTEM OF SERVICE OUTAGE AT 3.4M CCU
  5. day-one-outage-postmortem
  6. travisci
Message 80 of 102
Anonymous
in reply to: Anonymous

@sasha.crotty @Kevin.Short @KyleBernhardt @AdamPeter @Zsolt.Varga @Viveka_CD @Ian.Ceicys Autodesk team it's now been 10, yes 10 LONG days since the last major update on the outage on Monday 5/14.

 

When can we expect a summary and not crickets?

Can't find what you're looking for? Ask the community or share your knowledge.

Post to forums  

Technology Administrators


Autodesk Design & Make Report