Revit Cloud Worksharing Outage & Action Summary for Recent Incidents

sasha.crotty · ‎10-26-2018

Over the past two weeks Revit Cloud Worksharing has experienced multiple outages. Below is a summary explaining the causes of these incidents and actions we are taking to improve the resiliency of the service going forward.

On Wednesday October 17, 2018, Revit Cloud Worksharing experienced an issue followed by an outage. Our investigation determined that the two incidents are unrelated, despite the timing. The issue began just after 11 am PDT and lasted approximately 2.5 hours. It affected a small subset of customers who attempted to initiate new models to BIM 360 Document Management; most operations continued to be successful. The issue was resolved by adjusting access limits to core services. Just before 2:30 pm EDT, Revit Cloud Worksharing experienced an outage affecting all operations. The outage was caused by network connectivity problems in the service infrastructure that resulted in connection problems between various parts of Revit Cloud Worksharing. The fix applied to resolve the prior issue was rolled back in case the two incidents were connected. Service restoration was then further delayed by continuing network issues. Full functionality was restored approximately 1.5 hours later.
On Monday October 22, 2018 shortly before 11 am EDT the service experienced another outage caused by an issue in an underlying core service. This issue was identified as the result of an Autodesk team running an automated tool for large-scale user migration between Autodesk applications, which greatly increased service load. Revit Cloud Worksharing service was restored via a restart approximately 30 minutes later.
On Wednesday October 24, 2018 just after 9 am EDT Revit Cloud Worksharing experienced an additional outage caused by resource exhaustion in the service triggered by delayed response times in a downstream dependency. Out of caution the service was rolled back to an earlier state as an update had been deployed on Tuesday. However, after initial investigation we do not believe the updated deployment to have contributed to the root cause of the incident. Service was restored via a restart approximately 30 minutes after the incident began.

While disruptive and having different root causes, the incidents have provided new insights to the team about how the service operates under unexpected conditions. As part of the rigorous Incident to Improvement (I2I) process used by our teams, team members must discuss and propose changes that can be made to the service to address issues surfaced by incidents. Improvements identified during each I2I become the topmost priority for the team.

As a result of the investigations into the incidents, and particularly the incident on Wednesday, October 24, we have identified one preexisting bug that may have contributed to one or more of the outages and has only come to light as the service has grown. A fix for this bug has been submitted to the code and is undergoing testing.

The team’s investigation has also reinforced the need for a project that is already on the team’s immediate roadmap and which is intended to reduce the impact of similar conditions. The project will help the service shed load during times of delayed responses from core services. Because Revit automatically retries operations when it cannot immediately connect to the server, to you this will appear as increased operation times until the core service degradation is resolved. This project is the team’s top priority for further development work.

These recent outages are not the experiences we want you to have with our services. We are doing everything in our power to prevent future disruptions, and sincerely apologize for any interruptions these incidents may have had on your work. Thank you for your patience this week. We greatly appreciate your continued use of Autodesk services.

Sasha Crotty
Director, AEC Design Data

Anonymous · ‎10-26-2018

Thanks for the detailed writeup Sasha.

Will customers be offered a partial refund for the work interruptions?

ckuhn8DAJL · ‎10-26-2018

Thanks for clarifying. We have 4 sites with over 200 users and we are still experiencing interruptions even today 10/26/18. How are we to be compensated for this downtime? There is no way to calculate our exact losses, but when adding up the hours + the services we provide to our clients.... its substantial.

Revit Cloud Worksharing Outage & Action Summary for Recent Incidents

Revit Cloud Worksharing Outage & Action Summary for Recent Incidents

Forums Links

Post to forums