Re: C4R Unplanned Downtime

KyleB_Autodesk · ‎07-28-2015

All,

As many of you actively using the C4R service are aware, we are currently experiencing an unplanned downtime on the service. Our development team is investigating this matter with extreme focus. All hands on deck.

We apologize for the impact this matter has had on your productiuon activities, and will use this thread as the centralized discussion on the matter. The Health Dashboard will remain the master location for service status, but let's discuss here as necessary.

-Kyle

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

anshuman.raje · ‎07-28-2015

Hi Kyle,

Our team was able to connect back to C4R and was able to sync models starting about half an hour back or so. Is the service back up or is this an anomaly?I just want to make sure so as to not loose any work.

Thanks.

KyleB_Autodesk · ‎07-28-2015

Yes, @anshuman.raje we have been bringing updated service nodes back on line over the past hour. Our failure rates are dropping dramatically as a result, and we are close to calling things "back to normal". We're not at a place yet where we can do that, but extremely close.

-Kyle

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

bardenlyze · ‎07-28-2015

Not sure if this will help, but here is the log I experianced:

Revit Version: Revit 2015

Time Zone: Mountain Time

Sync Attempts per time:

Sync Failed: 3:14 p.m. MT

Sync completed: 3:16 p.m. MT

Sync Failed: 3:25 p.m. MT

Sync completed: 4:03 p.m. MT

Sync Failed: 4:13 p.m. MT

Sync Failed: 4:16 p.m. MT

Sync Failed: 4:17 p.m. MT

Sync Failed: 4:23 p.m. MT

Sync Failed: 4:26 p.m. MT

Sync completed: 4:39 p.m. MT

Sync Failed: 4:42 p.m. MT

Sync Failed: 4:45 p.m. MT

Sync Failed: 4:46 p.m. MT

Sync Failed: 4:47 p.m. MT

Sync completed: 5:07 p.m. MT

Sync completed: 5:17 p.m. MT

Sync completed: 5:32 p.m. MT

Working after 4:47 p.m MT

Tried opening and closing and various other things, but seemed to just be sporatic on when it would let me access.

MK

KyleB_Autodesk · ‎07-28-2015

That timeframe is consistent with the measured analytics on our end.

-Kyle

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

KyleB_Autodesk · ‎07-28-2015

UPDATE - 8:43pm EST

The C4R service has been restored back to the previous operating state. Reliability of cloud worksharing operations, as measured by our service analytics, are back to normal. You and your teams can be confident to resume production operations.

What Happened

As many of you are likely aware, our recent C4R update (2015.6 / 2016.1) introduced a regression in element borrowing performance. As you would have expected, our team has been working around the clock between our US and Shanghai development teams to identify the root cause. Today's unplanned downtime was a result of our team's efforts to identify that root cause -> our efforts to produce more verbose logging of service-side operations resulted in widespread service disruption as they hit our production environment. This behavior was not experienced on our non-production deployments of the service. Service was restored when we spun up new service nodes that did not contain the problematic logging, and tore down the bad ones.

What's Next

Now, it's easy to simply say "this won't happen again", but you all deserve a more detailed explanation than that. Myself and the team will be coming together over the next couple days to identify the concrete steps we need to take. So long as the details of those steps do not pose a security threat to the service, we'll detail them up on this thread. Stay tuned.

We understand that confidence and trust are key factors for any project team to consider, or continue use of, the Cloud Worksharing approach delivered by C4R. It is on us as a company to ensure these factors are reinforced by continued reliability and performance of the service among many other characteristics. Across the many teams responsbile for designing, coding, testing, deploying, documenting, localizing, supporting, and selling the serivce, we are committed to instilling that trust and confidence. Today the team responsible for the operation and reliability of the service - of which I am part - fell short of that commitment.

We look forward to regaining what trust was lost today, and delivering on the promise of the game-changing collaboration that C4R provides.

Kyle + the C4R Product Team

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

bmccallum_dialog · ‎07-28-2015

Thanks for the update Kyle. Should we expect any issues syncing files that failed this afternoon? We have a group of users in one studio that had numerous failures during the outage. We have take copies of their collaboration caches as a backup but want to know what to expect when they return to work in the morning.

We will have a team meeting at 7:30am MT and plan to prioritize who should sync first based on amount and type of changes to their models. Should we take any other action before the team starts to work again?

Thanks! ~Bruce

KyleB_Autodesk · ‎07-28-2015

@bmccallum_dialog wrote:

Thanks for the update Kyle. Should we expect any issues syncing files that failed this afternoon? We have a group of users in one studio that had numerous failures during the outage. We have take copies of their collaboration caches as a backup but want to know what to expect when they return to work in the morning.

We will have a team meeting at 7:30am MT and plan to prioritize who should sync first based on amount and type of changes to their models. Should we take any other action before the team starts to work again?

Thanks! ~Bruce

Nope, you should not expect issues. There was code deployed throughout the downtime that is no longer deployed. There is no data or technical thought process within the product team that indicates a need for C4R project teams to do anything but resume work at this point.

-Kyle

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

bmccallum_dialog · ‎07-28-2015

Ok, noted. We'll still be watching to be sure; with documents to be issued for a deadline on Friday we are particularly sensitive to any loss of data. ~B

KyleB_Autodesk · ‎07-28-2015

@bmccallum_dialog wrote:

Ok, noted. We'll still be watching to be sure; with documents to be issued for a deadline on Friday we are particularly sensitive to any loss of data. ~B

Certainly let us know if you see instabilities, but like I said previously we have no reason to recommend modified workflows.

-Kyle

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

jseltzer · ‎07-29-2015

whats your plan if/when the system goes down again?

we cant afford to lose 4 hours of work like we did yesterday

KyleB_Autodesk · ‎07-29-2015

As I mentioned in my previous post, we are going through some structured retrospective processes at the moment and will be posting the concrete steps on this thread, likely tomorrow.

Independent of the solutions, the key areas for improvement were:

Improved deployment testing processes, which would have avoided the downtime in the first place.
More rapid detection and communication of service degradation.

We'll provide some detail on the identified solutions in short order.

-Kyle

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

KyleB_Autodesk · ‎07-30-2015

All,

After a structured retrospecitve process, I wanted to communicate the changes we're making internally as a result of this week's unplanned downtime:

More Testing in Our Deployment Pipeline - We already have an automated testing process as code submissions make their way from a developer to our Dev -> Staging -> Production stacks. What we found this week, and what ultimately caused the downtime, was that the automated testing did not adequately test for multi-symptom failure scenarios. We had a change that didn't fail any individual test as it made its way through Dev to Staging, but ultimately failed catastrophically when it hit the production envirnoment and encountered a multi-symptom failure scenario.

Moving forward, the team is working to add some of these multi-symptom failure scenarios to our automated testing process. This approach would have caught, and ultimately prevented, this week's downtime.
Metrics-Driven Alerts - Up until this downtime, we had a series of automated warnings configured to alert the product team if certain test operations failed. While these are valuable, they did not pick up this week's issue for a few pretty technical reasons I won't go into here.

Moving forward, we are implementing a set of metric-driven alerts, so the product team can be immediately warned if customer experiences devaite from standard metrics. For example, we are implementing an alert based upon the running average of our SWC success rate. If a spike in failed SWC operations occurs within the last measured period, then the proverbial "red light with honking alarm" will start going off. This will allow us to get on top of things faster.
Adjusting our Machine Sizing & Scaling Logic - While the C4R service is an elastic scaling modern cloud service, this week's issue highlighted a need for us to adjust our machine sizing and scaling logic. We found that service nodes are saturating more quickly than when we initially set up the autoscaling logic, and were resulting in degraded performance before hitting their auto-scaling threshold. The logic was put in place at an earlier point in C4R's adoption curve, and we've found non-linear behavior as usage has increased.

Moving forward, we've lowered the saturation threshold before auto-scaling kicks off, which so far as or current data is concerned, will result in no degraded performance as the service scales elastically. Just the way it should be.
More Rapid Customer Communication - We knew something was wrong well before we updated the Health Dashboard. The delay was primarily because we wanted to be sure what was going on first. Customer feedback has clearly indicated that we got that one wrong.

Moving forward, we will be using the Health Dashboard as our primary communication mechanism for situations like this, and will be updating almost immediately if we see degraded service behavior. We'd recommend that interested parties subscribe to C4R service updates.

As I said in my previous post, it's on us as a product team to regain what trust and confidence was lost this week as a result of our unplanned downtime. All we can do is be transparent about our efforts, stay focused in the right areas, and solicit to market feedback to keep that focus in the right place.

-Kyle + the C4R Product Team

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

KyleB_Autodesk · ‎07-30-2015

One additional update -> We have restored the status of the C4R service to "Operational". This was done based on the element borrow performance trend over the past 2 days of service operation. Check out the data&colon;

You'll notice that the spikes from previous days under load are now completely gone, and we have consistent Element Borrowing performance, even under load.

-Kyle

Kyle Bernhardt
Director
Building Design Strategy
Autodesk, Inc.

jjr · ‎03-29-2016

Kyle,

We are still experiencing this intermitantly accross the board on all projects all disciplines (ARCH, MEP, STRUCTURAL). Any update on a permanent or semi-permanent fix?

jjr · ‎03-29-2016

Kyle,

We are still experiencing this intermitantly accross the board on all projects all disciplines (ARCH, MEP, STRUCTURAL). Any update on a permanent or semi-permanent fix?

C4R Unplanned Downtime

C4R Unplanned Downtime

Forums Links

Post to forums