All,
As many of you actively using the C4R service are aware, we are currently experiencing an unplanned downtime on the service. Our development team is investigating this matter with extreme focus. All hands on deck.
We apologize for the impact this matter has had on your productiuon activities, and will use this thread as the centralized discussion on the matter. The Health Dashboard will remain the master location for service status, but let's discuss here as necessary.
-Kyle
Hi Kyle,
Our team was able to connect back to C4R and was able to sync models starting about half an hour back or so. Is the service back up or is this an anomaly?I just want to make sure so as to not loose any work.
Thanks.
Yes, @anshuman.raje we have been bringing updated service nodes back on line over the past hour. Our failure rates are dropping dramatically as a result, and we are close to calling things "back to normal". We're not at a place yet where we can do that, but extremely close.
-Kyle
Not sure if this will help, but here is the log I experianced:
Revit Version: Revit 2015
Time Zone: Mountain Time
Sync Attempts per time:
Sync Failed: 3:14 p.m. MT
Sync completed: 3:16 p.m. MT
Sync Failed: 3:25 p.m. MT
Sync Failed: 3:25 p.m. MT
Sync completed: 4:03 p.m. MT
Sync Failed: 4:13 p.m. MT
Sync Failed: 4:13 p.m. MT
Sync Failed: 4:16 p.m. MT
Sync Failed: 4:17 p.m. MT
Sync Failed: 4:23 p.m. MT
Sync Failed: 4:26 p.m. MT
Sync completed: 4:39 p.m. MT
Sync Failed: 4:42 p.m. MT
Sync Failed: 4:45 p.m. MT
Sync Failed: 4:46 p.m. MT
Sync Failed: 4:47 p.m. MT
Sync completed: 5:07 p.m. MT
Sync completed: 5:17 p.m. MT
Sync completed: 5:32 p.m. MT
Working after 4:47 p.m MT
Tried opening and closing and various other things, but seemed to just be sporatic on when it would let me access.
MK
That timeframe is consistent with the measured analytics on our end.
-Kyle
UPDATE - 8:43pm EST
The C4R service has been restored back to the previous operating state. Reliability of cloud worksharing operations, as measured by our service analytics, are back to normal. You and your teams can be confident to resume production operations.
What Happened
As many of you are likely aware, our recent C4R update (2015.6 / 2016.1) introduced a regression in element borrowing performance. As you would have expected, our team has been working around the clock between our US and Shanghai development teams to identify the root cause. Today's unplanned downtime was a result of our team's efforts to identify that root cause -> our efforts to produce more verbose logging of service-side operations resulted in widespread service disruption as they hit our production environment. This behavior was not experienced on our non-production deployments of the service. Service was restored when we spun up new service nodes that did not contain the problematic logging, and tore down the bad ones.
What's Next
Now, it's easy to simply say "this won't happen again", but you all deserve a more detailed explanation than that. Myself and the team will be coming together over the next couple days to identify the concrete steps we need to take. So long as the details of those steps do not pose a security threat to the service, we'll detail them up on this thread. Stay tuned.
We understand that confidence and trust are key factors for any project team to consider, or continue use of, the Cloud Worksharing approach delivered by C4R. It is on us as a company to ensure these factors are reinforced by continued reliability and performance of the service among many other characteristics. Across the many teams responsbile for designing, coding, testing, deploying, documenting, localizing, supporting, and selling the serivce, we are committed to instilling that trust and confidence. Today the team responsible for the operation and reliability of the service - of which I am part - fell short of that commitment.
We look forward to regaining what trust was lost today, and delivering on the promise of the game-changing collaboration that C4R provides.
Kyle + the C4R Product Team
Thanks for the update Kyle. Should we expect any issues syncing files that failed this afternoon? We have a group of users in one studio that had numerous failures during the outage. We have take copies of their collaboration caches as a backup but want to know what to expect when they return to work in the morning.
We will have a team meeting at 7:30am MT and plan to prioritize who should sync first based on amount and type of changes to their models. Should we take any other action before the team starts to work again?
Thanks! ~Bruce
@bmccallum_dialog wrote:
Thanks for the update Kyle. Should we expect any issues syncing files that failed this afternoon? We have a group of users in one studio that had numerous failures during the outage. We have take copies of their collaboration caches as a backup but want to know what to expect when they return to work in the morning.
We will have a team meeting at 7:30am MT and plan to prioritize who should sync first based on amount and type of changes to their models. Should we take any other action before the team starts to work again?
Thanks! ~Bruce
Nope, you should not expect issues. There was code deployed throughout the downtime that is no longer deployed. There is no data or technical thought process within the product team that indicates a need for C4R project teams to do anything but resume work at this point.
-Kyle
Ok, noted. We'll still be watching to be sure; with documents to be issued for a deadline on Friday we are particularly sensitive to any loss of data. ~B
@bmccallum_dialog wrote:
Ok, noted. We'll still be watching to be sure; with documents to be issued for a deadline on Friday we are particularly sensitive to any loss of data. ~B
Certainly let us know if you see instabilities, but like I said previously we have no reason to recommend modified workflows.
-Kyle
whats your plan if/when the system goes down again?
we cant afford to lose 4 hours of work like we did yesterday
As I mentioned in my previous post, we are going through some structured retrospective processes at the moment and will be posting the concrete steps on this thread, likely tomorrow.
Independent of the solutions, the key areas for improvement were:
We'll provide some detail on the identified solutions in short order.
-Kyle
All,
After a structured retrospecitve process, I wanted to communicate the changes we're making internally as a result of this week's unplanned downtime:
As I said in my previous post, it's on us as a product team to regain what trust and confidence was lost this week as a result of our unplanned downtime. All we can do is be transparent about our efforts, stay focused in the right areas, and solicit to market feedback to keep that focus in the right place.
-Kyle + the C4R Product Team
One additional update -> We have restored the status of the C4R service to "Operational". This was done based on the element borrow performance trend over the past 2 days of service operation. Check out the data:
You'll notice that the spikes from previous days under load are now completely gone, and we have consistent Element Borrowing performance, even under load.
-Kyle
Kyle,
We are still experiencing this intermitantly accross the board on all projects all disciplines (ARCH, MEP, STRUCTURAL). Any update on a permanent or semi-permanent fix?