Update -4-6-15 10pm est Cart issues we believe are all fixed. Servers are back on line and back ups made. You should see normal fucntionality now. Please advise if you are havng any issues, thanks.
Update - The Object Storage clusters continues to be under heavy load due to replication of objects. Softlayer has received some advice from SwiftStack on how to tune the cluster to recover faster and are working on implementing those changes. So, response times and errors will increase and there may be availability issues with some stores.
Apr 4, 13:52 CDT
Update - Bigcommerce engineers are applying some new request timeout thresholds to allow Object Storage to take a little longer to fetch an object and hopefully fail less. Object replication is still causing intermittent performance issues and errors retrieving objects but is remaining relatively stable.
Apr 4, 13:03 CDT
Update - SoftLayer are currently observing more load from SoftLayer customer traffic on the object storage platform in Dallas, which is leading to additional I/O and replication delays which are impacting their ability to stablize the storage cluster.
SoftLayer are currently implementing rate limiting for write operations across the storage cluster in an attempt to minimise outside writes while the cluster stabilises. This may result in issues across Bigcommerce stores with product imports, and addition/modification of product images - if you experience difficulty saving or updating product images, please try again.
At this time, Bigcommerce engineers are still observing intermittent failures with objects such as template files, product images, and other uploaded content, as well as slow page loads across a number of Bigcommerce stores that do not yet have this content cached. We expect for this behaviour to continue for a while.
We will provide another update as soon as we have additional information.
Apr 4, 12:35 CDT
Update - We are continuing to receive updates from SoftLayer regarding the availability and recovery of their object storage service.
From the latest information we have, all storage nodes have been reintroduced into the object storage cluster and are online. As part of the recovery process, a number of storage nodes are currently under high load while the consistency of all objects is checked and data is shuffled around to the correct locations.
Due to the high load placed upon the storage cluster at the moment, SoftLayer continue to advise us that we may see intermittent spikes in load times and page timeouts.
At this point in time, we're continuing to receive information from a number of merchants that their stores are back up and operational. Some stores may still be inaccessible, or missing certain assets such as product images or stylesheets - from what we've observed so far these issues will continue to correct themselves over the coming hours.
Due to the ongoing operations to stabilise the object storage environment, we are going to continue to leave the WebDav and template editing functionality across Bigcommerce stores disabled. We'll continue to review this decision as the object storage recovers and as soon as we and SoftLayer are confident the functionality can be enabled again, we will enable it.
We will continue to monitor the recovery process and provide updates as soon as we have additional information.
Apr 4, 10:19 CDT
Update - SoftLayer have just provided us with a detailed update regarding the recovery efforts of their object storage service, which has been impacting the ability to access Bigcommerce stores.
SoftLayer are reporting that since the cluster upgrade which was completed overnight, they are seeing better handling of storage devices, and correct behaviour for full storage nodes and failed devices.
Recovery efforts are still underway to bring the rest of the storage nodes back online. There are approximately 8 storage nodes remaining to come online out of the 52 storage nodes assigned to the cluster. As this process completes, we expect more and more objects to become available to Bigcommerce stores, and these stores to return to normal operations.
SoftLayer have advised us that the load on the object storage cluster is significantly high at the moment due to the replication that the service is undertaking in order to ensure all data is correct, and assigned to the right storage nodes. This process is necessary to get the cluster into a healthy state but may introduce intermittent periods of slowness for Bigcommerce stores that have data which is yet to be repopulated into our cache.
We're continuing to monitor the SoftLayer recovery effort and the impact to Bigcommerce stores and will provide updates as soon as we have them. Almost all functionality is back at premier so you should have no issues with your cart now. Please contact us if you are still experiencing issues, thanks.
Apr 4, 07:53 CDT
Update - Bigcommerce engineers have been very pro-active in working with our storage provider, IBM Softlayer, in finding solutions. Unfortunately, it takes two parties to come to a solution. In this case, IBM Softlayer intentionally let their Object Storage cluster fall into disrepair and chose not to scale it. This has impacted Bigcommerce, IBM and many other Softlayer customers.
Our engineers placed too much trust in IBM Softlayer and that's on us. However, the catastrophic failures to see metrics and rapidly scale capacity, the decisions to let hard drives sit at 90% utilization for weeks and months, the cascading failures of an undersized cluster of 52 nodes for the busiest data center in their business speaks to IBM Softlayer’s lack of concern for their customers.
We should have pressed more and held them to the fire; for that, we are sorry. I'm the head of Technical Operations and this pains me, because of the high uptime and reliability our engineering teams have built in the past year. Unfortunately, our trust in IBM Softlayer was misplaced. They have failed at every level of an operations team, they have failed as a business unit, they have failed in caring about how their customers are affected.
We care deeply about our customers and have been trying to work around Softlayer’s bad decisions. We are at the point where we feel we needed to say, "This isn't us; this isn't how we think about high availability."
We are already planning and working toward how we will move off Softlayer Object Storage and better plan for a single vendor failing, no matter how well regarded it is.
I take this personally. I've crafted highly available solutions at Apple, Digg, Eventbrite and now, Bigcommerce. When the site isn't performing, I get angry. And, like the Hulk, no one wants to see me angry. I am fighting for all of you and will continue to do so.
—Scott Baker, Head of Operations and Site Reliability
Apr 3, 21:51 CDT