Ok.

Not only are we on a bigger set of iron, but now we’ve optimized caching to hopefully take the load off the origin servers. As a point of reference, this afternoon even after we were on the new server set, we were seeing peaks of 50 requests/sec with ~50% of those resulting in 504’s. In the crowdsource load test we did tonight, we saw >120 reqs/sec and received no errors. This looked promising.

However, today in the heat of the battle (i.e.: at 12:00:08) we saw 2017 requests, of which >1400 ended up as 504’s. The problem is if they hadn’t 504’d, presumably the request would have resulted in the rest of the page being loaded meaning the real attempted load is actually greater than 2017. We’re not really sure where the theoretical max request rate could have been, but best guess it was around 3,000 reqs/second.

As a comparison, this is 50% higher traffic than we saw in 2013 (when we were running on our own iron with pages that were nearly static). and 5000% higher than the rates we saw in 2010 (with the moose cluster).

Ultimately, we’re still not sure the new setup can handle the load and we’d like more time to chat with our (very responsive) hosting company. Varnish is on the front end now, and should be able to handle 10k reqs/second. But we haven’t been able to test that and verify how much is coming from varnish and how much is coming from the origin server. Until we have a better handle on where we stand, we’re not comfortable attempting sales tomorrow.

We’ll post more as we know more.  Thanks to everyone who helped out tonight.  And again we’ll give at least 12 hours notice for sales (probably more). Watch this space.