How to Achieve a SLA

Transcript

To deliver a SLA. To deliver the kind of SLA that we made for eJamming, absolutely from the beginning, required us to design the software to meet that from day 1. It was not something that we were able to evolve into, we would of have to rewrite your system. You’ve all heard and probably been in this business for more than a few years, you’ve encountered the need to rewrite the system simply because it couldn’t perform anymore. We have a very large e-commerce client that we’re doing a billing system for. And they gave us a certain SLA in which they will be providing the data that we’ll be doing, so the logistics for tracking the packages and billing the packages, how frequent the data will be coming in, so we knew, what the volume of packages that we needed to rate and build and charge for and track is going to be from the beginning. We’re a year into the project, and it went from we were going to get something either daily or weekly throughout the month to we’re going to get something within 36 hours before the month ends and the volume that they told us at first is 4 to 6 times higher than the volume , well the volume now is 4-6 times higher than the volume then. Now we architected the system that assumed a certain capacity and performance requirement and it allowed us to utilise Postgres to perform most of the math computations, so in other words it saved us from our programmers having to write all the math stuff and do calculations in Python and do a lot of other complex stuff because we’re talking about atomic operations, very fast, and we were able to offload that into the database. Well, when they increased our performance and capacity requirements basically by 2 orders of magnitude is what it came down to, now we’re having a very difficult time meeting the SLA for when their stuff has to be done. So this is why understanding SLAs from the beginning is very very important. Because, have we known then what the true SLA was going to be now, we would have architected the system entirely different.

So how do you achieve a SLA? Well, you have to have resiliency. This is what the 9s are about. How much downtime can you have, because things will fail. It’s not a question of if, it’s a question of when and how frequently you will have failures in your system. Amazon, when they first went to Ireland, I think it’s around 2011, I think it’s around when we had our flood. There was a big outage. All kinds of big companies who rely on Amazon in Europe went down, our client did not. Our client was in 2 data centres. Because it was our policy, because we knew what we needed to do to achieve an SLA, to not have any single points of failure. And we consider a data centre a single point of failure. So our client never knew there was an outage on Amazon and the rest of the world was just wiped out. That’s resiliency.

Cap theorem. Who’s heard of Cap theorem. So if you’re building systems in the wild on the internet, you need to be aware of this, when you’re architecting your system. So the cap theorem is basically saying, there’s 3 attributes of any system that are important, and you many only ever pick two out of those three. C is consistency, which means that updates to your data are atomic and are guaranteed to be correct. A is availability, meaning that your users are actually able to access the functionality of your system, and P is partition, which means your system can survive, remember these are distributed systems, this is for distributed systems, which is just about what every internet system is, your system can survive being broken apart, come back together, and still be correct. I suggest that everybody who’s building a system, thinking about building a system look on Wikipedia for CAP theorem and you need to figure out which two of those apply to your situation, because that’s going to help you decide, if you’re going to use a Platform as a Service, that’s going to help you decide which one to use. If you’re going to certain technologies like REST vs Graphqls, someone was asking about that earlier on today, they focus on different aspects of CAP theorem. So it’s not just an issue of what do I like about these things function, there are fundamental architectural drivers that cause them to design those systems the way they work. Almost all web systems, need to check the choice that a true RESTful service is. True RESTful service is what Roy Fielding defined in his dissertation, so you can read about that, it’s a very easy to read dissertation, it’s very accessible. If you’re writing anything that’s doing HTTP request, you’re using HTTP protocol, you need to know what REST is, you need to understand what motivates it to work the way it is. You have the item open commands. If you break REST, you can’t have a good SLA. you will take yourself down and you will do it by design in your architecture, and we see this all the time.

Nicholas Taleb, is a writer who wrote the book Black Swan and Anti-Fragile. He’s really turned statistics on it’s head. He’s just shown, all math that all the traders, all the bankers use to test the resiliency of their system against bad things happening in the world. He’s just fundamentally shown mathematically why they’re so wrong and so screwed, and it all gets to the fact that the assumption is that you can manage risk, where the reality is you can only manage exposure.

Other aspect of a SLA is capacity. You kind of need to know what your constant low capacity is vs what your burst is. How do you know this. You know this by measuring it. You can never predict it. You’ll always be wrong. Everytime I build a system, I try to predict where my critical path is, where it will fail capacity wise, and almost every time, I’ve been doing this for 30 years, almost everytime I’m wrong. The other thing you need to know about your capacity is how does your system fail when it hits it’s capacity. A capacity test, what people don’t understand, capacity test requires that you break the system. You don’t know what your capacity is until you’ve broken it. So you’ve got to generate the traffic and simultaneous users who are behaving exactly as they would behave with the web browser. So you can’t be using, Siege or JMeter. You can’t use those and find out what your system capacity is, unless you’re just measuring an API. Because what you really have to do, you have to simulate, how a browser behaves against your system. Your browser now is opening up 6 to 12 connections to your server at the same time. JMeter is not going to do that for you. JMeter does one little thing, it’s going to do it in order, one at a time, be very polite. So you’re going to think you have this much capacity with JMeter, then real users are going to come on, they’re going to knock your system over for one tenth of that. We see this a lot.

One of our recurring clients who just hires us to do capacity testing for them every year is the American Geophysical Union. Basically if you’re an Earth Science person, you want to get your paper published in the biggest place, you go to them. So every year at the conference, before they hired us, their system went down, because of all the users of the conference hitting them at the same time, even though their vendor did capacity test before the conference and promised they were going to beat this. When we did the capacity test, we knocked down the system over and over and over again. It usually took us like 2 or 3 months for them to make the revisions and meet the capacity requirements they have. Since they’ve hired us, they’ve never gone down during one of their events. This stuff is very very important. Performance. When we measure your system, we find out what point it’s going to fail Depending on the user access model of their system, we’re going to say that your capacity is either one third or one half that number. Because your capacity is not up to the point the system fails, you got to have room in your capacity. Even capacity test aren’t going to be able to completely predict or model the real world of people hitting your system. And the other aspect of course is, if your system can handle a thousand simultaneous users, but it takes 30 seconds for the response to get back, not a useful capacity is it. So performance has to be part of it. And the reality is, if you want a system to feel instantaneous, you’ve got to give them back a result in 10 milliseconds, which doesn’t happen on many web apps, but it should maybe on your mobiles. If you want users to feel that your system is responsive, you got to respond back in 1 second, and if your system takes longer than 10 seconds to respond, users are now interrupted in their train of thought, and they can’t even follow the original workflow. That’s just the basis, the rule of thumb. A lot of web people are saying, our site loads between 4 and 6 seconds, that’s pretty good.

Another aspect of that is, where are your servers. Your servers got to be near your users. And this is one of the problems that Content Delivery Networks, CDNs, is trying to solve. It’s a problem that we have in Southeast Asia, because we’re very underserved for the ability to do operations and infrastructure in Southeast Asia. Everybody is hosting in AWS in Singapore. Well, you’ve got to go through the international gateway here in Thailand, all your Thai users, hits Singapore and then come back. And it’s bad enough that these guys are messing with your HTTP address and all ISPs can really mess things up. So now you’ve got to make sure you’re using a HTTPs connection, which means you’ve got a couple of more round trips to get that secure connection going, so that you know that the traffic that you send from this, returns actually the traffic you did and not something that your ISP is messing with. So you’re injecting anything between 70 to 120 milliseconds, even if your system is instantaneous, for all your user interactions and this adds up real quick, because your typical web page is 40 - 100 request, and you’re running 6-12 connections simultaneously, so you divide it by that, but still that latency is injected every time you do that round trip. So we have physical machines in Thailand. We have 2 data centers, and each data centre has a completely different, isolated international backbone than the other one. So if something happens, like if CAT got turned off, our customers will not be affected, because it will route to our other data center.

Only way you can do all this stuff, is to measure everything. And I will say that’s what really separates us. It’s because we have been measuring performance capacity for a very long time. We know how to do this. It is the thing that everybody fails at. Most people don’t even make the effort. They just hope. They just cross their fingers, and they go “nope, our capacity is fine”. And I get this from customers all the time, who don’t want to pay for a capacity test. ‘No we feel good, We did this and this’, but they never tested the system as a whole.

Back to Knowledge Center