Imagine you are starting a brand new application. You’ve already started or best-case scenario, you have already implemented it and are deploying into production right now for the first time.
There are two small things which you probably missed:
There is a huge gap between an application that is ready for first deployment and an application that is ready for operation in production mode.
When real users start working with your application, you no longer have time to close this gap and it will hurt you.
We will go through aspects which should be considered when you are actually starting the architecture of your new application for production. If you already in production and forgot something, then better late than never. All of the aspects mentioned below seem to be obvious, but for different reasons they are regularly missed. This is not a comprehensive list, just a few of the most painful. The result of missing one is usually very simple, requiring emergency releases, pulling all-nighters and overtime.
Let’s Get Started…
Generally, all the topics below are called non-functional requirements, aka NFR (or quality attributes), or in short, how a system is supposed to be: “Non-functional requirements specifies criteria that can be used to judge the operation of a system, rather than specific behaviors”.
Today we will stick to NFRs related to system execution time and support operations around them. There are a lot of other non-functional requirements around the evolution of systems, testability, usability, security, etc. - we will cover them next time. Tools to support particular NFRs will also not be covered because tools selection depends on many different factors.
Also, I assume that the system covers functional requirements in full.
Topics for today:
- Availability and reliability
Is it seems that every engineer learned about the importance of performance requirements in kindergarten, but probably forgot it in each first project.
How many of you can tell me how many users your application planned for? How many operations they will perform? Are you sure? Do you know for sure that your application can handle it? There are a couple of aspects of performance that can be easily measured:
- Response time - how fast the client will get the response for request
- Throughput – the rate of successfully processed requests within the time unit
For both you need to know what is expected from the system. Response time not only concerns UI interactions, but also background or asynchronous processing - meaning processes which return results to the client later. Here are things which significantly influence both these aspects. The first thing which is typically forgotten about is actually data volume. Someone thinks that they will have lot of data and brings a cargo ship instead of a cart. It will cost later in terms of ownership and support. Others do not think about it at all. As a result, when the application warms up to real volumes at some point you find out that the main dashboard loads from one to five minutes and to handle all asynchronous messages for a day system you need two and a half days of nonstop work with 100% CPU load. For some lucky people, the workload will grow gently and there will be time to change the application. For others, it will be something like a 10x spike in one day and Russian roulette.
Instead of this game, just take a few simple steps:
- Calculate the maximum data volumes of major entities in system.
- Generate required data in the system (please don’t say that a senior developer doesn’t know how to create a million random records in the database).
- Check how the system works with this data for major scenarios. The user count in the system can have a significant impact on data volume, so you also need to know how many users expect to work with it not just now, but also in the near future.
Next, you need to know where you will run everything: network, hardware, virtualization. Don’t be put in a position where your development desktop is more powerful than expected production environment.
And last but not least for advanced teams: combine data, a real-like environment, and expected workload and run performance tests to make sure that you get the expected response time. Automation here helps make sure that you keep timings right after changes.
I don’t think I need to mention that if a system isn’t able to handle the expected load, it’s critical to start finding a solution before you have real users.
Availability and Reliability
This part is about making sure the system performs well when needed and managing the risk of failure.
The major question to answer here is when the system should be available. There are two parts to consider: time and level. For complex systems, different parts/components will have different metrics.
The first part is time or “business hours,” that is, when users will operate the system. For example, for an online system this is 24/7. For a system providing functions for a company department, then the time frame is the working hours of that department.
The second part is how critical failure would be. I prefer to break this into three levels:
- Critical – If a component is down it will significantly impact overall business (like purchases which the client won’t receive because of this). Usually such an impact can be directly measured in terms of money lost.
- Important – If a component is down for X hours it will impact major business use cases, but without direct money loss (for example, a user won’t have access to full text search but will still be able to use major features).
- Support - No direct impact on business (like a configuration tool or internal user management system). The grades are empirical and you may rate some component as critical because downtime will impact reputation or some other criteria.
The major goal here to have the number of critical parts as low as possible. After that, you need to make sure that critical parts are protected and will be available during business hours. For important components, you need to have procedures in place which guarantee that component availability will be restored in x number of hours (before it starts affecting business).
Next, you need to understand what can go wrong and how it is affects your components (this is especially critical). Some common things that can occur include: hardware issues with the network server, unavailability of external systems, or an unexpected load on your system. Do not forget that everything that can fail, will fail. The database host will go down, the external webservice will provide a response in 30 seconds instead of 100ms, and somebody will create a client for your API that makes 1000 additional requests just because of a human mistake in code.
For critical parts you need to have redundancy (at least two nodes) with data replication to be protected from hardware failures. The failure of the whole datacenter is rare, so the usual practice here is just to have a procedure to restore the system in another datacenter in reasonable time.
Another thing that is usually missed is what to do if you receive more requests than you are able to process (assuming that the system is already able to handle a normal operational load). Good patterns here include configured automatic scaling for the component and the throttling of requests from a particular client. Throttling strategy can be very different but usually it is something like, “If a client sent more than 1000 requests in the last 5 minutes, send an error instead of request processing.”
You also need to be ready for similar throttling behavior from external systems. The CircuitBreaker pattern can be used here to proactively manage connections with external systems, handle downtime instead of random failure on requests, and come back to normal operations when the external system comes back.
So try to be paranoid in term of possible issues, but only for components that are truly critical, instead of focusing on making things that are used twice per month more reliable.
After the point when your system is able to handle the expected load and the failure of things outside of your code, it is time to think about how to investigate issues in your own production code and be able to fix them easily. I know that you create perfect code without any defects, but sometimes somebody from your team might make a mistake.
The first level of help here is to maintain good logs. You need to have log files with information about all significant operations, interactions with external systems, background operations start and finish, and any unexpected behavior like cache miss or internal queue overflow - everything which shows what is going on in the system.
It’s a good idea for each business transaction to have a unique ID and login as well. It will help you quickly segregate one user’s interactions from another, and one click on UI from another. User ID and major data items IDs are also good candidates to be logged.
One feature which can save a lot of time is audit tracking. Many “defects” are just what a user executed or changed.
The next level is to be able to reproduce unexpected behavior with real user data. This needs to be carefully combined with security requirements because you do not want to provoke a leak of user data, but it really helps in complicated scenarios. Also, you never know when you will need this functionality.
One option is to export the required dataset from production into some QA environment and play with the data in QA. You will need to support the export/import of required data between environments.
Another option is to be able to log into the system as a particular user to do something. This option really helps the support team when a user tries to explain what he thinks is wrong. It allows the reproduction of the issue without any copy of data and in a real environment. One needs to be careful here to make sure not to expose a person’s personal information to support personnel and also to log all activities performed via this feature properly (with user ID and support person ID). Otherwise you open a huge security breach that can be used by anybody with the appropriate access.
The next advanced level is to have in place specific support functions for the application which give you the ability to get into the running application (like a dump user session), look at what information was put into caches, and restart a particular background task or be able to restart it with adjusted parameters.
Monitoring is a crosscutting topic for all the previous ones mentioned. Do you know what is happening with your application right now? How many users you have? Operations per second? What is the root cause of failure?
First what you need to have is a set of metrics for your application. Each aspect important from a performance and operational perspective should be monitored. You need to have statistics on how this metric changes over time. We are not only talking about infrastructure metrics, which everybody is aware of, but also application specific metrics. Examples of such metrics include: the count of users logged in to the system, the count of requests of a particular type, request response time, and job processing time. With a good metrics solution you can create business-oriented metrics like orders or purchases per day.
Next, you need to have an alert configured for your metrics so that responsible persons will be notified when something goes wrong. Also, here it’s good to track all exceptions in the system and be notified of them (as simple option, just configure the log to send an email to the development team distribution list for each exception).
After you have it configured, you need to test the configuration to make sure that the alert really generates when something goes wrong and that people receive it and know how to react. Also, you need to make sure that people do not receive unnecessary alerts so that they don’t start to ignore them, because at some point they will ignore something that’s important. One of the worst-case scenarios is one where you have configured monitoring, but nobody reacts to alerts from it.
In summary, just a few simple takeaways:
- Understand what is really expected from the application and have a plan to deliver on it.
- Do not overcomplicate things.
- Keep as close as possible to the real environment.
- Do not trust your experience and intuition - run tests.
- Keep monitoring and altering each important metric so that you are aware of failure before the user tells you.
And do not forget one last thing about IT: there are people who are not doing backups and people who are started to doing it . If you haven’t ever been affected by any of the issues above, it means either you are really smart or just lucky. I wish you peaceful sleep tonight.