Non-functional Requirements - Application Performance - Part 1

Returning to the analogy we made in the previous articles, where I compare the construction and evolution of an application with the generation and evolution of the human being, the computational system that aims at high performance, being able to meet thousands of requests per minute, is very similar to with a high-performance athlete. Since childhood, the high-performance athlete is prepared to acquire a nutritional, psychological, and muscular balance that, combined with solid resilience, discipline, and focus, make this individual achieve marks that are absolutely unattainable to other individuals who were not cut out to this end.

The same occurs with high-performance systems, where all these fundamental characteristics for the athlete are present, similarly, in our computer systems covering, for example, the following topics:

Language or development frameworks used.
Infrastructure, capacity, and geographic distribution
Data storage technologies
Security layers needed

These four topics are examples within several others that we can cite and evaluate. But as software architects, we have to meet a whole set of requirements. And many times, we will have to sacrifice one side to serve the other, which we commonly refer to as a trade-off. For example, I can mention that if there is a vital security requirement where all data must travel internally and externally encrypted, the network performance will be affected by the overhead created by this requirement. Which is perfectly normal and acceptable. In this particular case, security takes precedence over performance. From here, we can draw some crucial points, and I advise you to faithfully follow these teachings, regardless of the non-functional requirement analyzed:

Every non-functional requirements survey has its primary objective to eliminate any subjectivities. They all need to be documented, tested, and approved in detail.
No non-functional requirements should be individually analyzed or validated. They must be validated together and through simulations of the natural use of the application in its different growth scenarios and in equal conditions of infrastructure with that of production.
The Software or Solutions Architect's central role is to validate the feasibility of such requirements in these different scenarios. If this is not feasible, it needs to be pointed out as soon as possible, as this may even make the development of the product unfeasible or blocked. The Software Architect must maintain, in case of trade-offs or decisions that will affect the NFRs, an architectural decision record containing:

the context of the problem;
the alternative solutions presented;
the decision that was taken.

The record of architectural decisions needs to be formalized and evidenced so that in the future, such decisions do not become the target of criticism or significant problems.

To start the study, we will establish a fictitious scenario where we will focus on analyzing all the points regarding their performance characteristics. The initial idea here is to create a knowledge base on the various aspects subject to potential bottlenecks. We will consider this the starting point and then evolve the scenario to understand how to achieve an architecture of a large-scale system through it. For this performance conceptualization, we will use an overall architecture as a basis, where we have a web server, an application server, and a database server hosted in a public cloud and with low application usage. The figure below illustrates such an architecture.

We find this typical architecture in several clients: a web server that provides access to the HTML pages and their various images and some REST APIs that work with the JSON format. Those HTML pages and APIs instantiate objects responsible for implementing business rules and selecting, updating, removing, or inserting data into a structured database server. Such an architecture will help us understand the basic concepts of performance.

Performance – Basic concepts

The performance of an application is a measure of how fast or responsive this system is given a specific workload on a shared infrastructure. Variations in either of these situations (workload or infrastructure) will automatically impact application behavior or performance. Such an impact, when, for example, we increase the workload, can decrease the response time or even reach the breakpoint, a point where the response time increases so much that the application collapses.

Application performance can be divided into two other complementary concepts, namely:

Processing Time – the time spent by the active software components to fulfill the client's request from the origin to the delivery of the processed data.
Waiting time corresponds to the time the response is idle in our system. This includes the time spent transporting data packets or downtime in queues or processing pools.

In other words, an application's “Response Time or Latency” is the sum of the processing time and the waiting time. And the separate understanding and verification of these two elements are paramount. As an example - Let's say that the client with the above architecture has, at its peak performance, a login response of 750 milliseconds for a local workload (in the same city) of a maximum of 20 simultaneous users. Then he tells you that he will finance a national marketing campaign next month, offering his service with 30 days free as a way to leverage the business. And he wants the application to handle this new load as well as it does in the current situation. This will be unfeasible without investments in the current architecture, as a campaign of this size and with such an incentive (30 days free) will significantly increase the workload, which may be 10 times or more the current load and, in addition, the latency will be higher for more distant states due to the different routers that the packets will travel on. Such a scenario needs an architecture that allows it to deal with a higher workload and with technologies that can reduce latency.

Although these two concepts are, almost always, the most relevant to high-performance systems, they are not the only ones, and, in some instances, they may have less importance than other variables. Let's look at the following example:

This is a typical logging case used in many applications where you have an object used to asynchronously post a message to the application log. As it is asynchronous, the expected behavior is that this processing does not influence the application's performance. However, looking at the log process, we have different variables to analyze, such as the capacity of ingesting data by this log and the size of this queue. Suppose I have an ingestion capacity incompatible with the data flow generated by the business objects. In that case, this can cause performance problems in the log aggregator or even the loss of information due to the expiration of the lifetime in which the message can be in the queue. Therefore, we must analyze the performance issue, always observing the characteristics of the technologies we will use to achieve the desired robustness.

How can we detect a performance problem?

Generally, every performance problem is the result of some queueing somewhere in the application architecture, which can occur on the network, in the database input and output, and in the operating system, among other places. And such queueing is motivated by inefficient or slow processing, serialized access to resources, or limited resource capacity.

The best analogy to a performance problem is a traffic jam, where thousands of vehicles pile up on a road that does not have the necessary throughput.

For this reason, when analyzing performance, it is advisable to make a real diagnosis of the architecture and eliminate all the problems found. It is very common that in these diagnostics, when we make some modifications, we have a positive result at that point where we changed. Still, it does not result in an effective solution because the processing retention starts to manifest itself elsewhere. In other words, let's say that the problem was related to queueing in the web server, where we have several jammed requests. We decide to increase the processing capacity of this server so that it can attend to all the requests at this specific point. Still, the impact on system performance is minimal or imperceptible. What happens is that, in this case, the requests may be queued up in the application server due to an inefficient algorithm or in the database server that now has a more significant load than before and starts to present low performance in query execution. In other words, the solution of a particular point can generate consequences for the rest of the architecture. Therefore, the diagnosis must be continuously executed until we can solve the problem as a whole.

Principles to ensure a high-performance application

In my opinion, we need to pay attention to three principles to be able to deliver a high-performance application:

Efficiency

In the use of resources:

IO - Memory, Network, and Disk
CPU

In the application logic

Algorithms
Queries

On Data Storage

Data Structure
Database Schema

Caching

Concurrence

Hardware
Software

Use of asynchronous queues/processes
Consistency

Capacity

As we saw in the previous section, performance problems occur because of a queueing of requests in a specific place, and we know why such queueing occurs. Therefore, to solve these problems, we need to deliver an application with high efficiency across the architectural range, with a capacity that is adequate to the required load, and applies the concurrency principles effectively. But before we look deeper into these principles, let's conceptualize what concurrency is.

Concurrency is defined by the way we process our requests where we can have:

Serial - where requests are processed one after the other. In other words, the processing of a new request only begins when the current one has been fully processed and dispatched to another object or queue.
Parallel or concurrent - the requests can be processed simultaneously. For this, we need hardware that allows this parallelism, an operating system that manages parallelism well, and programming logic and language that is also prepared for this purpose. The goal of a high-performance application is the maximum use of parallel processing, whether executed on the same processor with multiple cores or executed by multiple computers or containers.

At first, we will look at the principles, always considering serial processing. After delving deeper into the principles, we will start to get into the parallelism of processes. Then we can better analyze the factors that can affect this parallelism, which are the queues and the cohesion of the application. Otherwise, we can enter scenarios where a request locks the other, affecting the application's performance.

Given that we now have the reasons that generate the performance problems and the principles that we must respect to achieve such performance, we can then say that we have the following objectives to attack:

Decrease the latency between requests and responses.

Where latency is measured in time units
And it is composed of the Waiting Time and the Processing Time

Maximize our Throughput

Which is determined by the number of requests served in a given time
And is directly dependent on Latency and Capacity

These two objectives will be analyzed in any exercises we do to measure the performance of an application. If we look at our architectural scheme, we will have the following graphical representation:

The first thing to note here is that latency is a measure of the time (usually in milliseconds) spent processing from the request to the delivery of the processing response to the user. Once a process is requested to be processed, it spends a lot of time at different points within the system, both waiting (processing queues) and executing (including logic/programming processing which translates into CPU cycles and data ingest or selection, which in turn implies disk or memory IO. Thus, when we talk about minimizing Latency, we should focus on reducing this waiting time, optimizing our algorithms and data structure, and in some cases, increasing processing power (capacity)

The second point is maximizing the throughput, which measures how many requests a system can process in a given time. This throughput is directly linked to latency. If we decrease the waiting or processing time, we automatically increase our throughput. This rate is paramount in the vast majority of processes performed in our systems, but there is one exception where throughput is not the primary objective.

In batch processes, such as the generation of extensive reports, the most important thing is the processing time since there is no immediate expectation of this response by the user. The importance, in this case, is to optimize/decrease the processing time so that they occur in specific windows and impair the company's online applications as little as possible. In certain cases, it is advisable to have completely independent instances for these cases, given the nature of your process and data needs. Batch processes should work at most with the maximum date in D-1, i.e., with historical or cold data, and not with live or in-process data. If this segregation of environments is not possible, we may experience throughput degradation due to resource consumption of batch processes.

How to measure a system's performance

The first step to a good system performance measurement process is the planning of these tests, which involves the following definitions:

Test Objective

There are two objectives that we can define as performance tests:

Load testing focuses on discovering a usage scenario's throughput against a predetermined load of requests. Here the primary objective is to understand the system's behavior in a scenario that is faithful to the application's real use.
Stress Testing - which aims to continuously increase the processing load until we discover the application's breaking point, that is, the maximum load supported within an acceptable response time. The goal is to understand whether the system capacity is adequate for regular use, seasonal use, and future growth.

It is essential to point out that unlike functional tests where an absolute value is expected (OK or FAIL), in load and performance tests, the results must be analyzed very carefully, trying to understand each of the peaks and in several spheres such as CPU load, memory usage, network usage, disk usage, so that with this information you can evaluate the result, outline strategies, and retest the scenario after applying some hypothetical solution.

Test Scenario Definition

Due to this complexity, the test scenario definitions must emulate real application usage situations. There is no value in testing specific functionality, for example, the logon load. You will not only have users logging on. You need to extract from existing logs what percentage of usage of each feature to compose such scenarios, for example:

Test Scenario Description 01

Script A - 25% of the load

Step 1: Successful login
Step 2: Search for product a
Step 3: add the product to the shopping cart
Step 4: Checkout
Step 5: Select the address
Step 6: Calculate and select a shipping method
Step 7: Payment Method Selection
Step 8: Confirm the purchase

Script B - 60% of the shipment

Step 1: Login successful
Step 2: Search for product a
Step 3: Search for product b
Step 4: Search for product c

Script C - 15% of the load

Step 1: Successful login
Step 2: Search for product A
Step 3: add product A to the shopping cart
Step 3: continue shopping
Step 4: Search for Product B
Step 5: add product B to the shopping cart
Step 6: Checkout
Step 5: Select the address
Step 6: Calculate and select the shipping method
Step 7: Payment Method Selection
Step 8: Checkout with payment failure

Load Test 1 - Scenario 1

Execution Environments:

Data Center in current architecture
Data Center in the new architecture
Subsidiary 1 in current architecture
Subsidiary 1 in the new architecture

Initial Load: 15 logged-on users
Ramp-up: 15 users every 5 minutes
Final Load: 300 users
Post-final ramp-up evaluation time: 30 minutes

Load Testing 2 - Scenario 1

Execution Environments:

Data Center in current architecture
Data Center in the new architecture
Subsidiary 1 in current architecture
Subsidiary 1 in the new architecture

Initial Load: 15 logged-on users
Ramp-up: 15 users every 5 minutes
Final Load: 500 users
Post-final ramp-up evaluation time: 30 minutes

Load Testing 3 - Scenario 1

Execution Environments:

Data Center in current architecture
Data Center in the new architecture
Subsidiary 1 in current architecture
Subsidiary 1 in the new architecture

Initial Load: 15 logged-on users
Ramp-up: 15 users every 5 minutes
Final Load: 700 users
Evaluation time after final ramp-up: 30 Minutes

Stress Test 1 - Scenario 1 - 3 application servers

Execution Environments:

Data Center in current architecture
Data Center in the new architecture
Subsidiary 1 in current architecture
Subsidiary 1 in the new architecture

Initial Load: 300 logged-on users
Ramp-up: 50 users every 5 minutes

Sample Stress Test Results

Instead of creating some random graphics to show the possible ways to report the results, I will refer to a real case scenario exposed by a specialized company in stress and load tests, which I think would be more suitable to show how complex and extensive those reports are.

An example is from WebPerformance.com where they have tremendous expertise and a complete report covering every aspect of the tests:

Sample report

I suggest searching other companies in this area to analyze how they manage those tests.

Test Tooling and Automation

We must always look for an environment similar to the production environment to have more reliability in the test results. Here you must also define the test tool to be used, its geographical distribution, the configuration of test scripts, and the repository of evidence (results) to be later analyzed.

I emphasize the need for geographic distribution because this directly affects the latency of the requests. If we have, for example, a system hosted in a data center in SP and the client has units in all Brazilian states, undoubtedly, the latency of the SP units will be lower than the latency of the country's northern states. Therefore, in cases like this, we recommend the distribution of the test robots in all regions to get an accurate perception of the load for these users, data that can influence the measures we will take to improve performance.

Consolidation and test results

Finally, all the data must be consolidated, and the results presented for each scenario and execution show the gaps, actions taken, and improvements obtained with such activities. These tests can be performed at the end of each sprint/delivery to the customer or at specific milestones. Sometimes, such tests may involve a large group of specialists from the solution developer company and the client's production/infrastructure support teams. Before finalizing, it is worth mentioning one last point:

We must always be aware of what we call tail latencies, which are represented here in the graph starting at 90%. At these points is where we have the worst performance and, in many cases, can be the result of errors in the system. However, no matter the smallest percentage this may be, as in the case of 99%, this indicates failures that in everyday use may be acceptable, but that can be highly impactful in cases of exacerbated use of the system, such as during seasonal periods like Black-Friday or after a successful marketing campaign. So never dismiss these cases. Keep investigating to ensure robustness during these periods of high usage.

Serial Processes - Causes and possible approaches for decreasing Latency in the application ecosystem

As said in previous sessions, the best approach to improve the performance of an application is generated from the reduction of latency in its various bottlenecks. In this session, from the perspective of serial processing, we will evaluate which bottlenecks can cause queueing and harm latency and the possible approaches.

Network Latency

Network latency can be divided into two parts:

External: this is about the processing time and waiting between the user's computer and the application infrastructure. As already explained, the greater the geographical distance, the greater the latency. This is because the TCP communication protocol must establish a connection between several routers (hops) until it reaches our router, and this negotiation has a high overhead.

Internal: which covers the waiting and processing times within our infrastructure. In this case, we have more control and conditions for improvement.

The main offender in the case of network latencies is the time spent for the establishment of connections, since in most applications, we use TCP as communication protocol due to its reliability of packet delivery and, therefore, it needs for each established connection to make a negotiation called 3-way Handshake as explained in the figure below:

After this negotiation, the hosts are ready to talk. The second figure illustrates the issue of external latency, where we must go through several routes to reach the destination.

Given this context, we have some strategies to minimize network latency, such as:

Suppose the application is hosted in public clouds. In that case, it is recommended to adopt a CDN (Content Delivery Network), a structure composed of several servers and distributed in several global regions used to host and serve (Cache), preferably static content such as HTML pages, and images, among others. They can also host dynamic content, but, in this case, you need to pay attention to the lifetime of these objects in the caches and establish purging policies.
Caching static data on the client is also a viable option.
Persistent connections - as explained earlier, the cost of establishing a connection is high, so keeping this connection open avoids having to do the whole 3-way handshake process again. This option is a simple configuration where you set the keep-alive attribute to persist the TCP connection for a long time.
Data compression - Another effective way to decrease latency is to send data in a compressed form. The trade-off here is that we increase the processing time for compressing and decompressing such data. So one should evaluate the actual effectiveness in the analyzed scenario.
Transmitted data format - It is also important to send as little data as possible. Structures such as protocol buffers and JSON are more effective than SOAP and XML.
TLS/SSL session cache - Data encryption protocols also generate significant overhead. If possible, keep established sessions cached to avoid any new negotiations.
Another important point is to manage the database connection pool well, reusing already established and inert connections rather than establishing a new connection. All databases have adequate tools to evaluate/metricate this usage and gauge if the usage is adequate for the incoming load.

Memory Latency

There will be cases where we may experience high latency due to memory management problems. Being a limited resource, memory stores processing data from various machine resources, from the operating system to the application you are analyzing. Most problems are basically due to two factors: a memory leak, where the memory usage skyrockets due to a programming error, and the objects stored there are never clear until more drastic action is taken, such as stopping and restarting the process or even the server. The other widespread problem is the configuration and architecture of Garbage Collectors, which are used by specific languages such as .NET and JAVA. In these languages, even when you clean an object from memory, you only indicate that it is ready to be cleaned when the garbage collector is executed. And generally, it is only performed when it detects that the system is running low on memory, generating a very aggressive execution where many objects are cleaned at once. In this execution, our processing time will undoubtedly suffer.

Another problem is when you have processes that almost entirely RAM memory, causing the operating system to page memory objects to the so-called virtual memory, which is a space reserved on the hard disk for this function. Whenever we have an exacerbated use of this virtual memory, we will have poor performance in the application.

In the case of memory problems, the possible approaches are

Evaluate and optimize the use of the Garbage Collector.
Apply the excellent practice of allocating resources as late as possible and releasing it as soon as possible any resource.
Pay attention to how you work with variables in your software

Dive deep into how your languages and frameworks work with Heap and Stack memory and how to architect to leverage the use of minor garbage collector runs instead of large garbage collector runs (usually done on the heap memory), which can jeopardize the overall performance.

Divide large processes into smaller processes so that they can have better management of the allocated resources
As a last resort, increase the RAM capacity - remembering that in cases of memory leaks, this solution will have no effect because it will only delay the moment when the memory becomes fully allocated.

Disk Latency

Disk latency is the next item we should pay attention to that can affect system performance. This point is critical because it is one of the points where performance can be affected the most, given how a hard disk works. Data written to disks in specific file systems or database systems are fragmented by nature and, in addition, due to other requirements, we must institute reliability strategies that imply disk mirroring or data segmentation with parity so that, in the event of a disk loss, we can quickly get the data ready for use with as little loss as possible and as soon as possible. Data fragmentation is a significant performance offender as it causes the hard disk's internal data reader to search for fragments on multiple tracks. Adding to the above redundancies can increase processing time by calculating parity data that will be used to recompose the contents of a specific disk in case of damage or simply by the hardware not having an efficient disk controller for mirroring or dealing with these issues. We have made significant progress in disk performance through SSD (Solid-State Drive) technology that offers similar performance to RAM access but at a high cost for large amounts of data.

It is also necessary to pay attention to the use of this data because, in some instances, we can use database solutions that accept only new data, not allowing the update or deletion of old data. Such technologies aim precisely at avoiding the fragmentation of this data. Today we have database technologies that prioritize write performance over reading performance and others that do the opposite. Therefore, the Software Architect needs to pay attention to the necessity of data usage in each use case to choose not only one database (as before) but a set of adequate databases for each case. In later articles, I will address the microservices architecture that facilitates these approaches.

As countermeasures so that we decrease disk latency and, as a consequence, increase the performance of our applications, I can suggest the following:

Use caches for static and dynamic web content when possible: Here is a tip: Manage the TTL (Time to Live) of cached objects well. If you populate this cache with all objects having the same value, when this value expires, you will have a phenomenon called cache evasion too high - we sometimes refer to these large evasions as avalanches. This means that at some point, all cache lookups will be invalidated and redirected to the application server to deliver the content to the user and refresh the cached object. Therefore, when configuring your cache, set random values to avoid this avalanche that will undoubtedly throw performance to the lowest levels and may even leave the system momentarily offline.
Always treat logs asynchronously and size your queues well so they can be processed quickly without the risk of exceeding their capacity.
Optimize your queries and evaluate their performance on the database, ensuring they are correctly using the data indexes.
Also, cache data with little or no volatility to avoid querying it directly to the database.
Optimize the Database Schema. Each time a new object is added to a database, it is necessary to re-establish the data optimization plan.
Define hardware that is compatible with data usage. Prefer controllers that offer high data throughput and manage redundancy strategies. Avoid driving these strategies in the OS.

In the next article, we will cover the second part of the non-functional performance requirement, focusing on parallel processing, where we will have different problems and solution approaches.

See you soon,

Paulo Azevedo

Paulo Azevedo (en-us)

Software Architectural Pearls: 002