Performance is an important criterion each software application should meet and each architect should have in mind when designing and putting in place non-functional requirements. Sometimes it becomes really hard to tune and improve a mature and complex application especially because the performance might be influenced by a lot of factors.
The key for sorting this out is to know exactly how to isolate and put aside the external components that cannot be improved too much since they are out of the application’s control (e.g. network or middleware systems latency) and to focus on components closer to the developer’s control. After deciding what exactly should be measured it is important to know how to do that in a proper way. Also, when discussing about measuring performance there are specific characteristics from one programming language to another and it involves a deeper understanding from both a software and hardware perspective. In the current article I will try to discuss some guidelines that are useful for measuring performance at the Java application level, independent of any other external systems.
When measuring performance, there are two major performance manifestations:
- Throughput, which is the number of operations per unit of time
- Response time or latency, which means how long the operations take
Sometimes it is much easier to improve the throughput (e.g. increasing the number of threads and running the algorithm in parallel) but it might get a worse latency (e.g. thread context scheduling causes overhead at the OS level). There are other situations when response time cannot be improved anymore (e.g. it takes 1h to successfully run the algorithm by one thread). Instead we can focus on increasing the throughput, because the response time does not matter, it cannot be improved anymore.
Try to find an acceptable throughput (latency or response time) as a unit of reference before starting to measure performance. Without a reference it is hard to achieve and improve the application performance.
In all cases do not forget about warmup iterations when testing and measuring performance. These iterations should be short and repeatedly triggered in cycles at the beginning of each test in order to take the code into a steady phase. A good practice is to have around 10 cycles of 15k iterations each (e.g. 15k is the threshold for Just-In-Time C2 Compiler). After this we can be sure the application code is stable, everything is maturely optimized and we can start measuring.
If the warmup iterations are omitted the test measures the code interpreted and not compiled. The difference is that a compiled code is around 10 times faster than the interpreted code.
If the warmup iterations are not well defined (in terms of number of cycles), when performance is measured, the results might be influenced by Just-In-Time Compiler overhead (since the code is not in a steady phase).
A few Just-In-Time Compiler optimizations developers should be aware of when starting to measure Java performance:
- Optimizations and de-optimizations of virtual method calls. This happens when a virtual method call is optimized and inlined but, after a few method invocations, it is de-optimized and re-optimized for another implementation (e.g. imagine there is a method declaration inside an interface with multiple implementations). Such behavior normally happens in the begging of the applications when the Compiler makes a lot of assumptions and does a lot of aggressive optimizations / de-optimizations.
- On-Stack-Replacement (OSR). The Java Virtual Machine starts executing the code in the Interpreter mode. If there is a long running loop inside a method which is interpreted, at some point it becomes hot and it is stopped and replaced by compiled code before the loop completes. There is a slight difference between Just-In-Time compiled code and OSR compiled code (e.g. code after the loop in the method is not compiled yet, hence the method is not fully compiled). That is why generated OSR compiled code should be avoided since it is less frequent in real situations.
- Loop unrolling and lock coarsening. During loop unrolling, the compiler will unroll the loops to reduce the number of branches and minimize the cost. Lock coarsening includes merging adjacent synchronized blocks to perform fewer synchronizations. These optimizations are really powerful and the idea is not that Just-In-Time Compiler optimizes away the benchmark code, but it can apply a different degree of optimization that might not happen in a real scenario.
Variety of platforms
Tests results gathered from a single platform are not relevant enough. Even if there is a really good benchmark, it is recommended to run it on multiple platforms, to collect and compare the results before drawing any conclusions. The diversity of hardware architecture implementations (e.g. Intel, AMD, Sparc) in regards to intrinsics (e.g. compare-and-swap or other hardware concurrency primitives), CPU and memory, could make a difference.
Microbenchmark (component level)
Testing at the component level means that you focus on a specific part of the code (e.g. measuring how fast a class method runs) ignoring everything else. In most of the cases such performance tests could become useless because microbenchmarks aggressively optimize pieces of code (e.g. class method) in a way that real situation optimizations might not occur. This principle is sometimes called testing under the microscope.
For example, during a microbenchmark test in Java HotSpot Virtual Machine there might be optimizations specific to Just-In-Time C2 Compiler but in reality these optimizations won’t happen because the application does not reach that phase, it might run with Just-In-Tim C1 Compiler. When running microbenchmark tests try to launch the Java Virtual Machine multiple times and discard first launches to avoid OS caching effects. Another advice is to trigger the tests enough times to get statistically relevant results (e.g. 20-30 times).
Microbenchmarking is in general useful for testing standalone components (e.g a sorting algorithm, to add/remove elements to/from lists), but not in a way that involves slicing a big application into small pieces and testing every piece. For big applications I would recommend macrobenchmark testing, it provides more accurate results.
Macrobenchmark (system level)
In some cases microbenchmarking the application does not help too much, it does not say anything about the overall throughput or the response time of the application. That is why in such cases we have to focus on macrobenchmarks, to write real programs, to develop realistic loads plus environment configurations in order to measure the performance. The dataset test must be similar to the one used in real cases, otherwise a “fake” dataset will create different optimization paths in the code and will end up with performance measurements that are not realistic. One important thing about macro-benchmarks is they might give an unrealistic treatment of Garbage Collectors. For example, during the macrobenchmark test there might be either only Young Generation Collections or extremely few Old Generation Collections. In real applications typical full Garbage Collector cycles might be triggered every hour or so and the test latency added by the Garbage Collector is skipped. Another aspect is that during a macrobenchmark testing I/O and database might not be well benchmarked because in real situations I/O and database are shared resources, hence bottlenecks and delays are not captured by the test.
In all cases this testing approach requires a lot of work but developing it closer to real scenarios gives better and more reliable results from an application performance standpoint.
Looking to upgrade your Java programming skills? Check out our trainings.
Senior Software Developer