Java Support for Large Memory Pages

Siddhartha Singh
Backstage
Published in
4 min readApr 4, 2017

--

The performance boost came as a surprise to me until I realise it technically how it works.

Learned at QCONF by Sergey Kuksenko where he was presenting the topic “Speedup Your Java Apps With Hardware Counters” .

Although he pointed out that performance boost one might get is around 20% by setting large memory pages in JVM and in Linux kernel. I got around 60% performance boost. This happened so because my process was highly CPU intensive than I/O.

The goal of large page support is to optimise processor Translation-Lookaside Buffers (TLB). TLB is a page translation cache that holds the most-recently used virtual-to-physical address translations. TLB is a scarce system resource. A TLB miss can be costly as the processor must then read from the hierarchical page table, which may require multiple memory accesses. By using bigger page size, a single TLB entry can represent larger memory range. There will be less pressure on TLB and memory-intensive applications may have better performance.

Changes required to run Java code which utilises Large Page :

a) Enable JVM parameter -XX:+UseLargePages

b) Operating system configuration changes to enable large pages:

· In Linux:
Large page support is included in 2.6 kernel. To check if your system can support large page memory, try the following:

# cat /proc/meminfo | grep Huge
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB

If the output shows the three “Huge” variables then your system can support large page memory, but it needs to be configured. If the command doesn’t print out anything, then large page support is not available. To configure the system to use large page memory, one must log in as root, then:

1. Increase SHMMAX value. It must be larger than the Java heap size. On a system with 4 GB of physical RAM (or less) the following will make all the memory sharable:

# echo 4294967295 > /proc/sys/kernel/shmmax

2. Specify the number of large pages. In the following example 3 GB of a 4 GB system are reserved for large pages (assuming a large page size of 2048k, then 3g = 3 x 1024m = 3072m = 3072 * 1024k = 3145728k, and 3145728k / 2048k = 1536):

# echo 1536 > /proc/sys/vm/nr_hugepages

Note the /proc values will reset after reboot so you may want to set them in an init script (e.g. rc.local or sysctl.conf).

You should restart your process and test the performance. Hopefully it should give you the performance you are looking for.

Now lets go through the reason behind it.

Lets discuss matrix multiplication. Here is a function :

Matrix Multiplication

Lets look at the time taken for ’N’ number of elements. ‘Strassen’ method is optimised matrix multiplication and better than O(n³).

Lets now checkout the misses of CPU cache with JMH, -prof perfnorm, N=256?

And where the L1 load miss happens in the code

Lets swap the loop from i — j — k to i — k — j

Now the code looks like :

Now see the time difference of I — J — K and I — K — J loop:

Thats all for now. Approaches to look into the problems are :

  • System Level — Network, Disk, OS, CPU/Memory
  • JVM Level — GC/Heap, JIT, Classloading
  • Application Level — Algorithms, Synchronization, Threading, API
  • Microarchitecture Level — Caches, Data/Code alignment, CPU Pipeline Stalls

Hope you enjoyed the topic…

@siddhusingh@gmail.com

--

--