Bartek Andrzejczak

Musings on software development

The Cost of Garbage Collection

Have you ever wondered what is the cost that you’re paying when using a garbage collector for cleaning up your mess? How less performant will your app be, if you choose to let virtual machine manage the memory? This month I’ve attended a presentation at Warsaw JUG by Nikita Salnikov-Tarnovski titled Heap, off you go which touched the issue of Java code optimization in the context of excessive garbage collection. The talk showed how garbage collector can “collect” almost 100% of your application execution time. That’s a big deal. Although the example used during the talk was quite specific - allocating 50M objects and then iterating over them - it still makes you think. What inspired me the most, was one of the questions from the audience, that the speaker didn’t know the answer to:

How does it compare to C++?

This blog post is the first of two-part summary of what I’ve learned during my journey through the benchmark land. Here I’ll touch the differences between C++ and Java in case of heap allocation using new operator. The next part, that I’ll publish somewhere next week, will touch the issue of optimized code performance.

The backstory

Nikita in his code used a few different code structure to show their effect on garbage collection:

  • LinkedList
  • ArrayList
  • ByteBuffer
  • Direct ByteBuffer
  • Mapped ByteBuffer (In-file storing)
  • Unsafe

Benchmark with the worst score run on LinkedList which apparently isn’t good for any use case. In this example garbage collector ate almost all execution time. Benchmark with the best score run on Unsafe class, which disabled type checking and boundaries for reading from the buffer. Using native memory access made garbage collection almost entirely obsolete. The effect was astounding. Almost 40 times shorter execution time than in case of LinkedList.

I’ve decided to create a C++ port of Nikita’s benchmark for two structures:

  • std::list which allocates all the memory by itself
  • byte[] which makes you arrange all bytes by yourself

In this post I’ll compare the code for ArrayList and std::list. In the next one we’ll tackle the problem of optimization with Unsafe and byte[].

The code

Java - ArrayList

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public class ArrayListStockExchange implements StockExchange {

    private List<Trade> trades = new ArrayList<>();

    @Override
    public void order(int ticket, int amount, int price, boolean buy) {
        trades.add(new Trade(ticket, amount, price, buy));
    }

    @Override
    public double dayBalance() {
        double balance = 0;
        for (Trade trade : trades) {
            balance += trade.amount * trade.price * (trade.buy ? 1 : -1);
        }
        return balance;
    }
}

C++ - std::list

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class ListStockExchange : public StockExchange{

private:
  std::list<Trade *> * trades;

public:
  ListStockExchange() {
      trades = new std::list<Trade *>();
  }

  ~ListStockExchange(){
      auto i = trades->begin();
      while (i != trades->end())
      {
          delete (*i);
          trades->erase(i++);
      }
      delete trades;
  }

  void order(int ticket, int amount, int price, bool buy) {
      trades->push_back(new Trade(ticket,amount,price,buy));
  }

  double dayBalance() {
      double balance = 0;
      for(auto tradesIterator = trades->begin(); tradesIterator != trades->end(); tradesIterator++)
          balance += (*tradesIterator)->amount * (*tradesIterator)->price * ((*tradesIterator)->buy ? 1 : -1);
      return balance;
  }
};

Results

I’ve run the benchmarks and here is what I’ve got:

Metrics Java C++
Iterations 10 10
Min. iteration time [ms] 4030.221 7890.256
Max. iteration time [ms] 9432.062 7941.776
Avg. iteration time [ms] 6251.026 7910.600

To me, it came as a huge surprise. How could Java be faster than C++? I’ve run the code a few more times but the results were more or less the same. I’ve even disabled an on-demand CPU governer in Ubuntu, as recommended by my benchmark library! There must have been some mistake. I’ve even posted a question to StackOverflow! And that was it. It was downvoted, commented and eventually closed. C++ ain’t Java. You can’t just run it and expect JIT compiler to do all the hard work. There is no JIT compiler!

Optimize the compilation

First of all I didn’t use optimization flags during compilation. Optimization flags? WTF? Why isn’t the code optimized every time? Maybe it’s hard to debug optimized code? Maybe there are a few different optimization flags, that work well in different cases? Nevermind. I’ve added the flags (-O3 -flto) and lo and behold, something moved.

Metrics Java C++ C++ optimized
Iterations 10 10 10
Min. iteration time [ms] 4030.221 7890.256 5136.895
Max. iteration time [ms] 9432.062 7941.776 5166.630
Avg. iteration time [ms] 6251.026 7910.600 5155.023

I’ve got a score almost 3 seconds lower. That’s 35%! And it’s just by adding right flags to the compilator invocation. That was new to me as a Java guy. On average it was already over a second faster than Java code.

In this case it’s worth to remember, that both Java and C++ application allocate their memory on the heap using new operator. Some people commenting on Stack Overflow question suggested, that it sucked exactly because on the heap allocation. Pointers, huh? So I’ve rewritten the code and got rid of pointers. To avoid wrong conclusions I’ve erased optimization flags from compilation for now.

Heap is your enemy

Metrics Java C++ C++ optimized C++ off-heap
Iterations 10 10 10 10
Min. iteration time [ms] 4030.221 7890.256 5136.895 6250.862
Max. iteration time [ms] 9432.062 7941.776 5166.630 6286.141
Avg. iteration time [ms] 6251.026 7910.600 5155.023 6267.842

That’s not bad.. It’s very similar to average iteration time for Java, but as you see above, it’s worse than optimized on-heap case. Let’s see how the combination of these cases works.

Metrics Java C++ C++ optimized C++ off-heap C++ combined
Iterations 10 10 10 10 10
Min. iteration time [ms] 4030.221 7890.256 5136.895 6250.862 2656.214
Max. iteration time [ms] 9432.062 7941.776 5166.630 6286.141 2665.393
Avg. iteration time [ms] 6251.026 7910.600 5155.023 6267.842 2662.243

This score is over two times better than off-heap score and almost three times better than unoptimized C++ score. It’s awesome, but don’t forget, that two columns from the right show scores for memory allocation happening off the heap. Considering only on-heap-allocating code, C++ is still better, but the difference isn’t that big. Yeah, on average it’s 18% faster, but in extreme situation (as shown by min. iteration time) the Java code can be over 20% faster. And yes, it can be also almost 50% slower. When I run the code more and more times I’ve found out that average iteration time is more or less constant between the runs. That’s why it’s pretty much the only value I’ve looked at when comparing the results to other benchmarks scores.

Garbage collector predictability

In theory Java memory allocation is very easy and thus very effective. New object is always created on the top of the heap. Heap is always a continuous block of allocated memory. When garbage collector is activated to reclaim some of the unused memory, it goes through the heap looking for dead objects that can be reclaimed, deletes them and moves the heap to keep the continuity. In C++ objects are always allocated in the first free space with enough size to fit them as a whole. Finding such space takes time, which affects the performance. When memory is freed, it’s just deleted leaving a gap. You could say, that with limitless memory, allocation process in Java and C++ would be similar, because new objects would always go on the top of the heap.

Garbage collectors are complicated and sophisticated. They have to analyze objects, their states and dependencies. That’s why their invocations can take so much time. In this ArrayList example, program code was running for only 9.94% of the whole time. It means, that 90% of the time was consumed by garbage collector.

The image above shows the screen from GCViewer - a tool for analyzing garbage collector logs. On the graph it shows the level of memory allocation (red space) and every GC invocation (black rectangles - Full GC runs, grey rectangles - Partial GC runs). The most important value is shown in the Summary section on the right - it’s the throughput. This value tells you, what’s the ratio of your code execution time to application application/benchmark execution time. Throughput of 9.94% means, that of every 10 seconds your application was running for 994ms which leaves astounding 9.006 seconds to the garbage collector.

As you see on the graph, rectangles showing garbage collection are far from regular. It means that the code with high memory consumption and thus the need for excessive garbage collection has much less predictable short-term effectiveness than e.g. C++ code. When using C++, your CPU is running immutable program created from the compilation of your code. The results in perfect conditions should be very similar. In all of my examples the biggest difference between minimal and maximal execution times was around 51ms for the scores around 8000ms. That’s 0.06%! It’s nothing! In Java the garbage collector takes the wheel whenever it’s needed, so the exact time of it’s invocation is unpredictable. It could run for 3 seconds and it could run for 6 seconds. It depends on the situation. That’s why you get the standard deviation around 2 seconds for the average of 6 seconds, which is huge for such a benchmark. Fortunately the average scores are quite trustworthy, even for as little as 10 iterations.

Warm up!

One last thing I wanted to cover in this post are warm up iterations in Java benchmarks. As I said before, C++ code is compiled once and is run always in the same manner. In Java the code is compiled into *.class files and then while running inside the virtual machine, it’s recompiled and changed by the JIT compiler, which means that the first iterations will be worse than next ones. How big is that effect? Let’s look at the iteration times:

Iteration no. Time[ms]
1 33656.406
2 16100.784
3 10734.691
4 13656.410
5 7297.651
6 7732.481
7 5081.496
8 3020.896
9 3126.210
10 5061.670
11 5053.820
12 5032.839
13 7640.066
14 7750.490
15 7518.114

As you see first good results are appearing after 4 initial iterations. As observed during this benchmark, 5 warmup iterations give nice results in this particular example. Between this five iterations you can see that a lot of compilation happens. It’s not like it stops after the fifth one, but there is much less of it going on in the following iterations.

Conclusion

Wrapping up the first part of my journey let me say that it was both hard and fun to write C++ code. It was hard, because I had to be in control all the time. I had to use valgrind to track down memory leaks, which happened a few times during this short period of coding. On the other hand it was fun because I’ve leared a lot about C++ and Java memory management.

The most important thing to write here is the core of this topic. What’s the cost of garbage collection? How much are we loosing when choosing Java over C++. Yeah, we’re loosing some performance and short-term predictability, but we’re getting a program, that will care about our memory instead. Is it a fair trade? Again, it depends.

These times I often hear about a shift in hacking/designing duality. Years ago it was much cheaper to hire a hacker, who would create the most effective solution, that uses the least amount of resources. Resources were expensive. Adding additional 1GB of RAM wasn’t a matter of one click. The code created by hackers was often unreadable, but it worked. You could always hire another hacker to change it later. It was still cheaper than buying hardware. Nowadays resources are cheap as chips. You need more RAM? More CPU power? Just ask the right guy and he’ll add it to your virtual machine in no time. On the other hand developers time is getting more and more expensive. It’s now more efficient to hire good programmers to create programs, that will be easy to understand by other programmers. Resources usually don’t matter this much.

Of course there are some exceptions. Resources sometimes do matter. There are microcontrollers, that have very limited resources. There are real-time systems. There are high throughput systems. Is it better not to use garbage collection in all of these cases? No. It’s better to use the right tool for the specific job.

Repositories

Comments