CppCon 2017: John Lakos “Local (‘Arena’) Memory Allocators (part 2 of 2)”

– All right, and with that we’re going to get started, as promised. So, roadmapping the benchmarks. We wanted to explore each dimension to observe its affects on optimal memory allocation. Our first thought was to create a single benchmark that spans all five dimensions, find the centroid, vary the arguments along each dimension separately, and have such a one single benchmark, and creating such a thing not only wasn’t easy, we couldn’t figure out how to do it. So we finally settled on four separate benchmarks. Benchmark one addresses the first two dimensions. So benchmark one. Considerations. We tried not to assume the answers we expected. This is really important. If you know the answer, what are you doing? So we tried to explore things in a very general way, gather all the data, and see what fell out.

We used successive power of twos, which is a really good thing to do. We often just show the exponent, so for example, we say five instead of two to the fifth. Five means 32. It’s just convenient. We contrast disparate results of different sizes by holding two similar parameters, like two temporal parameters or two physical parameters, the product constant, which means the sum of the exponents is constant. We trade them off. For example, subsystem size versus number of subsystems. If I tell you that the combined size times the number of– subsystem size times the number of subsystems is 25, that means that the size of my program is two to the 25th, and then I’ll vary them so that the size of my program stays the same and we see the effect of varying the system size and the number of sub– the subsystem size and the number of subsystems together.

So which is better, to have two subsystems that are large, or 32 subsystems that are smaller? Well what’s going on there? Those are the kind of questions we wanna ask, because it’s not absolute time, it’s relative times that are really important for understanding what’s going on. I’m just gonna give you, we chose this architecture, so I’m putting it up here for reference. It’s not the only one we tried. We picked this one because we can’t go through all the data, trust me, but this was representative, we didn’t cherry pick it, it just is. But it’s not special, it just is. – [Man] (unclear speech) – We have a list. If you want to go to our GitHub, Bloomberg GitHub, you can find out all the different– The question was, “Did you do it on–“, what was the name of it? – Embedded. – No, not embedded. We did it on GCC, and clang, and Microsoft Windows, and you can find all that data, it exists, it’s published. Here’s how we compiled all the programs. All experiments used only one core at a time, except for the last one, and that’s for contention.

Now we need to talk about tcmalloc and jemalloc. Some people think that these are better than what is supplied. Now, this was true at one point. Doesn’t seem to be true anymore, at least for the benchmarks that we ran. It is possible that these are still better than the off-the-shelf clang or GCC allocator, but that doesn’t matter, because for these benchmarks these didn’t do as well as the built-ins. And particularly, these are good for multi-threaded applications and that is not what we did mostly, that was only one of our benchmarks. So remember, we’re gonna use the platform’s global allocator. We tried these, they’re not as good, so we didn’t worry about them. Just saying, we didn’t forget them. Here are our four benchmarks, our roadmap. We’re going to look at these four things. Then we’re gonna be done, then we’re gonna hopefully have some time for questions, so I am gonna move along, and I have 58 minutes or something like that left. Short running, build up, use, tear down.

Whoa, that was quick. Go back. This is my symbol. So we’re firing a missile. A few seconds, it’s over. That’s my icon. Considerations. Initially wanted to investigate allocation density when we focused on allocation/deallocation costs themselves. We chose a variety of common data structures: int, string, vector, unordered_set. Didn’t wanna access locality– to access locality to dominate the results, which means we wrote just the first byte. Remember the locality is an extremely powerful benefit of allocators and we could lose all the individual benefits by focusing on locality. So except for the locality experiment, which is the most involved, everything else tries to minimize it. Later we incorporated variation into the allocated memory so we have object capacities, vector capacities are reserved up front so a vector has only one allocation at all times, where as strings are deliberately not one size, they range in size from 33 to 1,000 with a uniform distribution, and so that’s what we’re seeing there. We chose 33 because 32 is our short string optimization limit. If it’s 32 or under, it would be just one size. Here are some simple data structures. Vector of int, vector of string, unordered_set of int and unordered_set of string.

Do these seem like useful things that you might use in a program? Would we want to actually try to see what allocators do with these guys, just for fun? If anybody didn’t know, this is what a vector int looks like, schematically. Don’t laugh. Whoops, come on, go back. Why do you do this to me? (groans) One, two. This is a vector of strings, because we have the string objects and then we have the allocated string. The next one is an unordered_set of int. And finally we have an unordered_set of string. I remember where I was when I drew these.

These are not easy to draw and have them look like a kindergartner didn’t draw them. Just saying. So the plan, for each data structure in a thoughtfully chosen set. We’re gonna create a data structure, access it lightly, destroy it, and repeat until the problem size N is reached. Does that make sense? We’re just gonna do it over and over. Build it up, use it, tear it down, see what happens. And the use is extremely light, so we’re focusing on building it up and tearing it down, otherwise we’d be testing the access part. Does that make sense? This is where we’re worried about how much effort it takes to allocate it as opposed to once we’ve allocated it. Oh, darn it. So we chose an overall problem size of N equals two to the 27, and the container size is gonna range from two to the eight to two to the 16th. The number of experiment repetitions is N over S.

– [Man 2] (unclear speech) That’s a good question. It’s elements. And remember, these are comparing the same data structure so what we’re trading off is the container size and experiment repetitions. So we’re not comparing vector of int with vector of string, that’s not important. What’s important is vector of int with vector of int in a different ratio, in a different size. We’re looking at how the size of the container is affected with and without allocators, or with the different allocation strategies, if you will.

When I say N equals 27 I mean log of N equals 27, and these are all exponents. So each result is an absolute runtime in seconds. So here’s vector of int. You’re probably gonna say, “I don’t really know “what all these black and white numbers are. “What does that mean?” We’ve got these 14 strategies, and we’ve got these different sizes, and look at all this wonderful data. This particular example is not very interesting because there’s only one allocation going on, which is vector of int, so this whole slide isn’t really all that interesting.

But it is interesting because we need to do a little better. Let’s see if we can make it a little more readable. Does that help? How about we put some guidelines in? Does anybody remember what happened when Dorothy’s house landed in Oz? The idea is we really would like to see some color in this because this is a heatmap, and now things become a little bit more readable. And if you look at the global, that’s the global allocator, and then we have four different kinds of monotonic allocator. We have them with and without a virtual function interface, and then we have winking and not winking in each case. So if you just look at this, this is all 14 strategies, all color coded, and you don’t have to look to hard to see that the global allocators are darker red.

Red means more expensive. Green means fast. Now this isn’t that interesting a data structure, I said, because there’s only one allocation so we can’t learn too much from this, but we can learn a little more from a vector of string. And a vector of string makes things a little clearer. Do you notice that the global allocator is red, again, both of them? The monotonic is looking pretty good. The multipool in front of the monotonic is not quite as good. The multipool, it’s better than the global allocator by a factor of two, but it’s not as good. So just looking at this, which allocator do you think you’d use if you had a vector of strings and you wanted to build it up really quickly, use it, and get rid of it? Monotonic.

It’s not a hard problem. Let’s just keep going. Here is an unordered_set of int. Oh look, which would you use? Monotonic. How about an unordered_set of string. Guesses? Well what do you know. Okay. Whoops, lost it. Do we have any questions on this? No, yes? What do you think? Yes. – [Man 2] (unclear speech) – Yes, the wink was slower than the not winking. There’s a little bit of error in what’s going on, and also the wink doesn’t do anything so it’s a useless call. So between the two it could be a little bit more expensive because the wink doesn’t really wink anything. – [Man 2] (unclear speech) – No no, what I’m saying is, there’s nothing to be done, so you’re doing something that doesn’t do anything, which is more expensive than not doing anything. Okay, yes. – [Man 3] (unclear speech) – We didn’t wanna use ordered_sets because the node-based containers that are unordered, whatever, get the job done as an example of a node-based container. We could have done ordered_sets, which we did, and they’re no different. There’s no additional value. However, we’re not done. The next thing we did is we said well let’s make it a little bit more interesting.

Let’s make a vector of vector of int, vector of vector of string, vector of unordered_set of int, vector of unordered_set of string, and unordered_set of the four. So now we have some more data structures. Composite data structures, they’re much larger, so we picked an intermediate size and stuck with it for pretty arbitrary reasons. This is what we did. Kept the overall size two to the 27th, there’s an internal size of two to the seventh. We still varied the outer size from two to the eight to the 16th, and this is the stuff. But it really doesn’t matter. The outer container size is this, the inner one is constant, and then those vary and the product is still 27. You get the idea, right? We’re varying the outer, keeping the middle one fixed, and then… Yes? Okay. Remember this. Here is dataset five. This is vector of vector of int. Now look at it. Which one would you use, and by how much? Monotonic, but do you see the difference in performance? Roughly a factor of five. So let’s look at dataset six. Hold on. There’s a little bit– Go ahead. – [Man 4] (unclear speech) – So your point is we shouldn’t use global, but otherwise we don’t care? – [Man 4] (unclear speech) – So remember vector of vector of int is not as exciting– vector of int is unexciting, vector of vector of int is somewhat exciting, it’s like vector of string.

All right, well let’s keep looking because I think we’re gonna see a trend. Now notice the little virtual function overhead. Notice that we see that using the virtual function here causes some overhead, which is to be expected because while the outer part of the container can inline, the inner one can’t, so what’s gonna happen is there is some overhead. But again, notice you start out with the other one and then when you add in the wink it’s actually faster, the virtual one with the wink is faster than if you didn’t do it at all and baked it into the type. My point is that that’s all noise, but when you look at the global one, omg, who cares. Seven, do you see a pattern? I just want you to look. Notice there’s more virtual function overhead here. Now even with the winking with the virtual it’s actually, you know, one tick slower. But compare it to the cost of the global. Who cares? This is noise. Yes? – [Man 5] (unclear speech) – It’s the general purpose allocator that comes with the system. – [Man 5] (unclear speech) – It’s part of it.

It’s not all of it. When you see the next experiment you’ll see that it’s none of it. Winking matters. I just want you to notice the left column. And we can go look at this in detail, but the point is, without question, any local allocator is better than that beast, and if you just go with the monotonic– and by the way, just to tell you, the person who actually did this experiment didn’t put the monotonic on the stack but instead put it in some corner of the system.

The monotonic would actually be much faster than is shown here, just full disclosure. This is deliberately not as good as it really would be. Not deliberately, but it clearly is no worse than this and if it were on the stack it would be potentially even better. So, observation. Global allocators are always inferior. Question. What local allocator strategy is best? And the plan is let’s look again but gray out the global allocator so we can get some more texture. I’m just gonna go through this, because you’ll see this is not a good choice. Generally speaking– Whoops, let me back up, sorry.

This is D2, D3, D4. And then we go on to the other ones. D5, D6, D7. D8, D9, D10, D11, and D12. Whoops, sorry. The point is, you can go back and look at it on video, I just wanted to get it all there for people to see, but the bottom line is, use the monotonic allocator and call it a day for these kind of things, period. We have these dimensional characterizations, and the question mark means that’s what we’re measuring. We’re also measuring the variation. The access locality is low by design. The memory utilization is one. We build it up, we use it, we tear it down. We’re using all the memory at one point. There’s no contention. Take away messages. Global allocators are consistently inferior when dealing with systems having a short, focused duration irrespective of contention. For systems having a high utilization such as those modeled in this benchmark, a monotonic allocator is always superior, especially when applied to a stack buffer residing on the program stack, similar in effect to alloca in C.

If you learn nothing else from this talk, not bad, huh? Questions on this? No heckling yet. This experiment is mine, I invented this one, all by my lonesome. Long running time multiplex subsystems and we’re gonna look at access locality. So the considerations are investigate access locality, observe the effects both physically and temporally, simulate concurrent subsystems, as in time multiplex subsystems like AZO. Vary both size and time slices. We want the locality to dominate the results this time. So accessing the data should be in hours, and the build up and tear down should be negligible, e.g. in seconds. So we’re not worried about that. We’re worried only about the time to access the data. This is a completely different thing, this is not what people think about, this is the most important use of allocators that could possibly be.

So, we have a system, and subsystems. Now, what’s a subsystem? We’re gonna call it a linked list, because this is a model. We’re modeling things that really happen. So it looks like this. It’s a vector of list of int. And the creation plan is we’re gonna build up this vector with some number of subsystems, so we’re gonna build it up from scratch then we’re gonna tear it down and the result is an initialized data structure. This is the build up and tear down thing. So what happens? Build up the system. We create the vector, and here’s our memory, and then that’s the vector, we just allocated the vector of whatever we’re gonna hold the stuff.

And then we’re gonna ignore the cache for now and we’re going to build up the system. So here we just allocated a link. And what happens in memory? Well there’s our link. And then we do another one, and then here’s memory, and we do that. And then we allocate a bunch more, and then in memory, what happens? And then we do one more and you’ll see a pattern. Surprise. Then we go to the next one. And now what happens? Okay. And now what happens? Okay. So you see what’s happening? Now we’re gonna build this whole thing up, and see how it works, see a pattern here? Okay. So that’s what happens. That’s what’s happening in memory. Now the next thing is, we’re gonna access, we’re gonna visit each subsystem in turn.

For each subsystem we’re going to write to each of the int values in order, which means the linked list is a linked list of int. We’re gonna repeat the sequence of writes for a total of I iterations. We’re gonna advance to the next subsystem and repeat the entire thing for a total of R repetitions. The experiment is wall-clock time. No memory is allocated or deallocated at all. Yes sir. – [Man 5] (unclear speech) – Louder. – [Man 5] (unclear speech) – Okay, can I suggest something? First of all, the point is we’re doing this in order.

It’s necessary for the experiment, because in order to see how bad it is when it’s out of order, we have to compare it within order. I’ll get there. We’re gonna write each of the int values inwarder. What does that look like? Here’s a subsystem, and watch, see, I’m writing it. Now we don’t know how long, this could go forever because it could be a very large subsystem. Very large. Okay, it’s done. And then we just did one iteration, but we’re gonna repeat that for I iterations. So in this case I is two. Did that just skip over my iterations? That’s terrible. I is two. Go ahead, do– What? What? What? Let’s see this here. I is two. Oh, there we go, it’s on autopilot, that’s what’s happening. What’s gonna happen is this is gonna do– I see. So it’s doing two iterations. I is two so it’s gonna go through this twice. There we go. Then I’m gonna go to the next subsystem.

I forgot, I kind of automated this, I streamlined this so I could get a chance to get a drink. And then it goes to the next one. You see this pattern happening, right? And then… I’m trying to speed it up in time because you guys can get very bored seeing this keep going, right? And then finally we get to here. Now there could be a very large number of systems. Right now we’re imagining that this is just a few links, but very many systems. And then eventually, hopefully, it finishes and then we quickly write this guy. So that’s what a repetition is. It’s going through two times, two times, two times. That’s one repetition. Now we’re gonna repeat. So what does that mean? That means we’re gonna make sure that the iterations times the repetitions is a constant, so it’s gonna go through and do this. And once we do that, that’s one repetition. Then we go through. And you’re probably all worried because it says two to the seventh times a large number.

(laughter) Let me know when you get the idea, maybe I can stop it before then. The point is this goes for a long time. It takes a long time to get the data because this is gonna run for a very long time. But I am gonna spare you because you get the idea what a long time is and we have only so much time here. And I know it’s fun to watch this slide. Wall-clock time. Okay, so the pseudo code is we’re gonna create this vector, here it is.

And we’re gonna build up a list. So here it is. And then we’re going to tweak the little guy, so here’s the little guy that we tweaked, and then we’re building it up so this is for each of those guys. So what does that mean? We’re gonna do it for each one in the series, and then we’re gonna do it this many iterations, which is two, and then we’re gonna repeat it K times– or excuse me, of the K systems we’re going to repeat that R times. There, I got that right. So that’s this. You’ll notice that the blues, the product of the blues is going to stay constant and the product of the reds are going to stay constant. M is the memory size, and then A is the accesses.

So the accesses are gonna be held constant and the memory is gonna be held constant. So we have two different constants. And now we will need to look at some stuff. The total size of the problem is the product of M and A. And remember M is the physical memory size and A is the temporal aspect, which is how many instructions– excuse me, iterations times the repetitions, in other words the total accesses. N is 45. That means 32 trillion accesses. So this does take a little time anyway.

So you get the idea. We’re gonna look at this, and here, what three 18 mean, what does two 22 mean? Well, you come over to this system. We have a large number. What does this column mean? So it means two iterations, 128 repetitions. Anything in this column is two iterations and 128 repetitions. The one on the bottom is two to the one, and the seven on the top is two to the seven. If I move over, it means this. If I move over, it means this. Do you get the idea? So we have a domain that’s a two-dimensional domain. Yes? – [Man 5] (unclear speech) – The question is how big– The question is how does the problem size relate to the cache size. We’re gonna look and see how it relates to it because we don’t know. All we’re doing is we’re creating a huge amount of an experiment and we’re gonna look at the data, and then we’re gonna guess as to what’s happening. That’s what we can do.

– [Man 6] (unclear speech) – M is the number of elements, the elements are integer sized, but we’re talking about the node of the thing, and it’s a linked list node because I’m using an std::linked_list. The linked list, what is it, it’s got two pointers and it’s got an int. So if this is an eight byte architecture it’s 24 bytes. Right? Oh. Now the other way we have, this is eight subsystems, each of size 256K. This one is 16 subsystems of size 128K, and this one is 32 subsystems of size 64K. So you understand what the domain is. The next thing we need to look at is, now if you take a look at that, those are times in seconds. That’s how long it takes to run these things.

You’ll notice, by the way, that the number in the upper right corner is bigger than the number in the lower left corner. These are made up numbers, but the idea is the same. So now we’ll take a look at what this is. Here what we’re doing is we’re running this program, and you’ll see that what’s happening as we go through. With things accessed in order, we can fit several subsystems in cache before we have to flush one. Right? That’s good. Gotcha. Okay, so you get the idea. I was able to go to a subsystem, then the next, then the next, then the next, before I had to flush one. But I was doing multiple iterations, so I got two iterations without having to go to main memory, that’s great, right? Yes. – [Man 6] (unclear speech) – It might not be but it makes a nice graphic.

The question is is the vector really held in the cache. I decided it looks better if it was. Does anybody really care? No, it is a fair point. Seriously, that means you’re actually thinking about what’s going on. So the truth is, the point is, is that the vector might not actually stay in cache because maybe some of its bits might get thrown out. But as a cartoon– speaking of cartoons. It’s the right idea. What about diffusion? Now I heard fragmentation over here, which is really great. This is my straight man, I have to buy him a drink or something.

But anyway, what we need to do is look at the difference between when memory comes in in order, which is what you’d expect if you had a clean system and you had just a global allocator acting like a local allocator, and it just gave you the memory in order, there’s no problem which is why sometimes it’s hard to realize why allocators are important. So the shuffle plan is going to be to go and mix this stuff up first before we do that.

So I’m just going to show you what’s going to happen. We’re going to pop off and then push, and we’re gonna go through this and shuffle things until they get good and shuffled. So here’s what’s going to happen. I’m going to start, I’m going to ignore the cache, and I’m going to pop that guy. And then I’m just going to move him up, and now I’m going to push him randomly somewhere. By the way, when I pop him, creates a hole, that’s empty, but I still have a guy that I need to put somewhere.

I want to put him here because that’s the random place that was chosen. And now, notice the color is that blue color. And then we proceed to do this again with the blue guy. Go ahead. There we go. And now that guy has a hole. And now we’re gonna put this somewhere. Now we put him there because that was the random place that was selected. And the idea is we’re shuffling the memory. This is what would happen if we ran the system and started doing this, and what will happen over time because we’re moving, we’re actually doing a move, as opposed to a copy, if you can imagine that. Now move is supposed to be fast, right? Turns out that that’s fud. Moves are not fast, moves are horribly, horribly slow when it comes to access because what you’re doing is you’re moving the things that used to be contiguous away from each other, and so the performance over time is going to degrade, and it’s going to degrade substantially.

And you’re gonna see just how much it degrades when I get done shuffling this thing, which is gonna take at least another three minutes. (laughter) Do you understand what’s happening? This is what happens in a computer when you have a global allocator and it starts messing up. It’s like a clean room when it starts and then it messes up, and then the stuff that used to be nice and tidy goes together, gets messed up, and when things get messed up they get slower.

What do we call that again? – [Man 6] (unclear speech) – Okay, I heard fragmentation. You realize that it is not fragmentation, because we don’t have coalescing allocators, and fragmentation and coalescing allocators go hand in hand. What we have is diffusion. Our memory is diffusing, so we’re getting a homogenous global memory that has everybody’s subsystem intermixed, like oil and vinegar now become salad dressing instead of oil and vinegar. So if you like your oil and then you like your vinegar well guess what, you got salad dressing. It’s all clear, right? Do you understand what I’m trying to say? This is what happens when you don’t have local allocators because the memory is going to get messed up and in the end… Hopefully you’re quickly getting the idea what this is going to be, because I’m almost done. So what happens now… We’ve now shuffled once, we’ve gone through once. Imagine if we shuffled this for a while. What would happen is this would look a little bit more like– hold on, we’ll get there– this. Now it wouldn’t be a pattern like this unless my random number generator were particularly bad, but even this shows the point.

This isn’t even showing up. By the way, this is not the greatest picture. The idea is now we have– we don’t have blocks anymore, we have everything intermixed in a way that doesn’t lend itself to this. Now this is called diffusion. This is not fragmentation. Yes? – [Man 7] (unclear speech) – I have measured it. If you go through the process that I just described once, there’s enough diffusion that it’s dramatic. You go through a second time, 30% more, and you pretty much saturate after that. But that doesn’t matter because whatever much it needs to be. This is fragmentation by the way, just to show you the difference. This is where your memory gets scattered all over. You want to allocate a big block, you have more than enough memory to do it but there’s no place to put it.

This is memory fragmentation. This is really bad memory fragmentation, but it has nothing to do with this talk at all, because this doesn’t happen. What really happens is diffusion, in this particular case. I guess fragmentation could happen if you had a really ridiculous global allocator too, I mean anything can happen, but that’s not the point. How do we fix this? And the way we fix this is– What we’re gonna need is local allocators. So the overall plan, to show you how bad this is, is we’re gonna have after we shuffle, before we shuffle, and create and shuffle, those are three measures.

So we’re gonna do the same thing, but we’re gonna measure after we shuffle, in which case it’s gonna be slow, before we shuffle, which is gonna be fast, and just create and destroy only which is gonna give us our setup costs. So the result is gonna be wall-clock time, and I’m just gonna put this up there. The degradation ratio due to diffusion is going to be the after time minus the setup cost divided by the before time minus the setup cost, but the setup cost is negligible so we’ll ignore it. So what I really want you to think of is degradation is just after divided by before, and by after I mean we create it, we shuffle it, we access it, the way we discussed, or we create it, we access it, we shuffle it. We’re doing the same thing, we’re just measuring it in a different place.

Now does everybody understand that? If you shuffle it and then access it it’s gonna be worse than if you access it and then shuffle it. How much worse, does anybody get an idea for how much worse? – [Man 8] (unclear speech) – We’re gonna go with a factor of 10? Let’s see. Here’s what happens, and if you look at the shuffling, this is to answer the question. And we’re sticking with a temporal locality of 10, meaning it’s gonna iterate 10 times just for the purpose of this experiment. And since this is the first experiment I did, I didn’t realize that binary was better than decimals so I did it in decimals, so that’s what we have here. I regret it. But anyway, if you look at this, with one shuffle, one complete shuffle you see that if I have a system size of 10 to the six then I had 10 of them.

In that case, in that particular case, just one shuffle gives me an order of 10 degradation. With two shuffles it’s 15.6, and you see it quickly saturates to around 16, the high 15s. So this shuffling is really significant. Factor of 10, factor of 15. That’s what it gets to, because the longer you run the system the more shuffled it gets, and once it’s shuffled it’s like this. I think, pretty much you get the idea. So anyway after one shuffle, after two shuffles. Any questions on this? Shuffling demonstrates the degradation. How are we gonna fix the problem? Well, we don’t want the diffusion to occur, right? We’ll talk about how we fix that in a moment.

Here’s our example again. This is showing you the footprint. Remember, this is the number of subsystems. A lot of subsystems at the bottom. Large systems, large subsystems at the top. Do you see that? So let’s take a look. This is what the surface looks like, and it is also a heatmap. The blue means one, the ratio is one. The ratio is measuring after divided by measuring before, so if there’s no difference, it’s blue.

If there’s a big difference it’s red. And what you’ll notice is there’s a ridge right around two to the 18th. That’s a magic number that persists. When you have a subsystem size around two to the 18th, the maximum difference occurs. Also when the locality is at the edge, meaning I’m accessing each thing just one time before I move on. That means I have low temporal locality, whereas if I accessed it 10 times, I’d have higher temporal locality.

This way is higher temper locality, this way is lower temporal locality. And then the physical locality, obviously if I had only one system, there’s just one system, but if I bring it down to 10 subsystems I have more physical locality, and if I have 100, or I should say eight, if I have eight large subsystems that’s where this peak is very large. And then as I go to 16, and 32, and 64 subsystems it tapers down. Now it’s hard to see exactly how big this is from this graph but it does give an idea. And this funny thing in the front that I am unable to explain, but it is a very edge case, it’s really bizarre, it’s in this corner.

So if you ignore that one point the idea is you get something like a ridge where when you have right around, as I said, two to the 18th, or when the temporal locality is really low, the difference between shuffled and non-shuffled is maximized. Now, what I’m gonna do is some magic. I’m gonna turn this into an Excel spreadsheet. I do want you to keep in mind if anybody can be hypnotized because that’s what’s gonna happen right now.

I’m gonna turn this into an Excel spreadsheet. Doesn’t it look like an Excel spreadsheet? I think it does. Don’t you think so? Well watch this. Come on. I know you can do it. Make it happen. There we go. This is an Excel spreadsheet kind of thing, and if you look at you can see the lower left clutter where we have the high physical or high temporal locality it’s very green, and if you sort of come out, it’s yellow, and then you hit that band, that red band, and of course when it gets all the way back to top, the ratio of shuffling and unshuffling at the very top. Doesn’t matter because shuffling doesn’t do anything, it was only one system, so when you circulate it doesn’t do anything, that’s to be expected. So this is just giving you an idea of what’s going on. What’s going on is when you have the option for high physical and high temporal locality, that kind of thing, when you don’t, excuse me, when you don’t, when you get into these regions, then you need to make it happen, you need to actually do something to not get that crazy factor of slowdown.

Now this is the crazy part that happened, and I don’t know why those numbers down there happened, I really don’t. If we mask them off it doesn’t make any difference here, but it will turn out to later. Now how do we solve this problem? We solve it with local memory allocators, and the way we do that is we put them on each subsystem. Not on each thread, because this is a time multiplex system. And when we do that, look what happens. We create walls. And now we’re gonna use one of these allocation strategies. Now look what happens. Look at the ratio with allocators before and after. This shuffling instead of being a factor of 15 is a factor of 2.5. Do you see that? The local allocators stop you from order of magnitude degradation in this system.

So this is with local allocators, and you can see that they do the least good right over here, but everywhere else, right, this is with local allocators. And the thing we’re gonna wanna look at is the improvement of without to with, that’s what we’re looking at right now. So this is the ratio of without allocators to with allocators, so we’re seeing the benefit, if you will. And the thing to realize here is it’s pretty severe. Look at the multiplier you get in performance from access. Those are real numbers, look at those numbers. They’re serious numbers. We ignore that thing… That’s some serious stuff.

Would everybody– whoops, come back. Come back, come back. There we go. There we go. That’s a big number, that is an order of magnitude. I picked one up there, it’s not the biggest one but it’s gives you an idea. 15 minutes, I’ve got it. Do you see what I’m saying? This is an example, just access has nothing to do with creating it and destroying it. Just keeping it local and not letting it diffuse. So instead of using slice, use copy. Then your access will be good because you keep your stuff together. Yes. – [Man 9] (unclear speech) – Okay, the question is my list is very low. This graph is an example. We can look at a bigger one. So let’s look at a bigger one to see if we can answer that question. This is a size 25, two to the 25, so it’s bigger. Notice the shape is the same. The point is… You see we’re doing the same experiment but with a different problem, much bigger problem.

And what happens when you do the much bigger problem? It just gets a much better result. Look at the difference here. 16 times. So the bigger the problem, the better. Large systems need local allocators when there are pockets of the potential for a subsystem to be dealt with over a time slice. If you don’t do that, you do the same system without the local allocators, the diffusion will make a slowdown of 16 happen in this system.

This is a model. When you have a factor of 16, there’s really not much to talk about. It’s not like, “Well, is it really 16 or is it really 15?” Come on. We’ve done this for every size, for every platform, for everything. This is typical. Hopefully that will… Anyway. For this one, the allocation density is nil compared to what we’re doing. The chunk size– Excuse me, the variation is nil. The access locality, that’s what we’re measuring. The memory utilization is one because we build it up, use it, and tear it down. And there’s no contention. And so the takeaway tips for this, for long-running systems the speed of allocation/deallocation operations themselves may be entirely irrelevant to overall runtime performance. Partitioning memory corresponding to subsystems having physical and temporal locality can nonetheless make an enormous difference in overall runtime performance. Basically everything you need to know about this talk is done now, but we’re gonna look at two more benchmarks much quicker, and then we’ll be done.

I just want to point out those two alone, that’s build it up, use it, tear it down, and long-running systems, is almost everything that we do. Question. Yes. – [Woman] (unclear speech) – Okay the question is should we never use the global allocator. I didn’t say that. You have to decide does performance matter? If it doesn’t, use the global allocator. If it does, go to the flowchart and figure out whether there’s an opportunity to take advantage of allocators. For large systems, inevitably there will be, and if you design the large systems with allocators in mind, 100% there will be.

So retrofitting, who knows what you’ve got. But if you’re designing a system with allocators in mind, absolutely. Emphatically, 100%. And even retrofitting– We know because we’ve retrofitted a lot of what we’ve done with these allocators, and if you took them out right now, our customers would complain bitterly. So we know anecdotally they’re necessary, yes. – [Man 10] (unclear speech) – No, not exactly. Not exactly. The premise is they are already a certain size. I have a subsystem, and it’s this big. If it turns out it gets bigger then I will allocate more memory from the global pool, but it will stay there in perpetuity and that’s good until that subsystem goes away.

Good. Just to repeat, what he’s saying is how do you know how to size it. You don’t have to size it, it’ll size itself, but the point is once it sizes itself, all of the memory stays there and so you are not moving a link, you’re copying a value, and that’s critically important. So let me move on. The third one is just basically what if we have a pump. And this is going back to utilization. And we’re just getting something, putting it back, getting something, putting it back. And I’m gonna run through very quickly because I wanna have time for people to heckle me and I’m getting a little close. So utilization, it’s just another example. It’s not really as exciting. There’s a certain amount of memory that’s allocated, there’s a memory size.

We’re gonna get chunk, chunk, chunk until we hit some size, and then once we get that size we’re gonna just delete allocate, delete allocate, and once we’re done with that we’re gonna delete. I’m just showing right now as an animation because it’s very easy. So here we just filled up, and then we’re going to go through the process of doing this until we use up our total amount of allocations.

I think it’s just easier to see it this way. And then eventually we’re done. And then finally we’ll just get rid of them, and that’s it. That’s our experiment. And the result is in absolute runtimes. The entries in each row are relative– in each column– are relative to the first one. We’re not gonna do any winking out because there’s no advantage to winking out. And remember that this is our– this is our buffered sequential allocator, and this is our multipool allocator. Remember this guy. At anything it’s oversized, it’s gonna go straight through. And then these are all powers of two, so we have something. The total allocated memory is two to the 30 bytes and we have varied the memory size and the size sort of the way I’ve shown you here.

What else have we got? These are absolute times. These are relative percentages of the absolute times. What we’re gonna look at is the last three rows because what’s important about those is they are larger than 2048 so they’re going to overflow the multipool. And the other one to note is that there are bad choices to be made here.

So let’s look at those things that could overflow, and of course, if I’m doing something that allocates too much memory and I exhaust all of my available memory because I just build a buffered sequential allocator that’s just too big, my stuff will fail. But the multipool also will become very slow because it’s acting as a no-op in front of the actual allocator. So you’ll see that for sizes above the threshold of 248 the multipool does worse than the global. So that’s what I want you to see here.

And it’s consistent, and it’s not really surprising. And the point is you have to get stuff right. If you don’t get it right, it won’t be faster. The purpose of this is yes, allocation density is high, access locality is low, variation is zero. Memory utilization is a question mark, we’re measuring it, sort of, but it’s really pretty low because we don’t have a lot of the memory in at any given time. And the contention is zero. And so the takeaway tips here are never use monotonic allocator by itself when the utilization is low and the number of bytes allocated is large, e.g.

In a loop. Beware that multipool allocator passes its request to the backing allocator when the chunk size of memory being allocated exceeds the maximum pool size, e.g. 2048, which is an implementation detail but you should know it if you’re trying to optimize for performance, and that’s adjustable. There’s nothing really special about this. We can talk about it afterwards, unless someone has a burning question. Okay, let’s just do the last one. And contention. The bottom line about contention, I’m just gonna talk about as I put this up so people can read this as their convenience. But the idea is that contention is something that global allocators really try to work hard to do, and they put in all this effort and if they’re really good they can approximate it but they can’t come close to having unsequenced memory access.

So if you’re able to arrange your allocations in such a way that you don’t have to worry about contention, there is no way that a global allocator could compete with that. So I’m just gonna show you again, if you look at this. It really doesn’t matter what the details are so much, you can go back and look at it in GitHub. But the idea is that any of these other values are under 100%. So you see the global allocator, that’s the absolute time, that’s the relative time with the virtual thing, and all of these guys are a factor of three, four, better, just a constant factor, but a factor better. Now, the only way that a global allocator could do anything is try to base its decisions on one thread per subsystem. If that’s not how your world is arranged, let’s say you’re using AZO or let’s say your using thread pools or whatever, it’s not gonna work, got it. So you’re gonna have to do something different. I’m gonna run through this, but the idea is nothing much changes.

This is a bad choice, the monotonic one here again, because this thing is growing, and so the one you would want to use here is either the multipool or the multipool backed by the monotonic. Notice how that one has some advantages at some points. So anyway, that’s the single thread and this is the multiple thread working together. And I’m just gonna run through this quickly. And you can get an idea. But the main point is, choose the right allocator for the situation. Allocation density is high, chunk size variation zero, access locality low, memory utilization nil, and contention zero or high, is really all there was. This didn’t quite follow the same thing. The takeaway tips are make subsystems’ memory explicit using local allocators to obviate affinity in thread pools as well as thread-aware global memory allocators.

The second one is use a composite multipool mono allocation strategy, that’s AS11-14, when utilization is low, especially when variation, V, is high, so long as the size of the individual chunks are smaller than the pooling threshold, e.g. 2K bytes. Simple, right? Anyway. You got exposed to it, you see about what’s going on. And I do want to mention just briefly that false sharing is something that can be a problem. If you don’t know what it is I won’t explain it now but I will tell you that false sharing can cause problems, and diffusion can cause false sharing, and memory allocators can make it possible that no false sharing occurs because all of the stuff in each subsystem that’s being worked on by a different thread is independent, has independent cache lines. So if you know what I just said, then you know what I just said.

If you don’t know what I just said, no worries. (laughter) We had a problem. We got all the right data, but we actually screwed up and we got the columns wrong. We didn’t know how to explain it, and then there’s this fellow, Graham Bleaney, who was working at our company as an intern and I asked him, because he was really good, to go and redo the experiments and figure out what we did wrong, and he came up with something he called fragmentability while he was doing everything that we said he should do. The concept of fragmentability is simply if you have a data structure that is only one piece, it’s fragmentability is zero. That’s like a vector of int. And if you have a data structure that’s got lots of moving parts, like an unordered_map of unordered_map of string, then its fragmentability is huge, and those are the kinds of things that really beg to have an allocator so that they don’t fly apart. But a single vector of int isn’t gonna fly apart, we don’t need an allocator for that. By the way, he joined this summer, because we couldn’t let him go.

He wasn’t planning to, but anyway that’s… So the fragmentability of a vector is low. The fragmentability of a vector of string– Vector of int, by the way. Vector of string is high. And an unordered_set is medium-high. Or excuse me, let me back up, let me not get that wrong. Medium-low, medium-high, and the last one, this one would be higher because there are more moving parts and therefore they can be distributed all over, and following just one chain might require going and getting a cache line and getting a cache line and so on and so forth. So he presented, and it was accepted, by the way, on March 5th, 2016 into the standard, and along with monotonic and multipool, so this is great stuff. These are the references. This is the outline. And the conclusions, are memory allocators really worth the trouble? Yes. What situations? Well, there are different cases. A, to improve and preserve performance. We talked about here some of the examples. I’m just gonna put them up here because I’m low on time. If we want to place objects in a specific kind of memory, and the kinds of memory are really pretty amazing.

And then the third one is measure, test, and debugging, profiling, and that’s another thing that’s really awesome for memory allocators. So how do we apply them effectively? If you’re in this first category, which is no one really cares, just use a global allocator, but if you are, for example, on a large long-running system, and there’s disproportionate access, then use local allocators. And of course in the other cases use local allocators maybe, but if you can’t make this happen then there’s a problem. The find a new job is somewhat empathetic because people come into systems and they’re just goo and you try to retrofit allocators and it doesn’t always work. I’m putting up the next one. In any case, when it’s a small system and you don’t care about things, just go with that, but if you have a subsystem that exhibits high memory utilization, then a monotonic allocator is your friend.

You might want to use a monotonic allocator if it’s a small system. I’m just about done. You might wanna use a local multipool allocator if it’s not that way. And finally, let’s see. I’ll just finish this and say… In every benchmark, pretty much, local allocators are good. For long-running programs, improvements can be an order of magnitude. The overhead of using a virtual function interface, in many cases is non-existent. So the final thought is object-level control over memory allocation is intrinsic to C++, and they must always be so if we hope to maintain the language’s supremacy as the best-performing high-level systems language. Supporting object-specific memory allocation is admittedly an added burden, exacerbated by an initially difficult-to-use model, which is finally being address in C++17 by polymorphic memory resources. Any future incarnation of STL should incorporate the lessons elucidated here. And finally, remember this acronym. And of course fragmentability. So F DVLUC, but don’t mess with the duck. And thank you very much. (applause).

As found on Youtube