C++ performance tips and rules of thumb anyone?

When coding, what is a good rule of thumb to keep in mind with respect to performance? There are endless ways to optimize for a specific platform and compiler, but I'm looking for answers that apply equally well (or almost) across compilers and platforms.


A famous quote come to mind:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

But maybe you should not pass large data structures by value anyway... :-)

Edit: And maybe also avoid O(N^2) or more complex algorithms...

The number #1 performance tip is to profile your code early and often. There are a lot of general "don't do this" tips but it's really hard to guarantee this will impact the performance of your application. Why? Every application is different. It's easy to say that passing a vector by value is bad if you have a lot of elements but does your program even use a vector (you probably should but ...)?

Profiling is the only way to understand the performance of your application. I've been in way too many situations where people "optimized" the code but didn't ever profile. The "optimizations" turned out to introduce many bugs and not even be a hot spot in the code path. Waste of everyones time.


A couple of people have commented on the "early" part of my answer. I don't think you should be profiling from day 1. However you should also not be waiting till 1 month from ship either.

I usually first profile once I have a couple of definitive end to end scenarios, or in a larger project, a mostly functional component. I take a day or two (usually working with QA) to get together some large scenarios and throw it at the code. This is a great spot check to find obvious performance problems early. Fixing them at this point is a bit easier.

On a typical project I find that I have code meeting this criterias 30%-40% of the way through the project (100% being in customers hands). I loosely classify this time as early.

Someone mentioned function pointers (and why you should rather use if). Well, even better: use functors instead, they get inlined and usually have zero overhead. A functor is a structure (or class, but usually the former) that overloads operator () and instances of which can be used just like an ordinary function:

template <typename T>
struct add {
    operator T ()(T const& a, T const& b) const { return a + b; }

int result = add<int>()(1, 2);

These can be used almost in every context where an ordinary function or function pointer could be used. They usually derive from either std::unary_function or std::binary_function but that's often not necessary (and actually only done to inherit some useful typedefs).

EDIT The explicit <int> type qualification is necessary in the above code. Type inference only works for function calls, not for instance creation. However, it can often be omitted by employing a make helper function. This is done in the STL for pairs:

template <typename T1, typename T2>
pair<T1, T2> make_pair(T1 const& first, T2 const& second) {
    return pair<T1, T2>(first, second);

// Implied types:
pair<int, float> pif = make_pair(1, 1.0f);

Someone mentioned in the comments that functors are sometimes called “functionoids”. Yesish – but not quite. In fact, “functor” is a (somewhat weird) abbreviation for “function object”. A functionoid is conceptually similar but realized by employing virtual functions (although they are sometimes used synonymously). For example, a functionoid could look like this (along with its necessary interface definition):

template <typename T, typename R>
struct UnaryFunctionoid {
    virtual R invoke(T const& value) const = 0;

struct IsEvenFunction : UnaryFunctionoid<int, bool> {
    bool invoke(int const& value) const { return value % 2 == 0; }

// call it, somewhat clumsily:
UnaryFunctionoid const& f = IsEvenFunction();
f.invoke(4); // true

Of course, this loses any performance advantage that a functor has because of its virtual function call. It is therefore used in a different context that actually requires a polymorphic (stateful) runtime function.

The C++ FAQ has more to say on this subject.

  • When possible use if or switch instead of calls through function pointers. Clarification: void doit(int m) { switch(m) { case 1: f1(); break; case 2: f2(); break; } } instead of void doit(void(*m)()) { m(); } can inline the calls.
  • When possible and not harm causing, prefer CRTP to virtual functions
  • When possible, avoid C Strings and use a String class. It will be faster most often. (constant time length "measure", appending amortized constant time, ...)
  • Always pass user defined typed values (apart from where it doesn't make sense. e.g iterators) by reference to const (T const&) instead of copying value.
  • For user defined types, always prefer ++t instead of t++
  • Use const early, often. Most important to improve readability.
  • Try keeping new to a minimum. Always prefer automatic variables (on the stack) if possible
  • Instead of filling arrays yourself, prefer initialization with an empty initializer list like T t[N] = { }; if you want zeros.
  • Use the constructor initializer list as often as possible, especially when initializing user defined typed members.
  • Make use of functors (types with operator() overloaded). They inline better than calls through function pointers.
  • Don't use classes like std::vector or std::string if you have a fixed sized quantity not growing. Use boost::array<T, Size> or a naked array and use it properly.

And indeed, i almost forgot it:

Premature optimization is the root of all evil

Don't bother optimizing until it's needed. To find out if it's needed, profile. Don't guess; have proof.

Also, algorithmic optimizations usually have a greater impact than micro ones. Using A-star instead of brute force pathfinding will be faster, just like how Bresenham circles are better than using sin/cos. There are exceptions to these of course but they are very (very) rare (<0.1%). If you have a good design, changing the algorithm changes only one module in your code. Easy.

Use existing, reviewed code that's been used and re-used. (Example: STL, boost vs rolling your own containers and algos)

Updating due to comments: CORRECTLY use existing, reviewed code that's been used and re-used.

The best thing you can do as far as performance goes is to start with a solid architecture and threading model. Everything else will be built on this, so if your foundation is crummy, your finished product will only be as good as that. Profiling comes a little later, and even later than that come micro-optimizations (in general, these are insignificant, and complicate code more than anything.)

Moral of the story is: Start with an efficient base, build on top of that cognizant of not doing something downright silly and slow, and you should be alright.

Another point: The fastest code is code that doesn't exist. Which means the more robust and feature-full your project needs to be, the slower it will be. Bottom line: Skip the fluff whenever possible, while making sure you still meet the requirements.

Two of the best tips for C++:

Purchase Effective C++, by Scott Meyers.

Then purchase More Effective C++, by Scott Meyers.

Keep your code as clean as possible. Compilers are MARVELOUS nowadays. Then, if you do have a perf problem, profile.

All of this is after having chosen the best algorithms available for your problem.

Here are a few:

  • Effectively leverage Inlining (depending on your platform).

  • Avoid using temporaries as much as possible (and know what they are)

    x = y + z;

    Would be better optimized if written as:



Also, avoid virtual functions and only create objects when you need to use them.

If you're in the mood, check out Efficient C++. I've got a copy at home from back when I was in school.

Wikibooks has some things.

A good thing to do is know the efficiency of what you're using. How fast addition is to multiplication, how fast a vector is compared to a normal array or to the higher scales how certain algorithms compare. This allows you to choose the most efficient tool for a task

Using generic algorithms is a great optimization tip - not in terms of runtime but of coding time. Knowing that you can sort(start, end) and expect the range - be it two pointers or iterators to a database - will be sorted (and what's more, the algorithm used will be runtime efficient too). Generic programming is what makes C++ unique and powerful, and you should always keep it in mind. You shouldn't have to write many algorithms because versions already exist (and are likely as fast or faster than anything you would write). If you have other considerations, then you can specialize algos.

one simple sugestion is to get into the habit of doing ++i, rather than i++. i++ makes a copy and this can be expensive.

Consider using a memory pool.

  1. Always try to think on how your memory looks - for example an array is a consecutive line of memory of the size numOfObjects X sizeof(object). a two dimensional array is n X m X sizeof(object) and each object is in the index of n + m X n and so

    for(int i = 0 ; i < n ; i++){
        for(int j = 0 ; j < m ; j++){
            arr[i,j] = f();

    is much better then (on single process):

    for(int i = 0 ; i < n ; i++){
        for(int j = 0 ; j < m ; j++){
            arr[j,i] = f();

    Because the array is brought into the cache in consecutive chunks the 1st snippet runs on all the cells that are in the cache before fetching the rest while the 2nd snippet would need to fetch new array cells into the cells over and over again

  2. When your application starts to slow use performance benchmark to find the exact bottleneck even a simple GetTickCount calls can be used to determine the time it takes your components to run. On bigger projects use a proper profiler before you start to optimize so you will spend the most optimization effort where it matters.

Don't use grossly inefficient algorithms, turn on the optimization in your compiler, don't optimize anything unless a profiler shows it to be a bottleneck, and when you try to improve things test to see if you've done good or bad. Remember also that library functions have usually been optimized by people better at it than you are.

Pretty much everything else is minor, compared to these.

Basically your biggest performance gains are to be had by algorithmic improvements. This means using the most efficient algorithms and, in turn, the most efficient containers for data items.

Sometimes it is hard to know what the best trade-offs are, but fortunately the designers of the STL had exactly this use case in mind and hence the STL containers are generally flexible enough to allow containers to be mixed and matched in accordance with the demands of the application.

To fully realise the benefit however you need to ensure that you don't expose an internal design choice as part of the interface of your class/module/whatever. None of your clients should depend on your use of a std::vector. At the very least provide a typedef that they (and you) can use, which should allow the vector to be changed to a list (or whatever) depending on your needs.

Similarly make sure that you've got the widest possible choice of debugged algorithms and containers at your disposal. Boost and/or TR1 are prettymuch necessities these days.

From one of the C++ books i referred (Efficient C++ performance Techniques by Bulka and Mayhew), which explicitly talked about C++ performance aspects. One of them was;

while defining constructors..initialize the other constructors also; some thing like;

class x {

x::x(char *str):m_x(str) {} // and not as x::x(char *str) { m_str(str); }

std::string m_x;


The above is some thing which caught my attention and helped me to improve my coding style... this book has more to share on this interesting topic of performance.

Coarse Designs

Keep your class and data structure designs on the coarser side, not granular side, in performance-critical areas. By performance-critical, I mean either measured as such or one where you can definitely anticipate large inputs being processed repeatedly (ex: every single frame). The point of doing that is to leave enough breathing room in the future for any necessary optimizations. Otherwise we might look at bottlenecks which require extensive redesigns/rewrites of central areas of the codebase which necessitate cascading rewrites of all dependencies.

Let's run through a few examples of granular designs I've encountered from other people who ran into hotspots.

Boatload of Strings

// Stores millions of strings.
std::vector<std::string> boatload_of_strings;

Here this is a granular design in the sense that we're storing a full-blown string container/class by the millions of instances. Every single teeny little string, stored by the millions, is represented using a full-blown, variable-sized, memory-managing, std::string container. This ends up being either explosive in memory use, requiring far more memory than necessary, or using more heap allocations than necessary, or a combo of both. With recent small string optimizations, the minimum sizeof(std::string) can be as large as 24 bytes, which is pretty explosive memory use if the majority of our strings are only a few characters long. And if the strings are medium-length, you could still incur a separate heap allocation per string. Either way it translates to a boatload of cache misses. Instead if you make the design coarser like this:

class BoatloadOfStrings
     // Returns the nth string.
     const char* operator[](int n) const
         return buffer.data() + string_start[n];

     // Inserts a string.
     void insert(const char* str)
         buffer.insert(buffer.end(), str, str + strlen(str)+1);

     // Stores all the characters of all null-terminated
     // strings in one giant buffer.
     std::vector<char> buffer;

     // Stores the starting position of each null-terminated
     // string.
     std::vector<size_t> string_start;

... now we end up using way less memory and only face a heap allocation when one of those two vectors exceed capacity (amortized cost). Even if we started with the above solution using a std::vector storing millions of instances of std::string, using this kind of coarser BoatloadOfStrings class design gives us a whole lot more breathing room to optimize than if we had a boatload of dependencies to the former representation.

Boatload of Abstract Pixels

class IPixel
     virtual ~IPixel() {}

     // Abstract pixel operations.

Here if we're designing, say, an image/video processing application that has to loop through pixels repeatedly to do things like implement custom video filters, then the above design is extremely wasteful but also gives us no breathing room to optimize it any further if the whole system depends on such an interface.

Normally the cost of things like dynamic dispatch and virtual pointers is pennies, but pennies become expensive if they're paid millions of times over many times per frame. Paying for the cost of virtual dispatch on a per-pixel basis does become relatively very expensive, and the size of a virtual pointer, in spite of only being a mere 8 bytes on 64-bit systems, quadruples the memory use of a 32-bit RGBA image when we consider its size and the padding it will add to the structure given its 64-bit alignment requirements (64-bit vptr + 32-bit pixel data + 32-bits of padding to align vptr).

Same recommended solution from me in this case. Design at a coarser level. If you need polymorphism, see if you can abstract at the coarser image level:

class IImage
     virtual ~IImage() {}

     // Abstract image operations.

Suddenly the cost of virtual pointers and virtual dispatch become extremely cheap when they're only paid once for an entire image, which may very well consist of millions of pixels as a container of pixels. And now you have room to do things like apply SIMD instrinsics to image operations in parallel loops at a central level without rewriting huge parts of your codebase.

Boatload of Creatures

class Creature
     virtual ~Creature() {}

     // Abstract creature operations.

class Human: public Creature

class Orc: public Creature

class Undead: public Creature

Let's say we have a real-time war simulation which consists of a boatload of abstract creatures with a massive number of units in a single game session, Lord of the Rings-style. The creatures can be removed at any given moment since they can die, e.g. And most of our critical loops applied per frame are actually sequential loops which just do things like move all creatures along a given path.

In this case, the above representation could become a bottleneck with little breathing room to optimize further because the code, using polymorphic base pointers to Creature*, forces us to deal with one creature at a time. On top of this, each creature might have different size and alignment requirements, which implies that they generally cannot be stored contiguously in memory (ex: we might not be able to interleave an orc's data right next to a human's data in memory, even if we want to process an orc followed by a human).

In these cases you might be able to improve performance somewhat by using a separate allocator per creature sub-type, like a separate free list per creature sub-type which allocates the creatures of a particular type in contiguous blocks, and radix sorting your polymorphic base pointers by address for improved locality of reference (spatial locality for neighboring creatures of the same type and temporal locality on vtables, e.g.). However, that's a lot of effort involved to optimize with limited room to go far. Meanwhile if you just use coarser designs like this:

class CreatureHorde
     virtual ~CreatureHorde() {}

     // Abstract creature horde operations.

class HumanHorde: public CreatureHorde

class OrcHorde: public CreatureHorde

class UndeadHorde: public CreatureHorde

... now we're designing at the level of an entire mass horde of creatures, just as we turned our abstract pixel interface into an abstract image interface. Now we have all kinds of room to easily optimize, as our polymorphic base pointers (CreatureHorde*) will point to an entire horde of creatures to process, not one creature at a time. Within each horde, we can potentially have very cache-friendly sequential processing of, say, all data for human units within a horde of humans stored and processed sequentially in a contiguous vector/array.


So anyway, this is my number one advice, and it relates to design more than implementation. When you anticipate big, performance-critical loopy areas of your system, design at a coarser level. Use coarser data structures, coarser class interfaces, and coarser abstractions. Don't represent a boatload of teeny things as separate classes and independent data structures in their own right, or at least not as anything more than an private implementation detail of a coarser design.

After you get the designs sufficiently coarse to give you the breathing room to optimize, then you might look into getting yourself some nice profiling tools, improve your understanding of the memory hierarchy, parallelize your code (which also becomes easier with coarser designs), vectorize it (which also becomes easier with coarser designs), etc.

But the top priority to me when we consider efficiency upfront is design efficiency, since lacking that, we can't even effectively optimize our implementations in hindsight even when we profile and accurately discover our hotspots. And design efficiency typically just boils down to having designs with sufficient breathing room to optimize, and that effectively boils down to just coarser designs that don't try to model the tiniest of objects and data structures.

"premature optimization is the root of all evil" (Knuth, Donald)

It really depends on the type of code you write and it's typical usage.

Whatever action you take to save a few cycles, remember this: Don't try to be smarter than the compiler - measure to verify the gain.

though not addressing the exact issue, some advice is :

always code for interfaces ( when it comes to algorithms ) so that u can smoothly replace them by efficient one (by any req means ).

  • Rule 1: Don't
  • Rule 2: Measure

I agree with the advice on premature optimization. However, there are a few guidelines I like to follow during design because they may be difficult to optimize later:

  • Try to allocate all your objects on start up, minimize the use of new during run time.
  • Design your data structures so that your algorithms have a good shot at being O(1).
  • And as usual, modularize so that you can rip out and replace later. This implies you have a comprehensive suite of unit tests that give you the confidence your new solution is correct.
  • Add performance tests to your unit test suite to make sure you don't inadvertently end up with some O(n*n) code :-)

Select best performing algorithms, use less memory, use less branching, use fast operations, use small number of iterations.

Need Your Help

How to use Scala's immutable data structures as members of a class

scala state immutability mutable

I'm trying to get a hold on FP and the concept of immutable data structures for mutable states. Though i get the idea "locally" like in a function, i don't understand how i would do it in a system...

Keeping track of user usage on a Winform application

sql sql-server winforms

We have a Winform application that requires user to login using their username and password. The software is free but support is not. The support fee will be based on the number of full time equiva...