effective cplusplus

Effective C++Mark Veltzermailto:[email protected]

Micro optimizations

Micro optimizations

This presentation is about using the C++ language effectively.

This means that we will not be talking about how long other parts, which are not your code, are taking.

Always remember that for a lot of systems the real bottle neck is not within your code but in some other place (too much IO, heavily loaded database, network,...).

That is why the types of advice in this presentation are sometimes called micro optimizations

What we will not be discussing

There are two very important optimizations that could be considered more important than micro optimizations that we will not be discussing:Algorithmic optimizations.

Architectural optimizations.

These could bring tremendous improvements to your code but have less to do with the technicalities of the C++ language.

In reality implementing them does require knowledge of the C++ language but not at the micro optimization level.

Reasons against micro optimizations

Some people consider micro optimizations an evil.

They claim that since bottle necks are not in your code then you shouldn't pay attention to these types of optimizations since:The ease of implementing micro optimizations gives you the feeling that you improved the software even though the bottle neck is really somewhere else and in-fact you just complicated your code with no benefit which means your situation actually got worse.

They make your code harder to maintain and read.

They require you to learn something which is not important.

Reasons against micro optimizations (cont)

Even if your bottle neck is within your code you are better off putting effort into either algorithmic or architectural (including multi-core) changes.

Although these are harder to implement the claim is that per time spent they give better returns.

Micro optimizations distract you from doing architectural and algorithmic changes which are more important.

Micro optimizations hide the algorithm or architecture in a sea of details which makes understanding and modifying the software difficult.

Reasons against micro optimizations less agility

Some claim that micro optimizations make your software less agile.

Because they make each piece of code more difficult they make the entire software harder to change at the architectural level.

This means that the software making process will be slower to respond to new requirements.

This could have business implications as well as technical ones.

Reasons against micro optimizations - micro optimizations as premature optimization

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" Donald Knuth

There is a connection between micro optimizations and premature optimization because in a lot of cases micro optimizations are done prematurely.

So when first coding your system Keep It Simple Stupid which means avoiding unnecessary micro optimizations.

Reasons for micro optimization

The main point: You need to write your code anyway so you may as well use the right features and get better performance while you are at it.

If the language has two features and you use the one which is giving you better optimization then you gain more efficiency without complicating your code.

Sometimes you are CPU bound and micro optimizations could squeeze more performance out of an existing code base.

Reasons for micro optimization

Sometimes you are willing to live with the downsides of micro optimization (both technical and business).

Sometimes, just sometimes, your design is good and you think that it may stick around for a while and micro optimizations just make it better.

In these cases the loss of agility is not a big problem as it seems.

This happens in mature products more than in young ones.

Reasons for micro optimization (cont)

Algorithmic and Architectural changes are just too difficult.

Architectural changes sometimes means huge rewrites which companies rarely can handle because of organizational issues.

Algorithmic changes require lots of theoretical knowledge and sometimes low level details (turning a lock using algorithm into lock free requires understanding memory and machine barriers...)

Micro optimizations

If you are coding in C++ then you, albeit maybe unconsciously, agree that too much low level optimization is evil.

Otherwise you would be coding in C or assembly. Or maybe designing a dedicate piece of hardware?

Some people think so highly against micro optimizations that they go for higher level languages like Java/C#/Python/Ruby which prevent these types of optimizations by design.

Micro optimizations and the compiler

The compiler does lots of optimizations which could be classified as micro optimizations:In-lining.

Loop unrolling.

Constant folding.

Dead code elimination.

Common subexpression elimination.

Loop invariants.

And many more...

Micro optimizations and the compiler

This means that the micro optimizations we are talking about here are optimizations that the compiler cannot perform.

In practice, however, it is not always simple to know which optimizations are performed by the compiler.

Plus compilers tend to become better with time.

And you may need to support several compilers that have a different feature set.

Micro optimizations - conclusions

Should you use micro optimizations? Yes.

How many of them? Depends on your product (is it mature, is it's design going to change etc).

What types of optimizations should you prefer? Ones that don't make your code harder to read and understand and ones that are not already performed by the compiler.

When to avoid them at all? When your product is young and you are not sure about the design at all. Then they are just a waste of your time.

C++ efficiency (comparison to assembly/C)

Measuring efficiency

There are many ways to measure efficiency.

The various methods may give somewhat different results.

They are also not equally easy to use.

They sometimes are in conflict one with the other.

For instance: you may increase latency and decrease average response time.

And all measures depend on the dataset used.

Measuring efficiency

Measurement methods include:Measuring number of instructions in assembly code.

Measuring number of instructions issued at run time using CPU counters.

Measuring time elapsed using CPU counters.

Measuring time elapsed using hardware clock.

Measuring time elapsed using OS clock.

Full profiling in software or hardware.

Measuring CPU cycles

Is one way to measure efficiency.

In RISC chips this is more accurate since every instruction is executed in just one cycle.

In CISC chips things are more complicated. One instruction may be over in one clock while others may last many clocks.

And instructions are sometimes halted to wait for external factors: main memory, other CPU atomic operation, etc...

What does measuring CPU cycles mean assembly programmers?

That even an assembly programmer writing down instruction explicitly cannot know how long his/her code will run.

This is contrary to common opinion that assembly programmers know exactly how long their code runs.

It is true that assembly programmers have a firmer grasp on how long their code runs.

C complications function call

A function call in C may involve many cycles as it needs to push the arguments, then jump to the address and then the inner function needs to get the arguments from the stack, do the calculation, put the result onto the stack and then the outer function will get the result from the stack.

There are quite a few machine instructions involved in all of these.

C complications function call (cont)

On the other hand if the inner function is inside a header file and the compiler is an optimizing one it could eliminate the function call altogether.

And if the arguments are constant the compiler may eliminate the calculation too and just hard code the result.

So, a function call may range anywhere between one instruction and hundreds or more even for a simple function.

C complications switch statement

A single switch statement in C can be realized by a compiler in several ways:A series of compare-and-branch statements.

An array of jump locations.

A perfect hash of jump locations.

Actually good compilers use all of these combined when optimizing.

So how long does a switch statement take? I donno!

C complications - registers

In C the programmer may advise the compiler about using a register for a variable using the register keyword.

The programmer, however, does not know if the compiler actually manages to store the variable in a register and does not get feedback about it (aside from disassembly, of course).

The difference in performance is great as access to register and to memory (even cache memory) is about order of magnitude in difference.

C complications cache lines

When programmers start programming in C they are told nothing about cache lines, cache misses etc.

After a while they are told that structures accessed heavily should be small as to fit in a cache line so as to hit the CPU cache and not main memory.

And after a while more they are told about how synchronization between CPUs works.

What does all of this mean? - c programmers

That even when programming in C the programmer only has vague idea of how his/her program is translated into machine code.

The programmer has to try and wrap his/her head around various low level concepts (registers, switches, caches) in order to understand the speed of his/her program.

So analyzing speed is not simple, even in C.

What about C++, and higher level?

In C++ things get even nastier.

Various basic building blocks like constructor, destructor, virtual method call, exception, copy constructor are even harder to estimate.

So the feel of the programmer for the runtime gets more vague.

And in Java the programmer doesn't even know about machine instructions...:)

Criticisms of C++ performance (low level)

Low level people criticize C++ for being too slow.

As a result they say they'd use assembly or C over C++.

The claim is that C++ does too much behind the programmers back

This prevents the programmer from getting a good grasp of what is happening performance wise.

Criticisms of C++ performance (high level)

High level people criticize C++ for being too slow.

They claim that C++ forces the programmer to focus too much on low level details.

This make the design suffer and be suboptimal.

Which in turn leads to bad performance.

Is C++ performance bad?

No.

It really depends on the programmer.

And on the programmers knowledge of what C++ does behind the scenes.

If the programmer knows when and what C++ does behind his/her back then he/she knows how to handle the situation.

This is exactly what this presentation is for.

Is C++ performance bad? (cont)

Since C++ is a more complicated language it can be abused by programmers who do not know it's features in a more creative manner and create systems that perform more poorly than systems implemented in lower level languages (C, assembly).

But C++ is usually thrown at big problems where development in C is more difficult and requires object oriented approaches. In these cases C++ abuse is less of an issue since C abuse in 5large systems is an issue also.

Is C++ performance bad? (cont)

So C++ sometimes performs slower and sometimes faster than equivalent systems implemented in lower level languages.

If the system you are implementing is a large one then using object oriented and C++ has an advantage over C since it is more agile and will allow you to reach a good design with less rewrites.

Since rewriting is problematic in a corporate environment C++ could actually outperform C in such environments.

So what should we do?

Remember the example of the C function call and the C switch statement?

C++ has many more of these (virtual functions, exceptions, constructors, destructions, templates and more).

We should study the language to understand exactly when and what the language does.

And understand how to use the features effectively.

Comparison of effectiveness issues in C++ vs other languages

Other languages, like Java, also have effectiveness issues.

The difference is at the level at which the programmer needs to peek in order to understand the issues involved.

In Java or other higher level languages effectiveness books, courses and presentations deal with higher level considerations than in C++.

Issue such as CPU cache lines never appear...:)

Some see this as a weakness of C++.

C++ and over design

Over design is a situation where a programmer creates too much infrastructure in a piece of software, much of which is never used.

Over design is problematic because it makes the code larger and more complicated and thus making it harder to understand, maintain and develop further.

Over design also makes your software run slower for no good reason.

Over design is a notorious problem in object oriented languages and is usually inflicted by beginner programmers.

C++ and over design (cont)

Over design rests on the assumption that you must make your software flexible for future requirements now, regardless if that flexibility will be used in the future.

Experience shows that programmers make their program flexible where it is convenient for them to do so and not where future requirements will demand that flexibility.

This means that over design is rarely a good thing.


Mature and young products are different in respect to over design and young products tend to exhibit this illness more.

C++ has potential for over design.

Since over design means lower performance then C++ has potential for lower performance in that respect.

C has less over design issues and assembly even less because they are harder to code.


All you need to do to solve the problem is simply avoid over design.

This is very close to the concept of agile development.

Agile development tries to reconcile the two following statements:You don't know the requirements of a piece of software in advance.

You want to avoid over design.


Instead agile development puts an emphasis on the development process:Keep code simple.

Keep the programming environment simple.

Make it as easy as possible to change and test the code.

This makes the adjustment to new features easy without over design.

So one solution to the problem of C++ and over design is agile development.

Is efficiency important?

Reasons for efficiency

Even if hardware advances competition between software vendors remain. Since they use the same hardware the winner will be the one with more software efficiency.

Communication speed has exceeded CPU speed and so the software is the new bottleneck. Due to faster optical and other communication solutions it will stay that way.

Reasons for efficiency (cont)

CPU speed improvement has slowed down and given way to multi-core/multi-CPU designs.

This is due to various low level physical problems which are not likely to be lifted in the near future.

But multi-core/multi-CPU solutions are based on software that is specifically written in order to utilize such systems and so software efficiency is once again the bottle neck.

Reasons for efficiency (cont)

Last but not least is the act of writing software.

A simple error in programming could cause a piece of software to run 100 times slower than it could.

Example: NTPL in Linux made thread creation a 1000 times faster.

Software is still key to efficiency.

What type of efficiency?

What type of efficiency do we mean?

Binary size: we don't care about that much these days (data much larger than code).

Compile time: Most people don't care about this but maybe you should?

Data size: yes, but C++ data structures tend to be quite efficient and RAM is cheap these days.

Code efficiency: This is the one which most issues relate to.

What type of efficiency do we mean?

Code efficiency is usually split into:Latency/response time/responsiveness

Throughput.

These goals are a trade off: increasing one will tend to decrease the other.

You have to decide where you are in balance between these two opposing objectives.

Being somewhere in the middle is fine as long as you define clearly exactly where you are.

The fast path

Fast path: definition

Most software has a lot of code which is only used, if ever, one time or rarely.

This includes setup, tear down, error handing and more.

More than 90% of code is like that (sometimes a lot more).

This is the boring code or the slow path.

Fast path is the non boring code that actually gets used most of the time.

One function can include both slow path and fast path code (e.g. it gets used often but has lots of error handling code).

Fast path optimizations

Micro optimizations, and any other optimizations for that matter, only make sense if you actually use them a lot

This means that the fast path should receive most of the love and care.

Optimizing code outside of the fast path will not provide better performance and will just complicate your code making it harder to read/understand/modify/adjust to new requirements.

Some basics first

How is a class implemented?

A class is just like a struct with compile time checking of access (private, public, protected).

It is packed like a struct and you can tell the compiler the offsets of members (although you rarely have to).

The methods are not within the class. They are nowhere to be found in the class. The compiler arranges for the right object to be passed using compile time type checking. Virtual methods are a different matter.

How is a class implemented? (cont)

The minimum size of a class is 1 byte.

This is to make sure that memory really is allocated.

sizeof() cannot also return 0.

Think of sizeof(array)/sizeof(instance) if sizeof(instance)==0...

How is inheritance implemented?

Inherited classes are just like structs layed out one after the other.

So if B is derived from A then sizeof (B)=sizeof(A)+sizeof(fields of B).

Different compiler vendors can pack A before B or vice versa. This is an ABI issue.

Different compiler vendors can decide on different alignment between the A and the B part of the object. This is an ABI issue.

Inheriting from empty classes

An empty class takes 1 byte.

But if you define a class derived from it and adding an integer you will find that the size of the derived class is 4 and not 5.

This is because the compiler knows that the 1 byte in the empty class really doesn't refer to data.

So inheriting from many empty classes doesn't make your class bigger.

How is multi-inheritance implemented?

If C is derived from A and B then C has it's own fields, A's fields and B's fields (unless it's virtual inheritance).

The physical layout of these fields is again an ABI issue that is implemented differently by different compiler vendors.

Virtual methods

If a class has at least one virtual method then the size of the object grows by exactly one pointer.

That pointer points to a structure called a vtable.

There is exactly one of these in RAM for each class.

The table holds the method pointers for all virtual methods in that class.

Virtual methods (cont)

The compiler vendors could have held pointers of more vtables inside a vtable but most of them choose to create one flattened vtable per class.

The order of the functions within the table is compiler dependent ABI issue. It could be order of declaration, alphabetical order, whatever (as long as it is consistent).

Each call to a virtual method goes through this table.


You can think of the vtable pointer as the type of the instance the difference being that classes without virtual methods don't have this type information attached to them.

This is not the same as RTTI which adds info pointed to by the vtables.


This means that adding the first virtual method to a class adds a pointer to the size of the instance.

Adding a virtual method to a parent class will add that pointer to all of it's decedents.

Virtual methods are one dereference slower to call to.

And they cannot be in lined unless the compiler can determine the exact type which is rarely the case.

Any constructor sets up the pointer to the vtable which makes the constructor slower.

RTTI

The ability of the system to report on the dynamic type of an object and to provide information about that type at runtime (as opposed to at compile time)

When utilized consistently can be a powerful tool to ease the work of the programmer in managing resources.

Comes at the cost of a little performance.

Includes typeid and dynamic_cast

RTTI - typeid

typeid A function that can be given a type and returns std::typeinfo which describes the type.

You can use this return value for the name of the type (.name()) or to compare type information from two instances.

According to the manual the name of the instance is implementation (ABI) specific.

Furthermore, typeid is only guaranteed to work on instances which belong to classes that have at vtable (meaning at least one virtual method).

RTTI typeid (cont)

In order to work with typeid you must include .

Typeid sometimes works at compile time and sometimes leaves code that understands the type at runtime.

Most compilers allow you to work without typeid and get a little bit more performance (in terms of code and size of vtables).

In gcc this is the -fno-rtti flag.

RTTI - dynamic_cast

Allows you to try and cast an object to some type at runtime.

Because it is done at runtime it may fail at runtime and your code should include a check for this.

Works by traversing the vtables of the objects so it is slow and depends on the number of classes in your hierarchy.

RTTI should you use it?

Good design usually means not needing to use RTTI which is also good from a performance standpoint.

typeid returns name() as a string and strings are slow to compare so never use that to determine what to do with an instance.

dynamic_cast is slow (as explained) and thus should not be used in the fast path.

Inheritance with virtual methods is pretty fast and should be used to distinguish between types instead.

Virtual destructor

Is required if you:Want to destroy the class using delete.

At the delete point the pointer may point to a derived element.

Different classes in your hierarchy would like to do different things in the destructor.

This is not the same as having virtual methods.

You may have virtual methods and not need a virtual destructor.

You may have no virtual methods and need a virtual destructor.

In-lining

In-lining as C++ efficiency

C++ efficiency is based on compile time optimization.

But the compiler cannot optimize what it cannot see.

So a lot of patterns in C++ optimization are based on letting the compiler see more of your code.

This makes the compiler see through a lot of the abstractions that we need as developers and still do only the things that need to get done.

We maintain a layered software which is good for maintenance while the compiler does the right thing. Good deal.

Problems of in-lining

The problem of in-lining is that you need to let the compiler see more of your code.

This means you have to write more in h files.

Which makes the compilation longer.

And modifications to implementation now causes long project recompiles.

This is similar to the problems of templates in C++.

Only use when you really need the performance boost.

Advantages of in-lining

Avoiding yet another function call. Function call is not cheap since it uses the stack RAM.

Avoiding storing registers into ram before the functions call and collecting them after the call. Since RAM access is long compared to register access then this is a very good speedup.

Better allocation of registers since the allocation is combined between the inner and outer functions.

Allow the compiler to do dead code elimination based on constants passed to the inner function.


Allows the compiler to reorder instructions between the outer and inner functions and so utilize the hardware better.

Less code executed and so less instruction cache misses.

Allows the compiler to do loop invariant optimization if the inner function is called from a loop in the outer.


Allows the compiler to know whether pointers passed to the inner function are really changed by it. If they are not (as in many cases) the compiler can avoid load and stores to that address. This is even in the absence of correct const statements.

A function call is an optimization barrier in the outer function and it is now removed.

In-lining - conclusions

In-lining provides many, many advantages to C++ programmers.

Obviously you have to compile with optimization to get those advantages.

And write code in header files.

When you do optimize in-lining occurs even in non header files.

A barrier to in-lining is calling an unknown function which usually happens when calling a virtual function. So try to avoid those in the fast path.

Efficiency topics

Fast construction and destruction

Consider the need to hold a resource which must be released (mutex, file,...)

A common pattern is to put this resource into a class that releases the resource upon destruction.

But if this pattern is used in the fast path then we must make sure that constructors and destructors will not hurt our performance.


Do not allocate on the heap do it on the stack.

If you do use the stack then there is no delete with unknown type and so less need for a virtual destructor.

Try to avoid the need for a virtual destructor in order to avoid the need to go through the virtual table and the function call.

This does not mean you cannot use inheritance.


Avoid virtual functions in the object in order to avoid the extra virtual pointer assignment in the constructor.

In line the class by putting it into a header file.

This will make sure that the construction and destruction will not trigger a function call which is expensive.

Try to avoid composition in this case as it may create even more work which involves the heap.


Remember that constructors construct all members if not explicitly initialized using member construction syntax.

And their parents...

And so do destructors.

Try to keep fast classes as simple as possible.

Lazy construction

If some fast paths do not require everything that you do inside the constructor then avoid calling that work in the constructor.

Example could be some inner member composition initialization.

Instead only construct them lazily when needed.

This requires careful coding that makes sure that the member is not used unless initialized.

Redundant construction

When ever an object is composed of other objects and the constructor is triggered it creates the other objects using their default constructors.

Only then is the constructor run.

This means that if you initialize the member only within your construction you are paying for redundant construction.

Instead initialize with Class::Class():attr(args)

Avoid creation of temporary variables

If you have an assignment of two types and an exact assignment operator is missing then it may be fulfilled by constructing a temporary out of the right hand side and then calling the assignment operator.

This may be expensive.

Try to make sure you have the right assignment operators in place to avoid these costly constructions of temporaries.

Virtual functions cost

The assignment of the vptr.

The extra dereference in method calls.

The inability of the compiler to in-line the function.

The cost of the first two seems small. The cost of the third is unknown and may be large depending on the circumstances.

Virtual functions cost when should you care?

The inability to in-line is indeed problematic but remember that the function you did not inline could be in-lined internally.

This means that what you really care about is the call/return cost+more in-lining that is possible out of seeing the two pieces of code together.

If your function is long then you may not care about this at all so you should care only if your function should be very fast.

Virtual functions cost solution 1

Avoid the virtual function at all.

You may not really need virtual on it (are you over designing?)

Write two classes instead of one (do you really want to mix them or are you just using one in any particular run?)

Use a macro to decide which to use in each run?!?

Virtual functions cost solution 2

Pass your class to the using class as a template parameter.

Templates are more performance friendly than inheritance hierarchies.

They push type resolution from runtime to compile time (which we consider to be free).

If the template class is in-line ready then it will be in-lined in the using class.

References

References

Efficient C++ Performance Programming Techniques Dov Bulka, David Mayhew.

Effective C++ - Third Edition 55 Specific Ways to Improve Your Programs and Designs Scott Meyers.

More Effective C++ - 35 New Ways to Improve Your Programs and Designs Scott Meyers.

The C++ FAQ: http://www.parashift.com/c++-faq-lite/

effective cplusplus

Technology