less_retarded_wiki
main page, file list (648), source, all in md+txt+html+pdf, commit RSS feed, report abuse, stats, random article, consoomer version
Optimization
Optimization means making a program more
efficient in terms of some computing resource usage or by any similar
metric, commonly aiming for higher execution speed or lower memory usage
(but also e.g. lower power consumption, lower network speed demand etc.) while preserving how
the program functions externally; this can be done manually (by
rewriting parts of your program) or automatically (typically by compiler when it's translating your program).
Unlike refactoring, which aims primarily
for a better readability of source code, optimization changes the inner
behavior of the executed program to a more optimal one. Apart from
optimizing programs/algorithms we may also
more widely talk about optimizing e.g. data
structures, file formats, hardware, protocol and so on.
Manual Optimization
These are optimizations you do yourself by writing better code or
fiddling with how you compile your code.
General Tips'N'Tricks
These are mainly for C, but may be usable in other
languages as well.
- Tell your compiler to actually auto optimize
(
-O3
, -Os
flags etc.). Also check out further
compiler flags that may help you turn off unnecessary things you don't
need, AND try out different compilers, some may just produce better
code. If you are brave also check even more aggressive flags like
-Ofast
and -Oz
, which may be even faster than
-03
, but may break your program too.
- Watch out: what's fast on one platform may be slow on
another -- know your platform and compiler. This depends on the
instruction set as well as on compiler, operating
system, library implementation, available hardware, driver implementation and other details. In the end
you always need to test on the specific platform to be sure about how
fast it will run. For example with simple compilers that don't do much
auto optimizations you may want to do clever tricks to optimize
manually, which may however in turn confuse smarter compilers that
optimize well but rely on idiomatic code, i.e. optimizing something for
one platform may it slower on another platform. A good approach is
probably to optimize for the weakest platform you want to support -- if
it runs fasts on a weak platform, a "better" platform will most likely
still run it fast (even if not optimally).
- gprof is a utility you can use to
profile your code. You can also program your own profiling but be careful, it's not trivial to do
it well.
<stdint.h>
has fast type
nicknames, types such as uint_fast32_t
which picks
the fastest type of at least given width on given platform.
- Actually measure the performance to see if your
optimizations work or not. Sometimes things behave counterintuitively
and you end up making your program perform worse by trying to optimize
it! Also make sure that you MEASURE THE PERFORMANCE CORRECTLY, many
beginners for example just try to measure run time of a single simple
function call which doesn't really work, you want to try to measure
something like a million of such function calls in a loop and then
average the time; also make sure the compiler doesn't auto remove your
code if it happens to have no effect.
- Keywords such as
inline
, static
,
const
and register
can help compiler optimize
well.
- Optimize the bottlenecks! Optimizing in the wrong
place is a complete waste of time. If you're optimizing a part of code
that's taking 1% of your program's run time, you will never speed up
your program by more than that 1% even if you speed up the specific part
by 10000%. Bottlenecks are usually inner-most loops of the main program
loop, you can identify them with profiling. A
typical bottleneck code is for example a shader
that processes millions of pixels per second. Generally initialization
code that runs only once in a long time doesn't need much optimization
-- no one is going to care if a program starts up 1 millisecond faster
(but of course in special cases such as launching many processes this
may start to matter).
- You can almost always trade space (memory usage) for time
(CPU demand) and vice versa and you can also fine-tune this.
You typically gain speed by precomputation (look up
tables, more demanding on memory) and memory with compression (more demanding on CPU).
- Static things are faster and smaller
than dynamic things. This means that
things that are somehow fixed/unchangeable are better in terms of
performance (and usually also safer and better testable) than things
that are allowed to change during run time --
for example calling a function directly (e.g.
myVar = myFunc();
) is both faster and requires fewer
instructions than calling a function by pointer (e.g.
myVar = myFuncPointer();
): the latter is more flexible but
for the price of performance, so if you don't need flexibility (dynamic
behavior), use static behavior. This also applies to using constants (faster/smaller) vs variables, static vs dynamic typing, normal vs dynamic arrays etc.
- Be smart, use math, for
example simplify expressions using known
formulas, minimize logic circuits etc.
Example: let's say you want to compute the radius of a zero-centered bounding sphere of an N-point point cloud. Naively you might be computing
the Euclidean distance (sqrt(x^2 + y^2 + z2)) to each point
and taking a maximum of them, however you can just find the maximum of
squared distances (x2 + y^2 + z^2) and return a square
root of that maximum. This saves you a computation of N - 1
square roots.
- Fancier algorithms will very often be slower than simpler
ones, even if they are theoretically faster, i.e. don't get too
smart and first try the simple algorithm, greater complexity has to be
justified. This was commented on e.g. by Rob
Pike who said that "fancy algorithms are slow when n is small, and n
is usually small", i.e. if you're sorting an array of 10 items, just use
bubble sort, not quick sort etc.
- Learn about dynamic
programming.
- Avoid branches (ifs) if you can (remember ternary operators, loop conditions etc.
are branches as well). They break prediction in CPU pipelines and
instruction preloading and are often source of great performance losses.
Don't forget that you can many times compare and use the result of
operations without using any branching (e.g.
x = (y == 5) + 1;
instead of
x = (y == 5) ? 2 : 1;
).
- Use iteration instead of recursion if possible (calling a
function costs something). Know that it is ALWAYS possible to remove
recursive function calls.
- Use good enough approximations instead of completely
accurate calculations, e.g. taxicab distance instead of
Euclidean distance, capsule shape to represent the player's collision
shape rather than the 3D model's mesh etc. With a physics engine instead
of running the simulation at the same FPS as rendering, you may just run
it at half and interpolate between two
physics states at every other frame. Nice examples can also be found in
computer graphics, e.g. some software renderers use perspective-correct
texturing only for large near triangles and cheaper affine texturing for
other triangles, which mostly looks OK.
- Use quick bailout
conditions: many times before performing some expensive
calculation you can quickly check whether it's even worth performing it
and potentially skip it. For example in physics collision detections you may first
quickly check whether the bounding spheres of the bodies collide before
running an expensive precise collision detection -- if bounding spheres
of objects don't collide, it is not possible for the bodies to collide
and so we can skip further collision detection.
- Operations on static data can be accelerated with
accelerating structures (look-up tables
for functions, indices for database lookups,
spatial grids for collision checking, various trees ...).
- Use powers of 2 (1, 2, 4, 8, 16, 32, ...) whenever
possible, this is efficient thanks to computers working in binary. Not only may this help nice utilization and
alignment of memory, but mainly multiplication and division can be
optimized by the compiler to mere bit shifts which is a tremendous
speedup.
- Memory alignment usually helps speed, i.e.
variables at "nice addresses" (usually multiples of the platform's
native integer size) are faster to access, but this may cost some memory
(the gaps between aligned data).
- Write cache-friendly
code (minimize long jumps in memory).
- Compare to zero rather than other
values. There's usually an instruction that just checks the
zero flag which is faster than loading and comparing two arbitrary
numbers. For example in for loops where order of iteration doesn't
matter you may count down rather than up and compare if you're at zero.
{ E.g. under tcc I measured
for (int i = 1000; i; --i)
to save a bit of time compared
to for (int i = 0; i < 1000; ++i)
. ~drummyfish }
- Consider bit tricks (but
be aware that idiomatic code may be better
for advanced compilers), hacks for manipulating binary numbers in clever
ways only using very basic operations without which one might naively
write complex inefficient code with loops and branches. Example of a
simple bit trick is checking if a number is power of two as
!(x & (x - 1)) && x
.
- Consider moving computation from run time to compile
time, see preprocessor, macros and metaprogramming. E.g. if you make a
resolution of your game constant (as opposed to a variable), the
compiler will be able to partially precompute expressions with the
display dimensions and so speed up your program (but you won't be able
to dynamically change resolution).
- On some platforms such as ARM the first
arguments to a function may be passed via registers, so
it may be better to have fewer parameters in functions.
- Passing arguments costs something: passing a value
to a function requires a push onto the stack and later its pop, so
minimizing the number of parameters a function has, using global
variables to pass arguments and doing things like passing structs by
pointers rather than by value can help speed. { from Game
Programming Gurus -drummyfish }
- Optimize when you already have a working code and
when you can measure your optimizations. As Donald
Knuth put it: "premature optimization is the root of all evil".
Nevertheless you should get used to simple nobrainer efficient patterns
by default and just write them automatically. Also do one optimization
at a time, don't try to put in more optimizations at once.
- Use your own caches where they
help, for example if you're frequently working with some
database item you better pull it to memory and work with it there, then
write it back once you're done (as opposed to communicating with the DB
there and back).
- Single compilation
unit (one big program without linking) can
help compiler optimize better because it can see the whole code
at once, not just its parts. It will also make your program compile
faster.
- Search literature for algorithms with better complexity class (sorts are a nice example).
- For the sake of simple computers such as embedded platforms avoid floating point as that is often
painfully slowly emulated in software (and also inserts additional code,
making the executable bigger). Use fixed
point, or at least offer it as a fallback.
This also applies to other hardware requirements such as GPU or sound cards: while such hardware accelerates
your program on computers that have the hardware, making use of it may
lead to your program being slower on computers that lack it.
- Consider various overheads in
critical parts of code. Overhead is an extra resource price for
some kind of feature or mechanism you're using, for example a function call is a bit slower than directly
embedding code that's not inside a function, an OOP
program will use more memory than a non-OOP program just because it uses
OOP etc. That's why critical parts of code are often written in assembly -- to avoid overhead of higher level
languages. However, you can (and should) minimize overhead in other ways
also: for example in 3D graphics rendering
a single 3D object has a certain overhead, so instead of rendering a
scene with many separate 3D objects it's better to merge then all into a
single big 3D object.
- Factoring out invariants from loops and early branching can
create a speed up: it's sometimes possible to factor things out
of loops (or even long non-looping code that just repeats some things),
i.e. instead of branching inside the loop create two versions of the
loop and branch in front of them. This is a kind of space-time tradeoff.
Consider e.g.
while (a) if (b) func1(); else func2();
-- if
b doesn't change inside the loop, you can rewrite this as
if (b) while (a) func1(); else while (a) func2();
. Or in
while (a) b += c * d;
if c and d don't
change (are invariant), we can rewrite to
cd = c * d; while (a) b += cd;
. And so on.
- Division can be replaced by multiplication by reciprocal, i.e. x / y = x
1/y. The point is that multiplication is usually faster than
division. This may not help us when performing a single division by
variable value (as we still have to divide 1 by y*) but it does
help when we need to divide many numbers by the same variable number OR
when we know the divisor at compile time; we save time by precomputing
the reciprocal before a loop or at compile time. Of course this can also
easily be done with fixed point and
integers!
- Consider the difference between logical and bitwise
operators! For example AND and OR boolean functions in C have two variants, one
bitwise (
&
and |
) and one logical
(&&
and ||
) -- they behave a bit
differently but sometimes you may have a choice which one to use, then
consider this: bitwise operators usually translate to only a single fast
(and small) instruction while the logical ones usually translate to a
branch (i.e. multiple instructions with potentially slow jumps), however
logical operators may be faster because they are evaluated as short circuit (e.g. if first operand of
OR is true, second operand is not evaluated at all) while bitwise
operators will evaluate all operands.
- Consider the pros and cons of using indices vs
pointers: When working with arrays you usually have the choice
of using either pointers or indices, each option has advantages and
disadvantages; working with pointers may be faster and produce smaller
code (fewer instructions), but array indices are portable, may be
smaller and safer. E.g. imagine you store your game sprites as a
continuous array of images in RAM and your program internally
precomputes a table that says where each image starts -- here you can
either use pointers (which say directly the memory address of each
image) or indices (which say the offset from the start of the big image
array): using indices may be better here as the table may potentially be
smaller (an index into relatively small array doesn't have to be able to
keep any possible memory address) and the table may even be stored to a
file and just loaded next time (whereas pointers can't because on next
run the memory addresses may be different), however you'll need a few
extra instructions to access any image (adding the index to the array
pointer), which will however most definitely be negligible.
- Reuse variables to save space. A warning about this
one: readability may suffer, mainstreamers will tell you you're going
against "good practice", and some compilers may do this automatically
anyway. Be sure to at least make this clear in your comments. Anyway, on
a lower level and/or with dumber compilers you can just reuse variables
that you used for something else rather than creating a new variable
that takes additional RAM; of course a prerequisite for "merging"
variables is that the variables aren't used at the same time.
- To save memory use compression
techniques. (Needless to say this will however slow down the
code a bit, we're trading space for time here.) Compression doesn't
always have to mean you use a typical compression algorithm such as jpeg or LZ77, you may simply
just throw in a few compression techniques such as run length or word dictionaries into your data
structures. E.g. in Anarch maps are kept small
by consisting of a small dictionary of tile definitions and map cells
referring to this dictionary (which makes the cells much smaller than if
each one held a complete tile definition).
- Prefer preincrement over postincrement (typically
e.g. in a for loop), i.e. rather do
++i
than
i++
as the latter is a bit more complex and normally
generates more instructions.
- Mental calculation tricks, e.g. multiplying by one
less or more than a power of two is equal to multiplying by power of two
and subtracting/adding once, for example x 7 = x * 8 - x*; the
latter may be faster as a multiplication by power of two (bit shift) and
addition/subtraction may be faster than single multiplication,
especially on some primitive platform without hardware multiplication.
However this needs to be tested on the specific platform. Smart
compilers perform these optimizations automatically, but not every
compiler is high level and smart.
- With more than two branches use switch instead of
ifs (if possible) -- it should be common knowledge but some
newcomers may not know that switch is fundamentally different from if
branches: switch statement generates a jump table that can branch into
one of many case labels in constant time, as opposed to a series of if
statements which keeps checking conditions one by one, however switch
only supports conditions of exact comparison. So prefer using switch
when you have many conditions to check (but know that switch can't
always be used, e.g. for string comparisons). Switch also allows hacks
such as label fall through which may help some optimizations.
- Else should be the less likely branch, try to make
if conditions so that the if branch is the one with higher probability
of being executed -- this can help branch prediction.
- Similarly order if-sequences and switch cases from most
probable: If you have a sequences of ifs such as
if (x) ... else if (y) ... else if (z) ...
, make it so that
the most likely condition to hold gets checked first, then second most
likely etc. Compiler most likely can't know the probabilities of the
conditions so it can't automatically help with this. Do the same with
the switch
statement -- even though switch typically gets
compiled to a table of jump addresses, in which case order of the cases
doesn't matter, it may also get compiled in a way similar to the if
sequence (e.g. as part of size optimization if the cases are sparse) and
then it may matter again.
- Variable aliasing: If in a function you are often
accessing a variable through some complex dereference of multiple
pointers, it may help to rather load it to a local variable at the start
of the function and then work with that variable, as dereferencing
pointers costs something. { from Game Programming Gurus
-drummyfish }
- You can save space by "squeezing" variables -- this
is a space-time tradeoff, it's a no brainer but nubs may be unaware of
it -- for example you may store 2 4bit values in a single
char
variable (8bit data type), one in the lower 4bits, one
in the higher 4bits (use bit shifts etc.). So instead of 16
memory-aligned booleans you may create one int
and use its
individual bits for each boolean value. This is useful in environments
with extremely limited RAM such as 8bit Arduinos.
- Consider lazy
evaluation (only evaluate what's actually needed).
- You can optimize critical parts of code in assembly, i.e. manually write the
assembly code that takes most of the running time of the program, with
as few and as inexpensive instructions as possible (but beware, popular
compilers are very smart and it's often hard to beat them). But note
that such code loses portability! So ALWAYS
have a C (or whatever language you are using) fallback code for other platforms, use ifdefs to switch to the fallback version on
platforms running on different assembly languages.
- Loop unrolling/splitting/fusion, function inlining
etc.: there are optimizations that are usually done by high
level languages at assembly level (e.g. loop
unrolling physically replaces a loop by repeated commands which gains
speed but also makes the program bigger). However if you're writing in
assembly or have a dumb compiler (or are even writing your own) you may
do these manually, e.g. with macros/templates etc. Sometimes you can
hint a compiler to perform these optimizations, so look this up.
- Parallelism (multithreading, compute shaders, ...) can astronomically
accelerate many programs, it is one of the most effective
techniques of speeding up programs -- we can simply perform several
computations at once and save a lot of time -- but there are a few
notes. Firstly not all problems can be parallelized, some problem are
sequential in nature, even though most problems can probably be
parallelized to some degree. Secondly it is hard to do, opens the door
for many new types of bugs, requires hardware support (software
simulated parallelism can't work here of course) and introduces dependencies; in other words it is huge bloat, we don't recommend parallelization unless a
very, very good reason is given. Optional use of SIMD instructions can be a reasonable midway to going
full parallel computation.
- Optimizing data: it's
important to remember we can optimize both algorithm AND data, for
example in a 3D game we may simplify our 3D models, remove parts of a
level that will never be seen etc. Ordering, grouping, aligning,
reorganizing the data, changing number formats, adding indices and so on
may help us achieve cache friendliness and simpler and/or faster
algorithms. For example a color palette may be
constructed so that certain desired operations are faster; this is seen
e.g. in Anarch where colors are arranged so that
darkening/brightening is done just by decrementing/incrementing the
color index. In raycasting engines it is
common to store images by columns rather than by rows as they will be
drawn by columns -- this simple change of how data is ordered increases
cache friendliness. And so on.
- Specialized hardware (e.g. a GPU)
astronomically accelerates programs, but as with the previous
point, portablity and simplicity greatly suffers, your program becomes
bloated and gains dependencies, always consider using specialized
hardware and offer software fallbacks.
- Optimization comes at a cost -- not counting the
time and energy put in, optimization will also probably make your source
code less readable, more complicated (and so more likely buggy), or
maybe even less portable etc. For this you should NOT optimize
everything, only optimize where it is worth it -- as said, optimizing
non-looped code to run 1 millisecond faster is almost always absolutely
useless, it's better to rather have a nicer code.
- Smaller code may also be faster as it allows to fit
more instructions into cache.
- Do not optimize everything and for any cost: optimization often
makes the code more cryptic, it may bloat it,
bring in more bugs etc. Only optimize if it is worth the reward. { from
Game Programming Gurus -drummyfish }
- ...
When To Actually Optimize?
Nubs often ask this and this can also be a very nontrivial question.
Generally fine, sophisticated optimization should come as one of the
last steps in development, when you actually have a working thing. These
are optimizations requiring significant energy/time to implement -- you
don't want to spend resources on this at the stage when they may well be
dropped in the end, or they won't matter because they'll be outside the
bottleneck. However there are two "exceptions".
The highest-level optimization is done as part of the initial design
of the program, before any line of code gets written. This includes the
choice of data structures and mathematical models you're going to be
using, the very foundation around which you'll be building your castle.
This happens in your head at the time you're forming an idea for a
program, e.g. you're choosing between server-client or P2P,
monolithic or micro kernel, raytraced or rasterized graphics etc. These choices
affect greatly the performance of your program but can hardly be changed
once the program is completed, so they need to be made beforehand.
This requires wide knowledge and experience as you work
by intuition.
Another kind of optimization done during development is just
automatically writing good code, i.e. being familiar with specific
patterns and using them without much thought. For example if you're
computing some value inside a loop and this value doesn't change between
iterations, you just automatically put computation of that value
before the loop. Without this you'd simply end up with
a shitty code that would have to be rewritten line by line at the end.
Yes, compilers can often do this simple kind of optimization for you,
but you don't want to rely on it.
Automatic Optimization
Automatic optimization is typically performed by the compiler;
usually the programmer has the option to tell the compiler how much and
in what way to optimize (no optimization, mild optimization, aggressive
optimization, optimization for speed, size; check e.g. the man pages of
gcc where you can see how to turn on even specific
types of optimizations). Some compilers perform extremely complex
reasoning to make the code more efficient, the whole area of
optimization is a huge science -- here we'll only take a look at the
very basic techniques. We see optimizations as transformations of the
code that keep the semantics the same but minimize or maximize some
measure (e.g. execution time, memory usage, power usage, network usage
etc.). Automatic optimizations are usually performed on the intermediate
representation (e.g. bytecode) as that's the
ideal way (we only write the optimizer once), however some may be
specific to some concrete instruction set -- these are sometimes called
peephole optimizations and have to be delayed until code
generation.
There also exist dynamic optimization techniques
performed at runtime by the platform running the program (interpreter,
emulator, virtual machine, ...).
The following are some common methods of automatic optimization (also
note that virtually any method from the above mentioned manual
optimizations can be applied if only the compiler can detect the
possibility of applying it):
{ Tip: man pages of gcc or possibly other compilers detail specific
optimizations they perform under the flags that turn them on, so see
these man pages for a similar overview. ~drummyfish }
- Replacing instructions with faster equivalents: we
replace an instruction (or a series of instructions) with another one
that does the same thing but faster (or with fewer instructions etc.).
Typical example is replacing multiplication by power of two with a bit
shift (e.g.
x * 8
is the same as
x << 3
).
- Inlining: a function call may usually (not always
though, consider e.g. recursion) be replaced
by the function code itself inserted in the place of the call (so called
inlining). This is faster but usually makes the code bigger so the
compiler has to somehow judge and decide when it's worth to inline a
function -- this may be affected e.g. by the function size (inlining a
short function won't make the code that much bigger), programmer's hints
(
inline
keyword, optimize for speed rather than size etc.)
or guesstimating how often the function will be called. Function that is
only called in one place can be safely inlined.
- Loop unrolling: duplicates the body of a loop,
making the code bigger but increasing its speed (a condition check is
saved). E.g.
for (int i = 0; i < 3; ++i) func();
may be
replaced with func(); func(); func();
. Unrolling may be
full or just partial.
- Lazy evaluation/short
circuit/test reordering: the principles of lazy evaluation
(evaluate function only when we actually need it) and short circuit
evaluation (don't further evaluate functions when it's clear we won't
need them) may be auto inserted into the code to make it more efficient.
Test reordering may lead to first testing simpler things (e.g. equality
comparison) and leaving complex tests (function calls etc.) for
later.
- Algebraic laws, expression evaluation: expressions
may be partially preevaluated and manipulated to stay mathematically
equivalent while becoming easier to evaluate, for example
1 + 3 + 5 * 3 * x / 6
may be transformed to just
4 + 5 * x / 2
.
- Removing instructions that cancel out: for example
in Brainfuck the series of instructions
+++--
may be shortened to just +
.
- Removing instructions that do nothing: generated
code may contain instructions that just do nothing, e.g. NOPs that were
used as placeholders that never got replaced; these can be just
removed.
- Register allocation: most frequently used variables
should be kept in CPU registers for fastest access.
- Removing branches: branches are often expensive due
to not being CPU pipeline friendly, they can sometimes be replaced by a
branch-free code, e.g.
if (a == b) c = 1; else c = 0;
can
be replaced with c = a == b;
.
- Generating lookup tables: if
the optimizer judges some function to be critical in terms of speed, it
may auto generate a lookup table for it, i.e. precompute its values and
so sacrifice some memory for making it run extremely fast.
- Dead code removal: parts of code that aren't used
can be just removed, making the generated program smaller -- this
includes e.g. functions that are present in a library which however aren't used by the specific
program or blocks of code that become unreachable e.g. due to some
#define
that makes an if condition always false etc.
- Compression:
compression methods may be applied to make data smaller and optimize for
size (for the price of increased CPU usage).
- Dynamic
recompilation/JIT compilation (typical
for interpreted/emulated programs): these terms seem to not have very
clear definitions but the basic idea is that of compiling the program
late and/or only certain parts of it: we may compile the program as soon
it gets executed OR keep compiling parts of it as it runs, i.e. where we
are interpreting some kind of bytecode for
example we may be turning parts of it to a faster native code. Compiling
parts of the program as it is running has advantages and may in theory
even result in faster running program than that produced by a
traditional compiler because a dynamic compiler has more information
about the program: it can measure which parts of the program take most
computational time and these can be turned into native code, resulting
in significant optimization.
- ...
See Also
Powered by nothing. All content available under CC0 1.0 (public domain). Send comments and corrections to drummyfish at disroot dot org.