Pseudorandom data is data that appears (for example by its statistical properties) to have been generated by a random process despite in fact having been generated by a deterministic (i.e. non-random) process. I.e. it's a kind of "fake" but mostly good enough randomness that arises from chaotic systems -- systems that behave without randomness, by strict rules, but which scramble, shuffle, twist and transform the numbers in such a complicated way that they eliminate obvious patterns and leave the data looking very "random", though the numbers would be scrambled exactly the same way if the process was repeated with the same conditions, i.e. it is possible (if we know how the generator works) to exactly predict which numbers will fall out of a pseudorandom generator. This is in contrast to "true randomness" that (at least to what most physicists probably think) appears in some processes in nature (most notably in quantum physics) and which are absolutely unpredictable, even in theory. Pseudorandomness is typically used to emulate true randomness in computers because for many things (games, graphics, audio, random sampling, ...) it is absolutely sufficient, it is easy to do AND the repeatability of a pseudorandom sequence is actually an advantage to engineers, e.g. in debugging in which we have to replicate bugs we find, or in programs that simply have to behave deterministic (e.g. many network games). True randomness is basically only ever needed for cryptography/security (or possibly for rare applications where we absolutely NEED to ensure lack of any patterns in the data), it is a bit harder to achieve because we need some unbiased source of real-world random data. Pseudorandom generators are so common that in most contexts in programming the word "random" silently actually means just "pseudorandom".
A saying about psedorandom numbers states that "randomness is a task too important to be left to chance".
Pseudorandom numbers are not to be confused with quasirandom numbers (AKA low discrepancy sequences) which are imitating randomness yet in a weaker way (just attempting to be spaced far apart).
Firstly let's mention that we can use look up tables, i.e. embed some high quality random data right into our program and then use that as our random numbers, taking one after another and getting back to start once we run out of them. This is for example how Doom's pseudorandom generator worked. This is easy to do and extremely fast, but will take up some memory and will offer only a quite limited sized sequence (your generator will have a short period), so ponder on the pros and cons for your specific needs. From now on we'll leave this behind and will focus on really GENERATING the pseudorandom values with some algorithm, but look up tables may still be kept in mind (they might even perhaps be somehow combined with the true generators).
There are possibly many ways to approach generating pseudorandom numbers (for example you can just run some chaotic cellular automaton and then convert it to numbers somehow), however we'll be describing the most typical way, used in implementation of programming languages etc.
Pseudorandom generators generate an infinite (but in practice eventually repeating) sequence of numbers from some initial starting number which we call a seed (i.e. it's important to rather think of sequences than of individual "random numbers" -- a number itself can hardly be random). For the same seed number the same sequence will always be generated (of course supposing we use the same generator). If you ask the generator for the next pseudorandom number, it will just generate the next number in the sequence and pass it to you. (In practice programs often use system time as the generator's seed number because that will make the program generate different numbers each time it is run.)
The next number in the sequence is typically only generated from the previous number by performing some chaotic operations with it that transform it into a new number. However this is not entirely true because then the generator would have the following weakness: if it generated number A and then B, then every number A would ALWAYS be followed by number B -- that doesn't look exactly random. For this reason the generator internally keeps a big number with which it operates and it only returns some N bits (for example the highest 16 bits) of that number to you. This way it is possible (from your point of view) to have the same number be followed by different numbers (though this doesn't hold for the generator's internal value of course, but that doesn't bother us, we are only looking at the output sequence).
The number of bits that the generator takes from its internal number and gives you determines the range of random values that you can get. For example if the generator gives you 16 bit numbers, you know the numbers will be in range 0 to 65535. (This can be used to also generate numbers in any range but we'll look at that later.)
Now let's realize another important thing -- if the generator has some internal number, which is the only thing that determines the next number, and if its internal number has some limited size -- let's say for example 32 bits -- then the sequence HAS TO start repeating sometimes because there is a limited number of values the internal number can be in and once we get to the same number, it will have to evolve the same way it evolved before (because we have a deterministic generator and the number is the whole generator's state). Imagine this as a graph: numbers are nodes, the seed is the node we start in, there are finitely many nodes and each one leads to exactly one other node -- going through this graph you inevitably have to end up running in some loop. For this reason we talk about the period of a pseudorandom generator -- this period says after how many values the sequence will start to repeat. In fact it is possible that only last N values of the initial sequence will start to repeat -- again, if you imagine the graph, it is possible that an initial path leads us to some smaller loop in which we then keep cycling. This may depend on the seed, so the whole situation can get a bit messy, but we can resolve this, just hold on.
It's not hard to see that the period of the generator can be at most 2 to the power of the number of bits of the generator's internal value (i.e. the number of possible distinct values the number can be, or the nodes in the graph). We want to achieve this maximum period -- indeed, it is ideal if we can make it as long as possible, but achieving the maximum period will also mean the period won't depend on the initial seed! If you imagine the graph, having a big loop over all the values means that there are no other loops, there's just one long repeating sequence in which each internal value appears exactly once, so no matter where we start (which seed we set), we'll always end up being in the same big loop. In addition to this we ALSO get another awesome thing: the histogram (the count) of all values over the whole period will be absolutely uniform, i.e. every value generated during one period will appear exactly the same number of times (which is what we expect from a completely random, uniform generator) -- this can be seen from the fact that we are returning N bits of some bigger internal number of N + M bits, which will come through each possible value exactly once, so each possible value of N will have to appear and each of these values will have to appear with all possible values of the remaining M bits, which will be the same for all values.
Now let's take a look at specific generators, i.e. types of algorithms to generate the numbers. There are many different kinds, for example Mersenne Twister, middle square etc., however probably the most common type is the linear congruential generator (LCG) -- though for many decades now it's been known this type of generator has some issues (for example less significant bits have shorter and shorter periods, for which we usually want to use a very big internal value and return its highest bits as the result), it will likely suffice for most of your needs, but you have to be careful about choosing the generator's parameters correctly. It works like this: given a current (internal) number x[n] (which is initially set to the seed number), the next number in the sequence is computed as
x[n + 1] = (A * x[n] + C) mod M
where A, C and M are constants -- these are the parameters of the generator. I.e. the next number in the sequence is computed by taking the current number, multiplying it by some constant, adding some constant and then taking modulo (remainder after division) of this by some constant. The catch is in choosing the right constants A, C and M -- it seems there are only a few combinations of these constants that will yield quality sequences. Furthermore we try to make the equation fast to compute, so we often aim to choose M to be some power of two -- if we are for example using 32 bit numbers, then choosing M to be 2^32 is extremely convenient, we simply let the number overflow (the overflow does the same thing as modulo by 2^32) or, at worse, perform a simple logical and to mask out the lowest bits. It would be cool to also have A a power of two because then also the multiplication would be very fast and simple (just a bit shift), BUT this can't really be done because then the multiplication wouldn't actually achieve much randomness, it would just shift the number left, meaning the higher bits of x[n + 1] would really be the same (or at least corelated with) the lower bits of x[n].
So how to choose the A, C and M constants? Let's say we take M to be a power of two, for reasons stated above. Now it is proven (Hull-Dobell theorem) that we'll get the maximum possible period (the thing we absolutely WANT) if exactly all of the following conditions hold: M and C must have the highest common divisor 1 (i.e. they must be coprime) AND A - 1 is divisible by all prime factors of M AND A - 1 is divisible by 4 if M is divisible by 4. So basically we want all of these conditions to always hold -- however even if they hold, we don't necessarily have a statistically good generator yet (to ensure this see the tests below). The value of C isn't very significant, it's enough if the above conditions hold for it, so many just use e.g. 1 or some prime number, but choosing A correctly turns out to be quite hard.
Here is a template for a C linear congruential generator:
T _currentRand; // the internal number
void randomSeed(unsigned int seed)
{
_currentRand = seed;
}
unsigned int random()
{
_currentRand = A * _currentRand + C;
return _currentRand >> S;
}
Here T
is the data type of the internal number (implying the M constant) -- use some fixed width type from stdint.h
here. A
is the multiplicative constant, C
is the additive constant and S
is a shift that implies how many highest bits of the internal number will be taken (and therefore what range of numbers we'll be getting). The following table gives you a few hopefully good values you can just plug into this snippet to get an alright generator:
T |
A |
C |
S |
note |
---|---|---|---|---|
uint32_t |
32310901 | 37 | 16 or 24 | A from L'Ecuyer |
uint32_t |
2891336453 | 123 | 16 or 24 | A from L'Ecuyer |
uint32_t |
2147001325 | 715136305 | 16 | from Entacher 1999 |
uint32_t |
22695477 | 1 | 16 | from Entacher 1999 |
uint64_t |
2862933555777941757 | 12345 | 32 or 48 | A from L'Ecuyer |
uint64_t |
3935559000370003845 | 1719 | 32 or 48 | A from L'Ecuyer |
{ I pulled the above numbers from various sources I found, mentioned in the note, tried to select the ones that were allegedly good, I also quickly tested them myself, checked the period was at maximum at least for the 32 bit generators and lower and ran it through ent which reported good results. ~drummyfish }
Let's also quickly mention another kind of generator as an alternative -- the middle square plus Weyl sequence generator. Middle square generator was one of the first and is very simple, it simply starts with a number (seed), squares it, takes its middle digits as the next number, squares it, takes its middle digits and so on. The issue with this was mainly getting a number 0, at which we get stuck. A 2022 paper by Wydinski seems to have solved this issue by employing so called Weyl sequence -- basically just adding some odd number in each step, though the theory is a bit more complex, the paper goes on to prove a high period of this generator. An issue seems to be with seeding the sequence -- the generator has three internal numbers and they can't just be blindly set to "anything" (the paper gives some hints on how to do this). Here is a 32 bit variant of such generator (the paper gives a 64 bit one):
{ I tried to make a 32 bit version of the generator, tried to choose the _rand3
constant well -- after quickly testing this the values of the generator looked alright, though I just mostly eyeballed the numbers, each bit separately, checked the mean of some 4000 values and the histogram of 1 million values. I also ran this through ent and it gave good results except for the chi square test that reported the rarity of the sequence at something over 5% (i.e. not impossible, but something like 1 in 20 chance). I'm not claiming this version to be statistically good, but it may be a start for implementing something nice, use at own risk. ~drummyfish }
#include <stdint.h>
uint32_t _rand1, _rand2, _rand3 = 0x5e19dbae;
uint16_t random()
{
_rand2 += _rand3;
_rand1 = _rand1 * _rand1 + _rand2;
_rand1 = (_rand1 >> 16) | (_rand1 << 16);
return _rand1;
}
NOTE on the code: the (_rand1 >> 16) | (_rand1 << 16)
operation effectively makes the function return lower 16 bits of the squared number's middle digits, as multiplying _rand1
(32 bit) by itself results in the lower half of a 64 bit result.
Yet another idea might be to use some good hash just on numbers 1, 2, 3, 4 ... The difference here is we are not computing the pseudorandom number from the previous one, but we're computing Nth pseudorandom number directly from N. This will probably be slower. For example: { Again, no big guarantees, but I ran this through ent again and got very good results. ~drummyfish }
#include <stdint.h>
uint32_t _rand = 0;
uint32_t random(void)
{
uint32_t x = _rand;
_rand++;
x = 303484085 * (x ^ (x >> 15));
x = 985455785 * (x ^ (x >> 15));
return x ^ (x >> 15);
}
void randomSeed(uint32_t seed)
{
_rand = seed; // this alone just offsets the sequence
seed = random(); // this is an attempt at fix
}
How to generate a number in certain desired range? As said your generator will be giving you numbers of certain fixed number of bits, usually something like 16 or 32, which means you'll be getting numbers in range 0 to 2^bits - 1. But what if you want to get numbers in some specific range A to B (including both)? To do this you just need to generate a number in range 0 to B - A and then add A to it (e.g. to generate number from 20 to 30 you generate a number from 0 to 10 and add 20). So let's just suppose we want a number in range 0 to N (where N can be B - A). Let's now suppose N is lower than the upper range of our generator, i.e. that we want to get the number into a small range (if this is not the case, we can arbitrarily increase the range of our generator simply by generating more random bits with it, i.e we can join two 16 bit numbers to get a 32 bit number etc.). Now the most common way to get the number in the desired range is by using modulo (N + 1) operation, i.e. in C we simply do something like int random0to100 = random() % 101;
. This easily forces the number we get into the range we want. HOWEVER beware, there is one statistical trap about this called the modulo bias that makes some numbers slightly more likely to be generated than others, i.e. it biases our distribution a little bit. For example imagine our generator gives us numbers from 0 to 15 and we want to turn it into range 0 to 10 using the modulo operator, i.e. we'll be doing mod 11 operation -- there are two ways to get 0 (0 mod 11 and 11 mod 11) but only one way to get 9 (9 mod 11), so number 0 is twice as likely here. In practice this effect isn't so strong and in many situations we don't really mind it, but we have to be aware of this effects for the sake of cases where it may matter. If necessary, the effect can be reduced -- we may for example realize that modulo bias will be eliminated if the upper range of our generator is a multiple of the range into which we'll be converting, so we can just repeatedly generate numbers until it falls under a limit that's a highest multiple of our desired range lower than the true range of the generator.
What if we want floating point/fixed point numbers? Just convert the integer result to that format somehow, for example ((float) random()) / ((float) RANDOM_MAX_VALUE)
will produce a floating point number in range 0 to 1.
How to generate other probability distributions? Up until now we supposed a uniform probability distribution, i.e. the most random kind of generator that has an equal probability of generating any number. But sometimes we want a different distribution, i.e. we may want some numbers to be more likely to be generated than others. For this we normally start with the uniform generator and then convert the number into the new distribution. For that we may make use of the following:
{ If you want to look at some C code that tries to generate good pseudorandom numbers, take a look at gsl (GNU Scientific Library). ~drummyfish }
TODO: some algorithms from the gsl library described at https://www.gnu.org/software/gsl/doc/html/rng.html
This topic alone can be extremely complex, you could probably do a PhD on it, let's just do some overview for mere noobs like us.
Firstly you want your generator to be simply a good program in general, i.e. you want it to be as fast, small and as simple as possible while generating nice sequences. Consider that someone will want to use it to for example generate a white noise for a whole HD screen 60 times per second -- that may be something like 100 million numbers per second! We often judge generators by type and number of operations they need to generate the next number, as that will imply the speed -- as always, you generally want to avoid general division, branching, minimize multiplications and so on. Typically you'll probably want something like one multiplication and a few fast operations like an addition and modulo by power of two. That's basically what the congruential generators do. See also optimization.
However the core of a pseudorandom generator is the quality of the sequence itself, i.e. ensuring it really looks very random, that it has no patterns. To a noob this sounds easy, he thinks that if you just make some random shit with the numbers it's going to look random, he also thinks that the more things you do with the numbers the more random they'll look -- this is FALSE (typically if you just throw in more operations without thinking you'll make the sequence worse). It's not easy at all to make a random looking sequence of numbers! There is a danger in that with some effort you can make a sequence that will look random to you, but once you use the sequence to generate something -- for example some kind of noise -- you'll see there is actually some pattern, for example you'll see some straight lines in the noise etc. For some applications it may be sufficient to have a lower quality generator, but the danger lies in the cases in which you need a good generator and you think you have one while in fact you don't -- typically scientific simulation have to be extremely careful about this. For example if you're doing genetic programming and you need to somehow randomly pick organisms so as to make the evolution fair -- if you have a bad generator, your program won't work because it may secretly be not too random in its choices, it will be unfair -- in this case it will be VERY hard for you to find the cause of why your program doesn't work (in the worse case you'll think it works and will trust its results when in fact the results are bad). To avoid this trap you need to actually test the quality of your sequence (TBH if you want a really good generator you should definitely use something better than the mentioned congruential generator). Let's see some general stuff you can try (considering the base case of generating uniformly distributed numbers):
Powered by nothing. All content available under CC0 1.0 (public domain). Send comments and corrections to drummyfish at disroot dot org.