Scro says he intentionally biased it. What that implies I'm not sure. Maybe it's a sort of cheat to beat up on PractRand and it ain't quite as good as the score suggests.

Re: Intentionally biased.

Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.

Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.

If the PRNG state were bigger, this question wouldn't actually matter. A sequence of 2^128 128 bit random values is functionally impossible to distinguish from a random permutation of 2^128 bit values, so most people don't care anymore at that size. Even at 64 bits, a random permutation vs a random sequence is very hard to tell apart, though it can just barely be done if you use tons of time and memory working on it (PractRand doesn't bother, at the moment).

Previously I would have naively expected a "good" PRNG with 32 bits of state to visit every possible state once in a 4 giga sample space. Clearly this is not what would happen with a true random selection of 32 bits four billion times.

Is that actually better? Well... yes, mostly... Some people feel that...

Looks to me like "randomness" is in the eye of the beholder.

At the moment I'm leaning towards preferring that a PRNG with 32 bits of state does visit every possible output exactly once in 4 giga samples.

In what situation is that detrimental, apart from satisfying statistical tests like PractRand ?

Sure it means that somebody collecting enough samples can tell that you have a PRNG instead of a source of real randomness.

But presumably they can do that if there is a bias added as well.

If XORO32 can be used to produce a high-quality 16-bit result per iteration, then I could enhance the instruction to do TWO iterations, get 16-bit sums from each, and substitute the high-quality 32-bit PRNG result into the next instruction's S value. Meanwhile, the double iteration would be written back to the original D in the XORO32 instruction.

So, do a XORO32 iteration and store the 32-bit state in S. Do another XORO32 iteration, then sum the two states as two parallel 16-bit adds, with the parity jiggery-pokery, to give a 32-bit PRN in D. Is that correct and is there time to do all this?

I'm all for XORO32 doing the summing itself, if possible. We'd need a XORO32 write for the seed and a XORO32 read for the PRN.

It's all in one!

Because the iteration is just a bunch of XOR's, we could do a double iteration at once, grabbing results from each to perform the separate 16-bit adds and parity computations, in order to compute the compound PRNG output for the next instruction's S value. The 'XORO32 D' will write the double-iterated value (which is separate from the PRNG output) back into D, which is just a register. To seed the PRNG, just put a non-0 value into the register being used.

XORO32 D 'iterate D and put 32-bit PRNG result into next instruction's S

Example:

XORO32 rnd 'iterate rnd
MOV outb,0 'write XORO32 PRNG result to outb

Just wondering whether the MOV could be replaced by a "XORO32B" that does seem extra bit scrambling. In other words, XORO32 could do two different things, determined by a spare opcode bit. The same adder could be used in both parts and the new second part would not be compulsory.

XORO32 rnd,0 'iterate rnd
MOV outb,0 'write XORO32 PRNG result to outb
XORO32 rnd,0 'iterate rnd
XORO32 outb,1 'write XORO32 PRNG result after some extra processing to outb

Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.

Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.

If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.

Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.

Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.

If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.

True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.

Previously I would have naively expected a "good" PRNG with 32 bits of state to visit every possible state once in a 4 giga sample space. Clearly this is not what would happen with a true random selection of 32 bits four billion times.

Is that actually better? Well... yes, mostly... Some people feel that...

Looks to me like "randomness" is in the eye of the beholder.

At the moment I'm leaning towards preferring that a PRNG with 32 bits of state does visit every possible output exactly once in 4 giga samples.

In what situation is that detrimental, apart from satisfying statistical tests like PractRand ?

Sure it means that somebody collecting enough samples can tell that you have a PRNG instead of a source of real randomness.

But presumably they can do that if there is a bias added as well.

When considered from the context of a single run of a program using no more than the cycle length of the PRNG, the biased form is just better. It behaves like a PRNG should, in such a narrow case (assuming it was well written). For instance, duplicated 32 bit values occur at about the right frequency and spacings.

On the other hand, when things are going to be wrong anyway, the unbiased form may have things be wrong in a more intuitive way. Maybe. For instance, if you have a program that uses 2^32 32 bit bit values, and you run that program ten thousand times with different seeds (and thus different starting points in the cycle), the distribution of results will be quite wrong for both cases, but the way it is skewed for the unbiased PRNG might be easier to understand than the way it is skewed for the biased PRNG. Maybe.

Because the iteration is just a bunch of XOR's, we could do a double iteration at once, grabbing results from each to perform the separate 16-bit adds and parity computations, in order to compute the compound PRNG output for the next instruction's S value. The 'XORO32 D' will write the double-iterated value (which is separate from the PRNG output) back into D, which is just a register. To seed the PRNG, just put a non-0 value into the register being used.

XORO32 D 'iterate D and put 32-bit PRNG result into next instruction's S

Example:

XORO32 rnd 'iterate rnd
MOV outb,0 'write XORO32 PRNG result to outb

Another thought on this, simpler than the previous one.

XORO32 stores two 16-bit PRN outputs in the S field of the next instruction and the above code can be re-written as

XORO32 rnd
INS D,PRNs

where INS is an instruction with D and S fields which does not have to be a MOV. Some of the useful instructions after an XORO32 are

MOV
ADD
SUB
XOR
MUL

MUL is D[31:0] := D[15:0] * S[15:0]. Multiplying is a good way of scrambling bits, however the PRN in S[31:16] would be wasted. To avoid this, one of the spare CZ opcode bits could select between single and double iterations, leaving the other for forward/reverse.

EDIT
This is largely about saving one instruction and not so exciting on second reading.

Wow!
I have been following along without understanding most of it.

Thanks scro for the recent post that put it into perspective. Like heater, I presumed that a 32-bit PRNG that produced 1 of each number in a 4GB sample would be the perfect PRNG. But I now see why this isn't really.

However, if we have a truly 4GB where 1 of each number is given, does it really matter for P2?
I guess that is the real question at hand. Are the 4GB of numbers truly random though?

Chip, could a wider number of bits be used and then select just 32 bits from the wider answer?
eg use 40 bits, do the appropriate "xoro40" and select say b38..b7 as the 32 bits?
That might get over the double add which doesn't work in a clock.

If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.

True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.

scro, what would be good 16-bit constant to multiply with the xoroshiro32+ 16-bit output to scramble the bits better? Or is it too late after the addition to improve things by multiplying?

What I don't understand about multiplying by a constant is how people handle the result. There are bound to be leading zeroes, so do they shift and truncate it?

100% coverage parity is the only one that is a contender. Anything else is at least 32x sub-par. I don't think there is any way to make up a second bonus bit with this approach.

Yes, parity is a one-trick pony but it's a good trick and I'm glad I found it.

It would be interesting to see how bit 1 on its own scores in tests sometime and whether it is that much worse than bit 2, say.

Wow!
Chip, could a wider number of bits be used and then select just 32 bits from the wider answer?
eg use 40 bits, do the appropriate "xoro40" and select say b38..b7 as the 32 bits?
That might get over the double add which doesn't work in a clock.

xoroshiro40+ needs 40 state bits and they wouldn't fit into one register. Also the output would be only 20 bits.

If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.

True. But xoroshiro32plus is (mostly) unbiased when considered as a PRNG that produces 16 bits at a time, rather than 32 bits. There's almost exactly equal numbers of each possible 16 bit value produced over the course of 2^32 16 bit outputs.

scro, what would be good 16-bit constant to multiply with the xoroshiro32+ 16-bit output to scramble the bits better? Or is it too late after the addition to improve things by multiplying?

What I don't understand about multiplying by a constant is how people handle the result. There are bound to be leading zeroes, so do they shift and truncate it?

Um... it's not too late, though it might not be ideal either. We're talking about a 1-to-1 hash of 16 bit values using multiplication? I'd say, generally, for a multiplier constant like that, you want an odd number that, when expressed in binary, has no obvious repeating patterns. Multiplication is very good mixing, but tends to be too left-wards in the flow of information (lower bits of input effect higher bits of output, but not the reverse), so it's often paired with a right-shift and xor (edit: as in x *= K; x ^= x >> 8; (edit2: corrected shift value from 16 to 8, this was a 16 bit multiplication not 32)). Multiplication is not, I think, cost-effective from a hardware perspective though, I think multiplication tends to be kind of expensive compared to the alternatives, unless you just happen to have a multiplication unit sitting around that's not doing anything worthwhile. I think you could probably fit some reasonable PRNGs or even CSPRNGs in a comparable amount of die area to a 16x16 multiplication, and have the result run faster too.

More on running xoroshiro32+ backwards. I'm sure some people would find this feature extremely useful. The ability to go backwards at any time means there would be no need to store any previous states apart from the current one. This could be really handy during long simulations, or in games which could have random starting levels, but I think it's so cool we should do it anyway.

Here's the reverse algorithm , the first piece of Verilog I have ever written:

The XORO32 instruction would need a second operand to select the direction:

XORO32 D,#0 ' forward
XORO32 D,#1 ' reverse

but the new xoro32dir signal could be a constant = 1 to test things work in reverse. To check the output, the following file has states and sums for 2048 single iterations backwards and forwards from state $0000_0001: http://forums.parallax.com/discussion/download/121567/xoro.txt

There is an issue with XORO32 as it is now. The double iteration means that even states are written back to D but odd ones are not. The only way XORO32 can tell you the state immediately after your initial seed is by iterating it 2^31 or 2147483648 times. Combining two 16-bit PRNs into one 32-bit value is great but it halves the period and some users will want only 16 bits, always in the low word, and the full 2^32-1 period.

In a nutshell, we have lost the ability to single-step through the states. However, if we implement forward and reverse, then we could get single-stepping back for very little extra logic. Here is how the modified Verilog code might look:

// XORO32 code that outputs 16-bit or 32-bit PRN values, both forward and reverse
// 16-bit PRN in low word with zeroes in high word and second iteration discarded
// new state written is next one for 16-bit PRN or next but one for 32-bit PRN
// needs new xoro32ctl[1:0] using CZ opcode bits, similar to SETNIB/GETNIB/etc.
// xoroctl[1:0] = 00 for 16-bit PRN, forward XORO16F
// 01 for 32-bit PRN, forward XORO32F
// 10 for 16-bit PRN, reverse XORO16R
// 11 for 32-bit PRN, reverse XORO32R
// wire [1:0] xoroctl = ?
wire [15:0] xoro32z = d[31:16] ^ d[15:0];
wire [31:0] xoro32y = {xoro32z[8:0], xoro32z[15:9], // first forward state
{d[1:0], d[15:2]} ^
{xoro32z[13:0], 2'b0} ^ xoro32z};
wire [15:0] xoro32x = xoro32y[31:16] ^ xoro32y[15:0];
wire [31:0] xoro32 = {xoro32x[8:0], xoro32x[15:9], // second forward state
{xoro32y[1:0], xoro32y[15:2]} ^
{xoro32x[13:0], 2'b0} ^ xoro32x};
wire [15:0] xorr32z = {d[22:16], d[31:23]};
wire [15:0] xorr32zz = d[15:0] ^
{xorr32z[13:0], 2'b0} ^ xorr32z;
wire [31:0] xorr32y = {xorr32z ^ // first reverse state
{xorr32zz[13:0], xorr32zz[15:14]},
{xorr32zz[13:0], xorr32zz[15:14]}};
wire [15:0] xorr32x = {xorr32y[22:16], xorr32y[31:23]}
wire [15:0] xorr32xx = xorr32y[15:0] ^
{xorr32x[13:0], 2'b0} ^ xorr32x;
wire [31:0] xorr32 = {xorr32x ^ // second reverse state
{xorr32xx[13:0], xorr32xx[15:14]},
{xorr32xx[13:0], xorr32xx[15:14]}};
wire xoro32d = xoro32ctl ? // xoro32d = d state result
xorr32
: xorr32y
: xoro32
: xoro32y;
wire [16:0] xoro32a = xoro32ctl[1] ? // first iteration sum
xorr32y[31:16] + xorr32y[15:0]
: xoro32y[31:16] + xoro32y[15:0];
wire [16:0] xoro32b = xoro32ctl ? // second iteration sum
xorr32[31:16] + xorr32[15:0]
: 16'b0
: xoro32[31:16] + xoro32[15:0]
: 16'b0;
wire [31:0] xoro32r = {xoro32b[15:1], ^xoro32b, // xoro32r = prng result
xoro32a[15:1], ^xoro32a};

Please bear in mind I'm on day two as a Verilog coder.

The four different options are selected by the otherwise unused CZ opcode bits, as already done by several other instructions.

These mnenomics make it clear what the operation is. Before anyone objects to using XORO16 for what we have been calling xoroshiro32+, Sebastiano Vigna says its proper name should be xoroshiro[16]32+.

Chip, I've spent hours on this for several reasons (a) because I think XORO32 would be better with a reverse gear, (b) so that you could see this working for yourself, (c) to save you the time and trouble of implementing the reverse algorithm, (d) to bring back single-stepping and (e) to make the P2 the best it can be.

But don't we actually have a 2^32-1 period, already. At 2^31 iterations, don't we step into the 'odd' domain and then go another 2^31-1 iterations before arriving at our original value?

But don't we actually have a 2^32-1 period, already. At 2^31 iterations, don't we step into the 'odd' domain and then go another 2^31-1 iterations before arriving at our original value?

Yes, technically. After 2^31-1, you're repeating the same bytes, but what used to show up in the upper 16 bits will now show up in the lower 16 bits instead, and vise versa, I think. Technically that means your period is a full 2^32-1 calls, and for some purposes that can actually matter. However, if looked at from a byte-stream perspective, the period remains approximately 2^31-1 outputs - and for some purposes that counts for a lot more than the technical period.

Yes, it would repeat the sequence, but with results shifted by 16 bits, so that what used two appear as a unified long, now appears as two words in consecutive longs. So, the words would swap high/low positions, but be spread across two longs.

Scro, what do you think about the utility of having XORO32 be reversible?

I vote for reversible and potatohead likes it too. If anyone else thinks it would be good for XORO32 to run forwards and backwards please say so now.

Is the P2 going to be an ONC18: 0.18 µm CMOS Standard Cell and will it have individual gates or FPGA-like LEs? I'm wondering whether a 4-to-1 mux would be any bigger than a 2-to-1 mux. If the answer is no, or the difference is insignificant, then reversibility gives you single-stepping for free and single-stepping gives you reversibility for free.

Even if you always use 32-bit PRN data, what if you want to have two sequences running, offset by one state? A XORO32 that only does double-stepping cannot tell you the next state unless you go through the whole period. A XORO16 would tell you in one instruction, then you could use that state in XORO32 thereafter.

Without reversibility, states would need to be saved if you ever want to stop and restart further back. Even then, you wouldn't be able to run a simulation, say, backwards, just jump back so far depending on how much RAM is available for storing states and then run it forwards again.

I get the feeling Chip doesn't really believe this thing can run backwards. There's only one way to find out!

Am I missing a point? What possible use is running a PRNG backwards?

Generally in a simulation we have a state S and we evolve a new state from the old one plus some random number(s) like so:

Set intitial state S0
Get random number N0. Calculate new simulation state S1 as a function S1 = f(S0, N0)
Get random number N1. Calculate new simulation state S2 as a function S2 = f(S1, N1)
Get random number N2. Calculate new simulation state S3 as a function S3 = f(S2, N2)
Get random number N3. Calculate new simulation state S4 as a function S4 = f(S3, N3)
...

Now, if we reverse our PRNG we continue as:

Get random number N2. Calculate new simulation state S5 as a function S5 = f(S4, N2)
Get random number N1. Calculate new simulation state S6 as a function S6 = f(S5, N1)
Get random number N0. Calculate new simulation state S7 as a function S7 = f(S6, N0)
...

It's pretty unlikely that our simulation states are going to run backwards just because the PRNG does. S5 will not be S3, S6 will not be S2 and so on.

Once we have stirred up our simulation state with the PRNG it will not magically get un-stirred by running the PRNG backwards.

Yes, it would repeat the sequence, but with results shifted by 16 bits, so that what used two appear as a unified long, now appears as two words in consecutive longs. So, the words would swap high/low positions, but be spread across two longs.

Scro, what do you think about the utility of having XORO32 be reversible?

Mostly pointless. It lets you rewind faster than you could already, but what are you going to use it for? Use cases for backwards PRNG stepping are pretty rare. Plus, since you're limiting it to 32 bits of state, I'm suspecting that the state is mapped to local accessible memory anyway, so I think you'll already have fast state save/restore. 90% of use cases for backwards PRNG stepping can be done about as easily (or better) using fast state save/restore instead.

Can anyone come up with a use case that *isn't* easily & efficiently handled with fast state save/restore instead?

Ya, 32GB score is 8G-Iterations. PractRand is detecting the full period repeating, so it's a perfect score for a 32 bit state with 32 bit output.

The showstopper is that we don't have enough time in a clock cycle to do two 32-bit additions, one after the other.

Chip,
I'm thinking there is the whole two execution clock cycles available. As per the original Xoroshiro128+ source code, the summing can be in parallel with the iterating. EDIT: In fact, Scro even arranges his algorithm the same way.

Can anyone come up with a use case that *isn't* easily & efficiently handled with fast state save/restore instead?

I can't, other than procedural data generation. Like game world, open, roaming. Walking one way, vs walking the opposite way or any other way and seeing the environment present consistently.

State / restore would do this, but would do it at a much higher cost, in that there would need to be state markers in various places to know what the state to return to actually was.

With a forward / reverse, one can just run the same computation and generate routines without regard to state at all, other than the current one.

Running forwards all the time gives you one sequence.
Running backwards all the time gives you another sequence.
Isn't that the real utility? In effect two PRNGs in one.

And you could get this for free, probably, with single-stepping.
I'd be more worried about knowing only half the states if single-stepping is not possible.

I'd be happy if anyone could give me even one case where being able to run a PRNG backwards is useful at all !

potatohead,

With a forward / reverse, one can just run the same computation and generate routines without regard to state at all, other than the current one.

I don't follow your idea Spud.

It can work if your state is only a function of the PRNG value. Then if you reverse the PRNG you reverse the state. But that is never true. As I tried to show in my post above.

Let's think of a really simple simulation. A simulated robot that makes a left or right turn at every iteration and takes one step forwards. The state here is a position on an X, Y plane. The left or right is determined by a PRNG.

If the PRNG produces steps left(L) and right(R), say "LRLLR", then the robot ends up at some new X,Y coordinate.

If we now reverse the PRNG we get "LLRL".

The robot will not end up where it started.

I don't see how you can run a simulation backwards just by using a PRNG does.

So far, I see no use for a reversible PRNG at all.

Ya, 32GB score is 8G-Iterations. PractRand is detecting the full period repeating, so it's a perfect score for a 32 bit state with 32 bit output.

The showstopper is that we don't have enough time in a clock cycle to do two 32-bit additions, one after the other.

Chip,
I'm thinking there is the whole two execution clock cycles available. As per the original Xoroshiro128+ source code, the summing can be in parallel with the iterating. EDIT: In fact, Scro even arranges his algorithm the same way.

But the second add uses the result of the first add, doesn't it?

But the second add uses the result of the first add, doesn't it?

Yep, Prop2 execution is two clocks long, that could execute in the second cycle. Unless, that is, the subsequent S operand replacement occurs at that stage. I presume that'll depend on the addressing modes allowed - If the random number is solely an immediate operand then both execution cycles are available but if the random number is allowed to be a register address then only the first execution cycle is available.

## Comments

17Suppose you have 4 billion 32 bit random numbers. How many duplicate values should there be? How many of each possible value? Etc. If these are drawn from a good PRNG or TRNG, there's an expected distribution to the answer, and each time you draw that many random numbers from the source you'll get a value from that distribution. If they're drawn from an random number generator with only 32 bits of state though, each time you'll get exactly the same value. For an "unbiased" PRNG, the value you get will be really extreme - every possible 32 bit value will occur exactly once, a ridiculous coincidence if it had come from a TRNG or good PRNG. For a (correctly) biased PRNG of 32 bits of state, the value will look much more reasonable, like it was drawn from the correct distribution - the problem is, although, because there's only 32 bits of state, there's only one possible set of 4 billion numbers to end up with, so you'll always get the same value, not an actual distribution.

Is that actually better? Well... yes, mostly. PractRand will rate it higher, for very good reasons - it looks MUCH more like a plausible result from a good RNG. But it means that many possible 32 bit values never actually occur. Some people feel that having the output sequence be a permutation of all possible 32 bit values is more appropriate than having it look like 4 billion numbers drawn from a good PRNG, despite looking a lot less like good PRNG output that way.

If the PRNG state were bigger, this question wouldn't actually matter. A sequence of 2^128 128 bit random values is functionally impossible to distinguish from a random permutation of 2^128 bit values, so most people don't care anymore at that size. Even at 64 bits, a random permutation vs a random sequence is very hard to tell apart, though it can just barely be done if you use tons of time and memory working on it (PractRand doesn't bother, at the moment).

21,233Thanks for that explanation.

Previously I would have naively expected a "good" PRNG with 32 bits of state to visit every possible state once in a 4 giga sample space. Clearly this is not what would happen with a true random selection of 32 bits four billion times. Looks to me like "randomness" is in the eye of the beholder.

At the moment I'm leaning towards preferring that a PRNG with 32 bits of state does visit every possible output exactly once in 4 giga samples.

In what situation is that detrimental, apart from satisfying statistical tests like PractRand ?

Sure it means that somebody collecting enough samples can tell that you have a PRNG instead of a source of real randomness.

But presumably they can do that if there is a bias added as well.

1,763Just wondering whether the MOV could be replaced by a "XORO32B" that does seem extra bit scrambling. In other words, XORO32 could do two different things, determined by a spare opcode bit. The same adder could be used in both parts and the new second part would not be compulsory.

1,763If two consecutive xoroshiro32+ outputs are treated as one 32-bit value there will be gaps. Some 16+16 combinations will crop up twice or maybe three times, while others never occur.

17When considered from the context of a single run of a program using no more than the cycle length of the PRNG, the biased form is just better. It behaves like a PRNG should, in such a narrow case (assuming it was well written). For instance, duplicated 32 bit values occur at about the right frequency and spacings.

On the other hand, when things are going to be wrong anyway, the unbiased form may have things be wrong in a more intuitive way. Maybe. For instance, if you have a program that uses 2^32 32 bit bit values, and you run that program ten thousand times with different seeds (and thus different starting points in the cycle), the distribution of results will be quite wrong for both cases, but the way it is skewed for the unbiased PRNG might be easier to understand than the way it is skewed for the biased PRNG. Maybe.

21,233Given the speed of a Propeller and the applications it might get used for are any of these statistical subtleties worth worrying about?

There is some kind of madness in the search for the perfect PRNG.

1,763Another thought on this, simpler than the previous one.

XORO32 stores two 16-bit PRN outputs in the S field of the next instruction and the above code can be re-written as

where INS is an instruction with D and S fields which

does not have to be a MOV. Some of the useful instructions after an XORO32 areMUL is D[31:0] := D[15:0] * S[15:0]. Multiplying is a good way of scrambling bits, however the PRN in S[31:16] would be wasted. To avoid this, one of the spare CZ opcode bits could select between single and double iterations, leaving the other for forward/reverse.

EDIT

This is largely about saving one instruction and not so exciting on second reading.

13,630It's already that way. Whatever the next instruction is, its S value is the PRNG output. That next instruction can be anything.

1,76317,973I have been following along without understanding most of it.

Thanks scro for the recent post that put it into perspective. Like heater, I presumed that a 32-bit PRNG that produced 1 of each number in a 4GB sample would be the perfect PRNG. But I now see why this isn't really.

However, if we have a truly 4GB where 1 of each number is given, does it really matter for P2?

I guess that is the real question at hand. Are the 4GB of numbers truly random though?

Chip, could a wider number of bits be used and then select just 32 bits from the wider answer?

eg use 40 bits, do the appropriate "xoro40" and select say b38..b7 as the 32 bits?

That might get over the double add which doesn't work in a clock.

1,763scro, what would be good 16-bit constant to multiply with the xoroshiro32+ 16-bit output to scramble the bits better? Or is it too late after the addition to improve things by multiplying?

What I don't understand about multiplying by a constant is how people handle the result. There are bound to be leading zeroes, so do they shift and truncate it?

Melissa O'Neill says a xoroshiro* would be better than xoroshiro+, right at the end of this webpage:

http://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html

1,763Yes, parity is a one-trick pony but it's a good trick and I'm glad I found it.

It would be interesting to see how bit 1 on its own scores in tests sometime and whether it is that much worse than bit 2, say.

1,763xoroshiro40+ needs 40 state bits and they wouldn't fit into one register. Also the output would be only 20 bits.

171,763Here's the reverse algorithm ,

the first piece of Verilog I have ever written:Here's how the combined double iteration forward and reverse code might look:

The XORO32 instruction would need a second operand to select the direction:

but the new xoro32dir signal could be a constant = 1 to test things work in reverse. To check the output, the following file has states and sums for 2048 single iterations backwards and forwards from state $0000_0001: http://forums.parallax.com/discussion/download/121567/xoro.txt

13,630XORO32F

XORO32R

...for forward and reverse.

10,221Very cool, BTW. You guys did a lot with PRNGs here. Can't wait to see what people end up doing with this feature.

1,763In a nutshell, we have lost the ability to single-step through the states. However, if we implement forward and reverse, then we could get single-stepping back for very little extra logic. Here is how the modified Verilog code might look:

Please bear in mind I'm on day two as a Verilog coder.

The four different options are selected by the otherwise unused CZ opcode bits, as already done by several other instructions.

We could expand this to

XORO16F

XORO16R

XORO32F

XORO32R

These mnenomics make it clear what the operation is. Before anyone objects to using XORO16 for what we have been calling xoroshiro32+, Sebastiano Vigna says its proper name should be xoroshiro[16]32+.

Chip, I've spent hours on this for several reasons (a) because I think XORO32 would be better with a reverse gear, (b) so that you could see this working for yourself, (c) to save you the time and trouble of implementing the reverse algorithm, (d) to bring back single-stepping and (e) to make the P2 the best it can be.

13,630But don't we actually have a 2^32-1 period, already. At 2^31 iterations, don't we step into the 'odd' domain and then go another 2^31-1 iterations before arriving at our original value?

1713,630Scro, what do you think about the utility of having XORO32 be reversible?

1,763Is the P2 going to be an ONC18: 0.18 µm CMOS Standard Cell and will it have individual gates or FPGA-like LEs? I'm wondering whether a 4-to-1 mux would be any bigger than a 2-to-1 mux. If the answer is no, or the difference is insignificant, then reversibility gives you single-stepping for free and single-stepping gives you reversibility for free.

Even if you always use 32-bit PRN data, what if you want to have two sequences running, offset by one state? A XORO32 that only does double-stepping cannot tell you the next state unless you go through the whole period. A XORO16 would tell you in one instruction, then you could use that state in XORO32 thereafter.

Without reversibility, states would need to be saved if you ever want to stop and restart further back. Even then, you wouldn't be able to run a simulation, say, backwards, just jump back so far depending on how much RAM is available for storing states and then run it forwards again.

I get the feeling Chip doesn't really believe this thing can run backwards. There's only one way to find out!

21,233Generally in a simulation we have a state S and we evolve a new state from the old one plus some random number(s) like so:

Set intitial state S0

Get random number N0. Calculate new simulation state S1 as a function S1 = f(S0, N0)

Get random number N1. Calculate new simulation state S2 as a function S2 = f(S1, N1)

Get random number N2. Calculate new simulation state S3 as a function S3 = f(S2, N2)

Get random number N3. Calculate new simulation state S4 as a function S4 = f(S3, N3)

...

Now, if we reverse our PRNG we continue as:

Get random number N2. Calculate new simulation state S5 as a function S5 = f(S4, N2)

Get random number N1. Calculate new simulation state S6 as a function S6 = f(S5, N1)

Get random number N0. Calculate new simulation state S7 as a function S7 = f(S6, N0)

...

It's pretty unlikely that our simulation states are going to run backwards just because the PRNG does. S5 will not be S3, S6 will not be S2 and so on.

Once we have stirred up our simulation state with the PRNG it will not magically get un-stirred by running the PRNG backwards.

17Mostly pointless. It lets you rewind faster than you could already, but what are you going to use it for? Use cases for backwards PRNG stepping are pretty rare. Plus, since you're limiting it to 32 bits of state, I'm suspecting that the state is mapped to local accessible memory anyway, so I think you'll already have fast state save/restore. 90% of use cases for backwards PRNG stepping can be done about as easily (or better) using fast state save/restore instead.

Can anyone come up with a use case that *isn't* easily & efficiently handled with fast state save/restore instead?

11,906I'm thinking there is the whole two execution clock cycles available. As per the original Xoroshiro128+ source code, the summing can be in parallel with the iterating. EDIT: In fact, Scro even arranges his algorithm the same way.

10,221I can't, other than procedural data generation. Like game world, open, roaming. Walking one way, vs walking the opposite way or any other way and seeing the environment present consistently.

State / restore would do this, but would do it at a much higher cost, in that there would need to be state markers in various places to know what the state to return to actually was.

With a forward / reverse, one can just run the same computation and generate routines without regard to state at all, other than the current one.

1,763Running backwards all the time gives you another sequence.

Isn't that the real utility? In effect two PRNGs in one.

And you could get this for free, probably, with single-stepping.

I'd be more worried about knowing only half the states if single-stepping is not possible.

21,233potatohead, I don't follow your idea Spud.

It can work if your state is only a function of the PRNG value. Then if you reverse the PRNG you reverse the state. But that is never true. As I tried to show in my post above.

Let's think of a really simple simulation. A simulated robot that makes a left or right turn at every iteration and takes one step forwards. The state here is a position on an X, Y plane. The left or right is determined by a PRNG.

If the PRNG produces steps left(L) and right(R), say "LRLLR", then the robot ends up at some new X,Y coordinate.

If we now reverse the PRNG we get "LLRL".

The robot will not end up where it started.

I don't see how you can run a simulation backwards just by using a PRNG does.

So far, I see no use for a reversible PRNG at all.

1,763But the second add uses the result of the first add, doesn't it?

scro, the three state xors are Xorshift, I believe. Did you invent the hashing function after it and does it have a name?

11,906