by Mayur Patel
Summary
In a previous post, we presented a fast tabular hash for performing Owen Scrambling, which is commonly be used to randomize the base-2 Sobol sequence used extensively in sampling applications. In this post, we revisit that algorithm and propose an improvement which results in richer randomization while maintaining very fast speed.
Proposed Improvement
As a reminder, the original algorithm looks like this:
Original Algorithm: Owen Scrambling in Base-2
Require: static π ππππππππ‘ππππ[16][128], table of 8-bit words
Require: input π πππ, a user-supplied 32-bit random seed
Require: input π€πππ, 32-bit binary value to scramble
1: π₯ ← π πππ × 0π₯6π΄935πΆπ΄5
2: π ← π πππ
3: π€ ← π€πππ
4: πππ‘π ← 32
5: while πππ‘π do
6: πππ‘π ← πππ‘π − 8
7: π ← π ππππππππ‘ππππ[π & 0π₯πΉ][π€ >> 25]
8: π€ ← π€ << 8
9: π₯ ← π₯ ⊕ (π << πππ‘π )
10: end while
11: return π€πππ ⊕ π₯
Require: static π ππππππππ‘ππππ[16][128], table of 8-bit words
Require: input π πππ, a user-supplied 32-bit random seed
Require: input π€πππ, 32-bit binary value to scramble
1: π₯ ← π πππ × 0π₯6π΄935πΆπ΄5
2: π ← π πππ
3: π€ ← π€πππ
4: πππ‘π ← 32
5: while πππ‘π do
6: πππ‘π ← πππ‘π − 8
7: π ← π ππππππππ‘ππππ[π & 0π₯πΉ][π€ >> 25]
8: π€ ← π€ << 8
9: π₯ ← π₯ ⊕ (π << πππ‘π )
10: end while
11: return π€πππ ⊕ π₯
For direction on how the table is generated, please refer to the previous post.
For implementation in software, we can maximize the performance of many CPU pipelines by unwrapping the loop. Below is an optimized implementation.
Original Algorithm: Optimized Owen Scrambling in Base-2
Require: static π ππππππππ‘ππππ[16][128], table of 8-bit words
Require: input π πππ, a user-supplied 32-bit random seed
Require: input π€πππ, 32-bit binary value to scramble
1: π₯ ← π πππ × 0π₯6π΄935πΆπ΄5
2: π ← π πππ
3: π€ ← π€πππ
4: π ← π ππππππππ‘ππππ[π & 0π₯πΉ][π€ >> 25]
5: π€ ← π€ << 8
6: π₯ ← π₯ ⊕ (π << 24)
Require: static π ππππππππ‘ππππ[16][128], table of 8-bit words
Require: input π πππ, a user-supplied 32-bit random seed
Require: input π€πππ, 32-bit binary value to scramble
1: π₯ ← π πππ × 0π₯6π΄935πΆπ΄5
2: π ← π πππ
3: π€ ← π€πππ
4: π ← π ππππππππ‘ππππ[π & 0π₯πΉ][π€ >> 25]
5: π€ ← π€ << 8
6: π₯ ← π₯ ⊕ (π << 24)
7: π ← π ππππππππ‘ππππ[π & 0π₯πΉ][π€ >> 25]
8: π€ ← π€ << 8
9: π₯ ← π₯ ⊕ (π << 16)
8: π€ ← π€ << 8
9: π₯ ← π₯ ⊕ (π << 16)
10: π ← π ππππππππ‘ππππ[π & 0π₯πΉ][π€ >> 25]
11: π€ ← π€ << 8
12: π₯ ← π₯ ⊕ (π << 8)
11: π€ ← π€ << 8
12: π₯ ← π₯ ⊕ (π << 8)
13: π₯ ← π₯ ⊕ π ππππππππ‘ππππ[π & 0π₯πΉ][π€ >> 25]
14: return π€πππ ⊕ π₯
The following diagram depicts an implementation with 4-bit words in a table, used to produce an 8-bit scrambling word. This is in contrast to the algorithm listing which uses tables of 8-bit words to scramble a Sobol word of 32 bits. Nonetheless, it illustrates the relationships between the various data which will be helpful when we later discuss improvements to the algorithm.
Randomization and the most significant bits of the Sobol word contribute to a table lookup which produces the most significant bits of the scrambling word. There is a knock-on effect, because the first table lookup influences subsequent lookups. Notice that bit four and bit zero in the original Sobol word do not contribute to the scrambling word at all.
To improve the algorithm, we'd like all the bits in the Sobol word to contribute to the scrambling word. Furthermore, we'd like to maintain high performance and still be able to scramble Sobol sequences in arbitrary dimensions with a lookup table of 2KiB.
The following diagram depicts one way that this can be achieved.
The most significant bit of the scrambling word is determined entirely randomly, using the seed value to ensure that it is repeatable. The most significant bits of the Sobol word determine the next most significant bits of the scrambling word, through the table lookup. The least significant bit of the last table lookup is unused.
For this to work with a table of 8-bit words scrambling a Sobol word of 32 bits, each tree encoded into the table needs to have 256 entries instead of the previous 128. To preserve the compactness of the table, we reduce the number of encoded trees from 16 to 8. The total size of the lookup table remains 2KiB.
Here is the updated algorithm, in its optimized form:
Improved Algorithm: Optimized Owen Scrambling in Base-2
Require: static π ππππππππ‘ππππ[8][256], table of 8-bit words
Require: input π πππ, a user-supplied 32-bit random seed
Require: input π€πππ, 32-bit binary value to scramble
1: π₯ ← π πππ × 0π₯6π΄935πΆπ΄5
2: π ← π πππ
3: π€ ← π€πππ
4: π ← π ππππππππ‘ππππ[π & 0π₯3][π€ >> 24]
5: π€ ← π€ << 8
6: π₯ ← π₯ ⊕ (π << 23)
Require: static π ππππππππ‘ππππ[8][256], table of 8-bit words
Require: input π πππ, a user-supplied 32-bit random seed
Require: input π€πππ, 32-bit binary value to scramble
1: π₯ ← π πππ × 0π₯6π΄935πΆπ΄5
2: π ← π πππ
3: π€ ← π€πππ
4: π ← π ππππππππ‘ππππ[π & 0π₯3][π€ >> 24]
5: π€ ← π€ << 8
6: π₯ ← π₯ ⊕ (π << 23)
7: π ← π ππππππππ‘ππππ[π & 0π₯3][π€ >> 24]
8: π€ ← π€ << 8
9: π₯ ← π₯ ⊕ (π << 15)
8: π€ ← π€ << 8
9: π₯ ← π₯ ⊕ (π << 15)
10: π ← π ππππππππ‘ππππ[π & 0π₯3][π€ >> 24]
11: π€ ← π€ << 8
12: π₯ ← π₯ ⊕ (π << 7)
11: π€ ← π€ << 8
12: π₯ ← π₯ ⊕ (π << 7)
13: π₯ ← π₯ ⊕ (π ππππππππ‘ππππ[π & 0π₯3][π€ >> 24] >> 1)
14: return π€πππ ⊕ π₯
While some of the constants are changed, the instructions are largely the same. The only exception is line 13, where one new shift operation is introduced to discard the last bit of the last table lookup.
Because of the strong similarity in implementation, we expect the improved algorithm to be close to the computational performance of the original, while improving randomization.
Results
Below, on the first row, we plot 1024 points of the original Sobol sequence. On the second row, on the left is the scrambled version using the original algorithm. On the right, is the sequence scrambled with the improved algorithm.



Aesthetically, the PSD from the improved algorithm seems to have fewer artifacts than the original algorithm, but the differences are not extreme.
To test the efficacy of these point sets, we rendered an ambient occlusion scene, an area-light sampling scene, and a full path tracing scene using different numbers of samples. We also used both stateful and stateless generators because the order of samples from each are shuffled relative to one another between the implementations. The test scenes are depicted below.
For the ambient occlusion test, the samples using the improved algorithm produced a better mean absolute error than the unscrambled Sobol sequence for 73.6% of the tests we conducted. It showed a better MAE than the original scrambling algorithm for 52.7% of the tests. The percent difference between the MAE of the result using the improved algorithm and the unscrambled sequence showed an average of 8.28% and a median value of 7.08%. The percentage difference between the original scrambling algorithm result and that of the improved algorithm showed an average of 5.01% and a median of 4.06%.
Similarly, the area-light sampling render test showed the improved algorithm producing better results 74.0% of the time against the unscrambled sequence and 68.7% of the time against the original scrambling algorithm. The percentage difference of the MAE was an average of 3.64% and a median of 2.69% against the unscrambled sequence. The percentage difference against the original scrambling algorithm was an average of 2.70% and a median of 2.31%.
Interestingly, for the full path tracing renders, the improved algorithm created a result that was better than the unscrambled sequence only 33.8% of the time, and better than the original scrambling algorithm for 47.2% of the tests. The percentage difference in the MAE between the result using the improved scrambling algorithm and the unscrambled sequence showed an average of 0.98% and a median of 0.71%. Against the original scrambling algorithm, it showed an average of 0.88% and a median of 0.64%. So, while the improved algorithm did not produce the distinctly best result for the full path tracing render, the results across all the tests usually differed by only a fraction of one percent. In practice, we expect these particular renders to be indistinguishable to the eye.
In terms of computational efficiency, a sequence generator using the improved algorithm produced samples 5.41% slower than one using the original algorithm for a stateless implementation and 3.86% slower for a stateful implementation. These differences were observed on an Intel Xeon W3670 processor with a clock rate of 3.2GHz. As noted earlier, the most pronounced difference affecting performance in the optimized implementations of these algorithms is that the improved algorithm has one additional bit-wise shift operation.
Conclusion
In this brief post, we presented an improved implementation for scrambling the Sobol sequence using tabular hashing. Through testing, we demonstrated that it produces good quality with good speed. A C++ implementation of the algorithm is available here for generating point sequences in both, a stateful and a stateless manner.