A Story About Information, Entropy, and Efficiently Storing Unit Vectors
Disclaimer: This was written by someone who has never had any formal education in information theory. If entropy were a girl, I only know her because she sometimes visits the coffee shop I work at.
When it comes to binary file formats, it is common to see different forms of compression to occur internally within. File formats like FBX, for example, have flags on each instance of their array nodes to indicate whether the data within is compressed. General-purpose compression methods work well most of the time because humans conveniently group related data together in these cool little things files. I recently took on rewriting some serialization code to store 3D models more efficiently and, in doing so, have been fighting against the concept of entropy for a week. This story illustrates how the universe fights against the efforts of compression and why it is crucial to benchmark.
Unit Vectors
First, it is essential to understand what a unit vector is and how it gets used. A unit vector is simply a vector whose magnitude is equal to one. Unit vectors find themselves in many places inside 3D graphics because they are excellent for denoting direction. In a 3D model context, vertices use unit vectors to describe the direction of both their normals and tangents. A three-dimensional vector will have a number associated with each of its components (x, y, and z). Unit Vectors are commonly described in this notation because it is easy for humans to visualize mentally. Unfortunately, this notation finds it’s way into 3D file formats like the industry-standard FBX file format.
Recall that a unit vector requires its length to be precisely one. The vector already has information encoded in it by definition, which then can be taken advantage of for serialization purposes. One intuitive way to take advantage of this definition is to understand that every possible unit vector belongs on a unit sphere’s surface. Those who believe the earth is round are comfortable using two coordinates (latitude and longitude) to describe any point on a sphere (our planet). Shifting our thinking about unit vectors to be described in this manner allows us to drop one of the three original numbers used to describe the vector. Dropping a single number might not seem meaningful, but a 33% reduction in the serialized data becomes significant when applied to the scale of millions of unit vectors.
Disclaimer: This example was for illustrative purposes. Floating-point limitations make this type of mapping from 2D to 3D unattractive due to the uneven distribution of sphere coordinates. If you are interested, a better distribution exists where the unit vector is instead projected from a sphere to an octahedron and can be read more about here.
Information and The Resolution of Data
Given a fixed amount of space to store a unit vector, what representation will allow the most amount of compression to occur? That answer depends on the compression method in question. The deflate algorithm, seen in file formats like Zip, is a popular choice because it takes advantage of the ever-popular greedy Huffman coding algorithm. At heart, entropy encoding algorithms like Huffman coding create a new representation for the symbols of a given input based on the frequency in which they occur. Symbols that occur more frequently have less entropy associated with them and require fewer bits to represent them. The more entropy a given input contains, the more space is required to represent that input. Another way to think about this is that information is entropy. The more information there is in a given input, the higher the entropy present.
Back to the question at hand, if there are only 3 bytes available to store a single unit vector, what technique should be used to ensure a high compression rate when storing a million random unit vectors? Following our entropy definition, the technique that stores the least amount of information will ensure the highest compression rate. To better illustrate this point, consider two different methods for packing a unit vector into 24 bits.
In method one, we scale a float to fit within a single byte, where a float value of -1.0 corresponds to a byte value of 0, and a float value of +1.0 corresponds to a byte value of 255. Each component of our unit vector applies this method to reduce its final representation to a total of three bytes. Note that this conversion dropped a large amount of information to fit within the three bytes.
In method two, we take advantage of a unit vector definition, specifically the formula for calculating its magnitude.
A re-arranged formula can solve for the Z component given X and Y.
The ability to solve for the Z component affords us the opportunity to drop it during serialization and frees up bits for the X and Y components. One single bit still needs to be kept to denote the Z component’s sign (because the square root introduces ambiguity), but the X component now has 12 bits, and the Y component has 11 bits.
A Benchmark
The second method is dedicating more bits to the X and Y components, which has effectively increased the unit vector’s resolution, retaining more information over the first method. A small program demonstrates this in which a million random unit vectors get written to disk using the two different methods described above. Running the program generates two different equal size files (3,000,000 bytes). However, running Window’s zip program on each file demonstrates just how much information was contained within.
Crazy right? The exact same unit vectors were saved out to disk during this benchmark. One method, however, ended up having more information within it despite taking up the same amount of memory. The data’s resolution determined how much information was present, affecting the final compression ratio.
This realization was both exciting and depressing to me. There is no cute math trick (as seen in method two) I can ever do to save on disk space while increasing resolution. There will always be a sacrifice in terms of memory to maintain the information, and being more efficient in terms of memory means losing information. This concept turns out to be a fundamental idea in information theory, which can be found in Shannon’s source coding theorem.
If you ended up making it to the end of this, thank you. If there was nothing new to learn here I apologize for wasting your time. If this was new to you, and you don’t get the point, go smoke some pot or something.
As always, I appreciate any comments/corrections to my understanding of all of this, and consider following me on Twitter.