Today, we are happy to publish the details of a novel new algorithm, entitled “Memory-optimal Pixel Raster Scan to Pixel Macroblock Stream”. Enjoy!
Memory-Optimal Raster Scan to Macroblocks Stream
Optimal memory footprint implementation.
Background
Many video compression formats (MPEG, H.264, AV1 etc) are based upon discrete processing units known as ‘Macroblocks’. These are small (often 4×4, 8×8 or 16×16), areas of pixels, in which each pixel is composed of a set of colour component amplitudes (usually R-G-B or Y-Cb-Cr). At the core of the compression algorithm, a linear transform (usually a Discrete Cosine Transform ) is applied to each Macroblock, on a per-colour component basis (i.e. a separate DCT is applied to each, for example, 16×16 area of Red, Blue and Green pixels that compose the Macroblock). This converts the Macroblock from being represented by a three, finite sets of amplitudes to three finite sets of frequencies.
For this operation to take place, the pixel data that composes a frame (a stream of frames composes a video) must first be arranged into Macroblocks. Unfortunately, raw pixel source formats (i.e. MIPI, DisplayPort, HDMI etc) all produce a stream of pixel data in a Raster Scan – essentially rows of pixels, beginning from the top left of the frame, scanning horizontally to the right. Once an entire row of pixels has been streamed, the subsequent line is streamed in the same fashion. When the final pixel of that frame (bottom right of the frame) has been streamed, the process begins again at the top left of the subsequent frame.
Thus, a conversion from a Pixel Raster Scan to a Stream of Pixel Macroblocks is an essential component of a video compression format.
Basic Solution
The basic solution involves simply hosting two buffers as part of the raster scan to macroblock conversion, and switching target buffers (of both write/read operations) for each set of ‘Macroblock-Lines’ – i.e. an area of the frame which has a width equal to the full frame width, and a height equal to the Macroblock height. Both the write (raster scan of pixel data in) and read (macroblock arrangement of pixel data out) operations occur in parallel, but almost never (or could never, to simplify the logic at the expense of a slight increase in memory footprint) access the same buffer concurrently. This limitation is, of course, to ensure that the current Macroblock-Line of pixel data that is being read is not overwritten by the subsequent Macroblock-Line of pixel data. The only concurrent access would potentially be while writing the final line of each ‘Macroblock-Line’ unit into the module – for example, the first (i.e. left-most) Macroblock of that unit would be completely written (and thus ready to be read out) after a ‘Macroblock-Row’ (i.e. Macroblock width of pixel data) worth of pixel data of that final line has been written in. A block-diagram of this solution is as follows.
Optimal Solution
The optimal solution involves the application of a customized buffer access pattern, which allows the mapping from raster-scan to macroblock to occur with half the memory footprint of the basic solution. Thus the high-level block diagram of the module simplifies to the following.
The ‘remap algorithm’ in the above diagram is easiest described as follows.
First, in order to describe the remap algorithm as clearly as possible, the smallest discrete unit is no longer a single pixel, but a ‘Macroblock-Row’ – i.e. a single row of pixels within a Macroblock (herein abbreviated to MbR. This is illustrated by the following image, in which the 4×4 Macroblock boundaries and their internal MbR index have been outlined and labelled, with respect to the Macroblock locations within a frame.
Next, consider a frame with a width of twenty pixels (five 4×4 Macroblocks), in which the Macroblock-Rows have been labelled according to their arrival order, as per a Raster-Scan. Note that the arrival order labeling (not the arrival order itself) will reset after each two rows of Macroblocks – which has been done to allow the remap algorithm to be described more easily. This is illustrated in the following image.
Finally, we can illustrate the remap algorithm. The initial phase of the algorithm is to simply write the first Macroblock-Lines worth of MbR’s into a dual-port RAM, as per the below image. This is called the ‘Initial Fill’ stage. Note the highlighted RAM address index, which is the sixth (Frame width in Macroblocks + 1).
From this point onward, the remap algorithm follows a similar sequence of operations (split into ‘iterations’) per Macroblock-Line of MbR’s, the first of which is illustrated by the below image. MbR’s are read out of the dual-port RAM as per the ‘RAM Read Address’ sequence (Read Phase), to produce a Macroblock-Stream of data, as opposed to the incoming Raster-Scan (correlate ‘Macroblock Row Read Order’ with ‘Image 7.’ above). For each MbR that is read out of the RAM, a MbR from the subsequent Macroblock-Line can be written into the RAM. Note that the ‘Read Phase’ illustrates consecutive operations, tracing downwards (as per the blue arrow) while the ‘Write Phase’ is simply the contents of the RAM once the entire Macroblock-Line worth of data has been written into the RAM (following the access pattern detailed by the ‘RAM Read Address column’). Again, it is worth correlating the ‘Write Phase’ image with the second Macroblock-Line in ‘Image 7’ – it is written into the RAM in Raster-Scan order, as required. Also worth noting is the value by which the ‘RAM Read Address’ is incremented (five) – highlighted in the ‘Remap RAM – Initial Fill’ (Image 8) – and the highlighted value in the ‘RAM Read Address’ sequence, which is the again the sixth (Frame width in Macroblocks + 1). That fact that when the ‘RAM Read Address’ is going to be incremented above the total number of MbR’s – 1 (per Macroblock-Line), it instead wraps – i.e. in the below image, instead of incrementing from 15 to 20 (maximum is 19) it wraps to 20-19 = 1.
Subsequent iterations of the algorithm are illustrated below. Note that the ‘Read Phase’ is the current (departing) Macroblock-Line of pixel data, and the ‘Write Phase’ is the subsequent (arriving) Macroblock-Line of pixel data. The MbR index is always the Raster-Scan arrival order, but is reset for every second row of Macroblock-Lines (so reaches a maximum of 40).
Note that the ‘Incrementer’ (i.e. Read/Write access pattern) is always sourced from the value of the RAM Read Address for operation (Frame width in Macroblocks + 1) of the previous Read/Write access pattern.
Obviously, the set of access patterns are finite – illustrations of further iterations would show this, and the process would repeat.
Thus, conversion from Raster-Scan to Macroblock-Stream has been achieved and the optimal solution occupies half the memory footprint of the basic solution.
Further work
Although the algorithm has been (essentially) shown to generalize to any Frame width (that is a multiple of the Macroblock width) via experimental methods (i.e. simulation), it would be nice to prove it mathematically.