Saturday, June 28, 2014

Dissecting the 128-byte raycaster

Lately a Reddit post caught my attention which is about a raycaster demo, animated, with mouse control and texture mapping,  in only 128 bytes. It looks like this:






And this is it in motion:


Impressive, isn't it? Out of curiosity, I took a look at the source code  (http://olivier.poudade.free.fr/src/Wolf128.asm) and spent some time trying to figure out how it works. So now I'd like to share my findings here.

The algorithm

For each pixel, wolf128.asm performs two ray marching, one on X-Z plane, and one on Y-Z plane, to find out the distance to the nearest wall/ceiling/floor.  The ray marching happens on 2D planes and are always axis aligned so there is no vector math needed.



For simplicity, below I'll use GLSL code to show how it works.  Let's consider only the vertical direction first:
out vec4 outputColor;
uniform vec2 iResolution;
void main(void)
{
    // normalize y coordinate to 0-200, in consistent with wolf128's
    // vga 320x200 mode
    vec2 uv = gl_FragCoord.xy / iResolution.xy;
    uint y = uint(uv.y * 200.0);
    // initial depth is 0
    uint z = 0u;
    // the distance to wall(ceiling/floor)
    uint ydist;
    // we'll trace a maximun of 256 steps
    for(int i = 0; i < 256; ++i)
    {
        // the distance to wall is propotional to depth,
        // and propotional to the offset of the pixel from
        // the center of the screen(which is 100, because
        // the height of the screen is 200). In order to be
        // consistent with wolf128, everything uses unsigned
        // numbers, and we keep 8-bit precision for the result.
        ydist = ((y - 100u) * z + 4096u) >> 8;
        // If the distance exceeds a predefined threshold
        if (ydist >= 32u)
            break;
        // next itration, increase depth by 1
        ++z;
    }

    // normalize the depth value to [0,1]
    float c = float(z) / 256.0;
    outputColor = vec4(vec3(c), 1.0);
}
The output of this fragment shader is like this:


Fairly simple isn't it? The only thing worth mention is the +4096 in our distance calculation. The reason we need this is because the nature of unsigned numbers.  Imagine when the Y coordinate is less then 100, for example 99,  now 99-100 will underflow and becomes a very large number, immediately exceeds the threshold.  The answer is to add 4096 to the calculation.  Why 4096? Because we shift result right by 8 bits and  4096 >> 8 = 16, which is in the middle of our threshold (32).  This effectively shifts all results to the positive direction by 16, so <100 coordinates will get results starting from 16 and gradually decrease from there towards 0 until it underflows and  trigger the >=32 condition.  Meanwhile results from positive coordinates will start from 16 and adds up from there until reach 32.

Now let's add horizontal walls:
out vec4 outputColor;
uniform vec2 iResolution;
void main(void)
{
    vec2 uv = gl_FragCoord.xy / iResolution.xy;
    uint x = uint(uv.x * 320.0);
    uint y = uint(uv.y * 200.0);   
    uint z = 0u;
    uint xdist, ydist;
    for(int i = 0; i < 256; ++i)
    {
        // horizontal center is 160
        xdist = ((x - 160u) * z + 4096u) >> 8;
        ydist = ((y - 100u) * z + 4096u) >> 8;
        // break the loop if we hit a wall in either direction
        if ((xdist >=32u) || (ydist >= 32u))
            break;
        ++z;
    }
    float c = float(z) / 256.0;
    outputColor = vec4(vec3(c), 1.0);
}
Almost identical, only added distance tracing for horizontal walls. And the result is no surprising:

The above two shaders merely renders the depth value of a virtual tunnel (or distance field, if you prefer to use that term).  Now let's add some textures.

Given the fact that for each pixel we already have the distance to horizontal and vertical walls and we have the depth value, that's enough information on all 3 dimensions.  A very simple but effective way to generate some texture is to XOR all of them together, like this:
   uint texel = xdist ^ ydist ^ z;
    texel %= 16u; // adjust the period of the pattern to your liking
    float c = float(texel) / 16.0;
    outputColor = vec4(vec3(c), 1.0);

The result is this:

Now we are talking!

The next step is to make some corridors in  this tunnel.  We can do this with the depth information which we already have. The only change we need to make is change the loop exit condition. When we hit a horizontal wall, do not exit immediately, instead, check whether the depth satisfies some sort of conditions, for example turn wall on and off for every other 32 units:
   if ((xdist >=32u && ((z & 32u) != 0u)) || (ydist >= 32u))
        break;
This is what we get from this change:
Last thing, right now the maze looks quite boring, let's add some variation to it.  The way we do it is to add or subtract some random values before checking the distance:
       switch(z / 64u)
        {
        case 0u: xdist -= 10u; break;
        case 1u: xdist -= 40u; break;
        case 2u: xdist -= 20u; break;
        case 3u: xdist -= 30u; break;
        }
Now it looks almost exactly the same as the original wolf128.asm program:


The full fragment shader code that produces the above output is this:
out vec4 outputColor;
uniform vec2 iResolution;
void main(void)
{
   vec2 uv = gl_FragCoord.xy / iResolution.xy;
    uint x = uint(uv.x * 320.0);
    uint y = uint(uv.y * 200.0);
    
    uint z = 0u;
    uint xdist, ydist;
    for(int i = 0; i < 256; ++i)
    {
        xdist = ((x - 160u) * z + 4096u) >> 8;
        ydist = ((y - 100u) * z + 4096u) >> 8;
        switch(z / 64u)
        {
        case 0u: xdist -= 10u; break;
        case 1u: xdist -= 40u; break;
        case 2u: xdist -= 20u; break;
        case 3u: xdist -= 30u; break;
        }
        if ((xdist >=32u && ((z & 32u) != 0u)) || (ydist >= 32u))
            break;
        ++z;
    }

    uint texel = (xdist & 255u) ^ (ydist & 255u) ^ z;
    texel %= 16u;
    float c = float(texel) / 16.0;
    outputColor = vec4(vec3(c), 1.0);
}
 That's all about the algorithm. Next to follow is

Assembly code analysis

In order to minimize the size, wolf128.asm is a DOS .COM file. The advantage of .COM file is that it has zero fat. Every single byte in the file is program code. When executed, DOS(or Windows) loads the code to offset 0x100 of the current code segment, and fill the 256 bytes before it with the so called Program Segment Prefix (or PSP).

Below I'll explain the code line by line.

  ; Call BIOS INT 10h set the display mode to VGA mode 13h,
  ; which is 320x200,256 color. It takes advantage of ah's
  ; initial value being 0, therefore save one instruction to
  ; load ah.
  mov al,13h
  int 10h
  ; bx has initial value 0, and ds is the same as cs,so ds:bx
  ; points to PSP.
  ; es <- 9FFF (PSP+2 is the end of DOS program area,which is 9FFF)
  ; dx is also assigned,but not used。Why load es with 9fff? That's
  ; because the VGA frame buffers starts at A000
  les dx,[bx]
  ; cl <- 255, max loop count
A:mov cl,0ffh
  ; bl <- cl,bl keeps track of depth
B:mov bl,cl
  ; this is equivalent to bl = 255-cl
  not bl
  ; ax <- 0xffee (di has initial value 0xfffe,minus 10h and get 0xffee)
  ; This will be the screen buffer size. Why 0xffee instead of
  ; 320x200=0xfa00? This is because in order to save space, wolf128 always
  ; writes 65535 bytes for each frame. Since our es segment is 9fff, and di
  ; is fffe, after 2 bytes, di will wrap back to 0, which is 9fff:0, and 16
  ; bytes before the VGA frame buffer at a000:0. So that's a total of 18
  ; extra bytes written before the frame buffer: 65535 - 18 = 0xffee
  lea ax,[di-10h]
  ; bp <- 320,screen width
  mov bp,140h
  ; dx <- 0000,cdq is in fact a 32-bit instruction, it sign extends eax to
  ; edx. here it takes advantage of the fact that the high word in eax is
  ; always 0 in 16-bit programs.
  cdq
  ; DX:AX(FFEE) divide by BP(320),get quotient(Y coordinate) in AX
  ; remainder(X coordinate) in DX
  div bp
  ; calling G twice is equvalent to
  ; ax = ((ax - 100) * z + 0x13b0);
  ; dx = ((dx - 100) * z + 0x13b0);
  ; You should already be familiar with this formular. But Why 0x13b0
  ; instead of 0x1000(4096)? And why both are minus 100? Those are all
  ; for reducing program size. 0x13b0 is the first 2 bytes of the
  ; program, which can be loaded by taking advantage of SI's initial
  ; value being 0x100. And if the 2 formulars are written separately
  ; it'll take more instructions. The results are kept in 16-bit
  ; registers AX and DX but later we only use AH and DH, this is like
  ; an 8-bit right shift for free.
  call G
  call G
  ; the program keeps mouse button state at ds:1dh
  test [1dh],7
  ; if no button is pressed then no need to look at mouse position
  jz C
  ; ds:1fh keeps mouse X coordinate. Add to x distance if button is pressed
  add dh,[1fh]
  ; ds:1eh is mouse Y coordinate, subtract from depth
  sub bl,[1eh]
  jmp short E
  ; fs has initial value 0, under DOS address 0x46c stores the current
  ; tick count. Put the tick in DL.
C:mov dl,[fs:46ch]
  ; Subtract an arbitary constant(0xb0,the first byte of program)
  ; It's picked to avoid collision with the walls too frequently
  sub bl,[si]
  ; move to the right every 64 ticks
  test dl,40h
  jnz D
  add dh,dl 
  ; move forward
D:add bl,dl
  ; The shld instruction here effectively shifts depth right by 6 bits
  ; and put it in BP
E:shld bp,bx,10
  ; x distance minus an arbitary offset parameterized by depth>>6 to add
  ; variaty to the scene.
  ; The data used here is the first 4 bytes of PSP, but can be anything.
  sub dh,[bp]
  ; Save x distance in al
  mov al,dh
  ; What these 3 instructions do are very similar to
  ; (xdist >=32u && ((z & 32u) != 0u)) || (ydist >= 32u) 
  and dh,bl
  or dh,ah
  and dh,20h
  ; If condition is not met then --cl and keep marching
  loopz B

  ; XOR al,bl,ah(x,y,z) together to create texture
  xor al,bl
  xor al,ah
  ; This undocumented instructino is a variant of AAM. It splits the
  ; high and low 4 bits of AL and put them in AH and Al. This effectively
  ; performs al % 16.
  db 0d4h,10h
  ; Add 16 to texure. Why 16? because the we want to map the texture
  ; value to the second row of the VGA mode 13h palette (which is
  ; different shades of grays)


  add al,10h
  ; write al to es:di (and di increase by 1)
  stosb

  ; If di is not 0, it means we are not finished with a frame
  ; go back to compute the next pixel
  or di,di
  jnz A

  ; read mouse state
  ; INT 33,3 - Get Mouse Position and Button Status
  ; CX = horizontal (X) position  (0..639)
  ; DX = vertical (Y) position  (0..199)
  ; BX = button status:
  ; |F-8|7|6|5|4|3|2|1|0|  Button Status
  ; |  | | | | | | | `---- left button (1 = pressed)
  ; |  | | | | | | `----- right button (1 = pressed)
  ; `------------------- unused
  mov ax,3
  int 33h
  ; left button down?
  test bl,al
  ; Store button state at 1dh
  mov [1dh],bl
  ; If button is not down, goto F
  jz F
  ; If button is down, store Y coordinate at 1eh
  mov [1eh],dl
  ; X coordinate at 1fh
  mov [1fh],cl

  ; Read keyboard scancode from port 60h
F:in ax,60h
  ; The scancode of ESC is 1, so if scancode - 1 == 0, then quit
  dec ax
  jnz B

  ; This procedure is the formular
  ; ((x - 100) * z + 0x13b0)
G:xchg ax,dx
  sub ax,64h
  imul ax,bx
  add ax,[si]
  ret


This is it, the assembly code explained. Being assembly code it may still be rather difficult to read so below is a clone I wrote in D language. It does exactly the same thing as the assembly code, and has exactly the same code flow and is even annotated with the original assembly code, have fun!

import std.stdio;
import std.datetime;
import std.algorithm;
import std.array;
import std.random;
import derelict.sdl2.sdl;

void render(SDL_Renderer* ren, SDL_Texture* tex)
{
 

   // the VGA frame buffer starts from A000 but wolf128's ES points
 
   // to 9FFF so the first 16 bytes are not displayed.
 
   ubyte[65536] fb;            // 9fff:0000 - 9fff:ffff
 
   // initial values;
 
   ubyte[] data = [0xCD, 0x20, 0xFF, 0x9F]; // the first 4 bytes of PSP
 
   ubyte mouseButton = 0xff;   // ds:1dh
 
   ubyte mouseX  = 0xff;       // ds:1eh
 
   ubyte mouseY  = 0xff;       // ds:1fh
 
   ushort idx = 0xfffe;        // di = 0xfffe on start
 
   // registers
 
   ubyte  stepsLeft;           // cl
 
   ubyte  z;                   // bl
 
   ushort offsetInFrame;       // ax
 
   ushort width;               // bp
 
   ushort y;                   // ax/ah
 
   ushort x;                   // dx/dh
 
   ubyte  texCoord;            // al
 
   ubyte  texel;               // al
 
   uint   temp;                // dx:ax
 
   uint   temp2;               // bp
nextframe:
 
   // A:mov cl,0ffh
 
   stepsLeft = 255;
nextstep:
 
   // B:mov bl,cl
 
   //   not bl
 
   z = 255 - stepsLeft;
 
   offsetInFrame = cast(ushort)(idx - 16); // lea ax,[di-10h]
 
   width = 320;                // mov bp,140h
 
   
 
   temp = offsetInFrame;           // cdq
 
   y = cast(ushort)(temp / width); // div bp
 
   x = cast(ushort)(temp % width);
 
   x = cast(ushort)((x - 100) * z + 0x13b0); // call G
 
   y = cast(ushort)((y - 100) * z + 0x13b0); // call G
 
   x >>= 8;                    // ; x in dh
 
   y >>= 8;                    // ; y in ah
 
   // test [1dh],7
 
   // jz C
 
   if (mouseButton & 7)
 
   {
 
       x += mouseX;            // add dh,[1fh]
 
       z -= mouseY;            // sub bl,[1eh]
 
   }
 
   else
 
   {
 
       // C:mov dl,[fs:46ch]
 
       ubyte ticks = cast(ubyte)(Clock.currSystemTick().msecs/60);
 
       z -= 0xb0;       // sub bl,[si]
 
       // test dl,40h
 
       // jnz D
 
       if ((ticks & 0x40) == 0)
 
           x += ticks;         // add dh,dl
 
       z += ticks;             // D:add bl,dl
 
   }
 
   temp2 = z >> 6;             // shld bp,bx,10
 
   x -= data[temp2];           // sub dh,[bp]
 
   texCoord = cast(ubyte)x;    // mov al,dh
 
   x &= z;                     // and dh,bl
 
   x |= y;                     // or dh,ah
 
   x &= 0x20;                  // and dh,20h
 
   if (stepsLeft-- && x == 0)  // loopz B
 
       goto nextstep;
 
   texel = texCoord ^ z;         // xor al,bl
 
   texel = texel ^ cast(ubyte)y; // xor al,ah
 
   texel %= 16;                  // db 0d4h,10h
 
   texel += 16;                  // add al,10h; map to the 2nd row of VGA palette
 
   fb[idx++] = texel;            // stosb
 
   // or di,di
 
   // jnz A
 
   if (idx != 0)
 
       goto nextframe;
 
   // mov ax,3
 
   // int 33h
 
   SDL_Event event;
 
   if (SDL_PollEvent(&event))
 
   {
 
       // test bl,al
 
       // mov [1dh],bl
 
       mouseButton = (event.type == SDL_MOUSEBUTTONDOWN) ? 2 : 0;
 
       if (mouseButton)
 
       {
 
           mouseX = cast(ubyte)event.button.x; // mov [1eh],dl
 
           mouseY = cast(ubyte)event.button.y; // mov [1fh],cl
 
       }
 
       
 
       // in ax,60h
 
       // dec ax
 
       // jnz B
 
       if (event.type == SDL_KEYUP && event.key.keysym.sym == SDLK_ESCAPE)
 
           return;
 
   }
 
   SDL_UpdateTexture(tex, null, cast(const(void)*)convert(fb[16..(16+320*200)]), 320 * uint.sizeof);
 
   SDL_RenderCopy(ren, tex, null, null);
 
   SDL_RenderPresent(ren);
 
   goto nextframe;
}

// ---------------------------------------------------------------------

uint convertPixel(ubyte p)
{
 
   p = cast(ubyte)((p - 16) * 16);
 
   return p | (p << 8) | (p << 16) | 0xff000000;
}

uint[] convert(const(ubyte)[] buf)
{
 
   return map!convertPixel(buf).array;
}

int main(char[][] args)
{
 
   DerelictSDL2.load();
 
   SDL_Window* win;
 
   SDL_Renderer* ren;
 
   SDL_Init(SDL_INIT_VIDEO);
 
   SDL_CreateWindowAndRenderer(320, 200, 0, &win, &ren);
 
   SDL_Texture* tex = SDL_CreateTexture(ren, SDL_PIXELFORMAT_ARGB8888, SDL_TEXTUREACCESS_STREAMING, 320, 200);
 
   render(ren, tex);
 
   return 0;
}