Does this count as a self-hosting compiler?

Today's goal [edit: this took a couple of days actually] is to write a text-to-binary converter, in x86-64 assembly, as an ELF executable file.

Input and output format

Following the sage advice of Leslie Lamport, let's State the Problem Before Describing the Solution, by describing the input format and the expected output of this program, before we write any code.

An input file is a sequence of zero or more newline-terminated lines.
A line contains either a comment or numbers.
A comment line starts with '#' and ends at the first newline character after that. In between, all bytes are allowed, including NUL.
A numbers line contains a sequence of zero or more numbers, separated by spaces, with possible leading spaces. But no trailing spaces, we're not savages.
A number is a pair of hex digits [0-9A-Fa-f].

input   = line*
line    = (empty | comment | numbers) [\n]
empty   =
comment = [#] [^\n]*
numbers = [ ]* number ([ ]+ number)*
number  = digit digit
digit   = [0-9A-Fa-f]

The output is simply a sequence of bytes, one for each number which appears on a non-comment line, in order.

The input is taken from stdin, and output is sent to stdout. The program exits with status 0 on success, and a nonzero status on failure (in which case some bytes may have been written to stdout).

Finite state machine

The input format is very simple, so we can parse it with a (very) finite state machine. If we encode the states as powers of two (so, 1, 2, 4, 8, 16), then we can test whether the current state is one of an arbitrary set of states with a single instruction, so let's do that. The possible states are:

Beginning of line. This happens at the start of input, and also after every newline. This state accepts the characters [\n# 0-9A-Fa-f] and end-of-file.
Inside a comment. This state accepts all characters, but [\n] is treated specially.
Waiting for a number. This happens after a space. This state accepts the characters [ 0-9A-Fa-f].
Inside a number. This happens after the first digit. This state accepts the characters [0-9A-Fa-f].
After a number. This happens after the second digit. This state accepts the characters [\n ].

Main loop

The main loop of the program is very simple:

Read some input characters into a buffer. If there are none, exit.
Feed each input character into the finite state machine, potentially producing an output character.
Write out the output buffer, then go back to step 1.

Memory layout

We'll be using a fixed size buffer (one page, 4KB) to read the input, to avoid doing a syscall for every single character of input.

Since every byte of input produces at most one byte of output, and no backtracking is necessary, we can just use the same buffer to gather output bytes before doing a syscall to write them out. This means that our memory allocation state is always captured by just a few pointers, which keep the same relative ordering throughout program execution:

Start of the 4KB buffer (constant).
End of buffered output bytes (variable).
Start of buffered input bytes (variable).
End of buffered input bytes (variable).
End of the 4KB buffer (constant).

And if we keep all addresses within the first 2GB of virtual memory (those addresses which can be represented as positive signed 32-bit pointers), we can manipulate our 64-bit pointers using 32-bit registers, which will save us from using a bunch of REX instruction prefixes.

How to read

The simplest way to read bytes from stdin is with the read syscall, but it's important to pay attention to the return value:

return value 0 means end-of-file;
return value -4 means EINTR, aka "try again";
other negative return values mean various errors; and
positive return values mean "this many bytes got read".

In particular, the existence of "try again" means that we need a loop, in addition to all the conditionals for dealing with end-of-file and IO errors.

rdi = 0(stdin)
rsi = (buffer)
rdx = (count)
do:
  rax = 0(read)
  syscall
  cmp rax, -4(eintr)
  while equal
test rax, rax
if negative goto panic
if zero goto end-of-file
(count) in rax
(garbage) in rcx
(garbage) in r11

How to write

Writing is similar to reading, with one extra complication: we want to make sure all output bytes are written before moving on, but the write syscall may not all consume all the bytes we give it before returning. So, we need to do it in a nested loop:

an inner loop to deal with EINTR; and
an outer loop to deal with partial writes.

rdi = 1(stdout)
rsi = (buffer)
rdx = (count)
test rdx, rdx
if nonzero:
  do:
    do:
      rax = 1(write)
      syscall
      cmp rax, -4(eintr)
      while equal
    test rax, rax
    if negative goto panic
    rsi += rax
    rdx -= rax
    while nonzero
(garbage) in rcx
(garbage) in r11

Register assignments

Our entire program state fits within a few bytes, so we have an embarassment of riches when it comes to register assignments. To cut down the choices, let's make up some arbitrary constraints:

If we avoid registers r8 through r15, we can avoid a bunch of REX instruction prefixes, so let's do that.
It's probably best to avoid messing with the stack pointer rsp, in case the kernel or interrupts assume that it's always valid.
rax, rcx, rdx, rsi, rdi, r11 are all used by syscalls, so they're not automatically preserved when we read or write bytes (or exit the program). We'd better avoid them for the state of the finite state machine.
That leaves rbx and rbp completely free, so far. rbp feels like a pointer, so it can hold a pointer to the constants-and-code area, as a base for offsets. rbx can hold the finite state machine state.
And for fun, let's use all of the weirdly specific x86 instructions, like lodsb and stosb and loop. Then it's natural to use:
- rax to hold the byte being processed;
- rdi to point to the end of the write buffer;
- rsi to point to the start of the read buffer; and
- rcx to hold the remaining number of input bytes.

An unexpected abstraction

While writing the finite state machine code, I found that I was repeating the same sequence of instructions a few times, just with different constants. This sequence already contained a bunch of conditional jumps, so I started wondering whether I could move the code around to de-duplicate it. The answer was yes!

The reason is that for three of the possible input characters, namely newline, '#', and space, the handling is the same:

there's a fixed set of states which accept the character;
the following state only depends on the character (after newline we're at the start of a line, after '#' we're inside a comment, and after space we're waiting for a number); and
for states which don't accept the character, we panic.

We can put the set of accepting states in dl and the following state in dh unconditionally, and branch to the "check transition" code if the character matches.

What I like about this abstraction is that I noticed it because of the code repetition, but then after thinking about it I found a conceptual reason.

Code layout

For no particular reason, I initially wrote the code in the following order (with jump offsets missing, to be filled in later):

read input
finite state machine
found digit
second digit
check transition
next iteration
write output
panic
exit

I optimistically wrote all my branches using the short form of the jump instructions, which only accept a jump distance in the range [-128, 127], and of course this didn't work right away. Surprisingly, though, only a few jump targets were too far away, so I was able to reorder the labels to accomodate this:

read input
finite state machine
found digit
panic
exit
check transition
next iteration
write output
second digit

Basically, "panic" and "exit" moved up closer to the center, to be reachable from all the code, and "second digit" moved out of the way, to the end. Almost all of the code blocks actually fall through to the next block, so this was basically the only way to move the blocks around without introducing extra jump instructions. Thankfully all of the jump distances were fine afterwards (including, just barely, the jump from "write output" to "read input", with a distance of -127 bytes!).

Debugging

I "compiled" the source code below using a tiny python3 script, and ran in step by step in gdb on a few small input files. While doing this, I found the following bugs:

I accidentally called syscall 76 (truncate) instead of syscall 60 (exit) at first. Hopefully it didn't do any damage when called with a pointer to a very low address as the path argument?
I miscalculated 1 jump distance.
I wrote the wrong ModRM byte in 3 instructions.
I had the opposite jump condition in 1 place.
I forgot to check the state of the finite state machine before exiting at end of file. Fixing this will require adding more code, so more bytes, somewhere in the middle. Which means that I have to recalculate jump targets and probably switch to a long-form jump somewhere :( I think I'll just fix that in version 2, right now the program seems to compile correct input files correctly, at least.

Version 1 success!

The program has one known bug, but it's able to "compile" its own source code into an identical copy of itself, yay!

$ ./byter.py <byter.bytes >byter && chmod +x byter
$ ./byter <byter.bytes >self_byter && \
>     cmp byter self_byter && \
>     echo 'success!'
success!

The source code

# -------------------------------------------------
# byter.bytes -- a text-to-binary converter
# -------------------------------------------------

# -------------------------------------------------
# ELF header
# -------------------------------------------------
# magic number <DEL>"ELF"
7f 45 4c 46
# 64-bit architecture
             02
# little-endian
                01
# ELF version, the one and only
                   01
# no ELF extensions
                      00
# reserved
                           00 00 00 00  00 00 00 00
# executable file
02 00
# x86-64 architecture
      3e 00
# file version, the one and only
             01 00 00 00
# address of first instruction to execute
                           b0 00 40 00  00 00 00 00
# file offset of the program header table
40 00 00 00  00 00 00 00
# file offset of the section header table (absent)
                           00 00 00 00  00 00 00 00
# architecture-specific flags
00 00 00 00
# size of this ELF header
             40 00
# size of a program header
                   38 00
# count of program headers
                           02 00
# size of a section header
                                 40 00
# count of section headers
                                        00 00
# index of the section name section header (absent)
                                              00 00

# -------------------------------------------------
# 1st program header (code)
# -------------------------------------------------

# loadable segment
01 00 00 00
# permissions: read+execute
             05 00 00 00
# file offset
                           00 00 00 00  00 00 00 00
# memory address
00 00 40 00  00 00 00 00
# reserved
                           00 00 00 00  00 00 00 00
# size in file
5a 01 00 00  00 00 00 00
# size in memory
                           5a 01 00 00  00 00 00 00
# 4KB alignment
00 10 00 00  00 00 00 00

# -------------------------------------------------
# 2nd program header (read/write buffer)
# -------------------------------------------------

# loadable segment
                           01 00 00 00
# permissions: read+write
                                        06 00 00 00
# file offset
00 00 00 00  00 00 00 00
# memory address
                           00 00 41 00  00 00 00 00
# reserved
00 00 00 00  00 00 00 00
# size in file
                           00 00 00 00  00 00 00 00
# size in memory
00 10 00 00  00 00 00 00
# 4KB alignment
                           00 10 00 00  00 00 00 00

# -------------------------------------------------
# code
# -------------------------------------------------

# lea rbp, [rip-06] (start of code, end of headers)
8d 2d fa ff  ff ff
# mov bl, 01 (state: start of line)
                   b3 01

## read input
# xor edi, edi (stdin)
                           33 ff
# mov esi, [rbp-28] (start of buffer)
                                 8b 75  d8
# mov edx, [rbp-10] (size of buffer)
                                           8b 55 f0
# do: mov eax, edi (read syscall)
8b c7
# syscall
      0f 05
# cmp eax, -4 (eintr)
             3d fc ff ff   ff
# je [rip-0b] (retry)
                              74 f5
# test eax, eax
                                    85  c0
# js [rip+46] (panic)
                                           78 46
# jz [rip+49] (exit)
                                                 74
49
# mov ecx, eax (size of input)
   8b c8
# mov edi, esi (start of buffer)
         8b  fe

## finite state machine
# lodsb
                ac

# cmp al, 0a (newline)
                   3c 0a
# mov dl, 13 (states which accept newline)
                           b2 13
# mov dh, 01 (state: start of line)
                                 b6 01
# je [rip+43] (check transition)
                                        74 43

# test bl, 02 (state: inside comment)
                                              f6 c3
02
# jnz [rip+44] (next iteration)
   75 44

# cmp al, 20 (space)
         3c  20
# mov dl, 15 (states which accept space)
                b2 15
# mov dh, 04 (state: waiting for number)
                      b6   04
# je [rip+36] (check transition)
                              74 36

# cmp al, 23 ('#')
                                    3c  23
# mov dl, 01 (states which accept '#')
                                           b2 01
# mov dh, 02 (state: inside comment)
                                                 b6
02
# je [rip+2e] (check transition)
   74 2e

# cmp al, 30 ('0')
         3c  30
# jb [rip+1e] (panic)
                72 1e
# cmp al, 39 ('9')
                      3c   39
# jbe [rip+0c] (found digit)
                              76 0c
# or al, 20 (smash case)
                                    0c  20
# cmp al, 61 ('a')
                                           3c 61
# jb [rip+14] (panic)
                                                 72
14
# cmp al, 66 ('f')
   3c 66
# ja [rip+10] (panic)
         77  10
# add al, 09
                04 09

## found digit
# test bl, 05 (states which accept a first digit)
                      f6   c3 05
# jz [rip+40] (second digit)
                                 74 40
# mov bl, 08 (state: middle of number)
                                        b3 08
# shl al, 4
                                              c0 e0
04
# mov bh, al
   8a f8
# jmp [rip+12] (next iteration)
         eb  12

## panic
# mov edi, 1 (nonzero exit status)
                bf 01 00   00 00
## exit
# mov eax, 3c (exit syscall)
                                 b8 3c  00 00 00
# syscall
                                                 0f
05

## check transition
# test bl, dl (acceptable states)
   84 da
# jz [rip-10] (panic)
         74  f0
# mov bl, dh
                8a de
## next iteration
# loop [rip-54] (finite state machine)
                      e2   ac

## write output
# mov edx, edi (end of output buffer)
                              8b d7
# mov edi, 1 (stdout)
                                    bf  01 00 00 00
# mov esi, [rbp-28] (start of buffer)
8b 75 d8
# sub edx, esi (get size of output buffer)
         2b  d6
# jz [rip-7f] (read input)
                74 81
# do: mov eax, edi (write syscall)
                      8b   c7
# syscall
                              0f 05
# cmp eax, -4 (eintr)
                                    3d  fc ff ff ff
# je [rip-0b] (retry)
74 f5
# test eax, eax
      85 c0
# js [rip-31] (panic)
             78 cf
# add esi, eax
                   03 f0
# sub edx, eax
                           2b d0
# jmp [rip-17] (maybe write more)
                                 eb e9

## second digit
# test bl, 08 (states which accept a second digit)
                                       f6 c3 08
# jz [rip-3c] (panic)
                                                74
c4
# mov bl, 10 (state: end of number)
   b3 10
# and al, 0f
         24  0f
# or al, bh
                0a c7
# stosb
                      aa
# jmp [rip-33] (next iteration)
                           eb cd