Today's goal [edit: this took a couple of days actually] is to write a text-to-binary converter, in x86-64 assembly, as an ELF executable file.
Following the sage advice of Leslie Lamport, let's State the Problem Before Describing the Solution, by describing the input format and the expected output of this program, before we write any code.
[0-9A-Fa-f]
.
input = line*
line = (empty | comment | numbers) [\n]
empty =
comment = [#] [^\n]*
numbers = [ ]* number ([ ]+ number)*
number = digit digit
digit = [0-9A-Fa-f]
The output is simply a sequence of bytes, one for each number which appears on a non-comment line, in order.
The input is taken from stdin
, and output is sent to stdout
. The program exits with status 0 on success, and a nonzero status on failure (in which case some bytes may have been written to stdout
).
The input format is very simple, so we can parse it with a (very) finite state machine. If we encode the states as powers of two (so, 1, 2, 4, 8, 16), then we can test whether the current state is one of an arbitrary set of states with a single instruction, so let's do that. The possible states are:
[\n# 0-9A-Fa-f]
and end-of-file.
[\n]
is treated specially.
[ 0-9A-Fa-f]
.
[0-9A-Fa-f]
.
[\n ]
.
The main loop of the program is very simple:
We'll be using a fixed size buffer (one page, 4KB) to read the input, to avoid doing a syscall for every single character of input.
Since every byte of input produces at most one byte of output, and no backtracking is necessary, we can just use the same buffer to gather output bytes before doing a syscall to write them out. This means that our memory allocation state is always captured by just a few pointers, which keep the same relative ordering throughout program execution:
And if we keep all addresses within the first 2GB of virtual memory (those addresses which can be represented as positive signed 32-bit pointers), we can manipulate our 64-bit pointers using 32-bit registers, which will save us from using a bunch of REX instruction prefixes.
The simplest way to read bytes from stdin
is with the read
syscall, but it's important to pay attention to the return value:
EINTR
, aka "try again";
In particular, the existence of "try again" means that we need a loop, in addition to all the conditionals for dealing with end-of-file and IO errors.
rdi = 0(stdin)
rsi = (buffer)
rdx = (count)
do:
rax = 0(read)
syscall
cmp rax, -4(eintr)
while equal
test rax, rax
if negative goto panic
if zero goto end-of-file
(count) in rax
(garbage) in rcx
(garbage) in r11
Writing is similar to reading, with one extra complication: we want to make sure all output bytes are written before moving on, but the write
syscall may not all consume all the bytes we give it before returning. So, we need to do it in a nested loop:
rdi = 1(stdout)
rsi = (buffer)
rdx = (count)
test rdx, rdx
if nonzero:
do:
do:
rax = 1(write)
syscall
cmp rax, -4(eintr)
while equal
test rax, rax
if negative goto panic
rsi += rax
rdx -= rax
while nonzero
(garbage) in rcx
(garbage) in r11
Our entire program state fits within a few bytes, so we have an embarassment of riches when it comes to register assignments. To cut down the choices, let's make up some arbitrary constraints:
r8
through r15
, we can avoid a bunch of REX instruction prefixes, so let's do that.
rsp
, in case the kernel or interrupts assume that it's always valid.
rax
, rcx
, rdx
, rsi
, rdi
, r11
are all used by syscalls, so they're not automatically preserved when we read or write bytes (or exit the program). We'd better avoid them for the state of the finite state machine.
rbx
and rbp
completely free, so far. rbp
feels like a pointer, so it can hold a pointer to the constants-and-code area, as a base for offsets. rbx
can hold the finite state machine state.
lodsb
and stosb
and loop
. Then it's natural to use:
rax
to hold the byte being processed;
rdi
to point to the end of the write buffer;
rsi
to point to the start of the read buffer; and
rcx
to hold the remaining number of input bytes.
While writing the finite state machine code, I found that I was repeating the same sequence of instructions a few times, just with different constants. This sequence already contained a bunch of conditional jumps, so I started wondering whether I could move the code around to de-duplicate it. The answer was yes!
The reason is that for three of the possible input characters, namely newline, '#', and space, the handling is the same:
We can put the set of accepting states in dl
and the following state in dh
unconditionally, and branch to the "check transition" code if the character matches.
What I like about this abstraction is that I noticed it because of the code repetition, but then after thinking about it I found a conceptual reason.
For no particular reason, I initially wrote the code in the following order (with jump offsets missing, to be filled in later):
I optimistically wrote all my branches using the short form of the jump instructions, which only accept a jump distance in the range [-128, 127], and of course this didn't work right away. Surprisingly, though, only a few jump targets were too far away, so I was able to reorder the labels to accomodate this:
Basically, "panic" and "exit" moved up closer to the center, to be reachable from all the code, and "second digit" moved out of the way, to the end. Almost all of the code blocks actually fall through to the next block, so this was basically the only way to move the blocks around without introducing extra jump instructions. Thankfully all of the jump distances were fine afterwards (including, just barely, the jump from "write output" to "read input", with a distance of -127 bytes!).
I "compiled" the source code below using a tiny python3 script, and ran in step by step in gdb on a few small input files. While doing this, I found the following bugs:
The program has one known bug, but it's able to "compile" its own source code into an identical copy of itself, yay!
$ ./byter.py <byter.bytes >byter && chmod +x byter
$ ./byter <byter.bytes >self_byter && \
> cmp byter self_byter && \
> echo 'success!'
success!
# -------------------------------------------------
# byter.bytes -- a text-to-binary converter
# -------------------------------------------------
# -------------------------------------------------
# ELF header
# -------------------------------------------------
# magic number <DEL>"ELF"
7f 45 4c 46
# 64-bit architecture
02
# little-endian
01
# ELF version, the one and only
01
# no ELF extensions
00
# reserved
00 00 00 00 00 00 00 00
# executable file
02 00
# x86-64 architecture
3e 00
# file version, the one and only
01 00 00 00
# address of first instruction to execute
b0 00 40 00 00 00 00 00
# file offset of the program header table
40 00 00 00 00 00 00 00
# file offset of the section header table (absent)
00 00 00 00 00 00 00 00
# architecture-specific flags
00 00 00 00
# size of this ELF header
40 00
# size of a program header
38 00
# count of program headers
02 00
# size of a section header
40 00
# count of section headers
00 00
# index of the section name section header (absent)
00 00
# -------------------------------------------------
# 1st program header (code)
# -------------------------------------------------
# loadable segment
01 00 00 00
# permissions: read+execute
05 00 00 00
# file offset
00 00 00 00 00 00 00 00
# memory address
00 00 40 00 00 00 00 00
# reserved
00 00 00 00 00 00 00 00
# size in file
5a 01 00 00 00 00 00 00
# size in memory
5a 01 00 00 00 00 00 00
# 4KB alignment
00 10 00 00 00 00 00 00
# -------------------------------------------------
# 2nd program header (read/write buffer)
# -------------------------------------------------
# loadable segment
01 00 00 00
# permissions: read+write
06 00 00 00
# file offset
00 00 00 00 00 00 00 00
# memory address
00 00 41 00 00 00 00 00
# reserved
00 00 00 00 00 00 00 00
# size in file
00 00 00 00 00 00 00 00
# size in memory
00 10 00 00 00 00 00 00
# 4KB alignment
00 10 00 00 00 00 00 00
# -------------------------------------------------
# code
# -------------------------------------------------
# lea rbp, [rip-06] (start of code, end of headers)
8d 2d fa ff ff ff
# mov bl, 01 (state: start of line)
b3 01
## read input
# xor edi, edi (stdin)
33 ff
# mov esi, [rbp-28] (start of buffer)
8b 75 d8
# mov edx, [rbp-10] (size of buffer)
8b 55 f0
# do: mov eax, edi (read syscall)
8b c7
# syscall
0f 05
# cmp eax, -4 (eintr)
3d fc ff ff ff
# je [rip-0b] (retry)
74 f5
# test eax, eax
85 c0
# js [rip+46] (panic)
78 46
# jz [rip+49] (exit)
74
49
# mov ecx, eax (size of input)
8b c8
# mov edi, esi (start of buffer)
8b fe
## finite state machine
# lodsb
ac
# cmp al, 0a (newline)
3c 0a
# mov dl, 13 (states which accept newline)
b2 13
# mov dh, 01 (state: start of line)
b6 01
# je [rip+43] (check transition)
74 43
# test bl, 02 (state: inside comment)
f6 c3
02
# jnz [rip+44] (next iteration)
75 44
# cmp al, 20 (space)
3c 20
# mov dl, 15 (states which accept space)
b2 15
# mov dh, 04 (state: waiting for number)
b6 04
# je [rip+36] (check transition)
74 36
# cmp al, 23 ('#')
3c 23
# mov dl, 01 (states which accept '#')
b2 01
# mov dh, 02 (state: inside comment)
b6
02
# je [rip+2e] (check transition)
74 2e
# cmp al, 30 ('0')
3c 30
# jb [rip+1e] (panic)
72 1e
# cmp al, 39 ('9')
3c 39
# jbe [rip+0c] (found digit)
76 0c
# or al, 20 (smash case)
0c 20
# cmp al, 61 ('a')
3c 61
# jb [rip+14] (panic)
72
14
# cmp al, 66 ('f')
3c 66
# ja [rip+10] (panic)
77 10
# add al, 09
04 09
## found digit
# test bl, 05 (states which accept a first digit)
f6 c3 05
# jz [rip+40] (second digit)
74 40
# mov bl, 08 (state: middle of number)
b3 08
# shl al, 4
c0 e0
04
# mov bh, al
8a f8
# jmp [rip+12] (next iteration)
eb 12
## panic
# mov edi, 1 (nonzero exit status)
bf 01 00 00 00
## exit
# mov eax, 3c (exit syscall)
b8 3c 00 00 00
# syscall
0f
05
## check transition
# test bl, dl (acceptable states)
84 da
# jz [rip-10] (panic)
74 f0
# mov bl, dh
8a de
## next iteration
# loop [rip-54] (finite state machine)
e2 ac
## write output
# mov edx, edi (end of output buffer)
8b d7
# mov edi, 1 (stdout)
bf 01 00 00 00
# mov esi, [rbp-28] (start of buffer)
8b 75 d8
# sub edx, esi (get size of output buffer)
2b d6
# jz [rip-7f] (read input)
74 81
# do: mov eax, edi (write syscall)
8b c7
# syscall
0f 05
# cmp eax, -4 (eintr)
3d fc ff ff ff
# je [rip-0b] (retry)
74 f5
# test eax, eax
85 c0
# js [rip-31] (panic)
78 cf
# add esi, eax
03 f0
# sub edx, eax
2b d0
# jmp [rip-17] (maybe write more)
eb e9
## second digit
# test bl, 08 (states which accept a second digit)
f6 c3 08
# jz [rip-3c] (panic)
74
c4
# mov bl, 10 (state: end of number)
b3 10
# and al, 0f
24 0f
# or al, bh
0a c7
# stosb
aa
# jmp [rip-33] (next iteration)
eb cd