The Teensy Executable Revisited

🚀 Read this trending post from Hacker News 📖

📂 **Category**:

📌 **What You’ll Learn**:

<br /> The Teensy Executable Revisited<br />

(or, “Thunderclouds Gather on the Horizon”)

On a couple of occasions, people have responded to my original essay
with the comment that what I’ve created by the end isn’t really an ELF
executable. Rather, it is a file that the Linux kernel, in its current
incarnation, happens to mistake for an ELF executable.

It’s a fair point. That 45-byte file clearly doesn’t conform to
numerous requirements of the ELF specification. But can you blame me?
How could I have stopped at the point just before I tossed the ELF
specification out the window, knowing what might still be possible?

But to satisfy these purists, and the puritan side in all of us, I’ve
created this sequel.

So. We have an executable that we whittled down to 45 bytes. We now
want to bring it into rigid conformance with published standards,
while still keeping it as small as possible.

The point at which we strayed from straight and narrow path was when
we started fiddling with “unused” fields in the ELF header. So let’s
back up to before that point:

  BITS 32
  
                org     0x08048000
  
  ehdr:                                                 ; Elf32_Ehdr
                db      0x7F, "ELF", 1, 1, 1            ;   e_ident
        times 9 db      0
                dw      2                               ;   e_type
                dw      3                               ;   e_machine
                dd      1                               ;   e_version
                dd      _start                          ;   e_entry
                dd      phdr - $$                       ;   e_phoff
                dd      0                               ;   e_shoff
                dd      0                               ;   e_flags
                dw      ehdrsz                          ;   e_ehsize
                dw      phdrsz                          ;   e_phentsize
                dw      1                               ;   e_phnum
                dw      0                               ;   e_shentsize
                dw      0                               ;   e_shnum
                dw      0                               ;   e_shstrndx
  ehdrsz        equ     $ - ehdr
  
  phdr:                                                 ; Elf32_Phdr
                dd      1                               ;   p_type
                dd      0                               ;   p_offset
                dd      $$                              ;   p_vaddr
                dd      $$                              ;   p_paddr
                dd      filesz                          ;   p_filesz
                dd      filesz                          ;   p_memsz
                dd      5                               ;   p_flags
                dd      0x1000                          ;   p_align
  phdrsz        equ     $ - phdr
  
  _start:
                xor     eax, eax
                inc     eax
                mov     bl, 42
                int     0x80
  
  filesz        equ     $ - $$

This was our ninety-one-byte version. So: are we stuck with this as
our best size? No, not quite. We violated no rules when we overlapped
the ELF header and the program header table by eight bytes. The ELF
specification explicitly permits overlap of different data structures
within the file. So let’s do that here:

  ; tiny.asm
  
  BITS 32
  
                org     0x08048000
  
  ehdr:
                db      0x7F, "ELF", 1, 1, 1    ; e_ident
        times 9 db      0
                dw      2                       ; e_type
                dw      3                       ; e_machine
                dd      1                       ; e_version
                dd      _start                  ; e_entry
                dd      phdr - $$               ; e_phoff
                dd      0                       ; e_shoff
                dd      0                       ; e_flags
                dw      ehdrsz                  ; e_ehsize
                dw      phdrsz                  ; e_phentsize
  phdr:         dd      1                       ; e_phnum       ; p_type
                                                ; e_shentsize
                dd      0                       ; e_shnum       ; p_offset
                                                ; e_shstrndx
  ehdrsz        equ     $ - ehdr
                dd      $$                                      ; p_vaddr
                dd      $$                                      ; p_paddr
                dd      filesz                                  ; p_filesz
                dd      filesz                                  ; p_memsz
                dd      5                                       ; p_flags
                dd      0x1000                                  ; p_align
  phdrsz        equ     $ - phdr
  
  _start:       xor     eax, eax
                inc     eax
                mov     bl, 42
                int     0x80
  
  filesz        equ     $ - $$

That gives us eighty-three bytes. What else can we do? Seems like
there isn’t much. In desperation, we might turn back to the ELF
specification and read it over again, looking for something.

Are there any guarantees anything about the initial register values?
Only for one register: edx. And what is says is that it will contain
either zero, or the address of a final shutdown procedure. So, no
guarantees at all, really. Keep looking.

A-ha: The p_paddr field of the program header table structure! Every
other field of the headers which doesn’t apply to to Intel
architecture, or doesn’t apply to an executable file — or, at least,
not to our executable file — is required by the ELF specification to
be set to zero. But for the p_paddr field, the specification says the
field has unspecified contents. So we have four bytes that we can
play with, after all.

What can we do with them? Use it to hold part of our program,
naturally. Of course, we can’t put the whole program there, so we’ll
need to waste two of the four bytes on a jmp instruction, in order to
get to the rest of it. But that still leaves two bytes that we can
use, and the first instruction of our program is exactly two bytes
long.

  ; tiny.asm
  
  BITS 32
  
                org     0x08048000
  
  ehdr:
                db      0x7F, "ELF", 1, 1, 1    ; e_ident
        times 9 db      0
                dw      2                       ; e_type
                dw      3                       ; e_machine
                dd      1                       ; e_version
                dd      _start                  ; e_entry
                dd      phdr - $$               ; e_phoff
                dd      0                       ; e_shoff
                dd      0                       ; e_flags
                dw      ehdrsz                  ; e_ehsize
                dw      phdrsz                  ; e_phentsize
  phdr:         dd      1                       ; e_phnum       ; p_type
                                                ; e_shentsize
                dd      0                       ; e_shnum       ; p_offset
                                                ; e_shstrndx
  ehdrsz        equ     $ - ehdr
                dd      $$                                      ; p_vaddr
  _start:       xor     eax, eax                                ; p_paddr
                jmp     short part2
                dd      filesz                                  ; p_filesz
                dd      filesz                                  ; p_memsz
                dd      5                                       ; p_flags
                dd      0x1000                                  ; p_align
  phdrsz        equ     $ - phdr
  
  part2:        inc     eax
                mov     bl, 42
                int     0x80
  
  filesz        equ     $ - $$

So. Eighty-one bytes. Is that all?

The next field after the p_paddr field is the p_filesz field. If only
we could overlap the jmp instruction with that, we could squeeze
another instruction in there. But alas, the first byte of that field
is the size of the entire file, which would be an unwise jump to make.
And the remaining bytes are zeros. That approach doesn’t look too
promising.

What about the field before p_paddr? That’s the address the program is
to be loaded at. Well, we already know we don’t have to use the
default value of 0x08048000. We do need to keep the address
page-aligned, at the very least, but we should be able to fit a
two-byte instruction into the top half of the address. However, our
xor won’t work for that. Remember that this is little-endian. The
bytes of our program are currently:

  31C0              xor        eax, eax
  EB10              jmp        short part2
  
  40        part2:  inc        eax
  B32A              mov        bl, 42
  CD80              int        0x80

The xor instruction would leave C0 as the top byte, which has the high
bit set, and Linux doesn’t appreciate us putting our code there. (As a
general rule, in a 32-bit executable, the top half of memory is
reserved for dynamically assigned addresses: the heap, the stack, and
shared-object libraries.)

The “mov bl, 42” instruction, on the other hand, will give us a
perfectly acceptable top address byte. So, we can change the load
address to 0x2AB30000, and with a little rearranging we have:

  ; tiny.asm
  
  BITS 32
  
                org     0x2AB30000
  
  ehdr:
                db      0x7F, "ELF", 1, 1, 1    ; e_ident
        times 9 db      0
                dw      2                       ; e_type
                dw      3                       ; e_machine
                dd      1                       ; e_version
                dd      _start                  ; e_entry
                dd      phdr - $$               ; e_phoff
                dd      0                       ; e_shoff
                dd      0                       ; e_flags
                dw      ehdrsz                  ; e_ehsize
                dw      phdrsz                  ; e_phentsize
  phdr:         dd      1                       ; e_phnum       ; p_type
                                                ; e_shentsize
                dd      0                       ; e_shnum       ; p_offset
                                                ; e_shstrndx
  ehdrsz        equ     $ - ehdr
                dw      0                                       ; p_vaddr
  _start:       mov     bl, 42
                xor     eax, eax                                ; p_paddr
                jmp     short part2
                dd      filesz                                  ; p_filesz
                dd      filesz                                  ; p_memsz
                dd      5                                       ; p_flags
                dd      0x1000                                  ; p_align
  phdrsz        equ     $ - phdr
  
  part2:        inc     eax
                int     0x80
  
  filesz        equ     $ - $$

So we are now at seventy-nine bytes. If we can just find room for one
more byte, for the “inc eax” instruction, we could replace the jump in
the p_addr field with the “int 0x80”, and thus have the entire program
legally inside the headers! Just one more lousy byte …

Perhaps we could just let the last byte of the program spill over into
the next field? That would be the p_filesz field, which specifies the
size of the program segment in the file. The last byte of the program
is 0x80, which is certainly too big to be our file size. That is to
say: Yes, we could lie about the file’s size, and Linux will look the
other way — but we’re supposed to be on the straight and narrow here.

The field after p_filesz, however, is p_memsz. If only these two
fields were reversed, there wouldn’t be a problem. We’ve looked at
this field before: it specifies how much memory to allocate when
loading the program into memory. Setting that field to be larger than
the file size is perfectly fine.

But here is a way that we can use this field: We can jump over
p_filesz and into p_memsz. If we try this, we will find that we can,
in fact, come out ahead:

  ; tiny.asm
  
  BITS 32
  
                org     0x2AB30000
  
  ehdr:
                db      0x7F, "ELF", 1, 1, 1    ; e_ident
        times 9 db      0
                dw      2                       ; e_type
                dw      3                       ; e_machine
                dd      1                       ; e_version
                dd      _start                  ; e_entry
                dd      phdr - $$               ; e_phoff
                dd      0                       ; e_shoff
                dd      0                       ; e_flags
                dw      ehdrsz                  ; e_ehsize
                dw      phdrsz                  ; e_phentsize
  phdr:         dd      1                       ; e_phnum       ; p_type
                                                ; e_shentsize
                dd      0                       ; e_shnum       ; p_offset
                                                ; e_shstrndx
  ehdrsz        equ     $ - ehdr
                dw      0                                       ; p_vaddr
  _start:       mov     bl, 42
                xor     eax, eax                                ; p_paddr
                jmp     short part2
                dd      filesz                                  ; p_filesz
  part2:        inc     eax                                     ; p_memsz
                int     0x80
                db      0
                dd      7                                       ; p_flags
                dd      0x1000                                  ; p_align
  phdrsz        equ     $ - phdr
  
  filesz        equ     $ - $$

And we have arrived at seventy-six bytes!

This version has the disadvantage that the program requests a fair
chunk of memory. 0x0080CD40 bytes, to be precise, or eight megabytes.
(And note that we now have to add the writeable flag to p_flags, to
allow that memory to be zero-initialized.) Of course the memory is
never actually accessed, so it hardly matters, what with virtual
memory and all. All this number really indicates is at what address
Linux will report a segfault instead of trying to allocate a page of
RAM. So there’s no real harm done.

But it turns out there is a way to use a single byte to jump over four
bytes. How? By using a five-byte instruction, of course! Consider a
nice, innocuous instruction like “cmp eax, imm32”. The only effect it
has is to set the flags. Well, that and it advances the instruction
pointer:

  ; tiny.asm
  
  BITS 32
  
                org     0x2AB30000
  
  ehdr:
                db      0x7F, "ELF", 1, 1, 1    ; e_ident
        times 9 db      0
                dw      2                       ; e_type
                dw      3                       ; e_machine
                dd      1                       ; e_version
                dd      _start                  ; e_entry
                dd      phdr - $$               ; e_phoff
                dd      0                       ; e_shoff
                dd      0                       ; e_flags
                dw      ehdrsz                  ; e_ehsize
                dw      phdrsz                  ; e_phentsize
  phdr:         dd      1                       ; e_phnum       ; p_type
                                                ; e_shentsize
                dd      0                       ; e_shnum       ; p_offset
                                                ; e_shstrndx
  ehdrsz        equ     $ - ehdr
                dw      0                                       ; p_vaddr
  _start:       mov     bl, 42
                xor     eax, eax                                ; p_paddr
                inc     eax
                cmp     eax, strict dword filesz                ; p_filesz
                int     0x80                                    ; p_memsz
                dw      0
                dd      7                                       ; p_flags
                dd      0x1000                                  ; p_align
  phdrsz        equ     $ - phdr
  
  filesz        equ     $ - $$

The first byte of the compare instruction is at the last byte of the
p_paddr field, leaving the four bytes of immediate data. Our program
gets a ride across an intact p_filesz value at half the fare, and that
lets us move the increment instruction out of the p_memsz field. This
leaves that field holding a value of 0x000080CD, or a little over 32K
— a much friendlier amount of useless memory to allocate.

This is the final version of our new program. It is 76 bytes long, so
we have established conformance with the ELF specification at a cost
of 31 bytes. Not bad, all things considered. And now we have two
programs for which we can say that every byte has been accounted for.

But have we truly silenced the purists?

Some of them, yes. But a few are still present, tapping an accusing
index finger on our int instruction. Oh, yeah. That. Direct system
calls are not actually part of the documented interface with the
kernel. (Not for the likes of us, anyway.) The fact that system call
no. 1 is exit isn’t guaranteed to stay the same under future
revisions. Who knows? It’s possible that a later kernel will need to
introduce new functionality to the exit function, and so, in order to
maintain binary backwards-compatibility, they’ll introduce a new exit
function with a currently-unused system call number, and deprecate the
old one.

Admittedly, this is extremely unlikely to happen with such a basic
necessity as exit. But it does happen. For example, there are no less
than three system calls that implement the uname function, the first
two having both been deprecated. In order to add more information to
the utsname structure, new system calls had to be introduced,
permitting the older ones to continue to adhere to the original ABI
for backwards-compatibility with older binaries. (If, at some point,
everyone was confident that there were no more older binaries floating
around, those system calls could be removed entirely, and then perhaps
new ones could reuse those numbers. Except of course there are
obnoxious people like me who impede such progress by hard-coding
direct system calls!)

So, okay. If we’re trying to be completely above-board, then we’re
still in the doghouse. And the only way out is to strictly adhere to
the documented ABI. In this case, that means we need to exit by using
_exit. Not the system call, the function. The one that’s supplied by
libc. Which means that (groan) we need to re-introduce all that
dynamic linking overhead that we’ve so gratefully avoided so far.

That means we need to learn exactly how ELF binaries make dynamic
linking happen, so that we can do it ourselves (and in hopefully less
space). That’s not a trivial subject, however, so we’re going to have
to save it for the next essay.

(next)

Tiny

Software

Brian Raiter

💬 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Teensy #Executable #Revisited**

🕒 **Posted on**: 1782275170

🌟 **Want more?** Click here for more info! 🌟

The Teensy Executable Revisited

By

Leave a Reply Cancel reply