How programs get run: ELF binaries [LWN.net]

✨ Discover this awesome post from Hacker News 📖

📂 Category:

📌 Main takeaway:

Benefits for LWN subscribers

The primary benefit from subscribing to LWN
is helping to keep us publishing, but, beyond that, subscribers get
immediate access to all site content and access to a number of extra
site features. Please sign up today!

February 4, 2015

This article was contributed by David Drysdale

The previous article
in this series described the general
mechanisms that the Linux kernel has for executing programs as a result of a user-space call
to execve(). However, the particular format handlers described in that article each deferred the
process of execution to an inner call to search_binary_handler(). That
recursion almost always ends with the invocation of an ELF binary program, which is the subject of this
article.

The ELF format

The ELF (Executable and Linkable Format)
format is the main binary format in use on modern Linux systems, and support for it is implemented
in the file fs/binfmt_elf.c. It’s also a slightly complicated format
for the kernel to handle;
the main load_elf_binary() function spans over 400 lines,
and the ELF support code is more than four times as big as the
code that supports the old a.out format.

An ELF file for an executable program (rather than a shared library or an object file) must always contain
a program header table near the
start of the file, after the ELF
header; each entry in this table provides information that is needed to run the program.

The kernel only really cares about three types of program header
entries. The first type is the PT_LOAD segment,
which describes areas of the new program’s running memory. This
includes code and data sections that come from the
executable file, together with the size of a BSS section. The BSS will
be filled with zeroes (thus only its length needs to be stored in the executable file). The second entry of
interest is a PT_INTERP entry, which identifies the run-time linker needed to assemble the complete
program; for the time being, we’ll assume a statically linked ELF binary and return to dynamic linking later.
Finally, the kernel also gets a single bit of information from a PT_GNU_STACK entry, if present,
which indicates whether the program’s stack should be made executable or not.

(This article only focuses on what’s needed to load an ELF program, rather than exploring all of the details of
the format. The interested reader can find much more information via the references linked from
Wikipedia’s ELF article or
by exploring real binaries with the
objdump tool.)

Processing ELF binaries

Loading an ELF binary is handled by the load_elf_binary() function, which starts by examining the ELF
header to check that the file in question does indeed look like a supported ELF format.
The handler needs the whole of the ELF program header, whether it is
within the first 128 bytes read into buf in
linux_binprm or not,
so it needs to read it into some scratch space.

The code now loops over the program header entries, checking for an
interpreter (PT_INTERP) and whether the program’s stack should
be executable (from the PT_GNU_STACK entry). With this
preparation done, the code needs to initialize those attributes of the new program that are not inherited from the
old program;
the Single
UNIX Specification version 3 (SUSv3) exec
specification describes most of the required behavior
(and table 28-4 of The Linux Programming Interface
gives an excellent summary of the attributes involved).

The process of setting up the new program starts with a call to
flush_old_exec(), which clears up state in the kernel that refers to the
previous program. Any other threads of the old program are killed so the new program
starts with a single thread, and the signal-handling information for the process
is unshared so that it can be safely altered later.
Any pending POSIX timers for the old program are cleared, and the location of
the executable file for the program (visible
at /proc/pid/exe) is updated.
The virtual memory mappings for the old program are released, which also
kills any pending asynchronous I/O operations
and frees any uprobes.
Finally, the personality of the process
is updated to remove any features that could affect security, as previously recorded
in the per_clear field in linux_binprm.
The main handler code also calls the SET_PERSONALITY() macro to
set the thread flags appropriately for a new 64-bit
program.

A corresponding call to setup_new_exec() now sets up
the kernel’s internal state for the new program. This function starts by determining
whether the new program can generate a core dump (or have ptrace() attach to it); this
is disabled by default for setuid or setgid programs. Dumping is
also disabled when the program file isn’t readable under the
current credentials.
A call to __set_task_comm() sets the current
task’s comm field to the basename of the originally invoked filename; this value is used as
a thread name, and is accessible to user space via the PR_GET_NAME and PR_SET_NAME
prctl() operations. A call
to flush_signal_handlers() sets up the signal handlers for the
new program; any signal handler that’s not SIG_IGN gets set to the default SIG_DFL value (so any
ignored signals are inherited by the new program). Finally, a call
to do_close_on_exec() closes all of the old program’s file descriptors
that have the O_CLOEXEC flag set; other file descriptors will be inherited by the new program.

The virtual memory for the new program also needs to be set up. To improve security (by helping protect against
stack
overflow attacks), the highest address
for the stack is typically moved downward by a random offset.
An initial call
to setup_arg_pages() then sets up the kernel’s memory tracking
structures, and adjusts for the new location of the stack.
The code loops through all of the PT_LOAD segments in the program file and
maps
them into the process’s address space, setting up the new program’s
memory layout. It then sets up zero-filled pages that
correspond to the program’s BSS segment. Also, additional special pages — such as
the virtual dynamic shared object (vDSO) pages — need to be mapped, which is taken care of by
a call
to arch_setup_additional_pages().
An empty page may
also be mapped
at the zero address in the program’s address space for
backward-compatibility reasons (old SVr4
programs apparently assume that reading from a NULL pointer would return zeros rather
than SIGSEGV).

Next, the credentials for the new program are set up via a call
to install_exec_creds().
This function lets any active Linux Security Module (LSM) know about the
change in credentials (through the
bprm_committing_creds and
bprm_committed_creds LSM hooks), and the inner
commit_creds() function performs the assignment.

The final preparation for running the new program is to set up the rest of its stack (in its new randomized
location), by calling the create_elf_tables() function; this is described in a separate section below.

All of the preparation has now been done, and the new program can be launched.
An earlier article explained how the kernel’s system_call
entry point pushes the user-space CPU registers to the kernel stack before entering the main kernel code, and these
registers are correspondingly restored when the system call completes. The area of the stack that holds the saved
registers is cast to
a pt_regs structure, and the saved user-space CPU registers can thus be overwritten with
suitable values (zeroes) for the start of the new program. The call to the
start_thread() function also sets the saved
instruction pointer to the entry point of the program (or the dynamic linker), and the saved stack pointer to the
current top of the stack (from the p field in linux_binprm). The zero return code from the
handler indicates success, and the execve() syscall returns to user space — but to a completely
different user space, where the process’s memory has been remapped, and the restored registers have values that
start the execution of the new program.

Populating the stack: the auxiliary vector, environment and arguments

The create_elf_tables() function adds more information to the
new program’s stack, below the argument and environment information added by the generic code, as two distinct
chunks. An initial call to arch_align_stack()
rounds down the existing stack position to a 16-byte
boundary, and may also further randomize the stack position downward slightly.

The first collection of information forms the ELF auxiliary vector, a collection of (id, value) pairs that
describe useful information about the program being run and the environment it is running in, communicated
from the kernel to user space. To build this vector, the handler code first needs to push onto the stack
any information that doesn’t fit within a 64-bit value; for x86_64 this is a
platform capability description (the
string "x86_64") and
16
bytes of random data (to help seed user-space random number generators).

Next, the code assembles the (id, value) pairs for the auxiliary
vector in the saved_auxv space within
the mm_struct. An LWN article from Michael Kerrisk describes
the contents of this vector, so here we just mention a few interesting entries:

  • The (architecture-specific) first entry in the vector is the AT_SYSINFO_EHDR value for x86_64; this indicates
    the location of the vDSO page, as referenced in an earlier article.
  • The AT_PLATFORM value is the location of the "x86_64" platform capability description pushed
    earlier.
  • The AT_RANDOM value is the location of the random data pushed earlier.
  • The AT_EXECFN value holds the location of the program filename that was pushed as the very first
    thing on the stack (and whose location was stored in the exec field of linux_binprm), above
    the arguments and environment values.
  • The AT_ENTRY value holds the entry point for the text segment, i.e. where program execution should
    start.

Once this auxiliary vector is created, the code now assembles the rest of the new program’s stack. The required
space is calculated, and then the entries are inserted from low addresses
to higher ones:

  • The argc argument count is inserted first.
  • An array of argument pointers is inserted next, ending with a
    NULL pointer. This is where main()‘s argv will eventually point.
  • An array of environment pointers is inserted next, ending with
    a NULL pointer. This is
    where environ will point.
  • The auxiliary vector is put at the highest address, just below the
    additional values it references.

Taken together, the top of the new program’s address space will have contents like the following example
(this page has a similar example):

    ------------------------------------------------------------- 0x7fff6c845000
     0x7fff6c844ff8: 0x0000000000000000
            _  4fec: './stackdump\0'                      <------+
      env  /   4fe2: 'ENVVAR2=2\0'                               |    <----+
           \_  4fd8: 'ENVVAR1=1\0'                               |   <---+ |
           /   4fd4: 'two\0'                                     |       | |     <----+
     args |    4fd0: 'one\0'                                     |       | |    <---+ |
           \_  4fcb: 'zero\0'                                    |       | |   <--+ | |
               3020: random gap padded to 16B boundary           |       | |      | | |
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|       | |      | | |
               3019: 'x86_64\0'                        <-+       |       | |      | | |
     auxv      3009: random data: ed99b6...2adcc7        | <-+   |       | |      | | |
     data      3000: zero padding to align stack         |   |   |       | |      | | |
    . . . . . . . . . . . . . . . . . . . . . . . . . . .|. .|. .|       | |      | | |
               2ff0: AT_NULL(0)=0                        |   |   |       | |      | | |
               2fe0: AT_PLATFORM(15)=0x7fff6c843019    --+   |   |       | |      | | |
               2fd0: AT_EXECFN(31)=0x7fff6c844fec      ------|---+       | |      | | |
               2fc0: AT_RANDOM(25)=0x7fff6c843009      ------+           | |      | | |
      ELF      2fb0: AT_SECURE(23)=0                                     | |      | | |
    auxiliary  2fa0: AT_EGID(14)=1000                                    | |      | | |
     vector:   2f90: AT_GID(13)=1000                                     | |      | | |
    (id,val)   2f80: AT_EUID(12)=1000                                    | |      | | |
      pairs    2f70: AT_UID(11)=1000                                     | |      | | |
               2f60: AT_ENTRY(9)=0x4010c0                                | |      | | |
               2f50: AT_FLAGS(8)=0                                       | |      | | |
               2f40: AT_BASE(7)=0x7ff6c1122000                           | |      | | |
               2f30: AT_PHNUM(5)=9                                       | |      | | |
               2f20: AT_PHENT(4)=56                                      | |      | | |
               2f10: AT_PHDR(3)=0x400040                                 | |      | | |
               2f00: AT_CLKTCK(17)=100                                   | |      | | |
               2ef0: AT_PAGESZ(6)=4096                                   | |      | | |
               2ee0: AT_HWCAP(16)=0xbfebfbff                             | |      | | |
               2ed0: AT_SYSINFO_EHDR(33)=0x7fff6c86b000                  | |      | | |
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        | |      | | |
               2ec8: environ[2]=(nil)                                    | |      | | |
               2ec0: environ[1]=0x7fff6c844fe2         ------------------|-+      | | |
               2eb8: environ[0]=0x7fff6c844fd8         ------------------+        | | |
               2eb0: argv[3]=(nil)                                                | | |
               2ea8: argv[2]=0x7fff6c844fd4            ---------------------------|-|-+
               2ea0: argv[1]=0x7fff6c844fd0            ---------------------------|-+
               2e98: argv[0]=0x7fff6c844fcb            ---------------------------+
     0x7fff6c842e90: argc=3

Note that although there are two randomizations in the stack layout (the position of the top of memory and the
size of the gap between the argument values and the auxiliary
vector), the newly running program can still figure
out where all of the information on the stack is. The SP register tells the program where the top of the stack is
(i.e. the lowest address), and the command-line arguments are arranged upwards in memory from there, with a NULL
pointer to mark where they end. The environment values are found next, again with a NULL pointer to terminate,
and the auxiliary vector is found at the next consecutive addresses, closing with an AT_NULL ID. The
values found within all of this information give the addresses of the argument strings, environment strings, and
auxiliary data values, so no explicit information about the size of the random gap is needed.

Dynamically linked programs

So far we’ve assumed the program being executed is statically linked and skipped over steps that would be
triggered by the presence of a PT_INTERP entry in the ELF program header. However, most programs are
dynamically linked, meaning that required shared
libraries have to be located and linked at run-time. This is performed by the runtime linker (typically
something
like /lib64/ld-linux-x86-64.so.2), and
the identity of this linker is specified by the PT_INTERP program header entry.

To cope with a runtime linker, the ELF handler first reads the ELF interpreter file name
into scratch space, then opens the executable file
with open_exec(). The first 128 bytes of the file are read into
the bprm->buf scratch area, replacing the contents of the original program file and
allowing access to the ELF header of the interpreter program — which
must therefore be an ELF binary itself, rather than any other format.

After the program code has been loaded into memory as described previously,
the ELF handler also loads the ELF interpreter program into memory
with load_elf_interp(). This process is similar to the process of loading the
original program: the code checks the format information in the ELF header,
reads in the ELF program header, maps all of
the PT_LOAD segments from the file into the new program’s memory, and leaves room
for the interpreter’s BSS segment.

The execution start address for the program is also set
to be the entry point of the
interpreter, rather than that of the program itself. When the execve() system call completes, execution
then begins with the ELF interpreter, which takes care of satisfying the linkage requirements of the program from
user space — finding and loading the shared libraries that the program depends on, and resolving the
program’s undefined symbols to the correct definitions in those libraries. Once this linkage process is done (which
relies on a much deeper understanding of the ELF format than the kernel has), the interpreter
can start the execution of the new program itself,
at
the address previously recorded in the AT_ENTRY auxiliary value.

Compatibility with other architectures

As described previously, a modern 64-bit (x86_64) Linux
system can also support running 32-bit binaries of two types: normal 32-bit binaries (x86_32),
and x32 ABI programs (which can make use of additional
x86_64 registers). So how does the kernel support these binaries?

The key file that provides support for these formats
is compat_binfmt_elf.c, which is included in the kernel when the
CONFIG_COMPAT_BINFMT_ELF config option is set. This file
didn’t appear in our earlier list of places that register binary handlers, because the file contains almost no
code of its own. Instead, it includes the main binfmt_elf.c ELF handler
code (using #include), and uses the preprocessor to redirect various internal functions and values to 32-bit compatibility
versions. Other than these changes, the format handler therefore behaves the same as the normal ELF handler
described above.

One set of changes uses 32-bit versions of the structures describing the layout of
the ELF file;
similarly, the appropriate constant values for 32-bit binaries are used, which ensures
that the compatibility handler only claims support for the relevant ELF binary types. In particular,
the elf_check_arch() call is replaced with a
compat_elf_check_arch() version
that checks for either x86_32 or (if configured) x32.

The preprocessor changes also redirect some of the inner functionality of the ELF handler code.
The invocation of the SET_PERSONALITY() macro is
redirected
to set_personality_ia32() so that
the relevant thread flags for the 32-bit architecture are set and,
similarly,
the arch_setup_additional_pages() function is replaced with
a version that sets up a 32-bit vDSO.
More significantly, the start_thread() function is
replaced with compat_start_thread(),
which maps
to start_thread_ia32(). This alters
the arguments to the inner start_thread_common() function so that the saved segment registers are
initialized differently than for x86_64 binaries (and
the ELF_PLAT_INIT() macro is also
adjusted to match).

Epilogue

Every program that runs on a Linux system passes through the portal of execve(); as such it’s a key
piece of kernel functionality that’s worth understanding in detail. Although the kernel natively supports script
and other machine-code format programs, program execution on a modern Linux system eventually involves running an
ELF binary. ELF is a complicated format, but fortunately the kernel can ignore most of that complexity — it
only needs to understand just enough ELF to load segments into memory, and to invoke a user space run-time linker
program to finish the job of assembling a complete running program.




⚡ What do you think?

#️⃣ #programs #run #ELF #binaries #LWN.net

🕒 Posted on 1761452749

By

Leave a Reply

Your email address will not be published. Required fields are marked *