✨ Discover this must-read post from Hacker News 📖
📂 Category:
💡 Here’s what you’ll learn:
Perhaps the greatest surprise of the last two years was, for me, the realization
that people not only care about compiling C to Rust (for obvious reasons, such
as, ahem, memory safety) – they also care about compiling Rust to C! Wait,
what?
I wrote about this briefly a couple years
ago, but the level of interest for the project, I must say, took me somewhat by
surprise. So let’s talk about compiling Rust to C a little more today.
Rust is making big progress in terms of adoption, and represents a great value
proposition, especially for new code. Both my former
employer and my new employer, like
pretty much everyone else these days, have big projects that are written in pure
Rust or can have Rust components. Even Windows kernel drivers can be written
in
Rust
now. Amazing stuff.
However, if your project is, say, an open-source library that gets compiled on a
wonderfully diverse set of target architectures, OSes, distributions and
toolchains, well, chances are… one of these is not going to support Rust. Think of a
crypto library: there will be people out there with an obscure compiler for a weird
embedded target, and they really want to compile your library, because they’ve
been told not to roll out their own crypto. Or perhaps you have a format library
ridden with memory errors and you want to port it to Rust. Or maybe your company
has an in-house analysis that only runs on C code. Regardless of the scenario,
there will always be that one legacy use-case that prevents you from switching
to Rust until it’s 2035, all those LTS versions (looking at you RHEL) are
finally retired, and you yourself are too close to retirement to even care
anymore.
That is, unless you’re willing to use a Rust to C compiler.
Having a backwards-compat scenario where Rust can be compiled to C serves
several purposes.
- It allows for a gradual transition. The codebase can be ported to Rust,
and refactored / cleaned up / rewritten to use all the nice Rust things (data
types, pattern-matching, polymorphism, memory safety), thus making you and
your developers much, much happier. Meanwhile, the C version co-exists so
that you don’t alienate your userbase. - It only requires maintaining a single version. The Rust code is
authoritative; the C code is derived from it automatically, either on CI, or
at least with a CI job that checks that the two are in sync. - It allows for a census of problematic scenarios. By making the Rust version
the default (and putting the fallback C behind a--write-us-an-emailflag),
there is finally a way to enumerate those mythical users who cannot switch to
Rust just yet.
If that sounds appealing, meet Eurydice.
Eurydice is a compiler from Rust to C that aims to produce readable C code. Of
course, readability is subjective; also, seeing that Rust relies on
whole-program monomorphization, the C code is bound to be more verbose than the
Rust code. But you can judge for yourself: here’s the result of compiling
libcrux to
C.
The output of the test suite is under version control, and there are a lot more
tests
to peruse. See for instance this
bit,
compared to the Rust
original.
Eurydice plugs in directly at the MIR level, using
Charon to avoid reimplementing the
wheel and paying the price of interacting with the guts of rustc. Our
paper on Charon says more about its
architecture.
The advantage of plugging in at the MIR level is that i) we do not have to
interpret syntactic sugar, which means our translation is more faithful to the
Rust semantics, and ii) we have way fewer constructs that need compiling to C. Even then,
it’s no easy feat to translate Rust to C.
There is naturally, the need to perform whole-program monomorphization, over
types and const-generic arguments; the compilation of pattern matches into
tagged unions; recognizing instances of iterators that can be compiled to native
C for-loops. Then, there are more subtle things, such as compiling array
repeat expressions sensibly – zero-initializers when possible, initializer
lists otherwise, unless it generates too much code, in which case for-loops are
preferable. And finally, there are all the rules about visibility, static,
inline, etc. that are very C-specific and depend on how you want to lay out
your C files.
The translation is complicated by the constraint that the generated code
ought to be readable: for instance, we compile Rust structs to
C structs, including
DSTs, by
relying on flexible array
members.
We also
work hard to avoid using the fully-generic tagged union pattern when possible,
instead eliminating the tag when e.g. the Rust enum only has a single case.
Additionally, we rely on Charon to reconstruct control-flow, rather than compile
the MIR CFG to C code ridden
with gotos; again, this is for code quality.
At a low-level, there were many interesting tidbits.
- Because arrays in Rust are values, we wrap them within C structs to give them
value semantics in C, too; concretely,[u32; 8]becomesstruct ⚡. (A previous version of Eurydice would emituint32_t *,
and rely on variousmemcpys to implement value semantics, but this produced
a translation that was not type-generic, and there were plenty of finicky
corner cases. We revamped the compilation scheme recently.) - The notion of
lvaluein C means we need to insert more variable declarations
than in Rust – for instance, you can’t trivially compile&[0u32; 1]without
naming the array. - The fact that the evaluation order is so loosely defined in C means that
intermediary computations need to be stored in intermediary variables to
enforce the evaluation order. - Rust relies on whole-program monomorphization; this means that the C code is
inevitably going to contains multiple copies of the same types and functions,
but for different choices of type and const generic argumnets. This is
currently done with a builtin phase in Eurydice (for historical reasons), but
in the long run, we want to rely on Charon’s support for monomorphization. - There are plenty of peephole optimizations that are required for good code
quality, such as recognizingarray::from_fnand generating sensible code
that initializes the array in-place (instead of relying on the fully-general
compilation scheme for closures), or recognizing instances of theEq
trait that deserve dedicated treatment (such as usingmemcmpfor arrays and
slices of flat data).
A final design choice is that for now, Eurydice may define more behaviors than
Rust – for instance, Rust panics on integer overflow, but Eurydice-compiled
code does not. This is because we assume the input code is verified, and
therefore has been shown to be free of panics. This design choice can be easily
changed, though.
In practice, as soon as you use traits, the C code becomes more voluminous than
the Rust code. We rely on a configuration file mechanism to control the
placement of monomorphized instances of a given function, rather than put
everything in one big C file. This currently requires a lot of manual
intervention to give good results on large projects.
Eurydice starts by compiling the MIR AST obtained out of Charon into
KaRaMeL’s internal AST. This is ~3000
lines of OCaml code, so that’s already pretty involved. A lot of the work
revolves around trait methods and their monomorphization, given Rust’s
expressive trait system.
Then, about 30 nanopasses simplify the KaRaMeL AST until it becomes eligible for
compilation to C. Of those, a handful were originally written for KaRaMeL and
were somewhat reusable; this includes compilation of data types, as well as
monomorphization. The rest was written from scratch for Eurydice, and totals
about ~5000 lines of OCaml code.
A particularly gnarly phase was eliminating MIR’s variable assignments as much
as possible: in MIR, every variable starts out uninitialized at the beginning of
the function; then, in lieu of the variable declaration, we have an assignment
with the initial value. Naturally, having a variable declaration in the right
spot is better for code quality, so an initial phase tries to reconstruct these
assignments. That’s a drawback of using MIR, but we still firmly believe that
sticking to something that has clear semantics is ultimately better.
Fun fact: because there are so many peephole optimizations, I got tired of
maintaining enormous
pattern-matches
that would try to catch every flavor of
Rust iterator that can be compiled to a C for-loop. Instead, a custom OCaml syntax
extension allows writing concrete
syntax
for the internal KaRaMeL language in OCaml patterns. Those magic patterns then get
compiled at compile-time to OCaml AST nodes for an actual OCaml pattern that
matches the (deeply-embedded) syntax of KaRaMeL’s AST. This relies on a ppx
that lexes, parses and compiles the concrete syntax.
Eurydice-generated code expects some hand-written glue that contains macros and
static inline functions; sometimes, it’s simply more convenient to write a
single macro that uses a type, rather than have Eurydice generate N copies of a
polymorphic function that gets specialized each time. A typical example is
compiling the Eq trait for arrays: it’s nicer to emit Eurydice_array_eq(a1, a2,, which macro-expands to
len, t)!(memcmp(a1, a2, len*sizeof(t))), rather than
have N such functions, each containing a for-loop specialized for different
values of t.
Eurydice generates code that is either (C11 and C++20-compatible) or (C++-17
compatible, but not C-compatible). The reason for this is that Rust allows enum
values (e.g. Foo Share your opinion below!) in any expression position. For simplicity,
Eurydice emits a compound initializer (Foo) Share your opinion below!, or a C++20 aggregate that uses designated initializers,
relying on a macro (not shown here) to hide the syntax differences between the
two. But C++17 does not have designated initializers, so there is an option for
Eurydice to emit different code that relies on member pointers to achieve
sensibly the same effect.
Naturally, there are many limitations to this approach. Here are the
main ones that come to mind:
- we cannot guarantee that the layout of objects will be the same in C as in
Rust; conceivably, one could parse the layout information from MIR, then emit
compiler-specific alignment directives to keep the two identical, but this is
not done currently; - the generated code violates strict
aliasing,
because creating a user-defined DST involves casting one pointer type (a
struct containing an array) to another (a struct with a flexible array
member instead); I’m not sure what the best fix is, so for now, please compile your
code with-fno-strict-aliasing; - the code that Eurydice sees is MIR after applying
cfgtweaks; this means
that for code that is intended to be multi-platform, some
tricks need to be applied,
otherwise, Eurydice will only “see” one version of the code (AVX2, or ARM64,
or something else) - because monorphization is so pervasive, the configuration language needs to
express things such as “types that reference__m256i, an AVX2-only type,
need to go into a separate file to be compiled with-mavx2”; this can get
tedious real
fast
but I’m not sure I know how to do better.
There is ongoing work to integrate Eurydice-generated code for both
Microsoft
and
Google’s
respective crypto libraries.
The community grew recently, with wonderful contributions by GitHub users
@ssyram and @lin23299. There are more in the pipeline, and I look forward to
seeing the supported subset of Rust grow even more. Next on the horizon is
support for dyn traits via vtables, and relying on Charon’s monomorphization
to get MIR exactly as the Rust compiler would monomorphize it, intead of relying
on a custom procedure in Eurydice.
An ambitious goal is for the whole standard library of Rust to be extractable
via Eurydice in 2026. This is non-trivial, but I believe this achievement is
within reach. Stay tuned.
People keep asking about the name; because the project shares a large amount of
infrastructure with Aeneas and
Charon, I had to follow the Greek
mythology theme. Specifically, the myth of
Eurydice resonated with me: I thought
I was saved from the hell of generating C code, and was going to go back to the world of the
living, but alas, no.
Tell us your thoughts in comments! {What do you think?|Share your opinion below!|Tell us your thoughts in comments!}
#️⃣ #Eurydice #Rust #compiler
🕒 Posted on 1765077220
