20 Years on AWS and Never Not My Job

✨ Explore this must-read post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

20 Years on AWS and Never Not My Job

I created my first AWS account at 10:31 PM on April 10th, 2006. I had
seen the announcement of Amazon S3 and had been thinking vaguely about
the problem of secure backups — even though I didn’t start
Tarsnap until several months
later — and the idea of an online storage service appealed to me.
The fact that it was a web service made it even more appealing; I had
been building web services since 1998, when I decided that coordinating
a world-record-setting
computation of Pi over HTTP would be easier than doing it over
email.

While I created my AWS account because I was interested in Amazon S3,
that was not in fact immediately available to me: In the early days of
AWS, you had to specifically ask for each new service to be enabled for
your account. My new AWS account did come with two services enabled by
default, though — Amazon Simple Queue Service, which most people
know as “the first AWS service”, and Amazon E-Commerce Service, an API
which allowed Amazon affiliates to access Amazon.com’s product catalogue
— which was the real first AWS service, but which most
people have never heard of and which has been quietly scrubbed from AWS
history.

It didn’t take long before I started complaining about things. By this
point I was the FreeBSD Security Officer, so my first interest with
anything in the cloud was security. AWS requests are signed with API keys
providing both authentication and integrity protection — confirming
not only that the user was authorized, but also that the request hadn’t
been tampered with. There is, however,
no corresponding signature on AWS responses — and at this
time it was still very common to make AWS requests over HTTP rather than
HTTPS, so the possibility of response tampering was very real.
I don’t recall if anyone from Amazon showed any interest when I posted
about this on the (long-disappeared) AWS Developer Forums, but I still
think it would be a good thing to have: With requests going over TLS it
is obviously less critical now, but end-to-end signing is always going
to be better than transport-layer security.

Of course, as soon as Amazon EC2 launched I had a new target: I wanted
to run FreeBSD on it! I reached out to Jeff Barr via his blog and he
put me in touch with people inside Amazon, and in early 2007 I had my
first Amazon NDA. (Funny story, in 2007 Amazon was still using fax
machines — but I didn’t have a fax machine, so my first briefing
was delayed while I snail-mailed a wet-ink signature down to Seattle.)
Among the features I was briefed on was “Custom Kernels”; much like how
AWS Lambda works today, Amazon EC2 launched without any “bring your own
kernel” support. Obviously, to bring FreeBSD support to EC2 I was
going to need to use this functionality, and it launched in November
2007 when Amazon EC2 gained the ability to run Red Hat; soon after that
announcement went out, my FreeBSD account was allowlisted for the
internal “publish Amazon Kernel Images” API.

But I didn’t wait for this functionality to be offered before providing
more feedback about Amazon EC2. In March 2007 I expressed concerns to
an Amazonian about the security of Xen — it was at the time still
quite a new system and Amazon was the first to be deploying it in truly
hostile environments — and encouraged them to hire someone to do
a thorough security audit of the code. When the Amazonian I was speaking
to admitted that they didn’t know who to engage for this, I thought about
the people I had worked with in my time as FreeBSD Security Officer and
recommended Tavis Ormandy to them. Later that year, Tavis was credited
with reporting two vulnerabilities in Xen (CVE-2007-1320 and
CVE-2007-1321); whether there is any connection between those events, I
do not know.

I also mentioned — in fact in one of Jeff Barr’s AWS user meetups
in Second Life — that I wanted a way for an EC2 instance to be
launched with a read-only root disk and a guaranteed state wipe of all
memory on reboot, in order to allow an instance to be “reset”
into a known-good state; my intended use case for this was building
FreeBSD packages, which inherently involves running untrusted (or at
least not-very-trusted) code. The initial response from Amazonians was
a bit confused (why not just mount the filesystem read-only) but when I
explained that my concern was about defending against attackers who had
local kernel exploits, they understood the use case. I was very excited
when EC2 Instance Attestation launched 18 years later.

I ended 2007 with a blog post which I was told was quite widely read
within Amazon: Amazon,
Web Services, and Sesame Street. In that post, I complained about
the problem of Eventual Consistency and argued for a marginally stronger
model: Eventually Known Consistency, which still takes the “A” route out
of the CAP theorem, but exposes enough internal state that users can
also get “C” in the happy path. Amazon S3 eventually flipped from being
optimized for Availability to being optimized for Consistency (while
still having extremely high Availability), and of course DynamoDB is
famous for giving users the choice between Eventual or Strongly
consistent reads; but I still think the model of Eventually Known
Consistency is the better theoretical model even if it is harder for
users to reason about.

In early 2008, Kip Macy got FreeBSD working on Xen with PAE — while
FreeBSD was one of the first operating systems to run on Xen, it didn’t
support PAE and I was at the time not competent to write such low-level
kernel code, so despite being the driving force behind FreeBSD/EC2
efforts I had to rely on more experienced developers to write the
kernel code at the time. I was perfectly comfortable with userland
code though — so when Amazon sent me internal “AMI tools” code
(necessary for using non-public APIs), I spent a couple weeks porting
it to run on FreeBSD. Protip: While I’m generally a tools-not-policy
guy, if you find yourself writing Ruby scripts which construct and run
bash scripts, you might want to reconsider your choice of languages.

Unfortunately even once I got FreeBSD packaged up into an AKI (Amazon
Kernel Image) and AMI (Amazon Machine Image)
it wouldn’t boot in EC2; after exchanging dozens of emails with Cape
Town, we determined that this was due to EC2 using Xen 3.0, which had
a bug preventing it from supporting recursive page tables — a
cute optimization that FreeBSD’s VM code used. The problem was fixed
in Xen 3.1, but Xen didn’t have stable ABIs at that point, so upgrading
EC2 to run on Xen 3.1 would have broken existing AMIs; while it was
unfortunate for FreeBSD, Amazon made the obvious choice here by
sticking with Xen 3.0 in order to support existing customers.

In March 2008, I received one of those emails which only really seems
notable in hindsight:

Hi Colin,

This is Matt Garman from the EC2 team at Amazon.  [...]

Matt was inviting me to join the private Alpha of “Elastic Block Storage”
(now generally known as “Elastic Block Store” — I’m not sure if
Matt got the name wrong or if the name changed). While I was excited
about the new functionality, as I explained to Matt the best time to
talk to me about a new service is before building it.
I come from a background of mathematics and theory; I can provide far
more useful feedback on a design document than from alpha-test access.

By April 2008 I had Tarsnap in private beta and I was working on its
accounting code — using Amazon SimpleDB as a storage back-end
to record usage and account balances. This of course meant that I
had to read the API documentation and write code for signing SimpleDB
requests — back then it was necessary, but I still write my own
AWS interface code rather than using any of their SDKs — and a
detail of the signing scheme caught my eye: The canonicalization
scheme had collisions. I didn’t have any contacts on the SimpleDB
team — and Amazon did not at the time have any “report security
issues here” contacts — so on May 1st I sent an email to Jeff
Barr starting with the line “Could you forward this onto someone from
the SimpleDB team?”

While the issue wasn’t fixed until December, Amazon did a good job of
handling this — and stayed in contact with me throughout. They
asked me to review their proposed “signature version 2” scheme; fixed
their documentation when I pointed out an ambiguity; corrected what I
euphemistically referred to as a “very weird design decision”; and
allowlisted my account so I could test my code (which I had written
against their documentation) against their API back-end. (I wrote more
about this in my blog post
AWS
signature version 1 is insecure.)

In June 2008 I noticed that NextToken values — returned by
SimpleDB when a query returns too many results and then passed back
to SimpleDB to get more results — were simply base64-encoded
serialized Java objects. This was inherently poor security hygiene:
Cookies like that should be encrypted (to avoid leaking internal
details) and signed (to protect against tampering). I didn’t know
how robust Amazon’s Java object deserializer was, but this
seemed like something which could be a problem (and should have been
fixed regardless, as a poor design decision even if not exploitable),
so I reported it to one of the people I was now in contact with on
the SimpleDB team… and heard nothing back. Six months later, when
a (perhaps more security minded) engineer I had been working with on
the signing issue said “let me know if you find more security problems;
since we don’t yet have a security response page up, just email me”
I re-reported the same issue and he wrote it up internally. (Even
after this I still never received any response, mind you.)

Later in 2008, after Tarsnap was in public beta (but before
it had much traction) — and after considerable prompting from
Jeff Barr — I considered the possibility of working for Amazon.
I had a phone interview with Al Vermeulen and slightly too late learned
an important lesson: In a 45 minute interview, spending 30 minutes
debating the merits of exceptions with an author of The Elements of
Java Style is probably not the best use of time. I still firmly
believe that I was correct — exceptions are an inherently poor
way of handling errors because they make it easier to write bugs which
won’t be immediately obvious on casual code inspection — but I
also know that
it isn’t necessary to correct everyone
who is wrong.

Finally in November 2008, I drove down to Seattle for an AWS Start-up
Tour event and met Amazonians in person for the first time; for me,
the highlight of the trip was meeting the engineer I had been working
with on the request signing vulnerability. We had a lengthy discussion
about security, and in particular my desire for constrained AWS access
keys: I was concerned about keys granting access to an entire account
and the exposure it would create if they were leaked. I argued for
cryptographically derived keys (e.g. hashing the master secret with
“service=SimpleDB” to get a SimpleDB-only access key) while he
preferred a ruleset-based design, which was more flexible but concerned
me on grounds of complexity. Ultimately, I was entirely unsurprised
when I was invited to join a private beta of IAM in January 2010
— and also somewhat amused when SigV4 launched in 2012 using
derived keys.

For most of 2009 I was busy with growing Tarsnap. The EC2 team set up
some Xen 3.1 hosts for testing and by mid-January I was able to launch
and SSH into FreeBSD; but since EC2 had no concrete plans to upgrade
away from Xen 3.0, the FreeBSD/EC2 project as a whole was still
blocked. I did however notice and report a problem with the EC2
firewall: The default ruleset blocked ICMP, including Destination
Unreachable (Fragmentation Required) messages — thereby breaking
Path MTU Discovery. In December 2009 a manager in EC2 agreed with my
proposed solution (adding a rule to the default ruleset) and wrote
“I’ll let you know as soon as I have an implementation plan in place
and am confident it will happen soon”. This was ultimately fixed in
2012, soon after I
raised the issue publicly.

By the start of 2010, with EC2 still stuck on an ancient version of
Xen, I was starting to despair of ever getting FreeBSD running, so
I turned to the next best option: NetBSD, which famously runs on
anything. It only took me a week — and a few round trip emails
to Cape Town to ask for console logs — to create a NetBSD AMI
which could boot, mount its root filesystem, configure the network,
and launch sshd. While Amazon was a bit wary about me announcing this
publicly — they quite reasonably didn’t want me to say anything
which could be construed as making a promise on their behalf —
they agreed that I could discuss the work with developers outside the
NDA, and the NetBSD team were excited to hear about the progress…
although a bit confused as to why Amazon was still using
paravirtualized Xen rather than HVM.

The lack of HVM continued to be a sore point — especially as I
knew EC2 provided Xen/HVM for Windows instances — but in July
2010 Amazon launched “Cluster Compute” instances which supported HVM
even for “Linux” images. I wasn’t able to boot FreeBSD on these
immediately — while HVM solved the paging table problem, there
were still driver issues to address — but this gave me some hope
for progress, so when Matt Garman mentioned they were “thinking about”
making HVM more broadly available I immediately wrote back to encourage
such thoughts; by this point it was clear that PV was a technological
dead end, and I didn’t want Amazon to be stuck on the wrong technology
for any longer than necessary.

The first real breakthrough however came with the launch of the new
t1.micro instance type in September. While it wasn’t
publicly announced at the time, this new instance family ran on
Xen 3.4.2 — which lacked the bug which made it impossible to run
FreeBSD. By mid-November I was able to SSH into a FreeBSD/EC2 t1.micro
instance, and on December 13, 2010,
I announced that FreeBSD was
now available for EC2 t1.micro instances.

Once I’d gotten that far, things suddenly got easier. Amazon now had
customers using FreeBSD — and they wanted more FreeBSD. A
Solutions Architect put me in touch with a FreeBSD user who wanted
support for larger instances, and they paid me for the time it
took to get
FreeBSD working on Cluster Compute instances; then it was pointed
out to me that EC2 didn’t really know which OS we were
running, and I proceeded to make FreeBSD available on all 64-bit
instance types via
defenestration.
Obviously this meant paying the “windows tax” to run FreeBSD —
which Amazon was not very happy about! — but even with the added
cost it filled an essential customer need. (This hack finally ceased
to be necessary in July 2014, when T2 filled out the stable of instance
types which supported running “Linux” on HVM.)

2012 was an exciting year. In April, I had the classic greybeard
experience of debugging a network fault; I found that a significant
proportion of my S3 requests to a particular endpoint were failing
with peculiar errors, including SignatureDoesNotMatch failures. These
error responses from Amazon S3 helpfully contained the StringToSign,
and I could see that these did not match what I was sending to S3.
I had enough errors to identify the error as a “stuck bit”; so I
pulled out traceroute — this was pre-SRD so my packets were
traversing a consistent path across the datacenter — and then
proceeded to send a few million pings to each host along the path.
The Amazonians on the AWS Developer Forums were somewhat bemused when
I posted to report that a specific router had a hardware failure…
and even more surprised when they were able to confirm the failure
and replace the faulty hardware a few days later.

The highlight of 2012 however was the first re:Invent — which was
short of technical content and had a horrible tshirt-to-suit ratio, but
did give me the opportunity to talk to a number of Amazonians face to
face. On one memorable occasion, after attending an Intel talk about
“virtual machine security” (delivered by a VP who, in response to my
questioning, professed to have no knowledge of “side channel attacks”
or how they could affect virtual machines) I turned up at the EC2 booth
in the expo hall to rant… and by complete accident ended up talking
to a Principal engineer. I talked about
my work
exploiting HyperThreading to steal RSA keys, and explained that,
while the precise exploit I’d found had been patched, I was absolutely
certain there were many more ways that information could leak between
two threads sharing a core. I ended with a strong recommendation:
Based on my expertise in the field I would never run two EC2 instances
in parallel on two threads of the same core. Years later, I was told
that this recommendation was why so many EC2 instance families jumped
straight to two vCPUs (“large”) and skipped the “medium” size.

Time passed. With FreeBSD fundamentally working, I turned to the “nice
to haves”: merging my FreeBSD patches, simplifying the security update
path (including automatically installing updates on first boot), and
resizing the root filesystem on first boot. In April 2015, I finished
integrating the FreeBSD/EC2 AMI build process into the FreeBSD src tree
and handed off image builds to the FreeBSD release engineering team
— moving FreeBSD/EC2 across the symbolic threshold from a “Colin”
project to “official FreeBSD”. I was still the de facto owner of the
platform, mind you — but at least I wasn’t responsible for
running all of the builds.

In October 2016, I took a closer look at IAM Roles for Amazon EC2,
which had launched in mid-2012. The more I thought about it, the more
concerned I got; exposing credentials via the IMDS — an interface
which runs over unauthenticated HTTP and which warned in its documentation
against storing “sensitive data, such as passwords” — seemed like
a recipe for accidental foot-shooting. I wrote a blog post
“EC2’s most
dangerous feature” raising this concern (and others, such as
overly broad IAM policies), but saw no response from Amazon… that is,
not until July 2019, when Capital One was breached by exploiting the
precise risk I had described, resulting in 106 million customers’
information being stolen. In November 2019, I had a phone call
with an Amazon engineer to discuss their plans for addressing the
issue, and two weeks later, IMDSv2 launched — a useful improvement
(especially given the urgency after the Capital One breach) but in my
view just a mitigation of one particular exploit path rather than
addressing the fundamental problem that credentials were being exposed
via an interface which was entirely unsuitable for that purpose.

In May 2019, I was invited to join the
AWS Heroes
program, which recognizes non-Amazonians who make significant
contributions to AWS. (The running joke among Heroes is that a Hero
is someone who works for Amazon but doesn’t get paid by Amazon.)
The program is heavily weighted towards people who help developers
learn how to use AWS (via blog posts, YouTube videos, workshops, et
cetera), so I was something of an outlier; indeed, I was told that
when I was nominated they weren’t quite sure what to make of me, but
since I had been nominated by a Distinguished Engineer and a Senior
Principal Engineer, they felt they couldn’t say no.

In March 2021, EC2 added support for booting x86 instances using UEFI;
a “BootMode” parameter could be specified while registering an image
to declare whether it should be booted using legacy BIOS or modern
UEFI. For FreeBSD this was great news: Switching to UEFI mode
dramatically sped up the boot process — performing loader I/O
in 16-bit mode required bouncing data through a small buffer and cost
us an extra 7 seconds of
boot time. The only problem was that while all x86 instance types
supported legacy BIOS booting, not all instance types supported UEFI —
so I had to decide whether to degrade the experience for a small number
of users to provide a significant speedup to most users. In June,
I requested a
BootMode=polyglot setting which would indicate that the image was
able to boot either way (which, in fact, FreeBSD images already
could) and instructed EC2 to pick the appropriate boot mode based on
the instance. In March 2023, this landed as “BootMode=uefi-preferred”,
which I had to admit was a friendlier, albeit less geeky, name for it.

One of the most important things about the AWS Heroes program is the
briefings Heroes get, especially at the annual “Heroes Summit”. In
August 2023, we had a presentation about Seekable OCI, and looking at
the design I said to myself “hold on, they’re missing something here”:
The speaker made security claims which were true under most
circumstances, but did not hold in one particular use case. I wrote
to the AWS Security team (unlike in 2008, there was now a well-staffed
team with clear instructions on how to get in touch) saying, in part,
“I’m not sure if this is them not understanding about [type of attack]
or if it’s just an issue of confused marketing, but I feel like someone
needs to have a conversation with them”. My sense was that this could
probably be addressed with clear documentation saying “don’t do this
really weird thing which you probably weren’t planning on doing
anyway”, but since I wasn’t particularly familiar with the service I
didn’t want to make assumptions about how it was being used. After a
few email round trips I was assured that the problem had been corrected
internally and that the fix would be merged to the public
GitHub repository soon. I accepted these assurances — over the
years I’ve developed a good relationship with AWS Security people and
trust them to handle such matters — and put it out of my mind.

In December 2023, however, I was talking to some Amazonians at re:Invent
and was reminded of the issue. I hadn’t heard anything further, which
surprised me given that fixing this in code (rather than in documentation)
would be fairly intrusive. I asked them to check up on the issue and
they promised to report back to me in January, but they never did, and
again I stopped thinking about it. The following re:Invent though, in
December 2024, I met a Principal Engineer working on OCI and mentioned
the issue to him — “hey, whatever happened with this issue?”
— but he wasn’t aware of it. In January 2025, I raised it again
with a Security Engineer; he found the original ticket from 2023 and
talked to the team, who pointed at a git commit which they thought
fixed it.

The issue had not, in fact, been fixed: The 2023 commit prevented
the problem from being triggered by accidental data corruption, but
did nothing to prevent a deliberate attack. Once I pointed this out,
things got moving quickly; I had a Zoom call with the engineering team
a few days later, and by the end of February the problematic feature
had been disabled for most customers pending a “major revision”.

The largest change in my 20 years of working with Amazon started
out as something entirely internal to FreeBSD. In September 2020, the
FreeBSD Release Engineering Lead, Glen Barber, asked me if I could take
on the role of Deputy Release Engineer — in other words, Hot
Spare Release Engineer. As the owner of the FreeBSD/EC2 platform, I
had been working with the Release Engineering team for many years, and
Glen felt that I was the ideal candidate: reliable, trusted within the
project, and familiar enough with release engineering processes to take
over if he should happen to “get hit by a bus”. While I made a point
of learning as much as I could about how Glen managed FreeBSD releases,
like most hot spares I never expected to be promoted.

Unfortunately, in late 2022 Glen was hospitalized with pneumonia, and
while he recovered enough to leave the hospital a few months later, it
became clear that the long-term effects of his hospitalization made
it inadvisable for him to continue as release engineer; so on November
17, 2023, Glen decided to step back from the role and I took over as
FreeBSD Release Engineering Lead.
I like to think that I’ve done a good job since then — running
weekly snapshot builds, tightening schedules, establishing a
predictable and more rapid release cadence, and managing four releases
a year — but my volunteer hours weren’t unlimited, and it became
clear that my release engineering commitments were making it impossible
to keep up with EC2 support as well as I would have liked.

In April 2024 I confided in an Amazonian that I was “not really doing
a good job of owning FreeBSD/EC2 right now” and asked if he could find
some funding to support my work, on the theory that at a certain point
time and dollars are fungible. He set to work, and within a couple
weeks the core details had been sorted out; I received sponsorship
from Amazon via GitHub Sponsors for 10 hours per week for a year
and addressed a
large number of outstanding issues. After a six month hiatus
— most of which I spent working full time, unpaid, on FreeBSD
15.0 release engineering — I’ve now started a second 12-month
term of sponsorship.

While I like to think that I’ve made important contributions to AWS
over the past 20 years, it’s important to note that this is by no means
my work alone. I’ve had to remind Amazonians on occasion that I do not
have direct access to internal AWS systems, but several Amazonians have
stepped in as “remote hands” to file tickets, find internal contacts,
inspect API logs, and obtain technical documentation for me. Even when
people — including very senior engineers — have explicitly
offered to help, I’m conscious of their time and call upon them as little
as I can; but the fact is that I would not have been able to do even a
fraction of what I’ve accomplished without their help.

blog comments powered by

🔥 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Years #AWS #Job**

🕒 **Posted on**: 1775886769

🌟 **Want more?** Click here for more info! 🌟

20 Years on AWS and Never Not My Job

20 Years on AWS and Never Not My Job

By

Leave a Reply Cancel reply