🚀 Check out this trending post from Hacker News 📖
📂 **Category**:
📌 **What You’ll Learn**:
(See discussion on Lobsters.)
Collapsing // to / inside an HTTP URL path is not normalization.
The URI syntax permits empty path segments
RFC 3986 defines the path component and the segment grammar
in a way that allows for empty segments.
A double slash is therefore syntactically meaningful.
It represents a zero-length segment between two separators.
3.3. Path
The path component contains data, usually organized in hierarchical
form, that, along with data in the non-hierarchical query component
(Section 3.4), serves to identify a resource within the scope of the
URI’s scheme and naming authority (if any). The path is terminated
by the first question mark (“?”) or number sign (“#”) character, or
by the end of the URI.If a URI contains an authority component, then the path component
must either be empty or begin with a slash (“/”) character. If a URI
does not contain an authority component, then the path cannot begin
with two slash characters (“//”). In addition, a URI reference
(Section 4.1) may be a relative-path reference, in which case the
first path segment cannot contain a colon (“:”) character. The ABNF
requires five separate rules to disambiguate these cases, only one of
which will match the path substring within a given URI reference. We
use the generic term “path component” to describe the URI substring
matched by the parser to one of these rules.path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":" pchar = unreserved / pct-encoded / sub-delims / ":" / "@" A path consists of a sequence of path segments separated by a slash
(“/”) character. A path is always defined for a URI, though the
defined path may be empty (zero length). Use of the slash character
to indicate hierarchy is only required when a URI will be used as the
context for relative references. For example, the URI
mailto:fred@example.com has a path of “fred@example.com”, whereas
the URI foo://info.example.com?fred has an empty path.The path segments “.” and “..”, also known as dot-segments, are
defined for relative reference within the path name hierarchy. They
are intended for use at the beginning of a relative-path reference
(Section 4.2) to indicate relative position within the hierarchical
tree of names. This is similar to their role within some operating
systems’ file directory structures to indicate the current directory
and parent directory, respectively. However, unlike in a file
system, these dot-segments are only interpreted within the URI path
hierarchy and are removed as part of the resolution process (Section
5.2).Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax.
Because segment = *pchar,
the empty string is a valid segment.
Therefore,
path-abempty = *( "/" segment )
allows a slash followed by an empty segment.
Any transformation that collapses // to /
removes a syntactically valid segment
and thus changes the parsed sequence of segments.
HTTP uses RFC 3986 path grammar
HTTP (RFC 9110) uses the RFC 3986 path grammar for request targets.
4.1. URI References
URI references are used to target requests, indicate redirects, and
define relationships.The definitions of “URI-reference”, “absolute-URI”, “relative-part”,
“authority”, “port”, “host”, “path-abempty”, “segment”, and “query”
are adopted from the URI generic syntax. An “absolute-path” rule is
defined for protocol elements that can contain a non-empty path
component. (This rule differs slightly from the path-abempty rule of
RFC 3986, which allows for an empty path, and path-absolute rule,
which does not allow paths that begin with “//”.) A “partial-URI”
rule is defined for protocol elements that can contain a relative URI
but not a fragment component.URI-reference =absolute-URI = relative-part = authority = uri-host = port = path-abempty = segment = query = absolute-path = 1*( "/" segment ) partial-URI = relative-part [ "?" query ]
4.2.1. http URI Scheme
http-URI = "http" "://" authority path-abempty [ "?" query ]The origin server for an “http” URI is identified by the authority
component, which includes a host identifier ([URI], Section 3.2.2)
and optional port number ([URI], Section 3.2.3). If the port
subcomponent is empty or not given, TCP port 80 (the reserved port
for WWW services) is the default.The hierarchical path component and optional query component identify
the target resource within that origin server’s namespace.
Collapsing // alters the sequence of segments
and therefore alters the identifier.
Unless the origin explicitly defines those two identifiers as equivalent,
a generic normalizer has no authority to do so. Only the origin could
munge URIs in its own namespace.
URL normalization rules do not include collapsing //
RFC 3986 is quite explicit about what syntax-based normalization is:
case normalization, percent-encoding normalization, and dot-segment removal.
It does not list any rule that removes empty segments
or collapses multiple slashes.
6.2.2. Syntax-Based Normalization
Implementations may use logic based on the definitions provided by
this specification to reduce the probability of false negatives.
This processing is moderately higher in cost than character-for-
character string comparison. For example, an application using this
approach could reasonably consider the following two URIs equivalent:example://a/b/c/%7Bfoo%7D eXAMPLE://a/./b/../b/%63/%7bfoo%7dWeb user agents, such as browsers, typically apply this type of URI
normalization when determining whether a cached response is
available. Syntax-based normalization includes such techniques as
case normalization, percent-encoding normalization, and removal of
dot-segments.
Path normalization is quite narrowly specified too:
it is about . and .. in relative references, not empty segments.
6.2.2.3. Path Segment Normalization
The complete path segments “.” and “..” are intended only for use
within relative references (Section 4.1) and are removed as part of
the reference resolution process (Section 5.2). However, some
deployed implementations incorrectly assume that reference resolution
is not necessary when the reference is already a URI and thus fail to
remove dot-segments when they occur in non-relative paths. URI
normalizers should remove dot-segments by applying the
remove_dot_segments algorithm to the path, as described in
Section 5.2.4.
Notice what is not present: there is no rule permitting removal of empty
segments, nor any directive to coalesce repeated separators, etc.
HTTP scheme-based normalization still does not collapse //
HTTP adds a few scheme-based normalization rules, and they are quite narrow
still. The only rule that touches the path concerns the empty path component
(not empty segments inside a path):
4.2.3. http(s) Normalization and Comparison
URIs with an “http” or “https” scheme are normalized and compared
according to the methods defined in Section 6 of [URI], using the
defaults described above for each scheme.HTTP does not require the use of a specific method for determining
equivalence. For example, a cache key might be compared as a simple
string, after syntax-based normalization, or after scheme-based
normalization.Scheme-based normalization (Section 6.2.3 of [URI]) of “http” and
“https” URIs involves the following additional rules:
If the port is equal to the default port for a scheme, the normal
form is to omit the port subcomponent.When not being used as the target of an OPTIONS request, an empty
path component is equivalent to an absolute path of “/”, so the
normal form is to provide a path of “/” instead.The scheme and host are case-insensitive and normally provided in
lowercase; all other components are compared in a case-sensitive
manner.Characters other than those in the “reserved” set are equivalent
to their percent-encoded octets: the normal form is to not encode
them (see Sections 2.1 and 2.2 of [URI]).
Again, it does not include collapsing // inside the path.
Conclusion
-
The RFC 3986 path grammar explicitly permits empty segments
(segment = *pchar).
Therefore//in a path is syntactically valid
and corresponds to an explicit empty segment. -
The generic syntax declares that, aside from dot-segments,
path segments are opaque.
Collapsing//changes the segment sequence
and therefore changes opaque data,
which is outside what normalization is supposed to do. -
HTTP uses RFC 3986’s path definitions for HTTP(S) URIs
and states that the hierarchical path component
identifies the resource within the origin’s
namespace. That is, the exact path string
(other than the very limited normalization rules)
is part of the identifier. -
The normalization rules in RFC 3986 and RFC 9110
do not authorize collapsing repeated slashes inside the path.
The only allowed path-related normalizations are
dot-segment removal (generic URIs) and empty-path-to-/(HTTP).
Therefore, collapsing // to / in HTTP URL path segments is not correct
normalization. It produces a different, non-equivalent identifier unless the
origin explicitly defines those two paths as equivalent.
So, for example,
https://git.runxiyu.org/furweb.git// is a distinct identifier from
https://git.runxiyu.org/furweb.git/ under the standards’ grammar and
normalization rules, and must not be rewritten by a generic normalizer;
indeed, these two specific URLs serve different content.
/tmp $ git clone https://git.runxiyu.org/furweb.git/
Cloning into 'furweb'...
remote: Not Found
remote:
remote: You might be attempting to perform Git operations on
remote: a hierarchical index rather than a Git repository.
remote: Note that repositories URLs always end with a "//"
remote: sentinel. Perhaps try the following URL instead?
remote:
remote: https://git.runxiyu.org/furweb.git//
remote:
fatal: repository 'https://git.runxiyu.org/furweb.git/' not found
128 /tmp $ git clone https://git.runxiyu.org/furweb.git//
Cloning into 'furweb'...
remote: Enumerating objects: 2005, done.
remote: Counting objects: 100% (2005/2005), done.
remote: Compressing objects: 100% (500/500), done.
remote: Total 2005 (delta 1455), reused 2005 (delta 1455), pack-reused 0
Receiving objects: 100% (2005/2005), 372.87 KiB | 606.00 KiB/s, done.
Resolving deltas: 100% (1455/1455), done.
/tmp $
Why would you want to do that?
Sometimes it’s useful to have a separator between different parts of a path.
For example, let’s take a look at these two:
https://villosa.lindenii.org/villosa//repos/villosa/tree/HEAD//.editorconfig
https://villosa.lindenii.org/villosa/repos/villosa/tree/HEAD/.editorconfig
In places where the URL embeds arbitrary hierarchies, e.g., group paths and Git
refs, it is useful for there to be an explicit sentinel that distinguishes
between different parts of the path. The second line makes it ambiguous where
the group-path and the ref ends (note that Git refs may be file paths like
runxiyu/fix-router that would not have empty segments).
Wait, are there any implementations that wrongly collapse double-slashes?
-
nginx with
merge_slashes -
Go’s
net/http.ServeMuxandpath.Clean; note thatfilepath.Cleanis the
one we’re supposed to use for file paths, and path.Clean is extensively used
in code adjacent to URL handling. -
I also remember Apache in some configurations exhibit this behavior, but I
don’t have citations and I can’t verify for myself for now.
🔥 **What’s your take?**
Share your thoughts in the comments below!
#️⃣ **#incorrect #normalize #HTTP #URL #paths**
🕒 **Posted on**: 1776503798
🌟 **Want more?** Click here for more info! 🌟
