How-To: Using PEGs in Janet

🚀 Read this insightful post from Hacker News 📖

📂 Category:

📌 Main takeaway:

Janet is a small, Lisp-like language. Unlike most programming languages, it
offers no support for regular expressions. Instead, Janet supports parser
expression grammars, or PEGs.

A PEG in Janet is usually described by an associative data structure that lists
a series of rules. For each rule, the key is the name of the rule and the
value is a description of the string that the rule will match. What makes PEGs
especially powerful is the ability for rules to refer to other rules (including
recursive references) and for rules to run arbitrary functions.

Let’s see how we can use a PEG to parse a simplified subset of HTML. We’ll use
sequences, choices, captures (both compiled and match-time), replacements,
drops and back-references. It’s going to be fun.

Steps

Step 1. Define :main rule

Janet begins parsing using the :main rule. So let’s start with that:

'

This rule defines a pattern consisting of a sequence (represented by *) of
the rule :tagged and the value -1. This rule will match if the rule
:tagged matches and then the string ends (the value -1 matches if we are at
the end of the string).

Step 2. Define :tagged rule

Now if we try to use this grammar, Janet will complain that the rule :tagged
is not defined so let’s define that next:

'Share your opinion below!

This is pretty straightforward. Our :tagged rule consists of an opening tag,
a value of some kind and a closing tag.

Step 3. Define :open-tag rule

'🔥

We name the capture so that we can use a reference to it in our closing tag
rule. I went with :tag-name but you can choose whatever you like.

Step 4. Define :close-tag rule

Up to this point, we’ve been adding rules in the order that they’re processed.
Let’s deviate from that here and instead define our next rule to match closing
tags:

~Tell us your thoughts in comments!

This rule can be a little difficult to follow. We’re doing a match-time
capture here using the cmt function. This will match if the result of
the provided function (in this case, =) is truthy. If it is falsey, the match
will fail.

The = function is passed the values of any captures as separate arguments. We
have two captures in our pattern: the back-reference to the :tag-name capture
and the value matched by :w+. If the tag names match, the :close-tag
rule will match.

The eagle-eyed amongst you might have noticed that the quoting character at the
very beginning has changed from ' to ~. All but the simplest PEGs will
include references to various functions sequence, capture, etc that are run
by the PEG engine. To prevent these functions being called in our grammar
definition, we need to quote our data structure. However, when it comes to
passing the functions we want cmt to call, we need to pass a reference to
these functions. The solution is quasi-quoting. We quasi-quote the data
structure and then unquote the function symbol.

Step 5. Define :value rule

OK, now we’ll define the :value rule:

~💬

The value in between two tags could be nothing, a tagged value, an untagged
value or a combination of tagged and untagged values. We can match zero or more
occurrences of the pattern using the any function. The + combinator
tries the first pattern (:tagged) and if that fails, tries the second pattern
(:untagged).

Step 6. Define :untagged rule

Speaking of :untagged, this is the last named rule we need:

~What do you think?

This rule matches against one or more characters that are not <.

Step 7. Drop back reference after use

We have a problem at the moment. Our grammar won’t match nested tags because the
:tag-name capture is never being removed from the capture stack. We can solve
this problem using unref:

~

Now when we get to the end of a nested :tagged pattern, we’ll drop the captures.

Step 8. Capture tag names and values

The above is all well and good for checking if a string matches the grammar but
it’s usually much more helpful to return some structured data. Let’s do that:

~What do you think?

There’s quite a bit going on here so let’s look at each rule in turn.

  1. We’ve added a call to replace. This will pass the values captured to the
    function struct. We’ll see why in a second.

  2. Next, in :open-tag, we’re pushing the value :tag onto the capture stack
    using constant. Because the call to constant is part of the sequence in
    this pattern, it only occurs if the entirety of the pattern matches.

  3. We use a similar trick in :value. This time, instead of pushing :tag, we
    push :value.

  4. Still in :value, we use group to collect all of the matches in the (any (+ :tagged :untagged)) pattern and put them inside an array that we push
    onto the capture stack.

  5. Next, we’ve added a call to drop in :close-tag. This ensures that the
    value generated by the function = (i.e. true) is removed from the
    capture stack.

  6. Finally, we need to capture the result of :untagged. We do this with
    capture.

Now the call to struct hopefully makes sense. At the time this is called, the
pattern always returns four captures: (1) :tag, (2) the tag name, (3)
:value and (4) the tag’s value. These get pulled off the capture stack,
turned into a struct and pushed back on.

Bonus Step. Compile the grammar

If we intend to use the grammar repeatedly, we can compile it for a performance
boost:

(peg/compile
  ~{:main (* :tagged -1)
    :tagged (unref (replace (* :open-tag :value :close-tag) ,struct))
    :open-tag (* (constant :tag) "<" (capture :w+ :tag-name) ">")
    :value (* (constant :value) (group (any (+ :tagged :untagged))))
    :close-tag (drop (* "" (cmt (* (backref :tag-name) (capture :w+)) ,=) ">"))
    :untagged (capture (some (if-not "<" 1)))})

Example

Let’s see an example. Assuming this is our code:

(def grammar
  (peg/compile
    ~{:main (* :tagged -1)
      :tagged (unref (replace (* :open-tag :value :close-tag) ,struct))
      :open-tag (* (constant :tag) "<" (capture :w+ :tag-name) ">")
      :value (* (constant :value) (group (any (+ :tagged :untagged))))
      :close-tag (drop (* "" (cmt (* (backref :tag-name) (capture :w+)) ,=) ">"))
      :untagged (capture (some (if-not "<" 1)))}))

Then:

(peg/match grammar "

Hello world!

"
)
# => @[{:tag "p" # :value @[{:tag "em" :value @["Hello"]} # " " # {:tag "strong" :value @["world"]} # "!"]}] (peg/match grammar "Hello world!") # => nil

The second match returns nil because of the unmatched

. Neat!

Wrap-Up

Building support for PEGs directly into the language is one of Janet’s best
decisions. They might take a bit of time to get the hang of, but when you do,
you’ll be able to parse data in a way that’s considerably more powerful than
with regular expressions. ✺

{💬|⚡|🔥} {What do you think?|Share your opinion below!|Tell us your thoughts in comments!}

#️⃣ #HowTo #PEGs #Janet

🕒 Posted on 1760840654

By

Leave a Reply

Your email address will not be published. Required fields are marked *