A Tool That (Almost) Solved My Step Function Woes

I reached the end of my tether at work today on the task I was working on. The nature of the task involved crafting an AWS Step Function with several steps. Each step on the critical path contained some error handling, and several of them contained some cleanup logic, that had to be called by a bunch of other steps. This cleanup sequence is relatively complicated, and I’ve raised a number of PR’s to my colleagues which have come back with requests for change.

I’ll admit that I may have been a little sloppy with this change. But what’s not helping matters is the representation of Step Functions as a finite state machine written in YAML. I wrote about my issue with YAML used as some weird general purpose programming language, and this applies here as well. But a contributing factor is the level at which I’m defining this Step Function.fearing the sunk cost of the work I’ve done so far, I figured I’d just make the adjustments as they came. But now my patients has run out and I figured it was time to try a different approach.

The way to build an AWS Step Function is to define a finite state machine (FSM), which is basically a set of states linked together to form a graph. Doing this manually with a handful of states is relatively simple, but when you start to consider proper error handling and clean-up logic, such as making changes to DynamoDB records that need to be changed or reversed when something goes wrong, it can get quite complicated. The thing is that this is something that computers are able to do for a long time. A Turing complete program is a subset of a finite state machine, so if it’s possible to represent something that is Turing complete in a regular programming language, it should be possible to do likewise for a FSM with a subset of the language.

So I spent some time today trying to do that. My idea was to build a pre-processor which will take a Step Function definition encoded in the form that looks like a regular programming language, and translate it to the YAML FSM structure that AWS actually uses. Spoiler alert: I only got about half way through, but the seeds for something that could be used down the line are there, so it wasn’t a total waste of time.

The Design

There’s really nothing special about this. At this very early stage, the step function is simply written as if it was a regular programming language. The tool will read this, produce a graph representing the FSM, and generate the YAML file. An example of how this language looks is given below:

pass(name = "Pass1")
pass(name = "Pass2")

Here, we have a simple step function with two states, with one that will run after the other. Running it will produce the following YAML file:

Comment: This is a comment that is actually hard coded
StartAt: Pass1
States:
  Pass1:
    Type: Pass
    Next: Pass2
  Pass2:
    Type: Pass
    End: true

In this scenario, we’re using the Pass state, which will simply succeed, so this step function is pretty useless. But it does give a good demonstration as to what the eventual goal of the preprocessor is, which is to automatically do the wiring between the various states so that you don’t have to. If I were to put Pass2 above Pass1, it will update the generated YAML file to reflect that, so that Pass2 will have Next: Pass1, and Pass1 will have End: true.

The usefulness of the tool comes when we start considering failure modes. These can be expressed as normal “try-catch” blocks, a lot like may languages that exist today:

try {
  pass(name = "DoSomething")
} catch {
  pass(name = "SomethingFailed")
} finally {
  pass(name = "DoAlways")
}

This will produce the following Step Function YAML file:

Comment: This is a comment that is actually hard coded
StartAt: DoSomething
States:
  DoSomething:
    Type: Pass
    Catch:
      - ErrorEquals: ["States.ALL"]
        Next: SomethingFailed
    Next: DoAlways
  SomethingFailed:
    Type: Pass
    Next: DoAlways
  DoAlways:
    Type: Pass
    End: true

Once again, this is a relatively simple graph at this stage. But imagine this growing to several nested try-catch blocks, each one with slightly different error handling and cleanup logic. Manually keeping the various states wired correctly would be quite difficult, and it only makes sense to offload it to a tool to do this for us.

The Implementation

A few notes about how the tool itself was build:

It was hacked together in Go, which is my language of choice.
The parser was built using Participle, which is an excellent library for building parsers from annotated structs. If you work with Go and ever considered building a DSL, I would highly recommend this library.
The YAML document is serialised using this YAML library.

The tool was a complete hack job so I don’t want to go too much into the design. But the general approach is the following:

First the “program” is parsed from stdin and translated into an AST.
This AST is then converted into an internal representation of the Step Function. This is effectively the graph, with all the steps linked together via pointers.
The graph is then traversed and converted into a series of structs which, when serialised, will produce the YAML document that can be loaded into AWS.

The tool does not do everything. For example, choices are not supported, and only a few tasks that were specific to my problem were actually built.

So how did it go? Well, OK, I’ll come clean. I got as far as adding support for try-catch statements, but I never actually used it for the task I was working on. The task was nearing it’s completion and even though fixing the issues were annoying, it was less effort than regenerating the whole step function from scratch. But it’s likely that more step functions would be built in the future, most likely in the manner that we’ve been building them now: with error handling and clean-up states. So this tool might actually come in useful yet.