Engineering

Moving from SSH to SSM for Automation

July 19, 2021

author:

Moving from SSH to SSM for Automation

We’ve come a long way since the last time we recaliberated our architecture. We’re now ISO 27001 and SOC 2 compliant!

One of the largest changes in the way we do things was moving from SSH to SSM, which came with its own benefits and hurdles. For those who are unaware of SSM, it’s a service offered by AWS which removes the need to have port 22 (aka SSH) open on every single instance.

Goodbye jump boxes!

Before SSM

Back in the day, running a copy command on a cluster of 11 nodes looked like this:

$ ssh <jump box>
$ for i in {1..11}; do ssh username@xx.xx.xx.$i "my_process &> something.log & disown"; done

We’d then run two more for loops (each wrapped with a while true) to:

  1. Check that it’s actually running, aka ps aux | grep my_process
  2. Cycle through every node’s something.log to see how far the process has reached

And that was one cluster. Today, we have over a hundred clusters! There’s certainly a need to automate this.

With SSM

SSM gave us visibility. One of the important key aspects was to have an audit trail of who executed what on any given node. ~/.bash_history doesn’t cut it for several reasons:

  • It can be deleted
  • It’s not guaranteed to contain every command
  • It does not carry time information
  • And most importantly, our root volumes are disposable. Change the architecture, and bam, it’s lost!

Ultimately, we switched over to SSM, and in a nutshell, here’s how commands are executed:

  1. Use the SSM API to invoke an “SSM document”: this document (see below for a reference) contains the actual command(s) that would be executed on the host, which may be parametrised
  2. Specify the target (instance IDs, tags, etc.)
  3. Set an output S3 bucket (optional) where stderr and stdout from each instance is captured
  4. Specify other options such as rate throttling, error margin, and concurrency

This is absolutely fantastic! 100% visibility into:

  1. Who executed what and when
  2. The output

However, this immediately led to the one important problem – only the infrastructure team could add new documents (since it’s added via CloudFormation). If there was a new command to be executed, it would require a change in CloudFormation templates, which means that it would need to be prioritised by the infrastructure team.

So, how did we make a reasonable trade-off which still satisfies everything mentioned above?

The Abstract SSM Document

We decided to create an “abstract” document, which would accept just one parameter: the command itself!

{
  "description": "Run abstract commands on instances",
  "schemaVersion": "2.2",
  "parameters": {
    "commands": {
      "description": "Command(s)",
      "type": "StringList"
    }
  },
  "mainSteps": [
    {
      "inputs": {
        "timeoutSeconds": "86400",
        "runCommand": "{{ commands }}"
      },
      "name": "ExecuteAbstractCommands",
      "action": "aws:runShellScript"
    }
  ]
}

This bridged the gap between SSM and SSH.

Bringing It All Together

Since we had an SSM document that could accept commands as an argument, we used it as a fundamental building block in our automation. All commands that would’ve been run manually, could now be executed via the SSM Java API, with fine-grained control.

Conclusion

SSM is a powerful tool. We now open bash prompts to instances via the AWS CLI (as opposed to a jump box and then the instance), and have a complete audit trail. We’ve also used it to automatically migrate our in-house database, from one major version to another.

Have a similar story? Let us know in the comments below!

Leave a comment

Leave a Reply

%d bloggers like this: