Moving from SSH to SSM for Automation
We’ve come a long way since the last time we recaliberated our architecture. We’re now ISO 27001 and SOC 2 compliant!
One of the largest changes in the way we do things was moving from SSH to SSM, which came with its own benefits and hurdles. For those who are unaware of SSM, it’s a service offered by AWS which removes the need to have port 22 (aka SSH) open on every single instance.
Goodbye jump boxes!
Before SSM
Back in the day, running a copy command on a cluster of 11 nodes looked like this:
$ ssh <jump box>
$ for i in {1..11}; do ssh username@xx.xx.xx.$i "my_process &> something.log & disown"; done
We’d then run two more for
loops (each wrapped with a while true
) to:
- Check that it’s actually running, aka
ps aux | grep my_process
- Cycle through every node’s
something.log
to see how far the process has reached
And that was one cluster. Today, we have over a hundred clusters! There’s certainly a need to automate this.
With SSM
SSM gave us visibility. One of the important key aspects was to have an audit trail of who executed what on any given node. ~/.bash_history
doesn’t cut it for several reasons:
- It can be deleted
- It’s not guaranteed to contain every command
- It does not carry time information
- And most importantly, our root volumes are disposable. Change the architecture, and bam, it’s lost!
Ultimately, we switched over to SSM, and in a nutshell, here’s how commands are executed:
- Use the SSM API to invoke an “SSM document”: this document (see below for a reference) contains the actual command(s) that would be executed on the host, which may be parametrised
- Specify the target (instance IDs, tags, etc.)
- Set an output S3 bucket (optional) where
stderr
andstdout
from each instance is captured - Specify other options such as rate throttling, error margin, and concurrency
This is absolutely fantastic! 100% visibility into:
- Who executed what and when
- The output
However, this immediately led to the one important problem – only the infrastructure team could add new documents (since it’s added via CloudFormation). If there was a new command to be executed, it would require a change in CloudFormation templates, which means that it would need to be prioritised by the infrastructure team.
So, how did we make a reasonable trade-off which still satisfies everything mentioned above?
The Abstract SSM Document
We decided to create an “abstract” document, which would accept just one parameter: the command itself!
{ "description": "Run abstract commands on instances", "schemaVersion": "2.2", "parameters": { "commands": { "description": "Command(s)", "type": "StringList" } }, "mainSteps": [ { "inputs": { "timeoutSeconds": "86400", "runCommand": "{{ commands }}" }, "name": "ExecuteAbstractCommands", "action": "aws:runShellScript" } ] }
This bridged the gap between SSM and SSH.
Bringing It All Together
Since we had an SSM document that could accept commands as an argument, we used it as a fundamental building block in our automation. All commands that would’ve been run manually, could now be executed via the SSM Java API, with fine-grained control.
Conclusion
SSM is a powerful tool. We now open bash prompts to instances via the AWS CLI (as opposed to a jump box and then the instance), and have a complete audit trail. We’ve also used it to automatically migrate our in-house database, from one major version to another.
Have a similar story? Let us know in the comments below!