Replacing Ansible with salt-ssh

2020-10-22 11:07

First, where am I coming from with Ansible?

There is this machine (or "box") that I used to manage using Ansible until recently. I wanted configuration management on that box so that if ever disk or VM or the entire hosting provider would go away, I would have a magic button to start a rebuild from nothing, grab a coffee, and have things work the same again. I wanted Ansible for that task because it's fairly easy and approachable, requires nothing but working SSH access from the host system, and is written in Python. Unlike Puppet, Chef, CFEngine and SaltStack — or so I thought.

Over time using Ansible I noticed that when I made changes to a playbook I was repeatedly facing the same challenge: Either I run the whole playbook and wait for many de-facto no-op tasks or I invest in annotation with tags, save some runtime but need to deal with the shortcomings that tags have in Ansible.

Tags in Ansible: What shortcomings?

Tags in Ansible have two problems that bug me. First, you'll need to manually propagate the same tag to all dependency tasks, especially those referenced in when-conditionals or else you'll run into undefined-variable issues because the task due to register that variable has not been executed. So that's something I would have to take good care of, manually.

Secondly, tags and loops do not work well together in Ansible. What I would like to do is use the iteration item as a tag like this:

- hosts: all
  tasks:
  - name: Add Docker users
    user:
      name: "{{ item }}"
      groups: docker
      append: yes
    loop:
      - ssl-reverse-proxy
      - example1-org
      - example2-net
    tags:
      # NOTE: Does not work!
      #       Gets you: ERROR! 'item' is undefined
      - "{{ item }}"

Unfortunately, this gets me ERROR! 'item' is undefined because tags do not support loops like that in Ansible.

I can address this problem by

a) having two verbatim copies of that list,
b) extracting and re-using a variable, or
c) making use of YAML references.

A version using YAML references could look like this:

- hosts: all
  tasks:
  - name: Add Docker users
    user:
      name: "{{ item }}"
      groups: docker
      append: yes
    loop: &users
      - ssl-reverse-proxy
      - example1-org
      - example2-net
    tags: *users

More importantly though, I'll also need to be okay with the whole loop being run if I ask for any of those tags now, which means additional runtime for no value.

I didn't feel like I wanted to deal with these shortcomings of tags most of the time so instead I started to work on other tasks while the whole playbook was running, and got back to it when there were results.

It was hard to accept one other thing though: When I ran the playbook two times in a row, for the second run Ansible would take about 4 minutes to do nothing but confirming that all the work was already done. Why? Would I have to accept that it was that slow?

When Ansible is slow, how fast can I get it to be?

So I started looking for ways to improve Ansible speed, and SSH pipelining, disabling fact gathering, and Mitogen helped but wouldn't get runtime below 3 minutes, so I was not very happy. On a sidenote Mitogen doesn't support Ansible >=2.10 as of this writing so that boost in speed would come at the cost of being stuck with Ansible 2.9 in the past for longer, which is not ideal either.

So I accepted 3 minutes as the minimum runtime of that particular playbook at that time. And started wondering about looking elsewhere.

Can Salt be used like Ansible?

Maybe Salt had some way without all those minions, masters, daemons, agents that seemed like a given to me when I last had a few bits to do with SaltStack at a previous job a few years ago. To me delight, I did find salt-ssh this time. salt-ssh was introduced with the release of Salt 0.17.0 on 2013-09-26, it's not actually new.

So I was trying to answer the question:

Can I port my existing Ansible playbook to salt-ssh, will it be fun and work well, and will it be faster than 3 minutes for when it doesn't actually need to do anything?

A summary of my existing Ansible playbook

For some context, what is that playbook of mine doing anyway?

For an almost complete high-level summary (if you're interested):

Configure sshd, an SSH pubkey, restart the service as needed
Install Docker from a dedicated repository, having it running and enabled, install docker-compose
Configured firewalld to be friends with Docker
Create a specific Docker network for a Caddy-based SSL reverse proxy to talk to website containers
Configures and activates dnf-automatic so that it updates packages by itself, restarts outdated services and reboots the VM when tracer detects need to
Adjusts systemd-resolved config to no longer expose LLMNR port 5355 to the world without need to
Closes port 9000 to the world previously exposed by the cockpit service
Makes sure that ${HOME}/.local/bin is in $PATH for all users
Downgrade cgroup to v1 for Docker by adjusting the kernel command line and re-creating the GRUB config for the change to have actual effect
Install some tools for manual inspections, e.g. htop, tmux and ncdu
Create some bare Git repositories to host off-GitHub website content
Clone some Git repositories containing docker-compose website projects and keep them up to date with upstream
Spin up multiple docker-compose based service and have them do rebuilds and restarts whenever their underlying Git clone changed
Set machine hostname

It's not very different from this playbook actually, just a bit bigger.

First steps and pains with salt-ssh

I started making my way through the official Agentless Salt: Get Started Tutorial and got stuck rather quickly. I wanted execution as an unprivileged user but despite obeying the tutorial in detail I ran into errors about not being able to write to /var/cache/ — for good reasons — like these:

# salt-ssh '*' test.ping
[ERROR   ] Unable to render roster file: Traceback (most recent call last):
[..]
PermissionError: [Errno 13] Permission denied: '/var/cache/salt/master/roots/mtime_map'

And while the docs used absolute paths like /home/vagrant/salt-ssh/ everywhere, I wanted relative paths that would work with a Git repository cloned anywhere in the file system hierarchy. Not to mention that log_file needs to be ssh_log_file in the tutorial.

So with all of that figured out after a while, this minimal setup satisfied all of my needs: execution as an unprivileged user, relative paths with the help of root_dir: ., significantly less noisy output through state_output_diff: True, and a place to start adding playbook-like things to. For a bird's eye view:

# tree
.
├── master
├── pillar
│   ├── data.sls
│   └── top.sls
├── roster
├── salt
│   └── setup.sls
└── Saltfile

In more detail, looking into these files:

File Saltfile:

salt-ssh:
  roster_file: ./roster
  config_dir: .
  ssh_log_file: ./log.txt

File master:

root_dir: .

cachedir: ./cachedir

file_roots:
  base:
    - ./salt

pillar_roots:
  base:
    - ./pillar

state_output_diff: True

File roster:

host1:
  host: host1.tld
  user: root

host2:
  host: host2.tld
  user: root

File pillar/top.sls:

base:
  '*':
    - data

With that as a base I can now port the playbook over in a new file salt/setup.sls.

For example, let's adjust the Open SSH server config to know my public key (that I'll store at salt/ssh/files/authorized-keys-root.txt), to disable password-based log-in (to protect against brute-force log-in attempts) and be sure that the server makes use of the adjusted configuration:

ssh-daemon:
  # Set SSH public keys for root
  ssh_auth.present:
    - user: root
    - source: salt://ssh/files/authorized-keys-root.txt

  # Disable password-based log-ins to SSH
  file.keyvalue:
    - name: /etc/ssh/sshd_config
    - key: PasswordAuthentication
    - value: "no"
    - separator: " "
    - uncomment: "#"
    - require:
      - ssh_auth: ssh-daemon

  # Restart sshd service to apply changes in configuration
  service.running:
    - name: sshd
    - reload: True
    - watch:
      - file: ssh-daemon

That state file was made with Fedora 32 in mind, by the way.

With that local setup we can now run commands like:

# salt-ssh '*' test.ping

# salt-ssh '*' grains.items

# salt-ssh '*' state.apply setup test=True

# salt-ssh '*' state.apply setup

It took me maybe one and a half day to port the whole playbook to salt-ssh and be confident with the result. What did it get me?

(What I first believed to be a) significant reduction of runtime: Down from 3-4 minutes with Ansible to about 1 minute with salt-ssh… but I'll get to why these numbers are misleading, below
A high-level language leveraging YAML with idempotency in mind, just like with Ansible
Being able to stay agentless: No minions, no masters, just SSH
More flexibility (but also some duty) with regard to state dependencies and order of execution
Being able to use Jinja templating right in the playbook (or "Salt state file") unlike with Ansible
Experience with a new tool to add to my DevOps toolbox

Only after porting to SaltStack it became clear that some badly-written parts of the original Ansible playbook were a big contributing factor to its excessive runtime. For instance, the playbook was using module package with a loop…

- hosts: all
  tasks:
  - name: Install distro packages
    package:
      name: "{{ item }}"
      state: present
    loop:
      # NOTE: Bad idea, very slow
      - git
      - htop
      - ncdu

…rather than a list of names:

- hosts: all
  tasks:
  - name: Install distro packages
    package:
      name:
        # NOTE: Better, a lot faster
        - git
        - htop
        - ncdu
      state: present

With as many as 20 packages to check for, this single loop alone contributed heavily to the initial 4 minutes runtime with Ansible for when there was not actually anything left to do.

In a fair comparison with a well-written playbook, Ansible and salt-ssh exhibit close to identical runtime for me now.

Still, after having used both Ansible and SaltStack I think it's fair to say that I consider myself a salt-ssh convert by now.

I do hope that SaltStack gets better at fixing bugs in the future. All the hiccups and limitations I ran into with version 3001.1 were related to features that I'd consider mainstream enough that I shouldn't even have seen them, given the size of the community.

Things I ran into include:

I hope those are not a sign of structural issues with SaltStack. VMWare bought SaltStack in September 2020 so I'm hoping that it turns out for the best. I'm happy to help out with pull requests once I'm convinced that I won't be wasting my time.

For more about using salt-ssh to replace Ansible, maybe Duncan Mac-Vicar P.'s article "Using Salt like Ansible" is of interest to you.

That's enough Salt for me today. Did I miss anything? Please let me know.

Best, Sebastian

Hartwork Blog