Gitstafette Server Deployment¶

In this post, we examine the deployment of the Gitstafette server.

We cover the deployment on Google Cloud Platform (GCP) and Amazon Web Services (AWS). After describing the deployment target, we dive into the deployment automation (on AWS) using GitHub Actions.

What is the Gitstafette Server?

Earlier this year, I wrote about bringing Webhooks into your Homelab. The Gitstafette Server is the server-side component of the Gitstafette application.

The Gitstafette project is a way to relay webhooks from one service to another through a secure connection.

The first deployment of the Gitstafette server was on Google Cloud Platform (GCP) using Google Cloud Run.

Unfortunately, the decisions made regarding how I wanted the server to work ran into the limitations of Cloud Run. The primary reason for creating the Gitstafette application is for me to learn certain technologies. So, I moved the deployment elsewhere.

However, I want to share my experience of deployment on GCP as it might be helpful to others.

After that, we'll discuss the deployment on AWS. The deployment on AWS is done using an EC2 instance and tools such as Packer, Terraform, and GitHub Actions.

Finally, we'll discuss the automation of the deployment using GitHub Actions.

Deployment on GCP¶

In the past, I created an application to generate fair maps for the board game Settlers of Catan (Catan Map Generator, or CMG). At the time, the frontend was deployed on Heroku and the backend on Google Cloud Run.

So, my first instinct was to also deploy the Gitstafette server on Google Cloud Run. The Gitstafette server is a stateless Go application and an excellent candidate for serverless deployment.

It also means freeing myself of the responsibility of managing the server, the endpoint, and, last but not least, the HTTPS certificate —a requirement for receiving webhooks from GitHub.

GCP Cloud Run Challenges¶

To explain the challenges I faced, I need to explain how the Gitstafette server works.

The server has two endpoints:

Webhook Listener: a Restful (JSON over HTTP) endpoint for receiving webhooks
Relay GRPC Stream: a GRPC endpoint for streaming the webhooks to the relay client(s)

For more details, refer to the earlier post on Webhooks into your Homelab.

This results in the following characteristics: the server needs two exposed ports, and one port (the GRPC stream) must handle long-lived connections.

Unfortunately, Cloud Run only supports a single port and does not support long-lived connections.

Cloud Run is based upon Knative, which relies on Envoy Proxy for routing. While we can give some hints for Knative, we cannot control the Envoy Proxy configuration.

The two ports issue alone could have been a reason to look for another solution, but the long-lived connection issue was a deal-breaker.

Info

I briefly toyed with the idea of setting up two relay servers.

One for the webhook listener and one for the relay stream.

The relay stream would expose the GRPC stream while connecting "internally" to the webhook listener.

This worked, but it could have been better.

However, this is why the Server can also relay webhooks to another server.

Timeout Issues GCP Cloud Run¶

The long-lived connection issue surfaced as a timeout issue. The GRPC stream closes after some fixed period. This period is not configurable and is too short for GRPC streaming purposes, which is one of the things I want to learn about.

After exploring possible causes, I found the issue is the default timeout of the Envoy Proxy. This timeout is not configurable, as the Envoy Proxy configuration is abstracted via Knative - the technology behind Cloud Run.

I found a workaround by making the streaming duration and timeout configurable. So that you can handle different deployment environments, like Cloud Run, Cloud Functions, or Kubernetes.

Ultimately, I realized I was fighting the platform and moving further and further away from the project's original goal. So, I would need to change either the project or the platform. I chose the latter.

Info

This is why the application can set the time for how long it keeps the stream alive.

This is also why it automatically reconnects when the stream is closed if it did not explicitly receive an error return.

So, this was the end of the deployment on GCP.

Deployment on AWS¶

Around this time, three things were happening in the broader ecosystem:

Heroku announced they were going to deprecate the free tier for hobby projects
There was more and more support for ARM-based software
AWS offered a cheap ARM-based instance (t4g.nano) costing like 5$ a month

At the time, I was building my homelab out of Raspberry Pi's and was interested in ARM-based software. I also read several blog posts from Honeycomb.io on their ARM adoption.

So, I moved the Gitstafette server to AWS and on a t4g.nano instance.

Solution Overview¶

Unlike the Cloud Run deployment, I wanted to understand the infrastructure required and take control of it. So, I chose to forego the serverless options, like Lambda or Fargate, and go for an EC2 instance.

Now, an EC2 instance is not nearly enough to make the application accessible to the outside world and definitely not in a secure way.

So we have an EC2 instance within a VPC, with a security group that allows traffic on the ports we need—443 (HTTPS) and 50051 (GRPC). It also opens port 22 (SSH) but only for my IP address, allowing me to connect to the instance if needed.

The instance is accessible because it has a public IP address (EIP) and a DNS name (Route 53).

To save costs, I used Spot Instances. It does make the application less reliable, but the Gitstafette server is not a critical application and is mostly stateless. It caches webhooks for some time, but assuming your client is connected most of the time, it should not be a problem.

I use a CloudWatch Alarm to restart the instance in case of trouble (elaborated on later).

I create an AMI for the instance to ensure everything is set up correctly. This AMI is made using Hashicorp Packer.

In this instance, the application is run as a Container Image via Docker Compose. We'll dive into the Docker Compose setup later.

Challenges¶

As always, there were some challenges to overcome. As these challenges influence the design, let's explore them in more detail.

First Challenge - Public Web Access¶

First, the web entry point must be protected. I'm not a fan of exposing applications directly to the internet.

Not every application is built with security in mind, and I'm not a security expert.

So, I put a reverse proxy in front of the application. This reverse proxy handles HTTPS and forwards the traffic to the application.

There are many options for reverse proxies, but I chose Envoy Proxy. Mostly because I wanted to learn more about Envoy explicitly, as it is used by Knative and Istio. My then-employer also used both projects, so it was a good investment.

Second CHallenge - Trusted Certificates¶

The second challenge was to have a valid, trusted certificate. The webhooks are sent from GitHub; thus, GitHub needs to trust this certificate. In GitHub's webhook configuration, we cannot add a self-signed certificate. A good choice for automating certificates without cost is Let's Encrypt.

As I planned to use Docker Compose to manage the processes on the VM, I looked for a containerized solution for Let's Encrypt. This is where Certbot comes in. Certbot has a Docker image supporting the Route 53 DNS challenge. Assuming we can safely get the Route 53 credentials to the instance, we can automate the certificate renewal.

Third Challenge - Certificate Storage¶

And this is where we hit the third challenge. Certbot retrieves the certificates, but Envoy then needs to use them.

We need these certificates with a pre-defined path, and the permissions must be set correctly.

So, I created a small script that copies the certificates to the correct location and sets the permissions. It uses two volumes, one where Certbot stores the certificates and one where Envoy reads from.

Fourth Challenge - Instance Hangs¶

The fourth challenge was that the instance sometimes hangs. I have yet to find the actual cause, but it seems to happen when the instance consumes more than 2 CPU credits. The instance is a t4g.nano, which is a burstable instance.

And when it consumes more than 2 CPU credits, it seems to hang. It becomes totally unresponsive, and the only way to get it back is to restart it. To automate this, I created a CloudWatch Alarm that restarts the instance once it consumes more than 2 CPU credits for a certain period.

Fifth Challenge - Fast Startup¶

The fifth challenge was to ensure fast startup. The instance is a Spot Instance, so it can be terminated anytime. Due to the hanging issue, the CloudWatch Alarm might restart it.

So, we can expect the instance to be restarted regularly. I preloaded the application during the Packer AMI build to ensure a fast startup. Part of that means ensuring the Docker Compose service is started at boot. And download the container images used in the Docker Compose config during the Packer build to avoid downloading them upon the instance start.

Sixt Challenge - Handling Secrets¶

Last but not least, we need a way of handling secrets. The Route 53 credentials, the certificates for the GRPC port, and any tokens for the Gitstafette server. I used AWS Secrets Manager for this. The secrets are stored in Secrets Manager and are retrieved by the instance during startup. This means the instance has a policy that allows it to retrieve the secrets from Secrets Manager with a specific prefix.

In Terraform we load a startup script that retrieves the secrets and writes them to a .env. Overwriting the default .env file that is used by Docker Compose.

Instance Profile and Policies¶

We need to create an instance profile and policies to ensure the instance can retrieve the secrets from Secrets Manager.

We create the policies and attach them to the instance profile.

Here's an example of reading files from an S3 bucket (i.e., the GRPC TLS certificates):

read-s3-bucket.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::XXXX",
                "arn:aws:s3:::XXXX/*"
            ]
        }
    ]
}

And the policy for reading secrets from Secrets Manager:

read-secrets-manager.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": [
                "arn:aws:secretsmanager:eu-central-1:XXXX:secret:XXXX/*"
            ]
        }
    ]
}

We create a role and attach the policies to the role and the instance profile:

For more details on how to create these policies and the permissions required, see below resources:

Danger

I am not an AWS guru and definitely not a security expert.

So, I do not recommend using my policies.

I've decided not to include all policies to prevent anyone from using them.

Instead, the resources above should give you a good starting point.

Docker Compose Setup¶

The Docker Compose setup is pretty straightforward.

We have the following services: * Certbot * Cert Copy * Envoy * Gitstafette Server

The Certbot service retrieves the certificates from Let's Encrypt.

The Cert Copy service copies the certificates to the correct location and sets the permissions.

The Envoy service is the reverse proxy that handles the HTTPS traffic.

The Gitstafette Server service is the actual Gitstafette server that listens to webhooks and relays them via a GRPC port.

Certbot Service¶

The Certbot service is a containerized version of Certbot.

It uses the Route 53 DNS challenge to retrieve the certificates.

The certificates are stored in a shared volume with the Cert Copy service.

It has a default environment file and an override environment file. The override is optional but should be created upon instance startup and populated with the current Route 53 credentials.

docker-compose.yml

certbot:
  image: certbot/dns-route53:arm64v8-v2.11.0
  command: [ "certonly", "-v", "--dns-route53", "-d", "events.gitstafette.joostvdg.net", "--email", "joostvdg@gmail.com", "--keep-until-expiring", "--agree-tos", "--non-interactive" ]
  volumes:
    - certbot-certificates:/etc/letsencrypt
  deploy:
    restart_policy:
      condition: unless-stopped
      delay: 60s
    resources:
      limits:
        cpus: '0.15'
        memory: 50M
      reservations:
        cpus: '0.10'
  env_file:
    - path: ./default.env
      required: true # default
    - path: ./override.env
      required: false

And assuming we don't need to renew the certificates too often, we don't need to restart the service immediately. So, we set the restart delay to 60 seconds.

Cert Copy Service¶

The Cert Copy service is a small container that copies the certificates to the correct location and sets the permissions.

cert-copy:
  image: bitnami/minideb:latest
  restart: unless-stopped
  command: ["./etc/copy_script.sh"]
  depends_on:
    - certbot
  configs:
    - source: copy_script
      target: /etc/copy_script.sh
  volumes:
    - certbot-certificates:/etc/certbot/certificates:ro
    - envoy-certificates:/etc/envoy/certificates:rw
  deploy:
    restart_policy:
      condition: unless-stopped
      delay: 60s
    resources:
      limits:
        cpus: '0.15'
        memory: 20M
      reservations:
        cpus: '0.10'

We don't want the Cert Copy service to run too often, so we let it sleep for an hour after copying the certificates. We set the restart to a delay of 60 seconds, so we're less likely to hammer the instance in case of trouble.

Cert Copy Script

copy_script.sh

#!/bin/bash
echo "> Reading source location"
echo "-----------------------------------------"
echo "-----------------------------------------"
echo " > GSF Cert Location"
ls -lath /etc/certbot/certificates/live/events.gitstafette.joostvdg.net
echo "-----------------------------------------"
echo "-----------------------------------------"

echo "> Copy GSF Certs to target location"
cp /etc/certbot/certificates/live/events.gitstafette.joostvdg.net/fullchain.pem /etc/envoy/certificates/gsf-fullchain.pem
cp /etc/certbot/certificates/live/events.gitstafette.joostvdg.net/cert.pem /etc/envoy/certificates/gsf-cert.pem
cp /etc/certbot/certificates/live/events.gitstafette.joostvdg.net/privkey.pem /etc/envoy/certificates/gsf-privkey.pem

echo "> Reading target location"
ls -lath /etc/envoy/certificates

echo "> Set Cert permissions"
chmod 0444 /etc/envoy/certificates/gsf-fullchain.pem
chmod 0444 /etc/envoy/certificates/gsf-cert.pem
chmod 0444 /etc/envoy/certificates/gsf-privkey.pem

echo "> Sleeping for 1 hour"
sleep 3600

Envoy Service¶

The Envoy service is the reverse proxy that handles the HTTPS traffic.

docker-compose.yml

envoy:
  image: envoyproxy/envoy:v1.31.0
  configs:
    - source: envoy_proxy
      target: /etc/envoy/envoy-proxy.yaml
      uid: "103"
      gid: "103"
      mode: 0440
  command: /usr/local/bin/envoy -c /etc/envoy/envoy-proxy.yaml -l debug
  deploy:
    restart_policy:
      condition: unless-stopped
      delay: 10s
  depends_on:
    - cert-copy
    - gitstafette-server
  volumes:
    - type: volume
      source: envoy-certificates
      target: /etc/envoy/certificates
  ports:
    - 443:443
    - 8081:8081

I won't discuss the Envoy configuration in too much depth, as it is a bit out of the scope of this post. Essentially, we have a listener on port 443, with a filter chain that filters on the domain name.

The filter chain has a transport socket that uses the certificates we copied earlier. The listener forwards the traffic to the Gitstafette Server service.

Envoy Filter Chain Config

envoy.yaml

- address:
    socket_address:
      address: 0.0.0.0
      port_value: 443
  listener_filters:
    - name: "envoy.filters.listener.tls_inspector"
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
  filter_chains:
    - filter_chain_match:
        server_names: ["events.gitstafette.joostvdg.net"]
      filters:
        - name: envoy.filters.network.http_connection_manager
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
            codec_type: AUTO
            stat_prefix: ingress_http
            common_http_protocol_options:
              idle_timeout: 300s
            route_config:
              name: local_route
              virtual_hosts:
                - name: gitstafette-server
                  domains:
                    - "*"
                  routes:
                    - match:
                        prefix: "/"
                      route:
                        cluster: gitstafette-server
            http_filters:
              - name: envoy.filters.http.router
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
          common_tls_context:
            tls_certificates:
              - certificate_chain:
                  filename: /etc/envoy/certificates/gsf-fullchain.pem
                private_key:
                  filename: /etc/envoy/certificates/gsf-privkey.pem

Gitstafette Server Service¶

The Gitstafette server has some configuration properties.

First and foremost, for which repositories it should listen for webhooks. Any webhook for a repository not in this list will be dropped.

It also has the ports it should listen on, the GRPC and HTTP ports. And last but not least, the certificates for the GRPC TLS configuration. These are self-signed certificates meant for me.

We also have the environment files, the default, and the override. The override will contain the OAUTH token to verify the webhooks from GitHub.

docker-compose.yml

gitstafette-server:
  image: ghcr.io/joostvdg/gitstafette/server:0.3.0
  command: [
    "--repositories=537845873,478599060,758715872,763032882,502306743",
    "--grpcPort=50051",
    "--port=1323",
    "--grpcHealthPort=50051",
    "--caFileLocation=/run/secrets/ca.cert",
    "--certFileLocation=/run/secrets/server.cert",
    "--certKeyFileLocation=/run/secrets/server.key"
  ]
  secrets:
    - source: certificate
      target: server.cert
      uid: "103"
      gid: "103"
      mode: 0440
    - source: certificate-key
      target: server.key
      uid: "103"
      gid: "103"
      mode: 0440
    - source: ca
      target: ca.cert
      uid: "103"
      gid: "103"
      mode: 0440
  ports:
    - "8080:1323"
    - "50051:50051"
  env_file:
    - path: ./default.env
      required: true # default
    - path: ./override.env
      required: false
  deploy:
    restart_policy:
      condition: unless-stopped
      delay: 10s
    resources:
      limits:
        memory: 30M
      reservations:
        cpus: '0.10'

Full Docker Compose File¶

Full Docker Compose File

docker-compose.yml

services:
  cert-copy:
    image: bitnami/minideb:latest
    restart: unless-stopped
    command: ["./etc/copy_script.sh"]
    depends_on:
      - certbot
    configs:
      - source: copy_script
        target: /etc/copy_script.sh
    volumes:
      - certbot-certificates:/etc/certbot/certificates:ro
      - envoy-certificates:/etc/envoy/certificates:rw
    deploy:
      restart_policy:
        condition: unless-stopped
        delay: 60s
      resources:
        limits:
          cpus: '0.15'
          memory: 20M
        reservations:
          cpus: '0.10'

  certbot:
    image: certbot/dns-route53:arm64v8-v2.11.0
    command: [ "certonly", "-v", "--dns-route53", "-d", "events.gitstafette.joostvdg.net", "--email", "joostvdg@gmail.com", "--keep-until-expiring", "--agree-tos", "--non-interactive" ]
    volumes:
      - certbot-certificates:/etc/letsencrypt
    deploy:
      restart_policy:
        condition: unless-stopped
        delay: 60s
      resources:
        limits:
          cpus: '0.15'
          memory: 50M
        reservations:
          cpus: '0.10'
    env_file:
      - path: ./default.env
        required: true # default
      - path: ./override.env
        required: false

  envoy:
    image: envoyproxy/envoy:v1.31.0
    configs:
      - source: envoy_proxy
        target: /etc/envoy/envoy-proxy.yaml
        uid: "103"
        gid: "103"
        mode: 0440
    command: /usr/local/bin/envoy -c /etc/envoy/envoy-proxy.yaml -l debug
    deploy:
      restart_policy:
        condition: unless-stopped
        delay: 10s
    depends_on:
      - cert-copy
      - gitstafette-server
    volumes:
      - type: volume
        source: envoy-certificates
        target: /etc/envoy/certificates
    ports:
      - 443:443
      - 8081:8081
      - 8082:8082

  gitstafette-server:
    image: ghcr.io/joostvdg/gitstafette/server:0.3.0
    command: [
      "--repositories=537845873,478599060,758715872,763032882,502306743",
      "--grpcPort=50051",
      "--port=1323",
      "--grpcHealthPort=50051",
      "--caFileLocation=/run/secrets/ca.cert",
      "--certFileLocation=/run/secrets/server.cert",
      "--certKeyFileLocation=/run/secrets/server.key"
    ]
    secrets:
      - source: certificate
        target: server.cert
        uid: "103"
        gid: "103"
        mode: 0440
      - source: certificate-key
        target: server.key
        uid: "103"
        gid: "103"
        mode: 0440
      - source: ca
        target: ca.cert
        uid: "103"
        gid: "103"
        mode: 0440
    ports:
      - "8080:1323"
      - "50051:50051"
    env_file:
      - path: ./default.env
        required: true # default
      - path: ./override.env
        required: false
    deploy:
      restart_policy:
        condition: unless-stopped
        delay: 10s
      resources:
        limits:
          memory: 30M
        reservations:
          cpus: '0.10'

secrets:
  certificate:
    file: ./certs/events-aws.pem
  certificate-key:
    file: ./certs/events-aws-key.pem
  ca:
    file: ./certs/ca.pem

configs:
  envoy_proxy:
    file: ./envoy/envoy.yaml
  copy_script:
    file: ./scripts/copy_certs.sh

volumes:
  certbot-certificates:
  envoy-certificates:

networks:
  gitstafette:
    driver: bridge
    enable_ipv6: false

Automation with GitHub Actions¶

Now that we have covered the deployment on AWS, let's discuss automating the deployment using GitHub Actions.

Solution Overview¶

The automation consists of the following components:

AMI Creation via Packer
Deployment of AWS resources via Terraform
Orchestrating the deployment via GitHub Actions

AMI Creation¶

I don't believe I'm doing anything special with the AMI creation.

The steps taken are as follows:

Retrieve the latest Ubuntu ARM64 AMI
Install all the packages I need (Docker, Docker Compose, btop, AWS CLI)
Copy the Docker Compose configuration
Pull the Docker Compose images
Export the AMI details via a manifest

Below is the complete example of the Packer configuration.

The manifest.json file will contain the new AMI ID. So we can extract it and use it as a variable in the Terraform configuration. This way, we ensure we always deploy the latest AMI.

Packer Configuration

aws-ubuntu.pkr.hcl

packer {
  required_plugins {
    amazon = {
      version = ">= 1.1.1"
      source  = "github.com/hashicorp/amazon"
    }
  }
}

source "amazon-ebs" "ubuntu" {
  ami_name      = "${var.ami_prefix}-${local.date}"
  instance_type = "t4g.micro"
  region        = "eu-central-1"
  source_ami_filter {
    filters = {
      name                = "ubuntu/images/*ubuntu-*-24.04-arm64-server-*"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    most_recent = true
    owners      = ["099720109477"]
  }
  ssh_username = "ubuntu"
}

build {
  name = "gitstafette"
  sources = [
    "source.amazon-ebs.ubuntu"
  ]


  provisioner "shell" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common gnupg lsb-release",
      "sudo mkdir -m 0755 -p /etc/apt/keyrings",
      "curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg",
      "echo \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null",
      "sudo apt-get update",
      "sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin",
      "sudo systemctl status docker",
      "sudo usermod -aG docker ubuntu",
      "docker compose version",
      "sudo snap install btop",
      "sudo snap install aws-cli --classic",
      "aws --version",
      "sudo apt upgrade -y",
    ]
  }

  provisioner "file" {
    source = "../docker-compose"
    destination = "/home/ubuntu/gitstafette"
  }

  provisioner "shell" {
    inline = [
      "cd /home/ubuntu/gitstafette",
      "chmod +x /home/ubuntu/gitstafette/scripts/*.sh",
      "sudo su - ubuntu -c 'docker compose version'",
      "sudo su - ubuntu -c 'docker compose --project-directory=/home/ubuntu/gitstafette --progress=plain pull '",
    ]
  }


  post-processor "manifest" {
    output     = "manifest.json"
    strip_path = true
  }

}


locals {
  date = formatdate("YYYY-MM-DD-hh-mm", timestamp())
}

variable "ami_prefix" {
  type    = string
  default = "gitstafette-server"
}

To build it and retrieve the AMI ID, we can use the following commands:

packer build aws-ubuntu.pkr.hcl
ami_id=$(cat manifest.json | jq -r '.builds[-1].artifact_id' |  cut -d':' -f2)

Terraform Deployment¶

For the Terraform deployment, we need the following:

S3 Bucket for the Terraform state
VPC
Subnet
Internet Gateway
Security Group
Instance with instance_market_options (for the Spot request)
Route 53 Zone
Route 53 Record Set
CloudWatch Alarm

I won't go into much detail on most of the Terraform configuration, as it is pretty standard. I will highlight the EC2 Instance and the CloudWatch Alarm.

EC2 Instance¶

Where applicable, I use variables to make the configuration more flexible. Especially with how I've set up the AMI creation, I want to ensure I can easily change the AMI ID.

For the EC2 Instance, I use the instance_market_options to request a Spot Instance. I changed the instance_interruption_behavior to stop, so the instance is stopped instead of terminated. This way, I can restart the instance and not lose the data on the instance.

This also requires the spot_instance_type to be set to persistent.

instance.tf

resource "aws_instance" "gistafette" {
  ami                         = var.ami_id
  instance_type               = var.linux_instance_type
  subnet_id                   = data.aws_subnet.selected.id
  vpc_security_group_ids      = [aws_security_group.aws-linux-sg.id]
  associate_public_ip_address = var.linux_associate_public_ip_address
  source_dest_check           = false
  key_name                    = var.ssh_key_name

  iam_instance_profile = var.ec2_role_name

  instance_market_options {
    market_type = "spot"
    spot_options {
      instance_interruption_behavior = "stop"
      spot_instance_type             = "persistent"
    }
  }

  # root disk
  root_block_device {
    volume_size           = var.linux_root_volume_size
    volume_type           = var.linux_root_volume_type
    delete_on_termination = true
    encrypted             = true
  }

  user_data = file("${path.module}/startup.sh")

  tags = {
    Name        = "GSF-BE-Prod"
    Environment = "production"
  }
}

As discussed, I have a startup script that retrieves the secrets and boots the Docker Compose services in an appropriate order.

Note

I'm a bit paranoid, so I've replaced some values with placeholders.

I'm sure that if you want to replicate the setup, you can figure out what to replace them with.

Here's an excerpt of the startup script:

startup.sh

DNS_ACCESS_KEY=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/myDnsSecret" --query SecretString --output text | jq .KEY_ID)
DNS_ACCESS_SECRET=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/myDnsSecret" --query SecretString --output text | jq .KEY)
SENTRY_DSN=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/mySentryDSN" --query SecretString --output text | jq .DSN)
WEBHOOK_OAUTH_TOKEN=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/myToken" --query SecretString --output text | jq .TOKEN)

echo "Cleaning up .env file..."
rm -f .override.env

echo "Writing secrets to .env file..."
echo "AWS_ACCESS_KEY_ID=$DNS_ACCESS_KEY" > ./override.env
echo "AWS_SECRET_ACCESS_KEY=$DNS_ACCESS_SECRET" >> ./override.env
echo "SENTRY_DSN=$SENTRY_DSN" >> ./override.env
echo "OAUTH_TOKEN=$WEBHOOK_OAUTH_TOKEN" >> ./override.env

# retrieve GRPC TLS certificates from an S3 bucket
aws s3 cp s3://myresourcesbucket/ca.pem ./certs/ca.pem
aws s3 cp s3://myresourcesbucket/events-aws-key.pem ./certs/events-aws-key.pem
aws s3 cp s3://myresourcesbucket/events-aws.pem ./certs/events-aws.pem

echo "Starting Docker Compose components..."

echo "Starting CertBot..."
docker compose up certbot -d

# It takes a while for CertBot to retrieve the certificates
sleep 15

echo "Starting Cert-Copy..."
docker compose up cert-copy -d

docker compose restart cert-copy

echo "Starting GitStafette..."
docker compose up gitstafette-server -d

# Ensure the GitStafette server is started before Envoy
sleep 5

echo "Starting Envoy..."
docker compose up envoy -d

AWS & GitHub Integration¶

To ensure the deployment is automated, we must integrate the Packer AMI creation and Terraform deployment with an automation server or service.

In this case, I chose GitHub Actions. I have always wanted to learn more about using AWS via GitHub Action, using the identity federation rather than fixed credentials. So, this was an excellent opportunity to learn more about it.

We need to ensure the following:

GitHub Actions can access the AWS account and change to the appropriate role
GitHub Actions can run the Packer build and retrieve the AMI ID
GitHub Actions can manage the Terraform state (S3 bucket)
GitHub Actions can deploy the Terraform configuration

For this, we need to set up the following:

Create an OpenID Connect (OIDC) Identity Provider (IdP) in AWS
Create a role in AWS that trusts the OIDC IdP
Create policies that allow the role to perform the necessary actions

The first step is well documented by AWS, so I won't go into the details here.

In terms of the policies, we need to ensure the role has the following permissions:

Assume the role
Create the EC2 instance (see Hashicorp Packer Developer docs)
Create the Route 53 record set
Create the CloudWatch Alarm
Create an EC2 instance profile
read from the app resources S3 bucket
read and write to the Terraform state S3 bucket
read the secrets from Secrets Manager with a specific prefix

Danger

As stated before, I am not a security expert.

Nor am I an AWS IAM expert.

So, take the following policies with a grain of salt.

Below, I'll show some examples of the policies.

Manage CloudWatch Alarms

cloudwatch-policy.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricAlarm",
                "cloudwatch:TagResource",
                "cloudwatch:DescribeAlarmHistory",
                "cloudwatch:UntagResource",
                "cloudwatch:EnableAlarmActions",
                "cloudwatch:DeleteAlarms",
                "cloudwatch:DisableAlarmActions",
                "cloudwatch:ListTagsForResource",
                "cloudwatch:DescribeAlarms",
                "cloudwatch:SetAlarmState",
                "cloudwatch:PutCompositeAlarm"
            ],
            "Resource": [
                "arn:aws:cloudwatch:eu-central-1:XXXXX:slo/*",
                "arn:aws:cloudwatch:eu-central-1:XXXXX:insight-rule/*",
                "arn:aws:cloudwatch:eu-central-1:XXXXX:alarm:*"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:GenerateQuery",
                "cloudwatch:GetMetricData",
                "cloudwatch:DescribeAlarmsForMetric",
                "cloudwatch:GetMetricStatistics",
                "cloudwatch:GetMetricWidgetImage",
                "cloudwatch:ListMetrics",
                "cloudwatch:ListServices"
            ],
            "Resource": "*"
        }
    ]
}

Manage Instance Profile

instance-profile-policy.json

{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
              "iam:CreateInstanceProfile",
              "iam:DeleteInstanceProfile",
              "iam:ListInstanceProfilesForRole",
              "iam:PassRole",
              "iam:GetInstanceProfile",
              "iam:RemoveRoleFromInstanceProfile",
              "iam:AddRoleToInstanceProfile",
              "iam:GetRole"
          ],
          "Resource": [
              "arn:aws:iam::853805194132:instance-profile/*",
              "arn:aws:iam::853805194132:role/*"
          ]
      },
      {
          "Sid": "VisualEditor1",
          "Effect": "Allow",
          "Action": "iam:ListInstanceProfiles",
          "Resource": "*"
      }
  ]
}

Manage Hosted Zone

manage-hosted-zone.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "route53:GetChange",
                "route53:ListHostedZones",
                "route53:ListHostedZonesByName"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "route53:ListTagsForResources",
                "route53:GetHostedZone",
                "route53:ChangeResourceRecordSets",
                "route53:ChangeTagsForResource",
                "route53:ListResourceRecordSets",
                "route53:ListTagsForResource"
            ],
            "Resource": [
                "arn:aws:route53:::hostedzone/XXXXXXXXXXXXXXXX"
            ]
        }
    ]
}

Manage Terraform State bucket

manage-tf-state-s3.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::XXX-tf"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::XXX-tf/*"
        }
    ]
}

GitHub Actions Workflow¶

The GitHub Actions Workflow does the following:

Checks out the repository
Sets up AWS credentials
Uses Hashicorp Packer to build the AMI
Uses Terraform to deploy the resources

The workflow is not triggered by a push to the repository but by a manual trigger. At this point, I do not need to rebuild the AMI and redeploy the resources automatically.

What I have run into is that I failed to update my instance. So, at some point, I could not update it at all, as the referenced Ubuntu packages were no longer available. The goal is to rebuild and deploy an updated instance periodically or when I make changes to the Docker Compose configuration.

As a starting point, I reviewed the Hashicorp Developer documentation on GitHub Actions.

They have some GitHub Actions Actions we can use to prepare the environment for Packer and Terraform. For Packer, this Setup Hashicorp Packer, and for Terraform, this is HashiCorp - Setup Terraform.

We need to set up the AWS credentials the HashiCorp tools need to use. As described before, we use the OIDC IdP and the role that trusts it.

To set up the AWS credentials, we must use the aws-actions/configure-aws-credentials action.

It ensures the credentials the HashiCorp tools need are available in a form that they automatically pick up.

I've split the workflow into two jobs, one for Packer and one for Terraform.

Let's look at the top of the workflow file:

.github/workflows/deploy.yml

on: workflow_dispatch

env:
  AWS_REGION: "eu-central-1"

# Permission can be added at job level or workflow level
permissions:
  id-token: write   # This is required for requesting the JWT
  contents: read    # This is required for actions/checkout

We must add some permissions so the workflow can use the OIDC IdP.

Then comes the Packer job. This job ensures the Packer configuration is valid, builds the AMI, and exports the AMI ID to an output.

.github/workflows/deploy.yml

jobs:
  packer-build:
    runs-on: ubuntu-latest
    name: Run Packer
    outputs:
      ami_id: ${{ steps.build.outputs.ami_id }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup `packer`
        uses: hashicorp/setup-packer@main
        id: setup
        with:
          version: "latest"

      - name: Run `packer init`
        id: init
        working-directory: ./aws/packer
        run: "packer init ./aws-ubuntu.pkr.hcl"

      - name: Run `packer validate`
        id: validate
        working-directory: ./aws/packer
        run: "packer validate ./aws-ubuntu.pkr.hcl"

      - name: configure aws credentials
        uses: aws-actions/configure-aws-credentials@v1.7.0
        with:
          role-to-assume: arn:aws:iam::XXXXX:role/GitHubAction-XXXX #change to reflect your IAM role’s ARN
          role-session-name: GitHub_to_AWS_via_FederatedOIDC
          aws-region: ${{ env.AWS_REGION }}

      - name: Run `packer build`
        id: build
        working-directory: ./aws/packer
        run: | 
          packer build ./aws-ubuntu.pkr.hcl
          ami_id=$(cat manifest.json | jq -r '.builds[-1].artifact_id' |  cut -d':' -f2)
          echo "ami_id=${ami_id}" | tee -a $GITHUB_OUTPUT

We can then use that AMI output in the Terraform job:

.github/workflows/deploy.yml

jobs: # duplicate, but makes the example look better
  terraform-build:
    runs-on: ubuntu-latest
    name: Gitstafette AWS VM Rebuild
    needs: packer-build
    steps:
      - id: tf-checkout
        name: Checkout code for TF
        uses: actions/checkout@v4

      - id: tf-aws-creds
        name: Configure AWS credentials for Terraform
        uses: aws-actions/configure-aws-credentials@v1.7.0
        with:
          role-to-assume: arn:aws:iam::XXXXX:role/GitHubAction-XXXX #change to reflect your IAM role’s ARN
          role-session-name: GitHub_to_AWS_via_FederatedOIDC
          aws-region: ${{ env.AWS_REGION }}

      - id: tf-setup
        name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - id: tf-init
        name: Terraform Init
        working-directory: ./aws/terraform
        run: terraform init

      - id: tf-validate
        name: Terraform Validate
        working-directory: ./aws/terraform
        run: terraform validate -no-color

      - id: tf-plan
        name: Terraform Plan
        continue-on-error: true # because some exit codes are not 0, even if they are just warnings/informative
        env:
          AMI_ID: ${{ needs.packer-build.outputs.ami_id }}
        working-directory: ./aws/terraform
        run: |
          echo "AMI_ID=${AMI_ID}"
          export TF_VAR_ami_id=${AMI_ID}
          terraform plan \
            -var "ami_id=${AMI_ID}" \
            -no-color -out plan.out \
            -input=false

      - name: Terraform apply
        working-directory: ./aws/terraform
        run: terraform apply -auto-approve -input=false "plan.out"

I have seen some examples of people creating a PR with the Terraform changes rather than directly applying them.

That would be safer, as you can confirm that the changes are as desired and even run a cost estimate.

For my use case, that is overkill, so I just apply the changes directly.

Keeping An Eye On Things¶

A bit of bonus content.

In addition to the tools that AWS provides to monitor the resources, I also use a few other tools to keep an eye on things.

External Monitoring¶

I use Cronitor for the external facing components. It is free for up to five monitors; I don't need more.

It runs a health check on the webhook and includes both uptime and response time monitoring. For being alerted, I have a personal Slack workspace where I receive the alerts.

This way, I can also track when the instance is restarted, either because of the Spot instance interruption or for some other reason.

Internal Monitoring¶

For the internal components, I use Sentry.

I've set up the Gitstafette server to send errors to Sentry. This way, I can keep track of any errors that occur in the application.

The application isn't critical, and there really isn't any need to have full-scale observability set up. So Sentry is more than enough for my needs.

In the event of some unexpected errors, I get an email notification. The Sentry events capture much information, so I can easily see what went wrong, when, and which user interaction triggered the error.

Conclusion¶

In this post, I've shown how to deploy a Go application with multiple protocols on AWS using Docker Compose and automate the deployment with Packer, Terraform, and GitHub Actions.