Gitstafette Server Deployment¶
In this post, we examine the deployment of the Gitstafette server.
We cover the deployment on Google Cloud Platform (GCP) and Amazon Web Services (AWS). After describing the deployment target, we dive into the deployment automation (on AWS) using GitHub Actions.
What is the Gitstafette Server?
Earlier this year, I wrote about bringing Webhooks into your Homelab. The Gitstafette Server is the server-side component of the Gitstafette application.
The Gitstafette project is a way to relay webhooks from one service to another through a secure connection.
The first deployment of the Gitstafette server was on Google Cloud Platform (GCP) using Google Cloud Run.
Unfortunately, the decisions made regarding how I wanted the server to work ran into the limitations of Cloud Run. The primary reason for creating the Gitstafette application is for me to learn certain technologies. So, I moved the deployment elsewhere.
However, I want to share my experience of deployment on GCP as it might be helpful to others.
After that, we'll discuss the deployment on AWS. The deployment on AWS is done using an EC2 instance and tools such as Packer, Terraform, and GitHub Actions.
Finally, we'll discuss the automation of the deployment using GitHub Actions.
Deployment on GCP¶
In the past, I created an application to generate fair maps for the board game Settlers of Catan (Catan Map Generator, or CMG). At the time, the frontend was deployed on Heroku and the backend on Google Cloud Run.
So, my first instinct was to also deploy the Gitstafette server on Google Cloud Run. The Gitstafette server is a stateless Go application and an excellent candidate for serverless deployment.
It also means freeing myself of the responsibility of managing the server, the endpoint, and, last but not least, the HTTPS certificate —a requirement for receiving webhooks from GitHub.
GCP Cloud Run Challenges¶
To explain the challenges I faced, I need to explain how the Gitstafette server works.
The server has two endpoints:
- Webhook Listener: a Restful (JSON over HTTP) endpoint for receiving webhooks
- Relay GRPC Stream: a GRPC endpoint for streaming the webhooks to the relay client(s)
For more details, refer to the earlier post on Webhooks into your Homelab.
This results in the following characteristics: the server needs two exposed ports, and one port (the GRPC stream) must handle long-lived connections.
Unfortunately, Cloud Run only supports a single port and does not support long-lived connections.
Cloud Run is based upon Knative, which relies on Envoy Proxy for routing. While we can give some hints for Knative, we cannot control the Envoy Proxy configuration.
The two ports issue alone could have been a reason to look for another solution, but the long-lived connection issue was a deal-breaker.
Info
I briefly toyed with the idea of setting up two relay servers.
One for the webhook listener and one for the relay stream.
The relay stream would expose the GRPC stream while connecting "internally" to the webhook listener.
This worked, but it could have been better.
However, this is why the Server can also relay webhooks to another server.
Timeout Issues GCP Cloud Run¶
The long-lived connection issue surfaced as a timeout issue. The GRPC stream closes after some fixed period. This period is not configurable and is too short for GRPC streaming purposes, which is one of the things I want to learn about.
After exploring possible causes, I found the issue is the default timeout of the Envoy Proxy. This timeout is not configurable, as the Envoy Proxy configuration is abstracted via Knative - the technology behind Cloud Run.
I found a workaround by making the streaming duration and timeout configurable. So that you can handle different deployment environments, like Cloud Run, Cloud Functions, or Kubernetes.
Ultimately, I realized I was fighting the platform and moving further and further away from the project's original goal. So, I would need to change either the project or the platform. I chose the latter.
Info
This is why the application can set the time for how long it keeps the stream alive.
This is also why it automatically reconnects when the stream is closed if it did not explicitly receive an error return.
So, this was the end of the deployment on GCP.
Deployment on AWS¶
Around this time, three things were happening in the broader ecosystem:
- Heroku announced they were going to deprecate the free tier for hobby projects
- There was more and more support for ARM-based software
- AWS offered a cheap ARM-based instance (t4g.nano) costing like 5$ a month
At the time, I was building my homelab out of Raspberry Pi's and was interested in ARM-based software. I also read several blog posts from Honeycomb.io on their ARM adoption.
So, I moved the Gitstafette server to AWS and on a t4g.nano instance.
Solution Overview¶
Unlike the Cloud Run deployment, I wanted to understand the infrastructure required and take control of it. So, I chose to forego the serverless options, like Lambda or Fargate, and go for an EC2 instance.
Now, an EC2 instance is not nearly enough to make the application accessible to the outside world and definitely not in a secure way.
So we have an EC2 instance within a VPC, with a security group that allows traffic on the ports we need—443 (HTTPS) and 50051 (GRPC). It also opens port 22 (SSH) but only for my IP address, allowing me to connect to the instance if needed.
The instance is accessible because it has a public IP address (EIP) and a DNS name (Route 53).
To save costs, I used Spot Instances. It does make the application less reliable, but the Gitstafette server is not a critical application and is mostly stateless. It caches webhooks for some time, but assuming your client is connected most of the time, it should not be a problem.
I use a CloudWatch Alarm to restart the instance in case of trouble (elaborated on later).
I create an AMI for the instance to ensure everything is set up correctly. This AMI is made using Hashicorp Packer.
In this instance, the application is run as a Container Image via Docker Compose. We'll dive into the Docker Compose setup later.
Challenges¶
As always, there were some challenges to overcome. As these challenges influence the design, let's explore them in more detail.
First Challenge - Public Web Access¶
First, the web entry point must be protected. I'm not a fan of exposing applications directly to the internet.
Not every application is built with security in mind, and I'm not a security expert.
So, I put a reverse proxy in front of the application. This reverse proxy handles HTTPS and forwards the traffic to the application.
There are many options for reverse proxies, but I chose Envoy Proxy. Mostly because I wanted to learn more about Envoy explicitly, as it is used by Knative and Istio. My then-employer also used both projects, so it was a good investment.
Second CHallenge - Trusted Certificates¶
The second challenge was to have a valid, trusted certificate. The webhooks are sent from GitHub; thus, GitHub needs to trust this certificate. In GitHub's webhook configuration, we cannot add a self-signed certificate. A good choice for automating certificates without cost is Let's Encrypt.
As I planned to use Docker Compose to manage the processes on the VM, I looked for a containerized solution for Let's Encrypt. This is where Certbot comes in. Certbot has a Docker image supporting the Route 53 DNS challenge. Assuming we can safely get the Route 53 credentials to the instance, we can automate the certificate renewal.
Third Challenge - Certificate Storage¶
And this is where we hit the third challenge. Certbot retrieves the certificates, but Envoy then needs to use them.
We need these certificates with a pre-defined path, and the permissions must be set correctly.
So, I created a small script that copies the certificates to the correct location and sets the permissions. It uses two volumes, one where Certbot stores the certificates and one where Envoy reads from.
Fourth Challenge - Instance Hangs¶
The fourth challenge was that the instance sometimes hangs. I have yet to find the actual cause, but it seems to happen when the instance consumes more than 2 CPU credits. The instance is a t4g.nano, which is a burstable instance.
And when it consumes more than 2 CPU credits, it seems to hang. It becomes totally unresponsive, and the only way to get it back is to restart it. To automate this, I created a CloudWatch Alarm that restarts the instance once it consumes more than 2 CPU credits for a certain period.
Fifth Challenge - Fast Startup¶
The fifth challenge was to ensure fast startup. The instance is a Spot Instance, so it can be terminated anytime. Due to the hanging issue, the CloudWatch Alarm might restart it.
So, we can expect the instance to be restarted regularly. I preloaded the application during the Packer AMI build to ensure a fast startup. Part of that means ensuring the Docker Compose service is started at boot. And download the container images used in the Docker Compose config during the Packer build to avoid downloading them upon the instance start.
Sixt Challenge - Handling Secrets¶
Last but not least, we need a way of handling secrets. The Route 53 credentials, the certificates for the GRPC port, and any tokens for the Gitstafette server. I used AWS Secrets Manager for this. The secrets are stored in Secrets Manager and are retrieved by the instance during startup. This means the instance has a policy that allows it to retrieve the secrets from Secrets Manager with a specific prefix.
In Terraform we load a startup script that retrieves the secrets and writes them to a .env. Overwriting the default .env file that is used by Docker Compose.
Instance Profile and Policies¶
We need to create an instance profile and policies to ensure the instance can retrieve the secrets from Secrets Manager.
We create the policies and attach them to the instance profile.
Here's an example of reading files from an S3 bucket (i.e., the GRPC TLS certificates):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::XXXX",
"arn:aws:s3:::XXXX/*"
]
}
]
}
And the policy for reading secrets from Secrets Manager:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret"
],
"Resource": [
"arn:aws:secretsmanager:eu-central-1:XXXX:secret:XXXX/*"
]
}
]
}
We create a role and attach the policies to the role and the instance profile:
For more details on how to create these policies and the permissions required, see below resources:
Danger
I am not an AWS guru and definitely not a security expert.
So, I do not recommend using my policies.
I've decided not to include all policies to prevent anyone from using them.
Instead, the resources above should give you a good starting point.
Docker Compose Setup¶
The Docker Compose setup is pretty straightforward.
We have the following services: * Certbot * Cert Copy * Envoy * Gitstafette Server
The Certbot service retrieves the certificates from Let's Encrypt.
The Cert Copy service copies the certificates to the correct location and sets the permissions.
The Envoy service is the reverse proxy that handles the HTTPS traffic.
The Gitstafette Server service is the actual Gitstafette server that listens to webhooks and relays them via a GRPC port.
Certbot Service¶
The Certbot service is a containerized version of Certbot.
It uses the Route 53 DNS challenge to retrieve the certificates.
The certificates are stored in a shared volume with the Cert Copy service.
It has a default environment file and an override environment file. The override is optional but should be created upon instance startup and populated with the current Route 53 credentials.
certbot:
image: certbot/dns-route53:arm64v8-v2.11.0
command: [ "certonly", "-v", "--dns-route53", "-d", "events.gitstafette.joostvdg.net", "--email", "joostvdg@gmail.com", "--keep-until-expiring", "--agree-tos", "--non-interactive" ]
volumes:
- certbot-certificates:/etc/letsencrypt
deploy:
restart_policy:
condition: unless-stopped
delay: 60s
resources:
limits:
cpus: '0.15'
memory: 50M
reservations:
cpus: '0.10'
env_file:
- path: ./default.env
required: true # default
- path: ./override.env
required: false
And assuming we don't need to renew the certificates too often, we don't need to restart the service immediately. So, we set the restart delay to 60 seconds.
Cert Copy Service¶
The Cert Copy service is a small container that copies the certificates to the correct location and sets the permissions.
cert-copy:
image: bitnami/minideb:latest
restart: unless-stopped
command: ["./etc/copy_script.sh"]
depends_on:
- certbot
configs:
- source: copy_script
target: /etc/copy_script.sh
volumes:
- certbot-certificates:/etc/certbot/certificates:ro
- envoy-certificates:/etc/envoy/certificates:rw
deploy:
restart_policy:
condition: unless-stopped
delay: 60s
resources:
limits:
cpus: '0.15'
memory: 20M
reservations:
cpus: '0.10'
We don't want the Cert Copy service to run too often, so we let it sleep for an hour after copying the certificates. We set the restart to a delay of 60 seconds, so we're less likely to hammer the instance in case of trouble.
Cert Copy Script
#!/bin/bash
echo "> Reading source location"
echo "-----------------------------------------"
echo "-----------------------------------------"
echo " > GSF Cert Location"
ls -lath /etc/certbot/certificates/live/events.gitstafette.joostvdg.net
echo "-----------------------------------------"
echo "-----------------------------------------"
echo "> Copy GSF Certs to target location"
cp /etc/certbot/certificates/live/events.gitstafette.joostvdg.net/fullchain.pem /etc/envoy/certificates/gsf-fullchain.pem
cp /etc/certbot/certificates/live/events.gitstafette.joostvdg.net/cert.pem /etc/envoy/certificates/gsf-cert.pem
cp /etc/certbot/certificates/live/events.gitstafette.joostvdg.net/privkey.pem /etc/envoy/certificates/gsf-privkey.pem
echo "> Reading target location"
ls -lath /etc/envoy/certificates
echo "> Set Cert permissions"
chmod 0444 /etc/envoy/certificates/gsf-fullchain.pem
chmod 0444 /etc/envoy/certificates/gsf-cert.pem
chmod 0444 /etc/envoy/certificates/gsf-privkey.pem
echo "> Sleeping for 1 hour"
sleep 3600
Envoy Service¶
The Envoy service is the reverse proxy that handles the HTTPS traffic.
envoy:
image: envoyproxy/envoy:v1.31.0
configs:
- source: envoy_proxy
target: /etc/envoy/envoy-proxy.yaml
uid: "103"
gid: "103"
mode: 0440
command: /usr/local/bin/envoy -c /etc/envoy/envoy-proxy.yaml -l debug
deploy:
restart_policy:
condition: unless-stopped
delay: 10s
depends_on:
- cert-copy
- gitstafette-server
volumes:
- type: volume
source: envoy-certificates
target: /etc/envoy/certificates
ports:
- 443:443
- 8081:8081
I won't discuss the Envoy configuration in too much depth, as it is a bit out of the scope of this post. Essentially, we have a listener on port 443, with a filter chain that filters on the domain name.
The filter chain has a transport socket that uses the certificates we copied earlier. The listener forwards the traffic to the Gitstafette Server service.
Envoy Filter Chain Config
- address:
socket_address:
address: 0.0.0.0
port_value: 443
listener_filters:
- name: "envoy.filters.listener.tls_inspector"
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
filter_chains:
- filter_chain_match:
server_names: ["events.gitstafette.joostvdg.net"]
filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: AUTO
stat_prefix: ingress_http
common_http_protocol_options:
idle_timeout: 300s
route_config:
name: local_route
virtual_hosts:
- name: gitstafette-server
domains:
- "*"
routes:
- match:
prefix: "/"
route:
cluster: gitstafette-server
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain:
filename: /etc/envoy/certificates/gsf-fullchain.pem
private_key:
filename: /etc/envoy/certificates/gsf-privkey.pem
Gitstafette Server Service¶
The Gitstafette server has some configuration properties.
First and foremost, for which repositories it should listen for webhooks. Any webhook for a repository not in this list will be dropped.
It also has the ports it should listen on, the GRPC and HTTP ports. And last but not least, the certificates for the GRPC TLS configuration. These are self-signed certificates meant for me.
We also have the environment files, the default, and the override. The override will contain the OAUTH token to verify the webhooks from GitHub.
gitstafette-server:
image: ghcr.io/joostvdg/gitstafette/server:0.3.0
command: [
"--repositories=537845873,478599060,758715872,763032882,502306743",
"--grpcPort=50051",
"--port=1323",
"--grpcHealthPort=50051",
"--caFileLocation=/run/secrets/ca.cert",
"--certFileLocation=/run/secrets/server.cert",
"--certKeyFileLocation=/run/secrets/server.key"
]
secrets:
- source: certificate
target: server.cert
uid: "103"
gid: "103"
mode: 0440
- source: certificate-key
target: server.key
uid: "103"
gid: "103"
mode: 0440
- source: ca
target: ca.cert
uid: "103"
gid: "103"
mode: 0440
ports:
- "8080:1323"
- "50051:50051"
env_file:
- path: ./default.env
required: true # default
- path: ./override.env
required: false
deploy:
restart_policy:
condition: unless-stopped
delay: 10s
resources:
limits:
memory: 30M
reservations:
cpus: '0.10'
Full Docker Compose File¶
Full Docker Compose File
services:
cert-copy:
image: bitnami/minideb:latest
restart: unless-stopped
command: ["./etc/copy_script.sh"]
depends_on:
- certbot
configs:
- source: copy_script
target: /etc/copy_script.sh
volumes:
- certbot-certificates:/etc/certbot/certificates:ro
- envoy-certificates:/etc/envoy/certificates:rw
deploy:
restart_policy:
condition: unless-stopped
delay: 60s
resources:
limits:
cpus: '0.15'
memory: 20M
reservations:
cpus: '0.10'
certbot:
image: certbot/dns-route53:arm64v8-v2.11.0
command: [ "certonly", "-v", "--dns-route53", "-d", "events.gitstafette.joostvdg.net", "--email", "joostvdg@gmail.com", "--keep-until-expiring", "--agree-tos", "--non-interactive" ]
volumes:
- certbot-certificates:/etc/letsencrypt
deploy:
restart_policy:
condition: unless-stopped
delay: 60s
resources:
limits:
cpus: '0.15'
memory: 50M
reservations:
cpus: '0.10'
env_file:
- path: ./default.env
required: true # default
- path: ./override.env
required: false
envoy:
image: envoyproxy/envoy:v1.31.0
configs:
- source: envoy_proxy
target: /etc/envoy/envoy-proxy.yaml
uid: "103"
gid: "103"
mode: 0440
command: /usr/local/bin/envoy -c /etc/envoy/envoy-proxy.yaml -l debug
deploy:
restart_policy:
condition: unless-stopped
delay: 10s
depends_on:
- cert-copy
- gitstafette-server
volumes:
- type: volume
source: envoy-certificates
target: /etc/envoy/certificates
ports:
- 443:443
- 8081:8081
- 8082:8082
gitstafette-server:
image: ghcr.io/joostvdg/gitstafette/server:0.3.0
command: [
"--repositories=537845873,478599060,758715872,763032882,502306743",
"--grpcPort=50051",
"--port=1323",
"--grpcHealthPort=50051",
"--caFileLocation=/run/secrets/ca.cert",
"--certFileLocation=/run/secrets/server.cert",
"--certKeyFileLocation=/run/secrets/server.key"
]
secrets:
- source: certificate
target: server.cert
uid: "103"
gid: "103"
mode: 0440
- source: certificate-key
target: server.key
uid: "103"
gid: "103"
mode: 0440
- source: ca
target: ca.cert
uid: "103"
gid: "103"
mode: 0440
ports:
- "8080:1323"
- "50051:50051"
env_file:
- path: ./default.env
required: true # default
- path: ./override.env
required: false
deploy:
restart_policy:
condition: unless-stopped
delay: 10s
resources:
limits:
memory: 30M
reservations:
cpus: '0.10'
secrets:
certificate:
file: ./certs/events-aws.pem
certificate-key:
file: ./certs/events-aws-key.pem
ca:
file: ./certs/ca.pem
configs:
envoy_proxy:
file: ./envoy/envoy.yaml
copy_script:
file: ./scripts/copy_certs.sh
volumes:
certbot-certificates:
envoy-certificates:
networks:
gitstafette:
driver: bridge
enable_ipv6: false
Automation with GitHub Actions¶
Now that we have covered the deployment on AWS, let's discuss automating the deployment using GitHub Actions.
Solution Overview¶
The automation consists of the following components:
- AMI Creation via Packer
- Deployment of AWS resources via Terraform
- Orchestrating the deployment via GitHub Actions
AMI Creation¶
I don't believe I'm doing anything special with the AMI creation.
The steps taken are as follows:
- Retrieve the latest Ubuntu ARM64 AMI
- Install all the packages I need (Docker, Docker Compose, btop, AWS CLI)
- Copy the Docker Compose configuration
- Pull the Docker Compose images
- Export the AMI details via a manifest
Below is the complete example of the Packer configuration.
The manifest.json
file will contain the new AMI ID.
So we can extract it and use it as a variable in the Terraform configuration.
This way, we ensure we always deploy the latest AMI.
Packer Configuration
packer {
required_plugins {
amazon = {
version = ">= 1.1.1"
source = "github.com/hashicorp/amazon"
}
}
}
source "amazon-ebs" "ubuntu" {
ami_name = "${var.ami_prefix}-${local.date}"
instance_type = "t4g.micro"
region = "eu-central-1"
source_ami_filter {
filters = {
name = "ubuntu/images/*ubuntu-*-24.04-arm64-server-*"
root-device-type = "ebs"
virtualization-type = "hvm"
}
most_recent = true
owners = ["099720109477"]
}
ssh_username = "ubuntu"
}
build {
name = "gitstafette"
sources = [
"source.amazon-ebs.ubuntu"
]
provisioner "shell" {
inline = [
"sudo apt-get update",
"sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common gnupg lsb-release",
"sudo mkdir -m 0755 -p /etc/apt/keyrings",
"curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg",
"echo \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null",
"sudo apt-get update",
"sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin",
"sudo systemctl status docker",
"sudo usermod -aG docker ubuntu",
"docker compose version",
"sudo snap install btop",
"sudo snap install aws-cli --classic",
"aws --version",
"sudo apt upgrade -y",
]
}
provisioner "file" {
source = "../docker-compose"
destination = "/home/ubuntu/gitstafette"
}
provisioner "shell" {
inline = [
"cd /home/ubuntu/gitstafette",
"chmod +x /home/ubuntu/gitstafette/scripts/*.sh",
"sudo su - ubuntu -c 'docker compose version'",
"sudo su - ubuntu -c 'docker compose --project-directory=/home/ubuntu/gitstafette --progress=plain pull '",
]
}
post-processor "manifest" {
output = "manifest.json"
strip_path = true
}
}
locals {
date = formatdate("YYYY-MM-DD-hh-mm", timestamp())
}
variable "ami_prefix" {
type = string
default = "gitstafette-server"
}
To build it and retrieve the AMI ID, we can use the following commands:
packer build aws-ubuntu.pkr.hcl
ami_id=$(cat manifest.json | jq -r '.builds[-1].artifact_id' | cut -d':' -f2)
Terraform Deployment¶
For the Terraform deployment, we need the following:
- S3 Bucket for the Terraform state
- VPC
- Subnet
- Internet Gateway
- Security Group
- Instance with
instance_market_options
(for the Spot request) - Route 53 Zone
- Route 53 Record Set
- CloudWatch Alarm
I won't go into much detail on most of the Terraform configuration, as it is pretty standard. I will highlight the EC2 Instance and the CloudWatch Alarm.
EC2 Instance¶
Where applicable, I use variables to make the configuration more flexible. Especially with how I've set up the AMI creation, I want to ensure I can easily change the AMI ID.
For the EC2 Instance, I use the instance_market_options
to request a Spot Instance.
I changed the instance_interruption_behavior
to stop,
so the instance is stopped instead of terminated.
This way, I can restart the instance and not lose the data on the instance.
This also requires the spot_instance_type
to be set to persistent
.
resource "aws_instance" "gistafette" {
ami = var.ami_id
instance_type = var.linux_instance_type
subnet_id = data.aws_subnet.selected.id
vpc_security_group_ids = [aws_security_group.aws-linux-sg.id]
associate_public_ip_address = var.linux_associate_public_ip_address
source_dest_check = false
key_name = var.ssh_key_name
iam_instance_profile = var.ec2_role_name
instance_market_options {
market_type = "spot"
spot_options {
instance_interruption_behavior = "stop"
spot_instance_type = "persistent"
}
}
# root disk
root_block_device {
volume_size = var.linux_root_volume_size
volume_type = var.linux_root_volume_type
delete_on_termination = true
encrypted = true
}
user_data = file("${path.module}/startup.sh")
tags = {
Name = "GSF-BE-Prod"
Environment = "production"
}
}
As discussed, I have a startup script that retrieves the secrets and boots the Docker Compose services in an appropriate order.
Note
I'm a bit paranoid, so I've replaced some values with placeholders.
I'm sure that if you want to replicate the setup, you can figure out what to replace them with.
Here's an excerpt of the startup script:
DNS_ACCESS_KEY=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/myDnsSecret" --query SecretString --output text | jq .KEY_ID)
DNS_ACCESS_SECRET=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/myDnsSecret" --query SecretString --output text | jq .KEY)
SENTRY_DSN=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/mySentryDSN" --query SecretString --output text | jq .DSN)
WEBHOOK_OAUTH_TOKEN=$(aws --region myRegion secretsmanager get-secret-value --secret-id "app/myToken" --query SecretString --output text | jq .TOKEN)
echo "Cleaning up .env file..."
rm -f .override.env
echo "Writing secrets to .env file..."
echo "AWS_ACCESS_KEY_ID=$DNS_ACCESS_KEY" > ./override.env
echo "AWS_SECRET_ACCESS_KEY=$DNS_ACCESS_SECRET" >> ./override.env
echo "SENTRY_DSN=$SENTRY_DSN" >> ./override.env
echo "OAUTH_TOKEN=$WEBHOOK_OAUTH_TOKEN" >> ./override.env
# retrieve GRPC TLS certificates from an S3 bucket
aws s3 cp s3://myresourcesbucket/ca.pem ./certs/ca.pem
aws s3 cp s3://myresourcesbucket/events-aws-key.pem ./certs/events-aws-key.pem
aws s3 cp s3://myresourcesbucket/events-aws.pem ./certs/events-aws.pem
echo "Starting Docker Compose components..."
echo "Starting CertBot..."
docker compose up certbot -d
# It takes a while for CertBot to retrieve the certificates
sleep 15
echo "Starting Cert-Copy..."
docker compose up cert-copy -d
docker compose restart cert-copy
echo "Starting GitStafette..."
docker compose up gitstafette-server -d
# Ensure the GitStafette server is started before Envoy
sleep 5
echo "Starting Envoy..."
docker compose up envoy -d
AWS & GitHub Integration¶
To ensure the deployment is automated, we must integrate the Packer AMI creation and Terraform deployment with an automation server or service.
In this case, I chose GitHub Actions. I have always wanted to learn more about using AWS via GitHub Action, using the identity federation rather than fixed credentials. So, this was an excellent opportunity to learn more about it.
We need to ensure the following:
- GitHub Actions can access the AWS account and change to the appropriate role
- GitHub Actions can run the Packer build and retrieve the AMI ID
- GitHub Actions can manage the Terraform state (S3 bucket)
- GitHub Actions can deploy the Terraform configuration
For this, we need to set up the following:
- Create an OpenID Connect (OIDC) Identity Provider (IdP) in AWS
- Create a role in AWS that trusts the OIDC IdP
- Create policies that allow the role to perform the necessary actions
The first step is well documented by AWS, so I won't go into the details here.
In terms of the policies, we need to ensure the role has the following permissions:
- Assume the role
- Create the EC2 instance (see Hashicorp Packer Developer docs)
- Create the Route 53 record set
- Create the CloudWatch Alarm
- Create an EC2 instance profile
- read from the app resources S3 bucket
- read and write to the Terraform state S3 bucket
- read the secrets from Secrets Manager with a specific prefix
Danger
As stated before, I am not a security expert.
Nor am I an AWS IAM expert.
So, take the following policies with a grain of salt.
Below, I'll show some examples of the policies.
Manage CloudWatch Alarms
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricAlarm",
"cloudwatch:TagResource",
"cloudwatch:DescribeAlarmHistory",
"cloudwatch:UntagResource",
"cloudwatch:EnableAlarmActions",
"cloudwatch:DeleteAlarms",
"cloudwatch:DisableAlarmActions",
"cloudwatch:ListTagsForResource",
"cloudwatch:DescribeAlarms",
"cloudwatch:SetAlarmState",
"cloudwatch:PutCompositeAlarm"
],
"Resource": [
"arn:aws:cloudwatch:eu-central-1:XXXXX:slo/*",
"arn:aws:cloudwatch:eu-central-1:XXXXX:insight-rule/*",
"arn:aws:cloudwatch:eu-central-1:XXXXX:alarm:*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"cloudwatch:GenerateQuery",
"cloudwatch:GetMetricData",
"cloudwatch:DescribeAlarmsForMetric",
"cloudwatch:GetMetricStatistics",
"cloudwatch:GetMetricWidgetImage",
"cloudwatch:ListMetrics",
"cloudwatch:ListServices"
],
"Resource": "*"
}
]
}
Manage Instance Profile
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"iam:CreateInstanceProfile",
"iam:DeleteInstanceProfile",
"iam:ListInstanceProfilesForRole",
"iam:PassRole",
"iam:GetInstanceProfile",
"iam:RemoveRoleFromInstanceProfile",
"iam:AddRoleToInstanceProfile",
"iam:GetRole"
],
"Resource": [
"arn:aws:iam::853805194132:instance-profile/*",
"arn:aws:iam::853805194132:role/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "iam:ListInstanceProfiles",
"Resource": "*"
}
]
}
Manage Hosted Zone
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"route53:GetChange",
"route53:ListHostedZones",
"route53:ListHostedZonesByName"
],
"Resource": "*"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"route53:ListTagsForResources",
"route53:GetHostedZone",
"route53:ChangeResourceRecordSets",
"route53:ChangeTagsForResource",
"route53:ListResourceRecordSets",
"route53:ListTagsForResource"
],
"Resource": [
"arn:aws:route53:::hostedzone/XXXXXXXXXXXXXXXX"
]
}
]
}
Manage Terraform State bucket
GitHub Actions Workflow¶
The GitHub Actions Workflow does the following:
- Checks out the repository
- Sets up AWS credentials
- Uses Hashicorp Packer to build the AMI
- Uses Terraform to deploy the resources
The workflow is not triggered by a push to the repository but by a manual trigger. At this point, I do not need to rebuild the AMI and redeploy the resources automatically.
What I have run into is that I failed to update my instance. So, at some point, I could not update it at all, as the referenced Ubuntu packages were no longer available. The goal is to rebuild and deploy an updated instance periodically or when I make changes to the Docker Compose configuration.
As a starting point, I reviewed the Hashicorp Developer documentation on GitHub Actions.
They have some GitHub Actions Actions we can use to prepare the environment for Packer and Terraform. For Packer, this Setup Hashicorp Packer, and for Terraform, this is HashiCorp - Setup Terraform.
We need to set up the AWS credentials the HashiCorp tools need to use. As described before, we use the OIDC IdP and the role that trusts it.
To set up the AWS credentials, we must use the aws-actions/configure-aws-credentials action.
It ensures the credentials the HashiCorp tools need are available in a form that they automatically pick up.
I've split the workflow into two jobs, one for Packer and one for Terraform.
Let's look at the top of the workflow file:
on: workflow_dispatch
env:
AWS_REGION: "eu-central-1"
# Permission can be added at job level or workflow level
permissions:
id-token: write # This is required for requesting the JWT
contents: read # This is required for actions/checkout
We must add some permissions so the workflow can use the OIDC IdP.
Then comes the Packer job. This job ensures the Packer configuration is valid, builds the AMI, and exports the AMI ID to an output.
jobs:
packer-build:
runs-on: ubuntu-latest
name: Run Packer
outputs:
ami_id: ${{ steps.build.outputs.ami_id }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup `packer`
uses: hashicorp/setup-packer@main
id: setup
with:
version: "latest"
- name: Run `packer init`
id: init
working-directory: ./aws/packer
run: "packer init ./aws-ubuntu.pkr.hcl"
- name: Run `packer validate`
id: validate
working-directory: ./aws/packer
run: "packer validate ./aws-ubuntu.pkr.hcl"
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v1.7.0
with:
role-to-assume: arn:aws:iam::XXXXX:role/GitHubAction-XXXX #change to reflect your IAM role’s ARN
role-session-name: GitHub_to_AWS_via_FederatedOIDC
aws-region: ${{ env.AWS_REGION }}
- name: Run `packer build`
id: build
working-directory: ./aws/packer
run: |
packer build ./aws-ubuntu.pkr.hcl
ami_id=$(cat manifest.json | jq -r '.builds[-1].artifact_id' | cut -d':' -f2)
echo "ami_id=${ami_id}" | tee -a $GITHUB_OUTPUT
We can then use that AMI output in the Terraform job:
jobs: # duplicate, but makes the example look better
terraform-build:
runs-on: ubuntu-latest
name: Gitstafette AWS VM Rebuild
needs: packer-build
steps:
- id: tf-checkout
name: Checkout code for TF
uses: actions/checkout@v4
- id: tf-aws-creds
name: Configure AWS credentials for Terraform
uses: aws-actions/configure-aws-credentials@v1.7.0
with:
role-to-assume: arn:aws:iam::XXXXX:role/GitHubAction-XXXX #change to reflect your IAM role’s ARN
role-session-name: GitHub_to_AWS_via_FederatedOIDC
aws-region: ${{ env.AWS_REGION }}
- id: tf-setup
name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- id: tf-init
name: Terraform Init
working-directory: ./aws/terraform
run: terraform init
- id: tf-validate
name: Terraform Validate
working-directory: ./aws/terraform
run: terraform validate -no-color
- id: tf-plan
name: Terraform Plan
continue-on-error: true # because some exit codes are not 0, even if they are just warnings/informative
env:
AMI_ID: ${{ needs.packer-build.outputs.ami_id }}
working-directory: ./aws/terraform
run: |
echo "AMI_ID=${AMI_ID}"
export TF_VAR_ami_id=${AMI_ID}
terraform plan \
-var "ami_id=${AMI_ID}" \
-no-color -out plan.out \
-input=false
- name: Terraform apply
working-directory: ./aws/terraform
run: terraform apply -auto-approve -input=false "plan.out"
I have seen some examples of people creating a PR with the Terraform changes rather than directly applying them.
That would be safer, as you can confirm that the changes are as desired and even run a cost estimate.
For my use case, that is overkill, so I just apply the changes directly.
Keeping An Eye On Things¶
A bit of bonus content.
In addition to the tools that AWS provides to monitor the resources, I also use a few other tools to keep an eye on things.
External Monitoring¶
I use Cronitor for the external facing components. It is free for up to five monitors; I don't need more.
It runs a health check on the webhook and includes both uptime and response time monitoring. For being alerted, I have a personal Slack workspace where I receive the alerts.
This way, I can also track when the instance is restarted, either because of the Spot instance interruption or for some other reason.
Internal Monitoring¶
For the internal components, I use Sentry.
I've set up the Gitstafette server to send errors to Sentry. This way, I can keep track of any errors that occur in the application.
The application isn't critical, and there really isn't any need to have full-scale observability set up. So Sentry is more than enough for my needs.
In the event of some unexpected errors, I get an email notification. The Sentry events capture much information, so I can easily see what went wrong, when, and which user interaction triggered the error.
Conclusion¶
In this post, I've shown how to deploy a Go application with multiple protocols on AWS using Docker Compose and automate the deployment with Packer, Terraform, and GitHub Actions.