PMM Operational Runbook¶

This runbook provides step-by-step procedures for common operational tasks related to managing your PMM (Percona Monitoring and Management) deployment on AWS.

Table of Contents¶

Access and Authentication
Instance Management
Data Management
Monitoring and Alerting
Upgrades and Maintenance
Performance Tuning
Troubleshooting
Emergency Procedures

Access and Authentication¶

Accessing the PMM Web Interface¶

URL: https://pmm.<your-domain> - Example: https://pmm.example.com

Default credentials: - Username: admin - Password: Stored in AWS Secrets Manager

Retrieving the Admin Password¶

Via AWS Console: 1. Go to AWS Secrets Manager 2. Find secret named pmm-server-admin-password 3. Click "Retrieve secret value" 4. Copy the password

Via CLI:

ih-secrets get pmm-server-admin-password

Via Terraform Output (if configured):

terraform output admin_password_secret_arn

SSH Access to EC2 Instance¶

Prerequisites: - SSH key configured via ssh_key_name variable - admin_cidr_block variable set to allow SSH from your IP

Connect to instance:

# Find instance IP
ih-ec2 list --tags

# SSH (from bastion or VPN)
ssh ubuntu@<instance-ip>

Changing Admin Password¶

Option 1: Via PMM UI: 1. Log in to PMM as admin 2. Go to Settings → Users 3. Select admin user → Change password

Option 2: Update Secret in AWS:

# Generate new password
NEW_PASSWORD=$(openssl rand -base64 24)

# Update secret
ih-secrets set pmm-server-admin-password "$NEW_PASSWORD"

# Restart PMM container to pick up new password
ssh ubuntu@$INSTANCE_IP
sudo systemctl restart pmm-server

Instance Management¶

Starting PMM Service¶

ssh ubuntu@<instance-ip>

# Start PMM container
sudo systemctl start pmm-server

# Verify status
sudo systemctl status pmm-server

# Check container logs
sudo docker logs pmm-server --tail 50

Stopping PMM Service¶

Use case: Maintenance windows, troubleshooting

ssh ubuntu@<instance-ip>

# Stop PMM container
sudo systemctl stop pmm-server

# Verify it's stopped
sudo systemctl status pmm-server
sudo docker ps | grep pmm-server  # Should show nothing

Restarting PMM Service¶

Use case: Configuration changes, memory issues

ssh ubuntu@<instance-ip>

# Restart PMM
sudo systemctl restart pmm-server

# Monitor restart
sudo journalctl -u pmm-server -f

Checking Instance Health¶

EC2 Status Checks:

aws ec2 describe-instance-status \
  --instance-ids <instance-id> \
  --query "InstanceStatuses[0].[SystemStatus.Status,InstanceStatus.Status]" \
  --output text

System Resources:

ssh ubuntu@<instance-ip>

# CPU usage
top -bn1 | head -20

# Memory usage
free -h

# Disk usage
df -h

# EBS volume status
lsblk
mount | grep /srv

# Docker container status
sudo docker stats pmm-server --no-stream

Rebooting EC2 Instance¶

Use case: Kernel updates, persistent issues

Planned reboot:

# Reboot via AWS Console or CLI
aws ec2 reboot-instances --instance-ids <instance-id>

# Monitor instance state
aws ec2 describe-instance-status --instance-ids <instance-id>

After reboot:

# Verify PMM restarted
ssh ubuntu@<instance-ip>
sudo systemctl status pmm-server

# Check mount points
df -h | grep /srv

# Verify PMM web interface
curl -k https://pmm.example.com/v1/readyz

Data Management¶

Checking Disk Space¶

ssh ubuntu@<instance-ip>

# Overall disk usage
df -h

# Data volume usage
df -h /srv

# Top directories consuming space
sudo du -sh /srv/* | sort -h

# PMM database sizes
sudo du -sh /srv/clickhouse
sudo du -sh /srv/postgres
sudo du -sh /srv/prometheus

Cleaning Up Old Data¶

PMM data retention is managed within PMM settings:

Log in to PMM UI
Go to Settings → Advanced Settings
Adjust data retention:
Metrics retention: Default 30 days
Query Analytics retention: Default 8 days

Manual cleanup (if needed):

ssh ubuntu@<instance-ip>

# Stop PMM
sudo systemctl stop pmm-server

# Clean up old Prometheus data (older than 30 days)
sudo find /srv/prometheus -type f -mtime +30 -delete

# Start PMM
sudo systemctl start pmm-server

Expanding Data Volume¶

When to expand: - Disk usage >80% - CloudWatch alarm triggered - Planning to increase retention

Steps:

Update Terraform configuration:

module "pmm" {
  # ... other settings ...
  ebs_volume_size = 200  # Increase from 100GB
}

Apply Terraform:

terraform plan   # Verify only volume size changes
terraform apply

Extend filesystem (no downtime):

ssh ubuntu@<instance-ip>

# Verify new volume size
lsblk | grep xvdf

# Extend filesystem
sudo resize2fs /dev/xvdf

# Verify new size
df -h /srv

Backup and Restore¶

See BACKUP_RESTORE.md for detailed procedures.

Quick backup:

# Create on-demand snapshot
aws ec2 create-snapshot \
  --volume-id <volume-id> \
  --description "PMM manual backup $(date +%Y-%m-%d)" \
  --tag-specifications 'ResourceType=snapshot,Tags=[{Key=Name,Value=pmm-manual-backup}]'

Monitoring and Alerting¶

Viewing CloudWatch Metrics¶

Via CloudWatch Dashboard (if enabled): 1. Go to CloudWatch → Dashboards 2. Select pmm-server-monitoring dashboard 3. View real-time metrics

Via AWS CLI:

# CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=<instance-id> \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

# Memory usage (CloudWatch Agent)
aws cloudwatch get-metric-statistics \
  --namespace CWAgent \
  --metric-name mem_used_percent \
  --dimensions Name=InstanceId,Value=<instance-id> \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

Checking CloudWatch Alarms¶

List active alarms:

aws cloudwatch describe-alarms \
  --alarm-name-prefix pmm-server \
  --state-value ALARM \
  --output table

View alarm history:

aws cloudwatch describe-alarm-history \
  --alarm-name pmm-server-high-memory \
  --max-records 10

Viewing Logs¶

PMM service logs (systemd):

ssh ubuntu@<instance-ip>

# Real-time logs
sudo journalctl -u pmm-server -f

# Last 100 lines
sudo journalctl -u pmm-server -n 100

# Logs from last hour
sudo journalctl -u pmm-server --since "1 hour ago"

Docker container logs:

# Real-time logs
sudo docker logs pmm-server -f

# Last 100 lines
sudo docker logs pmm-server --tail 100

# Logs with timestamps
sudo docker logs pmm-server --timestamps

CloudWatch Logs (if shipped):

aws logs tail /aws/ec2/pmm-server --follow

Add email subscribers:

# Get SNS topic ARN
TOPIC_ARN=$(aws sns list-topics \
  --query "Topics[?contains(TopicArn, 'pmm-server-alarms')].TopicArn" \
  --output text)

# Subscribe new email
aws sns subscribe \
  --topic-arn $TOPIC_ARN \
  --protocol email \
  --notification-endpoint devops@example.com

# Subscriber must confirm via email

List subscribers:

aws sns list-subscriptions-by-topic --topic-arn $TOPIC_ARN

Upgrades and Maintenance¶

Upgrading PMM Version¶

Before upgrade: 1. Review PMM release notes 2. Create manual backup (see Backup section) 3. Schedule maintenance window 4. Notify users

Steps:

Update Terraform configuration:

module "pmm" {
  # ... other settings ...
  pmm_version = "3.1"  # Update from "3" to specific version
}

Apply Terraform (will recreate container):

terraform plan   # Review changes
terraform apply

Verify upgrade:

ssh ubuntu@<instance-ip>

# Check PMM version
sudo docker exec pmm-server pmm-admin --version

# Verify service health
sudo systemctl status pmm-server
curl -k https://pmm.example.com/v1/readyz

Post-upgrade checks:
Log in to PMM UI
Verify dashboards load
Check database connections
Review logs for errors

Patching EC2 Instance¶

Ubuntu OS patches:

ssh ubuntu@<instance-ip>

# Update package list
sudo apt update

# Check available updates
apt list --upgradable

# Install updates (kernel updates require reboot)
sudo apt upgrade -y

# Reboot if needed
sudo reboot

Automated patching (recommended): - Enable AWS Systems Manager Patch Manager - Or use maintenance windows in Terraform

Updating Docker¶

ssh ubuntu@<instance-ip>

# Check current Docker version
docker --version

# Update Docker
sudo apt update
sudo apt install --only-upgrade docker-ce

# Restart Docker daemon
sudo systemctl restart docker

# Restart PMM
sudo systemctl restart pmm-server

Performance Tuning¶

Optimizing ClickHouse Performance¶

Increase ClickHouse memory (if high query load):

ssh ubuntu@<instance-ip>

# Edit PMM container environment
sudo systemctl stop pmm-server

# Modify Docker container settings (example: increase memory)
# Edit systemd service file
sudo nano /etc/systemd/system/pmm-server.service

# Add environment variable
Environment="CLICKHOUSE_MAX_MEMORY=8GB"

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl start pmm-server

Optimizing Query Analytics¶

Reduce retention (if disk space constrained): 1. PMM UI → Settings → Advanced Settings 2. Query Analytics retention: 8 days → 3 days

Disable slow query log (if not needed): 1. PMM UI → Database monitoring settings 2. Disable slow query log collection for specific databases

Scaling Instance Type¶

When to scale: - CPU usage consistently >70% - Memory usage consistently >85% - Monitoring 20+ database instances

Steps:

Update Terraform:

module "pmm" {
  # ... other settings ...
  instance_type = "m5.xlarge"  # Scale up from m5.large
}

Apply (requires instance replacement):

terraform plan   # Note: instance will be replaced
terraform apply  # Downtime: 5-10 minutes

Verify:
Check PMM UI accessibility
Verify data persistence
Monitor new instance performance

Optimizing EBS Performance¶

Increase IOPS (for high I/O workloads):

module "pmm" {
  # ... other settings ...
  ebs_volume_type = "gp3"
  ebs_iops        = 6000  # Increase from 3000
  ebs_throughput  = 250   # Increase from 125 MB/s
}

Apply Terraform:

terraform apply  # No downtime, volume modified online

Troubleshooting¶

PMM Container Won't Start¶

Check logs:

sudo journalctl -u pmm-server -n 100 --no-pager
sudo docker logs pmm-server --tail 100

Common issues:

Port conflict:

# Check if ports 80/443 are in use
sudo netstat -tlnp | grep -E ':(80|443)'

# Kill conflicting process if needed
sudo kill <pid>

Volume mount failure:

# Check if /srv is mounted
mount | grep /srv

# Remount if needed
sudo mount /dev/xvdf /srv

Insufficient memory:

# Check available memory
free -h

# Kill memory-intensive processes or scale instance

High Memory Usage¶

Identify memory consumers:

ssh ubuntu@<instance-ip>

# System memory
free -h

# Docker container memory
sudo docker stats pmm-server --no-stream

# Processes within container
sudo docker exec pmm-server ps aux --sort=-%mem | head -20

Mitigation: 1. Restart PMM: sudo systemctl restart pmm-server 2. Reduce retention periods in PMM settings 3. Scale to larger instance type

Disk Space Running Out¶

Immediate action:

# Identify large directories
sudo du -sh /srv/* | sort -h

# Clean up old snapshots/temp files
sudo find /srv -name "*.tmp" -delete

Long-term solution: 1. Expand EBS volume (see Data Management) 2. Reduce data retention in PMM settings 3. Archive old data to S3

Database Connection Failures¶

Check PMM to RDS connectivity:

ssh ubuntu@<instance-ip>

# Test PostgreSQL connection
nc -zv <rds-endpoint> 5432

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids <pmm-security-group-id> \
  --query "SecurityGroups[0].IpPermissionsEgress"

Verify RDS security group:

# Check if PMM security group is allowed
aws ec2 describe-security-groups \
  --group-ids <rds-security-group-id> \
  --query "SecurityGroups[0].IpPermissions[?ToPort==\`5432\`]"

ALB Health Check Failures¶

Check target health:

# Get target group ARN
TG_ARN=$(aws elbv2 describe-target-groups \
  --query "TargetGroups[?contains(TargetGroupName, 'pmm')].TargetGroupArn" \
  --output text)

# Check target health
aws elbv2 describe-target-health --target-group-arn $TG_ARN

Common causes: 1. PMM container not running: Check systemd status 2. Health endpoint not responding: curl http://localhost/v1/readyz 3. Security group blocking traffic: Verify ALB can reach instance on port 80

Fix:

# Restart PMM
sudo systemctl restart pmm-server

# Test health endpoint
curl -v http://localhost/v1/readyz

Emergency Procedures¶

Service Outage Response¶

Assess severity:
PMM UI inaccessible?
Instance down?
Data corruption?

Check CloudWatch alarms:

aws cloudwatch describe-alarms \
  --alarm-name-prefix pmm-server \
  --state-value ALARM

Check instance status:

aws ec2 describe-instance-status --instance-ids <instance-id>

Immediate actions:
If instance failed: EC2 auto-recovery should trigger
If service crashed: SSH and restart PMM
If AZ down: Follow disaster recovery procedures

Data Corruption Recovery¶

Symptoms: - PMM UI shows errors - Dashboards won't load - Database query failures

Recovery: 1. Stop PMM: sudo systemctl stop pmm-server 2. Check filesystem: sudo fsck -y /dev/xvdf 3. Restore from backup (see BACKUP_RESTORE.md)

Emergency Rollback¶

Use case: Bad upgrade, configuration change caused issues

Steps:

Revert Terraform changes:

git revert <commit-hash>
terraform apply

Or restore from backup (if data changed):
See BACKUP_RESTORE.md Scenario 1
Use backup from before the change

Accessing Instance During AWS Console Outage¶

Via AWS CLI (pre-configured):

# Stop instance
aws ec2 stop-instances --instance-ids <instance-id>

# Start instance
aws ec2 start-instances --instance-ids <instance-id>

# Reboot instance
aws ec2 reboot-instances --instance-ids <instance-id>

Via Terraform:

# Emergency destroy and recreate
terraform destroy -target=module.pmm.aws_instance.pmm_server
terraform apply -target=module.pmm.aws_instance.pmm_server

ASG Reconciler Lambda¶

The Lambda reconciler automatically manages pmm-client on Auto Scaling Group instances. It runs every 5 minutes via EventBridge and is only created when monitored_asgs is configured.

Checking Lambda Status¶

# Get Lambda function name
FUNCTION_NAME=$(aws lambda list-functions \
  --query "Functions[?contains(FunctionName, 'reconciler')].FunctionName" \
  --output text)

# Check recent invocations
aws logs tail /aws/lambda/$FUNCTION_NAME --since 1h

Manually Invoking the Lambda¶

# Invoke and see output
aws lambda invoke \
  --function-name $FUNCTION_NAME \
  --log-type Tail \
  --query 'LogResult' \
  --output text /dev/stdout | base64 -d

# Or invoke and get result payload
aws lambda invoke \
  --function-name $FUNCTION_NAME \
  output.json && cat output.json

Expected output:

{"status": "ok", "added": 0, "removed": 0, "errors": []}

Verifying pmm-client on ASG Instances¶

# Check pmm-client status on a specific instance
aws ssm start-session --target <instance-id>

# Inside the instance:
sudo pmm-admin status
sudo pmm-admin list

# Check specific agents
sudo pmm-admin status 2>/dev/null | grep mysqld_exporter

Removing a Stale PMM Service¶

If a service exists in PMM for a terminated instance:

Via PMM UI: Configuration > Inventory > Services > Delete

Via API:

# Get admin password
PMM_PASSWORD=$(ih-secrets get <admin-password-secret-arn>)

# List services
curl -s -u admin:$PMM_PASSWORD \
  http://<pmm-private-ip>/v1/management/services | jq .

# Remove a service (force=true removes dependent agents)
curl -s -u admin:$PMM_PASSWORD \
  -X DELETE \
  "http://<pmm-private-ip>/v1/inventory/services/<service-id>?force=true"

Troubleshooting Lambda Failures¶

Lambda timeout (>300s): - First-time pmm-client install takes ~60s per instance - With 3+ instances sequential, may approach 300s limit - Check CloudWatch logs for which instance is slow

SSM command failures: - Verify instance is InService in ASG and SSM-managed - Check instance IAM role has SSM permissions - Ensure /opt/puppetlabs/bin is in PATH (Puppet facts required)

"already exists" errors: - Lambda automatically removes stale services and retries - If it persists, manually delete the service from PMM UI

pmm-agent connection timeout: - Verify security group allows port 443 from ASG SG to PMM instance - pmm-agent connects directly to PMM EC2 (not via ALB) - Check: pmm-admin status should show Connected : true

Regular Maintenance Schedule¶

Daily¶

Monitor CloudWatch alarms via email
Check backup job success (AWS Backup console)

Weekly¶

Review PMM UI performance and dashboards
Check disk usage trends
Review CloudWatch metrics

Monthly¶

Review access logs and user activity
Check for PMM version updates
Review and optimize data retention settings
Test database connections

Quarterly¶

Test disaster recovery procedures (see BACKUP_RESTORE.md)
Review and update documentation
Patch EC2 instance OS
Review CloudWatch alarm thresholds

Annually¶

Review architecture for cost optimization
Evaluate PMM version and plan upgrades
Audit IAM roles and permissions
Review backup retention policies

Contacts and Escalation¶

Primary contacts: - Team Email: devops@example.com - On-call Rotation: [PagerDuty/Slack channel]

Escalation path: 1. L1: Team on-call engineer 2. L2: Platform engineering lead 3. L3: AWS Support (if AWS infrastructure issue) 4. L4: Percona Support (if PMM software issue)

External resources: - Percona PMM Documentation - Percona Community Forums - AWS Support

Appendix¶

Useful Commands Cheat Sheet¶

# Instance status
aws ec2 describe-instance-status --instance-ids <id>

# Get admin password
ih-secrets get pmm-server-admin-password

# Restart PMM
ssh ubuntu@<ip> "sudo systemctl restart pmm-server"

# Check disk usage
ssh ubuntu@<ip> "df -h /srv"

# View logs
ssh ubuntu@<ip> "sudo journalctl -u pmm-server -f"

# Create manual backup
aws ec2 create-snapshot --volume-id <vol-id> --description "Manual backup"

# Check alarms
aws cloudwatch describe-alarms --alarm-name-prefix pmm-server --state-value ALARM

Terraform Commands Reference¶

# View current state
terraform show

# Plan changes
terraform plan

# Apply changes
terraform apply

# Target specific resource
terraform apply -target=module.pmm.aws_ebs_volume.pmm_data

# Import existing resource
terraform import module.pmm.aws_ebs_volume.pmm_data vol-xxxxx

# View outputs
terraform output

Document Revision History¶

Date	Version	Changes	Author
2024-01-15	1.0	Initial runbook creation	DevOps Team

Last updated: 2024-01-15 Review cycle: Quarterly Next review: 2024-04-15

PMM Operational Runbook¶

Table of Contents¶

Access and Authentication¶

Accessing the PMM Web Interface¶

Retrieving the Admin Password¶

SSH Access to EC2 Instance¶

Changing Admin Password¶

Instance Management¶

Starting PMM Service¶

Stopping PMM Service¶

Restarting PMM Service¶

Checking Instance Health¶

Rebooting EC2 Instance¶

Data Management¶

Checking Disk Space¶

Cleaning Up Old Data¶

Expanding Data Volume¶

Backup and Restore¶

Monitoring and Alerting¶

Viewing CloudWatch Metrics¶

Checking CloudWatch Alarms¶

Viewing Logs¶

Managing SNS Notifications¶

Upgrades and Maintenance¶

Upgrading PMM Version¶

Patching EC2 Instance¶

Updating Docker¶

Performance Tuning¶

Optimizing ClickHouse Performance¶

Optimizing Query Analytics¶

Scaling Instance Type¶

Optimizing EBS Performance¶

Troubleshooting¶

PMM Container Won't Start¶

High Memory Usage¶

Disk Space Running Out¶

Database Connection Failures¶

ALB Health Check Failures¶

Emergency Procedures¶

Service Outage Response¶

Data Corruption Recovery¶

Emergency Rollback¶

Accessing Instance During AWS Console Outage¶

ASG Reconciler Lambda¶

Checking Lambda Status¶

Manually Invoking the Lambda¶

Verifying pmm-client on ASG Instances¶

Removing a Stale PMM Service¶

Troubleshooting Lambda Failures¶

Regular Maintenance Schedule¶

Daily¶

Weekly¶

Monthly¶

Quarterly¶

Annually¶

Contacts and Escalation¶

Appendix¶

Useful Commands Cheat Sheet¶

Terraform Commands Reference¶

Document Revision History¶