Skip to content

PMM Operational Runbook

This runbook provides step-by-step procedures for common operational tasks related to managing your PMM (Percona Monitoring and Management) deployment on AWS.

Table of Contents

Access and Authentication

Accessing the PMM Web Interface

URL: https://pmm.<your-domain> - Example: https://pmm.example.com

Default credentials: - Username: admin - Password: Stored in AWS Secrets Manager

Retrieving the Admin Password

Via AWS Console: 1. Go to AWS Secrets Manager 2. Find secret named pmm-server-admin-password 3. Click "Retrieve secret value" 4. Copy the password

Via CLI:

ih-secrets get pmm-server-admin-password

Via Terraform Output (if configured):

terraform output admin_password_secret_arn

SSH Access to EC2 Instance

Prerequisites: - SSH key configured via ssh_key_name variable - admin_cidr_block variable set to allow SSH from your IP

Connect to instance:

# Find instance IP
ih-ec2 list --tags

# SSH (from bastion or VPN)
ssh ubuntu@<instance-ip>

Changing Admin Password

Option 1: Via PMM UI: 1. Log in to PMM as admin 2. Go to Settings → Users 3. Select admin user → Change password

Option 2: Update Secret in AWS:

# Generate new password
NEW_PASSWORD=$(openssl rand -base64 24)

# Update secret
ih-secrets set pmm-server-admin-password "$NEW_PASSWORD"

# Restart PMM container to pick up new password
ssh ubuntu@$INSTANCE_IP
sudo systemctl restart pmm-server

Instance Management

Starting PMM Service

ssh ubuntu@<instance-ip>

# Start PMM container
sudo systemctl start pmm-server

# Verify status
sudo systemctl status pmm-server

# Check container logs
sudo docker logs pmm-server --tail 50

Stopping PMM Service

Use case: Maintenance windows, troubleshooting

ssh ubuntu@<instance-ip>

# Stop PMM container
sudo systemctl stop pmm-server

# Verify it's stopped
sudo systemctl status pmm-server
sudo docker ps | grep pmm-server  # Should show nothing

Restarting PMM Service

Use case: Configuration changes, memory issues

ssh ubuntu@<instance-ip>

# Restart PMM
sudo systemctl restart pmm-server

# Monitor restart
sudo journalctl -u pmm-server -f

Checking Instance Health

EC2 Status Checks:

aws ec2 describe-instance-status \
  --instance-ids <instance-id> \
  --query "InstanceStatuses[0].[SystemStatus.Status,InstanceStatus.Status]" \
  --output text

System Resources:

ssh ubuntu@<instance-ip>

# CPU usage
top -bn1 | head -20

# Memory usage
free -h

# Disk usage
df -h

# EBS volume status
lsblk
mount | grep /srv

# Docker container status
sudo docker stats pmm-server --no-stream

Rebooting EC2 Instance

Use case: Kernel updates, persistent issues

Planned reboot:

# Reboot via AWS Console or CLI
aws ec2 reboot-instances --instance-ids <instance-id>

# Monitor instance state
aws ec2 describe-instance-status --instance-ids <instance-id>

After reboot:

# Verify PMM restarted
ssh ubuntu@<instance-ip>
sudo systemctl status pmm-server

# Check mount points
df -h | grep /srv

# Verify PMM web interface
curl -k https://pmm.example.com/v1/readyz

Data Management

Checking Disk Space

ssh ubuntu@<instance-ip>

# Overall disk usage
df -h

# Data volume usage
df -h /srv

# Top directories consuming space
sudo du -sh /srv/* | sort -h

# PMM database sizes
sudo du -sh /srv/clickhouse
sudo du -sh /srv/postgres
sudo du -sh /srv/prometheus

Cleaning Up Old Data

PMM data retention is managed within PMM settings:

  1. Log in to PMM UI
  2. Go to Settings → Advanced Settings
  3. Adjust data retention:
  4. Metrics retention: Default 30 days
  5. Query Analytics retention: Default 8 days

Manual cleanup (if needed):

ssh ubuntu@<instance-ip>

# Stop PMM
sudo systemctl stop pmm-server

# Clean up old Prometheus data (older than 30 days)
sudo find /srv/prometheus -type f -mtime +30 -delete

# Start PMM
sudo systemctl start pmm-server

Expanding Data Volume

When to expand: - Disk usage >80% - CloudWatch alarm triggered - Planning to increase retention

Steps:

  1. Update Terraform configuration:

    module "pmm" {
      # ... other settings ...
      ebs_volume_size = 200  # Increase from 100GB
    }
    

  2. Apply Terraform:

    terraform plan   # Verify only volume size changes
    terraform apply
    

  3. Extend filesystem (no downtime):

    ssh ubuntu@<instance-ip>
    
    # Verify new volume size
    lsblk | grep xvdf
    
    # Extend filesystem
    sudo resize2fs /dev/xvdf
    
    # Verify new size
    df -h /srv
    

Backup and Restore

See BACKUP_RESTORE.md for detailed procedures.

Quick backup:

# Create on-demand snapshot
aws ec2 create-snapshot \
  --volume-id <volume-id> \
  --description "PMM manual backup $(date +%Y-%m-%d)" \
  --tag-specifications 'ResourceType=snapshot,Tags=[{Key=Name,Value=pmm-manual-backup}]'

Monitoring and Alerting

Viewing CloudWatch Metrics

Via CloudWatch Dashboard (if enabled): 1. Go to CloudWatch → Dashboards 2. Select pmm-server-monitoring dashboard 3. View real-time metrics

Via AWS CLI:

# CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=<instance-id> \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

# Memory usage (CloudWatch Agent)
aws cloudwatch get-metric-statistics \
  --namespace CWAgent \
  --metric-name mem_used_percent \
  --dimensions Name=InstanceId,Value=<instance-id> \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

Checking CloudWatch Alarms

List active alarms:

aws cloudwatch describe-alarms \
  --alarm-name-prefix pmm-server \
  --state-value ALARM \
  --output table

View alarm history:

aws cloudwatch describe-alarm-history \
  --alarm-name pmm-server-high-memory \
  --max-records 10

Viewing Logs

PMM service logs (systemd):

ssh ubuntu@<instance-ip>

# Real-time logs
sudo journalctl -u pmm-server -f

# Last 100 lines
sudo journalctl -u pmm-server -n 100

# Logs from last hour
sudo journalctl -u pmm-server --since "1 hour ago"

Docker container logs:

# Real-time logs
sudo docker logs pmm-server -f

# Last 100 lines
sudo docker logs pmm-server --tail 100

# Logs with timestamps
sudo docker logs pmm-server --timestamps

CloudWatch Logs (if shipped):

aws logs tail /aws/ec2/pmm-server --follow

Managing SNS Notifications

Add email subscribers:

# Get SNS topic ARN
TOPIC_ARN=$(aws sns list-topics \
  --query "Topics[?contains(TopicArn, 'pmm-server-alarms')].TopicArn" \
  --output text)

# Subscribe new email
aws sns subscribe \
  --topic-arn $TOPIC_ARN \
  --protocol email \
  --notification-endpoint devops@example.com

# Subscriber must confirm via email

List subscribers:

aws sns list-subscriptions-by-topic --topic-arn $TOPIC_ARN

Upgrades and Maintenance

Upgrading PMM Version

Before upgrade: 1. Review PMM release notes 2. Create manual backup (see Backup section) 3. Schedule maintenance window 4. Notify users

Steps:

  1. Update Terraform configuration:

    module "pmm" {
      # ... other settings ...
      pmm_version = "3.1"  # Update from "3" to specific version
    }
    

  2. Apply Terraform (will recreate container):

    terraform plan   # Review changes
    terraform apply
    

  3. Verify upgrade:

    ssh ubuntu@<instance-ip>
    
    # Check PMM version
    sudo docker exec pmm-server pmm-admin --version
    
    # Verify service health
    sudo systemctl status pmm-server
    curl -k https://pmm.example.com/v1/readyz
    

  4. Post-upgrade checks:

  5. Log in to PMM UI
  6. Verify dashboards load
  7. Check database connections
  8. Review logs for errors

Patching EC2 Instance

Ubuntu OS patches:

ssh ubuntu@<instance-ip>

# Update package list
sudo apt update

# Check available updates
apt list --upgradable

# Install updates (kernel updates require reboot)
sudo apt upgrade -y

# Reboot if needed
sudo reboot

Automated patching (recommended): - Enable AWS Systems Manager Patch Manager - Or use maintenance windows in Terraform

Updating Docker

ssh ubuntu@<instance-ip>

# Check current Docker version
docker --version

# Update Docker
sudo apt update
sudo apt install --only-upgrade docker-ce

# Restart Docker daemon
sudo systemctl restart docker

# Restart PMM
sudo systemctl restart pmm-server

Performance Tuning

Optimizing ClickHouse Performance

Increase ClickHouse memory (if high query load):

ssh ubuntu@<instance-ip>

# Edit PMM container environment
sudo systemctl stop pmm-server

# Modify Docker container settings (example: increase memory)
# Edit systemd service file
sudo nano /etc/systemd/system/pmm-server.service

# Add environment variable
Environment="CLICKHOUSE_MAX_MEMORY=8GB"

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl start pmm-server

Optimizing Query Analytics

Reduce retention (if disk space constrained): 1. PMM UI → Settings → Advanced Settings 2. Query Analytics retention: 8 days → 3 days

Disable slow query log (if not needed): 1. PMM UI → Database monitoring settings 2. Disable slow query log collection for specific databases

Scaling Instance Type

When to scale: - CPU usage consistently >70% - Memory usage consistently >85% - Monitoring 20+ database instances

Steps:

  1. Update Terraform:

    module "pmm" {
      # ... other settings ...
      instance_type = "m5.xlarge"  # Scale up from m5.large
    }
    

  2. Apply (requires instance replacement):

    terraform plan   # Note: instance will be replaced
    terraform apply  # Downtime: 5-10 minutes
    

  3. Verify:

  4. Check PMM UI accessibility
  5. Verify data persistence
  6. Monitor new instance performance

Optimizing EBS Performance

Increase IOPS (for high I/O workloads):

module "pmm" {
  # ... other settings ...
  ebs_volume_type = "gp3"
  ebs_iops        = 6000  # Increase from 3000
  ebs_throughput  = 250   # Increase from 125 MB/s
}

Apply Terraform:

terraform apply  # No downtime, volume modified online

Troubleshooting

PMM Container Won't Start

Check logs:

sudo journalctl -u pmm-server -n 100 --no-pager
sudo docker logs pmm-server --tail 100

Common issues:

  1. Port conflict:

    # Check if ports 80/443 are in use
    sudo netstat -tlnp | grep -E ':(80|443)'
    
    # Kill conflicting process if needed
    sudo kill <pid>
    

  2. Volume mount failure:

    # Check if /srv is mounted
    mount | grep /srv
    
    # Remount if needed
    sudo mount /dev/xvdf /srv
    

  3. Insufficient memory:

    # Check available memory
    free -h
    
    # Kill memory-intensive processes or scale instance
    

High Memory Usage

Identify memory consumers:

ssh ubuntu@<instance-ip>

# System memory
free -h

# Docker container memory
sudo docker stats pmm-server --no-stream

# Processes within container
sudo docker exec pmm-server ps aux --sort=-%mem | head -20

Mitigation: 1. Restart PMM: sudo systemctl restart pmm-server 2. Reduce retention periods in PMM settings 3. Scale to larger instance type

Disk Space Running Out

Immediate action:

# Identify large directories
sudo du -sh /srv/* | sort -h

# Clean up old snapshots/temp files
sudo find /srv -name "*.tmp" -delete

Long-term solution: 1. Expand EBS volume (see Data Management) 2. Reduce data retention in PMM settings 3. Archive old data to S3

Database Connection Failures

Check PMM to RDS connectivity:

ssh ubuntu@<instance-ip>

# Test PostgreSQL connection
nc -zv <rds-endpoint> 5432

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids <pmm-security-group-id> \
  --query "SecurityGroups[0].IpPermissionsEgress"

Verify RDS security group:

# Check if PMM security group is allowed
aws ec2 describe-security-groups \
  --group-ids <rds-security-group-id> \
  --query "SecurityGroups[0].IpPermissions[?ToPort==\`5432\`]"

ALB Health Check Failures

Check target health:

# Get target group ARN
TG_ARN=$(aws elbv2 describe-target-groups \
  --query "TargetGroups[?contains(TargetGroupName, 'pmm')].TargetGroupArn" \
  --output text)

# Check target health
aws elbv2 describe-target-health --target-group-arn $TG_ARN

Common causes: 1. PMM container not running: Check systemd status 2. Health endpoint not responding: curl http://localhost/v1/readyz 3. Security group blocking traffic: Verify ALB can reach instance on port 80

Fix:

# Restart PMM
sudo systemctl restart pmm-server

# Test health endpoint
curl -v http://localhost/v1/readyz

Emergency Procedures

Service Outage Response

  1. Assess severity:
  2. PMM UI inaccessible?
  3. Instance down?
  4. Data corruption?

  5. Check CloudWatch alarms:

    aws cloudwatch describe-alarms \
      --alarm-name-prefix pmm-server \
      --state-value ALARM
    

  6. Check instance status:

    aws ec2 describe-instance-status --instance-ids <instance-id>
    

  7. Immediate actions:

  8. If instance failed: EC2 auto-recovery should trigger
  9. If service crashed: SSH and restart PMM
  10. If AZ down: Follow disaster recovery procedures

Data Corruption Recovery

Symptoms: - PMM UI shows errors - Dashboards won't load - Database query failures

Recovery: 1. Stop PMM: sudo systemctl stop pmm-server 2. Check filesystem: sudo fsck -y /dev/xvdf 3. Restore from backup (see BACKUP_RESTORE.md)

Emergency Rollback

Use case: Bad upgrade, configuration change caused issues

Steps:

  1. Revert Terraform changes:

    git revert <commit-hash>
    terraform apply
    

  2. Or restore from backup (if data changed):

  3. See BACKUP_RESTORE.md Scenario 1
  4. Use backup from before the change

Accessing Instance During AWS Console Outage

Via AWS CLI (pre-configured):

# Stop instance
aws ec2 stop-instances --instance-ids <instance-id>

# Start instance
aws ec2 start-instances --instance-ids <instance-id>

# Reboot instance
aws ec2 reboot-instances --instance-ids <instance-id>

Via Terraform:

# Emergency destroy and recreate
terraform destroy -target=module.pmm.aws_instance.pmm_server
terraform apply -target=module.pmm.aws_instance.pmm_server

ASG Reconciler Lambda

The Lambda reconciler automatically manages pmm-client on Auto Scaling Group instances. It runs every 5 minutes via EventBridge and is only created when monitored_asgs is configured.

Checking Lambda Status

# Get Lambda function name
FUNCTION_NAME=$(aws lambda list-functions \
  --query "Functions[?contains(FunctionName, 'reconciler')].FunctionName" \
  --output text)

# Check recent invocations
aws logs tail /aws/lambda/$FUNCTION_NAME --since 1h

Manually Invoking the Lambda

# Invoke and see output
aws lambda invoke \
  --function-name $FUNCTION_NAME \
  --log-type Tail \
  --query 'LogResult' \
  --output text /dev/stdout | base64 -d

# Or invoke and get result payload
aws lambda invoke \
  --function-name $FUNCTION_NAME \
  output.json && cat output.json

Expected output:

{"status": "ok", "added": 0, "removed": 0, "errors": []}

Verifying pmm-client on ASG Instances

# Check pmm-client status on a specific instance
aws ssm start-session --target <instance-id>

# Inside the instance:
sudo pmm-admin status
sudo pmm-admin list

# Check specific agents
sudo pmm-admin status 2>/dev/null | grep mysqld_exporter

Removing a Stale PMM Service

If a service exists in PMM for a terminated instance:

  1. Via PMM UI: Configuration > Inventory > Services > Delete
  2. Via API:
    # Get admin password
    PMM_PASSWORD=$(ih-secrets get <admin-password-secret-arn>)
    
    # List services
    curl -s -u admin:$PMM_PASSWORD \
      http://<pmm-private-ip>/v1/management/services | jq .
    
    # Remove a service (force=true removes dependent agents)
    curl -s -u admin:$PMM_PASSWORD \
      -X DELETE \
      "http://<pmm-private-ip>/v1/inventory/services/<service-id>?force=true"
    

Troubleshooting Lambda Failures

Lambda timeout (>300s): - First-time pmm-client install takes ~60s per instance - With 3+ instances sequential, may approach 300s limit - Check CloudWatch logs for which instance is slow

SSM command failures: - Verify instance is InService in ASG and SSM-managed - Check instance IAM role has SSM permissions - Ensure /opt/puppetlabs/bin is in PATH (Puppet facts required)

"already exists" errors: - Lambda automatically removes stale services and retries - If it persists, manually delete the service from PMM UI

pmm-agent connection timeout: - Verify security group allows port 443 from ASG SG to PMM instance - pmm-agent connects directly to PMM EC2 (not via ALB) - Check: pmm-admin status should show Connected : true

Regular Maintenance Schedule

Daily

  • Monitor CloudWatch alarms via email
  • Check backup job success (AWS Backup console)

Weekly

  • Review PMM UI performance and dashboards
  • Check disk usage trends
  • Review CloudWatch metrics

Monthly

  • Review access logs and user activity
  • Check for PMM version updates
  • Review and optimize data retention settings
  • Test database connections

Quarterly

  • Test disaster recovery procedures (see BACKUP_RESTORE.md)
  • Review and update documentation
  • Patch EC2 instance OS
  • Review CloudWatch alarm thresholds

Annually

  • Review architecture for cost optimization
  • Evaluate PMM version and plan upgrades
  • Audit IAM roles and permissions
  • Review backup retention policies

Contacts and Escalation

Primary contacts: - Team Email: devops@example.com - On-call Rotation: [PagerDuty/Slack channel]

Escalation path: 1. L1: Team on-call engineer 2. L2: Platform engineering lead 3. L3: AWS Support (if AWS infrastructure issue) 4. L4: Percona Support (if PMM software issue)

External resources: - Percona PMM Documentation - Percona Community Forums - AWS Support

Appendix

Useful Commands Cheat Sheet

# Instance status
aws ec2 describe-instance-status --instance-ids <id>

# Get admin password
ih-secrets get pmm-server-admin-password

# Restart PMM
ssh ubuntu@<ip> "sudo systemctl restart pmm-server"

# Check disk usage
ssh ubuntu@<ip> "df -h /srv"

# View logs
ssh ubuntu@<ip> "sudo journalctl -u pmm-server -f"

# Create manual backup
aws ec2 create-snapshot --volume-id <vol-id> --description "Manual backup"

# Check alarms
aws cloudwatch describe-alarms --alarm-name-prefix pmm-server --state-value ALARM

Terraform Commands Reference

# View current state
terraform show

# Plan changes
terraform plan

# Apply changes
terraform apply

# Target specific resource
terraform apply -target=module.pmm.aws_ebs_volume.pmm_data

# Import existing resource
terraform import module.pmm.aws_ebs_volume.pmm_data vol-xxxxx

# View outputs
terraform output

Document Revision History

Date Version Changes Author
2024-01-15 1.0 Initial runbook creation DevOps Team

Last updated: 2024-01-15 Review cycle: Quarterly Next review: 2024-04-15