Operations¶
Day-to-day cluster management uses infrahouse-toolkit, specifically the ih-elastic command. It is pre-installed on every cluster node by Puppet.
ih-elastic overview¶
| Option | Description |
|---|---|
--debug | Enable debug logging |
--quiet | Suppress info messages, only show warnings and errors |
--username TEXT | Elasticsearch username (default: elastic) |
--password TEXT | Password (auto-read from Puppet facts / Secrets Manager) |
--password-secret TEXT | AWS Secrets Manager secret id with the password |
--es-protocol TEXT | Protocol (default: http) |
--es-host TEXT | Elasticsearch host (default: 127.0.1.1) |
--es-port INTEGER | Port (default: 9200) |
--format [text/json/cbor/yaml/smile] | Output format |
All commands below are run on a cluster node (SSH in first). The tool auto-discovers credentials from Puppet facts or Secrets Manager, so you typically don't need to pass --password.
Check cluster health¶
Example output:
{
"cluster_name": "elastic",
"status": "green",
"number_of_nodes": 6,
"number_of_data_nodes": 3,
"active_primary_shards": 167,
"active_shards": 433,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"active_shards_percent_as_number": 100.0
}
This is the first thing to run when diagnosing issues. A healthy cluster shows "status": "green" and zero unassigned/relocating shards.
Inspect shards¶
Lists all shards and which nodes they're allocated to. Useful for spotting unbalanced shard distribution or unassigned shards.
Diagnose allocation problems¶
When shards are unassigned, this explains why Elasticsearch can't allocate them. Use ih-elastic cat shards first to find the problematic index and shard ID.
| Option | Description |
|---|---|
--index TEXT | Index name (from ih-elastic cat shards) |
--shard INTEGER | Shard ID |
--primary / --replica | Explain primary or replica shard |
Connecting to the cluster¶
Via ALB (external access)¶
The module creates HTTPS endpoints behind Application Load Balancers. Use the elastic superuser password from Secrets Manager:
# Get the password
ELASTIC_PASSWORD=$(aws secretsmanager get-secret-value \
--secret-id "$(terraform output -raw elastic_secret_id)" \
--query SecretString --output text)
# Query the cluster
curl -u "elastic:${ELASTIC_PASSWORD}" \
"https://my-cluster.example.com/_cluster/health?pretty"
| Endpoint | Points to |
|---|---|
https://{cluster_name}.{zone} | Master nodes (primary) |
https://{cluster_name}-master.{zone} | Master nodes (explicit) |
https://{cluster_name}-data.{zone} | Data nodes |
The ALB terminates TLS (Let's Encrypt certificates). Traffic from the ALB to instances is HTTP on port 9200.
Via SSH (on-node access)¶
SSH into any cluster node and use ih-elastic directly. No password needed -- credentials are auto-discovered from Puppet facts:
Anonymous monitoring access¶
The cluster configures an anonymous_monitor role that allows unauthenticated read access to monitoring endpoints. This enables Prometheus exporters to scrape metrics without credentials.
Node discovery and configuration¶
Nodes discover each other using the discovery-ec2 plugin. Puppet installs it automatically. Discovery is based on EC2 tags:
clustertag matchesvar.cluster_nameenvironmenttag matchesvar.environment
Each node is availability-zone aware -- Elasticsearch uses the zone node attribute for shard allocation awareness, distributing primary and replica shards across AZs.
Key configuration (/etc/elasticsearch/elasticsearch.yml)¶
| Setting | Value |
|---|---|
network.host | _ec2_ (binds to the instance's private IP) |
discovery.seed_providers | ec2 |
cluster.routing.allocation.awareness.attributes | zone |
xpack.security.enabled | true |
xpack.security.transport.ssl.enabled | true (inter-node TLS) |
xpack.security.http.ssl.enabled | false (ALB handles HTTPS) |
xpack.security.audit.enabled | true |
bootstrap.memory_lock | true (when var.memory_lock = true) |
JVM heap sizing¶
Puppet sets the JVM heap to 50% of instance RAM automatically (/etc/elasticsearch/jvm.options.d/heap.options). The other 50% is left for Lucene filesystem cache and the OS. This is the Elasticsearch recommended split.
TLS certificates¶
Inter-node transport (port 9300) is encrypted with TLS. The certificate chain is:
- CA cert/key -- generated by Terraform (
tls_self_signed_cert), stored in Secrets Manager, shared by all nodes. - Node cert/key -- generated on each instance by Puppet using
openssl. The node cert is signed by the CA cert.
Puppet reads the CA cert and key from Secrets Manager and generates a per-node certificate at /etc/elasticsearch/tls/. Certificates are valid for 10 years.
HTTP (port 9200) does not use node-level TLS -- the ALB terminates HTTPS with Let's Encrypt certificates instead.
Node lifecycle¶
The module uses ASG lifecycle hooks to safely add and remove nodes. These commands are called automatically by cloud-init / Puppet, but you can also invoke them manually.
Commission a node¶
Called automatically when a new instance launches. It:
- Waits for shard relocation to finish (up to 48 hours by default).
- Extends the ASG lifecycle heartbeat while waiting.
- Completes the lifecycle hook so the ASG marks the instance as healthy.
| Option | Description |
|---|---|
--wait-until-complete INTEGER | Max wait seconds (default: 172800) |
--complete-lifecycle-action TEXT | Lifecycle hook name to complete |
Decommission a node¶
ih-elastic cluster decommission-node \
--reason "instance refresh" \
--complete-lifecycle-action \
--only-if-terminating
Called automatically when the ASG terminates an instance. It:
- Checks cluster health is green (aborts if not, to prevent data loss).
- Registers a node shutdown with Elasticsearch so shards migrate away.
- Waits for shard migration to complete (up to 1 hour by default).
- Completes the terminating lifecycle hook.
| Option | Description |
|---|---|
--reason TEXT | Why the node is being decommissioned (required) |
--only-if-terminating | Only act if instance is in Terminating:Wait |
--wait-until-complete INTEGER | Max wait seconds (default: 3600) |
--complete-lifecycle-action | Complete the lifecycle hook when done |
Warning
If the cluster status is not green, decommission-node refuses to proceed and cancels the instance refresh. This prevents cascading failures.
Snapshots and backups¶
The module creates an S3 bucket for snapshots and Puppet registers it as a repository named backups. Backups are fully automated via Elasticsearch SLM (Snapshot Lifecycle Management) policies.
Automated backup schedule¶
Puppet configures four SLM policies:
| Policy | Schedule | Retention | Max snapshots |
|---|---|---|---|
hourly-snapshots | Every hour | 48 hours | 48 |
daily-snapshots | Daily at 01:30 UTC | 14 days | 14 |
weekly-snapshots | Monday at 01:30 UTC | 60 days | 8 |
monthly-snapshots | 1st of month at 01:30 UTC | 365 days | 12 |
All snapshots include global state and are stored in the S3 backups repository. Older snapshots are automatically deleted per the retention rules.
List snapshots¶
Example output:
id repository status start_epoch start_time end_epoch end_time duration indices
elastic-2024-02-20_19-19-54.544449 backups SUCCESS 1708456794 19:19:54 1708456796 19:19:56 1.8s 33
elastic-2024-02-20_19-43-51.722634 backups SUCCESS 1708458231 19:43:51 1708458233 19:43:53 1.6s 33
Manual snapshots¶
| Subcommand | Description |
|---|---|
create | Create a snapshot in a repository |
restore | Restore a snapshot |
status | Check snapshot progress |
create-repository | Register a new snapshot repository |
delete-repository | Remove a snapshot repository |
Take a manual snapshot:
Restore from a snapshot:
Change passwords¶
Changes the password for an Elasticsearch user. The new password is auto-generated and stored in Secrets Manager.
Raw API calls¶
For any Elasticsearch API not covered by ih-elastic subcommands:
ih-elastic api GET /_cat/nodes?v
ih-elastic api GET /_cluster/settings?pretty
ih-elastic api PUT /_cluster/settings -d '{"persistent":{"cluster.routing.allocation.enable":"all"}}'
Automatic decommission cron¶
Puppet installs a cron job on every node that runs every 5 minutes:
*/5 * * * * ih-elastic --quiet cluster decommission-node \
--only-if-terminating --reason instance_refresh \
--complete-lifecycle-action --wait-until-complete 172800
This ensures that when the ASG starts terminating an instance (e.g., during an instance refresh), the node gracefully drains its shards before the instance is actually terminated. The --only-if-terminating flag means the cron is a no-op on healthy running nodes.
Prometheus monitoring¶
Each node runs two Prometheus exporters:
- prometheus-node-exporter (port 9100) -- system metrics (CPU, memory, disk, network)
- prometheus-elasticsearch-exporter (port 9114) -- Elasticsearch cluster and node metrics
The module creates a security group (var.monitoring_cidr_block) that allows scraping from your monitoring network. Configure your Prometheus to scrape:
scrape_configs:
- job_name: elasticsearch-nodes
static_configs:
- targets:
- <node-ip>:9100 # node exporter
- <node-ip>:9114 # elasticsearch exporter
The elasticsearch exporter authenticates to Elasticsearch using the elastic superuser password (read from Secrets Manager) and connects over the local loopback interface.
Kibana¶
To add a web UI for your cluster, use the terraform-aws-kibana module. It deploys Kibana on ECS with an ALB, auto-provisioned TLS, and Route53 DNS -- pointing at your Elasticsearch cluster.
module "kibana" {
source = "registry.infrahouse.com/infrahouse/kibana/aws"
version = "2.0.0"
providers = {
aws = aws
aws.dns = aws
}
elasticsearch_cluster_name = "my-cluster"
elasticsearch_url = module.elasticsearch.cluster_master_url
kibana_system_password = module.elasticsearch.kibana_system_password
environment = "production"
zone_id = data.aws_route53_zone.main.zone_id
subnet_ids = module.service-network.subnet_private_ids
key_pair_name = aws_key_pair.my_key.key_name
alarm_emails = ["ops-team@example.com"]
}
After deployment, Kibana is available at https://kibana.{zone}.
Installing infrahouse-toolkit¶
ih-elastic is pre-installed on cluster nodes. To install it elsewhere (e.g., a bastion host):
When running from a remote host, pass connection details: