Skip to content
Icon

gke_runwhen-nonprod-sandbox_us-central1_sandbox-cluster-1-cluster Cluster Node Health

Profile Avatar

Icon 1 1 Troubleshooting Commands

Icon 1 Last updated 9 weeks ago

Icon 1 Contributed by stewartshea



Troubleshooting Commands

Check for Node Restarts in Cluster gke_runwhen-nonprod-sandbox_us-central1_sandbox-cluster-1-cluster

What does it do?

This script is a Bash shell script to get node-related events within a specified time range in a Kubernetes context. It summarizes the events based on nodes and categorizes them as preemptible/spot instances for different cloud providers, finding unique nodes started and stopped.

Command
CONTEXT="gke_runwhen-nonprod-sandbox_us-central1_sandbox-cluster-1-cluster" KUBERNETES_DISTRIBUTION_BINARY="kubectl" INTERVAL="10 minutes"  bash -c "$(curl -s https://raw.githubusercontent.com/runwhen-contrib/rw-cli-codecollection/main/codebundles/k8s-cluster-node-health/node_restart_check.sh)" _
IconCopy to clipboard Copied to clipboard

Learn more

This multi-line content is auto-generated and used for educational purposes. Copying and pasting the multi-line text might not function as expected.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#!/bin/bash

# Read the context passed as an environment variable
context=$CONTEXT

# Set the time interval for fetching the events (e.g., 24 hours)
interval=$INTERVAL

# Get the current date and time in ISO 8601 format
CURRENT_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

# Calculate the start date for the specified time interval using GNU date
START_DATE=$(date -u -d "$interval ago" +"%Y-%m-%dT%H:%M:%SZ")

# Fetch all node-related events within the specified time range using Kubernetes kubectl command and output it to a file
kubectl get events -A --context $context \
  --field-selector involvedObject.kind=Node \
  --output=jsonpath='{range .items[*]}{.lastTimestamp}{" "}{.involvedObject.name}{" "}{.reason}{" "}{.message}{"\n"}{end}' \
  | awk -v start="$START_DATE" -v end="$CURRENT_DATE" '$1 >= start && $1 <= end' \
  | grep -E "(Preempt|Shutdown|Drain|Termination|Removed|RemovingNode|Deleted|NodeReady|RegisteredNode)" \
  | sort | uniq > node_events.txt

# Function to check if a node is preemptible/spot based on annotations or labels
check_preemptible_node() {
    node=$1
    # Check for the presence of the preemptible/spot-related annotations or labels for GCP, AWS, and Azure
    is_preemptible=$(kubectl get node "$node" -o jsonpath='{.metadata.labels.cloud\.google\.com/gke-preemptible}' 2>/dev/null)
    is_spot=$(kubectl get node "$node" -o jsonpath='{.metadata.labels.eks\.amazonaws\.com/capacityType}' 2>/dev/null)
    is_azure_spot=$(kubectl get node "$node" -o jsonpath='{.metadata.labels.kubernetes\.azure\.com/scalesetpriority}' 2>/dev/null)

    # Output the result based on the presence of annotations or labels
    if [[ "$is_preemptible" == "true" ]]; then
        echo "Preemptible (GCP)"
    elif [[ "$is_spot" == "SPOT" ]]; then
        echo "Spot (AWS)"
    elif [[ "$is_azure_spot" == "spot" ]]; then
        echo "Spot (Azure)"
    else
        echo "Unidentified/Unplanned"
    fi
}

# Track unique nodes started and stopped using associative arrays
declare -A nodes_started
declare -A nodes_stopped

# Read the node events from the file and summarize by node
while read -r line; do
    node=$(echo "$line" | awk '{print $2}')
    preempt_status=$(check_preemptible_node "$node")

    # Print node summary and determine if the node was started or stopped
    if [[ ! "$current_node" == "$node" ]]; then
        if [[ -n "$current_node" ]]; then
            echo ""  # Empty line between different nodes for readability
        fi
        echo "Node: $node"
        echo "Type: $preempt_status"
        echo "Activities:"
        current_node="$node"
    fi

    # Determine if the node was started or stopped and store the information in the associative arrays
    if echo "$line" | grep -qE "(NodeReady|RegisteredNode)"; then
        nodes_started["$node"]=1
    elif echo "$line" | grep -qE "(Shutdown|Preempt|Termination|Removed)"; then
        nodes_stopped["$node"]=1
    fi

    # Print the event details for the node
    echo "  - $line"
done < node_events.txt

# Summary of unique nodes started and stopped
unique_nodes_started=${#nodes_started[@]}
unique_nodes_stopped=${#nodes_stopped[@]}
total_node_events=$((unique_nodes_started + unique_nodes_stopped))

# Print the summary of unique nodes started, stopped, and total start/stop events
echo ""
echo "Summary:"
echo "Unique nodes started: $unique_nodes_started"
echo "Unique nodes stopped: $unique_nodes_stopped"
echo "Total start/stop events: $total_node_events"
Helpful Links