Introduction

Prometheus and Grafana for Kubernetes node monitoring is an important step for running stable disconnected or air gapped Kubernetes clusters in production. Kubernetes can reschedule pods when a node fails, but it does not replace the need to understand why the node became unhealthy in the first place. For hardware status monitoring, the best approach is to use Prometheus with node-exporter and Grafana as one monitoring stack. This article explains how to monitor Kubernetes node hardware status using the Prometheus and Grafana stack.

The Main Question

If we want to monitor the hardware status of nodes in Kubernetes, the best way is to collect host-level metrics from every Kubernetes node and visualize them together with Kubernetes object status.

The recommended stack is:

  • node-exporter for Linux node hardware and operating system metrics
  • Prometheus for metric scraping, storage, and PromQL queries
  • kube-state-metrics for Kubernetes object state
  • Alertmanager for alert routing and notification
  • Grafana for dashboards and operational visibility

For most Kubernetes environments, the easiest and cleanest deployment method is the kube-prometheus-stack Helm chart. It deploys the main components together and provides a production-oriented starting point.

Why Node Hardware Monitoring Matters

Kubernetes nodes are still physical or virtual machines. Even when the platform abstracts workloads into pods, the real limits are still on the node level.

A node can become unstable because of:

  • High CPU usage
  • Memory pressure
  • Disk pressure
  • Filesystem saturation
  • Slow or failing disks
  • Network packet drops
  • Network interface errors
  • Kernel-level resource exhaustion
  • Node reboot or hardware failure

Kubernetes may show that a node is NotReady, but this is usually a late signal. Good hardware monitoring would detect the problem earlier, before scheduling, application availability, or storage performance is affected.

Procedure

Step 1: Create a Monitoring Namespace

Create a dedicated namespace for the monitoring stack.

$ kubectl create namespace monitoring

Step 2: Add the Prometheus Community Helm Repository

Add the Helm repository that contains kube-prometheus-stack.

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update

Step 3: Create a Values File

In order to customize the deployment to your Kubernetes environment, the values.yaml changes a few things in the operator.

Create a file named values.yaml.

grafana:
  enabled: true

  ingress:
    enabled: true
    ingressClassName: nginx # or cillium
    hosts:
    - grafana.k8s.co.il
    path: /
    pathType: Prefix

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-storage
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 100Gi

In production, the Grafana password should not be stored directly in a plain values file. It is better to use an existing Kubernetes Secret or an external secret management solution.

Step 4: Install kube-prometheus-stack

Install the stack into the monitoring namespace.

$ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f values.yaml

This installs the core monitoring components:

  • Prometheus
  • Prometheus Operator
  • Alertmanager
  • Grafana
  • node-exporter
  • kube-state-metrics
  • default dashboards
  • default alert rules

Step 5: Verify the Monitoring Pods

Check that the monitoring pods are running.

$ kubectl get pods -n monitoring

You should see pods similar to:

$ k get pods
NAME                                                        READY   STATUS        RESTARTS   AGE
alertmanager-kube-prometheus-stack-alertmanager-0           2/2     Running       0          7m32s
kube-prometheus-stack-grafana-864d797868-gn5zg              3/3     Running       0          7m41s
kube-prometheus-stack-kube-state-metrics-6b747b7c5f-8f26r   1/1     Running       0          7m41s
kube-prometheus-stack-operator-5d679799b9-64wsq             1/1     Running       0          7m41s
kube-prometheus-stack-prometheus-node-exporter-7cbkx        1/1     Running       0          7m41s
kube-prometheus-stack-prometheus-node-exporter-j2t2r        1/1     Running       0          7m41s

The prometheus-node-exporter component is usually deployed as a DaemonSet so that every node exposes hardware and operating system metrics.

Step 6: Check the Node Exporter DaemonSet

Run:

$ kubectl get daemonset -n monitoring

Look for:

$ kubectl get daemonset -n monitoring
NAME                                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-prometheus-stack-prometheus-node-exporter   4         4         2       4            2           kubernetes.io/os=linux   8m10s

Then check if it has one pod per node:

$ kubectl get pods -n monitoring -o wide | grep node-exporter

If a node does not have a node-exporter pod, Prometheus will not collect hardware metrics from that node.

Step 7: Verify the Grafana Ingress

Check that the Grafana Ingress was created.

$ kubectl get ingress -n monitoring

Expected output should include an Ingress for Grafana:

NAME                            CLASS   HOSTS                    ADDRESS
kube-prometheus-stack-grafana   nginx   grafana.k8s.co.il  192.168.10.50

Step 8: Access Grafana Through Ingress

Open Grafana in the browser:

http://grafana.k8s.co.il

The admin secret can be retrived using:

$ kubectl --namespace monitoring get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo

After logging in, search for dashboards related to:

  • Node Exporter
  • Nodes
  • Kubernetes compute resources
  • Kubernetes cluster overview

The node-exporter dashboards are usually the most useful for hardware monitoring. They show CPU, memory, filesystem, disk I/O, and network metrics.

What Should Be Monitored

A good Kubernetes node hardware monitoring design should include several layers.

CPU

CPU monitoring should include total usage, idle time, system time, user time, steal time, and load average. In virtualized environments, CPU steal time is especially important because it may show that the VM is waiting for CPU resources from the hypervisor.

Useful Prometheus metrics include:

node_cpu_seconds_total
node_load1
node_load5
node_load15

Example PromQL query for CPU usage:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory

Memory monitoring should not only show used memory. Linux uses memory for cache and buffers, so it is better to monitor available memory.

Useful metrics include:

node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes

Example PromQL query for memory usage:

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Disk Capacity

Disk monitoring should include filesystem usage, available space, inode usage, and mount status. This is critical for Kubernetes nodes because full filesystems can affect container runtime, image pulls, logs, and local volumes.

Useful metrics include:

node_filesystem_size_bytes
node_filesystem_avail_bytes
node_filesystem_files
node_filesystem_files_free

Example PromQL query for filesystem usage:

100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}))

Disk I/O

Disk I/O monitoring helps detect slow disks, overloaded storage, or noisy workloads. This is important for nodes running databases, logging workloads, container registries, or local persistent volumes.

Useful metrics include:

node_disk_read_bytes_total
node_disk_written_bytes_total
node_disk_io_time_seconds_total
node_disk_reads_completed_total
node_disk_writes_completed_total

Example PromQL query for disk utilization:

rate(node_disk_io_time_seconds_total[5m]) * 100

Network

Network monitoring should include traffic, packet drops, errors, and interface status. In Kubernetes, network issues may appear as application errors, DNS failures, pod-to-pod communication failures, or storage connectivity problems.

Useful metrics include:

node_network_receive_bytes_total
node_network_transmit_bytes_total
node_network_receive_errs_total
node_network_transmit_errs_total
node_network_receive_drop_total
node_network_transmit_drop_total

Example PromQL query for receive errors:

rate(node_network_receive_errs_total[5m])

Node Status

Hardware monitoring should be connected with Kubernetes node status. This is where kube-state-metrics is useful.

Useful metrics include:

kube_node_status_condition
kube_node_info
kube_node_spec_taint

Example query for nodes that are not ready:

kube_node_status_condition{condition="Ready",status="true"} == 0

Grafana Dashboard Examples

With the helm chart we get built in charts, here are dashboard examples for what should be monitored:

Node Monitoring

Network Monitoring

As you can see, the longhorn-system is using 90Mb/s ingress and 80.4 Mb/s egress.

Kubelet

This is for the entire cluster, in our case 4 nodes in the cluster with 81 containers inside 62 pods.

CoreDNS

In this example we can see the CoreDNS dashboard, the main DNS service in Kubernetes. Therefor, it’s important to know if CoreDNS is functioning properly. We can see how many requests and by which type were done inside the cluster.

API Server

Create Basic Hardware Alerts

A monitoring system is not complete without alerts. In other words, dashboards are good for investigation, but alerts are what we need for operational response.

Here’s an example alert for high CPU usage:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-hardware-alerts
  namespace: monitoring
spec:
  groups:
  - name: node.hardware.rules
    rules:
    - alert: NodeHighCPUUsage
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on node"
        description: "Node {{ $labels.instance }} has CPU usage above 85% for more than 10 minutes."

And here is another example alert for low memory:

    - alert: NodeLowAvailableMemory
      expr: 100 * (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 10
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Low available memory on node"
        description: "Node {{ $labels.instance }} has less than 10% available memory."

Lastly, an example alert for high filesystem usage:

    - alert: NodeFilesystemAlmostFull
      expr: 100 * (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) > 85
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Filesystem almost full"
        description: "Filesystem on {{ $labels.instance }} is above 85% usage."

Apply the alert file:

$ kubectl apply -f node-hardware-alerts.yaml

Configure Alertmanager

Alertmanager should route alerts to the right team or system. In a production environment, this can be email, Slack, Microsoft Teams, PagerDuty, Opsgenie, or a webhook.

A simple route may look like this:

route:
  receiver: platform-team
  group_by:
  - alertname
  - instance
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
- name: platform-team
  email_configs:
  - to: platform-team@example.com
    from: prometheus@example.com
    smarthost: smtp.example.com:587
    auth_username: prometheus@example.com
    auth_password: example-password

In production, passwords should be managed through Kubernetes Secrets or an external secret management solution.

Production Considerations

Use Persistent Storage for Prometheus

Prometheus should use persistent storage so that metrics are not lost when the Prometheus pod restarts.

Example values:

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-storage
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 100Gi

Set Retention Policy

Decide how long metrics should be stored locally.

prometheus:
  prometheusSpec:
    retention: 15d

For long-term storage, consider remote write to a long-term metrics platform.

Secure Grafana Access

Grafana should not be exposed without proper access control. consider the following:

  • HTTPS with a valid certificate
  • SSO integration
  • strong admin password rotation
  • limited public exposure
  • network restrictions
  • separate read-only users for dashboards

Grafana is an operational interface. It should be treated as part of the platform management layer and not as a public application.

Avoid Too Many Alerts

Hardware alerts should be useful and actionable. Too many alerts create noise and reduce trust in the monitoring system.

Start with a small set:

  • Node not ready
  • High CPU for a sustained period
  • Low available memory
  • Filesystem almost full
  • Disk I/O saturation
  • Network errors or drops
  • node-exporter target down

Then improve the alert rules based on real incidents.

What About Vendor Hardware Sensors?

node-exporter provides strong operating system and hardware-related metrics, but it does not always expose every physical hardware sensor.

For deeper hardware status, such as power supply, fan status, RAID controller state, disk SMART status, or server temperature, additional exporters may be required.

Examples include:

  • IPMI exporter for server management interfaces
  • SMART exporter for disk health
  • vendor-specific exporters for storage or server platforms
  • DCGM exporter for NVIDIA GPU monitoring

For general Kubernetes node monitoring, node-exporter is the base. For physical hardware health, it should be extended with hardware-specific exporters.

Summary

Prometheus and Grafana for Kubernetes node monitoring should be implemented as a complete monitoring stack and not only as a dashboard. Use Prometheus with node-exporter for hardware and operating system metrics, kube-state-metrics for Kubernetes object status, Alertmanager for alert routing, and Grafana for visualization. This approach will give your infrastructure teams visibility into CPU, memory, disk, filesystem, network, and node readiness before everything else. At Octopus Computer Solutions, this type of Kubernetes monitoring design is part of building reliable, observable, and production-ready container platforms for our customers.