Simple Server Health Monitor

Python script that SSHs into servers and checks if anything's broken. Sends email alerts when thresholds are exceeded. Basic monitoring without enterprise overhead.

What It Monitors

Connects to servers via SSH and checks:

Disk usage
CPU load
Memory usage

Any metric over threshold triggers an email alert.

Implementation

The script uses Paramiko for SSH connections, runs standard Unix commands, parses output:

def check_disk(self, host: str, username: str, key_path: str) -> Dict:
    """Check disk usage"""
    cmd = "df -h / | awk 'NR==2 {print $5}'"
    result = self.ssh_execute(host, username, key_path, cmd)
    usage = int(result.replace('%', ''))
    
    return {
        'metric': 'disk',
        'value': usage,
        'unit': '%',
        'status': 'CRITICAL' if usage >= self.thresholds['disk'] else 'OK'
    }

CPU check uses load average relative to number of cores. Single-core system at 1.0 load = 100% CPU. Four-core system at 1.0 load = 25% CPU per core:

def check_cpu(self, host: str, username: str, key_path: str) -> Dict:
    cmd = "echo $(cat /proc/loadavg | awk '{print $1}') $(nproc)"
    result = self.ssh_execute(host, username, key_path, cmd)
    load, cores = result.split()
    cpu_percent = (float(load) / int(cores)) * 100
    
    return {
        'metric': 'cpu',
        'value': round(cpu_percent, 2),
        'unit': '%',
        'status': 'CRITICAL' if cpu_percent >= self.thresholds['cpu'] else 'OK'
    }

Memory check parses free output:

def check_memory(self, host: str, username: str, key_path: str) -> Dict:
    cmd = "free | awk 'NR==2 {printf \"%.0f\", ($3/$2)*100}'"
    result = self.ssh_execute(host, username, key_path, cmd)
    usage = int(result)
    
    return {
        'metric': 'memory',
        'value': usage,
        'unit': '%',
        'status': 'CRITICAL' if usage >= self.thresholds['memory'] else 'OK'
    }

Configuration

JSON config file defines servers, thresholds, email settings:

{
  "servers": [
    {
      "name": "Web Server",
      "host": "192.168.1.100",
      "username": "admin",
      "key_path": "~/.ssh/id_rsa"
    }
  ],
  "thresholds": {
    "disk": 85,
    "cpu": 80,
    "memory": 85
  },
  "email": {
    "smtp_server": "smtp.gmail.com",
    "smtp_port": 587,
    "from": "alerts@domain.com",
    "to": "you@domain.com",
    "username": "alerts@domain.com",
    "password": "your-app-password"
  }
}

Thresholds in percentages. Disk at 85% triggers alert. CPU load above 80% triggers alert. Memory usage over 85% triggers alert.

Email Alerts

When threshold exceeded, script sends SMTP email with details:

def send_alert(self, server_name: str, issues: List[Dict]):
    subject = f"[ALERT] System Health Issues on {server_name}"
    body = f"Critical issues detected on {server_name} at {datetime.now()}:\n\n"
    
    for issue in issues:
        body += f"- {issue['metric'].upper()}: {issue['value']}{issue.get('unit', '')}\n"
    
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = email_config['from']
    msg['To'] = email_config['to']
    
    with smtplib.SMTP(email_config['smtp_server'], email_config['smtp_port']) as server:
        server.starttls()
        server.login(email_config['username'], email_config['password'])
        server.send_message(msg)

Gmail requires app-specific passwords. Don't use your main password. Generate one at account security settings.

Setup Process

Install Paramiko:

pip install paramiko

Set up SSH key authentication to avoid password prompts:

ssh-keygen -t ed25519
ssh-copy-id user@server

Test the script manually:

python3 health_checker.py config.json

Output shows check results:

=== System Health Check - 2025-10-07 ===

Checking Web Server (192.168.1.100)...
  ✓ disk: 45%
  ✓ cpu: 23.5%
  ✗ memory: 87%

Alert sent to you@domain.com

Automation

Add to cron for hourly checks:

crontab -e
# Add this line:
0 * * * * /usr/bin/python3 /path/to/health_checker.py /path/to/config.json

Runs every hour on the hour. Adjust schedule as needed. Every 15 minutes: */15 * * * *. Twice daily: 0 0,12 * * *.

What This Is Good For

Small deployments with 2-10 servers. Situations where enterprise monitoring tools are overkill or too expensive. Quick visibility into system health without dashboard complexity.

What This Isn't

Not suitable for large-scale infrastructure. No metrics history. No dashboards. No complex alerting logic. No anomaly detection.

For production environments with dozens of servers, use Prometheus, Grafana, or similar. This script is for situations where you need basic alerting and don't need the overhead.

Security Considerations

SSH keys should be passwordless for automation but stored securely. Restrict key permissions to read-only where possible.

Email credentials in config file are a risk. Consider using environment variables instead or a secrets manager.

Script runs commands with the privileges of the SSH user. Don't use root. Create a dedicated monitoring user with limited permissions.

Extensions

Add more checks: failed login attempts, running processes, service status, network connectivity.

Log results to file for trend analysis. Current implementation only alerts, doesn't store history.

Integrate with Slack or Discord webhooks instead of email for faster notification.

Add retry logic for transient connection failures. Current implementation fails immediately on SSH timeout.

Full code available on request. Modify for your infrastructure.