Log Analysis in a Nutshell: What I Learned

Spent the last few weeks learning log analysis for cybersecurity. Turns out reading logs is way more useful than I expected. This is what I figured out.

What logs are

Logs are timestamped records of events in a system. Web servers log requests, apps log errors, firewalls log blocked connections. These pile up in text files somewhere on a system. Format varies but most have a timestamp, source, severity level, and message about what happened.

Why this?

When something breaks you need to know what happened. Logs show you. Logs tell you what a server was doing before it died, where and how an actor gained access to a system, or broke something, they reveal which queries are taking forever when an app crashes.

With logs you have actual evidence of what went wrong and when, and without it you are essentially guessing. It just makes things easier.

Types of logs

System logs track OS events like service starts and user logins. On Linux they're in /var/log/syslog or /var/log/messages. Windows has Event Viewer.

Application logs show what apps are doing. Web servers log HTTP requests, databases log queries, custom apps log whatever the devs thought to include.

Security logs track auth attempts, privilege changes, policy violations. Failed SSH logins and sudo commands show up here.

Network logs come from firewalls and routers. Connection attempts, blocked traffic, detected threats.

Basic analysis

Grep is how you search logs. Looking for failed logins means grep for "Failed password" in auth logs. Looking for errors means grep for "error" in app logs.

grep "Failed password" /var/log/auth.log
grep -i "error" /var/log/apache2/error.log

Count occurrences to find patterns. 1000 failed logins from one IP = brute force attack. 50 DB timeouts in an hour = performance problem.

grep "Failed password" /var/log/auth.log | wc -l

Filter by time when investigating specific incidents. Problem at 2:30pm? Check logs from 2:25 to 2:35. Use awk or sed to extract timestamps and filter date ranges.

Tools

For local stuff grep, awk, sed, cut handle most tasks. Fast and work everywhere.

For multiple systems use centralized logging. ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, or Graylog collect logs from all your servers into one searchable place. Beats SSHing into 50 boxes to grep 50 different files.

SIEM systems like Splunk ES or QRadar add correlation and alerting on top. They flag suspicious patterns automatically like privilege escalation followed by data exfiltration.

What to look for

Repeated errors aren't random. If the same error shows up constantly something's misconfigured or broken.

Time patterns show usage peaks, scheduled jobs, attack windows. Traffic spikes at 9am when people log in. Backups run at 2am. Scanners probe at 3am.

Anomalies stick out. One failed login is normal, 100 isn't. One user downloading 10GB at midnight is suspicious.

Correlate across log sources. Failed DB connection in app logs matches "too many connections" in DB logs. Outbound traffic spike matches new process in system logs.

Common issues

Logs fill disks fast. Set up log rotation or you'll run out of space. Logrotate on Linux compresses old logs and deletes ancient ones automatically.

Too little logging = no troubleshooting data. Too much = noise. Find the right balance.

Time sync matters. Servers in different timezones with different clocks make correlation impossible. Run NTP.

Attackers delete logs. Use remote logging and write-once storage to prevent tampering.

My workflow

Start with what broke, when, and which systems. Identify relevant log files. Filter by time window around the incident. Search for keywords related to the problem. Follow the chain of events to find root cause. Document it so you remember next time.

Example

Web app returns 502 errors, users can't log in. Web server logs show timeouts connecting to app server. App logs show "DB connection pool exhausted". DB logs show 150 active connections at max capacity. Problem is too many connections. Solution is increase max connections or fix why the app isn't releasing them.

Without logs this is just "website down" with no clear fix.

What I got from this

Log analysis is reading evidence to understand system behavior. It's the difference between guessing and knowing. If you learn grep and regex, youll be in good standing. Understand log formats for systems you work with. Keep clocks synced. If you're not checking logs you don't know what your systems are actually doing.