Good point, I just bring up the kernel because that's the change to the system that happened between "works" and "crashes." I did the reboot manually.
For monitoring, I haven't gotten nagios to work so I'm just doing a simple w >> loadlog cronned at 5 minutes.
Code: Select all
10:15:01 up 4 days, 11:58, 1 user, load average: 0.69, 0.44, 0.30
Fri Jul 29 10:20:01 PDT 2011
10:20:01 up 4 days, 12:03, 1 user, load average: 0.35, 0.32, 0.28
Fri Jul 29 10:25:01 PDT 2011
10:25:01 up 4 days, 12:08, 0 users, load average: 1.04, 0.77, 0.48
Fri Jul 29 10:30:01 PDT 2011
10:30:01 up 4 days, 12:13, 0 users, load average: 0.99, 0.92, 0.62
Fri Jul 29 10:35:01 PDT 2011
10:35:01 up 4 days, 12:18, 0 users, load average: 0.80, 0.93, 0.70
Fri Jul 29 10:40:01 PDT 2011
10:40:01 up 4 days, 12:23, 0 users, load average: 1.09, 0.96, 0.77
Fri Jul 29 10:45:01 PDT 2011
10:45:01 up 4 days, 12:28, 0 users, load average: 1.39, 1.12, 0.86
Fri Jul 29 10:50:01 PDT 2011
10:50:01 up 4 days, 12:33, 0 users, load average: 1.01, 1.06, 0.91
Fri Jul 29 10:55:01 PDT 2011
10:55:01 up 4 days, 12:38, 0 users, load average: 1.08, 1.06, 0.94
Fri Jul 29 11:00:01 PDT 2011
11:00:01 up 4 days, 12:43, 0 users, load average: 0.99, 1.02, 0.94
Fri Jul 29 11:05:02 PDT 2011
11:05:02 up 4 days, 12:48, 0 users, load average: 1.17, 1.18, 1.02
Fri Jul 29 11:10:01 PDT 2011
11:10:01 up 4 days, 12:53, 0 users, load average: 0.41, 0.84, 0.93
Fri Jul 29 11:35:01 PDT 2011
11:35:01 up 5 min, 1 user, load average: 1.06, 1.06, 0.51
Fri Jul 29 11:40:01 PDT 2011
11:40:01 up 10 min, 1 user, load average: 1.29, 1.25, 0.75
Fri Jul 29 11:45:01 PDT 2011
11:45:01 up 15 min, 1 user, load average: 1.12, 1.26, 0.89
Fri Jul 29 11:50:01 PDT 2011
11:50:01 up 20 min, 1 user, load average: 0.83, 1.07, 0.90
Fri Jul 29 11:55:01 PDT 2011
This brings me to another thought. With the previous kernel, all else being equal, the load stayed marginal, with the new kernel it's regularly going above 1 and staying there. There's a RAID 1 using md3_sync and something else for some time after reboot. The last crash stuff I had going on was with the RAID on the 99-raid-check process running every Sunday. By pulling that out of cron.weekly the crashes mostly went away, but it wasn't until the previous kernel that the system totally stabilized.