New Kernel Less Stable than Last Kernel

premierhosting · Fri Jul 29, 2011 2:37 pm

#uname -a
Linux xxxxxxxxxxxxxxx 2.6.32.43-6.art.x86_64 #1 SMP Thu Jul 14 14:14:48 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

This kernel is crashing about once per week, but not on the same day each week.

When this was the kernel:

Code: Select all

title CentOS (2.6.32.41-4.art.x86_64)
        root (hd0,0)
        kernel /boot/vmlinuz-2.6.32.41-4.art.x86_64 ro root=/dev/md1 console=tty

We had some awesome stability.

This is running on a 1&1 Dedicated server, lots of ram:

Code: Select all

#free
             total       used       free     shared    buffers     cached
Mem:       4020996    1579784    2441212          0      47996     332496
-/+ buffers/cache:    1199292    2821704
Swap:      3919840          0    3919840

Any way to get back to a stable kernel, should I just downgrade?

Unread post by **mikeshinn** » Fri Jul 29, 2011 4:40 pm

When you say crashing, could you be more specific? What kind of kernel error are you getting?

And to select which kernel to boot into, please see this page:

https://www.atomicorp.com/wiki/index.ph ... el_to_boot

premierhosting · Sat Jul 30, 2011 2:20 pm

Hi mikeshinn,

Is there a particular log file you'd like to see? Crash means its hung and the only thing I see on console is the latest PAX error. I can boot into the old kernel, but I'm sure there are updates for a reason

Unread post by **mikeshinn** » Sat Jul 30, 2011 3:11 pm

Is there a particular log file you'd like to see?

Yes, the log file you want to look at is /var/log/messages. If a Linux kernel crashes or has an error, it will log the reasons there (unless you had a major hardware problem that prevented this).

Crash means its hung and the only thing I see on console is the latest PAX error. I can boot into the old kernel, but I'm sure there are updates for a reason

Indeed, the latest kernel contains bug fixes from upstream (the mainline kernel), but not any security fixes. So running the previous ASL kernel wont make your system more vulnerable. If there is a security issue with our kernels, we will always post a notice in Announcements in the forums urging everyone to upgrade (and thats not the case for the current kernel, its just boring old mainline bug fixes and new hardware support).

Now when you say "hung", do you mean the system crashed or you just couldnt reach the server? Its possible you just got shunned and couldnt log in. I'd also check the active response logs which are here:

/var/ossec/logs/active-responses.log

And grep for your IP to see if you got blocked.

premierhosting · Sat Jul 30, 2011 5:24 pm

Hi mikeshinn,

I'm familiar with getting shunned, it wasn't that. I have monitors on the server for Ping, HTTP, etc. and the monitors flagged because the server wasn't accessible. Hung, crashed, whatever you want to call it. I always attempt a login from an intermediary server when this happens to make sure it isn't just me or the monitors being shunned.

I received my first monitor response at 11:19 on July 29th. Here is the /var/log/messages around it.

Code: Select all

Jul 29 11:01:16 server1 kernel: PAX: bytes at PC: c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:16 server1 kernel: PAX: bytes at SP-8: 000066780fe4b4d8 0000000000400ca0 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00007fffd729ed20 0000000000400eac 0000000041ea9940 0000000000000000 000066780fe5abc0
Jul 29 11:01:16 server1 kernel: PAX: execution attempt in: <anonymous mapping>, 7fff2ccf6000-7fff2cd0b000 7ffffffea000
Jul 29 11:01:16 server1 kernel: PAX: terminating task: /usr/libexec/paxtest/mprotstack(mprotstack):6461, uid/euid: 0/0, PC: 00007fff2cd0ac10, SP: 00007fff2cd0ac08
Jul 29 11:01:16 server1 kernel: PAX: bytes at PC: c3 ad d0 2c ff 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:16 server1 kernel: PAX: bytes at SP-8: 00007fff2cd0ad20 0000000000400abd 00007fff2cd0adc3 0000000000000000 0000000000000000 0000000000400a7c 000000004250d940 0000000000000000 0000647439062bc0 00006474388f0994 0000000000400820
Jul 29 11:01:16 server1 kernel: PAX: execution attempt in: /usr/libexec/paxtest/shlibtest2.so, 6927fe938000-6927fe93a000 00000000
Jul 29 11:01:16 server1 kernel: PAX: terminating task: /usr/libexec/paxtest/shlibbss(shlibbss):6686, uid/euid: 0/0, PC: 00006927fe9397e0, SP: 00007fffae39ebf8
Jul 29 11:01:16 server1 kernel: PAX: bytes at PC: c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:16 server1 kernel: PAX: bytes at SP-8: 00006927feb3c7e0 0000000000400e36 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00007fffae39ed20 0000000000400d6c 0000000040dba940 0000000000000000 00006927fef75bc0
Jul 29 11:01:16 server1 kernel: PAX: execution attempt in: /usr/libexec/paxtest/shlibtest2.so, 70802d40f000-70802d411000 00000000
Jul 29 11:01:16 server1 kernel: PAX: terminating task: /usr/libexec/paxtest/shlibdata(shlibdata):6693, uid/euid: 0/0, PC: 000070802d40f7c0, SP: 00007fffd3367bf8
Jul 29 11:01:17 server1 kernel: PAX: bytes at PC: c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:17 server1 kernel: PAX: bytes at SP-8: 000070802d6127c0 0000000000400e36 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00007fffd3367d20 0000000000400d6c 00000000419d7940 0000000000000000 000070802da4cbc0
Jul 29 11:01:29 server1 freshclam[7799]: ClamAV update process started at Fri Jul 29 11:01:29 2011
Jul 29 11:01:29 server1 freshclam[7799]: main.cvd is up to date (version: 53, sigs: 846214, f-level: 53, builder: sven)
Jul 29 11:01:29 server1 freshclam[7799]: daily.cld is up to date (version: 13376, sigs: 165628, f-level: 60, builder: ccordes)
Jul 29 11:01:30 server1 freshclam[7799]: Downloading safebrowsing-31162.cdiff [100%]
Jul 29 11:01:31 server1 freshclam[7799]: Downloading safebrowsing-31163.cdiff [100%]
Jul 29 11:01:32 server1 freshclam[7799]: safebrowsing.cld updated (version: 31163, sigs: 513482, f-level: 60, builder: google)
Jul 29 11:01:32 server1 freshclam[7799]: bytecode.cld is up to date (version: 144, sigs: 41, f-level: 60, builder: edwin)
Jul 29 11:01:33 server1 freshclam[7799]: Database updated (1525365 signatures) from db.us.clamav.net (IP: 69.163.100.14)
Jul 29 11:01:33 server1 freshclam[7799]: Clamd successfully notified about the update.
Jul 29 11:01:34 server1 clamd[10355]: Reading databases from /var/clamav
Jul 29 11:01:47 server1 clamd[10355]: Database correctly reloaded (1937811 signatures)
Jul 29 11:02:51 server1 xinetd[3097]: START: smtp pid=8191 from=221.236.5.22
Jul 29 11:02:57 server1 xinetd[3097]: EXIT: smtp status=0 pid=8191 duration=6(sec)
Jul 29 11:09:42 server1 xinetd[3097]: START: smtp pid=9753 from=8.19.36.102
Jul 29 11:09:43 server1 xinetd[3097]: EXIT: smtp status=0 pid=9753 duration=1(sec)
Jul 29 11:11:11 server1 xinetd[3097]: START: smtp pid=9931 from=75.180.132.123
Jul 29 11:11:12 server1 xinetd[3097]: START: submission pid=9938 from=213.236.208.19
Jul 29 11:11:12 server1 xinetd[3097]: EXIT: submission status=1 pid=9938 duration=0(sec)
Jul 29 11:11:17 server1 xinetd[3097]: EXIT: smtp status=0 pid=9931 duration=6(sec)
Jul 29 11:11:35 server1 xinetd[3097]: START: smtp pid=9946 from=63.149.233.245
Jul 29 11:30:06 server1 syslogd 1.4.1: restart.
Jul 29 11:30:06 server1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jul 29 11:30:06 server1 kernel: Initializing cgroup subsys cpuset
Jul 29 11:30:06 server1 kernel: Initializing cgroup subsys cpu
Jul 29 11:30:06 server1 kernel: Linux version 2.6.32.43-6.art.x86_64 (mockbuild@archelon.atomicorp.com) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Thu Jul 14 14:14:48 EDT 2011

Unread post by **mikeshinn** » Sat Jul 30, 2011 5:54 pm

Thank you for the reply.

I have monitors on the server for Ping, HTTP, etc. and the monitors flagged because the server wasn't accessible. Hung, crashed, whatever you want to call it.

Does that happen with some regularity? And do you have any local monitors on the system, such as load monitors (CPU usage, memory, I/O, etc.)? And if so, what was the system doing at the time this occurred? What process(es) were using resources?

I dont see anything in your kernel logs that would indicate a bug in the kernel unfortunately, so lets see what other data there is that might help isolate this.

Jul 29 11:01:16 server1 kernel: PAX: bytes at PC: c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:16 server1 kernel: PAX: bytes at SP-8: 000066780fe4b4d8 0000000000400ca0 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00007fffd729ed20 0000000000400eac 0000000041ea9940 0000000000000000 000066780fe5abc0
Jul 29 11:01:16 server1 kernel: PAX: execution attempt in: <anonymous mapping>, 7fff2ccf6000-7fff2cd0b000 7ffffffea000
Jul 29 11:01:16 server1 kernel: PAX: terminating task: /usr/libexec/paxtest/mprotstack(mprotstack):6461, uid/euid: 0/0, PC: 00007fff2cd0ac10, SP: 00007fff2cd0ac08
Jul 29 11:01:16 server1 kernel: PAX: bytes at PC: c3 ad d0 2c ff 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:16 server1 kernel: PAX: bytes at SP-8: 00007fff2cd0ad20 0000000000400abd 00007fff2cd0adc3 0000000000000000 0000000000000000 0000000000400a7c 000000004250d940 0000000000000000 0000647439062bc0 00006474388f0994 0000000000400820
Jul 29 11:01:16 server1 kernel: PAX: execution attempt in: /usr/libexec/paxtest/shlibtest2.so, 6927fe938000-6927fe93a000 00000000
Jul 29 11:01:16 server1 kernel: PAX: terminating task: /usr/libexec/paxtest/shlibbss(shlibbss):6686, uid/euid: 0/0, PC: 00006927fe9397e0, SP: 00007fffae39ebf8
Jul 29 11:01:16 server1 kernel: PAX: bytes at PC: c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:16 server1 kernel: PAX: bytes at SP-8: 00006927feb3c7e0 0000000000400e36 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00007fffae39ed20 0000000000400d6c 0000000040dba940 0000000000000000 00006927fef75bc0
Jul 29 11:01:16 server1 kernel: PAX: execution attempt in: /usr/libexec/paxtest/shlibtest2.so, 70802d40f000-70802d411000 00000000
Jul 29 11:01:16 server1 kernel: PAX: terminating task: /usr/libexec/paxtest/shlibdata(shlibdata):6693, uid/euid: 0/0, PC: 000070802d40f7c0, SP: 00007fffd3367bf8
Jul 29 11:01:17 server1 kernel: PAX: bytes at PC: c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Jul 29 11:01:17 server1 kernel: PAX: bytes at SP-8: 000070802d6127c0 0000000000400e36 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00007fffd3367d20 0000000000400d6c 00000000419d7940 0000000000000000 000070802da4cbc0

The PAX messages, as you may know, and harmless:

https://www.atomicorp.com/wiki/index.ph ... st_mean.3F

Jul 29 11:01:29 server1 freshclam[7799]: ClamAV update process started at Fri Jul 29 11:01:29 2011
Jul 29 11:01:29 server1 freshclam[7799]: main.cvd is up to date (version: 53, sigs: 846214, f-level: 53, builder: sven)
Jul 29 11:01:29 server1 freshclam[7799]: daily.cld is up to date (version: 13376, sigs: 165628, f-level: 60, builder: ccordes)
Jul 29 11:01:30 server1 freshclam[7799]: Downloading safebrowsing-31162.cdiff [100%]
Jul 29 11:01:31 server1 freshclam[7799]: Downloading safebrowsing-31163.cdiff [100%]
Jul 29 11:01:32 server1 freshclam[7799]: safebrowsing.cld updated (version: 31163, sigs: 513482, f-level: 60, builder: google)
Jul 29 11:01:32 server1 freshclam[7799]: bytecode.cld is up to date (version: 144, sigs: 41, f-level: 60, builder: edwin)
Jul 29 11:01:33 server1 freshclam[7799]: Database updated (1525365 signatures) from db.us.clamav.net (IP: 69.163.100.14)
Jul 29 11:01:33 server1 freshclam[7799]: Clamd successfully notified about the update.
Jul 29 11:01:34 server1 clamd[10355]: Reading databases from /var/clamav
Jul 29 11:01:47 server1 clamd[10355]: Database correctly reloaded (1937811 signatures)
Jul 29 11:02:51 server1 xinetd[3097]: START: smtp pid=8191 from=221.236.5.22
Jul 29 11:02:57 server1 xinetd[3097]: EXIT: smtp status=0 pid=8191 duration=6(sec)
Jul 29 11:09:42 server1 xinetd[3097]: START: smtp pid=9753 from=8.19.36.102
Jul 29 11:09:43 server1 xinetd[3097]: EXIT: smtp status=0 pid=9753 duration=1(sec)
Jul 29 11:11:11 server1 xinetd[3097]: START: smtp pid=9931 from=75.180.132.123
Jul 29 11:11:12 server1 xinetd[3097]: START: submission pid=9938 from=213.236.208.19
Jul 29 11:11:12 server1 xinetd[3097]: EXIT: submission status=1 pid=9938 duration=0(sec)
Jul 29 11:11:17 server1 xinetd[3097]: EXIT: smtp status=0 pid=9931 duration=6(sec)
Jul 29 11:11:35 server1 xinetd[3097]: START: smtp pid=9946 from=63.149.233.245

This all looks normal, no errors from the kernel.

Jul 29 11:30:06 server1 syslogd 1.4.1: restart.
Jul 29 11:30:06 server1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jul 29 11:30:06 server1 kernel: Initializing cgroup subsys cpuset
Jul 29 11:30:06 server1 kernel: Initializing cgroup subsys cpu
Jul 29 11:30:06 server1 kernel: Linux version 2.6.32.43-6.art.x86_64 (mockbuild@archelon.atomicorp.com) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Thu Jul 14 14:14:48 EDT 2011

Did you reboot the system, or did this occur on its own?

premierhosting · Sat Jul 30, 2011 6:12 pm

Good point, I just bring up the kernel because that's the change to the system that happened between "works" and "crashes." I did the reboot manually.

For monitoring, I haven't gotten nagios to work so I'm just doing a simple w >> loadlog cronned at 5 minutes.

Code: Select all

 10:15:01 up 4 days, 11:58,  1 user,  load average: 0.69, 0.44, 0.30
Fri Jul 29 10:20:01 PDT 2011
 10:20:01 up 4 days, 12:03,  1 user,  load average: 0.35, 0.32, 0.28
Fri Jul 29 10:25:01 PDT 2011
 10:25:01 up 4 days, 12:08,  0 users,  load average: 1.04, 0.77, 0.48
Fri Jul 29 10:30:01 PDT 2011
 10:30:01 up 4 days, 12:13,  0 users,  load average: 0.99, 0.92, 0.62
Fri Jul 29 10:35:01 PDT 2011
 10:35:01 up 4 days, 12:18,  0 users,  load average: 0.80, 0.93, 0.70
Fri Jul 29 10:40:01 PDT 2011
 10:40:01 up 4 days, 12:23,  0 users,  load average: 1.09, 0.96, 0.77
Fri Jul 29 10:45:01 PDT 2011
 10:45:01 up 4 days, 12:28,  0 users,  load average: 1.39, 1.12, 0.86
Fri Jul 29 10:50:01 PDT 2011
 10:50:01 up 4 days, 12:33,  0 users,  load average: 1.01, 1.06, 0.91
Fri Jul 29 10:55:01 PDT 2011
 10:55:01 up 4 days, 12:38,  0 users,  load average: 1.08, 1.06, 0.94
Fri Jul 29 11:00:01 PDT 2011
 11:00:01 up 4 days, 12:43,  0 users,  load average: 0.99, 1.02, 0.94
Fri Jul 29 11:05:02 PDT 2011
 11:05:02 up 4 days, 12:48,  0 users,  load average: 1.17, 1.18, 1.02
Fri Jul 29 11:10:01 PDT 2011
 11:10:01 up 4 days, 12:53,  0 users,  load average: 0.41, 0.84, 0.93
Fri Jul 29 11:35:01 PDT 2011
 11:35:01 up 5 min,  1 user,  load average: 1.06, 1.06, 0.51
Fri Jul 29 11:40:01 PDT 2011
 11:40:01 up 10 min,  1 user,  load average: 1.29, 1.25, 0.75
Fri Jul 29 11:45:01 PDT 2011
 11:45:01 up 15 min,  1 user,  load average: 1.12, 1.26, 0.89
Fri Jul 29 11:50:01 PDT 2011
 11:50:01 up 20 min,  1 user,  load average: 0.83, 1.07, 0.90
Fri Jul 29 11:55:01 PDT 2011

This brings me to another thought. With the previous kernel, all else being equal, the load stayed marginal, with the new kernel it's regularly going above 1 and staying there. There's a RAID 1 using md3_sync and something else for some time after reboot. The last crash stuff I had going on was with the RAID on the 99-raid-check process running every Sunday. By pulling that out of cron.weekly the crashes mostly went away, but it wasn't until the previous kernel that the system totally stabilized.

Unread post by **mikeshinn** » Sun Jul 31, 2011 1:32 pm

Regarding load, what process(es) are using more CPU, I/O, etc. on your system?

premierhosting · Sun Jul 31, 2011 2:27 pm

ossec-security and md3_raid rsync are the major users of resources when I top. As far as "right before the crash" I have no idea. What would you suggest to gather that data?

Unread post by **mikeshinn** » Sun Jul 31, 2011 3:56 pm

ossec-security and md3_raid rsync are the major users of resources when I top. As far as "right before the crash" I have no idea. What would you suggest to gather that data?

Do you mean ossec-syscheckd? If so, and if you are running ASL 3, then you may want to look at this new feature we added into ASL:

https://www.atomicorp.com/wiki/index.ph ... lot_of_CPU

ossec-syscheckd can now tell you exactly what changed inside a file, and will report it. We also added in a new default directory to watch your PHP, JS and HTML files for your web sites. This will make the system do more work, and you may not want to do that. That article explains how to tune the system for your needs.

As for md3_raid, is it using a lot of I/O and/or CPU time? That might be a totally different issue.

premierhosting · Sun Jul 31, 2011 6:18 pm

Will check out those notes.

Here's a top right now with a load of 2.74 after doing an asl -u and having this happen:

Code: Select all

Checking for updates..
  ASL version is current: 3.0.2                            [OK]
  APPINV rules are current: 201107281511                   [OK]
  CLAMAV rules are current: 201107281103                   [OK]
  GEOMAP rules are current: 201107291042                   [OK]
  Updating MODSEC to 201107311626: updated                 [OK]
Stopping httpd:                                            [FAILED]
(98)Address already in use: make_sock: could not bind to address [::]:80
(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80
no listening sockets available, shutting down
Unable to open logs
Starting httpd:                                            [FAILED]
  OSSEC rules are current: 201107301424                    [OK]

Code: Select all

15593 apache    20   0  583m 271m 4304 S 65.0  6.9   0:01.40 httpd                                                        
15476 apache    20   0  587m 223m 6076 S 16.5  5.7   0:01.21 httpd                                                        
15477 apache    20   0  596m 233m 5048 R 11.9  5.9   0:00.49 httpd                                                        
 9260 root      20   0 21980  16m  580 D  5.5  0.4   3:18.65 ossec-syscheckd                                              
15479 apache    20   0  599m 235m 4580 S  2.7  6.0   0:01.11 httpd                                                        
 3230 mysql     20   0  310m  50m 4808 S  1.8  1.3  40:08.28 mysqld                                                       
15475 apache    20   0  608m 243m 6240 S  1.8  6.2   0:01.04 httpd                                                        
15483 apache    20   0  598m 234m 6060 S  1.8  6.0   0:00.84 httpd                                                        
15484 apache    20   0  546m 235m 4176 S  1.8  6.0   0:00.37 httpd                                                        
15592 apache    20   0  523m 210m 1940 S  1.8  5.4   0:00.07 httpd                                                        
15523 apache    20   0  544m 233m 5880 S  0.9  6.0   0:00.36 httpd

Unread post by **mikeshinn** » Sun Jul 31, 2011 9:52 pm

(98)Address already in use: make_sock: could not bind to address [::]:80
(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80
no listening sockets available, shutting down

https://www.atomicorp.com/wiki/index.ph ... 5B::.5D:80

premierhosting · Tue Aug 02, 2011 2:25 pm

Yep, got that stuff.

premierhosting · Tue Aug 09, 2011 9:19 pm

Still having irregular crashes without any reporting. Disabled the vhosts directory scanning.

I'm concerned that with 4 gb of RAM on this system, these heavy HTTPD's might just be loading up:

Code: Select all

 3644 apache    20   0  604m 241m 7164 S  2.4  6.1   0:49.65 httpd            
 3649 apache    20   0  608m 245m 7204 S  1.6  6.3   0:47.74 httpd            
 4321 apache    20   0  627m 263m 6996 S  1.6  6.7   0:39.77 httpd            
 4329 apache    20   0  610m 247m 7096 S  1.6  6.3   0:42.83 httpd            
 4331 apache    20   0  610m 246m 7172 S  1.6  6.3   0:42.79 httpd

That's only 16 httpd processes to fill up all the RAM.

Here's some top output of the md3 stuff:

Code: Select all

  125 root      20   0     0    0    0 S  1.6  0.0   1:13.01 md3_raid1        
  126 root      20   0     0    0    0 R  1.6  0.0   2:34.19 md3_resync

Any ideas? Recommend trying to pare down the httpd size by removing some of the mod_security checks?

Unread post by **scott** » Sat Aug 13, 2011 5:43 pm

Try getting rid of any mod_rewrite or .htaccess files first. Those tend to use up way more resources than anything else

Atomicorp

New Kernel Less Stable than Last Kernel

New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel

Re: New Kernel Less Stable than Last Kernel