Page 1 of 2

99-raid-check

Posted: Mon Mar 07, 2011 11:26 pm
by premierhosting
Hi Guys,

99-raid-check runs in the cron.weekly at 4:22 am on Sunday. It's for syncing a RAID 1 software array.

Perhaps this has something to do with my server slowing to a crawl and hanging weekly, usually late Sunday night / early Monday morning.

Apparently this is new in RHEL 5.5 or 5.4 (and hence Centos 5.5). Any suggestions on this, either debugging the issue, or alleviating it? Could one safely discard this RAID check? At least for one week to test the theory? I'm running this on:

Code: Select all

uname -a
Linux hostname 2.6.32.28-1.art.x86_64 #1 SMP Mon Feb 14 11:06:49 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
The concern I have is that its a glitch between the kernel and the RAID stuff. The server provider (1and1) basically says nothing is wrong with it, must be configuration. Once per week crashes are not cool. :)

Any ideas at all would be much appreciated.

Re: 99-raid-check

Posted: Tue Mar 08, 2011 12:47 am
by scott
That does seem suspicious, how are your raids put together? We have a bunch of servers at 1and1 as well (rebuilt with AOOI) and Ive never had anything like that happen. That script is going to force a sync event to the raid if the array is idle, which isn't terribly risky. It makes me wonder if the array is actually damaged or something.

Re: 99-raid-check

Posted: Tue Mar 08, 2011 1:47 am
by premierhosting
how are your raids put together?
I'm not sure how to answer this question. What are you looking for? How would I tell if the array is damaged? It's reporting clean on the mdadm check. When I brought this online, I didn't use the AOOI, it seemed to work by just doing the normal route, and the AOOI thing looked more risky.

Re: 99-raid-check

Posted: Tue Mar 08, 2011 5:02 am
by BruceLee
at first I wuould check if its this raid-check script by disabling it for testing, or let it run another day.
Otherwise you start digging for wrong configured raid, bad harddrive even if there is none.

Re: 99-raid-check

Posted: Thu Mar 10, 2011 4:22 pm
by premierhosting
I manually ran the script, and watched it finish, then 12 hours after that, give or take, the server hung again. Since it's been starting at 4:33 am on Sundays and probably finishing 4-5 hours later, that would put the Sunday night late hang about the same amount of time away. Is there any logic to that being causal? I disabled the script from cron.weekly so we'll see how the results go this weekend.

Re: 99-raid-check

Posted: Thu Mar 10, 2011 6:09 pm
by BruceLee
Could still be some other scripts. Maybe a combination of two. Check all cronjobs and logs.
Do you monitor cpu,ram,swap usage,etc. with nagios, mrtg or something similar? It's always helpful to have some graphs for comparison.

Re: 99-raid-check

Posted: Thu Mar 10, 2011 6:21 pm
by premierhosting
I haven't succeeded at getting nagios or mrtg going. Will need to try again. I've been doing an update >> logfile every 5 minutes on a cron to see if there is a load spike before hang, and there is nothing way off. Sometimes it shows a load a little over 1, but that's not a big deal.

I've poured over logfiles, and have nothing useful to show for it. Cronjobs brought up this possibility. The other weekly crons are:
-rwxr-xr-x 1 root root 380 Mar 27 2007 0anacron
-rwxr-xr-x 1 root root 146 Oct 29 05:02 50plesk-weekly
-rwxr-xr-x 1 root root 251 Sep 20 10:05 asl-webapp-inventory
-rwxr-xr-x 1 root root 414 Jan 6 2007 makewhatis.cron
The only other thing I do by cron is a mysql_backup shell script on each mysql table. This allows a quick recovery if data goes badly for the customers. These are staggered ever 15 minutes starting about midnight, once per day.

Re: 99-raid-check

Posted: Fri Mar 11, 2011 4:59 am
by BruceLee
It does not have to be a weekly script. Still a combination or something else is possible.
Maybe asl-webapp-inventory but I don't think so. It's a little bit resource consuming too. But it only runs weekly if you set it through ASL to weekly.
Check in ASL Gui Configuration the last point.
You will have to dig deeper to exclude scripts step by step and track it down to something. Everything else is just guessing.

Re: 99-raid-check

Posted: Wed Mar 23, 2011 5:58 pm
by premierhosting
Up for a happy 15 days after disabling the 99-raid-check.

Ideas?

Re: 99-raid-check

Posted: Tue May 31, 2011 7:13 pm
by Troy McClure
I seem to be having a problem with 99-raid-check too. Maybe it is an issue with newer 1and1 server because I just got mine setup and have this problem every time I try to run it. It completely hangs my server and I have to call them and get them to reboot it. The load on the server goes through the roof before it completely locks and I can't do anything. I did use AOOI to install mine and have used both the asl kernel as well as the standard CentOS kernel with the same result.

Re: 99-raid-check

Posted: Tue May 31, 2011 11:37 pm
by premierhosting
Troy - I've just been fine by disabling it. Feel like I'm missing out on something though.

Re: 99-raid-check

Posted: Wed Jun 01, 2011 10:18 am
by Troy McClure
Yeah, I just disabled it too. I would like to find out why this happens on the newer 1and1 server that I have and not the older one that I have though. On my older box it does cause the load to jump, but nothing too bad. Nothing like what I see on the newer box. On my newer box it will run for about 10 minutes and then it doesn't respond anymore and I have to reboot it. I have been able to see the load right before it stops responding completely and the cpu load is at 30.

Re: 99-raid-check

Posted: Wed Jun 01, 2011 1:32 pm
by premierhosting
Anyone have the raid-check cron running that *is not* crashing their server?

Re: 99-raid-check

Posted: Wed Jun 01, 2011 1:37 pm
by BruceLee
yes, I do (also on a 1and1 server)

Re: 99-raid-check

Posted: Wed Jun 01, 2011 1:48 pm
by premierhosting
Hi BruceLee,

Can you print your output:

[root@server1]# uname -a
Linux xxxx.xxxxx.xxx 2.6.32.28-1.art.x86_64 #1 SMP Mon Feb 14 11:06:49 EST 2011 x86_64 x86_64 x86_64 GNU/Linux