So I was helping a customer with a performance problem with Apache. Each time we went through a cycle of requesting information about the latest condition the customer kept trying more and more new things to fix it and the issues started to grow and grow and grow. We'd started on a simple issue with apache, then they started to have weird core dumps, then the customer couldn't log in, then things started to get slow and all before we could even get on the box.
It turns out that in the mad frenzied attempt on the customers part to solve the problem, the customer started to create more and more problems for themselves. They kept changing things, recompiling components from source to try to address a percieved problem, the list just grew and grew. They added proxies in front of apache, changed PHP settings, until really big unrelated problems started to present themselves. The system was a mess, and so far gone it had to be rekicked from scratch. The final source of the problem? No local DNS server.
That one simple things caused the slow downs, and even prevented the customer from being able to log in (they had a tcpwrapper rule setup for their hostname, and the DNS server they pointed to didnt always respond to queries) and the core dumps were being caused by an application being compiled against the wrong library (apparently --nodeps was used too).The moral of the story: Slow down.
Before you start making changes, slow down and answer some simple questions first:
1) What is the problem?
Reduce your variables. Its never a complicated as it seems. Start eliminating things that are not the problem. Don't get caught up in trying to solve more than one problem at a time. Focus on a single problem and eliminate variables that are complicating it. For example, if you compiled something yourself, does this happen when you use the binary supplied by the vendor? If it does, then you know its not how you compiled it, but if it doesn't you now know what the problem is.
Scott always like to remind our developers about "one offs". If you customize something, good luck. Try to stay with the lines. One technique I use to reduce variables is to have a bunch of virtual machines setup with whatever OSes I'm using in an untouched state. Just fresh clean installs, no customizations. If I want to test something I clone the image then try out the problem on a clone. If I can't reproduce it, then I know its something thats different with that machine. All I have to do now is isolate whats different.
And kvm is free afterall, so go nuts folks! This is a cheap, quick and free way to test and debug!
2) Can you reproduce it?
If you do the same exact thing, does it happen again? If you do that from a different computer, does it still happen? Can someone else reproduce it? My old engineering instructor had a simple saying: Never check your own work. Ask for a second opinion, it could be a local problem that only effects you.
Make sure you understand what conditions cause this (and what doesn't). For example, if you can't log in, what happens if you try from another machine? And if you can't consistently reproduce this based on the information collected then start over, thats not the problem.
3) What do your logs and process trace tools tell you?
If the logs don't tell you whats going on, then its time to do some debugging. strace is your friend if the system isn't logging what you need to know. Ask the application itself what its doing. Don't guess, a good process trace will tell you exactly whats going on. All you need to do is:
strace -fF -p pid_of_process
And watch the magic happen.
4) Collect information about the problem before you start making changes. Don't fix anything without knowing what the cause is first.
Follow these and you'll save yourself not only a ton of time, but a ton of trouble by nothing having to rebuild your system.