Summer is not a good time for my blogging it seems. I’m training for a 42 km bike ride at the Trio Triathlon in Deil, the Netherlands. It’s quarter triathlon relay. I always make sure I don’t have to swim (brrrr…) or run (not my thing either). Almost every afternoon before dawn I try to squeeze in a training round of about 20 km and 50 km on Sundays. So after work it’s cycling first, then dinner (I’m pretty serious about cooking, so no microwaved slush here) and then… whoops, look at the time. Time to sleep. And repeat the next day.
The ride is August 24th and it is an amateur thing. After everyone is finished most athletes treat themselves on beer or soda and we discuss the race. Who did great? How many times did the real quarter triathletes lap you?
At work I’m investigating some really elusive I/O performance issues. I’ve noticed that the log writer halts for 10+ seconds once in a blue moon. It’s not really easily repeatable and the issues are occurring on different databases and clusters from time to time. On the database you see spikes of “log file sync” waits and iostat shows 100% utilization. Fortunately I have collegues in different disciplines, like Linux experts and storage administrators, that really are helping us find this problem. I think we’re all learning from the experience.
It’s especially great that our storage guys think with us and really try to see what they can find, because I know many Oracle DBA’s have different experiences with storage experts. Responses usually seem to fall in the “I don’t see a problem” category, or “the storage system isn’t busy” category (“but why then do my db file sequential reads average on 15ms??”). Usually it is because they don’t have the tools to see the problem. But that doesn’t mean there isn’t one in the storage area.
I hope that by the time we’ve solved this one, I can shed some light in this blog on how to approach such a problem. The least I can say now is:
- If I/O related wait events pop up in AWR or Statspack, check your average wait time. Better yet, check the wait event histograms, because you’ll probably see a precentage of really fast wait times (1 ms and less), and possibly a percentage much slower ones. The fast ones likely come from cache. The slow ones are possibly the ones that are really read from disk.
- If you have a lot of average waits on db file sequential reads in the 8-16 ms range or worse, what you are getting from the actual disks is pretty slow.
- If your database is on Linux, check the %util on iostat. Is it 100%? Then you are likely waiting on something outside the database server. You can use OS Watcher to keep history of iostat output. Pretty damn handy. I know the newer Windows verions have a decent Performance Monitor, so you might check that.
We have already found out that two totally different databases are interfering with each other, where this should not happen on this storage system. The story continues.