Yesterday I got a question by mail to investigate a performance issue. A batchjob that started at 22:08, ran too long. I logged in on the database with Enterprise Manager and investigated the issue with the Oracle Diagnostics Pack. When I looked at the history of performance data, I could see a sharp peak between 22:00 and 23:00. After that not much was happening. There was one query with a response time that was significantly larger than any other.
I decided to use it’s SQL_ID as input for my sql_sql_id.sql script, so I would have all interesting data before the query would be flushed from the shared pool.
With that output I also had a history of response times of that query. It became immediately clear the query had a problem. In the days before the query had run 100+ times in an hour, everytime in 1-5 seconds. But for yesterday it was shown that there were zero executions and the response time was about 3500 seconds.
Table statistics to the rescue
I looked at the execution plan and there was a full table scan. I used the query in my table_stats_history.sql script to look at the history of statistics of the table that the full table scan was done on. And I got about this output:
OWNER OBJECT_NAME ROWCNT AVGRLN SAMPLESIZE ANALYZETIME ---------- ------------- ------- ------- ---------- ------------ APP_OWNER BATCH_TABLE 2279280 229 2279280 16-DEC-2012 APP_OWNER BATCH_TABLE 2192410 228 2192410 07-FEB-2013 APP_OWNER BATCH_TABLE 2283440 228 2283440 08-FEB-2013 APP_OWNER BATCH_TABLE 32120 228 32120 12-FEB-2013 APP_OWNER BATCH_TABLE 2315790 229 2315 13-FEB-2013 APP_OWNER BATCH_TABLE 33780 229 33780 14-FEB-2013
I had never seen this before. The amount of rows varied heavily and the sample size went from 100% to 1% and back again. What was this supposed to mean?
Then I remembered the time the batchjob started. This was exactly at the same time as Oracle’s maintenance window was open for, amongst others, the gathering of statistics. Could it be, that the batch job emptied and filled this table every night? The answer was almost certainly yes.
Case closed.. not!
Feeling very confident, I send a brilliant report that explained everything. “Case closed”, I thought.
Except.. I soon got a response by mail making clear that the case was anything but closed. “The batch job is still running”, it told. And by now it was 16:00.
I looked at the Top Activity screen in Enterprise Manager with realtime data. It was a desert out there. Only a handful sessions were running and the usernames of these sessions were familiar: SYS, DBSNMP and my own account. There was only one explanation now. This performance problem was not happening in this database. Maybe in the application, maybe it was a different database, anywhere but here. Which was what later turned out to be the case.
The moral of the story
There’s a lesson in this: it’s too easy to attack performance problems with a technical approach, after hearing or reading half a word. Chances are you are running, like I did, in the wrong direction. Take your time to do a proper analysis of the situation. Call people by phone or talk to them in person. Yes, for this job you need communication skills. Don’t you just hate it when performance cases end up with that advice?