A while ago a collegue DBA asked me to help him with a database he had problems with. The database was generating a lot of archives and since there also was a standby database, these archives had to be transported to the standby. At some point we were looking at the Top Activity screen in Enterprise Manager and there were wild highs and lows with CPU, disk I/O, cluster waits and other waits. And that was strange, because the application was hardly working at all.
Number one suspect was this query, that was run by the Oracle agent:
select 1 from "SYS"."AQ$_SYS$SERVICE_METRICS_TAB_L" where msgid = :1 and subscriber# = :2 and name = :3 and address# = :4
Number two was this one, started by the Q000 process:
select q_name, state, delay, expiration, rowid, msgid, dequeue_msgid , chain_no, local_order_no, enq_time, enq_tid, step_no, priority , exception_qschema, exception_queue, retry_count, corrid , time_manager_info, sender_name, sender_address, sender_protocol from "SYS"."SYS$SERVICE_METRICS_TAB" where msgid = :1
Surely there was a theme here. Of course I ran my sql_sql_id_html.sql script for both queries. Let’s look at the first query first. To begin with I noticed the large amount of child cursors (about 90). It doesn’t have to be a problem, but it usually indicates there is one.
From the bind data I found that different childs were looking up different values.
The execution plan was a flat out full table scan on AQ$_SYS$SERVICE_METRICS_TAB_L, with a cost of 1524. A execution plan from AWR showed that the old cost of the same plan used to be only 2.
The statistics history of SYS.AQ$_SYS$SERVICE_METRICS_TAB_L was very interesting. On October 20th and 21th there was a rowcount of 0. On October 22th it suddenly was 552132 and the rowcount was rising ever since.
By the time I ran the SQL performance report, the rowcount was 881475. I got only AWR response time history for a week, and during that time the response time didn’t rise all that much, apart from peaks.
The second query, on SYS.SYS$SERVICE_METRICS_TAB, did a unique index scan, but it ran a lot.
So what to do about a query that is basically some internal SQL? First I decided to count the rows in the table AQ$_SYS$SERVICE_METRICS_TAB_L and then try to find out if this was “normal”. By now the number of records was 906225. SYS.SYS$SERVICE_METRICS_TAB had about the same amount. Then I checked the number of rows in these tables in other databases, just to get a feel for what was normal. The values ranged from 0 to 4. I also read our dossier about the database and application and found no information about the use of Advanced Queueing or other things that could indicate a special situation.
There was an Oracle Support document (1228783.1) about SYS.SYS$SERVICE_METRICS_TAB keeping growing. It wasn’t exactly the same issue, but it has some handy queries for the investigation. I don’t remember how, but at a certain point I stumbled upon DocID 11162862.1 (ORA-01405: Fetched Column Value Is Null For An Advanced Queue).
Suddenly I remembered having seen that error somewhere before. It was when I got out of memory issues concerning the CJQ0 process. Those messages kept coming and a collegue told me that killing the CJQ0 process would be a workaround. It would be safe, because Oracle would automatically restart this internal process. From that moment on, the out of memory messages were gone and .. replaced by ORA-01405: fetched column value is NULL messages. I looked up in our logging and found out that that happened on this database.
I looked up the alert log and sure enough, the date those ORA-01405 messages started to appear, was the date that AQ$_SYS$SERVICE_METRICS_TAB_L started to grow. Oracle Support DocID 1162862.1 had the solution. In short, I could do this:
1. Stop the queue of SYS$SERVICE_METRICS with:
BEGIN dbms_aqadm.stop_queue(queue_name => 'SYS.SYS$SERVICE_METRICS'); END;
2. Delete unnecessary/corrupted entries in AQ$_SYS$SERVICE_METRICS_TAB_L.
3. Purge the PROCESSED messages.
4. Restart the queue.
(There was more. Please check the DocID.)
I tried to stop the queue and waited and waited and got a message that a deadlock was detected. That was weird. I tried it a number of times. Didn’t got through. Then I decided to execute steps 2 and 3 anyhow. I expected a lot of locking issues, but I got through without much problems. From then on, the number of active sessions gradually dropped to lower values. I couldn’t clean SYS.SYS$SERVICE_METRICS_TAB, but lateron it cleaned itself up. From then on the Top Activity screen showed what you’d expect from a rarely used database.
And in the end, the rate of log swithes dropped from 60-100 per hour to 4 per hour. And I guess I’ve learned not to kill defenseless background processes.