SQL*Net message from client on sqlplus?

Here is a very weird issue I encountered recently. I was working on a test to find out whether different values of parallel parameters would result in better performance for a data warehouse setup. I had noticed that several important SQL statements worked better when running parallel. So the project architect came up with 15 long running queries that might be tuned. What was important was not just the improved performance of the queries, but mainly to see how the instance/database would handle so many parallel queries running at the same time.

I created a SQL script for each query and a “settings script” that is run from each SQL script, so I could easily change settings for all SQL scripts. I also created a Linux shell script that runs the 15 SQL scripts simultaneously in the background, so they would all run together. I changed the list of all the columns in the select clause in the queries to a select count(*), so I wouldn’t have to wait on the output.

I want to discuss the changes I made to the parallellism parameters in another post, but here are the ones I changed:

parallel_degree_policy=TRUE
parallel_max_servers=500
parallel_servers_target=384
parallel_degree_limit=64

The version of the queries with count(*) in the select clause were lightning fast. The last query to finish, did so in 6 1/2 minutes.  At least two times as fast as the old situation, where parallel_max_servers was 30, parallel_servers_target was 8 and parallel_degree_limit was 4. When setting parallel_force_local=false so both RAC instances were used, the slowest query was even faster: 4 min 7 sec. So great, right?

Then I did a test with the original queries with all columns in the select clause, but I made sure my sqlplus run was silent, all possible output was turned off (termout off for example) and ran the test again. The top activity screen in Enterprise Manager now looked like this:

ash_sqlnetwaits

What you see is that there is some activity (CPU in green and I/O in blue) and then a lot of Scheduler class waits (light green) and then another bit of activity. These Scheduler class waits were “resmgr:pq queued” waits and I can explain those .. somewhat. You get those waits when your session is waiting for enough parallel servers (PX servers) to be available to run a query in parallel. This is the kind of behaviour you get when you set parallel_degree_policy=TRUE. If not enough parallel servers are available, your query has to wait in a queue. If parallel_degree_policy=FALSE your query will be run without delay, but in serial and I found out this lead to a lot of snapshot too old situations.

When I zoomed in on the running queries, I saw that – at first – they used a lot of resources, but after a couple of minutes doing that, it would do almost nothing on the database. If anything was happening, I couldn’t see it. In the SQL Monitoring screen however, the queries still had the status “running”. And they were running. They just didn’t seem to be doing anything. What was going on? Was this a freaky kind of behaviour because of the parameters I had set? Because the thing was: they would keep their parallel servers occupied. Hence the many resmgr:pq queued waits.

I traced a session that ran one of the queries, after it’s “real” activity. And this is what I would see:

FETCH #47399982626568:c=0,e=25,p=0,cr=0,cu=0,mis=0,r=15,dep=0,og=1,plh=2861747495,tim=1400508471933147
WAIT #47399982626568: nam='SQL*Net message from client' ela= 3635 driver id=1650815232 #bytes=1 p3=0 obj#=82504 tim=1400508471936796
WAIT #47399982626568: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=82504 tim=1400508471936816
FETCH #47399982626568:c=0,e=32,p=0,cr=0,cu=0,mis=0,r=15,dep=0,og=1,plh=2861747495,tim=1400508471936844
WAIT #47399982626568: nam='SQL*Net message from client' ela= 3635 driver id=1650815232 #bytes=1 p3=0 obj#=82504 tim=1400508471940494
WAIT #47399982626568: nam='SQL*Net message to client' ela= 0 driver id=1650815232 #bytes=1 p3=0 obj#=82504 tim=1400508471940512
FETCH #47399982626568:c=0,e=26,p=0,cr=0,cu=0,mis=0,r=15,dep=0,og=1,plh=2861747495,tim=1400508471940534
WAIT #47399982626568: nam='SQL*Net message from client' ela= 3644 driver id=1650815232 #bytes=1 p3=0 obj#=82504 tim=1400508471944196
WAIT #47399982626568: nam='SQL*Net message to client' ela= 0 driver id=1650815232 #bytes=1 p3=0 obj#=82504 tim=1400508471944214

And so on and on and on. Lots of waits on SQL*Net message from client. So that meant: waiting on stuff outside the database after every fetch. Network maybe, latency? No! I was logged in without SQL*Net. And when I logged in from another server with SQL*Net, things just got worse.

I just didn’t know why this was happening, until one point when I decided to run top:

top - 14:24:11 up 34 days, 25 min, 1 user, load average: 5.31, 8.45, 8.55
Tasks: 209 total, 6 running, 203 sleeping, 0 stopped, 0 zombie
Cpu(s): 84.9%us, 1.1%sy, 0.0%ni, 13.2%id, 0.1%wa, 0.0%hi, 0.7%si, 0.1%st
Mem: 67108864k total, 20744472k used, 46364392k free, 452708k buffers
Swap: 3145720k total, 581388k used, 2564332k free, 16513620k cached

PID   USER    PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31042 fictive 16 0 79736 14m 10m R 82.3 0.0 6:44.52 sqlplus
31043 fictive 15 0 79600 14m 10m R 74.7 0.0 25:47.69 sqlplus
31060 fictive 15 0 79600 14m 10m S 50.8 0.0 2:47.34 sqlplus
31067 fictive 15 0 79600 14m 10m R 49.1 0.0 2:33.49 sqlplus
31059 fictive 16 0 79600 13m 10m R 45.8 0.0 2:20.46 sqlplus
31086 fictive 15 0 79600 14m 10m R 39.5 0.0 2:13.50 sqlplus

These queries were retrieving so much data that sqlplus was working overtime to process that data. This was the same amount of data that the real application, Informatica’s PowerCenter (ETL tool), had to process. And now I could see why a couple of months back the number of processors on the PowerCenter server had to be extended and why before CPU usage maxed out to 100% so often on the server.

It was madness, really, because after the data would be retrieved by PowerCenter, it would have to be inserted in another table in the same database. Row by row even, so another added extreme inefficiency. This is probably the reason why Informatica has an option called Push-Down Optimalization. From what I’ve heard it translates data streams in INSERT .. AS SELECT kind of database commands and does everything on the database. It requires an extra license, but considering the costs of the inefficiency of the standard mode of working in PowerCenter, it seems to me money well spent.

About Marcel-Jan Krijgsman

Marcel-Jan is de PR-functionaris van de Werkgroep Maan en Planeten. Hij verzorgt ook het nieuws op de Facebook pagina en deze blog.
This entry was posted in Oracle performance tuning and tagged , , , , , , , , . Bookmark the permalink.

Leave a comment