Introduction to Parallel Execution

Tuning & Tracing Parallel Execution(An Introduction)

Doug Burns([email protected])

IntroductionIntroductionParallel ArchitectureConfigurationDictionary ViewsTracing and Wait EventsConclusion

IntroductionParallel Query Option introduced in 7.1Now called Parallel Execution

Parallel Execution splits a single large task into multiple smaller tasks which are handled by separate processes running concurrently.Full Table ScansPartition ScansSortsIndex CreationAnd others

IntroductionA little history

So why did so few sites implement PQO?- Lack of understanding- Leads to horrible early experiences- Community's resistance to change- Not useful in all environments- Needs time and effort applied to the initial design!

Isnt Oracles Instance architecture parallel anyway?

IntroductionNon-Parallel Architecture?

Parallel ArchitectureIntroductionParallel ArchitectureConfigurationDictionary ViewsTracing and Wait EventsConclusion

Parallel Architecture

Non-Parallel

ParallelDeg 2

Parallel ArchitectureThe Degree of Parallelism (DOP) refers to the number of discrete threads of work

The default DOP for an Instance is calculated as cpu_count * parallel_threads_per_cpuUsed if I dont specify a DOP in a hint or table definition

The maximum number of PX slaves is :-DOP * 2Plus the Query CoordinatorBut this is per Data Flow OperationAnd the slaves will be re-used


Inter-process communication is through message buffers (also known as table queues)These can be stored in the shared pool or the large pool


This slide intentionally left blank

Parallel ArchitectureMethods of invoking Parallel ExecutionTable / Index LevelALTER TABLE emp PARALLEL(DEGREE 2);Optimizer HintsSELECT /*+ PARALLEL(emp) */ *FROM emp;Note Using Parallel Execution implies that you will be using the Cost-based OptimiserAs usual, appropriate statistics are vitalStatement LevelALTER INDEX emp_idx_1 REBUILDPARALLEL 8;

ConfigurationIntroductionParallel ArchitectureConfigurationDictionary ViewsTracing and Wait EventsConclusion

Configurationparallel_automatic_tuningFirst introduced in Oracle 8iThis is the first parameter you should set - to TRUEAn alternative point of view dont use it!Deprecated in 10G and default is FALSE but much of the same functionality is implementedEnsures that message queues are stored in the Large Pool rather than the Shared PoolIt modifies the values of other parametersAs well as the 10g default values, the following sections show the values when parallel_automatic_tuning is set to TRUE on previous versions

Configurationparallel_adaptive_multi_userFirst introduced in Oracle 8Default Value FALSE (TRUE in 10g)Automatic Tuning Default TRUEDesigned when using PX for online usageAs workload increases, new statements will have their degree of parallelism down-graded.Effective Oracle by Design Tom KyteThis provides the best of both worlds and what users expect from a system. They know that when it is busy, it will run slower.

Configurationparallel_max_serversDefault - cpu_count * parallel_threads_per_cpu * 2 (if using automatic PGA management) * 5e.g. 1 CPU * 2 * 2 * 5 = 20 on my laptop The maximum number of parallel execution slaves available for all sessions in this instance.Watch out for the processes trap!

parallel_min_serversDefault - 0May choose to increase this if PX usage is constant to reduce overhead of starting and stopping slave processes.More on this subject in tomorrows presentation

Configurationparallel_execution_message_sizeDefault Value 2148 bytesAutomatic Tuning Default 4KbMaximum size of a message bufferMay be worth increasing to 8Kb, depending on wait event analysis.However, small increases in message size could lead to large increases in large pool memory requirementsRemember that DOP2 relationship and multiple sessions

ConfigurationMetalink Note 201799.1 contains full details and guidance for setting all parametersEnsure that standard parameters are also set appropriatelylarge_pool_sizeModified by parallel_automatic_tuningCalculation in Data Warehousing GuideCan be monitored using v$sgastatprocessesModified by parallel_automatic_tuningsort_area_sizeFor best results use automatic PGA managementBe aware of _smm_px_max_size

Metalink Note 201799.1 contains full details and guidance for all relevant parameters

Dictionary ViewsIntroductionParallel ArchitectureConfigurationDictionary ViewsTracing and Wait EventsConclusion

Dictionary ViewsParallel-specific Dictionary Views

SELECT table_name FROM dict WHERE table_name LIKE 'V%PQ%' OR table_name like 'V%PX%;

TABLE_NAME------------------------------V$PQ_SESSTATV$PQ_SYSSTATV$PQ_SLAVEV$PQ_TQSTATV$PX_BUFFER_ADVICEV$PX_SESSIONV$PX_SESSTATV$PX_PROCESSV$PX_PROCESS_SYSSTAT

Also GV$PQ_SESSTAT and GV$PQ_TQSTAT with INST_ID

Dictionary Viewsv$pq_sesstatProvides statistics relating to the current sessionUseful for verifying that a specific query is using parallel execution as expected

SELECT * FROM v$pq_sesstat;

STATISTIC LAST_QUERY SESSION_TOTAL------------------------------ ---------- -------------Queries Parallelized 1 1DML Parallelized 0 0DDL Parallelized 0 0DFO Trees 1 1Server Threads 3 0Allocation Height 3 0Allocation Width 1 0Local Msgs Sent 217 217Distr Msgs Sent 0 0Local Msgs Recv'd 217 217Distr Msgs Recv'd 0 0

Dictionary Viewsv$pq_sysstatThe instance-level overviewVarious values, including information to help set parallel_min_servers and parallel_max_serversv$px_process_sysstat contains similar information

SELECT * FROM v$pq_sysstat WHERE statistic like Servers%;

STATISTIC VALUE------------------------------ ----------Servers Busy 0Servers Idle 0Servers Highwater 3Server Sessions 3Servers Started 3Servers Shutdown 3Servers Cleaned Up 0

Dictionary Viewsv$pq_slaveGives information on the activity of individual PX slavesv$px_process contains similar information

SELECT slave_name, status, sessions, msgs_sent_total, msgs_rcvd_totalFROM v$pq_slave;

SLAV STAT SESSIONS MSGS_SENT_TOTAL MSGS_RCVD_TOTAL---- ---- ---------- --------------- ---------------P000 BUSY 3 465 508P001 BUSY 3 356 290P002 BUSY 3 153 78P003 BUSY 3 108 63P004 IDLE 2 249 97P005 IDLE 2 246 97P006 IDLE 2 239 95P007 IDLE 2 249 96

Dictionary Viewsv$pq_tqstatShows communication relationship between slavesMust be executed from a session thats been using parallel operations refers to this sessionExample 1 Attendance Table (25,481 rows)

break on dfo_number on tq_id

SELECT /*+ PARALLEL (attendance, 4) */ *FROM attendance;

SELECT dfo_number, tq_id, server_type, process, num_rows, bytesFROM v$pq_tqstatORDER BY dfo_number DESC, tq_id, server_type DESC, process;

DFO_NUMBER TQ_ID SERVER_TYP PROCESS NUM_ROWS BYTES---------- ---------- ---------- ---------- ---------- ---------- 1 0 Producer P000 6605 114616 Producer P001 6102 105653 Producer P002 6251 110311 Producer P003 6523 113032 Consumer QC 25481 443612

Dictionary ViewsExample 2 - with a sort operation

SELECT /*+ PARALLEL (attendance, 4) */ *FROM attendanceORDER BY amount_paid;

DFO_NUMBER TQ_ID SERVER_TYP PROCESS NUM_ROWS BYTES---------- ---------- ---------- ---------- ---------- ---------- 1 0 Ranger QC 372 13322 Producer P004 5744 100069 Producer P005 6304 110167 Producer P006 6303 109696 Producer P007 7130 124060 Consumer P000 15351 261380 Consumer P001 10129 182281 Consumer P002 0 103 Consumer P003 1 120 1 Producer P000 15351 261317 Producer P001 10129 182238 Producer P002 0 20 Producer P003 1 37 Consumer QC 25481 443612

Dictionary ViewsSo why the unbalanced slaves?Check the list of distinct values in amount_paid

SELECT amount_paid, COUNT(*)FROM attendanceGROUP BY amount_paidORDER BY amount_paid/AMOUNT_PAID COUNT(*)----------- ---------- 200 1 850 1 900 1 1000 7 1150 1 1200 15340 1995 10129 4000 1

Dictionary Viewsv$px_session and v$px_sesstatQuery to show slaves and physical reads

break on qcsid on server_set

SELECT stat.qcsid, stat.server_set, stat.server#, nam.name, stat.valueFROM v$px_sesstat stat, v$statname namWHERE stat.statistic# = nam.statistic#AND nam.name = physical readsORDER BY 1,2,3

QCSID SERVER_SET SERVER# NAME VALUE---------- ---------- ---------- -------------------- ---------- 145 1 1 physical reads 0 2 physical reads 0 3 physical reads 0 2 1 physical reads 63 2 physical reads 56 3 physical reads 61 physical reads 4792

Dictionary Viewsv$px_processShows parallel execution slave processes, status and session information

SELECT * FROM v$px_process;

SERV STATUS PID SPID SID SERIAL#---- --------- ---------- ------------ ---------- ----------P001 IN USE 18 7680 144 17P004 IN USE 20 7972 146 11P005 IN USE 21 8040 148 25P000 IN USE 16 7628 150 16P006 IN USE 24 8100 151 66P003 IN USE 19 7896 152 30P007 AVAILABLE 25 5804 P002 AVAILABLE 12 6772

Dictionary ViewsMonitoring the SQL being executed by slaves

set pages 0column sql_text format a60select p.server_name,sql.sql_textfrom v$px_process p, v$sql sql, v$session sWHERE p.sid = s.sid AND p.serial# = s.serial#AND s.sql_address = sql.address AND s.sql_hash_value = sql.hash_value/9i Results

P001 SELECT A1.C0 C0,A1.C1 C1,A1.C2 C2,A1.C3 C3,A1.C4 C4,A1.C5 C5, A1.C6 C6,A1.C7 C7 FROM :Q3000 A1 ORDER BY A1.C010g Results

P001 SELECT /*+ PARALLEL (attendance, 2) */ * FROM attendance ORDER BY amount_paid

Dictionary ViewsAdditional information in standard Dictionary Viewse.g. v$sysstat

SELECT name, value FROM v$sysstat WHERE name LIKE 'PX%';

NAME VALUE---------------------------------------------- ----------PX local messages sent 4895PX local messages recv'd 4892PX remote messages sent 0PX remote messages recv'd 0

Dictionary ViewsMonitoring the adaptive multi-user algorithmWe need to be able to check whether operations are being downgraded and by how muchDowngraded to serial could be a particular problem!

SELECT name, value FROM v$sysstat WHERE name LIKE 'Parallel%'

NAME VALUE---------------------------------------------------------------- ----------Parallel operations not downgraded 546353Parallel operations downgraded to serial 432Parallel operations downgraded 75 to 99 pct 790Parallel operations downgraded 50 to 75 pct 1454Parallel operations downgraded 25 to 50 pct 7654Parallel operations downgraded 1 to 25 pct 11873

Monitoring the adaptive multi-user algorithmWe need to be able to check whether operations are being downgraded and by how muchDowngraded to serial could be a particular problem!

SELECT name, value FROM v$sysstat WHERE name LIKE 'Parallel%'

NAME VALUE---------------------------------------------------------------- ----------Parallel operations not downgraded 546353P*ssed-off users 432Parallel operations downgraded 75 to 99 pct 790Parallel operations downgraded 50 to 75 pct 1454Parallel operations downgraded 25 to 50 pct 7654Parallel operations downgraded 1 to 25 pct 11873

Dictionary ViewsStatspackExample Report (Excerpt)During overnight batch operationMainly Bitmap Index creationSlightly difficult to read

Parallel operations downgraded 1 0Parallel operations downgraded 25 0Parallel operations downgraded 50 7Parallel operations downgraded 7538Parallel operations downgraded to 1Parallel operations not downgrade22

With one stream downgraded to serial, the rest of the schedule may depend on this one job.

Tracing and Wait EventsIntroductionParallel ArchitectureConfigurationDictionary ViewsTracing and Wait EventsConclusion

Tracing and Wait EventsTracing Parallel Execution operations is more complicated than standard tracingOne trace file per slave (as well as the query coordinator)Potentially 5 trace files even with a DOP of 2May be in background_dump_dest or user_dump_dest (usually background_dump_dest)Optimizing Oracle Performance Millsap and HoltThe remaining task is to identify and analyze all of the relevant trace files. This task is usually simple

Tracing and Wait EventsMuch simpler in 10gUse trcsess to generate a consolidated trace file for QC and all slaves

exec dbms_session.set_identifier(PX_TEST');

REM tracefile_identifier is optional, but might make things easier for youalter session set tracefile_identifier=PX_TEST';exec dbms_monitor.client_id_trace_enable(PX_TEST');

REM DO WORK

exec dbms_monitor.client_id_trace_disable(PX_TEST);

GENERATE THE CONSOLIDATED TRACE FILE AND THEN RUN IT THROUGH TKPROF

trcsess output=/ora/admin/TEST1020/udump/PX_TEST.trc clientid=PX_TEST /ora/admin/TEST1020/udump/*px_test*.trc /ora/admin/TEST1020/bdump/*.trc

tkprof /ora/admin/TEST1020/udump/DOUG.trc /ora/admin/TEST1020/udump/DOUG.out

Tracing and Wait EventsThis is what one of the slaves looks like

C:\oracle\product\10.2.0\admin\ORCL\udump>cd ../bdumpC:\oracle\product\10.2.0\admin\ORCL\bdump>more orcl_p000_2748.trc

*** SERVICE NAME:(SYS$USERS) 2006-03-07 10:57:29.812*** CLIENT ID:(PX_TEST) 2006-03-07 10:57:29.812*** SESSION ID:(151.24) 2006-03-07 10:57:29.812WAIT #0: nam='PX Deq: Msg Fragment' ela= 13547 sleeptime/senderid=268566527 passes=1 p3=0 obj#=-1 tim=3408202924=====================PARSING IN CURSOR #1 len=60 dep=1 uid=70 oct=3 lid=70 tim=3408244715 hv=1220056081 ad='6cc64000'select /*+ parallel(test_tab3, 2) */ count(*)from test_tab3END OF STMT

Tracing and Wait EventsMany more wait events and more time spent waitingThe various processes need to communicate with each otherMetalink Note 191103.1 lists the wait events related to Parallel ExecutionBut be careful of what Idle means

Tracing and Wait EventsEvents indicating consumers or QC are waiting for data from producersPX Deq: Execute ReplyPX Deq: Table Q Normal

Although considered idle events, if these waits are excessive, it could indicate a problem in the performance of the slaves

Investigate the slave trace files

Tracing and Wait EventsEvents indicating producers are quicker than consumers (or QC)PX qref latch

Try increasing parallel_execution_message_size as this might reduce the communications overhead

Although it could make things worse if the consumer is just taking time to process the incoming data.

Tracing and Wait EventsMessaging EventsPX Deq Credit: need bufferPX Deq Credit: send blkd

Although there may be many waits, the time spent should not be a problem.

If it is, perhaps you have an extremely busy server that is struggling to copeReduce DOP?Increase parallel_execution_message_size?Dont use PX?

Tracing and Wait EventsQuery Coordinator waiting for the slaves to parse their SQL statementsPX Deq: Parse Reply

If there are any significant waits for this event, this may indicate you have shared pool resource issues.

Or youve encountered a bug!

Tracing and Wait EventsPartial Message EventPX Deq: Msg Fragment

May be eliminated or improved by increasing parallel_execution_message_size

Not an issue on recent tests

Tracing and Wait EventsExampleExcerpt from an overnight Statspack Report Event Waits Timeouts Time (s) (ms) /txndirect Path read 2,249,666 0 115,813 51 25.5PX Deq: Execute Reply 553,797 22,006 75,910 137 6.3PX qref latch 77,461 39,676 42,257 546 0.9library cache pin 27,877 10,404 31,422 1127 0.3db file scattered read 1,048,135 0 25,144 24 11.9 Direct Path ReadsSort I/ORead-aheadPX Slave I/OThe average wait time SAN!

Tracing and Wait EventsEvent Waits Timeouts Time (s) (ms) /txndirect Path read 2,249,666 0 115,813 51 25.5PX Deq: Execute Reply 553,797 22,006 75,910 137 6.3PX qref latch 77,461 39,676 42,257 546 0.9library cache pin 27,877 10,404 31,422 1127 0.3db file scattered read 1,048,135 0 25,144 24 11.9 PX Deq: Execute ReplyIdle event QC waiting for a response from slavesSome waiting is inevitablePX qref latchLargely down to the extreme use of Parallel ExecutionPractically unavoidable but perhaps we could increase parallel_execution_message_size?Library cache pin?Need to look at the trace files

ConclusionIntroductionParallel ArchitectureConfigurationDictionary ViewsTracing and Wait EventsConclusion

ConclusionPlan / Test / ImplementAsking for trouble if you dont!HardwareIts designed to suck the server dryTrying to squeeze a quart into a pint pot will make things slow down due to contentionTune the SQL firstAll the old rules applyThe biggest improvements come from doing less unnecessary work in the first placeEven if PX does make things go quickly enough, its going to use a lot more resources doing so

ConclusionDont use it for small, fast tasksThey wont go much quickerThey might go slowerThey will use more resources

Dont use it for onlineNot unless its a handful of usersWith a predictable maximum number of concurrent activitiesWho understand the implications and wont go crazy when something takes four times as long as normal!It gives a false initial perception of high performance and isnt scalableOkay, Tom, set parallel_adaptive_multi_user to TRUE

ConclusionThe slower your I/O sub-system, the more benefit you are likely to see from PXBut shouldnt you fix the underlying problem?More on this in the next presentation

Consider whether PX is the correct parallel solution for overnight batch operationsA single stream of parallel jobs?Parallel streams of single-threaded jobs?Unfortunately youll probably have to do some work to prove your ideas!

Tuning & Tracing Parallel Execution(An Introduction)

Doug Burns([email protected])(oracledoug.blogspot.com)(doug.burns.tripod.com)

Good morning . How are those hangovers coming along, then?Today Im going to talk to you about Oracles Parallel Execution capabilities and hopefully give you a few performance issues to take away and think about in the course of your day-to-day work. Lets have a look at an outline

Before we begin looking at the technical aspects, Ill spend a little time introducing myself and then introducing Parallel Execution, some of the history behind it and asking why I havent come across it more often at the various sites Ive worked at.Next well take a look at the architecture used and how it integrates into Oracles basic instance architecture. Ill also give you a very brief overview of how you start to use parallel operations in your applicationIll move on to look at how you enable parallel execution at the server level and some of the configuration issues that you need to consider to maximise the benefits.But for the main part of the presentation, I want to concentrate on how we can analyse the performance of parallel operations in various ways to help us identify problems. First Ill go through some of the many dictionary views that are handy for performing real-time diagnosis of whats going on.Next Ill look at how trace files are generated for the parallel execution processes, which adds some minor complications to the standard wait analysis techniques that youre probably used to. There are also many additional wait events to consider which you might see in your Statspack reports, for example.Rather than just throwing technical information at you, Im keen to introduce a little common sense into the equation when we reach the conclusion.

The Parallel Query Option was introduced in version 7.1, but Ive chosen to focus on 10gR2 for this presentation and try to highlight some of the differences in earlier versions.Over time, the name has been changed to Parallel Execution to reflect the fact that it also covers INSERTs, UPDATEs, index creation and so on. Put simply, parallel execution attempts to take a single large task that runs for a long time and break the work into smaller tasks, each of which will be processed by separate slave processes that are able to run concurrently. The intention is to maximise the use of the multiple CPUs that most Oracle servers will have installed.As you can see, there are a range of tasks that can be split up in this way and Im sure many of the DBAs in the audience will have used parallel index creation, but Im going to be concentrating on queries today

Lets have a brief history lesson. The first time I heard of Oracles Parallel Query Option was in 1993, when my boss returned from the IOUG conference (I think it was in Florida that year) hailing a demonstration of PQO that hed attended. If my memory serves me well, Oracle had configured an nCube MPP server with dozens of processors and when they enabled parallelism on a query they were able to demonstrate a near-linear improvement in performance (i.e. it ran dozens of times more quickly). He was an experienced database guy and I still remember his comment that this could be as big a step forward as B-tree indexing. Interestingly, having been excited by my bosss great expectations, I heard nothing about parallel operations for another 2 or 3 years, working at various sites as a contractor. [Put second bullet point up]. So why did so few sites implement the parallel query option?[First sub-bullet]I worked at a site where one of the DBAs decided we should give it a try to improve the performance of our overnight batch schedule. [second sub-bullet]. The results were disastrous so it was switched off again because no one had the time to investigate why. Much of the time, thats the reality of life as a working DBA as Im sure many of you will be all to aware! Periodically I would hear about someone trying this wonderful feature and that it had disastrous effects so it was never used as much as Oracle must have hoped.[third sub-bullet] This is just a personal theory of mine, but Im sure many of you will know what I mean by the Oracle communitys occasional (and sometimes justified) resistance to change. Im thinking here about RMAN for example. Oracle introduce a feature, it doesnt seem to work too well and the slightly cynical dinosaurs of the Oracle community (me included!) will discount the feature.[fourth sub-bullet] The reality is of course that parallel execution simply doesnt suit many Oracle environments. So if you dont really need it for the type of systems youre working on, perhaps theyre pretty well-tuned OLTP systems, then the design effort [fifth sub-bullet] and the potential risks arent worth it. Although I have to say that its pretty easy to take a swing at it and give it a test.So its amazing how few of the sites Ive worked at ever bothered to use Parallel Execution. Dont get me wrong, some sites were using it, particularly for data warehouses and there where some papers out there by Jonathan Lewis and a few others, but many DBAs had never implemented it in earnest and that was one of the original reasons I wrote the paper.As servers started to go faster and table sizes grew we started to use it for maintenance operations such as index creation, loading data and so on, but I was still searching for that query that would go dozens of times more quickly. In fact, when I used to have people on courses talk to me about parallelism I would always ask this rhetorical question. [CLICK] Isnt Oracles architecture parallel anyway?We dont need a real architecture diagram here. Frankly, if you havent seen one of those already you should read the Oracle Concepts manual when you get home because its well covered in there. But I just wanted to highlight just how many processes or threads are involved in any running Oracle instance.One of the foundations of the architecture is multiple processes running in parallel that use multi-CPU servers effectively. In a typical Production OLTP system there will be a large number of concurrent sessions. In most configurations, each will have their own dedicated server process to communicate with the instance, unless of course youre using the Multi-threaded server model, but thats another story completely. When all of these server connection processes are considered, the reality is likely to be that you are using your available CPU resource quite effectively, particularly when you consider the various background processes as well!However, what about those situations when you have one resource intensive job to run, whether it be a long running report query, large data load or perhaps an index creation? Maybe youre the only user of the system, so there would only be one of those dedicated server processes running on your expensive server. That's where parallel execution can prove extremely useful.

So lets have a quick look at the basic parallel execution architecture for those of you who havent seen this before

I dont intend to spend too much time on this area because its all extremely well documented in the Using Parallel Execution chapter of the Data Warehousing Guide, but I wanted to give you just enough of how PX works to make the rest of the presentation understandable.First, lets look at the default Non-Parallel architecture [CLICK]This should be very familiar. The User Process (on the client or server) submits a SELECT statement that requires a full table scan of the EMP table and the Dedicated Server Process is responsible for retrieving the results and returning them to the User Process.Lets look at how things change when we enable Parallel Execution. [CLICK]This time, the server is going to process the query in parallel, because of the optimiser hint. When the server sees that the requested Degree of Parallelism (DOP) for the emp table is two the dedicated server process becomes the Query Coordinator. It makes a request for two Parallel Execution Slave processes and, if its able to acquire them, it will divide all of the blocks that it would have had to scan in the emp table into two equal ranges. Each slave process is responsible for retrieving its own range of blocks from the table.As the data is retrieved it will be returned to the query coordinator which will, in turn, return the data to the user process. Because the PX slaves are separate processes (or threads in a Windows environment), the operating system is able to schedule them and provide timely CPU resource in the same way that it would schedule individual user sessions. In fact, each PX slave is just like a normal dedicated server connection process in many ways so its like setting two normal user sessions to work on one problem.Of course, those users need to behave in a coordinated manner, so theyre not like real users at all![PAUSE]So thats an introduction to the basic architecture, lets look at the various ways you can ask Oracle to process your task in parallel.

Bold bits are optional see how timing goes on run-through. The first important concept you need to understand is the degree of parallelism. This just refers to the number of discrete threads of work. So in the example Ive just shown you, the degree of parallelism or DOP was two. Clearly, higher DOPs will mean that more PX slaves are assigned to your task so wed hope the task may complete more quickly but use more resources for all of the additional processes and inter-process messages.The default DOP for an Instance is calculated as the initialisation parameters cpu_count times parallel_threads_per_cpu and thats the value that will be used if I dont explicitly set the DOP in a hint or table definition. [DEMO??? I could show them it here]So we might think that the maximum number of processes used by our parallel task would be equal to the DOP. In fact, the real number is DOP times 2, plus the query co-ordinator and Ill show you the reasons behind this in a slide or two.We need to be careful here, though, because that maximum is per Data Flow Operation or DFO. And the slave processes might be reused at various stages of query execution. Lets take a look at the output of one of the tests I was running for the other paper. This is a query that performs a Hash Join and a Sort using a DOP of two. You can see that Ive forced that with the hints.

Remember I said that the maximum number of processes used for a query would be 2 times the DOP, + 1 for the query coordinator? Well this diagram shows you why. The key thing here is that our SELECT statement includes an ORDER BY on a non-indexed column, NAME, so Oracle is going to need to perform a sort operation. So heres how it does that.The first thing to notice is that theres no parallel hint in the query. I took that out deliberately to show you that, with a DEGREE setting of 2 on the table, theres no need for the hint in the query any more. So the DOP is 2. In fact, if I was running on a single CPU machine and the DEGREE setting on the table was DEFAULT, I would also get a DOP of two.Oracle decides that, as theres a SORT operation as well as the full table scan, it will allocate 2 sets of PX slaves, each containing two processes. The first set will scan half of the table each and the second set will each sort half of the results returned by the first set and then return them to the query coordinator. Each set of slaves has a different way of having its work divided. We already know that the table scanning slaves are given a range of blocks to trawl through and it seems sensible that the sort slaves should be given a range of values to sort. The ranges are decided by the Query Coordinators, acting as a Ranger in this case that tries to divide the workload evenly.However, because we cant guarantee the physical location in the table for the name BURNS, it means that both PX SLAVE 1 and PX SLAVE 2 are likely to read rows containing the name BURNS. So they both need to have a way of passing those rows to PX SLAVE 3.The way that rows are passed is through inter-process message buffers, also known as table queues that are stored in either the Shared Pool or the Large pool. But, as you can see, each slave in each set needs to be able to communicate with each slave in the other set, so this crisscross of connections will grow rapidly as we use higher degrees of parallelism.And just to complicate matters further, when I say query, Im really talking about a query block, so a sub-query may have two sets of slaves as well. Thanks to Jonathan Lewis for suggesting I make that point next time I did this presentation.The truth is that this diagram is a bit of a simplification as well see later when we look at some of the dictionary views.Ah, one of the oldest IT jokes around, a variation on a page I saw in IBM manuals around the time I got my first business computing job. So why have I left this slide blank? In fact, lets use this nifty new remote control and do the job properly. When I gave this presentation the first couple of times, I stood in front of a few hundred people and boldly stated some half-remembered rubbish about the DOP being dictated by the number of partitions scanned if the query was against partitioned tables. Until both Tom Kyte and Jonathan Lewis pointed out that it was completely untrue. But never fear, if you really want to see the original half-remembered rubbish, I made sure that you could by committing it to print in Select magazine, which some of you might receive. I keep a few copies in my house, just to remind me!Oh, and theres another angle to this story. Jonathan attended the very first time I gave this presentation, came up to talk to me about it afterwards and didnt mention my error at all. Which I think proves that either hes a very gentle man (he could have pointed it out in front of everyone as I was talking, after all) or that this presentation is somewhat more boring than I imagined and hed nodded off to sleep half way through. Ill leave you to decide which is true.I didnt want to spend too much time in this presentation discussing how to develop parallel applications because I wanted to concentrate on the server issues, but in the associated paper, there are a number of references to papers and books that discuss development issues. But we need to know how to switch it on, right?The first method is to alter your tables and indexes so that all operations against those schema objects could be performed in parallel. Youll see here were setting the DOP using the parallel(degree) syntax. Now any statement against the emp table is a candidate for parallelisation. Of course, its just a candidate the cost-based optimiser might decide that it wouldnt make sense to scan a small emp table in parallel. What do you think the DOP on the table will be if I omit the DEGREE part in brackets? Yep, whatever the default DOP is for the instance.Alternatively, we can use the PARALLEL optimizer hint. Ive left the DOP out in this example, to show you how youd accept the default DOP.Note that if you use a parallel hint like this then you are implicitly using the cost based optimiser, even if only for this statement, so that means you need to make sure that you have appropriate statistics on your emp table.And finally, there are statements which have a parallel clause, for example this INDEX REBULD statement. Again, we can leave the DEGREE off, and let Oracle decide.So lets have a think about which is the best approach. The down-side to setting the DOP at the object level is that weve lost control over precisely which statements are going to use parallel execution, so we might decide to use optimizer hints instead. However, we may not have access to all of the source code to embed hints, want to use the stored outline facility or even want to use hints for that matter. For example, at a previous site we were using Oracle Discoverer and we found that it was easier to set the degree of parallelism at the object level and let Oracle decide whether the statements that were submitted should be run as parallel, based on the object settings. This also allowed us to have a PL/SQL package that changed the DOP on relevant objects to a value that suited either overnight batch or online day processing. Im pretty sure that this is Toms preference if you want to use PX set the tables to parallel, let Oracle decide the default DOP and leave it to the server.If you do use this approach, though, you need to be careful to do it right. We had a severe problem at my work a couple of weeks ago because we use scripts to set all of the objects in a schema to parallel with the default DOP for certain occasional maintenance activities, then another script switches them back to noparallel afterwards. Now several months ago, this script had failed part-way through and so it didnt disable the parallelism on some of the objects, but no-one had noticed. Thats problem one, but it obviously hadnt impacted anything too much because no-one had noticed.Then we upgraded the database form 8.1.7 to 9.2 and some batch jobs started to run much more slowly. So thats problem two.When we investigated, they were using PX. They hadnt been before but I suggested this might be because the cost of the parallel option in 9i was cheaper than in 8i, so Oracle had plumped for PX where it hadnt before. When the objects were set back to noparallel, everything settled down again. Thats just something for you to think about be *very* careful that you understand the implications of having the wrong settings at the object level.In order to enable parallel execution youll need to configure your instance appropriately, so lets have a look at whats involved

Well, for as long as I remember, Oracle have steadily been working towards growing their sales in the small server space. Oracle has a reputation for being difficult to work with and so many users are happy to stick with Other Database EnginesWhen I see Oracle introduce an automatic tuning facility for some aspect of the server, for example automatic Dictionary Cache tuning in version 7 or System Managed Undo in version 9, it indicates to me that many of their customers have found it difficult to tune that facility manually or simply dont have the time. And when things start to run slowly or fall-over, where the people in this room might find that extremely interesting others would find it simply annoyingAt some point Oracle realised that this was the case for Parallel Execution, so they introduced the parallel_automatic_tuning parameter in Oracle 8i.To make life easy for yourself, this should be the first parameter you set to TRUE

HOWEVER In fact, Oracle have deprecated parallel_automatic_tuning in 10G. Their thinking behind this is that its largely self-tuning now anyway and the default settings for the parameters reflect many of the previous functionality of parallel_automatic_tuning. Mmmmmmmm

Automatic Tuning makes sure that the message buffers are placed in the large pool rather than the shared pool. This is a more appropriate place because its not that sensible to manage them using the LRU algorithms that would be used by Oracle to manage the shared pool. These are of a fixed size and theyre almost definitely going to be reused.It also modifies a number of other parameters to what Oracle decided were more sensible defaults and well see some of those effects in the next few slides. If you do set this parameter to TRUE, one of the most important settings it makes is this next one.

Introduced in Oracle 8, this parameter was the biggest improvement to parallel execution configuration since it was introduced, but using it has its own consequences that you need to consider.The default is FALSE on 9i but when you set parallel_automatic_tuning to TRUE, it also enables parallel_adaptive_multi_user. The default in 10g is TRUE, so that would be a significant change in any upgrade process if you havent used parallel_adaptive_multi_user previously.Imagine a situation where perhaps we have an Oracle Discoverer report running against a Data Warehouse that takes 8 minutes to run. So we modify the DOP setting on the relevant tables to see if using PX will improve performance. We find that a DOP of 4 gives us a run-time of 90 seconds and the users are extremely happy. However, to achieve this, the report is using a total of nine server processes. We decide to stress test the change to make sure that its going to work, dont we? Or maybe we just decide to release this to production because its such a fantastic improvement! The only difference is going to be whether we have a disastrous test, which might be bearable or dozens of unhappy users, which is usually unbearable.The problem is that weve just multiplied the user population by nine and, whilst this worked fantastically well with just one report running, it wont scale to large user populations. The likelihood is that it wont take long before Oracle manages to suck the very last millisecond of CPU away and the effect on the overall server performance will be very noticeable!So if we set this parameter to TRUE, as the workload on the server increases, new statements will have their degree of parallelism degraded so that the server doesnt grind to a halt. Which has to be a good thing, right?Well, let me give you two alternative views. In Tom Kytes Effective Oracle by Design, he says the following :-This provides the best of both worlds and what users expect from a system. They know that when it is busy, it will run slower. Now, before I go any further, Tom Kyte is a man that I admire very much for all the work hes done for the Oracle community and for his considerable technical and communication skills. It would be very difficult for me to find anything he has to say about PX that Id disagree with. All Im interested here is in different opinions and perspectives, not technical detail.Ask yourself this question Do my users expect the same report to run 4 or 8 times more slowly depending on what else is going on on the server? Im not talking about 90 seconds versus 100 seconds, more like 90 seconds against 8 minutes, at unpredictable moments from the point of view of the user. In my opinion, the statement doesnt really reflect what a lot of users are like at all. The one thing they dont want is unpredictable performance. In fact, I just finished working at a site where the managers were very particular about the fact that they wanted a reasonable but, more important, consistent level of performance. Theyd been badly burnt by parallel execution on an earlier releaseOf course, the performance will only vary because youre asking the server to do more than its capable of so I think Oracles solution is very sensible and I recommend it. But you must remember the implications and be able to articulate them to your users!Let me share a completely wild example with you. For this story to have its full effect, you need to understand that Oracle decides what DOP a query will use when its submitted and then uses that DOP for the lifetime of the query. After I gave this presentation for the first time, someone came up to me afterwards and said to me I know what you mean about that parallel_adaptive_multi_user algorithm. It got so bad for us that we suggested to the users that if it was running too slowly, they cancel their query and resubmit it!As Oracle uses PX for user requests, it needs to allocate PX slaves and it does this from a pool of slaves. These two parameters allow you to control the size of the pool and are very straightforward in use. The most difficult thing is to decide on the maximum number of slaves that you think is sensible for your server. Ive seen people running dozens or hundreds of slaves on a 6 CPU server. Clearly that means that each CPU could be trying to cope with 10-20 or more active slave processes and this probably isnt a good idea. This raises a question that people often ask is there any point on using parallel execution on a single-CPU server? The initial obvious answer would appear to be no, however if your disk subsystem is extremely slow, it may be that a number of slaves per CPU is beneficial because most of your processes are spending most of their time waiting on disk i/o rather than actually doing anything! However, that needs to be balanced against the extra work that the operating system is going to have to do managing the run queue.The most important thing is to perform some initial stress testing and to monitor CPU and disk usage and the servers run queue carefully!Ill be talking about parallel_max_servers a lot more tomorrow.

This parameter controls the size of the buffers used to pass messages between the various slaves and the query coordinator. If a message is larger than this size, perhaps because youre processing extremely long rows, then it will be passed in multiple pieces which may have a slight impact on performance. Tellingly, parallel_automatic_tuning increases the size from the default of 2Kb to 4Kb so this is probably a useful starting point, but it may be worth increasing to 8Kb or even larger. Bear in mind, though, that increasing this value will also increase the amount of memory in the Large or Shared Pool, so you should check the sizing calculations in the documentation and increase the relevant parameter appropriately

In addition to the parallel_ parameters, you should also think about the effect that all of the additional PX slaves will have on your server. For example, each is going to require a process and a session and each is going to be using a sub-task SQL statement which will need to exist in the Shared SQL area. Then we need to think about all of the additional sort areas! The documentation is very good in this area, though, so Ill refer you to that.

The easiest approach to get some high-level information on whats going on with PX is through dictionary views. Even on the extensive tests Ive been running recently, where I was more interested in operating system statistics and event tracing, these were pretty useful in confirming that everything was working as planned.

In a common single-instance environment, these begin with either V$PQ or V$PX, reflecting the change in Oracle's terminology over time. Typically, the V$PX_ views are the more recent and Oracle change the views that are available reasonably frequently so it's always worth using the query below to find out what views are available on the version that you're using

v$pq_sesstat provides statistics relating to the current session. Its one way of verifying that a specific query is using parallel execution as expected but Ill show you a better way later.

This view is useful for getting an instance-wide overview of how PX slaves are being used and is particularly helpful in determining possible This view is useful for getting an instance-wide overview of how PX slaves are being used and is particularly helpful in determining possible changes to parallel_max_servers and parallel_min_servers. For example if Servers Started and Servers Shutdown were constantly changing, maybe it would be worth increasing parallel_min_servers to reduce this activity. However, I saw a reference the other week on Metalink that suggested that having PX slaves sitting around for long periods of time could lead to shared pool problems.V$PX_PROCESS_SYSSTAT contains similar information.

This is an interesting view.V$PQ_TQSTAT shows you table queue statistics for the current session and you must have used parallel execution in the current session for this view to be accessible, because its giving you information about whats happened in this session. Its a little bit like the autotrace facility in that respect. I like the way that it shows the relationships between slaves and the query coordinator very effectively. For example, after running this query against the 25,481 row attendance table: - SELECT /*+ PARALLEL (attendance, 4) */ *FROM attendance;The contents of V$PQ_SYSSTAT look like this: -We can see here that four slave processes have been used acting as row Producers, each processing approximately 25% of the rows, which are all consumed by the QC to return the results to the user. Note that the ORDER BY clause was courtesy of Jonathan Lewis and takes advantage of the happy fact that, if you order by descending Server Type. The listing shows activities in the order that they happen. Well see this best on the next slideWhereas for the following query with the ORDER BY clause, well see something more like this.There are a few new things going on here The QC acts as a Ranger, which works out the range of values that each PX slave should be responsible for sorting P004, P005, P006 and P007 are scanning 25% of the blocks each. P0001, P002, P003 and P004 act as Consumers of the rows in Table Queue 0 that are being produced by P004-P007 and perform the sorting activity They also act as Producers of the final sorted results, for the QC to consume

What is a little worrying from a performance point of view is that P000 and P001 seem to be doing a lot more work than P002 and P003 which means that they will run for longer and were not getting the full benefit of a degree 4 parallel sort. Its a good idea to look at the range of values contained in the sort column.

So this example, which was based on real data I was playing around with at home, happens to have very skewed data so theres only so much benefit to be had from parallelising this sort operation.

One final word on v$pq_tqstat. When I was experimenting with subqueries and multiple DFOs a while ago, I noticed that a particular query against this view didnt contain a row for the query co-ordinator! Baffled by this, I asked Jonathan what he thought and his reply was that the contents of this view couldnt always be trusted, particularly for more complex examples. But, I have to say I havent come across another example since then.V$PX_SESSION is a bit like V$SESSTAT but also includes information about which QC and which Slave Set each session belongs to, which allows us to see a given statistic (e.g. Physical Reads) for all steps of an operation because we can tie the slaves together using the SID of their Query coordinator.

Because v$px_process gives us session information for all of the slaves, We can use it to build a query that shows what SQL each slave is executing. However, the results are very different on 10G than on previous versions, including 9iThis is an example of a more general change in 10g. When tracing or monitoring the PX slaves, the originating SQL statement is returned, rather than a block range query as shown earlier in this document. I think this makes it much easier to see at a glance what a particular long-running slave is really doing, rather than having to tie it back to the QC as on previous versions. However, it does mean that you cant see what a particular slave is doing.

There is some additional information in the standard dictionary views such as V$SYSSTAT and Ill come back to this again later.

Now I want to revisit some of the information thats in V$SYSSTAT.If you are using the Parallel Adaptive Multi-User algorithm, its vital that you are able to check whether any particular operations have been severely downgraded because the server is too busy. Again, Ive heard a couple of different points of view. Some people think that as soon as users hit bad performance, theyll moan like crazy. Ive certainly come across that but Ive also come across users who wont say a thing and will just decide that the system is rubbish, so I think you have to actively check that performance is acceptable to your users, through testing, discussion and post-implementation monitoring.Obviously, were not too concerned with operations that havent been downgraded at all, but personally Id quite like to see the Parallel operations downgraded to serial statistic renamed to [CLICK] Because the information in V$SYSSTAT is shown in a Statspack report, it also means you can monitor downgrade operations over a period of time. For example, this is a section of a report taken during our overnight batch operations whilst bitmap indexes were being recreated in parallel. We also had separate streams of index creation scripts, so we really had two levels of parallelism, if you like.One thing about the downgrade operations is that theyre truncated in a Statspack report which is a minor irritation that you soon get used to. But whats interesting about this report is that one operation in the measurement window has been downgraded to serial. Now, if all of these jobs need to complete before the next part of the schedule can begin, you could see this would be a big problem because all of our tuning efforts are going to be limited by this one stream. So thats the type of thing you should keep an eye out for. Another way of looking at this would be to say that, by implementing Parallel Execution you may improve performance but are even more likely to introduce a bottle-neck.But sometimes dictionary views and Statspack reports are too high-level to give us the detailed information that we need.So sometimes well need to trace sessions. Even if youre used to analysing response time by generating trace files, using parallel execution makes things a little more complicated. Youll also need to be aware that there are a number of additional wait states because of the inter-process communication.

So sometimes well need to trace sessions. Even if youre used to analysing response time by generating trace files, using parallel execution makes things a little more complicated. Youll also need to be aware that there are a number of additional wait states because of the inter-process communication.Tracing Parallel Execution operations is more complicated than standard tracingFor starters, youll find one trace file per slave (as well as the query coordinator), which means that potentially there will be 5 trace files even with a DOP of 2The tracefiles probably wont be in user_dump_dest either, but in background_dump_dest. Although its straightforward once you know about it, the background_dump_dest business is virtually guaranteed to catch you out if youre familiar with generating trace files but have never used parallel execution.You wont find too much specific information about tracing Parallel Execution either because its based on exactly the same principles as standard tracing, with the few differences mentioned above. For example, in the largely excellent Optimizing Oracle Performance, there is only a small section about Parallel Execution, with the following statement:: -

The remaining task is to identify and analyze all of the relevant trace files. This task is usually simple

The first sentence is certainly true. Im not so sure about the second, though, particularly if you use PX during an overnight batch schedule that consists of hundreds of jobs! However, this becomes a lot easier when you move to 10g

There are several steps that I used when setting up tracing for the recent tests Ive been doing. First, set the session identifier. Then set the tracefile identifier so that the string DOUG will be inserted into the filenames and then switch on tracing.Then do whatever work Im interested in tracing, then switch it off.Now I can use the trcsess utility and pass in the name of the tracefile output I want, the clientid Im interested in and then the trace file for the query coordinator, which will be in udump and all of the trace files in bdump, which will include the trace files for the PX slaves. That will generate a consolidated trace file containing all of the information for the session and I can pass that through tkprof if I want.Of course, the Hotsos profile or orasrp are likely to make things much easier.There are several steps that I used when setting up tracing for the recent tests Ive been doing. First, set the session identifier. Then set the tracefile identifier so that the string DOUG will be inserted into the filenames and then switch on tracing.Then do whatever work Im interested in tracing, then switch it off.Now I can use the trcsess utility and pass in the name of the tracefile output I want, the clientid Im interested in and then the trace file for the query coordinator, which will be in udump and all of the trace files in bdump, which will include the trace files for the PX slaves. That will generate a consolidated trace file containing all of the information for the session and I can pass that through tkprof if I want.Of course, the Hotsos profile or orasrp are likely to make things much easier.Another difference when you use Parallel Execution is that youll see many more wait events and much more time spent waiting simply because the processes need to synchronise and communicate with each other.Theres a good Metalink note that lists the various wait events, but you need to be careful not to ignore what Oracle term idle events as I hope to show you.

Here are two of the events that youll see a lot of - PX Deq: Execute Reply and PX Deq: Table Q Normal.Essentially, the query coordinator or one of the slaves is waiting for data to some back from a slave. Now, because this is just a natural part of how PX works, Oracle considers these to be Idle events.However, just because we need to wait for the slaves, it doesnt mean that the length of time we spend waiting is irrelevant. Later on. Ill show you an example of a situation where one of these waits turned out to be one of the key identifiers of a performance problem.

Ive found that PX qref latch is one of the events that a system can spend a lot of time waiting on when using Parallel Execution extensively (as you can see from the earlier Statspack example). Oracle suggest that you could try to increase parallel_execution_message_size as this might reduce the communications overhead, but this could make things worse if the consumer is just taking time to process the incoming data.

Although you will see a lot of waits on these synchronisation events the slaves an QC need to communicate with each other - the time spent should not be a problem. If it is, perhaps you have an extremely busy server that is struggling to cope and reducing the Degree of Parallelism and parallel_max_servers would be the best approach.

Long waits on this event would tend to indicate problems with the Shared Pool as the slaves are being delayed while trying to parse their individual SQL statements. (Indeed, this was the event I would have expected to see as a result of the bug I was talking about earlier but the library cache pin waits were appearing in the Execute phase of the PX slaves work.) Again, the best approach is to examine the trace files of the PX slaves and track down the problem there.

More on this one in a moment

This event indicates that parallel_execution_message_size may be too small. Maybe the rows that are being passed between the processes are particularly long and the messages are being broken up into multiple fragments. Its worth experimenting with message size increases to reduce or eliminate the impact of this.

Let me show you how parallel wait events can be less than obvious. This is a section from a Statspack report covering a long overnight batch schedule. It covers 8 hours REALLY !!!!!!!!!!!!!!!which is longer than you would normally want to analyse, but my intention here is to give you a very high level overview.The first event that we can see is direct Path read which you may not see very often if you dont use parallel execution. Direct Path Reads are caused by one of the following Sort I/OParallel Execution SlavesRead-ahead

The next event, PX Dequeue: Execute Reply is considered by Oracle to be an idle event, so you might choose to ignore it.

Dont even think about implementing Parallel Execution unless you are prepared to invest some time in initial testing, followed by ongoing performance monitoring. If you dont, you might one day hit performance problems either server-wide or on an individual user session that youd never believe (until it happens to you).Parallel Execution is designed to utilise hardware as heavily as possible. If you are running on a single-CPU server with two hard disk drives and 512Mb RAM, dont expect significant performance improvements just because you switch PX on. The more CPUs, disk drives, controllers and RAM you have installed on your server, the better the results are going to be.Although you may be able to use Parallel Execution to make an inefficient SQL statement run many times faster, that would be incredibly stupid. Its essential that you tune the SQL first. In the end, doing more work than you should be, but more quickly, is still doing more work than you should be! To put it another way, dont use PX as a dressing for a poorly designed application. Reduce the workload to the minimum needed to achieve the task and then start using the server facilities to make it run as quickly as possible. Seems obvious, doesnt it?Using PX for a query that runs in a few seconds is pointless. Youre just going to use more resources on the server for very little improvement in the run time of the query. It might well run more slowly!If you try to use PX to benefit a large number of users performing online queries you may eventually bring the server to its knees. Well, maybe not if you use the Adaptive Multi-User algorithm, but then its essential that both you and, more important, your users understand that response time is going to be very variable when the machine gets busy.

The slower your I/O sub-system, the more benefit you are likely to see from PX - but shouldnt you fix the real problem?

Consider whether PX is the correct parallel solution for overnight batch operations. It may be that you can achieve better performance using multiple streams of jobs, each single-threaded, or maybe you would be better with one stream of jobs which uses PX. It depends on your application so the only sure way to find out is to try the different approaches.

45 minutes isnt long for this subject, but Im happy to take any questions just now or, if youd prefer you can contact me via the email address shown here. You can pick up this presentation or a PDF version of the paper behind it at the link shown. I appreciate you taking the time to listen. Thank you.

The slower your I/O sub-system, the more benefit you are likely to see from PX - but shouldnt you fix the real problem?Consider whether PX is the correct parallel solution for overnight batch operations. It may be that you can achieve better performance using multiple streams of jobs, each single-threaded, or maybe you would be better with one stream of jobs which uses PX. It depends on your application so the only sure way to find out is to try the different approaches.

45 minutes isnt long for this subject, but Im happy to take any questions just now or, if youd prefer you can contact me via the email address shown here. You can pick up this presentation or a PDF version of the paper behind it at the link shown. I appreciate you taking the time to listen. Thank you.

Introduction to Parallel Execution

Documents

Transcript of Introduction to Parallel Execution