https://www.ibm.com/developerworks/community/blogs/DB2PerfTips?lang=en

DB2 Performance Tips

db2pd -- the hidden gem !

rssarkar | July 19 ? | 1,453 Views

Those who are familiar with the Informix database must have used a utility named "onstat" to monitor the database, do some cool tricks by looking at the innards of the Informix engine. The db2pd tool is very similar to that and there are a lot of cool things you can do with this tool some of which are externally documented and some of them are undocumented ( the reason being if misused or wrongly used can be catastrophic for the DB2 engine ). Since, my blog is more "performance" biased, I'll discuss a few cool things you can do with db2pd tool to figure out a performance bottleneck. Let me say one thing at the outset, if you are looking for some kind of nice kind of report from this tool, then this might not be the tool of choice for you. This tool is more like providing a quick insight into the shared memory of the DB2 engine and the outputs can be a bit cryptic and it takes a bit of time to get used to it. But, I can tell you, once you get used to it, there's no peer to this tool .. :-). There is one more great thing about this tool ( perhaps the greatest ) i.e. when your engine is hung and no connections are going through, this is the ONLY tool which can work and get you the state of the engine at the time of hang. Now, that the preamble is out of the way .. let me get down to the nitty-gritty.

The following scenarios are based on my experience working in support. These are pretty much the most common "perf" scenarios customers typically face and I'll show you how you can get to the root cause just by using db2pd tool.

1) High CPU situation -- A typical situation is CPU spikes or high CPU after migrating to new release or upgrading hardware etc.

My suggestion will be to get a vmstat profile first to see what kind of CPU ticks are being accumulated more i.e. is it %usr or %sys. If you are used to vmstat output and would like to get a similar output on Windows systems running DB2, you can use the db2pd -vmstat option.

a) If %sys CPU is high

This means that for whatever reason, more system calls are being made and system is not doing useful work i.e. executing user/application code but spending more time running system calls. Typically, I deem it as a problem when %sys is 1.5 - 2x or more higher than %usr. So, the trick is to find out what function call is consuming those ticks. If DB2 is driving it, it can be 2 things i.e.

i) I/O calls or Network calls OR

ii) Latch calls

Since, the I/O system calls are outside the "application", OS tools such as curt ( AIX ), for Linux, oprofile or perf ( good examples in: http://www.brendangregg.com/perf.html and https://perf.wiki.kernel.org/index.php/Tutorial ) or if you are an advanced user and want to be adventurous try systemtap ( it is like dTrace or probevue ) and can write a program to profile the system for system calls ). I did not recommend strace on Linux for the reason that tracing a multi-threaded process like DB2's db2sysc can cause serious impact on the performance of it. The other tools are much less impact choices.

For latch calls, you can use the command:

db2pd -latches -rep 1 50 -file latches.log > /dev/null 2>&1 and it will run the command 50 times, 1 second apart. A typical output will be like:

Database Member 0 -- Active -- Up 3 days 17:08:40 -- Date 2016-07-18-15.12.55.033746

Latches:
Address Holder Waiter Filename LOC LatchType HoldCount
0x0780000001340478 1029 0 ../include/sqle_workload_disp.h 1391 SQLO_LT_sqeWLDispatcher__m_tunerLatch 1

When there is a problem i.e. latch calls causing high %sys, you should see values in "Waiter" column. The numbers in the "Holder" and "Waiter" columns are the EDU ids which you can co-relate to Application handles using

a) db2pd -db -apinfo|egrep "EDU|AppHan"|paste - - |grep OR

b) db2pd -agents|awk -v "edu=13880" '/^0x/ { if( $4 == edu ) print "Apphandle: "$2; }'

To ease your work, there are 2 scripts attached:

1) latchToQueryIdentifier.pl -- find out which query is driving the latch contention.

2) parseDb2pdLatch.pl -- Parse the db2pd -latches output to find out the latch contention.

Please note, these scripts are provided as-is without any support.

b) If %usr CPU is high

This means that user code is driving the CPU high. After you have done away with identifying the top CPU consuming process ( top output or ps aux or topas ) and narrowed it down to db2sysc process, you can run the following db2pd command to figure out which EDU / agent is consuming the most CPU and then find out which query is driving the high cpu usage.

1) To identify the top CPU consuming EDUs ( DB2 v9.7 FP5 and higher ):

db2pd -db -edus interval=5 top=10 -file topEdus.txt > /dev/null &

2) Collect the application information to find out the query driving the problem.

db2pd -db -apinfo -rep 2 5 -file apinfo.txt > /dev/null 2>/dev/null &

Notice the '&' sign at the end. It is to run both the commands at the same time, as it would be required to co-relate the outputs to find out the query or queries. You may require to run the above multiple times to figure out the queries if there are multiple queries which are driving the high CPU. Then of course, the next step being running it through Query Tuner if you are using DSM or get an explain plan and see what can be tuned to make the high CPU consumption go away. Typically, large hash joins are CPU intensive, but there can be other reasons too.

2) High number of concurrent queries slowing down the database performance

Sometimes it may so happen, some query which has a bad plan is running at very high numbers concurrently on the database which is making it run slow. If the query is run in isolation, it runs very quick but when run in large numbers simultaneously, it causes problems. You can find out the offending query using:

db2pd -db -active -rep 2 5 -file active.txt > /dev/null

A sample output looks like:

Database Member 0 -- Database SAMPLE -- Active -- Up 3 days 18:42:10 -- Date 2016-07-18-16.46.34.408112

Active Statement List:
Address AppHandl [nod-index] UOW-ID StmtID AnchID StmtUID EffISO EffLockTOut EffDegree EntryTime StartTime LastRefTime
0x0700000049604C80 6881 [000-06881] 1 1 377 1 1 -1 0 Mon Jul 18 16:46:28 Mon Jul 18 16:46:28 Mon Jul 18 16:46:34

Key things to look at are:

i) Apphandle

ii) Anch ID

iii) StmtUID

iv) Start Time

v) Last Ref Time

Note: To identify the query running with highest concurrency, you can run:

db2pd -db -active | awk '/^0x/ { print $6","$7; }' | sort | uniq -c | sort -nr

The Anch ID and StmtUID combination uniquely identifies the query in the cache. You can then run the following to get the statement.

db2pd -db sample -dyn|awk -v "anchid=377" -v "stmtid=1" '/Dynamic SQL Statements:/,/Dynamic SQL Environments:/ { if( match($0,"^0x") && ( $2 == anchid ) && ( $3 == stmtid ) ) print $0; }'

You can also use the db2pd -active output to find out long running queries ( The watchThreshold script I provided in one of my earlier blogs uses the same mechanism ) by doing a diff between the "Start Time" and the time shown in the header line of the db2pd output.

3) Find out lock waits in the database

The command to figure out the lock wait situation is:

db2pd -db -wlocks [ detail ]

The suboption is not available prior to 10.5 releases. It provides the table name as well for the lock. If that option is not available, there are other ways to figure out the table name.

A sample output is:

Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:02:28 -- Date 2016-07-18-17.16.41.521224

Locks being waited on :
AppHandl [nod-index] TranHdl    Lockname                   Type       Mode Conv Sts CoorEDU    AppName AuthID   AppID
30       [000-00030] 420        0002000F000000000000000452 RowLock    ..X       G   3086       db2bp    RSARKAR *LOCAL.rsarkar.160718211437
32       [000-00032] 397        0002000F000000000000000452 RowLock    ..U       W   15166      db2bp    RSARKAR *LOCAL.rsarkar.160718211532

Apphandle 32 is waiting on Apphandle 30. It is trying to get a Row lock in "U" mode but apphandle is holding an 'X' lock on that row which is why it is waiting. If I use the suboption I can find out on which table the issue is happening.

$ db2pd -db sample -wlocks detail

Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:03:46 -- Date 2016-07-18-17.17.59.830245

Locks being waited on :
AppHandl [nod-index] TranHdl    Lockname                   Type       Mode Conv Sts CoorEDU    AppName AuthID   AppID                           TableNm            SchemaNm AppNode
30       [000-00030] 420        0002000F000000000000000452 RowLock    ..X       G   3086       db2bp    RSARKAR *LOCAL.rsarkar.160718211437      STAFF              RSARKAR db2debug
32       [000-00032] 397        0002000F000000000000000452 RowLock    ..U       W   15166      db2bp    RSARKAR *LOCAL.rsarkar.160718211532      STAFF              RSARKAR db2debug

So, it is on table STAFF. In order to find out the queries run by the app handle, I can use the following command:

db2pd -db sample -apinfo 30

db2pd -db sample -apinfo 32

db2pd -db sample -apinfo 30 results in:

Application :
Address :                0x0780000001DE0080
AppHandl [nod-index] :   30       [000-00030]
TranHdl :                420
Application PID :        56755468
Application Node Name : db2debug
IP Address:              n/a
Connection Start Time : (1468876459)Mon Jul 18 17:14:19 2016
Client User ID :         rsarkar
System Auth ID :         RSARKAR
Coordinator EDU ID :     3086
Coordinator Member :     0
Registered Agents :      1
Active Agents :          1
Locks timeout value :    NotSet
Locks Escalation :       No
Workload ID :            1
Workload Occurrence ID : 1
Trusted Context :        n/a
Connection Trust Type : non trusted
Role Inherited :         n/a
Application Status :     UOW-Waiting
Application Name :       db2bp
Application ID :         *LOCAL.rsarkar.160718211437
ClientUserID :           n/a
ClientWrkstnName :       n/a
ClientApplName :         n/a
ClientAccntng :          n/a
CollectActData:          N
CollectActPartition:     C
SectionActuals:
UOW start time :         07/18/2016 17:15:15.911638
UOW stop time :

The application handle is in UOW Waiting status, however, as can be seen, it has a open UOW which started @17:15:15.

Application :
Address :                0x0780000002FA0080
AppHandl [nod-index] :   32       [000-00032]
TranHdl :                397
Application PID :        23074544
Application Node Name : db2debug
IP Address:              n/a
Connection Start Time : (1468876532)Mon Jul 18 17:15:32 2016
Client User ID :         rsarkar
System Auth ID :         RSARKAR
Coordinator EDU ID :     15166
Coordinator Member :     0
Registered Agents :      1
Active Agents :          1
Locks timeout value :    NotSet
Locks Escalation :       No
Workload ID :            1
Workload Occurrence ID : 2
Trusted Context :        n/a
Connection Trust Type : non trusted
Role Inherited :         n/a
Application Status :     Lock-wait
Application Name :       db2bp
Application ID :         *LOCAL.rsarkar.160718211532
ClientUserID :           n/a
ClientWrkstnName :       n/a
ClientApplName :         n/a
ClientAccntng :          n/a
CollectActData:          N
CollectActPartition:     C
SectionActuals:
UOW start time :         07/18/2016 17:16:31.350722
UOW stop time :

List of current activities :
    Activity ID :                                 1
    UOW-ID :                                      1
    Package schema :                              NULLID
    Package name :                                SQLC2K26
    Package Version :
    Consistency Token :                           AAAAAfAd
    Section number :                              203
    Statement number :                            1
    Isolation :                                   CS
    Effective degree :                            0
    Actual degree    :                            1
    Sourece ID :                                  0
    Cursor ID :                                   0
    Nesting level :                               0
    Invocation ID :                               0
    Package cache ID :                            0x0000000102910000
    Anchor ID :                                   657
    Statement UID :                               1
    SQL Type :                                    Dynamic
    Statement Type :                              DML, Insert/Update/Delete
    Statement :                                   update staff set years = 10 where id = '200'
    Entry time :                                  07/18/2016 17:16:31.351389
    Local start time :                            07/18/2016 17:16:31.351391
    Last reference time :                         07/18/2016 17:16:31.351389

Application handle was running an update statement on staff for a specific ID and is in Lock-wait status. So, an educated guess can be made here that application handle 30 must have run some DML statement to hold an X lock on the row but has not yet committed which is why the lock wait is happening. So, if that user has walked away from the workstation, then the best alternative is to do a "force application" on that application handle 30 to roll back the transaction and let application handle 32 proceed.

Note: if suboption is not available. You can use the MON_FORMAT_LOCK_NAME function where available as:

db2 "SELECT SUBSTR(NAME,1,20) AS NAME,SUBSTR(VALUE,1,50) AS VALUE FROM TABLE( MON_FORMAT_LOCK_NAME('0002000F000000000000000452')) as LOCK" OR

The tablespace id and table id is embedded in the lock name. The first 2 bytes store the tablespace id and next 2 bytes store the table id. You can then issue the command:

db2pd -db -tcbstat

e.g. in this case db2pd -db sample -tcbstat 2 15

$ db2pd -db sample -tcbstat 2 15

Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:14:51 -- Date 2016-07-18-17.29.04.517807

TCB Table Information:
Address TbspaceID TableID PartID MasterTbs MasterTab TableName SchemaNm ObjClass DataSize LfSize LobSize XMLSize
0x0700000048F553D8 2 15 n/a 2 15 STAFF RSARKAR Perm 1 0 0 0

TCB Table Stats:
Address TableName SchemaNm Scans UDI RTSUDI PgReorgs NoChgUpdts Reads FscrUpdates Inserts Updates Deletes OvFlReads OvFlCrtes PgDictsCrt CCLogReads StoreBytes BytesSaved
0x0700000048F553D8 STAFF RSARKAR 3 35 35 0 0 70 0 0 35 0 0 0 0 0 - -

4) Miscellaneous things you can do with db2pd

i) Find out how much log space is being used by each transaction -- db2pd -db -transactions ( look at the SpaceReserved column ).

ii) Find out the transaction throughput -- db2pd -db -transactions | egrep "commit|rollback" -- the numbers are cumulative, so you will have to do the math to find

per second values.

iii) Find out query throughput -- db2pd -db -workload | egrep "WorkloadName|ActCompleted" -- Numbers are cumulative, so math has to be done to find out per sec

values.

**Note: With the availability of db2 "call monreport.dbsummary(60)" the throughput values ( query and transactions ) are much more easier to find rather than scripting to find it out of db2pd values.

iv) db2pd -everything -- It collects ALL information about the instance and is useful for support when they are trying to figure out what was going on in the instance during a problem. The db2fodc -perf/hang tools do collect this automatically.

v) db2pd -db -dirty summary -cleaners -- To find out how the page cleaners are working to keep the buffer pools clean. If you see its filling up too rapidly and page cleaners are not able to keep up, maybe it is time to increase the number of page cleaners.

vi) db2pd -db -bufferpool -- The bufferpool performance ( hit ratio, I/O speeds and throughput ).

vii) db2pd -stack all -- To take stacks of the ALL the agents in the engine. If you want to take a specific EDU stack, just append the EDU id to the command. By default, it dumps the stack file ( *stack.txt ) file in DIAGPATH but if you want to change it, you can use:

db2pd -stack all dumpdir=

Stacks are VERY VERY important in hang like scenarios, as that tells the support engineers in which code path the agent is stuck. It is useful in slow perf issues too but it needs to be used judiciously as it does have an impact.

There are many more commands ( db2pd -h shows you a help screen ) which the support engineers use to debug any kind of issue in the engine. It'll need a whole book to put all of it in detail .. :-)

In conclusion, I would say, this tool is not for someone who wants GUI based outputs or wants to resolve issues with a few clicks of mouse ( there are tools such as DSM, Optim which do this very well ) but it is for someone who is more of a awk/sed/grep person and not afraid to do quick one line scripts to parse the data. I'm personally biased towards this tool as I am an ex-Informixer :-) and am used to a typical character based output and doing greps/awk/sed etc.

I hope the not so brief primer on db2pd was useful for you. Please feel free to comment.

Modified on July 19 by rssarkar

DB2 tracing tips and tricks

rssarkar | May 25 ? | 1 Comment ? | 857 Views

1

Any well written code should have some kind of tracing facility to allow an "inside" view to its working so that when problems happen the issue can be looked into while its happening. You must have come across OS tracing tools such as dTrace ( Solaris ), systemTap ( Linux ), probeVue ( AIX )., tcpdump ( Linux ) , iptrace ( AIX ) for network tracing. On similar lines, DB2's tracing facility is coded so that it can provide a view of the inner working of the engine and it can be used to find bottlenecks within the engine.

However, it needs to be handled with care ( as is the case with any tracing tool ) as too much of it can cause huge performance degradation or a wrong option being used can cause a system to be so slow that it looks like a hang. I am going to discuss the optimal usage of DB2 Tracing facility for different kind of scenarios and also some of the options you should or should not use based on different situations. The tool which controls DB2's tracing facility is known as db2trc.

Let me start with the different options available with the db2trc facility ( DB2 v10.1 FP5+ or 10.5FP3+ ).

STAGE #1 - COLLECTION
Usage: db2trc [facility] [-u]

[facility]
db2 - DB2 instance (default)
das - DB2 Administration Server instance
cf - CF Server
cfcli - CF Client

change - Change trace options
clear - Clear the trace buffer
dump - Generate trace dump file
info - Information
off - Disable the trace facility
on - Enable the trace facility
stop - Stop tracing

STAGE #2 - PARSING
Usage: db2trc [-u]

flow - Generate control flow diagram
format - Format (includes CLI/ODBC driver traces)
formattedFlow - Generate global flow diagram
info - Information
perffmt - Format a performance trace
perfrep - Format a performance report

For more information add the "-u" option to any of the above commands

You can get the above commands by just typing in "db2trc" from the command line.

Typically, when you turn on db2trc, the trace records are put in a cyclical buffer in memory. The size of the buffer can be found using the command ( if instance is up ):

$> db2pd -memsets | egrep "Address|Trace"
Name Address Id Size(Kb) Key DBP Type Unrsv(Kb) Used(Kb) HWM(Kb) Cmt(Kb) Uncmt(Kb) CmtRt(Kb) DcmtRt(Kb) HoldRt(Kb) Sngl
Trace 0x0000000000000000 14057472 33355 0x45EE9E74 0 -1 0 33355 0 33355 0 0 0 0 0

You can change the buffer size by setting the registry variable BEFORE bringing up the instance:

db2set DB2TRC_DEF_BUFFSIZE=512m ( you can go up to 1GB maximum. It accepts the size in bytes or suffixes such as 'm' or 'g' ).

You can also put the trace records to a file with a -f option. However this has to be used with caution and is recommended only to be used in error / crash situation which happens very quickly e.g. right after the engine is started etc. The default recommendation is NEVER to use it until and unless specifically asked by support engineer.

Ok, now that the preamble is out of the way, let me get down to the nitty gritty and show how it can be used for different type of situations.

Overall performance is slow

A typical situation when everything is running slow. This happens very commonly right after some upgrade, but may happen otherwise.

Suggested command#1 to use:

db2trc on -l ( small L ) 512m -t

sleep 30

db2trc dmp trc.dmp.1

sleep 30

db2trc dmp trc.dmp.2

sleep 30

db2trc dmp trc.dmp.3

db2trc off

After this, use these commands to format the trace dump files.

db2trc perfrep trc.dmp trc.perfrep -sort timeelapsed -order desc -g

This command will generate a "profile" trace for each of the threads working within the engine. The output will be something like:

Node : 0 , PID : 7771 , TID : 46912849832256

nCalls TotalElapsed AvgElapsed TotalSpent AvgSpent FunctionName

1 0.000512000 0.000512000 0.000003000 0.000003000 sqlriProcessSplitRowData

1 0.000509000 0.000509000 0.000001000 0.000001000 sqljsDrdaAsCBPutLongDataQueryCheck

1 0.000503000 0.000503000 0.000004000 0.000004000 sqlriWriteCsoIntoBuffer

1 0.000499000 0.000499000 0.000008000 0.000008000 sqlrimso

1 0.000491000 0.000491000 0.000001000 0.000001000 sqlriglf

1 0.000490000 0.000490000 0.000003000 0.000003000 sqldFetchLFD

1 0.000485000 0.000485000 0.000002000 0.000002000 sqldx_lob_read

1 0.000480000 0.000480000 0.000005000 0.000005000 sqldxReadLob

1 0.000474000 0.000474000 0.000003000 0.000003000 sqldxReadLobData

1 0.000471000 0.000471000 0.000001000 0.000001000 sqldx_diskread

1 0.000470000 0.000470000 0.000002000 0.000002000 sqlbDirectRead

1 0.000468000 0.000468000 0.000009000 0.000009000 sqlbDMSDirectRead

1 0.000328000 0.000328000 0.000028000 0.000028000 sqlbDMSMapAndRead

1 0.000295000 0.000295000 0.000002000 0.000002000 sqlbReadBlocks

1 0.000293000 0.000293000 0.000293000 0.000293000 sqloReadBlocks

2 0.000179000 0.000089500 0.000006000 0.000003000 sqlbDMScheckObjAlloc

2 0.000158000 0.000079000 0.000013000 0.000006500 sqlbIsExtentAllocated

The way to interpret the above is to look at the TotalElapsed and TotalSpent column values. TotalElapsed is the "elapsed time to get in and out of this function", TotalSpent is the "time spent inside the function". For example, sqlriProcessSplitRowData is showing TotalElapsed as .000512 seconds and 0.000003 seconds as TotalSpent which means, the function was entered in time t1 and exited 0.000512 seconds later but it actually spent only 0.000003 seconds inside the function itself, the other time was spent in some of the functions it called. Since, the command had ordered it in descending order, keep looking for a function where TotalElapsed =~ TotalSpent. In the above example, its the sqloReadBlocks function which as its name suggests is an I/O function. So, of the 0.000512 seconds, 0.000293 seconds was spent in the I/O call. This is how you can figure out a bottleneck from this output.

This way, you can attempt to see where is the bottleneck. If its an obvious problem ( refer to my SSL Perf blog ) it will surely bubble up here.

Suggested command#2

db2trc on -perfcount -t

sleep 180

db2trc dmp trc.perfcount

db2trc off

db2trc perffmt trc.perfcount trc.perffmt

This command keeps how many times a function is called and how much time is spent in each function. It doesn't have the per thread granularity which the perfrep report provides. However, it creates a nice profile information of the whole engine. Typically, I use the following command to parse the trc.perffmt output.

sort -t"(" -k2nr trc.perffmt

The output is like:

1204129 (1294 sec, 118794000 nanosec) csmFetch
1204594 (1288 sec, 694681000 nanosec) sqlak_callbDrdaOutput
2815676 (1225 sec, 149336000 nanosec) csmGetCursorBuf
560574 (1223 sec, 174019000 nanosec) csmDriveFetch
560525 (1222 sec, 257363000 nanosec) sqljrDrdaArFetch
560486 (1200 sec, 247509000 nanosec) sqljrRecv
2414805 (1185 sec, 162567000 nanosec) clientboDDFMode1Conversions
2410385 (1183 sec, 317071000 nanosec) clientboWriteBytes
560425 (1173 sec, 235236000 nanosec) sqljsParse
560158 (1171 sec, 596516000 nanosec) sqljsParseRdbAccessed
559503 (1170 sec, 921528000 nanosec) sqljs_ddm_cntqry
560041 (1170 sec, 81641000 nanosec) sqlrr_fetch
560021 (1158 sec, 474592000 nanosec) sqlrr_process_fetch_request
1765631 (1135 sec, 986047000 nanosec) sqlribnoCbDrda
2953145 (1080 sec, 720990000 nanosec) sqljsDrdaAsCBPutLongDataQueryCheck
3490741 (1068 sec, 659227000 nanosec) sqlriWriteCsoIntoBuffer
3490211 (1067 sec, 508322000 nanosec) sqlrimso
3485339 (1054 sec, 156969000 nanosec) sqlriglf
3486049 (1051 sec, 759578000 nanosec) sqldFetchLFD
3485674 (1047 sec, 24618000 nanosec) sqldx_lob_read
3484393 (1039 sec, 146770000 nanosec) sqldxReadLob
3486801 (1031 sec, 887696000 nanosec) sqldxReadLobData
3483545 (1010 sec, 669271000 nanosec) sqlbDirectRead
3483541 (1009 sec, 188149000 nanosec) sqlbDMSDirectRead
3485667 (1007 sec, 485205000 nanosec) sqldx_diskread
3485702 (957 sec, 532408000 nanosec) sqlbDMSMapAndRead
3479676 (933 sec, 916670000 nanosec) sqlbReadBlocks
3470448 (926 sec, 439074000 nanosec) sqloReadBlocks
8182 (907 sec, 85306000 nanosec) sqloCSemP
8180 (906 sec, 780554000 nanosec) sqlbpfRemoveFromQ

If some call is a bottleneck, you will see it right around the top. Do not be fooled however, with some typical waiting functions such as the functions waiting for the client to send something etc. ( sqljrRecv being one of them ). They will be at the top for a large instance with lots of threads if most of the threads are idle.

What happens if you have a DPF system with the same issue ?

The trace memory segment is SHARED by ALL the local members on the server so you don't have to turn it on for ALL members separately. Usually, if you have a large DPF installation ( say > 50 nodes ) my rule of thumb is to take trace data on Co-ordinator node + 2 data nodes. If you happen to require to turn on tracing on ALL the members, you can now run the following command from any ONE node.

db2trc on -l 512m -t -member all

If you want to dump the file, you can also issue the command from any one node however, the restriction is that the dump directory MUST be a directory which can be written to by ALL the nodes.

db2trc dmp trc.dmp -sdir -member all

It will automatically suffix the hostname of the server from where its being dumped.

If you don't have the command option ( -sdir ) available, you can do something like:

tstamp=`date "+%Y%m%d_%H%M%S"`

rah "||db2trc dmp trc.dmp.$tstamp.\`hostname -s\`"

Rest of the suggestions on perfcount and perfrep still apply on DPF too.

My suggestion would be to use both methods ( I personally favor perfrep option. However, it's available only in the later releases ) to find out some obvious bottlenecks on an overall slowdown situation.

A particular query is running slow

The identification of the application handle which is running the query is criical here, as you don't want to impact any other queries running on the database. You can identify the application handle via:

a) If you have just connected and you are going to run the query within that same connection:

db2 "select mon_get_application_handle() from sysibm.dual" OR

db2 "select application_id() from sysibm.dual"

Suggested command to use:

db2trc on -l 512m -t -apphdl

db2trc on -l 512m -t -appid ( You can go up to 12 appids separated by commas )

sleep 60

db2trc dmp trc.dmp.1

sleep 60

db2trc dmp trc.dmp.2

db2trc off

db2trc perfrep trc.dmp.1 trc.perfrep.1

You can always verify the db2trc command using:

db2trc info

The output will be like:

Marker : @TRACE@
Trace version : 7.0
Platform : Linux/X8664
Build level : s151221
maxBufferSize : 33554432 bytes (32 MB)
auxBufferSize : 0 bytes (0 MB)
allocationCount : 2
DB2TRCD pid : 0
Trace destination :
numSuspended : 0
Trace starting time : 2016-05-24-19.02.51.280774-240

Buffer size : 33554432 bytes (32 MB)
Allow buffer to wrap : yes
Mask : *.*.*.*.*
Timestamps : enabled
PID.TID mask : all
Fixed data mask #1 : all
Fixed data mask #2 : all
Max system errors : infinite
Treat this rc as sys err: none
Member mask : none
Application handle mask : 272 << This will show you which apphandles it's tracing
Application ID mask : none

Using the same methodology as I described above to find a bottleneck in perfrep report, you can do a similar thing for the application handle.

You can extend this to:

a) trace a slow load process.

i) Identify the application handle running the load job using the command db2pd -db -load

ii) Start an apphandle trace ( db2trc on -l 512m -t -apphdl )

b) trace a slow backup / restore

i) Identify the application handle doing the backup / restore using the command db2pd -edus | grep db2med

The apphandle is embedded in eduname.

ii) Start an apphandle trace ( db2trc on -l 512m -t -apphdl )

What if you find turning on db2trc is affecting your system performance adversely

In cases where using the above commands are causing a performance degradation, you will have to use masks to reduce the trace footprint. Unfortunately, this has to be supplied from support as this requires some source code knowledge and access to come up with a good mask to reduce the impact on the system. I am sharing a few common masks which we use to debug problems, you can use it if you see the problem description is matching your scenario.

1) I/O performance

db2trc on -l 512m -t -Madd sqloReadBlocks -Madd sqloWriteBlocks

sleep 60

db2trc dmp trc.dmp

db2trc off

db2trc flw trc.dmp trc.flw -t

Run the attached script to parse the output. The syntax is:

findTimesByOperationsFromTrace.pl -operation "Read I/O" -file trc.flw -func "sqloReadBlocks enter" -func "sqloReadBlocks exit"

The output will be like:

Read I/O times between 1 and 5: 661 ( 30.1826% ) Total time(ms) = 928.960 ( 2.3804% )
Read I/O times between 5 and 9: 282 ( 12.8767% ) Total time(ms) = 2113.160 ( 5.4148% )
Read I/O times between 10 and 14: 249 ( 11.3699% ) Total time(ms) = 3093.450 ( 7.9266% )
Read I/O times between 15 and 19: 239 ( 10.9132% ) Total time(ms) = 4131.900 ( 10.5876% )
Read I/O times between 20 and 24: 182 ( 8.3105% ) Total time(ms) = 4082.750 ( 10.4616% )
Read I/O times between 25 and 49: 470 ( 21.4612% ) Total time(ms) = 16063.380 ( 41.1607% )
Read I/O times between 50 and 74: 71 ( 3.2420% ) Total time(ms) = 4231.670 ( 10.8432% )
Read I/O times between 75 and 99: 13 ( 0.5936% ) Total time(ms) = 1143.200 ( 2.9293% )
Read I/O times between 100 and 199: 20 ( 0.9132% ) Total time(ms) = 2628.130 ( 6.7343% )
Read I/O times between 200 and 299: 3 ( 0.1370% ) Total time(ms) = 609.390 ( 1.5615% )

Total number of Reads: 2190, Total time spent in Read I/O: 39025.990 (ms), Average time per Read I/O: 17.820 (ms)

**Disclaimer Note: This script is provided as-is without any support.

2) Debug slow stored procedures.

db2trc on -l 512m -t -Mfile trcMaskForStoredProcedures

sleep 60

db2trc dmp trc.dmp

db2trc off

db2trc perfrep trc.dmp trc.perfrep

add sqljcReceive.entry
add DB2.SQLJC.sqljcReceive.1178
add sqljcReceive.exit
add sqljsParse.entry
add sqljsParse.exit
add sqlrr_execute.entry
add sqlrr_execute.exit
add sqlrr_activity_exe_start.entry
add sqlrr_activity_exe_start.exit
add sqlriSectInvoke.entry
add sqlriSectInvoke.exit
add sqljcSend.entry
add DB2.SQLJC.sqljcSend.1177
add sqljcSend.exit
add PVM::run.entry
add PVM::run.exit
add sqlra_get_section.entry
add sqlra_get_section.exit
add sqlra_load_pkg.entry
add sqlra_load_pkg.exit
add sqlra_get_var.entry
add sqlra_get_var.exit
add sqlnn_cmpl.entry
add sqlnn_cmpl.exit
add DB2.SQLR.sqlrr_rds_common_pre1.200
add sqlra_search_for_pkg.entry
add sqlra_search_for_pkg.exit
add sqlra_open_pkg.entry
add sqlra_open_pkg.exit
add sqlra_pkg_lock.entry
add sqlra_pkg_lock.exit
add sqlra_sqlC_get_sibling.entry
add sqlra_sqlC_get_sibling.exit
add sqlra_sqlC_get_stmt.entry
add sqlra_sqlC_get_stmt.exit
add sqlricall.entry
add sqlricall.exit

This mask file was come up due to some debugging effort for a very tricky stored procedure problem.

As can be seen, setting masks needs referring to the problem and where in source code the potential problem could be which is why most of the time its advisable to contact support to help you set up a good trace mask to have the least impact on your system.

Other useful db2trc options

1) If you are dealing with millisec issues e.g. trying to match a WAS log entry to a DB2 trace entry to figure out which connection was servicing the WAS connection. You can use the -wc ( wallclock ) option to format the db2trc output to get wall clock times. It then becomes much easier to co-relate the entries in DB2 trace logs to the application logs.

db2trc flw trc.dmp trc.flw -t -wc

2) If you can trace to file ( -f option ) but would like to limit the size of the file you can use:

db2trc on -l ( small L ) 2g -f trc.dmp OR

db2trc on -i 2g -f trc.dmp

The first command will limit the file size to 2GB and once it hits the limit it will wrap to the front and keep writing to it. The second command will stop writing to the file after it hits size 2GB.

3) If you know the return code from db2diag.log for a specific issue, you can stop the trace when the error is hit by using the command:

db2trc on -l 512m -t -rc -e 1 ( The -e 1 option will stop the tracing right after the 1st occurrence of the error code ).

4) If you want to investigate BLU performance issues, you can turn on BLU performance trace

db2trc on -l 512m -t -Madd CDE_PERF_TRACE

Useful scripts

1) If you have the whole trace but you want to focus on the times taken by a particular operation e.g. Insert row or I/O then you can use the script findTimesByOperationsFromTrace.pl

Syntax to use:

findTimesByOperationsFromTrace.pl -operation "Read I/O" -file trc.flw -func "sqloReadBlocks enter" -func "sqloReadBlocks exit" or

findTimesByOperationsFromTrace.pl -operation "RowInsert" -file trc.flw -func "sqldRowInsert enter" -func "sqldRowInsert exit" etc.

2) If the version you are using does not have the -perfrep option, then you can use the analyzetrace script on a db2trc flw or fmt output to generate a profile output. The output looks like

Pid Lvl FuncName TTime(ms) HTime(ms) LTime(ms) AvgTime(ms) NCalls ERecnumHTime
---------- --- ------------------------------- --------------- --------------- --------------- --------------- --------- -------------
7771(Tid = 46913080518976, Node = 0)
0 sqloWaitEDUWaitPost 3999.644 999.952 999.875 999.911 4 1751
0 ossGetAvgCPULoad 0.432 0.126 0.044 0.086 5 2337
0 sqloGetEnvInternal 0.015 0.005 0.000 0.002 10 1
0 sqlogmt 0.009 0.003 0.001 0.002 5 3
0 sqeApplication::UpdatePITNodeI 0.006 0.002 0.001 0.001 5 1138
0 pdUpdateBaselinesIfNecessary 0.005 0.001 0.001 0.001 5 12
0 sqleRollupGetCachedCEAL 0.002 0.001 0.000 0.000 5 607

The column legends are:

Pid - Process Id ( thread id ) of the traced proces.

Lvl - The level of indentation of the function call i.e. if its 0 its the outermost, if its 1 its probably called from the function at level 0.

TTime - Total time elapsed in the function

HTime - Highest time amongst ALL the calls to the function

LTime - Least time amongst ALL the calls to the function

AvgTime - Average time

NCalls - Number of calls to the function

ERecnumHTime - Entry record number of the Highest Time call to this function. If you search for this record number in the flow output you will find it.

Things you should not do with db2trc especially on Production system

1) Never use the -f option to start tracing to a file. It can pretty much bring the whole system to a grinding halt if the system is large enough or busy. You should use it ONLY if asked specifically by support. There are exception situations like a crash right after starting up the engine where without -f option support cannot really move forward. Never ever use this option for performance data collection as its got its own overhead and it skews the times so much that the whole trace information becomes useless to debug the performance issue.

2) On DPF installations, do not attempt to start db2trc on ALL members. It can cause tremendous slowness. Usually, if you do have to use db2trc on DPF systems, follow the thumb rule of one co-ord node and 2 data nodes. There will always be exceptions to rule but use it on ALL members only with guidance from support.

Hope you find this entry useful and good enough for you to start using db2trc .. :-)

Please leave a comment if there are any questions or feedback

Modified on June 2 by rssarkar

SSL performance issue on AIX

rssarkar | May 19 ? | 815 Views

I recently had to work on a peculiar performance issue where customer was testing workload with and without SSL encryption and was seeing huge differences. For example, a query which runs in 1 second will take 20 seconds to complete when SSL was turned on. Since the solution was fairly simple, I thought I might as well put out in a blog for anyone else who hits a similar issue.

I collected the following data for both types of workload run:

1) db2trc on -t -perfcount

2) db2trc on -i 512m -t ( for 30 seconds ) -- 3 times, 3 mins apart.

3) db2pd -stack all -- 3 times, 3 mins apart

When I looked into the perfcount output ( db2trc perffmt trc.perf trc.perffmt; cat trc.perffmt | sort -t"(" -k2nr | more ). I found that two function calls were standing out.

2957916 (11690 sec, 206149111 nanosec)   sqlccSSLSocketRead 1473046 (4602 sec, 347173824 nanosec)    sqlccSSLSocketWrite

When I looked at the other DB2 functions which do work, they were in sub-ms which indicated that whatever slowness was happening was all within these 2 functions. The full trace gave a similar picture. I ran db2trc perfrep trc.dmp trc.perf -g. When I looked at the flow output ( db2trc flw trc.dmp trc.flw -t ) I found that there was a 90ms gap in one of the calls and it had just called a callback function to do the ssl_read. A snippet of the db2trc output.

511916         0.293736869   | | | sqlcctcprecv data [probe 1]
511920         0.293737443   | | | | tcprecv entry [eduid 28338 eduname db2agent]
511937         0.293747148   | | | | | sqlogetdatetime entry [eduid 28338 eduname db2agent]
511942         0.293751369   | | | | | sqlogetdatetime exit
516231         0.295472654   | | | | | sqlogetdatetime entry [eduid 28338 eduname db2agent]
516249         0.295477816   | | | | | sqlogetdatetime exit
516256         0.295479347   | | | | | sqlccSSLSocketRead entry [eduid 28338 eduname db2agent]
516264         0.295480910   | | | | | | sqlccSSLCallbackRead entry [eduid 28338 eduname db2agent]
516267         0.295481160   | | | | | | sqlccSSLCallbackRead data [probe 991]
516275         0.295484500   | | | | | | sqlccSSLCallbackRead exit [rc = 5]
516279         0.295486451   | | | | | | sqlccSSLCallbackRead entry [eduid 28338 eduname db2agent]
516281         0.295486660   | | | | | | sqlccSSLCallbackRead data [probe 991]
516288         0.295489146   | | | | | | sqlccSSLCallbackRead exit [rc = 0x00000020 = 32]
817092         0.385796982   | | | | | sqlccSSLSocketRead exit                            >> gets out of this call after 90ms.
817104         0.385798310   | | | | tcprecv exit

The stacks immediately gave the clue to the root cause. Lot of agents were in the following stack.

_global_lock_common
malloc_y
malloc_common@AF104_87
_Fancy_malloc__FUl
init_pres__FUl
__vn__FUl
ReadCompressedMsg_BlockCipher__13SSLV3ProtocolFi
ReadCompressedMsg__13SSLV3ProtocolFi
ReadMsg__13SSLV3ProtocolFi
SSL_Read__13SSLV3ProtocolFPviT2
StateMachine__13SSLV3ProtocolFiPvT1Pb
Receive__13SSLV3ProtocolFPvi
Receive__18SSLProtocolManagerFPvi
gsk_secure_soc_read
sqlccSSLSocketRead__FP21SQLCC_TCPCONNHANDLE_TP12SQLCC_COND_TUsUiPciPi
tcprecv__FP17SQLCC_COMHANDLE_TiPcT2UsT5P21SQLCC_TCPCONNHANDLE_TP12SQLCC_COND_TUiP

This clearly pointed to some contention in the malloc heap which is actually outside of DB2's control. The problem obviously is on the OS's malloc heap and as more and more threads are tryinb to allocate / de-allocate memory from that heap, the lock protecting the heap ( to serialize access ) has become very hot. So, the solution is to reduce the contention .. Fortunately, AIX provides some malloc heap optimizations which we can exploit here. Typically, for fairly small and frequent allocations, AIX offers a BUCKET optimization where different preset sizes are set and if the allocations fall within that range, they can just get it from the bucket instead of going to the heap. There's a lot of information available on this subject in AIX infocenter ( http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.genprogc/malloc_detect.htm?lang=en ). The following settings were provided to the customer which instantly provided relief to the customer.

export MALLOCOPTIONS=buckets,multiheap:4
export MALLOCBUCKETS=number_of_buckets:128,bucket_sizing_factor:64,blocks_per_bucket:1024

db2set DB2ENVLIST="MALLOCTYPE MALLOCOPTIONS"
db2stop
db2start

After this setting, the performance was back to normal for both with and without SSL case.

Hope this helps anyone else who encounters this issue.

Please leave a comment if there are any questions or feedback

Modified on June 2 by rssarkar

Upgrade to DB2 v10.5 FP7 woes

rssarkar | Mar 22 ? | 1,108 Views

Recently working on couple of issues related to upgrade to DB2 v10.5 fp7 I found out that the root cause was the same so I thought I might as well put this in a blog so that people are aware of it when they search it on their favorite search engine.

Problem statement

-----------------------------

Typically, you will see an over all slowdown of the whole database after upgrading. The typical type of application will be:

1) Fairly high query throughput

2) High number of explicit or implicit rollbacks happening.

How can you figure out if you are hitting the problem

------------------------------------------------------------------------

1) Take stacks ( db2pd -stack all )

2) Go to the DIAGPATH and do a quick grep

ls *stack.txt | wc -l

grep -ic sqlrrbck_dps *stack.txt

grep -ic pdLogEvRecEx *stack.txt

3) If you see a large percentage of stacks showing these functions, you know you are hitting the problem.

Another way to figure it out is:

1) Take a trace

2) db2trc on -l 512m -t ; sleep 30; db2trc dmp trc.dmp; db2trc off

3) db2trc perfrep trc.dmp trc.perf -g -sort timeelapsed -sort desc

4) The output should look like:

nCalls TotalElapsed AvgElapsed TotalSpent AvgSpent FunctionName

2 0.035750000 0.017875000 0.0000100000.000005000 sqljsParse
4 0.035736000 0.008934000 0.0000030000.000000750 sqljsParseRdbAccessed
1 0.035261000 0.035261000 0.0000010000.000001000 sqljs_ddm_rdbrllbck
1 0.035259000 0.035259000 0.0000020000.000002000 sqlrr_rollback
1 0.035176000 0.035176000 0.0000040000.000004000 sqlrrbck
1 0.035152000 0.035152000 0.0000030000.000003000 sqlrrbck_dps
3 0.035129000 0.011709667 0.0000050000.000001667 pdLogEvRecEx
1 0.035121000 0.035121000 0.0351120000.035112000 pdLogInternal << The elapsed time and spent time are equal for this function indicating this is the bottleneck

**As a side note: Elapsed time -- Total time elapsed in this function, Spent time -- Actual time spent in this function

What to do now that you have confirmed you are hitting the problem

----------------------------------------------------------------------------------------------

Issue the following commands:

db2pd -db sample -dmpevrec comp=SQLRA name=sqlraMED mask=0
db2pd -db sample -dmpevrec comp=SQLRA name=sqlraLOW mask=0

However, pls. keep in mind, this change will NOT live through a recycle of the engine or deactivate of the database. You will have to run it again after recycle or activate of database.

APAR IT14357 has been opened to track this issue.

Please leave a comment if there are any questions or feedback

Modified on June 2 by rssarkar

Backup statistics

rssarkar | Dec 3 2014 ? | 2,883 Views

Starting DB2 v9.7 if you want to understand how the backup or restore is working ( especially if its slow ) there's a feature which was put in which can be enabled by setting the regvar DB2_BAR_STATS=YES ( recycle required ) and it will print something like this in the db2diag.log after a backup is complete:

2013-09-30-20.55.02.862821-240 E839641A1560 LEVEL: Info

PID : 917934 TID : 3086 PROC : db2sysc

INSTANCE: rsarkar NODE : 000 DB : SAMPLE

APPHDL : 0-141 APPID: *LOCAL.rsarkar.131001005458

AUTHID : RSARKAR HOSTNAME: p6db2serv

EDUID : 3086 EDUNAME: db2agent (SAMPLE)

FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:393

MESSAGE : Performance statistics

DATA #1 : String, 1058 bytes

Number of buffers = 5

Buffer size = 3149824 (769 4kB pages)

BM# Total I/O MsgQ WaitQ Buffers kBytes

--- -------- -------- -------- -------- -------- --------

000 3.59 3.50 0.00 0.00 56 165696

001 3.59 0.18 0.00 3.40 6 16128

002 3.59 0.27 0.00 3.31 4 11520

003 3.59 0.12 0.00 3.47 2 4864

004 3.59 0.10 0.00 3.48 1 864

--- -------- -------- -------- -------- -------- --------

TOT 17.97 4.19 0.00 13.68 69 199072

MC# Total I/O MsgQ WaitQ Buffers kBytes

--- -------- -------- -------- -------- -------- --------

000 3.59 0.00 3.59 0.00 70 209200

--- -------- -------- -------- -------- -------- --------

TOT 3.59 0.00 3.59 0.00 70 209200

Starting v10.1 it is enabled by default and after every backup / restore is complete it will print the results as above in the db2diag.log

The decode of each of the columns is:

BM#

- the number we assigned to an individual Buffer Manipulator. BM's READ data from the databases tablespace during a backup and place them into buffers. Or WRITE data back out to the database from the buffer during a restore.

MC#

- the number assigned to an individual Media Controller. MC's WRITE buffers out to the target location on backup and READ buffer on restore.

Total

- The total amount of time spent by the process in seconds.

I/O

- The amount of time spent either reading or writing data. For the BM's this represents time reading data from tablespace, and filling the buffer. For MC it's time spent reading from buffer and sending it to the target destination.

MsgQ

- This is the amount of time we spend waiting to get a buffer. For BM's it's how long is spent waiting to get an empty buffer for filling. For MC's it's time spent waiting to get a full buffer in order to write out.

Wait Q

- Amount of time spent waiting on directives from the agent overseeing the whole backup.

Buffers

- The number of Buffers Processed by a particular BM or MC. A BM filled X number of buffers. An MC wrote out X number of buffers.

Mbytes

- The amount of data handled by a particular BM or MC in Mbytes. NOTE that the sum of all MC's written bytes is the amount of total data sent.

So, from this data you can easily tell in where the backup or restore is spending its time i.e. in the Reading portion or Writing portion and then focus on that part to find out what is wrong.

For example, look for the I/O times of the BMs and MC to see which one is spending more time in I/O and then also look at the MsgQ values to find out if there's an inordinate time waiting on buffers which means the reader or writer are having slowdowns which is why the buffers are not being freed quick enough.

This is a pretty useful statistic to know of the inner working of Backup / Restore and can be used to understand where the bottleneck is.

Here is a link to a blog written by Dale McInnis having a much more detailed explanation of the above output.

Hope you find this blog useful

Modified on Sep 8 by rssarkar

DB2 Performance -- Leveraging the MON interface

rssarkar | Apr 1 2014 ? | 1,841 Views

I work on various performance issues ( single node, DPF, pureScale ) and I somehow have found most of the DBAs still using the older snapshot way to collect the information. I don't have anything against snapshots but after using the new MON infrastructure to tackle various perf issues and also use it to monitor normal activity I thought maybe I should use my blog to encourage the use of this infrastructure. I don't think I can cover everything in one entry so probably I will spread it across a few entries and dwell on how to use it for specific issues.

In this entry I will try and lay out what is out there and what I use commonly.

One thing I should point out upfront is that, during hang situations ( unable to connect etc ) this is not a very useful tool and should not be used. db2pd is the tool of choice in hang conditions.

So, starting from some basics.

a) The MON infrastructure has been available since DB2 v9.7 GA and has been refined and enhanced ( and is continuously being enhanced ) in the later versions.

b) Its provides a easy SQL interface to real time memory of DB2 to print out details of it.

c) Its got time spent metrics, object metrics which can quickly narrow down the bottleneck for the user.

d) Its controlled by 3 DB config params ( MON_REQ_METRICS, MON_ACT_METRICS, MON_OBJ_METRICS ). Setting them to BASE is usually good enough.

e) It uses much less latching ( internal locking ) to get the information thus making it easier to use on really busy systems very frequently without any side effects.

Personally, I've used it very effectively on extremely busy systems with literally no side effects and was able to resolve a lot of issues using this. I would like to share my experiences here so that others can benefit as well from this great infrastructure which is built in the engine itself.

Now, that the intro is out of the way, let's get our hands dirty .. :-)

Let's start with a hypothetical scenario, say:

You have upgraded from v9.5 to v9.7 or v10.1 and all your benchmark queries are running slower and overall performance is struggling. You have checked the usual i.e. vmstat, iostat, db snapshot but you are not able to really pinpoint an obvious issue.

In this case, how can the new MON interface help you ??

It can actually help you a lot !!

Let's start with a monreport query .. you must be thinking .. huh! what's that? This is something the DB2 developers thought can help the customer by providing a quick overview of the whole database engine using some pre-packaged MON queries. There's very good information about these in the infocenter at this link. There are different kinds of report in this module but for a generic perf issue where there's overall slowness, I like to start with the dbsummary monreport. The syntax to run it is:

db2 connect to

db2 "call monreport.dbsummary(60)" > dbsummary.txt

db2 terminate

The parameter passed tells dbsummary to run for 60 seconds and provide an overall view of the internals of the DB2 engine as to what happened during those 60 seconds.

This report gives you everything i.e.

a) Transaction throughput

b) Wait times within the engine << This is the part I am most interested in for perf issues

c) Section processing times -- in other words, apart from the waits, how much time does it take to execute SQLs in the engine

d) Bufferpool hit ratio

e) Package cache hit ratios

f) SQL throughput

Small snippet of the report:

--------------------------------------------------------------------------------

Monitoring report - database summary

--------------------------------------------------------------------------------

Database: SAMPLE

Generated: 03/31/2014 19:57:20 << When was it run

Interval monitored: 60 << Monitoring time frame in seconds

Work volume and throughput

--------------------------------------------------------------------------------

Per second Total

--------------------- -----------------------

TOTAL_APP_COMMITS 194 11665 << Transaction throughput

ACT_COMPLETED_TOTAL 194 11665 << SQL throughput

-- Wait time as a percentage of elapsed time --

% Wait time/Total time

--- ----------------------------------

For requests 85 12077/14123 << This gives the overall wait time within engine in ms

-- Detailed breakdown of TOTAL_WAIT_TIME --

% Total

--- ---------------------------------------------

TOTAL_WAIT_TIME 100 12077

I/O wait time

POOL_READ_TIME 23 2815 << This is the physical read time.

POOL_WRITE_TIME 0 99

DIRECT_READ_TIME 0 0

DIRECT_WRITE_TIME 0 0

LOG_DISK_WAIT_TIME 75 9067 << Waiting for the log buffers to be flushed to disk

LOCK_WAIT_TIME 0 0

AGENT_WAIT_TIME 0 0

Network and FCM

TCPIP_SEND_WAIT_TIME 0 0

TCPIP_RECV_WAIT_TIME 0 0

IPC_SEND_WAIT_TIME 0 59

IPC_RECV_WAIT_TIME 0 35

FCM_SEND_WAIT_TIME 0 0

So, in the above output its clearly showing that the log flush is slow ( i.e. commit ) which is driving most of the wait times within the engine. Maybe moving to a faster log disk can help here or maybe grouping the commits.

Component times

--------------------------------------------------------------------------------

-- Detailed breakdown of processing time --

% Total

---------------- --------------------------

Total processing 100 2046

Section execution

TOTAL_SECTION_PROC_TIME 46 952

TOTAL_SECTION_SORT_PROC_TIME 1 31

Compile

TOTAL_COMPILE_PROC_TIME 0 0

TOTAL_IMPLICIT_COMPILE_PROC_TIME 0 0

Transaction end processing

TOTAL_COMMIT_PROC_TIME 9 197

TOTAL_ROLLBACK_PROC_TIME 0 0

Utilities

TOTAL_RUNSTATS_PROC_TIME 0 0

TOTAL_REORGS_PROC_TIME 0 0

TOTAL_LOAD_PROC_TIME 0 0

The SECTION_PROC_TIME shows how much time is being spent within the engine to execute an SQL statement ( without the wait time ). As can be seen its pretty fast.

Buffer pool

--------------------------------------------------------------------------------

Buffer pool hit ratios

Type Ratio Reads (Logical/Physical)

--------------- --------------- ----------------------------------------------

Data 77 21176/4714

Index 87 27495/3300

XDA 0 0/0

Temp data 0 0/0

Temp index 0 0/0

Temp XDA 0 0/0

Locking

--------------------------------------------------------------------------------

Per activity Total

------------------------------ ----------------------

LOCK_WAIT_TIME 0 0

LOCK_WAITS 0 0

LOCK_TIMEOUTS 0 0

DEADLOCKS 0 0

LOCK_ESCALS 0 0

--------------------------------------------------------------------------------

Compilation

TOTAL_COMPILATIONS = 0

PKG_CACHE_INSERTS = 0

PKG_CACHE_LOOKUPS = 11665

Catalog cache

CAT_CACHE_INSERTS = 0

CAT_CACHE_LOOKUPS = 0

You can calculate the package cache hit ratio from this data. If you see its very low, then increasing package cache size might help.

As can be seen the dbsummary report is a great starting point for a generic ( aka unknown ) performance problem and then narrow down accordingly. You can other monreport packages to help you do that as well. I use the currentsql() and pkgcache() liberally to find out any SQL statements which require attention.

I will continue further on specific situation .. keep an eye on this blog for further updates.

Debug Slow Online Backup in DB2

rssarkar | Feb 28 2014 ? | 1 Comment ? | 14,489 Views

3

In my daily work, I get escalations from customers whose backups have suddenly started to run slow. Recently I was working on such a case where the customer was a SAP shop and they were moving some tables of a tablespace and all of a sudden their backups started to take about 3x their normal times.

In debugging a backup issue, the best data to collect is generally db2trc as that captures everything for all the threads involved in the backup and tell us exactly what is happening.

Generally, for any backup the following threads are created:

db2agent -- The main co-ordinator agent

db2bm -- Buffer Manipulator threads which read tbsp data to buffers and once the buffer is full hand it over to the db2med thread

db2med -- Media thread which takes a full buffer and writes it to wherever the backup destination is.

So, when a back up is slow, it can be in two places:

1) db2bm taking time to read the data into the buffer OR

2) db2med is taking time to write a full buffer to destination which is making the db2bm wait for an empty buffer.

The read throughput and write throughput can be easily diagnosed using iostat tool.

An easy way to identify all the threads associated with the backup is to do the following:

db2pd -edus | grep db2med

You will get the name of the thread as:

db2med.123456.1

123456 is the edu id of the main co-ordinator agent.

You can now get the application handle by doing:

db2pd -age | grep 123456

The reason I wanted to get the application handle was to pass that to the db2trc command:

db2trc on -l 512m -apphdl

and get the trace of all the backup threads involved.

In this particular customer case, I collected the following:

1) db2trc on -l 512m -apphdle

2) db2pd -stack app= -rep 180 3

Once I got the trace, I found that in fact both db2bm and db2med were idle which was a bit puzzling as at least one of them has to be doing something to move the backup forward. So, I went back to the customer and asked for a longer trace ( previous was for only 30 secs, I asked them to take it for around 2 mins ). The second time trace was able to capture db2bm intermittently getting some data from a tablespace. I found that the tablespace was a fairly large one Also, from the stacks I observed that there was only 1 db2bm thread active and most of the time it was spending time waiting on the latch SQLO_LT_SQLB_POOL_CB__readLatch as other agents were holding it to read in some data from time to time. I found the reason was that it an an ONLINE backup and this tablespace contained close to 12k tables and other agents would be reading data from tables in it during the online backup.

The way ONLINE backup works is that it tries to read contiguous chunks of data from the tablespace in one go ( while holding the latch in X mode ) and fill up the buffer but obviously that was not happening here as it was found to be waiting behind the other agents most of the time which were holding this latch in S mode.

This info gave me a clue and pointed me to the fact that there's a possibility that there was a lot of fragmentation in the tablespace which is why there's not enough contiguous pages to fill up the buffer in one go and the backup has to try and get the X latch more frequently on the tablespace and do more read operations than if the tablespace was relatively fragmentation free.

I looked up the db2pd -tablespaces information and parsed it ( script attached ) to find out that indeed there was huge fragmentation in the tablespace. The parse script showed:

Tbspid = 6 CBS#BTABD UsedPages = 86265936 HWM = 111916828 Diff = 25650892

Tbspid = 7 CBS#BTABI UsedPages = 77966230 HWM = 98836376 Diff = 20870146

As can be seen the tablespaces had huge difference is Used Pages and HWM pages.

After talking to customer it was found that the tables were actually moved from this tablespace and also this tablespace was created in v9.1 ( currently customer was on v9.7 ).

Since, the tablespace was created in v9.1 the new feature of v9.7 to reduce HWM dynamically ( ALTER TABLESPACE REDUCE MAX ) was unavailable, so the only way we could get around this was to ask customer to create dummy tables in the tablespace to fill up the "gaps" ( verify it by db2pd -tablespaces -db ) and try and reduce fragmentation.

After customer reduced the fragmentation using this trick, the backup times went back to normal and everything was running fine.

I am providing two scripts here:

1) To automate data collection for slow backups ( If you open up a PMR with this data, that would be awesome !! )

getInfoForSlowBackup

2) Check the fragmentation of tablespaces. ( You can verify the fragmentation using db2dart /dhwm /tsi /rptn -- the holes will show up as ==EMPTY== )

findDiffInHWMForATbsp

Also, the following query can quickly identify whether the tablespace has got RECLAIMABLE STORAGE set or not ( by default its set if the tablespace is created in v9.7 ).

db2 "select tbsp_name, RECLAIMABLE_SPACE_ENABLED FROM TABLE(MON_GET_TABLESPACE('',-2)) AS t"

NOTE: I should mention here that this problem would not have surfaced if it was an offline backup. Its specific to online backup only.

Hope you find this post useful in debugging slow backup issues yourself.

Modified on Feb 28 2014 by rssarkar

Mystery behind SQL1229 error on DPF systems

rssarkar | Jan 7 2014 ? | 1 Comment ? | 2,416 Views

Recently, I'd been working on lot of ISAS / DPF systems where a very common complaint is the SQL1229 error encountered by the applications. In a few cases, it was fairly severe that a recycle was necessary.

I wanted to dwell a bit on this as it can be a mystery as to why it happens and can consume a lot of debugging effort by the DBA. I know its not a real performance issue per se, but I've seen customers being affected by it a lot and not really knowing what to do. I'm hoping with the information in this article, they can take a few steps and at least collect the "right data" for the root cause analysis.

To start with, what does this error really mean ( in the context of DPF ) .. it essentially means that the engine was trying to do something and it was not able to get back a reply in time from the remote side which it deems as a "system error" and thus is asking the transaction to be rolled back and re-submitted by the application again.

There can be many reasons for this error, in some cases DB2 can be a victim of underlying issues on hardware, network or in some cases it can be a DB2 issue itself.

The most common error seen in the db2diag.log just before a SQL1229 error received by the application is:

2013-09-12-21.06.56.206128-420 E5017316E450 LEVEL: Error (OS)

PID : 28382 TID : 46912950495552PROC : db2sysc 0

INSTANCE: xxxxx1 NODE : 000

EDUID : 16 EDUNAME: db2fcmr 0

FUNCTION: DB2 UDB, oper system services, sqloPdbRecvBuffer, probe:35

MESSAGE : ZRC=0x810F001A=-2129723366=SQLO_CONN_TIMEOUT "Connection timed out"

CALLED : OS, -, recv OSERR: ETIMEDOUT (110)

followed by

2013-09-12-21.06.56.238933-420 E5032877E381 LEVEL: Error

PID : 28382 TID : 46912958884160PROC : db2sysc 0

INSTANCE: dwpinst1 NODE : 000

EDUID : 14 EDUNAME: db2fcms 0

FUNCTION: DB2 UDB, fast comm manager, sqkfTcpLink::closeConn, probe:25

DATA #1 :

Link info: node 17; type 4; state 5; session 0; activated 1

The point to note is that its the FCM Receive thread ( db2fcmr ) which has got this error and ETIMEDOUT error can happen due to two reasons:

1) Connection timed out OR

2) Active transmission timed out

Generally, most of the time the reason is invariably "Active transmission timed out".

Now, based on this error, DB2 has to act on it and assume that the remote node is not responding, so what it does is that it does a "Node recovery" for the remote node on the local node and send a message to the remote node to do the same as well. In the above snippet, Node#0 got the ETIMEDOUT message from Node#17 so what it'll do is do a "Node recovery" for Node#17 on Node#0 and then send a message across to Node#17 to do a node recovery for Node#0. Node recovery isn't a "full blown" recovery as we know it .. its essentially recycling the socket connections to the node and trying to re-establish the connection using new sockets ( If interested, you can observe this phenomenon in netstat -an outputs ). There's one quirk however, if it so happens that the catalog node gets the error, the remote node in order to complete the node recovery will have to de-activate / re-activate the database. Usually, the node recovery should happen very fast ( in millisecs ) but in some cases if it doesn't happen, then following can hold true:

1) Its not able to send the message across to the remote node quick enough to do the node recovery on the remote node ( network issue ) OR

2) Something is happening on the remote node to make it unable to process the message quick enough ( maybe workload, low on memory, cpu etc.) OR

3) Its the exception case where it has to deactivate/activate the db in order to complete node recovery and is stuck in deactivate phase.

There can be another cause for SQL1229. A message such as

2008-07-28-09.31.23.006067-240 I22590117A418      LEVEL: Error
PID     : 5038184              TID : 1286        PROC : db2sysc 46
INSTANCE: edwinp01             NODE : 046
EDUID   : 1286                 EDUNAME: db2fcms 46
FUNCTION: DB2 UDB, fast comm manager, sqkfNetworkServices::detectNodeFailure, probe:15
DATA #1 :
Detected failure for node 41 - time elapsed: 501; max timeout: 500; link state: 4

appears in the diaglog.

This essentially means that DB2 is trying to contact the remote node for nearly 500 seconds but unable to do so. However, if you notice, there's OS error this time around which indicates the network packets probably are going through but the other side is not able to process the messages sent by this node quick enough.

Ok, I have kind of explained what can cause the SQL1229 issues, but what can really be done to address it. Unfortunately, there's no simple solution to it as there are a lot of moving parts in this and the problem can be anywhere. The best you can do is to collect the "right data at the right time" to help the support engineers quickly address the problem you are having. If you see that your instance is hung right after the SQL1229 errors, the chances are that you probably have encountered the issue where a "deactivate/activate" is necessary to complete the node recovery. In that case, issue a "db2 force applications all" from the co-ordinator node. There's a high chance that the deactivate ( which was stuck ) completes and then node recovery completes and engine is back on line. If that also, doesn't help, unfortunately the quickest way out of it is a "recycle" of the instance but pls. make sure you collect the data for root cause analysis.

What can you collect:

For the first condition ( where OS reports ETIMEDOUT or ECONNRESET ) its next to impossible to anticipate that error, but you can be prepared for the recurrence. Unfortunately, since its the OS which reports the error, you have to have TCP trace turned on continuously and you have to stop it when the message appears in the db2diag.log. You will have to get the respective TCP trace commands for the OS ( for AIX its iptrace, Linux its tcpdump ) and you can limit the TCP trace only to the FCM ports. In order to stop it at the right moment, you can use my tool "watchThreshold.pl" ( I had uploaded it for one of my earlier posts ) and use the -watchFile and -command option to issue the stop TCP trace command when it encounters the message in the db2diag.log file.

If instance is "hung" then I would suggest taking stacks on the "hung" node if possible ( if you can identify from the db2diag.log which node is stuck in "deactivate" phase ) otherwise, I would suggest taking the following data:

rah "db2pd -alldbp -stack all"

sleep 180

rah "db2pd -alldbp -stack all"

This will collect the info of the "stuck" thread and provide clues as to why its stuck in deactivate.

If you get the timeout message as pasted above in the db2diag.log, you can set the following registry var for future occurrences which will trigger data collection automatically.

db2set DB2_FCM_SETTINGS=FCM_PERCENT_OF_MAX_TIMEOUT:50

This will trigger data collection @50% of the timeout value, 75% of the timeout value, 87% of the timeout value.

This is available from DB2 v97FP5 onwards. This helps to find out what was happening in DB2 when the timeout happened ( was it too busy ? If so, why ? etc. ).

The db2diag.log is very helpful in stitching together the scenario when the issue happened so its also an essential part of the puzzle.

I hope this information will help you understand the SQL1229 error and what can you do to help do the RCA.

What to do when CPU spikes up

rssarkar | Apr 21 2013 ? | 3,453 Views

2

I deal with PMR escalations in my daily job and I can't say how many times the most common performance problem I see is that the system has slowed down and a CPU spike was observed during the slowdown. In such cases, data collection at the right time i.e. when the CPU has spiked up is the most important thing. If you miss the data collection during the spike but still go ahead and collect, its not going to help much as we would not know what drove the CPU up.

Now, this is where things get a bit complicated, the data collection depends on what "kind" of CPU spike is observed i.e. is it a %usr CPU spike or %sys CPU spike. The data collection is different in each of the cases .. When %usr CPU spikes up it means that system is spending time in executing application code which in most cases is DB2 but when %sys CPU spikes up it means that system is spending time in kernel code and not able to execute any application code.

I generally follow a simple principle .. i.e. when %sys CPU is high, I want OS internal data to see what call is driving up the CPU and also collect some DB2 information i.e. stacks ( db2pd -stack all ) and db2pd -latches ( to see if there's any latch contention ). For OS data, I generally collect:

AIX:

1) OS trace -- perfpmr.sh -x trace.sh 30 ( or separately start the trace command as user root ).

Solaris:

1) dtrace script to see which system call is being called the most ( you can get a lot of samples using Google )

Linux:

1) systemTap script to find out which system call is being called the most OR

2) strace -c of the db2sysc pid ( only if we are fairly certain if DB2 is driving it ).

If say, DB2 is driving the %sys CPU high 99% chances are that its latch contention within DB2 code. So, we have to find out what code path is driving that contention up. The stacks help pinpoint that. The db2pd -latches output also helps in that respect.

From OS trace ( AIX ), a curt report tells what kind of system call is driving up the %sys cpu and if its say latch contention then the splat report from the trace can show which address is having the highest contention on which can be matched with the db2pd -latches output.

I know it gets a bit complicated ... :-) but the main thing is to get the "right data at the right time" otherwise the data collection is useless and won't help get to the root cause of the issue.

If %usr is high, generally I concentrate on what kind of queries were being run on the DB. With the new MON interface of v9.7 it becomes so easy now where you can use the monreport module to get the highest CPU consumers fairly quickly or you can write your own queries to query the table MON_GET_PKG_CACHE_STMT and other MON tables to find out the top CPU consumers. If you are not on v9.7 then it becomes a bit tougher, there's some sysibmadm routines such as top_running_sql which can be used but you can use application snapshot and dynamic sql snapshot to narrow down those queries and see if tuning them helps reduce the CPU consumption.

This was quick help on what to with CPU spikes .. hope you find it helpful .. :-)

Range Partitioned tables and long compile times

rssarkar | Feb 23 2013 | Tags: compile range partitioned ? | 1,815 Views

2

Recently, I encountered an interesting case where customer were complaining of long compile times .. some times as long as 35 minutes !!.

Generally, when I start looking at such an issue, the first thing which comes to mind is that probably its some very complicated query and optimizer is taking some time to evaluate the plans etc. but still 35 mins is too long even for the optimizer to evaluate all the available plans to it. Also, later on I found that the query is a very simple DELETE on a table with only one where condition. Since, the issue was reproducible, I asked customer to collect db2trc when the compile was running .. so the steps were:

1) db2 connect to

2) db2 set current explain mode explain

3) db2trc on -i 1g -t

4) db2 -tvf query.sql

The trace revealed an interesting thing .. it seemed that the optimizer was spending a lot of time fabricating statistics for each partition of the table. On further investigation it was found that the table was range partitioned and had 3800 partitions !!. On further, investigation, it was found that the optimizer was trying to fabricate the statistics for each of the "empty" data partitions. The table had nearly all of the partitions empty .. :-). For each partition, DB2 was taking approx. 300-400ms to get the statistics as it had to get the data from disk every time. This added up nicely to the total time taken for the compile of the query.

The customer then re-architected their data model to reduce the number of empty partitions to a minimum and the problem went away.

So, I guess the moral of the story is that you can surely have a lot of data partitions for a table, but just make sure the number of empty partitions are kept to a minimum to enable quick compiles of queries to the table.

How to track and collect data for a long running query

rssarkar | Dec 8 2012 ? | 2,725 Views

2

Recently, working for a customer I encountered an interesting situation which I would like to share ..

The customer was running a Web commerce site on DB2 ( v97fp6 ) and out of the blue a few of the inserts to table would take long ( as long as 10-15 secs under heavy load ). The interesting part was that say 8 out of 10 inserts to the table will be in millisecs but suddenly one will take long. The challenge was not only to track a long insert but also collect the data when its happening. Generally, the best data to debug such issues is our good friend db2trc ( DB2 trace ) as that will give exactly where in DB2 its spending most of its time.

So, thinking hard on how to time data collection for this issue, I resorted to my script "watchThreshold" script. It had the infrastructure already built in to trigger data collection ( i.e. can call script or run any command if and when a trigger happens ) it was a matter of adding code to check for the long running query .. What I essentially did was that based on what the user considers a "long running" query I would run db2pd -active in that frequency and if the same apphandle, uowid, activityid, anchorid and stmtid appear in two consecutive outputs, it means that its a long running query. Since, the issue was happening on a live production site with nearly 4000 connections there was no way I can turn on the trace for the whole instance without affecting the response times for the whole . here the v97 feature of turning on the trace only for one apphandle helped. So, in short, what I did in the script was to check db2pd -active ( every user configured time ) and found if the query was running in two consecutive iterations .. if yes, I triggered a script where I essentially coded the db2trc command ( db2trc on -i 512m -apphdl ) to start the trace only for the apphandle which was running the query and collect the information.

Now, the trick to all this is to identify the query .. as in db2pd -active you don't have the whole query but just the anchor id, stmtid .. I used the db2pd -dyn output to identify the anchor id/stmtid pair for the insert queries which were taking longer and turned on the script to capture the necessary info.

With this script, I was able to capture the right data at the right time to get to the bottom of the issue fairly quickly ..

The script options pertaining to this are:

-longQuery

( If multiple statements need to be tracked use option like -longQuery 1,2 -longQuery 4,2 and so on )

-longQueryTime ( The trigger time in seconds or millisecs ( suffix ms to value for millisec ) -- Default 1 second )

The syntax I used to capture the info was:

watchThreshold.pl -numiter 5 -longQuery 622,1 -longQuery 4,3 -longQueryTime 500ms -script collect.ksh -scriptargs dbws -daemon

Migration to v97 from v95 -- Experiences

rssarkar | Aug 24 2012 ? | 1 Comment ? | 2,666 Views

1

Recently I was involved in two customer situations where they migrated to v97fp5 and faced performance issues. One customer was moving their Commerce site from v95fp4 and another was moving their 3rd party app from v91fp8.

Even though the source versions were different, the end result was the same after migration i.e. overall sluggishness.

Initially, I thought it might be due to some query plan changes but that was quickly ruled out as pretty simple queries were running much longer on v97. Thanks to the enhanced monitoring capabilities of v97 ( I love the MON interface .. recently I came across the wonderful MONREPORT facility which takes out the requirement of individually querying different MON tables and prints out a report to the user in a very nice formatted manner .. I would definitely encourage everyone on v97 to explore this facility ) I was able to find the root cause for both the customers and fix the issue.

For the migration from v95fp4 to v97fp5, I did the following ( Pls. note I had the luxury of capturing everything as it was a test system dedicated to test the migration out )

1) Collected multiple db snapshots

2) Collected 30 second db2trc at various times of the benchmark tests run

In the db snapshot, I found

a) Number of pages written per commit ( Log pages written / Commit statements attempted ) and

b) Commit time ( ( Log write time in ms / Log pages written ) * value from a )

were going up and down during the run on v97 whereas on v95 it was completely steady through the same job run.

This prompted me to take a look at the db2trc ( with timestamps ) output and find gaps in the output I found this:

===================================================================

Delta = 51 (PID = 19923626, THREAD = 1117968, NODE = 0)

DELTA of greater or equal to 8 millisecs FOUND !!

===================================================================

Delta = 51 (PID = 19923626, THREAD = 1117968, NODE = 0)

This shows that Grouped Commit ( MINCOMMIT cfg variable ) was at play in v97 whereas this function was not seen at all on v95. On further digging it was found that due to APAR IC79533 the MINCOMMIT was no more a no-op and would work as advertised. This fix was put in from v97fp3.

After we set the MINCOMMIT=1 ( customer had it set to 10 ) the perf on v97fp5 was back at v95 levels.

For the migration from v91 to v97fp5 customer was testing a 3rd party app and were finding sluggish performance ( tracked by transaction time outs from front end ) on the database.

The DBA had collected a few db snapshots and that was about all ..

Again, since this was a test system, I had the luxury of controlling the test runs, manipulate it etc.

On looking at the db snapshots I was finding that there was a lot of time being spent on Lock waits and when the load increased, it got worse. Since, the lock time out was set to a pretty high value, the standard capture of LOCKTIMEOUT using DB2_CAPTURE_LOCKTIMEOUT regvar wasn't going to work but since they were on v97 there was a event monitor which could have helped here. Its the Lock Wait Event monitor ( steps to create and format event monitor data is provided below )

After looking at the lock wait monitor data, I found that in most of the cases, the lock waits were readers ( select query ) waiting on a writer ( update query ) even though the row which is being read was not the one which was exclusively locked by the writer.

In v97 there's a locking enhancement done specifically to address such issues .. its controlled by a DB config param CUR_COMMIT.

Since customer had the whole code logic in SQL stored procedures, just setting the db cfg value of CUR_COMMIT wasn't enough the stored procedures would require to be re-compiled with the CURR_COMMIT setting. Unfortunately, there's no easy way out for SQL stored procedures using db2rbind etc. so I had to ask the customer to re-compile the stored procedure with the curr commit flag turned on ( steps to do that is provided below ).

Once, that was done, all the lock waits went away and customer was able to push many more users than they were able to push through on v91.

Event monitor set up and formatting instructions:

Creating the event monitor:

create event monitor lockevmon for locking write to unformatted event table;

alter workload sysdefaultuserworkload collect lock wait data with history and values for locks waiting more than 100000 microseconds;

set event monitor lockevmon state 1;

Formatting the data:

1. Copy db2evmonfmt.java and *.xsl from $HOME/sqllib/samples/java/jdbc to a local directory

2. export CLASSPATH=.:$HOME/sqllib/java/db2java.zip:$HOME/sqllib/java/db2jcc.jar:$HOME/sqllib/java/db2jcc_license_cu.jar:$HOME/sqllib/java/jdk64/lib:$HOME/sqllib/lib:$HOME/sqllib/function:$HOME/sqllib/java/sqlj.zip:$CLASSPATH

3. export LD_LIBRARY_PATH=$HOME/sqllib/lib:$LD_LIBRARY_PATH

4. export PATH=$HOME/sqllib/java/jdk64/bin:$HOME/sqllib/lib:$PATH

5. Compile the java program:

javac db2evmonfmt.java

6. Test the program out by:

java db2evmonfmt -h

7. Run it to get a formatted output details on locking.

java db2evmonfmt -d -ue lockevmon -ftext -w sysdefaultuserworkload -u -p > lockWaitDetails.txt

Stopping the event monitor:

alter workload sysdefaultuserworkload collect lock timeout data none;

set event monitor lockevmon state 0;

Steps to compile the stored procedures

Either set the registry var

db2set DB2_SQLROUTINE_PREPOPTS="CONCURRENTACCESSRESOLUTION USE CURRENTLY COMMITTED"

or

calling set_routine_opts('CONCURRENTACCESSRESOLUTION USE CURRENTLY COMMITTED') before the compile.

The SQL procedure code will have to picked up from within the database using db2look.

Threshold data capture for DB2 Performance issues

rssarkar | Apr 27 2012 | Tags: based performance data capture db2 threshold ? | 2 Comments ? | 4,395 Views

3

I work for DB2 Support and work for the APD ( Advanced Problem Determination ) team. In my daily job I get to work on lot of real life performance issues encountered by customers using DB2 on LUW, do crash analysis I thought of sharing my experiences through this blog so that those who read this are aware of a support analyst's perspective of an issue and hopefully will be able takeaway something useful from it .. :-)

My first topic for the blog is "Threshold Data Capture for Performance issues" ..

Generally, what happens is that customer faces a performance issue and then calls up support to open a PMR and then the analyst provides some script or data capture instructions for the next recurrence. What if, we can get a lightweight script running on the system and when the issue happens is able to capture the data automatically ? This will help both the customer and DB2 support .. for the customer, they don't have to worry about being online to collect the information when the issue happens ( and it may so happen, that the window for data capture is lost by the time the DBA logs in ) and for DB2 support, it will be able to gather the right data at the right time to do quick analysis of the underlying cause and save time for every one.

In our latest DB2 versions, we are incorporating the "threshold data capture" via db2fodc options. These options were rolled in the DB2 Version 97 FP5. The threshold data capture is based either on CPU spikes or memory spikes or connection spikes. This greatly reduces the complexity of gathering the right data at the right time. The new option under db2fodc is -detect. The multiple sub-options are detailed below

-cpu ( triggers script db2cos_threshold and puts data in FODC_CPU__ in DIAGPATH) -- Collect CPU specific performance data
-memory ( triggers script db2cos_threshold and puts data in FODC_Memory__ in DIAGPATH) -- Collect memory info to diagnose problems like no free memory, high swap utilization, paging or memory leak
-connection ( triggers script db2cos_threshold and puts data in FODC_Connections__ in DIAGPATH ) -- Collect connection info, diagnose problems like spike in number of applications in executing, compiling or new connections being denied etc.

There are two more options also available, which are useful to detect CLP or upgrade issues.

-clp ( triggers script db2cos_clp and puts data in FODC_Clp__ in DIAGPATH ) -- Collect environment and configuration info, troubleshoot problems related to instance creation.
-preupgrade ( triggers script db2cos_preupgrade and puts data in FODC_Preupgrage__ in DIAGPATH ) -- Collect performance related info before a critical upgrage ( such as catalog info etc. ) or upgrading an instance or updating to next fixpack. Helps troubleshooting problems which might occur after the upgrade or update.

An example usage of this would be, say a customer is encountering perf issues on a ISAS system with 32 nodes spread across 4 physical server and notices that its accompanied by CPU spikes say for example, the usage goes to 90%. The -cpu option can be used to trigger data capture say a little before it hits 90% so that DB2 support gets some good data on what was happening on the system when it was spiking up. We can deploy this script on all the physical servers by invoking it just from one physical server using the following command:

db2fodc -detect -cpu idle"<=85" triggercount="3" interval="1" iteration="1" -member all

This will now keep watching the system and moment the CPU usage goes above 85% for 3 consecutive seconds ( triggercount * interval option ) on any of the physical servers it'll call db2cos_threshold and capture the data.

Under the -cpu option, it can watch for runqueue as well and it can be combined with CPU usage to trigger data capture.

I would suggest exploring the different options and finding out how it can be used on your system to troubleshoot performance issues. The scripts it triggers are all optimized for the specific option its called under and will collect the most pertinent info right at the beginning so that we don't lose the window of opportunity to get good data.

What if the you are on a DB2 version which is less than v97fp5 ?? Well, yes, you cannot use the db2fodc option, but there's help .. I've written a Perl script called watchThreshold.pl which pretty much does the same thing as db2fodc -detect and can watch for CPU, pageouts, strings in any file, package cache growth and connections spike. This also provides a way to capture data continuously leading up to the trigger, which can help a support analyst to find out the tipping point which subsequently causes the issue.

The different options are:

Usage:
watchThreshold.pl [options]

                -vmstat
                        -idle        ( Trigger when %idle reaches below this threshold )
                        -user        ( Trigger when %usr reaches above this threshold )
                        -sys         ( Trigger when %sys reaches above this threshold )
                        -qrun        ( Trigger when run-q value reaches above this threshold )
                        -qblock      ( Trigger when block-q value reaches above this threshold )
                        -qoperator       ( Override default comparison operation for Queue values )
                        -cpuoperator ( Override default comparison operator for CPU values )
                        -condition   ( Can combine CPU and run-q output. ( Default: or ) )
                        -avm         ( This is applicable only for AIX platforms and will trigger when avm reaches above )
                        -po          ( This is also applicable only for AIX platforms and will trigger when po reaches above )

                -pkgcache
                        -cachesize
                        -database

                -watchFile              ( File to watch )
                        -string      [ -string ... -string ]
                        -matchtype   [1|2]        ( 1 = match one string at a time ( or condition ), 2 = match all strings provided ( and condition ). Default:1 )

-connections ( This option will check number of appls. in UOW-Executing status and trigger data collection based on that )
-connecttype ( User can define the connection type for the threshold e.g. UOW-Executing, Compiling etc. default:UOW-Executing )

                -numtimes                  ( Number of consecutive seconds the trigger is breached before it will start data collection
                                                    This is to reduce the false positives as much as possible. This is applicable
                                                    for -vmstat and -pkgcache option only. ( Default: 5 seconds ) )
                -command                ( If don't want to run script, can use this option to trigger data collection )
                -before                 ( Run command(s) before it starts to watch for trigger )
                -scriptname

db2 tips

Tags

Similar Blogs

Similar Ideation Blogs

Archive

Links

Blog Authors

DB2 Performance Tips

1) High CPU situation -- A typical situation is CPU spikes or high CPU after migrating to new release or upgrading hardware etc.

a) If %sys CPU is high

b) If %usr CPU is high

2) High number of concurrent queries slowing down the database performance

3) Find out lock waits in the database

4) Miscellaneous things you can do with db2pd

Overall performance is slow

A particular query is running slow

What if you find turning on db2trc is affecting your system performance adversely

Other useful db2trc options

Useful scripts

Things you should not do with db2trc especially on Production system