Those who are familiar with the Informix database must have used a utility named "onstat" to monitor the database, do some cool tricks by looking at the innards of the Informix engine. The db2pd tool is very similar to that and there are a lot of cool things you can do with this tool some of which are externally documented and some of them are undocumented ( the reason being if misused or wrongly used can be catastrophic for the DB2 engine ). Since, my blog is more "performance" biased, I'll discuss a few cool things you can do with db2pd tool to figure out a performance bottleneck. Let me say one thing at the outset, if you are looking for some kind of nice kind of report from this tool, then this might not be the tool of choice for you. This tool is more like providing a quick insight into the shared memory of the DB2 engine and the outputs can be a bit cryptic and it takes a bit of time to get used to it. But, I can tell you, once you get used to it, there's no peer to this tool .. :-). There is one more great thing about this tool ( perhaps the greatest ) i.e. when your engine is hung and no connections are going through, this is the ONLY tool which can work and get you the state of the engine at the time of hang. Now, that the preamble is out of the way .. let me get down to the nitty-gritty.
The following scenarios are based on my experience working in support. These are pretty much the most common "perf" scenarios customers typically face and I'll show you how you can get to the root cause just by using db2pd tool.
1) High CPU situation -- A typical situation is CPU spikes or high CPU after migrating to new release or upgrading hardware etc.
My suggestion will be to get a vmstat profile first to see what kind of CPU ticks are being accumulated more i.e. is it %usr or %sys. If you are used to vmstat output and would like to get a similar output on Windows systems running DB2, you can use the db2pd -vmstat option.
a) If %sys CPU is high
This means that for whatever reason, more system calls are being made and system is not doing useful work i.e. executing user/application code but spending more time running system calls. Typically, I deem it as a problem when %sys is 1.5 - 2x or more higher than %usr. So, the trick is to find out what function call is consuming those ticks. If DB2 is driving it, it can be 2 things i.e.
i) I/O calls or Network calls OR
ii) Latch calls
Since, the I/O system calls are outside the "application", OS tools such as curt ( AIX ), for Linux, oprofile or perf ( good examples in: http://www.brendangregg.com/perf.html and https://perf.wiki.kernel.org/index.php/Tutorial ) or if you are an advanced user and want to be adventurous try systemtap ( it is like dTrace or probevue ) and can write a program to profile the system for system calls ). I did not recommend strace on Linux for the reason that tracing a multi-threaded process like DB2's db2sysc can cause serious impact on the performance of it. The other tools are much less impact choices.
For latch calls, you can use the command:
db2pd -latches -rep 1 50 -file latches.log > /dev/null 2>&1 and it will run the command 50 times, 1 second apart. A typical output will be like:
Database Member 0 -- Active -- Up 3 days 17:08:40 -- Date 2016-07-18-15.12.55.033746
Latches:
Address Holder Waiter Filename LOC LatchType HoldCount
0x0780000001340478 1029 0 ../include/sqle_workload_disp.h 1391 SQLO_LT_sqeWLDispatcher__m_tunerLatch 1
When there is a problem i.e. latch calls causing high %sys, you should see values in "Waiter" column. The numbers in the "Holder" and "Waiter" columns are the EDU ids which you can co-relate to Application handles using
a) db2pd -db
b) db2pd -agents|awk -v "edu=13880" '/^0x/ { if( $4 == edu ) print "Apphandle: "$2; }'
To ease your work, there are 2 scripts attached:
1) latchToQueryIdentifier.pl -- find out which query is driving the latch contention.
2) parseDb2pdLatch.pl -- Parse the db2pd -latches output to find out the latch contention.
Please note, these scripts are provided as-is without any support.
b) If %usr CPU is high
This means that user code is driving the CPU high. After you have done away with identifying the top CPU consuming process ( top output or ps aux or topas ) and narrowed it down to db2sysc process, you can run the following db2pd command to figure out which EDU / agent is consuming the most CPU and then find out which query is driving the high cpu usage.
1) To identify the top CPU consuming EDUs ( DB2 v9.7 FP5 and higher ):
db2pd -db
2) Collect the application information to find out the query driving the problem.
db2pd -db
Notice the '&' sign at the end. It is to run both the commands at the same time, as it would be required to co-relate the outputs to find out the query or queries. You may require to run the above multiple times to figure out the queries if there are multiple queries which are driving the high CPU. Then of course, the next step being running it through Query Tuner if you are using DSM or get an explain plan and see what can be tuned to make the high CPU consumption go away. Typically, large hash joins are CPU intensive, but there can be other reasons too.
2) High number of concurrent queries slowing down the database performance
Sometimes it may so happen, some query which has a bad plan is running at very high numbers concurrently on the database which is making it run slow. If the query is run in isolation, it runs very quick but when run in large numbers simultaneously, it causes problems. You can find out the offending query using:
db2pd -db
A sample output looks like:
Database Member 0 -- Database SAMPLE -- Active -- Up 3 days 18:42:10 -- Date 2016-07-18-16.46.34.408112
Active Statement List:
Address AppHandl [nod-index] UOW-ID StmtID AnchID
StmtUID EffISO EffLockTOut EffDegree EntryTime
StartTime LastRefTime
0x0700000049604C80 6881 [000-06881] 1 1 377
1 1 -1 0 Mon Jul 18 16:46:28 Mon
Jul 18 16:46:28 Mon Jul 18 16:46:34
Key things to look at are:
i) Apphandle
ii) Anch ID
iii) StmtUID
iv) Start Time
v) Last Ref Time
Note: To identify the query running with highest concurrency, you can run:
db2pd -db
The Anch ID and StmtUID combination uniquely identifies the query in the cache. You can then run the following to get the statement.
db2pd -db sample -dyn|awk -v "anchid=377" -v "stmtid=1" '/Dynamic SQL Statements:/,/Dynamic SQL Environments:/ { if( match($0,"^0x") && ( $2 == anchid ) && ( $3 == stmtid ) ) print $0; }'
You can also use the db2pd -active output to find out long running queries ( The watchThreshold script I provided in one of my earlier blogs uses the same mechanism ) by doing a diff between the "Start Time" and the time shown in the header line of the db2pd output.
3) Find out lock waits in the database
The command to figure out the lock wait situation is:
db2pd -db
The
A sample output is:
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:02:28 -- Date 2016-07-18-17.16.41.521224
Locks being waited on :
AppHandl [nod-index] TranHdl Lockname Type Mode Conv Sts CoorEDU AppName AuthID AppID
30 [000-00030] 420 0002000F000000000000000452 RowLock
..X G 3086 db2bp RSARKAR *LOCAL.rsarkar.160718211437
32 [000-00032] 397 0002000F000000000000000452 RowLock
..U W 15166 db2bp RSARKAR *LOCAL.rsarkar.160718211532
Apphandle
32 is waiting on Apphandle 30. It is trying to get a Row lock in
"U
$ db2pd -db sample -wlocks detail
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:03:46 -- Date 2016-07-18-17.17.59.830245
Locks being waited on :
AppHandl [nod-index] TranHdl Lockname Type
Mode Conv Sts CoorEDU AppName AuthID
AppID TableNm SchemaNm AppNode
30 [000-00030] 420 0002000F000000000000000452 RowLock
..X G 3086 db2bp RSARKAR
*LOCAL.rsarkar.160718211437 STAFF RSARKAR db2debug
32 [000-00032] 397 0002000F000000000000000452 RowLock
..U W 15166 db2bp RSARKAR
*LOCAL.rsarkar.160718211532 STAFF RSARKAR db2debug
So, it is on table STAFF. In order to find out the queries run by the app handle, I can use the following command:
db2pd -db sample -apinfo 30
db2pd -db sample -apinfo 32
db2pd -db sample -apinfo 30 results in:
Application :
Address : 0x0780000001DE0080
AppHandl [nod-index] : 30 [000-00030]
TranHdl : 420
Application PID : 56755468
Application Node Name : db2debug
IP Address: n/a
Connection Start Time : (1468876459)Mon Jul 18 17:14:19 2016
Client User ID : rsarkar
System Auth ID : RSARKAR
Coordinator EDU ID : 3086
Coordinator Member : 0
Registered Agents : 1
Active Agents : 1
Locks timeout value : NotSet
Locks Escalation : No
Workload ID : 1
Workload Occurrence ID : 1
Trusted Context : n/a
Connection Trust Type : non trusted
Role Inherited : n/a
Application Status : UOW-Waiting
Application Name : db2bp
Application ID : *LOCAL.rsarkar.160718211437
ClientUserID : n/a
ClientWrkstnName : n/a
ClientApplName : n/a
ClientAccntng : n/a
CollectActData: N
CollectActPartition: C
SectionActuals:
UOW start time : 07/18/2016 17:15:15.911638
UOW stop time :
The application handle is in UOW Waiting status, however, as can be seen, it has a open UOW which started @17:15:15.
Application :
Address : 0x0780000002FA0080
AppHandl [nod-index] : 32 [000-00032]
TranHdl : 397
Application PID : 23074544
Application Node Name : db2debug
IP Address: n/a
Connection Start Time : (1468876532)Mon Jul 18 17:15:32 2016
Client User ID : rsarkar
System Auth ID : RSARKAR
Coordinator EDU ID : 15166
Coordinator Member : 0
Registered Agents : 1
Active Agents : 1
Locks timeout value : NotSet
Locks Escalation : No
Workload ID : 1
Workload Occurrence ID : 2
Trusted Context : n/a
Connection Trust Type : non trusted
Role Inherited : n/a
Application Status : Lock-wait
Application Name : db2bp
Application ID : *LOCAL.rsarkar.160718211532
ClientUserID : n/a
ClientWrkstnName : n/a
ClientApplName : n/a
ClientAccntng : n/a
CollectActData: N
CollectActPartition: C
SectionActuals:
UOW start time : 07/18/2016 17:16:31.350722
UOW stop time :
List of current activities :
Activity ID : 1
UOW-ID : 1
Package schema : NULLID
Package name : SQLC2K26
Package Version :
Consistency Token : AAAAAfAd
Section number : 203
Statement number : 1
Isolation : CS
Effective degree : 0
Actual degree : 1
Sourece ID : 0
Cursor ID : 0
Nesting level : 0
Invocation ID : 0
Package cache ID : 0x0000000102910000
Anchor ID : 657
Statement UID : 1
SQL Type : Dynamic
Statement Type : DML, Insert/Update/Delete
Statement : update staff set years = 10 where id = '200'
Entry time : 07/18/2016 17:16:31.351389
Local start time : 07/18/2016 17:16:31.351391
Last reference time : 07/18/2016 17:16:31.351389
Application handle was running an update statement on staff for a specific ID and is in Lock-wait status. So, an educated guess can be made here that application handle 30 must have run some DML statement to hold an X lock on the row but has not yet committed which is why the lock wait is happening. So, if that user has walked away from the workstation, then the best alternative is to do a "force application" on that application handle 30 to roll back the transaction and let application handle 32 proceed.
Note: if
db2 "SELECT SUBSTR(NAME,1,20) AS NAME,SUBSTR(VALUE,1,50) AS VALUE FROM TABLE( MON_FORMAT_LOCK_NAME('0002000F000000000000000452')) as LOCK" OR
The tablespace id and table id is embedded in the lock name. The first 2 bytes store the tablespace id and next 2 bytes store the table id. You can then issue the command:
db2pd -db
$ db2pd -db sample -tcbstat 2 15
Database Member 0 -- Database SAMPLE -- Active -- Up 0 days 00:14:51 -- Date 2016-07-18-17.29.04.517807
TCB Table Information:
TCB Table Stats:
i) Find out how much log space is being used by each transaction -- db2pd -db
ii) Find out the transaction throughput -- db2pd -db
per second values.
iii) Find out query throughput -- db2pd -db
values.
**Note: With the availability of db2 "call monreport.dbsummary(60)" the
throughput values ( query and transactions ) are much more easier to
find rather than scripting to find it out of db2pd values.
iv)
db2pd -everything -- It collects ALL information about the instance
and is useful for support when they are trying to figure out what was
going on in the instance during a problem. The db2fodc -perf/hang tools
do collect this automatically.
v)
db2pd -db
vi) db2pd -db
vii)
db2pd -stack all -- To take stacks of the ALL the agents in the engine.
If you want to take a specific EDU stack, just append the EDU id to the
command. By default, it dumps the stack file ( *stack.txt ) file in
DIAGPATH but if you want to change it, you can use:
db2pd -stack all dumpdir=
Stacks
are VERY VERY important in hang like scenarios, as that tells the
support engineers in which code path the agent is stuck. It is useful in
slow perf issues too but it needs to be used judiciously as it does
have an impact.
There
are many more commands ( db2pd -h shows you a help screen ) which the
support engineers use to debug any kind of issue in the engine. It'll
need a whole book to put all of it in detail .. :-)
In
conclusion, I would say, this tool is not for someone who wants GUI
based outputs or wants to resolve issues with a few clicks of mouse (
there are tools such as DSM, Optim which do this very well ) but it is
for someone who is more of a awk/sed/grep person and not afraid to do
quick one line scripts to parse the data. I'm personally biased towards
this tool as I am an ex-Informixer :-) and am used to a typical
character based output and doing greps/awk/sed etc.
I hope the not so brief primer on db2pd was useful for you. Please feel free to comment.
Any
well written code should have some kind of tracing facility to allow an
"inside" view to its working so that when problems happen the issue can
be looked into while its happening. You must have come across OS
tracing tools such as dTrace ( Solaris ), systemTap ( Linux ), probeVue (
AIX )., tcpdump ( Linux ) , iptrace ( AIX ) for network tracing. On
similar lines, DB2's tracing facility is coded so that it can provide a
view of the inner working of the engine and it can be used to find
bottlenecks within the engine.
However,
it needs to be handled with care ( as is the case with any tracing tool
) as too much of it can cause huge performance degradation or a wrong
option being used can cause a system to be so slow that it looks like a
hang. I am going to discuss the optimal usage of DB2 Tracing facility
for different kind of scenarios and also some of the options you should
or should not use based on different situations. The tool which controls
DB2's tracing facility is known as db2trc.
Let me start with the different options available with the db2trc facility ( DB2 v10.1 FP5+ or 10.5FP3+ ).
STAGE #1 - COLLECTION
[facility]
STAGE #2 - PARSING
For more information add the "-u" option to any of the above commands
You can get the above commands by just typing in "db2trc" from the command line.
Typically,
when you turn on db2trc, the trace records are put in a cyclical buffer
in memory. The size of the buffer can be found using the command ( if
instance is up ):
$> db2pd -memsets | egrep "Address|Trace"
You can change the buffer size by setting the registry variable BEFORE bringing up the instance:
db2set DB2TRC_DEF_BUFFSIZE=512m ( you can go up to 1GB maximum. It accepts the size in bytes or suffixes such as 'm' or 'g' ).
You
can also put the trace records to a file with a -f option. However this
has to be used with caution and is recommended only to be used in error
/ crash situation which happens very quickly e.g. right after the
engine is started etc. The default recommendation is NEVER to use it until and unless specifically asked by support engineer.
Ok,
now that the preamble is out of the way, let me get down to the nitty
gritty and show how it can be used for different type of situations.
A
typical situation when everything is running slow. This happens very
commonly right after some upgrade, but may happen otherwise.
Suggested command#1 to use:
db2trc on -l ( small L ) 512m -t
sleep 30
db2trc dmp trc.dmp.1
sleep 30
db2trc dmp trc.dmp.2
sleep 30
db2trc dmp trc.dmp.3
db2trc off
After this, use these commands to format the trace dump files.
db2trc perfrep trc.dmp trc.perfrep -sort timeelapsed -order desc -g
This
command will generate a "profile" trace for each of the threads working
within the engine. The output will be something like:
Node : 0 , PID : 7771 , TID : 46912849832256
nCalls TotalElapsed AvgElapsed TotalSpent AvgSpent FunctionName
1 0.000512000 0.000512000 0.000003000 0.000003000 sqlriProcessSplitRowData
1 0.000509000 0.000509000 0.000001000 0.000001000 sqljsDrdaAsCBPutLongDataQueryCheck
1 0.000503000 0.000503000 0.000004000 0.000004000 sqlriWriteCsoIntoBuffer
1 0.000499000 0.000499000 0.000008000 0.000008000 sqlrimso
1 0.000491000 0.000491000 0.000001000 0.000001000 sqlriglf
1 0.000490000 0.000490000 0.000003000 0.000003000 sqldFetchLFD
1 0.000485000 0.000485000 0.000002000 0.000002000 sqldx_lob_read
1 0.000480000 0.000480000 0.000005000 0.000005000 sqldxReadLob
1 0.000474000 0.000474000 0.000003000 0.000003000 sqldxReadLobData
1 0.000471000 0.000471000 0.000001000 0.000001000 sqldx_diskread
1 0.000470000 0.000470000 0.000002000 0.000002000 sqlbDirectRead
1 0.000468000 0.000468000 0.000009000 0.000009000 sqlbDMSDirectRead
1 0.000328000 0.000328000 0.000028000 0.000028000 sqlbDMSMapAndRead
1 0.000295000 0.000295000 0.000002000 0.000002000 sqlbReadBlocks
1 0.000293000 0.000293000 0.000293000 0.000293000 sqloReadBlocks
2 0.000179000 0.000089500 0.000006000 0.000003000 sqlbDMScheckObjAlloc
2 0.000158000 0.000079000 0.000013000 0.000006500 sqlbIsExtentAllocated
The
way to interpret the above is to look at the TotalElapsed and
TotalSpent column values. TotalElapsed is the "elapsed time to get in
and out of this function", TotalSpent is the "time spent inside the
function". For example, sqlriProcessSplitRowData is showing TotalElapsed
as .000512 seconds and 0.000003 seconds as TotalSpent which means, the
function was entered in time t1 and exited 0.000512 seconds later but it
actually spent only 0.000003 seconds inside the function itself, the
other time was spent in some of the functions it called. Since, the
command had ordered it in descending order, keep looking for a function
where TotalElapsed =~ TotalSpent. In the above example, its the
sqloReadBlocks function which as its name suggests is an I/O function.
So, of the 0.000512 seconds, 0.000293 seconds was spent in the I/O call.
This is how you can figure out a bottleneck from this output.
This way, you can attempt to see where is the bottleneck. If its an obvious problem ( refer to my SSL Perf blog ) it will surely bubble up here.
Suggested command#2
db2trc on -perfcount -t
sleep 180
db2trc dmp trc.perfcount
db2trc off
db2trc perffmt trc.perfcount trc.perffmt
This
command keeps how many times a function is called and how much time is
spent in each function. It doesn't have the per thread granularity which
the perfrep report provides. However, it creates a nice profile
information of the whole engine. Typically, I use the following command
to parse the trc.perffmt output.
sort -t"(" -k2nr trc.perffmt
The output is like:
1204129 (1294 sec, 118794000 nanosec) csmFetch
If
some call is a bottleneck, you will see it right around the top. Do not
be fooled however, with some typical waiting functions such as the
functions waiting for the client to send something etc. ( sqljrRecv
being one of them ). They will be at the top for a large instance with
lots of threads if most of the threads are idle.
What happens if you have a DPF system with the same issue ?
The
trace memory segment is SHARED by ALL the local members on the server
so you don't have to turn it on for ALL members separately. Usually, if
you have a large DPF installation ( say > 50 nodes ) my rule of thumb
is to take trace data on Co-ordinator node + 2 data nodes. If you
happen to require to turn on tracing on ALL the members, you can now run
the following command from any ONE node.
db2trc on -l 512m -t -member all
If
you want to dump the file, you can also issue the command from any one
node however, the restriction is that the dump directory MUST be a
directory which can be written to by ALL the nodes.
db2trc dmp trc.dmp -sdir
It will automatically suffix the hostname of the server from where its being dumped.
If you don't have the command option ( -sdir ) available, you can do something like:
tstamp=`date "+%Y%m%d_%H%M%S"`
rah "||db2trc dmp trc.dmp.$tstamp.\`hostname -s\`"
Rest of the suggestions on perfcount and perfrep still apply on DPF too.
My
suggestion would be to use both methods ( I personally favor perfrep
option. However, it's available only in the later releases ) to find out
some obvious bottlenecks on an overall slowdown situation.
The
identification of the application handle which is running the query is
criical here, as you don't want to impact any other queries running on
the database. You can identify the application handle via:
a) If you have just connected and you are going to run the query within that same connection:
db2 "select mon_get_application_handle() from sysibm.dual" OR
db2 "select application_id() from sysibm.dual"
Suggested command to use:
db2trc on -l 512m -t -apphdl
db2trc on -l 512m -t -appid
sleep 60
db2trc dmp trc.dmp.1
sleep 60
db2trc dmp trc.dmp.2
db2trc off
db2trc perfrep trc.dmp.1 trc.perfrep.1
You can always verify the db2trc command using:
db2trc info
The output will be like:
Marker : @TRACE@
Buffer size : 33554432 bytes (32 MB)
Using
the same methodology as I described above to find a bottleneck in
perfrep report, you can do a similar thing for the application handle.
You can extend this to:
a) trace a slow load process.
i) Identify the application handle running the load job using the command db2pd -db
ii) Start an apphandle trace ( db2trc on -l 512m -t -apphdl
b) trace a slow backup / restore
i) Identify the application handle doing the backup / restore using the command db2pd -edus | grep db2med
The apphandle is embedded in eduname.
ii) Start an apphandle trace ( db2trc on -l 512m -t -apphdl
In cases where
using the above commands are causing a performance degradation, you
will have to use masks to reduce the trace footprint. Unfortunately,
this has to be supplied from support as this requires some source code
knowledge and access to come up with a good mask to reduce the impact on
the system. I am sharing a few common masks which we use to debug
problems, you can use it if you see the problem description is matching
your scenario.
1) I/O performance
db2trc on -l 512m -t -Madd sqloReadBlocks -Madd sqloWriteBlocks
sleep 60
db2trc dmp trc.dmp
db2trc off
db2trc flw trc.dmp trc.flw -t
Run the attached script to parse the output. The syntax is:
findTimesByOperationsFromTrace.pl -operation "Read I/O" -file trc.flw -func "sqloReadBlocks enter" -func "sqloReadBlocks exit"
The output will be like:
**Disclaimer Note: This script is provided as-is without any support.
2) Debug slow stored procedures.
db2trc on -l 512m -t -Mfile trcMaskForStoredProcedures
sleep 60
db2trc dmp trc.dmp
db2trc off
db2trc perfrep trc.dmp trc.perfrep
add sqljcReceive.entry
This mask file was come up due to some debugging effort for a very tricky stored procedure problem.
As can be
seen, setting masks needs referring to the problem and where in source
code the potential problem could be which is why most of the time its
advisable to contact support to help you set up a good trace mask to
have the least impact on your system.
1)
If you are dealing with millisec issues e.g. trying to match a WAS log
entry to a DB2 trace entry to figure out which connection was servicing
the WAS connection. You can use the -wc ( wallclock ) option to format
the db2trc output to get wall clock times. It then becomes much easier
to co-relate the entries in DB2 trace logs to the application logs.
db2trc flw trc.dmp trc.flw -t -wc
2) If you can trace to file ( -f option ) but would like to limit the size of the file you can use:
db2trc on -l ( small L ) 2g -f trc.dmp OR
db2trc on -i 2g -f trc.dmp
The
first command will limit the file size to 2GB and once it hits the
limit it will wrap to the front and keep writing to it. The second
command will stop writing to the file after it hits size 2GB.
3)
If you know the return code from db2diag.log for a specific issue, you
can stop the trace when the error is hit by using the command:
db2trc
on -l 512m -t -rc
4) If you want to investigate BLU performance issues, you can turn on BLU performance trace
db2trc on -l 512m -t -Madd CDE_PERF_TRACE
1)
If you have the whole trace but you want to focus on the times taken by
a particular operation e.g. Insert row or I/O then you can use the
script findTimesByOperationsFromTrace.pl
Syntax to use:
findTimesByOperationsFromTrace.pl -operation
"Read I/O" -file trc.flw -func "sqloReadBlocks enter" -func
"sqloReadBlocks exit" or
findTimesByOperationsFromTrace.pl -operation
"RowInsert" -file trc.flw -func "sqldRowInsert enter" -func
"sqldRowInsert exit" etc.
2) If the version you are using does not have the -perfrep option, then you can use the analyzetrace script on a db2trc flw or fmt output to generate a profile output. The output looks like
Pid
Lvl FuncName TTime(ms) HTime(ms)
LTime(ms) AvgTime(ms) NCalls ERecnumHTime
Pid - Process Id ( thread id ) of the traced proces.
Lvl
- The level of indentation of the function call i.e. if its 0 its the
outermost, if its 1 its probably called from the function at level 0.
TTime - Total time elapsed in the function
HTime - Highest time amongst ALL the calls to the function
LTime - Least time amongst ALL the calls to the function
AvgTime - Average time
NCalls - Number of calls to the function
ERecnumHTime
- Entry record number of the Highest Time call to this function. If you
search for this record number in the flow output you will find it.
1) Never use the -f option to
start tracing to a file. It can pretty much bring the whole system to a
grinding halt if the system is large enough or busy. You should use it
ONLY if asked specifically by support. There are exception situations
like a crash right after starting up the engine where without -f option
support cannot really move forward. Never ever use this
option for performance data collection as its got its own overhead and
it skews the times so much that the whole trace information becomes
useless to debug the performance issue.
2)
On DPF installations, do not attempt to start db2trc on ALL members. It
can cause tremendous slowness. Usually, if you do have to use db2trc on
DPF systems, follow the thumb rule of one co-ord node and 2 data nodes.
There will always be exceptions to rule but use it on ALL members only
with guidance from support.
Hope you find this entry useful and good enough for you to start using db2trc .. :-)
Please leave a comment if there are any questions or feedback
I
recently had to work on a peculiar performance issue where customer was
testing workload with and without SSL encryption and was seeing huge
differences. For example, a query which runs in 1 second will take 20
seconds to complete when SSL was turned on. Since the solution was
fairly simple, I thought I might as well put out in a blog for anyone
else who hits a similar issue.
I collected the following data for both types of workload run:
1) db2trc on -t -perfcount
2) db2trc on -i 512m -t ( for 30 seconds ) -- 3 times, 3 mins apart.
3) db2pd -stack all -- 3 times, 3 mins apart
When
I looked into the perfcount output ( db2trc perffmt trc.perf
trc.perffmt; cat trc.perffmt | sort -t"(" -k2nr | more ). I found that
two function calls were standing out.
When
I looked at the other DB2 functions which do work, they were in sub-ms
which indicated that whatever slowness was happening was all within
these 2 functions. The full trace gave a similar picture. I ran db2trc
perfrep trc.dmp trc.perf -g. When I looked at the flow output ( db2trc
flw trc.dmp trc.flw -t ) I found that there was a 90ms gap in one of the
calls and it had just called a callback function to do the ssl_read. A
snippet of the db2trc output.
511916 0.293736869 | | | sqlcctcprecv data [probe 1]
The stacks immediately gave the clue to the root cause. Lot of agents were in the following stack.
_global_lock_common
export MALLOCOPTIONS=buckets,multiheap:4
After this setting, the performance was back to normal for both with and without SSL case.
Hope this helps anyone else who encounters this issue.
Please leave a comment if there are any questions or feedback
Recently working on couple of issues related
to upgrade to DB2 v10.5 fp7 I found out that the root cause was the
same so I thought I might as well put this in a blog so that people are
aware of it when they search it on their favorite search engine.
Problem statement
-----------------------------
Typically, you will see an over all slowdown of the whole database after upgrading. The typical type of application will be:
1) Fairly high query throughput
2) High number of explicit or implicit rollbacks happening.
How can you figure out if you are hitting the problem
------------------------------------------------------------------------
1) Take stacks ( db2pd -stack all )
2) Go to the DIAGPATH and do a quick grep
ls *stack.txt | wc -l
grep -ic sqlrrbck_dps *stack.txt
grep -ic pdLogEvRecEx *stack.txt
3) If you see a large percentage of stacks showing these functions, you know you are hitting the problem.
Another way to figure it out is:
1) Take a trace
2) db2trc on -l 512m -t ; sleep 30; db2trc dmp trc.dmp; db2trc off
3) db2trc perfrep trc.dmp trc.perf -g -sort timeelapsed -sort desc
4) The output should look like:
nCalls TotalElapsed AvgElapsed TotalSpent AvgSpent FunctionName
**As a side note: Elapsed time -- Total time elapsed in this function, Spent time -- Actual time spent in this function
What to do now that you have confirmed you are hitting the problem
----------------------------------------------------------------------------------------------
Issue the following commands:
db2pd -db sample -dmpevrec comp=SQLRA name=sqlraMED mask=0
However, pls. keep in mind, this change will NOT live
through a recycle of the engine or deactivate of the database. You will
have to run it again after recycle or activate of database.
APAR IT14357 has been opened to track this issue.
Please leave a comment if there are any questions or feedback
Starting DB2 v9.7 if you want to understand
how the backup or restore is working ( especially if its slow ) there's a
feature which was put in which can be enabled by setting the regvar
DB2_BAR_STATS=YES ( recycle required ) and it will print something like
this in the db2diag.log after a backup is complete:
Starting v10.1 it is enabled by default and after every
backup / restore is complete it will print the results as above in the
db2diag.log
The decode of each of the columns is:
BM#
- the number we assigned to an individual Buffer Manipulator. BM's READ data from the databases tablespace during a backup and place them into buffers. Or WRITE data back out to the database from the buffer during a restore.
MC#
- the number assigned to an individual Media Controller. MC's WRITE buffers out to the target location on backup and READ buffer on restore.
Total
- The total amount of time spent by the process in seconds.
I/O
- The amount of time spent either reading or writing data. For the BM's this represents time reading data from tablespace, and filling the buffer. For MC it's time spent reading from buffer and sending it to the target destination.
MsgQ
- This is the amount of time we spend waiting to get a buffer. For BM's it's how long is spent waiting to get an empty buffer for filling. For MC's it's time spent waiting to get a full buffer in order to write out.
Wait Q
- Amount of time spent waiting on directives from the agent overseeing the whole backup.
Buffers
- The number of Buffers Processed by a particular BM or MC. A BM filled X number of buffers. An MC wrote out X number of buffers.
Mbytes
- The amount of data handled by a particular BM or MC in Mbytes. NOTE that the sum of all MC's written bytes is the amount of total data sent.
So,
from this data you can easily tell in where the backup or restore is
spending its time i.e. in the Reading portion or Writing portion and
then focus on that part to find out what is wrong.
For
example, look for the I/O times of the BMs and MC to see which one is
spending more time in I/O and then also look at the MsgQ values to find
out if there's an inordinate time waiting on buffers which means the
reader or writer are having slowdowns which is why the buffers are not
being freed quick enough.
This
is a pretty useful statistic to know of the inner working of Backup /
Restore and can be used to understand where the bottleneck is.
Here is a link to a blog written by Dale McInnis having a much more detailed explanation of the above output.
Hope you find this blog useful
I work on various performance issues ( single node, DPF, pureScale )
and I somehow have found most of the DBAs still using the older snapshot
way to collect the information. I don't have anything against snapshots
but after using the new MON infrastructure to tackle various perf
issues and also use it to monitor normal activity I thought maybe I
should use my blog to encourage the use of this infrastructure. I don't
think I can cover everything in one entry so probably I will spread it
across a few entries and dwell on how to use it for specific issues.
In this entry I will try and lay out what is out there and what I use commonly.
One thing I should point out upfront is that, during hang situations (
unable to connect etc ) this is not a very useful tool and should not be
used. db2pd is the tool of choice in hang conditions.
So, starting from some basics.
a) The MON infrastructure has been available since DB2 v9.7 GA and has
been refined and enhanced ( and is continuously being enhanced ) in the
later versions.
b) Its provides a easy SQL interface to real time memory of DB2 to print out details of it.
c) Its got time spent metrics, object metrics which can quickly narrow down the bottleneck for the user.
d) Its controlled by 3 DB config params
( MON_REQ_METRICS, MON_ACT_METRICS, MON_OBJ_METRICS ). Setting them to
BASE is usually good enough.
e) It uses much less latching ( internal locking ) to get the
information thus making it easier to use on really busy systems very
frequently without any side effects.
Personally, I've used it very effectively on extremely busy systems
with literally no side effects and was able to resolve a lot of issues
using this. I would like to share my experiences here so that others can
benefit as well from this great infrastructure which is built in the
engine itself.
Now, that the intro is out of the way, let's get our hands dirty .. :-)
Let's start with a hypothetical scenario, say:
You have upgraded from v9.5 to v9.7 or v10.1 and all your benchmark
queries are running slower and overall performance is struggling. You
have checked the usual i.e. vmstat, iostat, db snapshot but you are not
able to really pinpoint an obvious issue.
In this case, how can the new MON interface help you ??
It can actually help you a lot !!
Let's start with a monreport query .. you must be thinking .. huh!
what's that? This is something the DB2 developers thought can help the
customer by providing a quick overview of the whole database engine
using some pre-packaged MON queries. There's very good information about
these in the infocenter at this link.
There are different kinds of report in this module but for a generic
perf issue where there's overall slowness, I like to start with the
dbsummary monreport. The syntax to run it is:
db2 connect to
db2 "call monreport.dbsummary(60)" > dbsummary.txt
db2 terminate
The parameter passed tells dbsummary to run for 60 seconds and provide
an overall view of the internals of the DB2 engine as to what happened
during those 60 seconds.
This report gives you everything i.e.
a) Transaction throughput
b) Wait times within the engine << This is the part I am most interested in for perf issues
c) Section processing times -- in other words, apart from the waits, how much time does it take to execute SQLs in the engine
d) Bufferpool hit ratio
e) Package cache hit ratios
f) SQL throughput
Small snippet of the report:
So, in the above output its clearly showing that the log flush is slow (
i.e. commit ) which is driving most of the wait times within the
engine. Maybe moving to a faster log disk can help here or maybe
grouping the commits.
You can calculate the package cache hit ratio from this data. If you
see its very low, then increasing package cache size might help.
As can be seen the dbsummary report is a great starting point for a
generic ( aka unknown ) performance problem and then narrow down
accordingly. You can other monreport packages to help you do that as
well. I use the currentsql() and pkgcache() liberally to find out any
SQL statements which require attention.
I will continue further on specific situation .. keep an eye on this blog for further updates.
In my daily work, I get escalations from customers whose backups have
suddenly started to run slow. Recently I was working on such a case
where the customer was a SAP shop and they were moving some tables of a
tablespace and all of a sudden their backups started to take about 3x
their normal times.
In debugging a backup issue, the best data to collect is generally
db2trc as that captures everything for all the threads involved in the
backup and tell us exactly what is happening.
Generally, for any backup the following threads are created:
db2agent -- The main co-ordinator agent
db2bm -- Buffer Manipulator threads which read tbsp data to
buffers and once the buffer is full hand it over to the db2med thread
db2med -- Media thread which takes a full buffer and writes it to wherever the backup destination is.
So, when a back up is slow, it can be in two places:
1) db2bm taking time to read the data into the buffer OR
2) db2med is taking time to write a full buffer to destination which is making the db2bm wait for an empty buffer.
The read throughput and write throughput can be easily diagnosed using iostat tool.
An easy way to identify all the threads associated with the backup is to do the following:
db2pd -edus | grep db2med
You will get the name of the thread as:
db2med.123456.1
123456 is the edu id of the main co-ordinator agent.
You can now get the application handle by doing:
db2pd -age | grep 123456
The reason I wanted to get the application handle was to pass that to the db2trc command:
db2trc on -l 512m -apphdl
and get the trace of all the backup threads involved.
In this particular customer case, I collected the following:
1) db2trc on -l 512m -apphdle
2) db2pd -stack app=
Once I got the trace, I found that in fact both db2bm and db2med were
idle which was a bit puzzling as at least one of them has to be doing
something to move the backup forward. So, I went back to the customer
and asked for a longer trace ( previous was for only 30 secs, I asked
them to take it for around 2 mins ). The second time trace was able to
capture db2bm intermittently getting some data from a tablespace. I
found that the tablespace was a fairly large one Also, from the stacks I
observed that there was only 1 db2bm thread active and most of the time
it was spending time waiting on the latch SQLO_LT_SQLB_POOL_CB__readLatch
as other agents were holding it to read in some data from time to time.
I found the reason was that it an an ONLINE backup and this tablespace
contained close to 12k tables and other agents would be reading data
from tables in it during the online backup.
The
way ONLINE backup works is that it tries to read contiguous chunks of
data from the tablespace in one go ( while holding the latch in X mode )
and fill up the buffer but obviously that was not happening here as it
was found to be waiting behind the other agents most of the time which
were holding this latch in S mode.
This
info gave me a clue and pointed me to the fact that there's a
possibility that there was a lot of fragmentation in the tablespace
which is why there's not enough contiguous pages to fill up the buffer
in one go and the backup has to try and get the X latch more frequently
on the tablespace and do more read operations than if the tablespace was
relatively fragmentation free.
I
looked up the db2pd -tablespaces information and parsed it ( script
attached ) to find out that indeed there was huge fragmentation in the
tablespace. The parse script showed:
As can be seen the tablespaces had huge difference is Used Pages and HWM pages.
After talking to customer it was found that the tables were actually
moved from this tablespace and also this tablespace was created in v9.1 (
currently customer was on v9.7 ).
Since, the tablespace was created in v9.1 the new feature of v9.7 to
reduce HWM dynamically ( ALTER TABLESPACE
After customer reduced the fragmentation using this trick, the backup
times went back to normal and everything was running fine.
I am providing two scripts here:
1) To automate data collection for slow backups ( If you open up a PMR with this data, that would be awesome !! )
2) Check the fragmentation of tablespaces. ( You can verify the
fragmentation using db2dart
Also, the following query can quickly identify whether the tablespace
has got RECLAIMABLE STORAGE set or not ( by default its set if the
tablespace is created in v9.7 ).
db2 "select tbsp_name, RECLAIMABLE_SPACE_ENABLED FROM TABLE(MON_GET_TABLESPACE('',-2)) AS t"
NOTE:
I should mention here that this problem would not have surfaced if it
was an offline backup. Its specific to online backup only.
Hope you find this post useful in debugging slow backup issues yourself.
Recently, I'd been working on lot of ISAS / DPF systems where a very
common complaint is the SQL1229 error encountered by the applications.
In a few cases, it was fairly severe that a recycle was necessary.
I wanted to dwell a bit on this as it can be a mystery as to why it
happens and can consume a lot of debugging effort by the DBA. I know its
not a real performance issue per se, but I've seen customers being
affected by it a lot and not really knowing what to do. I'm hoping with
the information in this article, they can take a few steps and at least
collect the "right data" for the root cause analysis.
To start with, what does this error really mean ( in the context of DPF
) .. it essentially means that the engine was trying to do something
and it was not able to get back a reply in time from the remote side
which it deems as a "system error" and thus is asking the transaction to
be rolled back and re-submitted by the application again.
There can be many reasons for this error, in some cases DB2 can be a
victim of underlying issues on hardware, network or in some cases it can
be a DB2 issue itself.
The most common error seen in the db2diag.log just before a SQL1229 error received by the application is:
followed by
2013-09-12-21.06.56.238933-420 E5032877E381 LEVEL: Error
The point to note is that its the FCM Receive thread ( db2fcmr ) which
has got this error and ETIMEDOUT error can happen due to two reasons:
1) Connection timed out OR
2) Active transmission timed out
Generally, most of the time the reason is invariably "Active transmission timed out".
Now, based on this error, DB2 has to act on it and assume that the
remote node is not responding, so what it does is that it does a "Node
recovery" for the remote node on the local node and send a message to
the remote node to do the same as well. In the above snippet, Node#0 got
the ETIMEDOUT message from Node#17 so what it'll do is do a "Node
recovery" for Node#17 on Node#0 and then send a message across to
Node#17 to do a node recovery for Node#0. Node recovery isn't a "full
blown" recovery as we know it .. its essentially recycling the socket
connections to the node and trying to re-establish the connection using
new sockets ( If interested, you can observe this phenomenon in netstat
-an outputs ). There's one quirk however, if it so happens that
the catalog node gets the error, the remote node in order to complete
the node recovery will have to de-activate / re-activate the database. Usually, the node recovery should happen very fast ( in millisecs ) but
in some cases if it doesn't happen, then following can hold true:
1) Its not able to send the message across to the remote node quick
enough to do the node recovery on the remote node ( network issue ) OR
2) Something is happening on the remote node to make it unable to
process the message quick enough ( maybe workload, low on memory, cpu
etc.) OR
3) Its the exception case where it has to deactivate/activate the db in
order to complete node recovery and is stuck in deactivate phase.
There can be another cause for SQL1229. A message such as
2008-07-28-09.31.23.006067-240 I22590117A418 LEVEL: Error
appears in the diaglog.
This essentially means that DB2 is trying to contact the remote node
for nearly 500 seconds but unable to do so. However, if you notice,
there's OS error this time around which indicates the network packets
probably are going through but the other side is not able to process the
messages sent by this node quick enough.
Ok, I have kind of explained what can cause the SQL1229 issues, but
what can really be done to address it. Unfortunately, there's no simple
solution to it as there are a lot of moving parts in this and the
problem can be anywhere. The best you can do is to collect the "right
data at the right time" to help the support engineers quickly address
the problem you are having. If you see that your instance is hung right
after the SQL1229 errors, the chances are that you probably have
encountered the issue where a "deactivate/activate" is necessary to
complete the node recovery. In that case, issue a "db2 force
applications all" from the co-ordinator node. There's a high chance that
the deactivate ( which was stuck ) completes and then node recovery
completes and engine is back on line. If that also, doesn't help,
unfortunately the quickest way out of it is a "recycle" of the instance
but pls. make sure you collect the data for root cause analysis.
What can you collect:
For the first condition ( where OS reports ETIMEDOUT or ECONNRESET )
its next to impossible to anticipate that error, but you can be prepared
for the recurrence. Unfortunately, since its the OS which reports the
error, you have to have TCP trace turned on continuously and you have to
stop it when the message appears in the db2diag.log. You will have to
get the respective TCP trace commands for the OS ( for AIX its iptrace,
Linux its tcpdump ) and you can limit the TCP trace only to the FCM
ports. In order to stop it at the right moment, you can use my tool
"watchThreshold.pl" ( I had uploaded it for one of my earlier posts )
and use the -watchFile and -command option to issue the stop TCP trace
command when it encounters the message in the db2diag.log file.
If instance is "hung" then I would suggest taking stacks on the "hung"
node if possible ( if you can identify from the db2diag.log which node
is stuck in "deactivate" phase ) otherwise, I would suggest taking the
following data:
rah "db2pd -alldbp -stack all"
sleep 180
rah "db2pd -alldbp -stack all"
This will collect the info of the "stuck" thread and provide clues as to why its stuck in deactivate.
If you get the timeout message as pasted above in the db2diag.log, you
can set the following registry var for future occurrences which will
trigger data collection automatically.
db2set DB2_FCM_SETTINGS=FCM_PERCENT_OF_MAX_TIMEOUT:50
This will trigger data collection @50% of the timeout value, 75% of the timeout value, 87% of the timeout value.
This is available from DB2 v97FP5 onwards. This helps to find out what
was happening in DB2 when the timeout happened ( was it too busy ? If
so, why ? etc. ).
The db2diag.log is very helpful in stitching together the scenario when
the issue happened so its also an essential part of the puzzle.
I hope this information will help you understand the SQL1229 error and what can you do to help do the RCA.
I deal with PMR escalations in my daily job and I can't say how many
times the most common performance problem I see is that the system has
slowed down and a CPU spike was observed during the slowdown. In such
cases, data collection at the right time i.e. when the CPU has spiked up
is the most important thing. If you miss the data collection during the
spike but still go ahead and collect, its not going to help much as we
would not know what drove the CPU up.
Now, this is where things get a bit complicated, the data collection
depends on what "kind" of CPU spike is observed i.e. is it a %usr CPU
spike or %sys CPU spike. The data collection is different in each of the
cases .. When %usr CPU spikes up it means that system is spending time
in executing application code which in most cases is DB2 but when %sys
CPU spikes up it means that system is spending time in kernel code and
not able to execute any application code.
I generally follow a simple principle .. i.e. when %sys CPU is high, I
want OS internal data to see what call is driving up the CPU and also
collect some DB2 information i.e. stacks ( db2pd -stack all ) and db2pd
-latches ( to see if there's any latch contention ). For OS data, I
generally collect:
AIX:
1) OS trace -- perfpmr.sh -x trace.sh 30 ( or separately start the trace command as user root ).
Solaris:
1) dtrace script to see which system call is being called the most ( you can get a lot of samples using Google )
Linux:
1) systemTap script to find out which system call is being called the most OR
2) strace -c of the db2sysc pid ( only if we are fairly certain if DB2 is driving it ).
If say, DB2 is driving the %sys CPU high 99% chances are that its latch
contention within DB2 code. So, we have to find out what code path is
driving that contention up. The stacks help pinpoint that. The db2pd
-latches output also helps in that respect.
From OS trace ( AIX ), a curt report tells what kind of system call is
driving up the %sys cpu and if its say latch contention then the splat
report from the trace can show which address is having the highest
contention on which can be matched with the db2pd -latches output.
I know it gets a bit complicated ... :-) but the main thing is to get
the "right data at the right time" otherwise the data collection is
useless and won't help get to the root cause of the issue.
If %usr is high, generally I concentrate on what kind of queries were
being run on the DB. With the new MON interface of v9.7 it becomes so
easy now where you can use the monreport module to get the highest CPU
consumers fairly quickly or you can write your own queries to query the
table MON_GET_PKG_CACHE_STMT and other MON tables to find out the top
CPU consumers. If you are not on v9.7 then it becomes a bit tougher,
there's some sysibmadm routines such as top_running_sql which can be
used but you can use application snapshot and dynamic sql snapshot to
narrow down those queries and see if tuning them helps reduce the CPU
consumption.
This was quick help on what to with CPU spikes .. hope you find it helpful .. :-)
There are two more options also available, which are useful to detect CLP or upgrade issues.
An example usage of this would be, say a customer is
encountering perf issues on a ISAS system with 32 nodes spread across 4
physical server and notices that its accompanied by CPU spikes say for
example, the usage goes to 90%. The -cpu option can be used to trigger
data capture say a little before it hits 90% so that DB2 support gets
some good data on what was happening on the system when it was spiking
up. We can deploy this script on all the physical servers by invoking it
just from one physical server using the following command:
e.g. in this case db2pd -db sample -tcbstat 2 15
Address TbspaceID TableID PartID MasterTbs MasterTab
TableName SchemaNm ObjClass DataSize LfSize LobSize
XMLSize
0x0700000048F553D8 2 15 n/a 2 15 STAFF RSARKAR Perm 1 0 0 0
Address TableName SchemaNm Scans UDI
RTSUDI PgReorgs NoChgUpdts Reads FscrUpdates
Inserts Updates Deletes OvFlReads OvFlCrtes PgDictsCrt
CCLogReads StoreBytes BytesSaved
0x0700000048F553D8 STAFF RSARKAR
3 35 35 0 0
70 0 0 35 0 0
0 0 0 - -
4) Miscellaneous things you can do with db2pd
DB2 tracing tips and tricks
Usage: db2trc [facility]
db2 - DB2 instance (default)
das - DB2 Administration Server instance
cf - CF Server
cfcli - CF Client
change - Change trace options
clear - Clear the trace buffer
dump - Generate trace dump file
info - Information
off - Disable the trace facility
on - Enable the trace facility
stop - Stop tracing
Usage: db2trc
flow - Generate control flow diagram
format - Format (includes CLI/ODBC driver traces)
formattedFlow - Generate global flow diagram
info - Information
perffmt - Format a performance trace
perfrep - Format a performance report
Name Address Id Size(Kb) Key DBP
Type Unrsv(Kb) Used(Kb) HWM(Kb) Cmt(Kb) Uncmt(Kb)
CmtRt(Kb) DcmtRt(Kb) HoldRt(Kb) Sngl
Trace 0x0000000000000000 14057472 33355 0x45EE9E74 0 -1 0 33355 0 33355 0 0 0 0 0
Overall performance is slow
1204594 (1288 sec, 694681000 nanosec) sqlak_callbDrdaOutput
2815676 (1225 sec, 149336000 nanosec) csmGetCursorBuf
560574 (1223 sec, 174019000 nanosec) csmDriveFetch
560525 (1222 sec, 257363000 nanosec) sqljrDrdaArFetch
560486 (1200 sec, 247509000 nanosec) sqljrRecv
2414805 (1185 sec, 162567000 nanosec) clientboDDFMode1Conversions
2410385 (1183 sec, 317071000 nanosec) clientboWriteBytes
560425 (1173 sec, 235236000 nanosec) sqljsParse
560158 (1171 sec, 596516000 nanosec) sqljsParseRdbAccessed
559503 (1170 sec, 921528000 nanosec) sqljs_ddm_cntqry
560041 (1170 sec, 81641000 nanosec) sqlrr_fetch
560021 (1158 sec, 474592000 nanosec) sqlrr_process_fetch_request
1765631 (1135 sec, 986047000 nanosec) sqlribnoCbDrda
2953145 (1080 sec, 720990000 nanosec) sqljsDrdaAsCBPutLongDataQueryCheck
3490741 (1068 sec, 659227000 nanosec) sqlriWriteCsoIntoBuffer
3490211 (1067 sec, 508322000 nanosec) sqlrimso
3485339 (1054 sec, 156969000 nanosec) sqlriglf
3486049 (1051 sec, 759578000 nanosec) sqldFetchLFD
3485674 (1047 sec, 24618000 nanosec) sqldx_lob_read
3484393 (1039 sec, 146770000 nanosec) sqldxReadLob
3486801 (1031 sec, 887696000 nanosec) sqldxReadLobData
3483545 (1010 sec, 669271000 nanosec) sqlbDirectRead
3483541 (1009 sec, 188149000 nanosec) sqlbDMSDirectRead
3485667 (1007 sec, 485205000 nanosec) sqldx_diskread
3485702 (957 sec, 532408000 nanosec) sqlbDMSMapAndRead
3479676 (933 sec, 916670000 nanosec) sqlbReadBlocks
3470448 (926 sec, 439074000 nanosec) sqloReadBlocks
8182 (907 sec, 85306000 nanosec) sqloCSemP
8180 (906 sec, 780554000 nanosec) sqlbpfRemoveFromQ
A particular query is running slow
Trace version : 7.0
Platform : Linux/X8664
Build level : s151221
maxBufferSize : 33554432 bytes (32 MB)
auxBufferSize : 0 bytes (0 MB)
allocationCount : 2
DB2TRCD pid : 0
Trace destination :
numSuspended : 0
Trace starting time : 2016-05-24-19.02.51.280774-240
Allow buffer to wrap : yes
Mask : *.*.*.*.*
Timestamps : enabled
PID.TID mask : all
Fixed data mask #1 : all
Fixed data mask #2 : all
Max system errors : infinite
Treat this rc as sys err: none
Member mask : none
Application handle mask : 272 << This will show you which apphandles it's tracing
Application ID mask : none
What if you find turning on db2trc is affecting your system performance adversely
Read I/O times between 1 and 5: 661 ( 30.1826% ) Total time(ms) = 928.960 ( 2.3804% )
Read I/O times between 5 and 9: 282 ( 12.8767% ) Total time(ms) = 2113.160 ( 5.4148% )
Read I/O times between 10 and 14: 249 ( 11.3699% ) Total time(ms) = 3093.450 ( 7.9266% )
Read I/O times between 15 and 19: 239 ( 10.9132% ) Total time(ms) = 4131.900 ( 10.5876% )
Read I/O times between 20 and 24: 182 ( 8.3105% ) Total time(ms) = 4082.750 ( 10.4616% )
Read I/O times between 25 and 49: 470 ( 21.4612% ) Total time(ms) = 16063.380 ( 41.1607% )
Read I/O times between 50 and 74: 71 ( 3.2420% ) Total time(ms) = 4231.670 ( 10.8432% )
Read I/O times between 75 and 99: 13 ( 0.5936% ) Total time(ms) = 1143.200 ( 2.9293% )
Read I/O times between 100 and 199: 20 ( 0.9132% ) Total time(ms) = 2628.130 ( 6.7343% )
Read I/O times between 200 and 299: 3 ( 0.1370% ) Total time(ms) = 609.390 ( 1.5615% )
Total number of Reads: 2190, Total time spent in Read I/O: 39025.990 (ms), Average time per Read I/O: 17.820 (ms)
add DB2.SQLJC.sqljcReceive.1178
add sqljcReceive.exit
add sqljsParse.entry
add sqljsParse.exit
add sqlrr_execute.entry
add sqlrr_execute.exit
add sqlrr_activity_exe_start.entry
add sqlrr_activity_exe_start.exit
add sqlriSectInvoke.entry
add sqlriSectInvoke.exit
add sqljcSend.entry
add DB2.SQLJC.sqljcSend.1177
add sqljcSend.exit
add PVM::run.entry
add PVM::run.exit
add sqlra_get_section.entry
add sqlra_get_section.exit
add sqlra_load_pkg.entry
add sqlra_load_pkg.exit
add sqlra_get_var.entry
add sqlra_get_var.exit
add sqlnn_cmpl.entry
add sqlnn_cmpl.exit
add DB2.SQLR.sqlrr_rds_common_pre1.200
add sqlra_search_for_pkg.entry
add sqlra_search_for_pkg.exit
add sqlra_open_pkg.entry
add sqlra_open_pkg.exit
add sqlra_pkg_lock.entry
add sqlra_pkg_lock.exit
add sqlra_sqlC_get_sibling.entry
add sqlra_sqlC_get_sibling.exit
add sqlra_sqlC_get_stmt.entry
add sqlra_sqlC_get_stmt.exit
add sqlricall.entry
add sqlricall.exit
Other useful db2trc options
Useful scripts
---------- --- ------------------------------- ---------------
--------------- --------------- --------------- --------- -------------
7771(Tid = 46913080518976, Node = 0)
0 sqloWaitEDUWaitPost 3999.644
999.952 999.875 999.911 4 1751
0 ossGetAvgCPULoad 0.432
0.126 0.044 0.086 5 2337
0 sqloGetEnvInternal 0.015
0.005 0.000 0.002 10 1
0 sqlogmt 0.009
0.003 0.001 0.002 5 3
0 sqeApplication::UpdatePITNodeI 0.006
0.002 0.001 0.001 5 1138
0 pdUpdateBaselinesIfNecessary 0.005
0.001 0.001 0.001 5 12
0 sqleRollupGetCachedCEAL 0.002
0.001 0.000 0.000 5 607
Things you should not do with db2trc especially on Production system
SSL performance issue on AIX
2957916 (11690 sec, 206149111 nanosec) sqlccSSLSocketRead 1473046 (4602 sec, 347173824 nanosec) sqlccSSLSocketWrite
511920 0.293737443 | | | | tcprecv entry [eduid 28338 eduname db2agent]
511937 0.293747148 | | | | | sqlogetdatetime entry [eduid 28338 eduname db2agent]
511942 0.293751369 | | | | | sqlogetdatetime exit
516231 0.295472654 | | | | | sqlogetdatetime entry [eduid 28338 eduname db2agent]
516249 0.295477816 | | | | | sqlogetdatetime exit
516256 0.295479347 | | | | | sqlccSSLSocketRead entry [eduid 28338 eduname db2agent]
516264 0.295480910 | | | | | | sqlccSSLCallbackRead entry [eduid 28338 eduname db2agent]
516267 0.295481160 | | | | | | sqlccSSLCallbackRead data [probe 991]
516275 0.295484500 | | | | | | sqlccSSLCallbackRead exit [rc = 5]
516279 0.295486451 | | | | | | sqlccSSLCallbackRead entry [eduid 28338 eduname db2agent]
516281 0.295486660 | | | | | | sqlccSSLCallbackRead data [probe 991]
516288 0.295489146 | | | | | | sqlccSSLCallbackRead exit [rc = 0x00000020 = 32]
817092
0.385796982 | | | | | sqlccSSLSocketRead
exit >> gets out of this call after
90ms.
817104 0.385798310 | | | | tcprecv exit
malloc_y
malloc_common@AF104_87
_Fancy_malloc__FUl
init_pres__FUl
__vn__FUl
ReadCompressedMsg_BlockCipher__13SSLV3ProtocolFi
ReadCompressedMsg__13SSLV3ProtocolFi
ReadMsg__13SSLV3ProtocolFi
SSL_Read__13SSLV3ProtocolFPviT2
StateMachine__13SSLV3ProtocolFiPvT1Pb
Receive__13SSLV3ProtocolFPvi
Receive__18SSLProtocolManagerFPvi
gsk_secure_soc_read
sqlccSSLSocketRead__FP21SQLCC_TCPCONNHANDLE_TP12SQLCC_COND_TUsUiPciPi
tcprecv__FP17SQLCC_COMHANDLE_TiPcT2UsT5P21SQLCC_TCPCONNHANDLE_TP12SQLCC_COND_TUiP
export MALLOCBUCKETS=number_of_buckets:128,bucket_sizing_factor:64,blocks_per_bucket:1024
db2set DB2ENVLIST="MALLOCTYPE MALLOCOPTIONS"
db2stop
db2start
Upgrade to DB2 v10.5 FP7 woes
2 0.035750000 0.017875000 0.0000100000.000005000 sqljsParse
4 0.035736000 0.008934000 0.0000030000.000000750 sqljsParseRdbAccessed
1 0.035261000 0.035261000 0.0000010000.000001000 sqljs_ddm_rdbrllbck
1 0.035259000 0.035259000 0.0000020000.000002000 sqlrr_rollback
1 0.035176000 0.035176000 0.0000040000.000004000 sqlrrbck
1 0.035152000 0.035152000 0.0000030000.000003000 sqlrrbck_dps
3 0.035129000 0.011709667 0.0000050000.000001667 pdLogEvRecEx
1 0.035121000 0.035121000 0.0351120000.035112000
pdLogInternal << The elapsed time and spent time are equal for
this function indicating this is the bottleneck
db2pd -db sample -dmpevrec comp=SQLRA name=sqlraLOW mask=0
Backup statistics
DB2 Performance -- Leveraging the MON interface
Debug Slow Online Backup in DB2
Mystery behind SQL1229 error on DPF systems
PID : 5038184 TID : 1286 PROC : db2sysc 46
INSTANCE: edwinp01 NODE : 046
EDUID : 1286 EDUNAME: db2fcms 46
FUNCTION: DB2 UDB, fast comm manager, sqkfNetworkServices::detectNodeFailure, probe:15
DATA #1 :
Detected failure for node 41 - time elapsed: 501; max timeout: 500; link state: 4
What to do when CPU spikes up
Range Partitioned tables and long compile times
How to track and collect data for a long running query
Migration to v97 from v95 -- Experiences
Threshold data capture for DB2 Performance issues
watchThreshold.pl [options]
-idle
-user
-sys
-qrun
-qblock
-qoperator
-cpuoperator
-condition
-avm
-po
-cachesize
-database
-string
-matchtype [1|2] ( 1 = match one string
at a time ( or condition ), 2 = match all strings provided ( and
condition ). Default:1 )
-connecttype
This is to reduce
the false positives as much as possible. This is applicable
for -vmstat and -pkgcache option only. ( Default: 5 seconds ) )
-command
-before
-scriptname