[ID 976714.1]

修改时间 16-OCT-2011 类型 HOWTO 状态 MODERATED

In this Document
  Goal
  Solution
     Issue Description
     Diagnostics
     Interpretation
  References

This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review.

Applies to:

Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.5.0 - Release: 10.2 to 10.2
Information in this document applies to any platform.
***Checked for relevance on 03-Aug-2011***

Goal

Explain the steps necessary and the diagnostics to collect for a hang situation with messages such as :

PMON failed to acquire latch, see PMON dump

in the alert log.

Solution

Issue Description

This issue occurs because the PMON process is unable to get a resource for a fixed period of time. This can be a serious issue and hang the database since PMON will be unable to perform. its normal maintenance tasks and if the blocker is not freed will halt database activity.

Diagnostics

In these cases it is necessary to collect the following diagnostics:

Current RDA information. This will provide the Alert Log which is useful to identify the time of the messages and also to determine if there are any other errors at the time which might explain why PMON is unable to continue. Additionally an up to date current RDA provides a lot of additional information about the configuration of the database and performance metrics that may provide useful background to the problem. See the following note on metalink:
Systemstates and hanganalyze trace at the time of the problem
AWR reports from immediately before, during and after the problem.
PMON trace file from the time of the problem

Apart from the PMON Trace, the diagnostics are very similar to the data required for a 'standard' hang situation. To collect this information see:

Note:452358.1 Database Hangs: What to collect for support.

Interpretation

PMON Trace
This file should be used as the key start point for the analysis of the problem. It should show what resource PMON is unable to get and what is likely holding it. When used in conjunction with systemstates at the time of the probem, the cause of the problem can be determined and the reason for the holder keeping the resource for so long established.
Example trace:

*** SESSION ID:(115.1) 2009-12-09 06:37:09.015
PMON unable to acquire latch  38ef52390 Child library cache level=5 child#=29 
        Location from where latch is held: kgldtld: 2child: 
        Context saved from call: 3305698714
        state=busy, wlstate=free
    waiters [orapid (seconds since: put on list, posted, alive check)]:
     49 (174, 1260337028, 174)
     29 (174, 1260337028, 174)
     27 (174, 1260337028, 174)
     70 (171, 1260337028, 171)
     9 (134, 1260337028, 134)
     waiter count=5
    gotten 1905490 times wait, failed first 80 sleeps 0
    gotten 18325 times nowait, failed: 4
  possible holder pid = 26 spid=202

In this example, the problem is that a "Child library cache" latch cannot be taken by the PMON process. The possible holder process is shown as:

possible holder pid = 26 spid=2024

With an accompanying systemstate, this process can be found and its activity monitored.

The second section of the PMON trace dumps part of the process state for the holding process with a short stack:

----------------------------------------
SO: 443803fb8, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=26, calls cur/top: 4439f7178/443845450, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 549 0 4
              last post received-location: kslpsr
              last process to post me: 403c01830 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 403c01830 1 6
  (latch info) wait_event=0 bits=20
        Location from where call was made: kgldtld: 2child: 
        Context saved from call: 3305698714
    waiting for 38ef524d0 Child library cache level=5 child#=27 
        Location from where latch is held: kglpin: 
        Context saved from call: 0
        state=busy, wlstate=free
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         58 (174, 1260337028, 174)
         26 (174, 1260337028, 174)
         46 (174, 1260337028, 174)
         waiter count=3
        gotten 1098798 times wait, failed first 71 sleeps 13
        gotten 19775 times nowait, failed: 918
        possible holder pid = 37 spid=16089
    on wait list for 38ef524d0
    holding    (efd=16) 38ef52390 Child library cache level=5 child#=29 
        Location from where latch is held: kgldtld: 2child: 
        Context saved from call: 3305698714
        state=busy, wlstate=free
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         49 (174, 1260337028, 174)
         29 (174, 1260337028, 174)
         27 (174, 1260337028, 174)
         70 (171, 1260337028, 171)
         9 (134, 1260337028, 134)
         waiter count=5
    Process Group: DEFAULT, pseudo proc: 390f99190
    O/S info: user: oracle, term: UNKNOWN, ospid: 2024
    OSD pid info: Unix process pid: 2024, image: oracle@ipaddress
    Short stack dump: 
ksdxfstk()+36<-ksdxcb()+2452<-sspuser()+176<-sigacthandler()+44
<-__systemcall()+52<-semop()+24<-sskgpwwait()+500<-kslges()+1188
<-kgldtld()+692<-kqllod()+3392<-kglobld()+992<-kglobpn()+1560
<-kglpim()+296<-kglpin()+892<-kglgob()+404<-kktget()+780
<-kxtiget()+1144<-kkmdrvend()+92<-kkmdrv()+80<-opiSem()+2136
<-opiDeferredSem()+404<-opitca()+560<-kksFullTypeCheck()+8
<-rpiswu2()+500<-kksSetBindType()+4608<-kksfbc()+5780
<-opiexe()+2404<-kpoal8()+1912<-opiodr()+1548<-ttcpip()+1284
<-opitsk()+1432<-opiino()+1128<-opiodr()+1548<-opidrv()+896
<-sou2o()+80<-opimai_real()+124<-main()+152<-_start()+380

From this Process information the holder is holding

    holding (efd=16) 38ef52390 Child library cache level=5 child#=29

(this is what is blocking PMON). But this session (pid 26) is waiting for:

    waiting for 38ef524d0 Child library cache level=5 child#=27

and possible folder of this child# 27 is:

    possible holder pid = 37 spid=16089

This means that the holder that is blocking PMON is being blocked by another session (pid 37).
Thus systemstates are required to see what this process is doing since the PMON trace does not contain all the information required to drill down to the root cause.

Note: If the holder blocking PMON (pid 26) had been active on cpu then the PMON trace alone may have been sufficient to diagnose the issue. However, in order to determine if there is any process movement, multiple systemstates are required, so it is always prudent to collect these so as to have all the required information.

Known Bug:

Note:468740.1

Note:4632780.8

Note:8502963.8

@

References

NOTE:452358.1 - How to Collect Diagnostics for Database Hanging Issues
NOTE:278316.1 - Troubleshooting: "WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! "

显示相关信息 相关内容

产品

Oracle Database Products > Oracle Database > Oracle Database > Oracle Server - Enterprise Edition

关键字

DATABASE HANG; HANGANALYZE; HANGING; PMON; PMON FAILED; SYSTEMSTATE

返回页首