Tuning CPU 100% in Oracle 11g rac-20220215
导读:一个检验DBA通过AWR,ASH,SQL,执行计划,hang sql,10046事件等等工具串联起来解决生产故障的case。随着分析深入了解fixed table原理,绕过BUG解决问题,并解决了偶尔可执行sql语句的优化思路。隆重推荐细读。
场景:FIXED TABLE(X$KSOLSFTS)数据量大,触发的Oracle Bug,引起的cpu高负载(100%)。因业务系统不能立即应用Patch,故采取重启instance方式来重新填充FIXED TABLE(X$KSOLSFTS)数据,进而减少FIXED TABLE(X$KSOLSFTS)数据优化案例。
1.问题描述
2月14日夜里00点左右。接到业务同事的优化请求。数仓系统中发生cpu负载100%情况,导致正常的业务跑批无法完成。 发生100%cpu高负载是00:00:00开始,00:15:00处理,通过kill session(超过24小时的session的 foglight监控processes)减缓了cpu高负载的状况,勉强把跑批任务完成。具体原因上班继续分析。 2月14日上班期间继续分析这个cpu高负载问题,这个问题还没有根本解决。其实,2月9日已经开始关注这个问题了, 但是误以为是偶发现象,没有立即做优化处理,导致2月14的故障。深刻教训,dba发现问题立即解决,误等,切记误等!!
2.问题分析
2.1 环境
Oracle 11.2.0.4.0 rac on Aix6.1
2.2 事后分析:由面(os,awr)到点(ash,sql,rows)
2.2.1 从面分析(OS,AWR)
OS层面:从2月13晚10点左右开始高cpu负载,2月14下午14左右彻底解决(2月14日12点左右重启instance)。 历史数据来源foglight监控。 # OS层面分析的思路是:提供一个大方向上的影响范围(时间,设备)。同时咨询业务运维人员此期间未做任何变更。 从cpu高负载的时间段分析AWR。 AWR层面:从AWR分析,发现CPU高负载和OS层看到的CPU高负载相吻合。并且可以找出可疑的SQL。 # 2月14日0点至1点这个1个小时内节点2,从AWR中分析出DB负载为189%,且OS级CPU负载接近98.9% Host Name Platform CPUs Cores Sockets Memory(GB) ---------------- -------------------------------- ---- ----- ------- ---------- bmcdb2 AIX-Based Systems (64-bit) 32 8 124.50 Snap Id Snap Time Sessions Curs/Sess Instances --------- ------------------- -------- --------- --------- Begin Snap: 68748 14-Feb-22 00:00:07 162 .9 2 End Snap: 68749 14-Feb-22 01:00:26 300 .9 2 Elapsed: 60.33 (mins) DB Time: 3,662.68 (mins) # 从awr查看db负载情况。负载=(DB Time/Elapsed*CPUs)*100%=189%。注:不包含DB后台进程等待事件 Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 60.7 380.2 9.75 7.01 DB CPU(s): 7.8 49.0 1.26 0.90 Redo size (bytes): 572,447.6 3,585,022.2 Logical read (blocks): 28,460.0 178,234.1 Block changes: 787.2 4,929.8 Physical read (blocks): 5,351.1 33,511.6 Physical write (blocks): 548.2 3,433.3 Read IO requests: 56.5 353.6 Write IO requests: 40.6 254.4 Read IO (MB): 41.8 261.8 Write IO (MB): 4.3 26.8 RAC GC blocks received: 17.7 110.7 RAC GC blocks served: 5.9 37.0 User calls: 8.7 54.2 Parses (SQL): 3.5 22.2 Hard parses (SQL): 0.1 0.3 SQL Work Area (MB): 0.9 5.5 Logons: 0.9 5.3 Executes (SQL): 6.2 39.0 Rollbacks: 0.1 0.4 Transactions: 0.2 Host CPU ~~~~~~~~ Load Average CPUs Cores Sockets Begin End %User %System %WIO %Idle ----- ----- ------- --------- --------- --------- --------- --------- --------- 32 8 52.47 52.44 98.9 1.1 0.0 0.0 Instance CPU ~~~~~~~~~~~~ % of total CPU for Instance: 24.6 % of busy CPU for Instance: 24.6 %DB time waiting for CPU - Resource Mgr: 28.7 <=cpu使用与RM竞争严重 # 理论上:%Idle=0较低,% of busy CPU=24.6较高情况下,存在cpu瓶颈。这个cpu瓶颈也与OS层监控展示的cpu瓶颈相吻合。 # 理论上:%DB time waiting for CPU - Resource Mgr 产生等待值时, 表示resource manager限制某个用户和会话使用CPU,而产生的等待。 一般会产生resmgr:cpu quantum等待事件,如果产生该等待事件需要和RSRC_MGR的值结合起来判断。 # %DB time waiting for CPU - Resource Mgr 一般解决办法是调整RM的限制,但有时候RM限制是正确的, 只是session使用cpu不合理。我们这个案例就是这样的。 # resmgr:cpu quantum 等待事件并不是很多,数据库也未出现挂起的情况。所以这里分析RM受限是正确的, 那产生%DB time waiting for CPU - Resource Mgr等待,只能推测是cpu使用不合理了。 接下来我们看看cpu资源严重的sql排名第一,如下: # 分析awr找到一个sql_id为70ddwj7363cd7的语句比较可疑。这个语句每执行1次大约5566秒,收集1个小时的awr才3600秒。 可以说明这个session连续执行了1个多小时都未执行成功,一直使用cpu,所以才有这个session1个小时内竟然执行5566秒。 CPU CPU per Elapsed Time (s) Executions Exec (s) %Total Time (s) %CPU %IO SQL Id ---------- ------------ ---------- ------ ---------- ------ ------ ------------- 22,264.3 4 5,566.09 78.5 159,729.5 13.9 .0 70ddwj7363cd7 Module: Oracle SQL Performance Investigator SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic _name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block chan ges', 'physical writes') ) PIVOT (SUM (value) FOR statistic_ 结论: 从os,awr的分析结果sql_id=70ddwj7363cd7消耗cpu严重,并在一个小时内长时间执行不结束,一直使用cpu。 接下来我们细化分析(ASH,SQL,ROWS)一下SQL是否需要优化。
2.2.2 从点分析(ASH,SQL,ROWS)
故障发生时间是00:00:00--00:15:00。awr和addm默认1小时快照1次,15分钟内肯定没有快照。这种情况使用ash进行精准分析 # 通过ash分析同样发现了问题sql_id为70ddwj7363cd7的语句,等待所占比例为77.2%。这个结果和AWR分析等结果再次一次吻合。 Top SQL with Top Events DB/Inst: bmcDB/bmcdb2 (Feb 14 00:00 to 00:15) Sampled # SQL ID Planhash of Executions % Activity ----------------------- -------------------- -------------------- -------------- Event % Event Top Row Source % RwSrc ------------------------------ ------- --------------------------------- ------- 70ddwj7363cd7 1450034333 52 97.41 CPU + Wait for CPU 77.20 FIXED TABLE - FULL 63.45 SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic _name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block chan ges', 'physical writes') ) PIVOT (SUM (value) FOR statistic_ # cpu等待的事件session,竟然是system,值得深思。 Top Sessions DB/Inst: bmcDB/bmcdb2 (Feb 14 00:00 to 00:15) -> '# Samples Active' shows the number of ASH samples in which the session was found waiting for that particular event. The percentage shown in this column is calculated with respect to wall clock time and not total database activity. -> 'XIDs' shows the number of distinct transaction IDs sampled in ASH when the session was waiting for that particular event -> For sessions running Parallel Queries, this section will NOT aggregate the PQ slave activity into the session issuing the PQ. Refer to the 'Top Sessions running PQs' section for such statistics. Sid, Serial# % Activity Event % Event --------------- ---------- ------------------------------ ---------- User Program # Samples Active XIDs -------------------- ------------------------------ ------------------ -------- 17,21977 1.89 CPU + Wait for CPU 1.41 SYSTEM JDBC Thin Client 67/90 [ 74%] 0 29,10017 1.89 CPU + Wait for CPU 1.47 SYSTEM JDBC Thin Client 70/90 [ 78%] 0 56,46405 1.89 CPU + Wait for CPU 1.52 SYSTEM JDBC Thin Client 72/90 [ 80%] 0 121, 8429 1.89 CPU + Wait for CPU 1.52 SYSTEM JDBC Thin Client 72/90 [ 80%] 0 130,20913 1.89 CPU + Wait for CPU 1.56 SYSTEM JDBC Thin Client 74/90 [ 82%] 0 # 最后,发现foglight监控使用的是system用户 Activity Over Time DB/Inst: bmcDB/bmcdb2 (Feb 14 00:00 to 00:15) -> Analysis period is divided into smaller time slots -> Top 3 events are reported in each of those slots -> 'Slot Count' shows the number of ASH samples in that slot -> 'Event Count' shows the number of ASH samples waiting for that event in that slot -> '% Event' is 'Event Count' over all ASH samples in the analysis period Slot Event Slot Time (Duration) Count Event Count % Event -------------------- -------- ------------------------------ -------- ------- 00:00:00 (5.0 min) 1,575 CPU + Wait for CPU 1,258 26.48 resmgr:cpu quantum 313 6.59 Streams AQ: qmn coordinator wa 1 0.02 00:05:00 (5.0 min) 1,574 CPU + Wait for CPU 1,247 26.25 resmgr:cpu quantum 324 6.82 control file sequential read 3 0.06 00:10:00 (5.0 min) 1,601 CPU + Wait for CPU 1,240 26.11 resmgr:cpu quantum 356 7.49 control file sequential read 4 0.08 -------------------------------------------------------------
2.2.3 问题原因:SQL语句引起的CPU高负载
问题SQL_ID='70ddwj7363cd7'的sql语句,由foglight程序发起,system执行,导致cpu高负载问题。接下来我们来看看罪恶元凶这个sql语句长什么样子。 # 从AWR分析得知SQL_ID='70ddwj7363cd7',cpu使用率极高 # 从ASH分析得知SQL_ID='70ddwj7363cd7',Event比例极高 select SQL_TEXT from DBA_HIST_SQLTEXT where SQL_ID='70ddwj7363cd7'; # 使用DBA_HIST_SQLTEXT视图通过sql_id可以查询到详细的sql,sql语句如下: SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= :1 # 这个语句显然不是一个业务语句,所以处理它,可以大胆一些。
2.2.4 优化问题SQL,查看真实执行计划及解决办法
首先,查看真实执行计划,可使用一次sql语句后使用dbms_xplan.display_cursor查看执行计划, 可惜QL(SQL_ID='70ddwj7363cd7)无法执行成功,一直hang住。 其次,未执行完成或者一直hang的SQL,可以使用10046事件来分析执行计划。 未执行完成或一直hang的SQL使用sql trace工具在session级别分析其执行计划 # 具体方法步骤请参考:附表中《未执行完成的SQL语句如何查看执行计划呢?》 # 分析得出trc查看,并tkprof工具格式化后,如下 ========================================== SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 1 200.03 328.22 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 3 200.03 328.23 0 0 0 0 Misses in library cache during parse: 1 Optimizer mode: ALL_ROWS Parsing user id: SYS Number of plan statistics captured: 1 Rows (1st) Rows (avg) Rows (max) Row Source Operation ---------- ---------- ---------- --------------------------------------------------- 0 0 0 COUNT STOPKEY (cr=0 pr=0 pw=0 time=62 us) 0 0 0 VIEW (cr=0 pr=0 pw=0 time=61 us cost=313 size=485355 card=7467) 0 0 0 SORT ORDER BY STOPKEY (cr=0 pr=0 pw=0 time=60 us cost=313 size=246411 card=7467) 0 0 0 HASH GROUP BY PIVOT (cr=0 pr=0 pw=0 time=43 us cost=313 size=246411 card=7467) 81563632 81563632 81563632 FIXED TABLE FULL X$KSOLSFTS (cr=0 pr=0 pw=0 time=762017301 us cost=293 size=3310527 card=100319) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message to client 1 0.00 0.00 SQL*Net break/reset to client 1 0.00 0.00 SQL*Net message from client 1 12.02 12.02 # 通过10046事件分析得知,这条语句执行全部消耗在FIXED TABLE FULL X$KSOLSFTS , 并且严重消耗CPU=200/328=60% ========================================== # X$KSOLSFTS是一个fixed table,作用是动态分析instance资源的fixed table, 数据均保存内存中,理应很快返回结果,为什么这个case这么慢呢,还消耗cpu这么高。 从这个不正常的角度分析,查询了一下mos,发现是一个bug。 # Slow Performance When Querying V$SEGSTAT or GV$SEGSTAT Views (Doc ID 1532624.1) 解决办法: 第一:Apply the patch for Bug 21050285 or 第二:As a workaround, restarting the problem node alleviates this issue since fixed tables are populated from the time when the instance is started. # 因为我短时间无法升级patch 21050285,所以我选择了重启instance。 总结:到此其实这个问题已经解决,我们按照mos指定方法,找个时间窗口重启instance尝试解决它就可以了。 但是,抱着身为高级dba装B的态度,想了解以下原理内容,故就有了下面《3.发散思维》
3.发散思维
作为一个想进步的DBA,不能简单解决问题就结束了。拿着这个case锻炼一下自己其他方面的知识。接下来我们分析一下执行计划。
3.1 简易执行计划
从awr分析到需要深入分析的sql语句后,最查看执行计划的方式就是通过dbms_xplan.display_awr来查看。 # 这种方法优缺点:执行计划不一定是最佳的,但是最快速了解问题sql大概信息的方法。 3.2 会介绍更准确分析执行计划的方法。 select * from table(dbms_xplan.display_awr('&sql_id')); select * from table(dbms_xplan.display_awr('70ddwj7363cd7')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 70ddwj7363cd7 -------------------- SELECT obj#, WHEREV$SEGSTAT 'physicalreads', 'physical writes'), ('logical_name IN 'db block changes', WHERE rownum <= :1 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 2 (100)| | | 1 | COUNT STOPKEY | | | | | | | 2 | VIEW | | 1 | 65 | 2 (100)| 00:00:01 | | 3 | SORT ORDER BY STOPKEY| | 1 | 86 | 2 (100)| 00:00:01 | | 4 | HASH GROUP BY PIVOT | | 1 | 86 | 2 (100)| 00:00:01 | | 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 86 | 0 (0)| | -------------------------------------------------------------------------------------- 23 rows selected. # 这个执行计划,估计不是最佳的。如果是最佳的,也不至于1个小时也不出结果 # 执行计划 Plan hash value: 1450034333;很多时候不同的sql_id是可以有同一个执行计划的。 # 这里的SQL与DBA_HIST_SQLTEXT里查询的SQL并不相同。 # 既然认为执行计划不是最佳,那我们如何判断它不是最佳的执行计划呢? # X$KSOLSFTS表实际行数是632764,可统计信息是283998。显然是不准确的。统计信息不准确的情况下, 很难得到最佳的执行计划。所以,认为执行计划不是最佳的,有优化的空间 SQL> select t.TABLE_NAME,t.NUM_ROWS,t.BLOCKS,t.LAST_ANALYZED from dba_tab_statistics t where table_name ='X$KSOLSFTS'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANALYZED ------------------------------ ---------- ---------- ------------------- X$KSOLSFTS 283998 2017-09-05 14:03:06 实际行数 SQL> select count(*) from X$KSOLSFTS; COUNT(*) ---------- 632764 # 往往通过dbms_xplan.display_awr查看执行计划不是最准确的, 接下来我们使用dbms_xplan.display_cursor查看真实的执行计划。
3.2 真实执行计划
3.2.1 如何查看SQL的真实执行计划?
========================================== step1:设置session分析级别 alter session set statistics_level=ALL; step2:执行业务sql 用10来代替变量 step3:为了样式,设置linesize set linesize 200 pagesize 300; step4:查询真实执行计划 select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); ========================================== SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 6u1rq6czpwhgy, child number 0 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:01.24 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:01.24 | | 2 | VIEW | | 1 | 4583 | 10 |00:00:01.24 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 4583 | 10 |00:00:01.24 | | 4 | HASH GROUP BY PIVOT | | 1 | 4583 | 7467 |00:00:01.23 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 44024 | 115K|00:00:00.97 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # E-Rows 44024 与 A-Rows 115K 差距较大,执行计划显然不是最佳的。之前我们也分析X$KSOLSFTS统计信息不准确,我们来收集一下X$KSOLSFTS统计的信息。
3.2.2 如何收集fixed tables统计信息?
========================================== # 备注:业务低峰期进行收集。 收集fixed tables统计信息 execute dbms_stats.gather_fixed_objects_stats; 删除fixed tables统计信息 execute dbms_stats.delete_fixed_objects_stats; ========================================== 收集fixed tables统计信息后,X$KSOLSFTS统计的信息也得到了更新。 想着这回执行计划可能会准确把?但现实狠狠再次抽我一下。 SQL> select t.TABLE_NAME,t.NUM_ROWS,t.BLOCKS,t.LAST_ANALYZED from dba_tab_statistics t where table_name like '%KSOLSFTS%'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANALYZED ------------------------------ ---------- ---------- ------------------- X$KSOLSFTS 632280 2022-02-10 11:26:24 再次查看执行计划,还是不准确的 SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 6u1rq6czpwhgy, child number 0 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.51 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.51 | | 2 | VIEW | | 1 | 4583 | 10 |00:00:00.51 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 4583 | 10 |00:00:00.51 | | 4 | HASH GROUP BY PIVOT | | 1 | 4583 | 7459 |00:00:00.50 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 44024 | 114K|00:00:00.39 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # 虽然X$KSOLSFTS统计信息更新了,但是执行计划还是不准确,E-Rows是44024行,我们实际却要访问114K行。 # 但是实际使用的时间是00:00:00.51,已经很快了。为什么会出现cpu等待呢? 问题:其实这条语句在一个小时内执行内有时可以完成,有时无法完成;导致我们无法系统的分析他。 答案:我们必须全局的考虑他,比如查看每个小时cpu执行情况,cpu使用情况等等,这样就招呼出来dba_hist_sqlstat视图来帮助我们分析。
3.3.3 通过dba_hist_sqlstat视图来分析sql执行计划消耗资源的变化,选择资源消耗更少的执行计划
SQL_ID='70ddwj7363cd7'偶尔能执行成功,偶尔不行,这种情况我们只能看历史情况综合分析它了。 通过dba_hist_sqlstat视图来看看这个sql语句的执行计划在每个小时内消耗cpu的资源分析 # 7天内,查询SQL_ID的sql每个执行计划消耗的资源情况 # 收集fixed tables统计信息后,显然是有效果的,但是没想到后续又出现了cpu高等待 set lines 150 pages 999; col p_user for 99999 ; col execs for 9999999 ; select a.INSTANCE_NUMBER inst_id,to_char(b.END_INTERVAL_TIME,'yyyymmdd hh24:mi:ss') time,plan_hash_value,buffer_gets_delta/executions_delta get_exec, disk_reads_delta/executions_delta read_exec,cpu_time_delta/executions_delta/1000 cpu_exec_ms ,elapsed_time_delta/executions_delta/1000 elaps_exec_ms ,parsing_schema_id p_user,ROWS_PROCESSED_delta/executions_delta rows_exec,EXECUTIONS_DELTA execs from dba_hist_sqlstat a, dba_hist_snapshot b where a.sql_id='&1' and a.snap_id = b.snap_id and a.instance_number = b.instance_number and b.END_INTERVAL_TIME between sysdate - 7 and sysdate and executions_delta>0 and a.INSTANCE_NUMBER ='2' order by 2,1; INST_ID TIME PLAN_HASH_VALUE GET_EXEC READ_EXEC CPU_EXEC_MS ELAPS_EXEC_MS P_USER ROWS_EXEC EXECS ---------- ----------------- --------------- ---------- ---------- ----------- ------------- ------ ---------- -------- 2 20220207 18:00:13 1450034333 0 0 3245284.24 22792974.5 5 0 6 =>省略一些 2 20220209 12:00:42 0 0 18276.144 30567.4557 5 0 60 2 20220209 13:00:23 0 0 18589.8178 30220.1374 5 0 60 2 20220209 14:00:32 0 0 18809.3314 30660.602 5 0 60 2 20220209 15:00:16 0 0 719547.728 1170843.71 5 0 6 =>计划工作是1个小时60次,其实只能完成6次,这个时候其实问题已经存在了 2 20220209 16:00:42 0 0 2599794.88 4483118.75 5 0 4 2 20220209 17:00:02 0 0 3711172.17 7159658.05 5 0 4 2 20220209 18:00:15 0 0 329147.826 662295.523 5 0 41 2 20220209 19:00:24 0 0 1653292.58 2709949.97 5 0 4 =>省略一些,这个时候偶尔可以,偶尔不可以 2 20220210 10:00:04 0 0 314395.289 1281711.87 5 7459.88333 60 2 20220210 11:00:15 0 0 317873.686 1263796.6 5 7465.53333 60 2 20220210 12:00:21 0 0 322069.2 1253976.16 5 7461.5 20 =>11:30左右开始收集fixed tables统计信息,12:50左右收集完成。14:00开启监控sql(V$SEGSTAT) 2 20220210 14:00:25 0 0 289.353 590.408256 5 7456.16279 43 2 20220210 15:00:22 1.43333333 .083333333 287.013433 465.832983 5 7453.6 60 2 20220210 16:00:15 0 0 285.755183 463.890183 5 7450.21667 60 2 20220210 17:00:22 1450034333 0 0 288.7832 589.889667 5 7450 60 2 20220210 18:00:17 1450034333 0 0 286.847317 464.948483 5 7450.91667 60 2 20220210 19:00:18 1450034333 0 0 286.966 464.918083 5 7450.01667 60 2 20220210 20:00:11 1450034333 0 0 286.781417 465.000033 5 7449.96667 60 2 20220210 21:00:06 1450034333 0 0 287.152033 465.723383 5 7450 60 2 20220210 22:00:07 1450034333 0 0 287.14685 465.364817 5 7450 60 2 20220210 23:00:04 1450034333 0 0 309.7571 503.2724 5 7450 60 2 20220211 00:00:49 1450034333 0 0 309.959393 503.812279 5 7449.18033 61 2 20220211 05:00:41 1450034333 0 0 225638.216 372509.22 5 0 18 =>再次出现cpu高等待,后面就是偶尔高,偶尔低不稳定 2 20220211 06:00:24 1450034333 0 0 102763.108 166902.224 5 0 60 2 20220211 07:00:18 1450034333 0 0 102958.522 167239.497 5 0 60 2 20220211 08:00:57 1450034333 0 0 103486.041 168101.32 5 0 61 2 20220211 09:00:11 1450034333 0 0 96940.2201 158655.225 5 0 59 2 20220211 10:00:04 1450034333 0 0 100547.878 164694.348 5 0 60 2 20220211 11:00:02 1450034333 0 0 104021.177 169745.013 5 0 60 2 20220211 12:00:11 1450034333 0 0 104604.996 170262.021 5 0 60 # 分析到此,几乎黔驴技穷了。fixed tables统计信息准确后,我们的执行计划还是不准的。看样这个BUG只有重启rac的instance来减少X$KSOLSFTS的数据量来解决这个问题了。
3.3.4 rac重启instance后,fixed tables数据减少,查询速度变快,执行计划成本未改变(未收集统计信息)
# 重启instance后,X$KSOLSFTS在重启instance时被内存中的C程序结构重新填充了数据, 进而X$KSOLSFTS数据量减少了。 节点1 SQL> select count(*) from X$KSOLSFTS; COUNT(*) ---------- 36102 节点2 SQL> select count(*) from X$KSOLSFTS; COUNT(*) ---------- 22352 # instance重启后,X$KSOLSFTS实际行数从632764下降至3万多 # 重启之后,sql执行计划的改变 节点2 SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 6u1rq6czpwhgy, child number 1 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.02 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.02 | | 2 | VIEW | | 1 | 7467 | 10 |00:00:00.02 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 7467 | 10 |00:00:00.02 | | 4 | HASH GROUP BY PIVOT | | 1 | 7467 | 920 |00:00:00.02 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 100K| 4064 |00:00:00.01 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # 这里可以看出来,这个执行计划还是有问题的。e-rows与a-rows差距太大100k与4064。 为什么e-rows是100k呢,因为我们的统计信息不准确
3.3.5 收集统计信息后得到最佳执行计划
我们继续收集一下fixed tables的统计信息,看看执行计划是否有改变 节点2收集统计信息后 SQL> execute dbms_stats.gather_fixed_objects_stats; PL/SQL procedure successfully completed. # 刚重启instance后,fixed table数据量少,收集速度也很快。 节点2统计信息分析结果查看 SQL> select t.TABLE_NAME,t.NUM_ROWS,t.BLOCKS,t.LAST_ANALYZED from dba_tab_statistics t where table_name like '%KSOLSFTS%'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANALYZED ------------------------------ ---------- ---------- ------------------- X$KSOLSFTS 37246 2022-02-14 13:17:57 节点2收集统计信息后,sql真实执行计划如下 SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID g5ktn014rjsyw, child number 0 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.02 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.02 | | 2 | VIEW | | 1 | 1507 | 10 |00:00:00.02 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 1507 | 10 |00:00:00.02 | | 4 | HASH GROUP BY PIVOT | | 1 | 1507 | 936 |00:00:00.02 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 5011 | 3096 |00:00:00.01 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # 当前执行计划终于准确了,E-Rows | A-Rows差距相近
3.3.6 使用dba_hist_sqlstat统计一下我们优化后的效果
rac重启instance之后,持续观察一下cpu资源每个小时内消耗cpu的资源分析 #7天内,查询SQL_ID的sql每个执行计划消耗的资源情况 set lines 150 pages 999; col p_user for 99999 ; col execs for 9999999 ; select a.INSTANCE_NUMBER inst_id,to_char(b.END_INTERVAL_TIME,'yyyymmdd hh24:mi:ss') time,plan_hash_value,buffer_gets_delta/executions_delta get_exec, disk_reads_delta/executions_delta read_exec,cpu_time_delta/executions_delta/1000 cpu_exec_ms ,elapsed_time_delta/executions_delta/1000 elaps_exec_ms ,parsing_schema_id p_user,ROWS_PROCESSED_delta/executions_delta rows_exec,EXECUTIONS_DELTA execs from dba_hist_sqlstat a, dba_hist_snapshot b where a.sql_id='&1' and a.snap_id = b.snap_id and a.instance_number = b.instance_number and b.END_INTERVAL_TIME between sysdate - 7 and sysdate and executions_delta>0 and a.INSTANCE_NUMBER ='2' order by 2,1; Enter value for 1: 70ddwj7363cd7 old 3: where a.sql_id='&1' new 3: where a.sql_id='70ddwj7363cd7' INST_ID TIME PLAN_HASH_VALUE GET_EXEC READ_EXEC CPU_EXEC_MS ELAPS_EXEC_MS P_USER ROWS_EXEC EXECS ---------- ----------------- --------------- ---------- ---------- ----------- ------------- ------ ---------- -------- ............ =》省略一些 2 20220213 19:00:27 1450034333 0 0 5734704.21 22100839.5 5 0 4 2 20220213 20:00:51 1450034333 0 0 6204963.99 26358999.8 5 0 4 2 20220213 21:00:28 1450034333 0 0 6270982.43 29882859 5 0 4 2 20220213 22:00:05 1450034333 0 0 6402684.53 33756085.6 5 0 4 2 20220213 23:00:13 1450034333 0 0 5777364.85 34102730.6 5 0 4 2 20220214 00:00:07 1450034333 0 0 5759844.62 36741886.9 5 0 4 2 20220214 01:00:26 1450034333 0 0 5566086.21 39932383.6 5 0 4 2 20220214 02:00:17 1450034333 0 0 663313.873 4898792.2 5 0 33 2 20220214 03:00:38 1450034333 0 0 1218136.33 9638080.7 5 0 18 2 20220214 04:00:12 1450034333 0 0 21725865.7 182932616 5 0 1 =>20220214 12:30:00 实例重启之后,X$KSOLSFTS表行数减少。问题得到真正解决 2 20220214 14:00:14 1450034333 1.75 .055555556 13.5259444 21.8915833 5 1002.05556 36 2 20220214 15:00:10 1450034333 0 0 13.9752 22.5881333 5 1032.55 60 2 20220214 16:00:09 1450034333 0 0 14.4937667 23.4365333 5 1089.48333 60 2 20220214 17:00:36 1450034333 0 0 14.9452 24.17055 5 1126.1 60 2 20220214 18:00:35 1450034333 0 0 15.6892167 25.32605 5 1196.4 60 2 20220214 19:00:45 1450034333 0 0 15.7695082 25.4163115 5 1199.45902 61 2 20220214 20:00:24 1450034333 0 0 16.0380508 25.8679492 5 1224.94915 59 2 20220214 21:00:22 1450034333 0 0 16.1035167 25.9960667 5 1226 60 2 20220214 22:00:13 1450034333 0 0 16.10595 26.0308333 5 1226 60 2 20220215 00:00:00 1450034333 0 0 25.5067333 41.3598667 5 2001.21667 60 2 20220215 05:00:18 1450034333 0 0 36.4353 59.0761833 5 2941.05 60 2 20220215 06:00:12 1450034333 0 0 36.5330167 59.0673833 5 2948.21667 60 2 20220215 07:00:16 1450034333 0 0 36.44085 58.8934333 5 2948 60 2 20220215 08:00:24 1450034333 0 0 36.6509833 59.2170667 5 2949.53333 60 2 20220215 09:00:17 1450034333 0 0 37.3499 60.3852333 5 3035.36667 60 2 20220215 10:00:41 1450034333 0 0 38.42875 62.13975 5 3115.76667 60 2 20220215 11:00:03 1450034333 0 0 38.7289833 62.6527667 5 3124.53333 60 2 20220215 12:00:17 1450034333 0 0 38.4324167 62.15165 5 3124.08333 60 2 20220215 13:00:27 1450034333 0 0 38.6223833 62.4524333 5 3125 60 2 20220215 14:00:47 1450034333 0 0 38.5679672 62.5711311 5 3125.59016 61 2 20220215 15:00:07 1450034333 0 0 38.5895085 62.5896271 5 3126.22034 59 到此问题得到根本解决。 第一,执行计划准确,E-Rows | A-Rows差距接近 第二,使用dba_hist_sqlstat查看sql使用cpu资源也趋近稳定合理。
4.附表:
4.1 如何查看统计信息维护窗口?
col WINDOW_NAME for a15 col REPEAT_INTERVAL for a60 col DURATION for a30 set linesize 120 SELECT t1.window_name, t1.repeat_interval, t1.duration FROM dba_scheduler_windows t1, dba_scheduler_wingroup_members t2 WHERE t1.window_name = t2.window_name AND t2.window_group_name IN ('MAINTENANCE_WINDOW_GROUP', 'BSLN_MAINTAIN_STATS_SCHED'); ***************************************************************** WINDOW_NAME REPEAT_INTERVAL DURATION --------------- ------------------------------------------------------------ ------------------------------ MONDAY_WINDOW freq=daily;byday=MON;byhour=22;byminute=0; bysecond=0 +000 04:00:00 TUESDAY_WINDOW freq=daily;byday=TUE;byhour=22;byminute=0; bysecond=0 +000 04:00:00 WEDNESDAY_WINDOW freq=daily;byday=WED;byhour=22;byminute=0; bysecond=0 +000 04:00:00 THURSDAY_WINDOW freq=daily;byday=THU;byhour=22;byminute=0; bysecond=0 +000 04:00:00 FRIDAY_WINDOW freq=daily;byday=FRI;byhour=22;byminute=0; bysecond=0 +000 04:00:00 SATURDAY_WINDOW freq=daily;byday=SAT;byhour=22;byminute=0; bysecond=0 +000 04:00:00 SUNDAY_WINDOW freq=daily;byday=SUN;byhour=22;byminute=0; bysecond=0 +000 04:00:00 7 rows selected.
4.2 未执行完成的SQL语句如何查看执行计划呢?
使用sql trace工具在session级别对未执行完的SQL,进行执行计划分析 # 使用sql trace 查看未执行完的SQL执行计划的方法: ========================================== connect xxxx/yyyy alter session set tracefile_identifier='EVENT10046'; alter session set max_dump_file_size = UNLIMITED; alter session set timed_statistics=true; alter session set events '10046 trace name context forever, level 12'; # level 12 可以对变量进行追踪 -- 执行对象 SQL -- sql_id=70ddwj7363cd7是一个绑定变量的SQL,所以暂定把变量=10带入SQL来优化 alter session set events '10046 trace name context off'; select * from v$diag_info where name like 'Default Trace File%'; exit ==========================================
5.总结:
5.1 灵活应用手中的工具: 若分析一个快照周期内数据库状况,可以使用AWR;若要分析未完成快照的数据库状况,可使用ASH。 5.2 instance永远不重启,也许不是一个正确的选择? 长时间未重启的instance要考虑fixed table的增长对数据库影响, 本案例instance就是1年多未重启。X$KSOLSFTS由3w增加到63多w触发了BUG。 5.3 未执行完成的SQL如何查看执行计划? 10046事件来分析。 5.4 怎么分析SQL执行偶尔成功,偶尔失败的资源状况呢? dba_hist_sqlstat和dba_hist_snapshot组合是一个利器。衡定一个时间内分析这个SQL资源状况最为标准。