作者:程柳润,爱可生 DBA 团队成员,OceanBase 数据库技术爱好者。
问题描述
业务反馈一个存储过程执行慢,需要排查原因。由于存储过程过长且涉及多张表,所以主要分享排查思路。
分析过程
1. 查找该存储过程的 trace_id
每次执行存储过程后会有两条记录,两条记录 SQL 的请求类型不同,先产生的记录是预编译的记录(执行耗时很小),另一个是实际执行的记录。
obclient [oceanbase]> select usec_to_time(request_time),svr_ip,trace_id,query_sql,elapsed_time from gv$sql_audit where query_sql like 'call TEST_CHENG%' order by request_time;+----------------------------+----------------+-----------------------------------+-------------------+--------------+--------------+| usec_to_time(request_time) | svr_ip | trace_id | query_sql | elapsed_time | request_type |+----------------------------+----------------+-----------------------------------+-------------------+--------------+--------------+| 2023-12-16 18:20:31.741933 | xx.xxx.xxx.xxx | YB420ABA3D1F-00060A149FDC35FD-0-0 | call TEST_CHENG() | 73 | 5|| 2023-12-16 18:20:31.742838 | xx.xxx.xxx.xxx | YB420ABA3D1F-00060A149CFDF57C-0-0 | call TEST_CHENG() | 2362075 | 6|+----------------------------+----------------+-----------------------------------+-------------------+--------------+--------------+2 rows in set (0.614 sec)
2. 分析存储中较慢的 SQL
根据上一步中执行较慢的
trace_id 查看该存储过程中执行较慢的 SQL。发现存储过程中的几个
UPDATE 耗时较长。另外,
IS_HIT_PLAN 都为 0,表示没有命中
plan cache。然而其
GET_PLAN_TIME(获取执行计划时间)也为 0,是不正常的,这一点会在后面解释。
obclient [oceanbase]> select usec_to_time(request_time),query_sql,elapsed_time,execute_time,AFFECTED_ROWS,EVENT,TOTAL_WAIT_TIME_MICRO,WAIT_TIME_MICRO,TOTAL_WAITS,GET_PLAN_TIME,QUEUE_TIME,IS_HIT_PLAN,plan_id,svr_ip from gv$sql_audit where trace_id='YB420ABA3D1F-00060A149CFDF57C-0-0' order by elapsed_time desc limit 6\G*************************** 1. row ***************************usec_to_time(request_time): 2023-12-16 18:20:31.742838 query_sql: call TEST_CHENG() elapsed_time: 2362075 execute_time: 2362015...... GET_PLAN_TIME: 21 QUEUE_TIME: 15 IS_HIT_PLAN: 0......*************************** 2. row ***************************usec_to_time(request_time): 2023-12-16 18:20:32.359304 query_sql: update "CC"."TEST1" "T" set "T"."CITY" = '1',"T"."COUNTY" = (select elapsed_time: 630735 execute_time: 630735...... GET_PLAN_TIME: 0 QUEUE_TIME: 0 IS_HIT_PLAN: 0......*************************** 3. row ***************************usec_to_time(request_time): 2023-12-16 18:20:32.999169 query_sql: update "CC"."TEST1" "T" set "T"."CITY" = '0',"T"."COUNTY" = (select elapsed_time: 400111 execute_time: 400111...... GET_PLAN_TIME: 0 QUEUE_TIME: 0 IS_HIT_PLAN: 0......*************************** 4. row ***************************usec_to_time(request_time): 2023-12-16 18:20:33.731726 query_sql: update "CC"."TEST1" "T" set "T"."CITY" = '0',"T"."ROAD" = (select elapsed_time: 335918 execute_time: 335918...... GET_PLAN_TIME: 0 QUEUE_TIME: 0 IS_HIT_PLAN: 0......*************************** 5. row ***************************usec_to_time(request_time): 2023-12-16 18:20:33.402608 query_sql: update "CC"."TEST1" "T" set "T"."CITY" = '0',"T"."ROAD" = (select elapsed_time: 319150 execute_time: 319150...... GET_PLAN_TIME: 0 QUEUE_TIME: 0 IS_HIT_PLAN: 0......*************************** 6. row ***************************usec_to_time(request_time): 2023-12-16 18:20:32.086302 query_sql: truncate Table TEST1 elapsed_time: 142128 execute_time: 142128...... GET_PLAN_TIME: 0 QUEUE_TIME: 0 IS_HIT_PLAN: 0......6 rows in set (5.342 sec)
3. 查看存储过程执行情况
发现调用该存储过程之后,对
__all_tenant_dependency 等系统视图有增删改查的操作(篇幅有限,这里只截取了一部分)。这是由于编译 PL 对象会对系统视图做一些操作,
gv$sql_audit 会记录下来,可以用
tenant_id=业务租户 ID 来过滤掉这些记录。
另外,在
UPDATE 前有
TRUNCATE TEST1 的操作。该操作会导致两个问题:
- 整个存储过程的 plan cache 无法命中,每次都要重新编译。
- 存储过程中这张表相关的 SQL 的 plan cache 也会失效,每次执行也需要重新获取执行计划。
obclient [oceanbase]> select substr(query_sql,1,60),usec_to_time(request_time),elapsed_time,svr_ip from gv$sql_audit where trace_id='YB420ABA3D1F-00060A149CFDF57C-0-0' order by request_time;+---------------------------------------------------+----------------------------+--------------+----------------+| substr(query_sql,1,60) | usec_to_time(request_time) | elapsed_time | svr_ip |+---------------------------------------------------+----------------------------+--------------+----------------+| call TEST_CHENG(); | 2023-12-16 18:50:30.086159 | 2514812 | xx.xxx.xxx.xxx || START TRANSACTION | 2023-10-07 18:50:30.472741 | 194 | xx.xxx.xxx.xxx || delete FROM __all_tenant_dependency WHERE dep_obj_| 2023-10-07 18:50:30.473589 | 1570 | xx.xxx.xxx.xxx || INSERT INTO __all_tenant_dependency (tenant_id, de| 2023-10-07 18:50:30.476450 | 243 | xx.xxx.xxx.xxx ||...... ||...... || insert into "CC"."TEST5"( "LOG_DATE", "LOG_LEVEL",| 2023-12-16 18:50:30.504996 | 393 | xx.xxx.xxx.xxx || truncate Table TEST1 | 2023-12-16 18:50:30.505998 | 569828 | xx.xxx.xxx.xxx || select max("T"."CREATE_TIME") from "CC"."TEST6 | 2023-12-16 18:50:31.176795 | 295 | xx.xxx.xxx.xxx || select max("T"."CREATE_TIME") from "CC"."TEST6 | 2023-12-16 18:50:31.177108 | 180 | xx.xxx.xxx.xxx || insert into "CC"."TEST1"( "ID", "NUMBER | 2023-12-16 18:50:31.178344 | 8039 | xx.xxx.xxx.xxx || SELECT * FROM __all_tenant_meta_table WHERE tenant| 2023-12-16 18:50:31.184392 | 328 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."ID" = | 2023-12-16 18:50:31.187246 | 4696 | xx.xxx.xxx.xxx || SELECT partition_id, object_type, row_cnt as row_c| 2023-12-16 18:50:31.189366 | 181 | xx.xxx.xxx.xxx || SELECT tenant_id, table_id, partition_id, column_i| 2023-12-16 18:50:31.190145 | 224 | xx.xxx.xxx.xxx || SELECT tenant_id, table_id, partition_id, column_i| 2023-12-16 18:50:31.190774 | 131 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."CITY" = | 2023-12-16 18:50:31.192182 | 4577 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."CITY" = | 2023-12-16 18:50:31.197606 | 3896 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."CITY" = | 2023-12-16 18:50:31.202269 | 347793 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."CITY" = | 2023-12-16 18:50:31.559553 | 338883 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."CITY" = | 2023-12-16 18:50:31.901602 | 311209 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."CITY" = | 2023-12-16 18:50:32.218374 | 6085 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."CITY" = | 2023-12-16 18:50:32.225213 | 338044 | xx.xxx.xxx.xxx || update "CC"."TEST1" "T" set "T"."ORGAN_CODE" = | 2023-12-16 18:50:32.566794 | 15760 | xx.xxx.xxx.xxx || insert into "CC"."TEST8"( "MANAGER_ID", "APART_ID | 2023-12-16 18:50:32.593934 | 5915 | xx.xxx.xxx.xxx |+---------------------------------------------------+----------------------------+--------------+----------------+154 rows in set (4.125 sec)- 手工执行慢的几个update,发现命中 plan cache 时只需要几毫秒,而没有命中 plan cache 时需要300多毫秒。
手工执行时会命中执行计划
obclient [oceanbase]> select usec_to_time(request_time),query_sql,elapsed_time,execute_time,AFFECTED_ROWS,EVENT,TOTAL_WAIT_TIME_MICRO,WAIT_TIME_MICRO,TOTAL_WAITS,GET_PLAN_TIME,QUEUE_TIME,IS_HIT_PLAN,plan_id,svr_ip from gv$sql_audit where query_sql like 'update "CC"."TEST1" "T" set "T"."CITY"%' and GET_PLAN_TIME>0 order by request_time\Gusec_to_time(request_time): 2023-12-16 19:12:30.063271 query_sql: update "CC"."TEST1" "T" set "T"."CITY" = '1',"T"."COUNTY" = (select elapsed_time: 8810 execute_time: 8810...... GET_PLAN_TIME: 0 QUEUE_TIME: 0 IS_HIT_PLAN: 1......
手工执行update前先做truncate操作
不会命中执行计划,此时
GET_PLAN_TIME 为 372ms。
obclient [oceanbase]> select usec_to_time(request_time),query_sql,elapsed_time,execute_time,AFFECTED_ROWS,EVENT,TOTAL_WAIT_TIME_MICRO,WAIT_TIME_MICRO,TOTAL_WAITS,GET_PLAN_TIME,QUEUE_TIME,IS_HIT_PLAN,plan_id,svr_ip from gv$sql_audit where query_sql like 'update "CC"."TEST1" "T" set "T"."CITY"%' and GET_PLAN_TIME>0 order by request_time\G*************************** 1. row ***************************usec_to_time(request_time): 2023-12-16 19:21:53.069019 query_sql: update "CC"."TEST1" "T" set "T"."CITY" = '1',"T"."COUNTY" = (select elapsed_time: 378148 execute_time: 6036...... GET_PLAN_TIME: 372055 QUEUE_TIME: 28 IS_HIT_PLAN: 0......
注意:存储过程里的 SQL,即使没有命中 plan cache,在 gv$sql_audit 中还是会显示 GET_PLAN_TIME: 0,都体现在 execute_time。正常来说(比如手工执行的 SQL),如果没有命中 plan cache,GET_PLAN_TIME 值会比较大,并且不会计算到 execute_time 里。
结论
TRUNCATE 导致
UPDATE 无法命中
plan cache,所以存储过程慢。
解决方案
将存储过程里的
TRUNCATE 改成了
DELETE。