RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (文档 ID 2008933.1)

转到底部

In this Document

Symptoms

Cause

Solution

References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Generic Linux

SYMPTOMS

Red Hat Enterprise Linux or Oracle Linux running Red-Hat compatible kernel, after upgraded to 6.6, database/node fails with messages:

Fri May 01 03:05:48 2015
IPC Send timeout detected. Receiver ospid 28660 [oracle@xxxxx (LMS0)]
Fri May 01 03:05:48 2015
Errors in file /xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_lms0_28660.trc:
IPC Send timeout detected. Receiver ospid 28670 [oracle@xxxxx (LMS1)]
Fri May 01 03:05:53 2015
Errors in file /xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_lms1_28670.trc:
Fri May 01 03:06:00 2015
IPC Send timeout detected. Receiver ospid 31414 [oracle@xxxxx (PZ98)]
Fri May 01 03:06:00 2015
Errors in file /xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_pz98_31414.trc:
Fri May 01 03:06:13 2015
IPC Send timeout detected. Receiver ospid 1835 [oracle@xxxxx (PZ97)]
Fri May 01 03:06:13 2015
Errors in file /xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_pz97_1835.trc:
Fri May 01 03:06:43 2015
Fri May 01 03:06:43 2015
Received an instance abort message from instance 1Received an instance abort message from instance 1

Please check instance 1 alert and LMON trace files for detail.Please check instance 1 alert and LMON trace files for detail.

LMS0 (ospid: 28660): terminating the instance due to error 481

Fri May 01 03:06:43 2015

System state dump requested by (instance=3, osid=28660 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /xddv1covd/oracle/diag/rdbms/xrcovd/XRCOVD3/trace/XRCOVD3_diag_28625.trc

While this is happening, "netstat" shows huge jump of "packet reassembles failed":

==>> before the issue, the following number is more or less stable or increasing slowly
6817 packet reassembles failed
....
==>> in 30 minutes it increased by 50
6867 packet reassembles failed
==>> now the issue is happening and in 10 seconds it increased by 7533 - 6867 = 666
7533 packet reassembles failed
==>> in another 10 seconds it increased by 9630 - 7533 = 2097
9630 packet reassembles failed

Other symptoms could be:

1. node eviction

2. instance/node won't join the cluster after instance/node eviction without rebooting the node where "packet reassembles failed" is happening

CAUSE

RHEL 6.6 has a few ipfrag fix and increased the default ipfrag_*_thresh:

										cat /proc/sys/net/ipv4/ipfrag_low_thresh

3145728

cat /proc/sys/net/ipv4/ipfrag_high_thresh

4194304

However, the issue is still happening, for Oracle Linux running Red-Hat compatible kernel, the issue is being tracked:

BUG 21036841 - LCOV5/7/17 SERVER CRASHED AFTER PATCH UPGRADE AND KERNEL UPGRADE

SOLUTION

The issue is not fixed at the time of this writing, the temporary workaround is to enable jumbo frame

or

Increase value of below kernel parameter as mentioned below,

net.ipv4.ipfrag_high_thresh = 16M
net.ipv4.ipfrag_low_thresh = 15M

Units of these values are MB.

给你一个不用rhel 6.6的理由

APPLIES TO:

SYMPTOMS

CAUSE

SOLUTION