环境说明:
OS:Redhat 7.5 DB: Oracle 11.2.0.4.0 RAC 节点1:10.1.1.103 节点2:10.1.1.101
/etc/hosts配置:
cat /etc/hosts ...... 10.1.1.103 chen-cjc-01 10.1.1.101 chen-cjc-02 10.1.1.102 chen-cjc-01-vip 10.1.1.103 chen-cjc-02-vip 25.5.255.13 chen-cjc-01-pri 25.5.255.14 chen-cjc-02-pri 10.1.1.105 chen-scan
问题:
节点1,2内存使用率较高,需要进行内存扩容。
解决方案:
停机扩容内存。 1.停应用服务。 2.停库。 3.停集群。 4.停服务器扩容内存。 5.启动集群。 6.启动数据库。 7.应用修改连接数据库地址,改为连接scan ip。 8.启动应用服务。 9.验证。
实施过程中遇到的问题:
1.由于历史原因,应用没有连接数据库scan ip,只连接了节点1 vip 10.1.1.102,失去了高可用功能,需要停应用服务后进行数据库扩容。 2.Oracle ASM共享存储在NAS上,经检查发现没有设置NAS开机自动挂载,需要开机后手动挂载,需要提前将Oracle crs自动启动服务关闭,启动服务器后,先手动挂载NAS,在手动启动CRS。 3.Oracle 11.2.0.4.0 RAC安装在Redhat 7及以上版本,无法直接启动ohas服务,需要打补丁或手动添加ohas服务。 4.启动CRS时,后台日志提示无法找到voting files,导致集群无法启动,根据ocssd日志,优化NAS挂载参数。
具体问题如下:
1.停节点2数据库实例时,速度较慢,耗时9分钟。
主要原因,服务器性能较差,建议下次停实例可以考虑先提前手动中止会话,在停实例。
停止实例日志如下:
Thu Feb 10 18:42:11 2022 Shutting down instance (immediate) Stopping background process SMCO Shutting down instance: further logons disabled Stopping background process QMNC Stopping background process MMNL Stopping background process MMON License high water mark = 23 Thu Feb 10 18:42:38 2022 Reconfiguration started (old inc 4, new inc 6) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Feb 10 18:42:38 2022 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Thu Feb 10 18:42:38 2022 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Thu Feb 10 18:42:38 2022 Instance recovery: looking for dead threads Instance recovery: lock domain invalid but no dead threads Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Thu Feb 10 18:43:38 2022 Decreasing number of real time LMS from 2 to 0 Thu Feb 10 18:47:14 2022 Active call for process 5123 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' Active call for process 30720 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' Active call for process 1908 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' Active call for process 27349 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' Active call for process 1753 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' Active call for process 6241 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' Active call for process 22201 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' Active call for process 21133 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)' SHUTDOWN: waiting for active calls to complete. Thu Feb 10 18:52:13 2022 License high water mark = 23 USER (ospid: 23318): terminating the instance Instance terminated by USER, pid = 23318
2.crs无法启动。
启动服务器后,系统工程师挂载NAS,DBA启动crs服务时,前台命令挂起,长时间无响应。
###crsctl start crs;
检查后台集群日志、crs日志、ohas日志、ocss日志均没有任何输出。
反复中止、启动crs后,过一段时间,前台报错CRS-3124:
root@chen-cjc-02:/oracle/db/grid/product/11.2.0/bin#./crsctl start crs CRS-4124: Oracle High Availability Services startup failed. CRS-4000: Command Start failed, or completed with errors.
问题原因:
Oracle 11.2.0.4.0 RAC安装在Redhat 7及以上版本,无法直接启动ohas服务,需要打补丁或手动添加ohas服务。
处理过程:
创建服务ohas.service的服务文件并赋予权限
su - root touch /usr/lib/systemd/system/ohas.service chmod 777 /usr/lib/systemd/system/ohas.service
往ohas.service服务文件添加启动ohasd的相关信息
vi /usr/lib/systemd/system/ohas.service [Unit] Description=Oracle High Availability Services After=syslog.target [Service] ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple Restart=always [Install] WantedBy=multi-user.target
重新加载守护进程
systemctl daemon-reload
设置守护进程自动启动
###systemctl enable ohas.service
手工启动ohas服务
systemctl start ohas.service systemctl status ohas.service
3.CRS-1714:Unable to discover any voting files
手动添加并启动ohas服务器,启动crs前台没有报错,很快返回结果。
###crsctl start crs
检查后台集群日志,集群启动失败。
2022-02-10 19:37:03.771: [cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log 2022-02-10 19:37:18.783: [cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log
继续检查ocssd.log日志,提示NFS配置不正确,rsize和wsize值不正确。
2022-02-10 19:43:19.113: [ SKGFD][3605935872]running stat on disk:/oradata/vote3 2022-02-10 19:43:19.113: [ SKGFD][3605935872]WARNING:NFS file system /oradata mounted with incorrect options(rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2) 2022-02-10 19:43:19.113: [ SKGFD][3605935872]WARNING:Expected NFS mount options: rsize>=32768,wsize>=32768,hard,
继续检查nfs挂载参数
检查fstab里记录的nas挂载参数
cat /etc/fstab #mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs /oradata
检查当前挂载参数
mount |grep oradata 25.17.1.2:/CJC_db_oradata_01_nfs on /oradata type nfs (rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2)
检查历史挂载命令
root@chen-cjc-01:/oracle/db/grid/product/11.2.0/log/chen-cjc-01#history |grep mount ...... 1194 2020-09-16-10:05:37 [root]umount /oradata 1195 2020-09-16-10:05:44 [root]mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs /oradata
手动重新挂载
两节点分别停止crs服务
crsctl stop crs ps -ef|grep d.bin
两节点分别重新挂载
###umount /oradata ###mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs /oradata
重新挂载后,crs和数据库实例可以正常启动。
验证:
检查数据库五个IP地址是否可以正常连接数据库。
###sqlplus CJC/***@10.1.1.102:15212/CJC ###sqlplus CJC/***@10.1.1.103:15212/CJC ###sqlplus CJC/***@10.1.1.105:15212/CJC ###sqlplus CJC/***@10.1.1.101:15212/CJC ###sqlplus CJC/***@10.1.1.104:15212/CJC
待应用服务修改完IP地址后,检查数据库连接情况,确保节点1,2都有业务连接。
SQL> select inst_id,username,count(*) from gv$session group by inst_id,username order by 1,2; INST_ID USERNAME COUNT(*) ---------- ------------------------------ ---------- 1 CJC 42 1 MONITOR 3 1 SYS 5 1 41 2 CJC 42 2 SYS 3 2 39 7 rows selected.
#####chenjuchao 20220212#####