votedisk不管是对RAC(10g Clusterware、11g GI)而言,是非常重要的,我们称它为仲裁盘,当RAC集群中的某个节点产生故障而脱网掉线时,就由它来判断是不是将其踢出集群,以保证集群正常运行,当votedisk破坏了,也就会致使集群服务没法启动,集群资源都没法加载,最后致使罢工。那末我们平时就要注意对votedisk的备份,在11g中,由于votedisk和ocr默许就会放进ASM磁盘组,因此可以不用特别关注,但对10g的Cluster来讲,由于不能放到ASM磁盘组,只能以raw的情势使用,因此要特别关注votedisk,定期对其进行备份,如:
用dd命令备份和恢复votedisk的方法:
备份:dd if=/dev/raw/raw3 of=/tmp/votedisk.bak
恢复:dd if=/tmp/votedisk.bak of=/dev/raw/raw3
如果很不幸,之前没有做过备份,且没有做过镜像,当votedisk破坏的时候,就只能对crs进行重建了,下面来演示1下这个进程:
--关闭crs,对votedisk的盘进行破坏,这里是/dev/raw/raw3
[root@rac1 ~]# dd if=/dev/zero of=/dev/raw/raw3 bs=4096 count=12800
再次重启crs,就提示没法启动了,查找ocssd.log日志文件发现,其中有记录,说明了是磁盘破坏
PS:10g Clusterware的日志入口地址是$ORA_CRS_HOME/log/主机名/...
[ CSSD]2015-01⑴6 09:37:38.327 >USER:
Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2094
Oracle. All rights reserved.
[ CSSD]2015-01⑴6 09:37:38.327 >USER: CSS daemon log for node rac1, number 1, in cluster cluster
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_CSSD))
[ CSSD]2015-01⑴6 09:37:38.332 [3059615952] >TRACE: clssscmain: local-only set to false
[ CSSD]2015-01⑴6 09:37:38.344 [3059615952] >TRACE: clssnmReadNodeInfo: added node 1 (rac1) to cluster
[ CSSD]2015-01⑴6 09:37:38.352 [3059615952] >TRACE: clssnmReadNodeInfo: added node 2 (rac2) to cluster
[ CSSD]2015-01⑴6 09:37:38.356 [3032808336] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[ CSSD]2015-01⑴6 09:37:38.356 [3059615952] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[ CSSD]2015-01⑴6 09:37:38.362 [3059615952] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw3)
[ CSSD]2015-01⑴6 09:37:40.381 [3032808336] >TRACE: clssnmvDiskOpen: corrupt kill block on disk (0x09!=0x636c73536b696c4c)
[ CSSD]2015-01⑴6 09:37:40.381 [3032808336] >TRACE: clssnmDiskStateChange: state from 2 to 3 disk (0//dev/raw/raw3)
重建crs很简单,就履行2个脚本:
1.$ORA_CRS_HOME/install/rootdelete.sh
2.$ORA_CRS_HOME/install/rootdeinstall.sh
节点1:
[root@rac1 install]# ./rootdelete.sh
Shutting down
Oracle Cluster Ready Services (CRS):
Stopping resources.
Error while stopping resources. Possible cause: CRSD is down.
Stopping CSSD.
Unable to communicate with the CSS daemon.
Shutdown has begun. The daemons should exit soon.
Checking to see if
Oracle CRS stack is down...
Oracle CRS stack is not running.
Removing script for
Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'
[root@rac1 install]# ./rootdeinstall.sh
Removing contents from OCR device
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 0.590608 seconds, 17.8 MB/s
节点2:
[root@rac2 install]# ./rootdelete.sh
Shutting down
Oracle Cluster Ready Services (CRS):
OCR initialization failed with invalid format: PROC⑵2: The OCR backend has an invalid format
Shutdown has begun. The daemons should exit soon.
Checking to see if
Oracle CRS stack is down...
Oracle CRS stack is not running.
Removing script for
Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'
[root@rac2 install]# ./rootdeinstall.sh
Removing contents from OCR device
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 0.627909 seconds, 16.7 MB/s
[root@rac2 install]# dd if=/dev/zero of=/dev/raw/raw3 bs=4096 count=128000
dd: writing `/dev/raw/raw3': No space left on device
25601+0 records in
25600+0 records out
104857600 bytes (105 MB) copied, 5.40456 seconds, 19.4 MB/s
然后重新在2个节点顺次履行$ORA_CRS_HOME/root.sh就能够了,软件的OUI不用重新安装
如果通过脚本没法删除成功,安装顺利重新安装crs,可以手工删除以下目录:
rm /etc/oracle/*
rm -f /etc/init.d/init.cssd
rm -f /etc/init.d/init.crs
rm -f /etc/init.d/init.crsd
rm -f /etc/init.d/init.evmd
rm -f /etc/rc2.d/K96init.crs
rm -f /etc/rc2.d/S96init.crs
rm -f /etc/rc3.d/K96init.crs
rm -f /etc/rc3.d/S96init.crs
rm -f /etc/rc5.d/K96init.crs
rm -f /etc/rc5.d/S96init.crs
rm -Rf /etc/oracle/scls_scr
rm -f /etc/inittab.crs
cp /etc/inittab.orig /etc/inittab
总结:
平时我们都会对ocr和votedisk磁盘做多个镜像冗余,另外,如果是裸装备的话,还会通过dd命令单独去备份,通常是不太容易破坏和丢失的,万1产生了无备份情况下的破坏,那末就只能工作重建crs来解决问题了,这就是DBAs们的最后1根救命稻草了。