HP双机配置时的故障浅析

2015-11-27

在实际工作中,考虑到要充分保留客户的应用系统运行环境配置,会采用一些最大化保留环境配置的手段,例如系统盘克隆或全系统备份恢复等,本文就其中的一种现象进行剖析,供大家参考。

一、故障现象

在配置HP双机时,先进行双机环境检查:

# cmquerycl -n hlrdb1 -n hlrdb2 -v -C /etc/cmcluster/clhlrdb.ascii

Begin checking the nodes...

Warning: Unable to determine local domain name for hlrdb1

Looking for other clusters ... Done

Gathering configuration information ..

Gathering storage information ..

Found 10 devices on node hlrdb1

Found 16 devices on node hlrdb2

Analysis of 26 devices should take approximately 4 seconds

 0%----10%----20%----30%----40%----50%----60%----70%----80%----90%----100%

 Found 3 volume groups on node hlrdb1

Found 3 volume groups on node hlrdb2

 Analysis of 6 volume groups should take approximately 1 seconds

 0%----10%----20%----30%----40%----50%----60%----70%----80%----90%----100%

 .....

Gathering Network Configuration ....... Done

Note: Disks were discovered which are not in use by either LVM or VxVM.

       Use pvcreate(1M) to initialize a disk for LVM or,

      use vxdiskadm(1M) to initialize a disk for VxVM.

Warning: Volume group /dev/vg00 is configured differently on node hlrdb1 than on

  node hlrdb2

Error: Volume group /dev/vg00 on node hlrdb2 does not appear to have a physical

volume corresponding to /dev/dsk/c2t1d0 on node hlrdb1 (24142709971117077841).

Warning: Volume group /dev/vg00 is configured differently on node hlrdb2 than on

  node hlrdb1

Error: Volume group /dev/vg00 on node hlrdb1 does not appear to have a physical

volume corresponding to /dev/dsk/c2t1d0 on node hlrdb2 (24142709991117033934).

Warning: The volume group /dev/vg00 is activated on more than one node:

hlrdb1

hlrdb2

Warning: Volume groups should not be activated on more than one node.

Use vgchange to de-activate a volume group on a node.

Failed to gather configuration information.

从上述提示可以看出,双机检测时报VG00同时在两台主机上被激活,因而没有通过,无法继续配置。

二、故障分析及定位

通过查找相关技术资料,可以得到如下信息:

When  a volume group is created, it is given a unique VGID - a merger of the  servers' machine ID (uname -i) and the timestamp of the VG creation  date. To save time, administrators may have used dd or copyutil to clone   vg00 onto another servers' disks. Unfortunately, this also copies the same  VGID to the new server.

 MC/ServiceGuard  utilizes LVM structures such as Volume Group ID (VGID) and Physical  Volume ID (PVID) to determine which VGs are shared (common to both  servers). If each servers' vg00s' PVID and VGID are the same, cmquerycl  (ServiceGuard) considers them to be the same VG. Subsequent alterations  of vg00 LVM structures (such as adding a disk) are interpretted by  cmquerycl as unresolvable LVM differences between servers - terminating  the command.

上述描述大概的意思是:双机以VGIDPVID来确定vg是否是共享盘;在双机的vg00所在硬盘克隆的情况下,如果VGID相同,就会出现问题。

三、故障处理

解决方法如下:

0. 执行lvlnboot -v /dev/vg00,保存执行结果

1. 启动到LVM维护模式

   #shutdown -ry 0

   ISL>hpux -lm

2. export vg00,并保存配置

   #vgexport -v -m vg00.map /dev/vg00

3. 修改vgid,如果已经作了镜像,两块盘要同时指定

   #vgchgid -f /dev/rdsk/c2t0d0 /dev/rdsk/c2t1d0

输入y'

4. 重新import vg

   #mkdir /dev/vg00

   #mknod /dev/vg00/group c 64 0x000000

   #vgimport -v -m vg00.map /dev/vg00 /dev/rdsk/c2t0d0 /dev/rdsk/c2t1d0

 5. 修改LVM boot

   #vgchange -a y vg00

   #lvlnboot -r /dev/vg00/lvol3

   #lvlnboot -s /dev/vg00/lvol2

   #lvlnboot -d /dev/vg00/lvol2

   #lvlnboot -b /dev/vg00/lvol1

   #lvlnboot -R /dev/vg00

执行lvlnboot -v /dev/vg00,与第0步执行结果比较,应该相同

 6. 修改LVM缺省的boot命令

   #mount /usr

   #lifcp /dev/rdsk/c2t0d0:AUTO -

   #lifcp /dev/rdsk/c2t0d0:AUTO -

上面两条命令的输出结果如果不是“hpux”,则执行下面的命令修改

   #mkboot -a "hpux" /dev/rdsk/c2t1d0

7. 重启

   #shutdown -ry 0

四、经验总结及心得

故障产生的原因是B机克隆的A机的系统,导致两个系统VG00VGID一样。每个卷组都有唯一的VGID,双机中的共享卷组也是通过VGID去确认是否为同一卷组的。两个系统VG00VGID如果一致,双机的检查的时候就会把VG00识别为共享卷组,而这不符合双机的机制,因而无法通过检查,继续配置。通过在维护模式下修改卷组的VGID之后就可以解决这个问题。


本网站由阿里云提供云计算及安全服务 Powered by CloudDream