Gathering HACMP Info

Often, when working with a cluster, you might want to rebuild it from scratch, rather than take the time to figure out what is broken. Here are some commands to gather basic info for AIX and email it to yourself. Obviously, change the email address at the end.

(
echo '#########################' 
echo '#########################' OS Level
echo '#########################' 
oslevel -s
echo '#########################' 
echo '#########################' HA Level
echo '#########################' 
halevel -s
echo '#########################' 
echo '#########################' System Info
echo '#########################' 
lsattr -El sys0
echo '#########################' 
echo '#########################' Cluster Exports
echo '#########################' 
cat /usr/es/sbin/cluster/etc/exports
echo '#########################' 
echo '#########################' System Exports
echo '#########################' 
cat /etc/exports
echo '#########################' 
echo '#########################' Physical Volumes
echo '#########################' 
lspv -u
echo '#########################' 
echo '#########################' Cluster UD
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsclstr
echo '#########################' 
echo '#########################' Cluster Heartbeat
echo '#########################' 
lscluster -d
echo '#########################' 
echo '#########################' Cluster Status
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllscompstat
echo '#########################' 
echo '#########################' Cluster Dump
echo '#########################' 
/usr/es/sbin/cluster/utilities/cldump
echo '#########################' 
echo '#########################' Cluster Services
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsserv
echo '#########################' 
echo '#########################' Cluster App Monitors
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsappmon
echo '#########################' 
echo '#########################' Cluster Resource Group Variables
echo '#########################' 
for i in `/usr/es/sbin/cluster/utilities/cllsgrp` ; do echo '###################' $i ; /usr/es/sbin/cluster/utilities/cllsres -g $i ; done
echo '#########################' 
echo '#########################' Cluster Resource Group Details
echo '#########################' 
for i in `/usr/es/sbin/cluster/utilities/cllsgrp` ; do echo '###################' $i ; /usr/es/sbin/cluster/utilities/clshowres -g $i ; done
echo '#########################' 
echo '#########################' Cluster Interfaces
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsif
echo '#########################' 
echo '#########################' Network Interfaces
echo '#########################' 
ifconfig -a
echo '#########################' 
echo '#########################' Rhosts
echo '#########################' 
cat /.rhosts
echo '#########################' 
echo '#########################' root rhosts
echo '#########################' 
cat /root/.rhosts
echo '#########################' 
echo '#########################' cluster rhosts
echo '#########################' 
cat /etc/cluster/rhosts
echo '#########################' 
echo '#########################' New custer rhosts
echo '#########################' 
cat /usr/es/sbin/cluster/etc/rhosts
echo '#########################' 
echo '#########################' Net monitor IPs
echo '#########################' 
cat /usr/es/sbin/cluster/netmon.cf
echo '#########################' 
echo '#########################' File Collections
echo '#########################' 
odmget HACMPfilecollection
echo '#########################' 
echo '#########################' Collection Files
echo '#########################' 
odmget HACMPfcfile
echo '#########################' 
echo '#########################' Free Major Numbers
echo '#########################' 
lvlstmajor
echo '#########################' 
echo '#########################' Example commands for VG Imports
echo '#########################' 
for VG in `lsvg |egrep -v 'rootvg|caavg'`; do 
  echo `getlvodm -d $VG` `lspv | grep $VG | tr -s [:space:] | sort -k 2 | head -1` \
  | awk '{print "importvg -V" , $1 , "-y " , $4 , " " , $3 ; } ; ' ; done | sort
echo '#########################' 
echo '#########################' Volume Groups
echo '#########################' 
lsvg
echo '#########################' 
echo '#########################' Volume Group Details
echo '#########################' 
lsvg | xargs -n1 lsvg
echo '#########################' 
echo '#########################' Logical Volumes
echo '#########################' 
lsvg | xargs -n1 lsvg -l
echo '#########################' 
echo '#########################' Logical Volume Details
echo '#########################' 
lsvg | xargs -n1 lsvg -l | grep / | cut -f 1 -d \  | xargs -n1 lslv
echo '#########################' 
echo '#########################' Filesystems
echo '#########################' 
df -Pg
echo '#########################' 
echo '#########################' Mounts
echo '#########################' 
mount
echo '#########################' 
echo '#########################' Tunables from last boot
echo '#########################' 
cat /etc/tunables/lastboot
echo '#########################' 
echo '#########################' Device settings
echo '#########################' 
for i in `lsdev | egrep '^en|hdisk|fcs|fscsi' | cut -f1 -d\  ` ; do echo '#####################' $i ; lsattr -El $i ; done | egrep -v 'False$'
echo '#########################' 
echo '#########################' Crontab entries
echo '#########################' 
crontab -l
echo '#########################' 
echo '#########################' snmp config
echo '#########################' 
cat /etc/snmpdv3.conf
echo '#########################' END END END
) 2>&1 | mail -vs `hostname` jdavis@omnitech.net


AIX and PowerHA levels

Research shows these dates for AIX:
https://www.ibm.com/support/pages/aix-support-lifecycle-information
It’s generally 26 weeks, plus or minus, from the initial YYWW date. Once a TLSP APARs releases, the YYWW code is be updated.

  • 7300-01-01-2246 2022-12-02 (Next 2023Q2)
  • 7200-05-05-2246 2022-12-02 (Next 2023Q2)
  • 7100-05-10-2220 2022-09-09 (Next 2023Q1)
  • 6100-09-12-1846 2018-11-16 (EoL CSP)

My AIX selection process would be:

  • AIX 7.3.1.1 from 2022 week 46 is what I have in my repo.  Another TL should be coming out 2023 Q1.  None of my customers run this, but you want this for POWER10.  NIM should be latest of all versions as well.
  • AIX 7.2.5.5 from 2022 week 46 is what I have in my repo.  Another TL should be coming out 2023 Q1.  You probably want this for POWER7 and up.
  • AIX 7.1.5.10 from 2022 week 20 is what I have in my repo.  I think the CSP is 2023Q1.  Supports AIX 5.2 and 5.3 WPARs.  Not much other reason to use this now other than some specific apps that are OK with OpenSSL, OpenSSH, and Java updates, but not kernel updates.
  • AIX 6.1.9.12 is where I stopped tracking.  No real need for AIX 6 anywhere.  Either you’re stuck on 5.3, or you came up to 7.1 (or ideally 7.2).  6.1.9.9 was needed for application compatibility on POWER9.
  • For anything POWER6 or older should really upgrade to p710 to p740 or s81x/s82x as replacements (cost).  POWER8 is EoS 2024-10-31.  POWER7 is EoS 2019.
  • AIX 5.3.12.9 + U866665 on POWER8 is end-stage.  AIX 5.3 was EoS in 2012, but some people still run it now.  Power8 is EoS 2024-10-31.  Power7 was 2019.
  • AIX 5.3 PTF U866665.bff (bos.mp64.5.3.12.10.U) enables POWER8. AIX must be 5.3.12.9. Must be patched before moving to p8. p8 must be 840 firmware or later. VIO must be 2.2.4.10 or later.  Migration is by LPM, NIM, or mksysb.  equires active extended support agreement AIX p8 systems on file to download.
  • AIX older – You should not be running anything older.  AIX 5.1 was all CHRP, and AIX 4.x was all PCI.  AIX 3.x was MicroChannel.  AIX 2.1 had some PS/2 systems.  Outside of a museum, on an isolated network (or no network because CYLONS!), just have this recycled.

My PowerHA (HA/CMP) selection process would be:
https://www.ibm.com/support/pages/powerha-aix-version-compatibility-matrix 

  • 7.2.7 Base is what I last grabbed.  OK for AIX 7.1.5, 7.2.5, 7.3.0 and later patch levels.
  • 7.1.3 SP09 was the end-stage for this.  OK for AIX 6.1.9.11, 7.1.3.9, 7.2.0.6, and later patch levels.  No AIX 5, and no AIX 7.3.
  • 6.1 SP15 was end-stage for this, and supported AIX 5.3.9, 6.1.2.1, 7.1.0, and later patch levels.  No AIX 5.1, 7.2, or 7.3.

Code sources:

UPDATE 2023-03-07:

  • Refreshed all of the info above to current.  If you’re on AIX 5.2, HA/CMP 5.x, or VIO 1.x, that’s really disappointing.
  • VIO should be 3.1.4.10.  Always go current whenever possible.  If not, 2.2.6.65 is your target.  If you have any VIO 1.x, upgrade.  Period.
  • System firmware, adapter microcode, disk microcode, tape microcode, and library firmware should all be latest available.
  • Storage array firmware should be latest LTS patch level, excluding any .0 versions.  I still don’t trust Data Reduction Pools.
  • ADSM/ITSM/TSM/Spectrum Protect/Storage Protect – These should typically be the final version supporting your OS/App combo.  I’m sweet on 8.1.17, though there are still major issues with deduplication pools when your database is over 3TB.  Containers and extents cannot be purged.  I have not seen SP 9.1, but I assume it will be extremely similar to SP 8.1.18 other than some minor rebadging.  Not sure.  I got dropped from the beta program because I didn’t have cycles to test their new code, and they didn’t have any interest in steering features.  They pick what they want to pick, and you’ll like it. 
  • IBM is phasing out Spectrum Protect Plus (Catalogic DPX), but might still be keeping Catalogic ECX (Copy Data Management).  In Storage Defender, IBM has picked up Cohesity DataProtect because of the cloud / DRaaS bits.  These all integrate with DS8000/Flash Systems for data immutability / vaulting / ransomware protection, and they want you to buy Rapid7 for the AI/Logic behind it.  I know regular ISP’s operations center anomaly detection us unusable due to its lack of adaptability/logic.  It just says everything is alerted every week when you run weekly fulls, etc.

I don’t really track IBMi OS (OS400), zOS, zVM, etc.  Storage should be rebranding this year though, but still no NFS/CIFS hardware.  At best, IBM sells a GPFS cluster with Ceph and with some StorWize FS7200s.


AIX and PowerHA versions 2017-06

This changes periodically, but for today, here is what I would do.

My PowerHA selection process would be:
* 7.1.3 SP06 if I needed to deploy quickly, because I have build docs for that.
* 7.1.4 doesn’t exist, but if it came out before deployment, I would consider it. Whichever was a newer release, latest 7.1.3 SP, or latest 7.1.4 SP.
* 7.2.0 SP03 if they wanted longer support, but had time for me to work up the new procedures during the install.
* 7.2.1 SP01 if SP01 came out before I deployed, and had chosen 7.2.0 prior. 7.2.1.0 base is available, but that’s from Dec 2016, and 7.2.0.3 is from May 2017. Newer by date is better.

My AIX selection process would be:
* Any NIM server would be AIX 7.2, latest TLSP.
* Any application support limits would win down to AIX 6.1, plus latest TLSP.
* For POWER9, I would push 7.2, latest TLSP.
* For POWER8, I would push 7.1 or later. — latest TLSP
* For POWER7, I would push 6.1 or later. — latest TLSP
* For POWER6 or older, or AIX 5.3 or older, I would push strongly against due to support and parts limitations.

Code sources:
* I would make sure to install yum from ezinstall, and deploy GNU tar and rsync:
http://public.dhe.ibm.com/aix/freeSoftware/aixtoolbox/ezinstall/ppc/
* I would update openssh from the IBM Web Download expansion:
https://www-01.ibm.com/marketing/iwm/iwm/web/reg/pick.do?source=aixbp&lang=en_US
* If any exposure to the public net, or a high-sensitivity system, I would check AIX security patches also.
http://public.dhe.ibm.com/aix/efixes/security/?C=M;O=D
ftp://ftp.software.ibm.com/aix/efixes/security/
* I would get the latest service pack for both AIX and PowerHA from Fix Central:
https://www-945.ibm.com/support/fixcentral/
* Base media, if I were certain the customer was entitled, but didn’t want to wait for them to provide media, Partnerworld SWAC:
https://www-304.ibm.com/partnerworld/partnertools/eorderweb/ordersw.do

Reference: PowerHA to AIX Support Matrix:
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347


PowerHA holds my disks

I did some testing and needed to document command syntaxen, even though I was not successful.
node01 / node02 – cannot remove EMC disks
aps are stopped

The fuser command will not detect processes that have mmap regions where that associated file descriptor has since been closed.

lsof | grep hdisk   ### nothing
fuser -fx /dev/hdisk2 ### nothing
fuser -d /dev/hdisk2 ### nothing
sudo filemon -O all -o 2.trc ; sleep 10 ; sudo trcstop   ### only shows hottest 2 dsks

### Cannot remove disks after removign from HA, is related to this defect.
http://www-01.ibm.com/support/docview.wss?uid=isg1IV65140
/usr/es/sbin/cluster/events/utils/cl_vg_fence_term -c vgname

In PowerHA 7.1.3, with the shared VG varied off, and the
disk in closed state, rmdev may fail and return a
busy error, eg:

# rmdev -dl hdisk2
Method error (/usr/lib/methods/ucfgdevice):
0514-062 Cannot perform the requested function because
         the specified device is busy.
.

# cl_set_vg_fence_height
Usage: cl_set_vg_fence_height [-c]  [rw|ro|na|ff]

JDSD NOTE: The levels are:
* rw = readwrite
* ro = read only
* na = no access
* ff = fail access

jdsd@node01  /home/jdsd
$ sudo ls -laF /usr/es/sbin/cluster/events/utils/cl*fence*
-rwxr--r--    1 root     system        12832 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_fence_vg*
-rwxr--r--    1 root     system        15624 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height*
-r-x------    1 root     system         5739 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_ssa_fence*
-rwxr--r--    1 root     system        22508 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_vg_fence_init*
-rwxr--r--    1 root     system         4035 Feb 26 2015  /usr/es/sbin/cluster/events/utils/cl_vg_fence_redo*
-rwxr--r--    1 root     system        15179 Oct 21 2014  /usr/es/sbin/cluster/events/utils/cl_vg_fence_term*


jdsd@node01  /home/jdsd
$ sudo ls -laF /usr/es/sbin/cluster/events/cspoc/cl*disk*
-r-x------    1 root     system       109726 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_diskreplace*
-rwxr-xr-x    1 root     system        20669 Nov  7 2013  /usr/es/sbin/cluster/cspoc/cl_getdisk*
-r-x------    1 root     system       105962 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_lsreplacementdisks*
-r-x------    1 root     system       103433 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_lsrgvgdisks*
-rwxr-xr-x    1 root     system        12259 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_pviddisklist*
-rwxr-xr-x    1 root     system         4929 Nov  7 2013  /usr/es/sbin/cluster/cspoc/cl_vg_non_dhb_disks*


jdsd@node01  /home/jdsd
$ sudo /usr/es/sbin/cluster/cspoc/cl_lsrgvgdisks
#Volume Group   hdisk    PVID             Cluster Node
#---------------------------------------------------------------------
caavg_private   hdisk38  00deadbeefcaff53 node01                        node01,node02 
datavg          hdisk22  00deadbeefca8643 node02                        node01,node02 demo_rg
datavg          hdisk23  00deadbeefca86f9 node02                        node01,node02 demo_rg
datavg          hdisk24  00deadbeefca8752 node02                        node01,node02 demo_rg
datavg          hdisk25  00deadbeefca87ac node02                        node01,node02 demo_rg
datavg          hdisk26  00deadbeefca880e node02                        node01,node02 demo_rg
datavg          hdisk27  00deadbeefca886c node02                        node01,node02 demo_rg
datavg          hdisk28  00deadbeefca88d7 node02                        node01,node02 demo_rg
datavg          hdisk29  00deadbeefca8965 node02                        node01,node02 demo_rg
datavg          hdisk30  00deadbeefca89c5 node02                        node01,node02 demo_rg
datavg          hdisk31  00deadbeefca8a52 node02                        node01,node02 demo_rg
datavg          hdisk32  00deadbeefca8ad2 node02                        node01,node02 demo_rg
datavg          hdisk33  00deadbeefca8b50 node02                        node01,node02 demo_rg
datavg          hdisk34  00deadbeefca8c26 node02                        node01,node02 demo_rg
datavg          hdisk35  00deadbeefca8c9a node02                        node01,node02 demo_rg
datavg          hdisk36  00deadbeefca8cf7 node02                        node01,node02 demo_rg
journalvg       hdisk37  00deadbeefca8d53 node02                        node01,node02 demo_rg


jdsd@node01  /home/jdsd
$ sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
Disk name:                      hdisk2
Disk UUID:                      1edeadbeefcafe04 b512d9e3b580fb13
Fence Group UUID:               0000000000000000 0000000000000000 - Not in a Fence Group
Disk device major/minor number: 18, 2
Fence height:                   2 (Read/Only)
Reserve mode:                   0 (No Reserve)
Disk Type:                      0x01 (Local access only)
Disk State:                     32785

Concurrent vg, so updating on node2 shows up on node1.

From node 2

sudo extendvg journalvg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk12
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk37
# Shows RW

From node 1

sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk37
# Shows RW

From node1

sudo /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height -c journalvg rw
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
# Shows RW

From node2

sudo reducevg journalvg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk12
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
# Shows RO

### OK, try again
From node 1

sudo mkvg -y dummyvg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk12
sudo varyoffvg dummyvg

From node 2

sudo importvg  -y dummyvg hdisk2
sudo /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height -c dummyvg rw
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
### Still RO
sudo /usr/es/sbin/cluster/events/utils/cl_vg_fence_term -c dummyvg
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
### Still RO
sudo varyoffvg dummyvg
sudo rmdev -Rl hdisk2

Both nodes

sudo exportvg dummyvg
sudo importvg -c -y dummyvg hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
### Still RO
sudo /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height -c dummyvg rw
sudo /usr/es/sbin/cluster/events/utils/cl_vg_fence_init -c dummyvg rw hdisk2
cl_vg_fence_init[279]: sfwAddFenceGroup(dummyvg, 1, hdisk2): No such device
sudo chvg -c dummyvg
sudo varyonvg -n -c -A -O dummyvg
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk3
### Still RO
sudo varyoffvg dummyvg

From Node 2
sudo rmdev -Rl hdisk2
Method error (/etc/methods/ucfgdevice):
        0514-062 Cannot perform the requested function because the
                 specified device is busy.

sudo /usr/es/sbin/cluster/events/utils/cl_vg_fence_redo -c dummyvg rw hdisk2
 /usr/es/sbin/cluster/events/utils/cl_vg_fence_redo: line 109: cl_vg_fence_init: not found
 cl_vg_fence_redo: Volume group dummyvg fence height could not be set to read/write

This is related to this defect, but later version:
http://www-01.ibm.com/support/docview.wss?uid=isg1IV52444

sudo su -
export PATH=$PATH:/usr/es/sbin/cluster/utilities:/usr/es/sbin/cluster/events/utils/:/usr/es/sbin/cluster/cspoc/:/usr/es/sbin/cluster/sbin:/usr/es/sbin/cluster
/usr/es/sbin/cluster/events/utils/cl_vg_fence_redo -c dummyvg rw hdisk2
 cl_vg_fence_init[279]: sfwAddFenceGroup(dummyvg, 11, hdisk2, hdisk3, hdisk4, hdisk5, hdisk6, hdisk7, hdisk8, hdisk9, hdisk10, hdisk11, hdisk12): No such device
 cl_vg_fence_redo: Volume group dummyvg fence height could not be set to read/write#
cd /dev
/usr/es/sbin/cluster/events/utils/cl_vg_fence_redo -c dummyvg rw hdisk2
 cl_vg_fence_init[279]: sfwAddFenceGroup(dummyvg, 11, hdisk2, hdisk3, hdisk4, hdisk5, hdisk6, hdisk7, hdisk8, hdisk9, hdisk10, hdisk11, hdisk12): No such device
 cl_vg_fence_redo: Volume group dummyvg fence height could not be set to read/write#

SIGH!

I give up. We will probably have to reboot.


PowerHA Quickbuild

Because Facebook notes editor has zero formatting functionality in the new version.

####################################
### POWERHA QUICKBUILD - SANITIZED
####################################
This is a list of all the commands I'm using to build this cluster.
It's been sanitized of any customer information.


####################################
### Cleanup
####################################
clrmclstr
rmcluster -n MYCLUSTER
y | rmcluster -r hdisk2
rmdev -Rdl cluster0
/usr/sbin/rsct/bin/cthagsctrl -z
/usr/sbin/rsct/bin/cthagsctrl -d
echo "cthags 12348/udp" >> /etc/services
/usr/sbin/rsct/bin/cthagsctrl -a
/usr/sbin/rsct/bin/cthagsctrl -s
stopsrc -s clcomd ; sleep 2 ; startsrc -s clcomd
rm /var/hacmp/adm/* /var/hacmp/log/* /var/hacmp/clverify/* /usr/es/sbin/cluster/etc/auto_versync.pid
no -po nonlocsrcroute=1
no -po ipsrcrouterecv=1
shutdown -Fr now


####################################
### System config
####################################
# oslevel -s
7100-04-01-1543

# halevel -s
7.1.3 SP4

# emgr -P
PACKAGE INSTALLER LABEL
======================================================== =========== ==========
openssl.base installp 101a_fix
bos.net.tcp.client installp IV79944s1a
openssh.base.server installp IV80743m9a
openssh.base.client installp IV80743m9a
bos.net.tcp.client installp IV80191s1a
bos.rte.control installp IV80586s1a

# cat /etc/hosts
127.0.0.1 localhost
10.0.0.1 gateway
10.0.0.10 mycluster MYCLUSTER
10.0.0.11 node1
10.0.0.12 node2


####################################
### Cluster communication
####################################
echo node1 > /etc/cluster/rhosts
echo node2 >> /etc/cluster/rhosts
cat /etc/cluster/chosts > /usr/es/sbin/cluster/etc/rhosts
echo 10.0.0.11 >> /usr/es/sbin/cluster/etc/rhosts
echo 10.0.0.12 >> /usr/es/sbin/cluster/etc/rhosts
echo 10.0.0.1 > /usr/es/sbin/cluster/netmon.cf
stopsrc -s clcomd ; sleep 2 ; startsrc -s clcomd
sleep 10
cl_rsh -n node1 date
cl_rsh -n node2 date

####################################
### Basic cluster build
####################################
export CLUSTER=MYCLUSTER
export NODES="node2 node1"
export HBPVID=deadbeefcafe1234
clmgr add cluster ${CLUSTER} NODES="$NODES"
clmgr modify cluster $CLUSTER REPOSITORY=$HBPVID HEARTBEAT_TYPE=unicast
cldare -rt


####################################
### Add the service address
####################################
/usr/es/sbin/cluster/utilities/claddnode -Tservice -Bmycluster -wnet_ether_01 # -zignore
cllsif
cldare -rt


####################################
### file collections
####################################
clfilecollection -o coll -c Configuration_Files -'' -'AIX and HACMP config files' yes yes
clfilecollection -o coll -c HACMP_Files -'' -'HACMP Resource Group Files' yes yes
clfilecollection -o time -c 10
clfilecollection -o coll -a User_Files 'System user config' yes yes
clfilecollection -o file -a User_Files /etc/passwd
clfilecollection -o file -a User_Files /etc/group
clfilecollection -o file -a User_Files /etc/security/passwd
clfilecollection -o file -a User_Files /etc/security/limits
clfilecollection -o file -a User_Files /.profile
clfilecollection -o file -a User_Files /etc/environment
clfilecollection -o file -a User_Files /etc/profile
clfilecollection -o file -a User_Files /etc/exports
clfilecollection -o file -a User_Files /etc/sudoers
clfilecollection -o file -a User_Files /etc/qconfig
clfilecollection -o file -l Configuration_Files
clfilecollection -o file -l HACMP_Files
clfilecollection -o file -l User_Files


####################################
### mail events
####################################
/usr/es/sbin/cluster/utilities/claddcustom -t event -n'mail_event' \
-I'mail out when event occurs' -v'/usr/local/cluster/mail_event'
for EVENT in `cat /usr/local/cluster/mail_event.list`; do
/usr/es/sbin/cluster/utilities/clchevent -O"$EVENT" \
-s /usr/es/sbin/cluster/events/$EVENT -b mail_event -c 0
done
/usr/es/sbin/cluster/utilities/clacdNM -MA -nLVM_IO_FAIL -p0 -lLVM_IO_FAIL -m/usr/local/cluster/LVM_IO_FAIL
/usr/es/sbin/cluster/utilities/claddserv -s'my_app' \
-b'/usr/local/cluster/APP_start.ksh' -e'/usr/local/cluster/APP_stop.ksh'
/usr/es/sbin/cluster/utilities/claddserv -s'my_dsmc' \
-b'/usr/local/cluster/DSMC_start.ksh' -e'/usr/local/cluster/DSMC_stop.ksh'
cllsserv


####################################
### Resource group
####################################
/usr/es/sbin/cluster/utilities/claddgrp -g 'myclster_rg' -n 'node2 node1' -S 'OFAN' -O 'FNPN' -B 'FBHPN'
cllsgrp


####################################
### Resources
####################################
/usr/es/sbin/cluster/utilities/claddres -g 'myclster_rg' SERVICE_LABEL='myclster' \
APPLICATIONS='my_app my_dsmc' VOLUME_GROUP='prdappvg prdvg prdjrnvg' \
FORCED_VARYON='false' VG_AUTO_IMPORT='false' FILESYSTEM= FSCHECK_TOOL='fsck' \
RECOVERY_METHOD='sequential' PPRC_REP_RESOURCE='' FS_BEFORE_IPADDR='false' \
EXPORT_FILESYSTEM='' ERCMF_REP_RESOURCE='' MOUNT_FILESYSTEM='' \
NFS_NETWORK='' SHARED_TAPE_RESOURCES='' DISK='' AIX_FAST_CONNECT_SERVICES='' \
COMMUNICATION_LINKS='' MISC_DATA='' WPAR_NAME='' GMD_REP_RESOURCE='' SVCPPRC_REP_RESOURCE=''
cllsres
cllsres -g myclster_rg


####################################
### Application monitor
####################################
/usr/es/sbin/cluster/utilities/claddappmon MONITOR_TYPE=process name=my_dsmc_mon \
RESOURCE_TO_MONITOR=my_dsmc INVOCATION='longrunning' PROCESSES='dsm.opt.cluster' \
PROCESS_OWNER=root STABILIZATION_INTERVAL='60' RESTART_COUNT='3' FAILURE_ACTION='notify' \
INSTANCE_COUNT=1 RESTART_INTERVAL=360 NOTIFY_METHOD='/usr/local/cluster/mail_event' \
CLEANUP_METHOD='/usr/local/cluster/DSMC_stop.ksh' \
RESTART_METHOD='/usr/local/cluster/DSMC_start.ksh'
/usr/es/sbin/cluster/utilities/claddappmon name=my_app_mon \
RESOURCE_TO_MONITOR=my_app INVOCATION='both' MONITOR_TYPE=user \
STABILIZATION_INTERVAL=120 MONITOR_INTERVAL=120 \
RESTART_COUNT=3 RESTART_INTERVAL=800 FAILURE_ACTION=fallover \
NOTIFY_METHOD=/usr/local/cluster/mail_event FAILURE_ACTION='notify' \
CLEANUP_METHOD='/usr/local/cluster/APP_stop.ksh' \
RESTART_METHOD='/usr/local/cluster/APP_start.ksh' \
MONITOR_METHOD=/usr/local/cluster/APP_check.ksh HUNG_MONITOR_SIGNAL=9
cllsappmon
cllsappmon my_app_mon
cllsappmon my_dsmc_mon


####################################
### Sync all the changes
####################################
cldare -rt -C interactive


####################################
### Verify both nodes see it fine
####################################
cllsclstr
lscluster -m


####################################
### Start the cluster
####################################
smitty clstart


This is where it complains that hags is not up.
Rebooting does not bring up hags.
Manually starting, and it wil die after 20 mins or so.
Very little logging.

EVERY time I try to mess with HA, it’s broken. It’s always something different. Such a pain. Truly, I don’t know why people do not just use their own scripts.


cl_rsh fails

PROBLEM: On some migrates, we found the rpdomain would not stay running on one node.
The cluster was up, and SEEMED to operate normally, but errpt got CONFIGRM stop/start messages every minute.

lsrpdomain would show Offline, or “Pending online”.

lsrpnode would show:
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.

On the other node, lsrpnode only showed itself, and lsrpdomain showed Online.

“cl_rsh node1 date” worked from both nodes
“cl_rsh node2 date” worked only from node2.
/etc/hosts, cllsif, hostname, /etc/cluster/rhosts… everything was spotless.
clcomd was running, even after refresh.
Same subnet, and ports were not filtered.

Importing a snapshot said:
Warning: unable to verify inbound clcomd communication from
node "node1" to the local node, "node2".

I applied PowerHA 7.1.3 SP4, and no fix. I think this is a problem with clmigcheck or mkcluster in AIX.

SOLUTION
I saved a snapshot, blew away the cluster, and imported the snapshot.
/usr/es/sbin/cluster/utilities/clsnapshot -c -i -nmysnapshot -d "Snapshot before clrmcluster"
clstop -g -N
stopsrc -g cluster
clrmclstr
rmcluster -r hdisk10
# one node's SSHd died here.
rmdev -dl cluster0
cfgmgr
cl_rsh works all the way around now.
/usr/es/sbin/cluster/utilities/clsnapshot -a -n'mysnapshot' -f'false'
cllsclstr ; lscluster -m ; lsrpdomain ; lsrpnode

works fine all around, before and after reboot.
Cluster starts normally.

Error Reference
---------------------------------------------------------------------------
LABEL: CONFIGRM_STOPPED_ST
IDENTIFIER: 447D3237

Date/Time: Tue Nov 24 04:18:36 EST 2015
Sequence Number: 42614
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Description
IBM.ConfigRM daemon has been stopped.

Probable Causes
The RSCT Configuration Manager daemon(IBM.ConfigRMd) has been stopped.

User Causes
The stopsrc -s IBM.ConfigRM command has been executed.

Recommended Actions
Confirm that the daemon should be stopped. Normally, this daemon should
not be stopped explicitly by the user.

Detail Data
DETECTING MODULE
RSCT,ConfigRMDaemon.C,1.25.1.1,219
ERROR ID

REFERENCE CODE

---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42613
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42612
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42611
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42610
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42609
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42608
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_PENDINGQUO
IDENTIFIER: A098BF90

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42607
Class: S
Type: PERM
WPAR: Global
Resource Name: ConfigRM

Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM.
This state usually indicates that exactly half of the nodes that are defined in the
peer domain are online. In this state cluster resources cannot be recovered although
none will be stopped explicitly.

Failure Causes
One or more nodes in the active peer domain have failed.
One or more nodes in the active peer domain have been taken offline by the user.
A network failure is disrupted communication between the cluster nodes.

Recommended Actions
Ensure that more than half of the nodes of the domain are online.
Ensure that the network that is used for communication between the nodes is functioning correctly.
Ensure that the active tie breaker device is operational and if it set to
'Operator' then resolve the tie situation by granting ownership to one of
the active sub-domains.

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,19713

---------------------------------------------------------------------------
LABEL: STORAGERM_STARTED_S
IDENTIFIER: EDFF8E9B

Date/Time: Tue Nov 24 04:17:53 EST 2015
Sequence Number: 42606
Node Id: node1
Class: O
Type: INFO
WPAR: Global
Resource Name: StorageRM

Detail Data
DETECTING MODULE
RSCT,IBM.StorageRMd.C,1.49,147

---------------------------------------------------------------------------
LABEL: CONFIGRM_ONLINE_ST
IDENTIFIER: 3B16518D

Date/Time: Tue Nov 24 04:17:52 EST 2015
Sequence Number: 42605
Node Id: node1
Class: S
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,24950

Peer Domain Name
mycluster