cl_rsh fails

PROBLEM: On some migrates, we found the rpdomain would not stay running on one node.
The cluster was up, and SEEMED to operate normally, but errpt got CONFIGRM stop/start messages every minute.

lsrpdomain would show Offline, or “Pending online”.

lsrpnode would show:
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.

On the other node, lsrpnode only showed itself, and lsrpdomain showed Online.

“cl_rsh node1 date” worked from both nodes
“cl_rsh node2 date” worked only from node2.
/etc/hosts, cllsif, hostname, /etc/cluster/rhosts… everything was spotless.
clcomd was running, even after refresh.
Same subnet, and ports were not filtered.

Importing a snapshot said:
Warning: unable to verify inbound clcomd communication from
node "node1" to the local node, "node2".

I applied PowerHA 7.1.3 SP4, and no fix. I think this is a problem with clmigcheck or mkcluster in AIX.

SOLUTION
I saved a snapshot, blew away the cluster, and imported the snapshot.
/usr/es/sbin/cluster/utilities/clsnapshot -c -i -nmysnapshot -d "Snapshot before clrmcluster"
clstop -g -N
stopsrc -g cluster
clrmclstr
rmcluster -r hdisk10
# one node's SSHd died here.
rmdev -dl cluster0
cfgmgr
cl_rsh works all the way around now.
/usr/es/sbin/cluster/utilities/clsnapshot -a -n'mysnapshot' -f'false'
cllsclstr ; lscluster -m ; lsrpdomain ; lsrpnode

works fine all around, before and after reboot.
Cluster starts normally.

Error Reference
---------------------------------------------------------------------------
LABEL: CONFIGRM_STOPPED_ST
IDENTIFIER: 447D3237

Date/Time: Tue Nov 24 04:18:36 EST 2015
Sequence Number: 42614
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Description
IBM.ConfigRM daemon has been stopped.

Probable Causes
The RSCT Configuration Manager daemon(IBM.ConfigRMd) has been stopped.

User Causes
The stopsrc -s IBM.ConfigRM command has been executed.

Recommended Actions
Confirm that the daemon should be stopped. Normally, this daemon should
not be stopped explicitly by the user.

Detail Data
DETECTING MODULE
RSCT,ConfigRMDaemon.C,1.25.1.1,219
ERROR ID

REFERENCE CODE

---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42613
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42612
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42611
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42610
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42609
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42608
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_PENDINGQUO
IDENTIFIER: A098BF90

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42607
Class: S
Type: PERM
WPAR: Global
Resource Name: ConfigRM

Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM.
This state usually indicates that exactly half of the nodes that are defined in the
peer domain are online. In this state cluster resources cannot be recovered although
none will be stopped explicitly.

Failure Causes
One or more nodes in the active peer domain have failed.
One or more nodes in the active peer domain have been taken offline by the user.
A network failure is disrupted communication between the cluster nodes.

Recommended Actions
Ensure that more than half of the nodes of the domain are online.
Ensure that the network that is used for communication between the nodes is functioning correctly.
Ensure that the active tie breaker device is operational and if it set to
'Operator' then resolve the tie situation by granting ownership to one of
the active sub-domains.

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,19713

---------------------------------------------------------------------------
LABEL: STORAGERM_STARTED_S
IDENTIFIER: EDFF8E9B

Date/Time: Tue Nov 24 04:17:53 EST 2015
Sequence Number: 42606
Node Id: node1
Class: O
Type: INFO
WPAR: Global
Resource Name: StorageRM

Detail Data
DETECTING MODULE
RSCT,IBM.StorageRMd.C,1.49,147

---------------------------------------------------------------------------
LABEL: CONFIGRM_ONLINE_ST
IDENTIFIER: 3B16518D

Date/Time: Tue Nov 24 04:17:52 EST 2015
Sequence Number: 42605
Node Id: node1
Class: S
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,24950

Peer Domain Name
mycluster


TSM 7.1 config

In the past, I set up TSM.PWD as root, but this seems to not be what I needed.

I’m posting because the error messages and IBM docs don’t cover this.

tsmdbmgr.log shows:
ANS2119I An invalid replication server address return code rc value = 2 was received from the server.

TSM Activity log shows:
ANR2983E Database backup terminated due to environment or setup issue related to DSMI_DIR – DB2 sqlcode -2033 sqlerrmc 168. (SESSION: 1, PROCESS: 9)

db2diag.log shows:

2014-02-26-13.54.12.425089-360 E415619A371 LEVEL: Error
PID : 15138852 TID : 1 PROC : db2vend
INSTANCE: tsminst1 NODE : 000
HOSTNAME: tsmserver
EDUID : 1
FUNCTION: DB2 UDB, database utilities, sqluvint, probe:321
DATA #1 : TSM RC, PD_DB2_TYPE_TSM_RC, 4 bytes
TSM RC=0x000000A8=168 — see TSM API Reference for meaning.

EDUID : 38753 EDUNAME: db2med.35926.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:656
DATA #1 : String, 134 bytes
Vendor error: rc = 11 returned from function sqluvint.
Return_code structure from vendor library /tsm/tsminst1/sqllib/adsm/libtsm.a:

DATA #2 : Hexdump, 48 bytes
0x0A00030462F0C4D0 : 0000 00A8 3332 3120 3136 3800 0000 0000 ….321 168…..
0x0A00030462F0C4E0 : 0000 0000 0000 0000 0000 0000 0000 0000 …………….
0x0A00030462F0C4F0 : 0000 0000 0000 0000 0000 0000 0000 0000 …………….

EDUID : 38753 EDUNAME: db2med.35926.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:696
MESSAGE : Error in vendor support code at line: 321 rc: 168

RC 168 per dsmrc.h means:
#define DSM_RC_NO_PASS_FILE 168 /* password file needed and user is
not root */

Verified everything required for this:
• passworddir points to the right directory
• DSMI_DIR points to the right directory
• dsmtca runs okay
• dsmapipw runs okay

Verified hostname info was correct

dsmffdc.log shows:
[ FFDC_GENERAL_SERVER_ERROR ]: (rdbdb.c:4200) GetOtherLogsUsageInfo failed, rc=2813, archLogDir = /tsm/arch.

Checked, and the log directory inside dsmserv.opt was typoed as /tsm/arch instead of /tsm/arc as was used to create the instance and as exists on the filesystems.

Updated dsmserv.opt and restarted tsm server. No change other than fixing Q LOG

SOLUTION:
The TSM.PWD file must be owned by the instance user, not by root.
Make sure to run the dsmapipw as the instance user, or chown the file after.