IT42905 TSM / Spectrum Protect SSL cert expiration

TSM 7.1.8 / Spectrum Protect 8.1.2 and later create SSL certs with a 10 year expiration.

IBM reference: https://www.ibm.com/support/pages/apar/IT42905

The fix is:

Delete the instance keystore (cert.kdb)
Set all clients, admins, servers to SESSIONSECURITY=TRANSITIONAL
HALT the server and restart it – this will make a NEW key.
FORCESYNC for server connections.
Delete the client keystore (dsmcert.idx)
Restart the client and make sure it connects for the new key.

If you don’t wipe the client keystore:

ANS1695E The certificate is not valid.
ANS8023E Unable to establish session with server.
ANS8002I Highest return code was -370.

From the actlog/SERVER_CONSOLE

ANR8599W The connection with someserver:port failed due to an untrusted server certificate. An attempt to reconnect and establish certificate trust might follow.

IBM is considering automating this.

There is no automation yet as of 8.1.18.0 in 2023-03, and once the key expires, you’re stuck doing it manually.
You may be able to use DEFINE CLIENTACTION to delete the keystores on the clients if you use dsmcad.


Spectrum Protect: Failed to prepare update packages.

OpCenter GUI Failed to prepare update packages, and the button is greyed out.
We have never been able to use OpCenter client updates.

We’ve gone through the steps in the 8.1.007 release notes
https://www.ibm.com/support/pages/ibm-spectrum-protect-version-817000-fix-pack-readme-files
* Command routing works from hub to spoke and spoke to hub.
* Every server has a FILE class, and half have a directory class pool.
* Every server has an HLA and LLA set.
* Most of our clients to use cad with passwordaccc=generate

We’ve gone through the dependencies in opcenter help, which is mostly the same.

We have tried deleting the DEPLOY nodes and letting them re-replicate.

Protect: TSM1>q act begind=-1 endd=today msgno=3759
02/07/23 08:00:31 ANR3759E An error occurred during the replication of client update packages from node IBM_DEPLOY_CLIENT_UNX to the monitored server, TSM2. The return code is 18. (PROCESS: 1612)
02/07/23 08:00:31 ANR3759E An error occurred during the replication of client update packages from node IBM_DEPLOY_CLIENT_WIN to the monitored server, TSM2. The return code is 18. (PROCESS: 1613)

Reset the remote clients that are failing

TSM2: DEL FI IBM_DEPLOY_CLIENT_UNX *
TSM2: DEL FI IBM_DEPLOY_CLIENT_WIN *

TSM2: remove replnode IBM_DEPLOY_CLIENT_UNX server=TSM1
TSM2: remove replnode IBM_DEPLOY_CLIENT_WIN server=TSM1

TSM2: remove node IBM_DEPLOY_CLIENT_UNX
TSM2: remove node IBM_DEPLOY_CLIENT_WIN

Reset deploypkgmgr – not really needed

SET DEPLOYPKGMGR off
SET DEPLOYREPOSITORY /sp/software/octemp/downloads/
SET DEPLOYMAXPKGS 4
SET DEPLOYPKGMGR on

HELPS – The underlying hidden command that does the sync.

refresh pkg clean=no startnow=yes

Once this processes, the button turned blue.

Asked IBM to address this because if there are multiple spokes, failure of one spoke should not block the others.


TSM/ISP Recovering from “SKIP UPGRADING THIS INSTANCE”

Recovering from “SKIP UPGRADING THIS INSTANCE”

REFS:
https://www.ibm.com/support/pages/manually-upgrading-ibm-spectrum-protect-server-instances
https://www.ibm.com/support/pages/anr0187e-failure-during-server-startup
http://issen007.blogspot.com/2017/05/manual-upgrade-ibm-spectrum-protect-71x.html

###################################################
### Stop the instance completely
su - tsminst1

### This may not work if your environment or links are bad.
db2 list db directory

### If db2sysc is still running
ps | grep db2 
db2stop
db2stop force
db2 terminate ### Kill off db2bp fragments

### kill everything else other
ps | grep db2

### Remove IPC
ipcrm -a


###################################################
### Clean up remainders
su - root
/opt/tivoli/tsm/db2/instance/db2ilist
/opt/tivoli/tsm/db2/instance/db2idrop tsminst1

### Verify nothing left
/opt/tivoli/tsm/db2/instance/db2ilist


###################################################
### Redefine the instance
su - root
#/opt/tivoli/tsm/db2/instance/db2icrt -a server -u tsminst1 tsminst1
/opt/tivoli/tsm/db2/instance/db2icrt -u tsminst1 tsminst1

DBI1446I The db2icrt command is running.
DB2 installation is being initialized.
Total number of tasks to be performed: 4
Total estimated time for all tasks to be performed: 309 second(s)

Task #1 start
Description: Setting default global profile registry variables
Estimated time 1 second(s)
Task #1 end

Task #2 start
Description: Initializing instance list
Estimated time 5 second(s)
Task #2 end

Task #3 start
Description: Configuring DB2 instances
Estimated time 300 second(s)
Task #3 end

Task #4 start
Description: Updating global profile registry
Estimated time 3 second(s)
Task #4 end

The execution completed successfully.
For more information see the DB2 installation log at "/tmp/db2icrt.log.21176".
DBI1070I Program db2icrt completed successfully.


###################################################
### Set up Db2 environment variables
# NOTE: userprofile and db2profile get reset after db2icrt
su - tsminst1
/opt/tivoli/tsm/db2/adm/db2set -i tsminst1 "DB2_SKIPINSERTED=ON"
/opt/tivoli/tsm/db2/adm/db2set -i tsminst1 "DB2_KEEPTABLELOCK=ON"
/opt/tivoli/tsm/db2/adm/db2set -i tsminst1 "DB2_EVALUNCOMMITTED=ON"
/opt/tivoli/tsm/db2/adm/db2set -i tsminst1 "DB2_SKIPDELETED=ON"
/opt/tivoli/tsm/db2/adm/db2set -i tsminst1 "DB2CODEPAGE=819"
/opt/tivoli/tsm/db2/adm/db2set -i tsminst1 "DB2_PARALLEL_IO=*"

cat <<EOF >>${HOME}/sqllib/userprofile
export LD_LIBRARY_PATH=${HOME}/sqllib/lib64/gskit:${HOME}/sqllib/lib32:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/tivoli/tsm/server/bin/dbbkapi:/opt/ibm/lib:/opt/ibm/lib64:/usr/lib64:${HOME}/sqllib/lib64
export PATH=$PATH:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/tivoli/tsm/server/bin64
export PATH=$PATH:/opt/tivoli/tsm/server/bin:/usr/tivoli/tsm/server/bin64:/usr/tivoli/tsm/server/bin
export PATH=$PATH:/opt/tivoli/tsm/client/ba/bin64:/opt/tivoli/tsm/client/ba/bin:/usr/tivoli/tsm/client/ba/bin64
export PATH=$PATH:/usr/tivoli/tsm/client/ba/bin:/usr/tivoli/tsm/client/api/bin64:/usr/tivoli/tsm/client/api/bin
export PATH=$PATH:/opt/tivoli/tsm/client/api/bin64:/opt/tivoli/tsm/client/api/bin:/opt/tivoli/tsm/db2/bin
export PATH=$PATH:${HOME}/sqllib/bin:${HOME}/sqllib/adm:${HOME}/sqllib/misc

DSMI_CONFIG=${HOME}/tsmdbmgr.opt
DSMI_DIR=/opt/tivoli/tsm/server/bin/dbbkapi
DSMI_LOG=${HOME}
export DSMI_CONFIG DSMI_DIR DSMI_LOG 
EOF

cat <<EOF >>${HOME}/.profile
. ${HOME}/sqllib/db2profile
. ${HOME}/sqllib/userprofile

alias ll='ls -laF --color=auto'
set -o vi
EOF

. ./.profile


###################################################
### Catalog the DB to make sure it is okay.
db2start

### Find the TSMDB1 instances).
DBDIR=$(find /home /sp /tsm -name sqldbdir -exec strings {} \; 2>/dev/null | grep inst | cut -c 2-99 | sort | uniq)
echo $DBDIR

### Register the instance(s).
for i in $DBDIR ; do db2 catalog db TSMDB1 on $i ; done
# SQL6028N Catalog database failed because database "tsminst1" was not found in the local database directory.

### List the instances.
db2 list db directory


###################################################
### Upgrade the DB2 system catalog tables
#db2 upgrade db tsminst1
#SQL1013N The database alias name or database name "TSMINST1" could not be found. SQLSTATE=42705
db2 upgrade db TSMDB1
SQL1103W The UPGRADE DATABASE command was completed successful.

### Stop DB2 to make sure it flushes everything.
db2stop
SQAL1064N DB2STOP processing was successful.


###################################################
### Upgrade the TSM database schema
/opt/tivoli/tsm/server/bin/dsmserv upgradedb
ANR7800I DSMSERV generated at 18:03:03 on Nov 19 2021.

IBM Spectrum Protect for AIX
Version 8, Release 1, Level 13.000

Licensed Materials - Property of IBM

(C) Copyright IBM Corporation 1990, 2021.
All rights reserves.
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 64684398.
ANR7811I Using instance directory /tsm/tsminst1/
ANR3339I Default Label in key database is TSM Server SelfSigned SHA Key.
ANR4726I The ICC support module has been loaded.
ANR0990I Server restart-recovery in progress.

###################################################
### Make sure the server accepts workload

# start the server normally (rc script, systemd, or inittab line run from an at-job).

### Run these from dsmadmc
REG LIC FILE=*ee.lic
enable ses all

 


Spectrum Protect (TSM) Operations Center on Ubuntu LTS

Per IBM, the Spectrum Protect server is supported on Ubuntu LTS 14, 16, 18, and 20 (aka 2014.04, 2016.04, etc.) 

https://www.ibm.com/support/pages/overview-ibm-spectrum-protect-supported-operating-systems

However, Operations Center (web GUI) is not supported on Ubuntu, only RHEL and SLES.

https://www.ibm.com/support/pages/ibm-spectrum-protect-operations-center-software-and-hardware-requirements

./install.sh -c
Validating package prerequisites...
=====> IBM Installation Manager> Update> Prerequisites
Validation results:
* [ERROR] IBM Spectrum Protect Operations Center 8.1.12000.20210326_0723 contains validation errors.
1. ERROR: The operating system on which you are installing the product is not supported. For more information, see http://www.ibm.com/support/docview.wss?uid=swg21243309.

Enter the number of the error or warning message above to view more details.

To skipp the OS and platform checks, and convert the ERROR into WARNING:

./install.sh -c -vmargs "-DBYPASS_TSM_REQ_CHECKS=true"
Validation results:
* [WARNING] IBM Spectrum Protect Operations Center 8.1.12000.20210326_0723 contains validation warning.
1. WARNING: The operating system on which you are installing the product is not supported. For more information, see http://www.ibm.com/support/docview.wss?uid=swg21243309.

Enter the number of the error or warning message above to view more details.

I recommend ONLY install/update Operations Center with this, and then exit and go back in normally to make sure the other filesets validate okay.


ANR1812E DELETE FILESPACE VMFULL failed because replication

ERROR:

ANR1812E DELETE FILESPACE VMFULL failed because replication

DESCRIPTION:

Decommed VMs fail to auto-delete during expiration because replication is happening. In an ideal world, there would be enough system resources to perform DB Backup in 2 hours, expiration in 2 hours, and replication in 4-8 hours. In this environment, replication overlaps a lot of other processes, and can get in the way. 

ANR1812E DELETE FILESPACE VMFULL-SOMENODENAME for node failed deletion because of a replication in progress. (SESSION: 123456)

 

WORKAROUND:

Identify the server

 

Cancel replication
CANCEL REPLICATION

 

Identify the filespace
VMFULL-SOMENODENAME in the example

 

Find the node that owns the filespace.
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER    Bkup     \VMFULL-SOMENODENAME 4   53084         –    6,782,908

 

Delete the filespace on both local and replica:
DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘
TSM2: DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘

 

Monitor Progress until complete
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER     Bkup     \VMFULL-SOMENODENAME      4   50848         –    6,469,955

Protect: TSM>q act search=ANR1812E
03/07/21   23:14:23      ANR2017I Administrator ADMIN issued command: QUERY ACTLOG search=ANR1812E  (SESSION: 438180)

Protect: TSM>q proc
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
     395     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER    : 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.

Protect: TSM>TSM2: q proc
ANR1699I Resolved TSM2 to 1 server(s) – issuing command Q PROC against server(s).
ANR1687I Output for command ‘Q PROC’ issued against server TSM2 follows:
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
   5,756     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER: 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.
ANR1688I Output for command ‘Q PROC’ issued against server TSM2 completed.
ANR1694I Server TSM2 received the request to process command ‘Q PROC’.
ANR1697I Command ‘Q PROC’ processed by 1 server(s):  1 successful, 0 with warnings, and 0 with errors.

 

CAUSE:

Replicate Node, a normal operation, creates locks on any filespace to be processed.

The long-term resolution would be to have enough system resources to not have to overlap daily operations processes.

The benchmark set by IBM for this would be the ability to complete BACKUP DB in 2 hours.  This environment take 8-12 hours for most servers.


Posted in Reference, Work | Tagged , , | Comments Off on ANR1812E DELETE FILESPACE VMFULL failed because replication

ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

ERROR:

ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

 

DESCRIPTION:

This happens when two or more schedulers are connecting as the same node.  One node starts work on a schedule, and the others are denied.

 

 

WORKAROUND:

Check the client node for two or more “dsmcad” and “dsmc sched” processes with the same (or no) config file listed.

Kill the oldest duplicates.

 

If no duplicates are on the client, then search the activity log to see if this client is connecting with multiple IP addresses or hostnames.

If so, find out which client should not be running the scheduler, and kill them on that host.

 

This may require coordination with the UNIX team in cases of cluster failovers.

This may require investigation of start scripts in cases where the same client chronically has duplicates.

 

CAUSE:

Typically, a human will restart a scheduler, but fail to kill the original.

Sometimes, a start/stop script on a host fails to stop the prior instance.

In some cases, multiple start scripts fire on system boot.

 


Posted in Reference, Work | Tagged , , | Comments Off on ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

NDMP TOC failure – datamover type incorrect

ERROR:
ANR4950E The server is unable to retrieve NDMP file history information while building table of contents for node NASNODE01, file space /SVM_NASNODE01_VIRTUALFS. NDMP node ID is 90156245149. Table of contents creation fails.

CAUSE:
One possible cause of this can be if the datamover was defined with the wrong scope (TYPE).  
TYPE can be NAS, NASVSERVER, or NASCLUSTER.  NAS is for node context.  VSERVER is for SVM ccontext.  CLUSTER is for the whole cluster context.

NOTE: There are other possible causes, such as corrupt inodes, or other issues; however, this one bit me and was not clearly define anywhere else.

CORRECTION:
You cannot UPDATE DATAMOVER TYPE=blah, but you can simply DELETE DATAMOVER and DEFINE DATAMOVER to fix.

DELETE DATAMOVER NASNODE01
DEFINE datamover NASNODE01 type=nascluster dataformat=netappdump hla=192.168.128.1 user=NDMPADMIN password=PASSWORDHERE

TRACING INFO:

trace disable
trace enable spi spid toc
trace begin /tmp/server.trc

Once tracing has been enabled, I would then like for you to initiate another backup of the /SVM_SBNAS01_OU_ABOD volume. When the backup completes/fails, you can then issue the following commands to disable tracing:

trace flush
trace end
trace disable
QUERY ACTLOG

grep NDMP dsmffdc.log

NASNODE01::> node run -node SBNAS01-01
Type ‘exit’ or ‘Ctrl-D’ to return to the CLI
NASNODE01> rdfile /etc/log/backup


TSM SP Remove ReplServer

PROBLEM:
Every 5.5 minutes, this shows up in the actlog

08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:26 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:26 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:26 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:28 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:28 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:28 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.

SOLUTION:
QUERY REPLSERVER shows the GUID
REMOVE REPLSERVER (GUID) to cause the errors to stop.


Replacing TSM / ISP server

I ran into an issue where a primary TSM 7.1.8 server was broken, and it was easier to just move all of the clients over to the secondary 8.1.5 server. These used the new TLS encryption, and I kept running into issues.

    —————

First, I physically shut down the old server, and updated the new server to use the IP as an alias by editing /etc/network/interfaces. (Ubuntu 16 LTS)

    —————

Various errors included:
ANR8599W The connection with host address:host port failed due to an untrusted server certificate. An attempt to reconnect and establish certificate trust might follow.

ANR2284S The server master encryption key has changed. Passwords protected with the previous master encryption key are not available.

I had to make the server trust itself again with:[LINK]
dsmcert -add -server TSM -file /home/tsminst1/cert256.arm

    —————

This error:
ANR0456W Session rejected for server DT – the server name at 192.168.1.99, 1500 does not match.

I removed the server:
REMOVE SERVER DT

    —————

These errors:
ANR1651E Server information for DT is not available.
ANR4377E Session failure, target server DT is not defined on the source server.
ANR3151E Configuration refresh failed with configuration manager DT.

I disabled replication config
REMOVE REPLNODE *
Q REPLSERVER
REMOVE REPLSERVER {GUID}
DEL SUBSCRIPTION DEFAULT_PROFILE

    —————

These errors:
ANS1695E The certificate is not valid.
ANS1592E Failed to initialize SSL protocol.
ANS8023E Unable to establish session with server.
ANR3335W Unable to distribute certificate to for session .
ANR8599W The connection with 192.168.1.2:40250 failed due to an untrusted server certificate. An attempt to reconnect and establish certificate trust might follow.

From the clients, I had to remove /opt/tivoli/tsm/client/ba/bin/dsmcert.* and C:\Program Files\Tivoli\TSM\baclient\dsmcert.* on the clients per http://andrewjtobiason.com/index.php/2018/08/16/resolving-ssl-errors/

However, I also had to fix dsmcert / dsmcert.exe which I clobbered in the process.

    —————

And on the server, allow keys to be swapped again:[LINK]
UPD NODE * SESSIONSECURITY=TRANSITIONAL
UPD ADMIN * SESSIONSECURITY=TRANSITIONAL

Swap keys on client:
dsmadmc

    —————

NOTE: I also removed /etc/adsm/* during troubleshooting, but that was not needed. That just lead to me having to re-enter passwords again. Simply deleting the cert database corrected the problem on other clients.

    —————

I tried to order this in dependency order. I was sort of all over the place when I did it, and might have missed something. I just could not exactly get all of the right info from any one documentation source.


AIX 7.2.3.1 breaks GSKit 8.0.50.89

AIX 7.2.3 breaks GSKit8, up through GP29 (8.0.50.89).

This affects TSP/Spectrum Protect, Content Manager, Tivoli Directory Server, Websphere, DB2, Informix, IBM HTTP Server, etc.

Before reboot, everything works still, which implies the change is in the kernel.

We found it on TSM, and AIX 7200-03-01-1838, and Spectrum Protect server 8.1.6.0.

Application crash and DBX follow below.

ANR7800I DSMSERV generated at 12:17:13 on Sep 11 2018.
IBM Spectrum Protect for AIX
Version 8, Release 1, Level 6.000
Licensed Materials - Property of IBM
(C) Copyright IBM Corporation 1990, 2018.
All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 10944920.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR7811I Using instance directory /home/tsminst1.
Illegal instruction(coredump)

# dbx /opt/tivoli/tsm/server/bin/dsmserv core.10944896.28165312
Type 'help' for help.
[using memory image in core.10944896.28165312]
reading symbolic information ...warning: no source compiled with -g

Illegal instruction (illegal opcode) in . at 0x0 ($t1)
warning: Unable to access address 0x0 from core

(dbx) where
.() at 0x0
gsk_src_create__FPPvPv(??, ??) at 0x9000000015b6d88
__ct__8GSKMutexFv(??) at 0x9000000018d664c
__ct__20GSKPasswordEncryptorFv(??) at 0x9000000018cb248
__ct__7gsk_envFv(??) at 0x900000000aaa6b0
GskEnvironmentOpen__FPPvb(??, ??) at 0x900000000ab14c4
gsk_environment_open(??) at 0x900000000ab277c
IPRA.$CheckGSKVersion() at 0x100eecf68
tlsInit() at 0x100eecd70
main(??, ??) at 0x10000112c

(dbx) th
thread state-k wchan state-u k-tid mode held scope function

$t1 run running 41877977 k no sys
$t2 run blocked 21234465 u no sys _cond_wait_global
$t3 run running 24380103 u no sys waitpid