Spectrum Protect (TSM) Operations Center on Ubuntu LTS

Per IBM, the Spectrum Protect server is supported on Ubuntu LTS 14, 16, 18, and 20 (aka 2014.04, 2016.04, etc.) 

https://www.ibm.com/support/pages/overview-ibm-spectrum-protect-supported-operating-systems

However, Operations Center (web GUI) is not supported on Ubuntu, only RHEL and SLES.

https://www.ibm.com/support/pages/ibm-spectrum-protect-operations-center-software-and-hardware-requirements

./install.sh -c
Validating package prerequisites...
=====> IBM Installation Manager> Update> Prerequisites
Validation results:
* [ERROR] IBM Spectrum Protect Operations Center 8.1.12000.20210326_0723 contains validation errors.
1. ERROR: The operating system on which you are installing the product is not supported. For more information, see http://www.ibm.com/support/docview.wss?uid=swg21243309.

Enter the number of the error or warning message above to view more details.

To skipp the OS and platform checks, and convert the ERROR into WARNING:

./install.sh -c -vmargs "-DBYPASS_TSM_REQ_CHECKS=true"
Validation results:
* [WARNING] IBM Spectrum Protect Operations Center 8.1.12000.20210326_0723 contains validation warning.
1. WARNING: The operating system on which you are installing the product is not supported. For more information, see http://www.ibm.com/support/docview.wss?uid=swg21243309.

Enter the number of the error or warning message above to view more details.

I recommend ONLY install/update Operations Center with this, and then exit and go back in normally to make sure the other filesets validate okay.


ANR1812E DELETE FILESPACE VMFULL failed because replication

ERROR:

ANR1812E DELETE FILESPACE VMFULL failed because replication

DESCRIPTION:

Decommed VMs fail to auto-delete during expiration because replication is happening. In an ideal world, there would be enough system resources to perform DB Backup in 2 hours, expiration in 2 hours, and replication in 4-8 hours. In this environment, replication overlaps a lot of other processes, and can get in the way. 

ANR1812E DELETE FILESPACE VMFULL-SOMENODENAME for node failed deletion because of a replication in progress. (SESSION: 123456)

 

WORKAROUND:

Identify the server

 

Cancel replication
CANCEL REPLICATION

 

Identify the filespace
VMFULL-SOMENODENAME in the example

 

Find the node that owns the filespace.
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER    Bkup     \VMFULL-SOMENODENAME 4   53084         –    6,782,908

 

Delete the filespace on both local and replica:
DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘
TSM2: DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘

 

Monitor Progress until complete
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER     Bkup     \VMFULL-SOMENODENAME      4   50848         –    6,469,955

Protect: TSM>q act search=ANR1812E
03/07/21   23:14:23      ANR2017I Administrator ADMIN issued command: QUERY ACTLOG search=ANR1812E  (SESSION: 438180)

Protect: TSM>q proc
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
     395     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER    : 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.

Protect: TSM>TSM2: q proc
ANR1699I Resolved TSM2 to 1 server(s) – issuing command Q PROC against server(s).
ANR1687I Output for command ‘Q PROC’ issued against server TSM2 follows:
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
   5,756     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER: 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.
ANR1688I Output for command ‘Q PROC’ issued against server TSM2 completed.
ANR1694I Server TSM2 received the request to process command ‘Q PROC’.
ANR1697I Command ‘Q PROC’ processed by 1 server(s):  1 successful, 0 with warnings, and 0 with errors.

 

CAUSE:

Replicate Node, a normal operation, creates locks on any filespace to be processed.

The long-term resolution would be to have enough system resources to not have to overlap daily operations processes.

The benchmark set by IBM for this would be the ability to complete BACKUP DB in 2 hours.  This environment take 8-12 hours for most servers.


Posted in Reference, Work | Tagged , , | Comments Off on ANR1812E DELETE FILESPACE VMFULL failed because replication

ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

ERROR:

ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

 

DESCRIPTION:

This happens when two or more schedulers are connecting as the same node.  One node starts work on a schedule, and the others are denied.

 

 

WORKAROUND:

Check the client node for two or more “dsmcad” and “dsmc sched” processes with the same (or no) config file listed.

Kill the oldest duplicates.

 

If no duplicates are on the client, then search the activity log to see if this client is connecting with multiple IP addresses or hostnames.

If so, find out which client should not be running the scheduler, and kill them on that host.

 

This may require coordination with the UNIX team in cases of cluster failovers.

This may require investigation of start scripts in cases where the same client chronically has duplicates.

 

CAUSE:

Typically, a human will restart a scheduler, but fail to kill the original.

Sometimes, a start/stop script on a host fails to stop the prior instance.

In some cases, multiple start scripts fire on system boot.

 


Posted in Reference, Work | Tagged , , | Comments Off on ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

NDMP TOC failure – datamover type incorrect

ERROR:
ANR4950E The server is unable to retrieve NDMP file history information while building table of contents for node NASNODE01, file space /SVM_NASNODE01_VIRTUALFS. NDMP node ID is 90156245149. Table of contents creation fails.

CAUSE:
One possible cause of this can be if the datamover was defined with the wrong scope (TYPE).  
TYPE can be NAS, NASVSERVER, or NASCLUSTER.  NAS is for node context.  VSERVER is for SVM ccontext.  CLUSTER is for the whole cluster context.

NOTE: There are other possible causes, such as corrupt inodes, or other issues; however, this one bit me and was not clearly define anywhere else.

CORRECTION:
You cannot UPDATE DATAMOVER TYPE=blah, but you can simply DELETE DATAMOVER and DEFINE DATAMOVER to fix.

DELETE DATAMOVER NASNODE01
DEFINE datamover NASNODE01 type=nascluster dataformat=netappdump hla=192.168.128.1 user=NDMPADMIN password=PASSWORDHERE

TRACING INFO:

trace disable
trace enable spi spid toc
trace begin /tmp/server.trc

Once tracing has been enabled, I would then like for you to initiate another backup of the /SVM_SBNAS01_OU_ABOD volume. When the backup completes/fails, you can then issue the following commands to disable tracing:

trace flush
trace end
trace disable
QUERY ACTLOG

grep NDMP dsmffdc.log

NASNODE01::> node run -node SBNAS01-01
Type ‘exit’ or ‘Ctrl-D’ to return to the CLI
NASNODE01> rdfile /etc/log/backup


TSM SP Remove ReplServer

PROBLEM:
Every 5.5 minutes, this shows up in the actlog

08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:26 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:26 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:26 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:28 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:28 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:28 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.

SOLUTION:
QUERY REPLSERVER shows the GUID
REMOVE REPLSERVER (GUID) to cause the errors to stop.


AIX 7.2.3.1 breaks GSKit 8.0.50.89

AIX 7.2.3 breaks GSKit8, up through GP29 (8.0.50.89).

This affects TSP/Spectrum Protect, Content Manager, Tivoli Directory Server, Websphere, DB2, Informix, IBM HTTP Server, etc.

Before reboot, everything works still, which implies the change is in the kernel.

We found it on TSM, and AIX 7200-03-01-1838, and Spectrum Protect server 8.1.6.0.

Application crash and DBX follow below.

ANR7800I DSMSERV generated at 12:17:13 on Sep 11 2018.
IBM Spectrum Protect for AIX
Version 8, Release 1, Level 6.000
Licensed Materials - Property of IBM
(C) Copyright IBM Corporation 1990, 2018.
All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 10944920.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR7811I Using instance directory /home/tsminst1.
Illegal instruction(coredump)

# dbx /opt/tivoli/tsm/server/bin/dsmserv core.10944896.28165312
Type 'help' for help.
[using memory image in core.10944896.28165312]
reading symbolic information ...warning: no source compiled with -g

Illegal instruction (illegal opcode) in . at 0x0 ($t1)
warning: Unable to access address 0x0 from core

(dbx) where
.() at 0x0
gsk_src_create__FPPvPv(??, ??) at 0x9000000015b6d88
__ct__8GSKMutexFv(??) at 0x9000000018d664c
__ct__20GSKPasswordEncryptorFv(??) at 0x9000000018cb248
__ct__7gsk_envFv(??) at 0x900000000aaa6b0
GskEnvironmentOpen__FPPvb(??, ??) at 0x900000000ab14c4
gsk_environment_open(??) at 0x900000000ab277c
IPRA.$CheckGSKVersion() at 0x100eecf68
tlsInit() at 0x100eecd70
main(??, ??) at 0x10000112c

(dbx) th
thread state-k wchan state-u k-tid mode held scope function

$t1 run running 41877977 k no sys
$t2 run blocked 21234465 u no sys _cond_wait_global
$t3 run running 24380103 u no sys waitpid


Spectrum Protect / TSM systemd autostart


cat < <'EOF' >/etc/systemd/system/db2fmcd.service
[Unit]
Description=DB2V111

[Service]
ExecStart=/opt/tivoli/tsm/db2/bin/db2fmcd
Restart=always
KillMode=process
KillSignal=SIGHUP

[Install]
WantedBy=default.target
EOF
systemctl enable db2fmcd.service
systemctl start db2fmcd.service

cp -p /opt/tivoli/tsm/server/bin/dsmserv.rc /etc/init.d/tsminst1
cat < <'EOF' >>/etc/systemd/system/tsminst1.service
[Unit]
Description=tsminst1
Requires=db2fmcd.service

[Service]
Type=forking
ExecStart=/etc/init.d/tsminst1 start
ExecReload=/etc/init.d/tsminst1 reload
ExecStop=/etc/init.d/tsminst1 stop
StandardOutput=journal

[Install]
WantedBy=multi-user.target
EOF
systemctl enable tsminst1.service
systemctl start tsminst1.service

ln -s /opt/tivoli/tsm/client/ba/bin/rc.dsmcad /etc/init.d/dsmcad
cat < <'EOF' >>/etc/systemd/system/dsmcad.service
[Unit]
Description=dsmcad

[Service]
Type=forking
ExecStart=/etc/init.d/dsmcad start
ExecReload=/etc/init.d/dsmcad reload
ExecStop=/etc/init.d/dsmcad stop
StandardOutput=journal

[Install]
WantedBy=multi-user.target
EOF
systemctl enable dsmcad.service
systemctl start dsmcad.service


Spectrum Protect – container vulnerability

We ran into an issue where a level-zero operator became root, and cleaned up some TSM dedupe-pool containers so he’d stop getting full filesystem alerts.

Things exposed:

How does someone that green get full, unmonitored root access?
* They told false information about timestaps during defense
* Their senior tech lead was content to advise they not move or delete files without contacting the app owner.
* Imagine if this had been a customer facing database server!

In ISP/TSM, once extents are marked damaged, a new backup of that extent will replace it.
* Good TDP4VT CTL files and other incrementals will send missing files.
* TDP for VMWare full backups fail if the control file backup is damaged.
* Damaged extents do not mark files as damaged or missing.

Replicate Node will back-propagate damaged files.
* Damaged extents do not mark files as damaged or missing.

Also, in case you missed that:
* Damaged extents do not mark files as damaged or missing.

For real, IBM says:
* Damaged extents do not mark files as damaged or missing.
* “That might cause a whole bunch of duplicates to be ingested and processed.”

IBM’s option is to use REPAIR STGPOOL.
* Requires a prior PROTECT STGPOOL (similar to BACKUP STGPOOL and RESTORE STGPOOL).
* PROTECT STGPOOL can go to a container copy on tape, a container copy on FILE, or a container primary on the replica target server.
* PROTECT STGPOOL cannot go to a cloud pool
* STGRULE TIERING only processes files, not PROTECT extents.
* PROTECT STGPOOL cannot go to a cloud pool that way either.
* There is NO WAY to use cloud storage pool to protect a container pool from damage.

EXCEPTION: Damaged extents can be replaced by REPLICATE NODE into a pool.
* You can DISABLE SES, and reverse the replication config.
* Replicate node that way will perform a FULL READ of the source pool.

There is a Request For Enhancement from November, 2017 for TYPE=CLOUD POOLTYPE=COPY.
* That would be a major code effort, but would solve this major hole.
* That has not gotten a blink from product engineering.
* Not even an “under review”, nor “No Way”, nor “maybe sometime”.

Alternatives for PROTECT into CLOUD might be:
* Don’t use cloud. Double the amount of local disk space, and replicate to another datacenter.
* Use NFS (We would need to build a beefy VM, and configure KRB5 at both ends, so we could do NFSv4 encrypted).
* Use CIFS (the host is on AIX, which does not support CIFS v3. Linux conversion up front before we had bulk data was given a big NO.)
* Use azfusefs (Again, it’s not Linux)

Anyway, maybe in 2019 this can be resolved, but this is the sort of thing that really REALLY was poorly documented, and did not get the time and resources to be tested in advance. This is the sort of thing that angers everyone at every level.

REFERENCE: hard,intr,nfsvers=4,tcp,rsize=1048576,wsize=1048576,bg,noatime


ANR3114E LDAP error 81. Failure to connect to the LDAP server

This used to be on IBM’s website, but it disappeared.  It is referenced all over the net, and needed to still exist. I only found it in the wayback machine, so I’m adding another copy to the internet.

2013 SOURCE: www-01.ibm.com/support/docview.wss?uid=swg21656339

Problem(Abstract)

When the SET LDAPUSER command is used, the connection can fail with:

ANR3114E LDAP error 81 (Can’t contact LDAP server)

Cause

The user common name (CN) in the SET LDAPUSER command contains a space or the ldapurl option is incorrectly specified.

Diagnosing the problem

Collect a trace of the Tivoli Storage Manager Server using the following trace classes:
session verbdetail ldap ldapcache unicode

More information about tracing the server can be found here: Enabling a trace for the server or storage agent

The following errors are reported within the trace:

11:02:04.127 [44][output.c][7531][PutConsoleMsg]:ANR2017I Administrator ADMIN issued command: SET LDAPPASSWORD ?***? ~
11:02:04.171 [44][ldapintr.c][548][ldapInit]:Entry: ldapUserNew =      CN=tsm user,OU=TSM,DC=ds,DC=example,DC=com
11:02:04.173 [44][ldapintr.c][5851][LdapHandleErrorEx]:Entry: LdapOpenSession(ldapintr.c:2340) ldapFunc = ldap_start_tls_s_np, ldapRc = 81, ld = 0000000001B0CAB0
11:02:04.174 [44][ldapintr.c][5867][LdapHandleErrorEx]:ldap_start_tls_s_np returned LDAP code 81(Can't contact LDAP server), LDAP Server message ((null)), and possible GSKIT SSL/TLS error 0(Success)
11:02:04.174 [44][output.c][7531][PutConsoleMsg]:ANR3114E LDAP error 81 (Can't contact LDAP server) occurred during ldap_start_tls_s_np.~
11:02:04.174 [44][ldapintr.c][6079][LdapHandleErrorEx]:Exit: rc = 2339, LdapOpenSession(ldapintr.c:2340), ldapFunc = ldap_start_tls_s_np, ldapRc = 81, ld = 0000000001B0CAB0
11:02:04.174 [44][ldapintr.c][1580][ldapCloseSession]:Entry: sessP = 0000000009B99CD0
11:02:04.175 [44][ldapintr.c][3159][LdapFreeSess]:Entry: sessP = 0000000009B99CD0
11:02:04.175 [44][ldapintr.c][2449][LdapOpenSession]:Exit: rc = 2339, ldapHandleP = 000000000AFDE740, bindDn =                              (CN=tsm user,OU=TSM,DC=ds,DC=example,DC=com)
11:02:04.175 [44][output.c][7531][PutConsoleMsg]:ANR3103E Failure occurred while initializing LDAP directory services.~
11:02:04.175 [44][ldapintr.c][856][ldapInit]:Exit: rc = 2339
11:02:04.175 [44][output.c][7531][PutConsoleMsg]:ANR2732E Unable to communicate with the external LDAP directory server.~

Resolving the problem

  • In the trace provided, the common name (CN) contains a space. (CN=tsm user,OU=TSM,DC=ds,DC=example,DC=com)

    Remove the space in the common name when using the SET LDAPUSER command. For example:

    SET LDAPUSER “CN=tsmuser,OU=TSM,DC=ds,DC=example,DC=com”

  • Use an LDAP connection utility such as ldp.exe to ensure the ldapurl option is correct and the LDAP server is accepting connections

    <ldapurl> port 636, check the box for SSL

    Verify there are no errors in the output


Protect initial install

This is happiness…

tsminst1@tsm:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

/bin/bash# for i in /dev/sd? ; do smartctl -a $i ; done | grep ‘Device Model’
Device Model: Samsung SSD 850 EVO 250GB
Device Model: WDC WD30EFRX-68EUZN0
Device Model: Samsung SSD 850 EVO 250GB
Device Model: WDC WD30EFRX-68EUZN0
Device Model: WDC WD30EFRX-68EUZN0

tsminst1@tsm:~$ dsmserv format dbdir=/tsm/db01,/tsm/db02,/tsm/db03,/tsm/db04,/tsm/db05,/tsm/db06,/tsm/db07,/tsm/db08 \
> activelogsize=8192 activelogdirectory=/tsm/log archlogdirectory=/tsm/logarch

ANR7800I DSMSERV generated at 11:32:48 on Sep 19 2017.

IBM Spectrum Protect for Linux/x86_64
Version 8, Release 1, Level 3.000

Licensed Materials – Property of IBM

(C) Copyright IBM Corporation 1990, 2017.
All rights reserved.
U.S. Government Users Restricted Rights – Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 29286.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR0010W Unable to open message catalog for language en_US.UTF-8. The default language message catalog will be used.
ANR7814I Using instance directory /home/tsminst1.
ANR3339I Default Label in key data base is TSM Server SelfSigned SHA Key.
ANR4726I The ICC support module has been loaded.
ANR0152I Database manager successfully started.
ANR2976I Offline DB backup for database TSMDB1 started.
ANR2974I Offline DB backup for database TSMDB1 completed successfully.
ANR0992I Server’s database formatting complete.
ANR0369I Stopping the database manager because of a server shutdown.