Spectrum Protect (TSM) Operations Center on Ubuntu LTS

Per IBM, the Spectrum Protect server is supported on Ubuntu LTS 14, 16, 18, and 20 (aka 2014.04, 2016.04, etc.) 

https://www.ibm.com/support/pages/overview-ibm-spectrum-protect-supported-operating-systems

However, Operations Center (web GUI) is not supported on Ubuntu, only RHEL and SLES.

https://www.ibm.com/support/pages/ibm-spectrum-protect-operations-center-software-and-hardware-requirements

./install.sh -c
Validating package prerequisites...
=====> IBM Installation Manager> Update> Prerequisites
Validation results:
* [ERROR] IBM Spectrum Protect Operations Center 8.1.12000.20210326_0723 contains validation errors.
1. ERROR: The operating system on which you are installing the product is not supported. For more information, see http://www.ibm.com/support/docview.wss?uid=swg21243309.

Enter the number of the error or warning message above to view more details.

To skipp the OS and platform checks, and convert the ERROR into WARNING:

./install.sh -c -vmargs "-DBYPASS_TSM_REQ_CHECKS=true"
Validation results:
* [WARNING] IBM Spectrum Protect Operations Center 8.1.12000.20210326_0723 contains validation warning.
1. WARNING: The operating system on which you are installing the product is not supported. For more information, see http://www.ibm.com/support/docview.wss?uid=swg21243309.

Enter the number of the error or warning message above to view more details.

I recommend ONLY install/update Operations Center with this, and then exit and go back in normally to make sure the other filesets validate okay.


ANR1812E DELETE FILESPACE VMFULL failed because replication

ERROR:

ANR1812E DELETE FILESPACE VMFULL failed because replication

DESCRIPTION:

Decommed VMs fail to auto-delete during expiration because replication is happening. In an ideal world, there would be enough system resources to perform DB Backup in 2 hours, expiration in 2 hours, and replication in 4-8 hours. In this environment, replication overlaps a lot of other processes, and can get in the way. 

ANR1812E DELETE FILESPACE VMFULL-SOMENODENAME for node failed deletion because of a replication in progress. (SESSION: 123456)

 

WORKAROUND:

Identify the server

 

Cancel replication
CANCEL REPLICATION

 

Identify the filespace
VMFULL-SOMENODENAME in the example

 

Find the node that owns the filespace.
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER    Bkup     \VMFULL-SOMENODENAME 4   53084         –    6,782,908

 

Delete the filespace on both local and replica:
DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘
TSM2: DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘

 

Monitor Progress until complete
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER     Bkup     \VMFULL-SOMENODENAME      4   50848         –    6,469,955

Protect: TSM>q act search=ANR1812E
03/07/21   23:14:23      ANR2017I Administrator ADMIN issued command: QUERY ACTLOG search=ANR1812E  (SESSION: 438180)

Protect: TSM>q proc
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
     395     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER    : 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.

Protect: TSM>TSM2: q proc
ANR1699I Resolved TSM2 to 1 server(s) – issuing command Q PROC against server(s).
ANR1687I Output for command ‘Q PROC’ issued against server TSM2 follows:
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
   5,756     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER: 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.
ANR1688I Output for command ‘Q PROC’ issued against server TSM2 completed.
ANR1694I Server TSM2 received the request to process command ‘Q PROC’.
ANR1697I Command ‘Q PROC’ processed by 1 server(s):  1 successful, 0 with warnings, and 0 with errors.

 

CAUSE:

Replicate Node, a normal operation, creates locks on any filespace to be processed.

The long-term resolution would be to have enough system resources to not have to overlap daily operations processes.

The benchmark set by IBM for this would be the ability to complete BACKUP DB in 2 hours.  This environment take 8-12 hours for most servers.


ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

ERROR:

ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

 

DESCRIPTION:

This happens when two or more schedulers are connecting as the same node.  One node starts work on a schedule, and the others are denied.

 

 

WORKAROUND:

Check the client node for two or more “dsmcad” and “dsmc sched” processes with the same (or no) config file listed.

Kill the oldest duplicates.

 

If no duplicates are on the client, then search the activity log to see if this client is connecting with multiple IP addresses or hostnames.

If so, find out which client should not be running the scheduler, and kill them on that host.

 

This may require coordination with the UNIX team in cases of cluster failovers.

This may require investigation of start scripts in cases where the same client chronically has duplicates.

 

CAUSE:

Typically, a human will restart a scheduler, but fail to kill the original.

Sometimes, a start/stop script on a host fails to stop the prior instance.

In some cases, multiple start scripts fire on system boot.

 


lilo slow boot map

Happily switched my boot back to SATA mirrors, and was able to reenable LILO COMPACT mode.

This means instead of 13,000 reads per boot file, it’s more like 50. Not only is booting a few seconds faster, more importantly updating the liko boot map after installing a new kernel takes 10 seconds instead of 5 minutes.


SMB/CIFS 3 on AIX

Mounting should be vaguely similar to the SMB1 mounting you had before.

Download and install SMB Client 3, and “Network Authentication Service” (aka kerberos 5) from here:

https://www-01.ibm.com/marketing/iwm/iwm/web/pickUrxNew.do?source=aixbp

Ensure your Windows 201x server has SMB v3 enabled.

You want a service account in AD to use for your SMB3 mounts on AIX.

 

Notes about options:

encryption should be desired and secure_negotiate should be desired.
signing should be enabled
​​​​​pver should be 3.0.2
The kerberos realm specified in the “wrkgrp” option must be in all UPPERCASE if your domain is in uppercase.
The username provided for mounting is used for all read/write permissions/access.  
UID and GID default to root.system, but you can specify others.
fmode is the inverse of umask, and what the files’ permissions look like across the whole share.  Default is 755.
port can be 139 (ipv4) or 445 (ipv4 or ipv6).  Default is 445.

 

/etc/filesystems format:

/mnt:
     dev = /corpshare
     vfs = smbc
     mount = true
     options = “wrkgrp=CORP.DOMAIN,signing=enabled,pver=3.0.2,encryption=desired,secure_negotiate=desired”
     nodename = win2016server.corp.domain/sambauser

 

Command line example

mount -v smbc -n win2016server.corp.domain/sambauser/Passw0rd! \
-o “wrkgrp=CORP.DOMAIN,port=445,signing=required,encryption=required, \
secure_negotiate=desired,pver=auto” /corpshare /mnt

 

Store the samba credentials

mksmbcred -s win2016server.corp.comain -u sambauser [-p password]

See also lssmbcred, chsmbcred, and rmsmbcred.

 

Reference 2021:

https://www.ibm.com/docs/en/aix/7.2?topic=protocol-server-message-block-smb-client-file-system 


GPSD / NTPD / Debian 10 Buster

I think I finally have my GPS NTP server tweaked. Average deviance yesterday was 0.33ms.
 
Just threw that last adjustment in, and we’ll see tomorrow how it aligns (eg, am I just -0.33 now, or am I close to 0.03 off?)
 
Without the time1 offset, it was getting silently ignored. Obscure.
 
Config is simple once you understand it, but for me, the understanding part was tough.
 
Quick-Reference:
apt update && apt install gpsd ntp ntpdate
ntpdate time.nist.gov
 
Plug in the VK* or uBlox style GPS receiver and put it in a window.
dmesg | grep tty
cat <<‘EOF’ >> /etc/default/gpsd
START_DAEMON=”true”
USBAUTO=”false”
DEVICES=”/dev/ttyACM0″
EOF
systemctl disable gpsd.socket
systemctl enable gpsd.service
systemctl restart gpsd.service
 
cat <<‘EOF’> /etc/ntp.conf
### Public servers and permissions
pool time.nist.gov burst minpoll 5 maxpoll 5
pool us.pool.ntp.org burst minpoll 5 maxpoll 5
pool pool.ntp.org burst minpoll 5 maxpoll 5
server ntp01.frontier.com burst minpoll 5 maxpoll 5
server ntp02.frontier.com burst minpoll 5 maxpoll 5
restrict source notrap nomodify noquery
restrict default kod limited nomodify notrap nopeer noquery
restrict -6 default kod limited nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1
 
### Stats needed for accuracy
driftfile /var/lib/ntp/ntp.drift
leapfile /usr/share/zoneinfo/leap-seconds.list
statsdir /var/log/ntpstats/
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
 
### GPS time service (PPS does not work on my device)
server 127.127.28.0 minpoll 3 maxpoll 4
fudge 127.127.28.0 time1 0.0445 refid GPS
server 127.127.28.1 prefer minpoll 3 maxpoll 4
fudge 127.127.28.1 refid PPS
broadcast 192.168.1.255 minpoll 4 maxpoll 4
EOF
systemctl enable ntp
systemctl restart ntp

Supermicro BMC firmware

Setting up the new replica server, I kept running into problems during the initial firmware update.  All IPMI settings and hardware data were inaccessible after BMC firmware update. System otherwise works as expected.  This condition persisted no matter if I went to DEL setup or F11 boot menu.  I could hang out in the UEFI shell, or etc.

It was not that the sensor data was “not present”.  It was that the list of sensors was missing.  Also, System MAC address, and BIOS version info was missing.  The FRU data was empty, as well as the Hardware Information.  All of the IPMI settings were blank, and could not be set.  The diagnostic data page gave “File not found”.  The iKVM was inaccessible, and the system could not be put into maintenance mode, powered off, powered on, or reset from the IPMI interface.  All of the system logs were blank and inaccessible.  Support was not super helpful, but they were responsive.  Supermicro is one of the top tier system makers.  They OEM for IBM, but that equipment is not quite as touchy.

The solution is to be very finicky about the BMC firmware update.

Get the right version for your system here:
https://www.supermicro.com/support/resources/bios_ipmi.php

My system used this code:
https://www.supermicro.com/Bios/softfiles/12085/X11SDVN_BIOS_1_3a_IPMI_1_31_03.zip
Manufacturer Name: Supermicro
Product PartNum: SYS-E300-9D
Chassis Part Number: CSE-E300
Board Product Name: X11SDV-4C-TLN2F
BIOS Vendor: American Megatrends Inc.
Processor: Intel(R) Xeon(R) D-2123IT CPU @ 2.20GHz

Update the code with extra patience
Use the AwUpdate utility to update the IPMI/BMC firmware
.\AwUpdate -f ......\WS_X11AST2500_131_03.bin -i lan -h 192.168.1.210 -u ADMIN -p ADMIN
NOTE: This can be on some other system as long as you can connect TCP between the two.
NOTE: No -r, and in the web UI, we would uncheck all of the “preserve settings” options.

Let all 5 parts (0 through 4) complete
Wait for the “New firmware is updating” to complete
Wait for the system to reboot.

Monitor the console
Wait for a longer version of IPMI Initialization
Wait for a longer than usual DXE — ACPI Initialization
Wait for the red LED to come on
Wait at least 5 more minutes (try 10)

At this point, you should see that it responds to F11 or DEL, but stays hung.
CTRL-ALT-DEL and everything should be populated and working.

The Unit IDentity LED may be stuck red.
You cannot clear the UID red state any way other than pulling the power cord.
Let it drain for 30 seconds, and plug back in.

After this, everything works, AND the UID LED setting in the IPMI web interface will switch from blink blue to off.


OVM CPU Pinning

If you clone or recover an Oracle VM guest, and the source used CPU pinning (Hard Partitioning), the target may not work.  The error is entirely non-intuitive, and I could not find it on the interwebs, so here is a sanitized version.

OVMAPI_5001E Job: 1416254413024/QueuedVmStartDbImpl_1416254413023/OVMJOB_1500J Start/resume vm: PRODVM, on server: PRODSERVER, failed. 
Job Failure Event: 1416254413902/Server Async Command Failed/OVMEVT_00C014D_001 Async command failed on server: PRODSERVER. 
Object: PRODVM, PID: 15431, Server error: 
Command: [‘xm’, ‘create’, ‘/OVS/Repositories/000dead000beef00cafe0421cab55bad/VirtualMachines/000dead000beef00cafef207cabdbbad/vm.cfg’] failed (1): 
stderr: Error: (22, ‘Invalid argument’) 
stdout: Using config file “/OVS/Repositories/000dead000beef00cafe0421cab55bad/VirtualMachines/000dead000beef00cafef207cabdbbad/vm.cfg”. , 
on server: PRODSERVER, associated with object: 000dead000beef00cafef207cabdbbad [Thu Apr 15 00:12:19 EDT 2021]

 

You can remove the “cpus = ‘#-#'” line from vm.cfg to reset this.

References about OVM hard partitioning includes:

xm info

xm list

xenpm get-cpu-topology

xm vcpu-list

# cd /u01/app/oracle/ovm-manager-3/ovm_utils
# ./ovm_vmcontrol -u admin -p YourPassword -h ovm-manager -v my-first-vm -c vcpuset -s 0-7
Oracle VM VM Control utility 0.6.3.
Connected.
Command : vcpuset
Pinning virtual CPUs
Pinning of virtual CPUs to physical threads  '0-7' 'my-first-vm' completed.

After that, vcpu-list will show VM names in column 1 for dedicated CPUs.

NDMP TOC failure – datamover type incorrect

ERROR:
ANR4950E The server is unable to retrieve NDMP file history information while building table of contents for node NASNODE01, file space /SVM_NASNODE01_VIRTUALFS. NDMP node ID is 90156245149. Table of contents creation fails.

CAUSE:
One possible cause of this can be if the datamover was defined with the wrong scope (TYPE).  
TYPE can be NAS, NASVSERVER, or NASCLUSTER.  NAS is for node context.  VSERVER is for SVM ccontext.  CLUSTER is for the whole cluster context.

NOTE: There are other possible causes, such as corrupt inodes, or other issues; however, this one bit me and was not clearly define anywhere else.

CORRECTION:
You cannot UPDATE DATAMOVER TYPE=blah, but you can simply DELETE DATAMOVER and DEFINE DATAMOVER to fix.

DELETE DATAMOVER NASNODE01
DEFINE datamover NASNODE01 type=nascluster dataformat=netappdump hla=192.168.128.1 user=NDMPADMIN password=PASSWORDHERE

TRACING INFO:

trace disable
trace enable spi spid toc
trace begin /tmp/server.trc

Once tracing has been enabled, I would then like for you to initiate another backup of the /SVM_SBNAS01_OU_ABOD volume. When the backup completes/fails, you can then issue the following commands to disable tracing:

trace flush
trace end
trace disable
QUERY ACTLOG

grep NDMP dsmffdc.log

NASNODE01::> node run -node SBNAS01-01
Type ‘exit’ or ‘Ctrl-D’ to return to the CLI
NASNODE01> rdfile /etc/log/backup