TSM SP Remove ReplServer

PROBLEM:
Every 5.5 minutes, this shows up in the actlog

08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:26 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:26 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:26 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:28 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:28 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:28 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.

SOLUTION:
QUERY REPLSERVER shows the GUID
REMOVE REPLSERVER (GUID) to cause the errors to stop.


SVC, StorWize, FlashSystem, Spectrum Virtualize – replace a drive

I always forget, so here’s a reminder…

When you replace a drive on one of these, mdisk arrays do not auto-rebuild.

If the GUI fix procedures go away, or never show up, or whatever causes the replacement drive to not get included as a new drive in the mdisk, you can do this manually.

First, look for the candidate or spare drive you want to use.
lsmdisk

Then, make sure that drive ID is a candidate:
chdrive -use candidate 72

Then, find the missing member:
lsarraymember mdisk1

Then, set the new drive to use that missing member ID:
charraymember -member 31 -newdrive 72 mdisk1

You can watch the progress of the rebuild:
lsarraymemberprogress mdisk1


Gathering HACMP Info

Often, when working with a cluster, you might want to rebuild it from scratch, rather than take the time to figure out what is broken. Here are some commands to gather basic info for AIX and email it to yourself. Obviously, change the email address at the end.

(
echo '#########################' 
echo '#########################' OS Level
echo '#########################' 
oslevel -s
echo '#########################' 
echo '#########################' HA Level
echo '#########################' 
halevel -s
echo '#########################' 
echo '#########################' System Info
echo '#########################' 
lsattr -El sys0
echo '#########################' 
echo '#########################' Cluster Exports
echo '#########################' 
cat /usr/es/sbin/cluster/etc/exports
echo '#########################' 
echo '#########################' System Exports
echo '#########################' 
cat /etc/exports
echo '#########################' 
echo '#########################' Physical Volumes
echo '#########################' 
lspv -u
echo '#########################' 
echo '#########################' Cluster UD
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsclstr
echo '#########################' 
echo '#########################' Cluster Heartbeat
echo '#########################' 
lscluster -d
echo '#########################' 
echo '#########################' Cluster Status
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllscompstat
echo '#########################' 
echo '#########################' Cluster Dump
echo '#########################' 
/usr/es/sbin/cluster/utilities/cldump
echo '#########################' 
echo '#########################' Cluster Services
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsserv
echo '#########################' 
echo '#########################' Cluster App Monitors
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsappmon
echo '#########################' 
echo '#########################' Cluster Resource Group Variables
echo '#########################' 
for i in `/usr/es/sbin/cluster/utilities/cllsgrp` ; do echo '###################' $i ; /usr/es/sbin/cluster/utilities/cllsres -g $i ; done
echo '#########################' 
echo '#########################' Cluster Resource Group Details
echo '#########################' 
for i in `/usr/es/sbin/cluster/utilities/cllsgrp` ; do echo '###################' $i ; /usr/es/sbin/cluster/utilities/clshowres -g $i ; done
echo '#########################' 
echo '#########################' Cluster Interfaces
echo '#########################' 
/usr/es/sbin/cluster/utilities/cllsif
echo '#########################' 
echo '#########################' Network Interfaces
echo '#########################' 
ifconfig -a
echo '#########################' 
echo '#########################' Rhosts
echo '#########################' 
cat /.rhosts
echo '#########################' 
echo '#########################' root rhosts
echo '#########################' 
cat /root/.rhosts
echo '#########################' 
echo '#########################' cluster rhosts
echo '#########################' 
cat /etc/cluster/rhosts
echo '#########################' 
echo '#########################' New custer rhosts
echo '#########################' 
cat /usr/es/sbin/cluster/etc/rhosts
echo '#########################' 
echo '#########################' Net monitor IPs
echo '#########################' 
cat /usr/es/sbin/cluster/netmon.cf
echo '#########################' 
echo '#########################' File Collections
echo '#########################' 
odmget HACMPfilecollection
echo '#########################' 
echo '#########################' Collection Files
echo '#########################' 
odmget HACMPfcfile
echo '#########################' 
echo '#########################' Free Major Numbers
echo '#########################' 
lvlstmajor
echo '#########################' 
echo '#########################' Example commands for VG Imports
echo '#########################' 
for VG in `lsvg |egrep -v 'rootvg|caavg'`; do 
  echo `getlvodm -d $VG` `lspv | grep $VG | tr -s [:space:] | sort -k 2 | head -1` \
  | awk '{print "importvg -V" , $1 , "-y " , $4 , " " , $3 ; } ; ' ; done | sort
echo '#########################' 
echo '#########################' Volume Groups
echo '#########################' 
lsvg
echo '#########################' 
echo '#########################' Volume Group Details
echo '#########################' 
lsvg | xargs -n1 lsvg
echo '#########################' 
echo '#########################' Logical Volumes
echo '#########################' 
lsvg | xargs -n1 lsvg -l
echo '#########################' 
echo '#########################' Logical Volume Details
echo '#########################' 
lsvg | xargs -n1 lsvg -l | grep / | cut -f 1 -d \  | xargs -n1 lslv
echo '#########################' 
echo '#########################' Filesystems
echo '#########################' 
df -Pg
echo '#########################' 
echo '#########################' Mounts
echo '#########################' 
mount
echo '#########################' 
echo '#########################' Tunables from last boot
echo '#########################' 
cat /etc/tunables/lastboot
echo '#########################' 
echo '#########################' Device settings
echo '#########################' 
for i in `lsdev | egrep '^en|hdisk|fcs|fscsi' | cut -f1 -d\  ` ; do echo '#####################' $i ; lsattr -El $i ; done | egrep -v 'False$'
echo '#########################' 
echo '#########################' Crontab entries
echo '#########################' 
crontab -l
echo '#########################' 
echo '#########################' snmp config
echo '#########################' 
cat /etc/snmpdv3.conf
echo '#########################' END END END
) 2>&1 | mail -vs `hostname` jdavis@omnitech.net


IBM SVC / Storwize DRP – Still a no-go

IBM released “Data Reduction Pools” for their SAN Volume Controller, which includes Storwize, FlashSystem etc products back in early 2018. The internal flag for it is “deduplication”, but it can be used for thin, compressed, or deduplicated volumes. You can put thick volumes in them too, but that is more of a “block off this space” thing.

DRP is a “log structured device”, meaning it is random read, but always sequential write. When you enable DRP, it inserts into the high cache (front-end cache). Because of this, it adds 3-4ms of latency right off the bat. Some of the higher end controllers can reduce that to 2-3ms, but it never goes away. Also, this affects ALL storage pools (mdisk groups), not just the ones built with DRP enabled. IBM product engineering says this is not a defect. It’s just they way it is, because you have to twiddle the data on the CPU.

In late 2018, I found one of their code defects, which has been resolved; however, that defect highlighted a troubling design choice. DRP is a whole-cluster thing, not a whole I/O group thing, nor a whole pool thing. That means, if you trigger a defect that kills DRP for one pool, it will kill DRP for every pool. If you’re in a hyperswap or stretched cluster, and something happens (future defect maybe) to take one site’s DRP offline while the nodes are still live, then that could take out BOTH sites.

This year, I found out that their sizing partner, “Intellimagic”, does not understand DRP. When a “Disk Magic” report comes out, it can recommend DRP. It cannot tell how much cache affects things, so it just recommends maximum cache. It cannot tell how much DRP affects things, so it cannot recommend against it for latency sensitive workloads. The IBM technical sales team is trying to work around this by simply recommending the next model up when DRP is to be used.

However, that is not enough. DRP garbage collection is VERY heavy handed. There is no way to throttle it. It’s a log structured array, so when you re-write a block on a LUN, the old block gets invalidated, then the old chunk gets read, then new data is packed in, then it’s written to the end of the array. This is sequential for each major section of the pool, and it has additional latency added by the code. We found that a normal system of 25% writes would be suppressed by 80% in DRP. Latency during normal operations would fluctuate from 6 to 18, and that’s to the client. Any major I/O operations on the client (big rewrites, or big block freeing), or any vdisk (LUN) deletions would cause latency to fluctuate from 16-85ms (completely unusable).

Once DRP is enabled, it affects the whole system. You cannot turn it off for a pool. You have to delete that pool. You cannot migrate a vdisk out of nor into a DRP, you have to mirror to a new pool, then delete or split the old copy. Funny thing is, that’s just what a migrate does at all but the lowest levels. It’s been 2 years, that really should have been folded into the standard commands by now.

Once you have everything split, and all of the hosts are running on the new, non-DRP enabled pool, that is not enough. All of the performance problems will still be there. It is only once you delete the old pool that performance returns to normal. Since vdisk deletions take several hours each, and running more than 4 at a time causes severe, deadly latency, it’s best to just verify nothing is in use and force-delete the whole DRP-enabled pool. Risky, but otherwise, you could spend weeks deleting the old LUNs by hand due to how slow they clear.

The sales materials, technical whitebooks, and the gritty redbooks all recommend DRP as the best thing ever. Staff and books shy away from highlighting any of these limitations. Other products handle deduplication without this sort of negative impact, but here, even with no deduplication in use, just the OPTION of using dedupe, it causes the system to be unusable.

Maybe there will be a “fix” or a “rewrite”, but I’ve been burned hard, with severe customer satisfaction, over two years. This negatively impacts trust, and is not the only IBM storage product causing me trust issues.

I expect that this is the fallout from Ginny Rometty’s command, “Don’t try to protect the past”. IBM’s vision is stated as “Cloud, Data, and Watson”. It will be interesting to see exactly how that translates into a future structure. IBM has been divesting a lot of pieces lately. Rometty ran services, and that’s what she knows as Chair and CEO. Anything she sees as a commodity, or low margin will be spun off. (IBM seems to need at least a 6:1 revenue to cost in order to operate. There are a lot of layers.) I’m starting to wonder if Storage and POWER will be sold off to Lenovo. I see hints, but no roadmaps, nothing concrete.


AIXPCM vs SDDPCM

AIX geeks, converting from SDDPCM to AIXPCM, when you uninstall the drivers, you also need to uninstall host attachment. On 2% of our migrations, mksysb/alt clones would fail to find the boot disk (554).

devices.fcp.disk.ibm.mpio.rte devices.sddpcm.*

Note that from SVC 7.6.1, AIX 6.1.8, AIX 7.1.3, and AIX 7.2, you MAY switch to AIXPCM. On POWER9 and later, you MUST switch to AIXPCM.

The rm script is from storage development, but the manage_disk_drivers command is from AIX dev. Either is okay, but tge AIX one does not require making a PMR.

Best reference:

https://www.ibm.com/developerworks/community/blogs/cgaix/entry/One_Path_Control_Module_to_Rule_Them_All


IBM PCIe3 Ethernet Decoder

Some modern Ethernet adapter decoder wheel information.

PCIe2 4-Port (10GbE SFP+ & 1GbE RJ45) Adapter
PCIe2 4-Port Adapter (10GbE SFP+) (e4148a1614109304)
PCIe2 4-Port Adapter (1GbE RJ45) (e4148a1614109404)
FRU Number.................. 00E2715
Customer Card ID Number..... 2CC3
Part Number................. 00E2719
Feature Code/Marketing ID... EN0S or EN0T
Make/Model: Broadcom NetXtreme II 10Gb
Code Name: Shiner

PCIe3 10GbE RoCE Converged Host Bus Adapter (b31507101410eb04)
RoCE Converged Network Adapter
2-Port 10GbE RoCE SR PCIe 3.0
Part Number................. 00RX871
FRU Number.................. 00RX875
Feature Code/Marketing ID... EC2M / EC2N / EC37 / EC38 / EL3X / EL40
Customer Card ID Number..... 57BE
Make/Model: Mellanox ConnectX-3 Pro / Gen 3.0 x8 10GbE
Code Name: Core

PCIe3 4-Port 10GbE SR Adapter (df1020e21410e304)
PCIe3 4-port 10GbE SR Adapter, NIC PF
Part Number................. 00ND462 or 00ND464 or 00ND468 or 00RX863
Feature Code/Marketing ID... EN15 or EN16
Customer Card ID Number..... 2CE3
Make/Model: Emulex OneConnect NIC (Lancer) (rev 30)

I believe the 4-port 1gbit PCI cards are Intel.

Emulex, Brocade, and Broadcom are all owned by the same company as of 2016. Currently known as “Broadcom, Inc.”, the parent was formerly Avago, formerly Agilent, and formerly the semiconductor division of HP.


AIX 7.2.3.1 breaks GSKit 8.0.50.89

AIX 7.2.3 breaks GSKit8, up through GP29 (8.0.50.89).

This affects TSP/Spectrum Protect, Content Manager, Tivoli Directory Server, Websphere, DB2, Informix, IBM HTTP Server, etc.

Before reboot, everything works still, which implies the change is in the kernel.

We found it on TSM, and AIX 7200-03-01-1838, and Spectrum Protect server 8.1.6.0.

Application crash and DBX follow below.

ANR7800I DSMSERV generated at 12:17:13 on Sep 11 2018.
IBM Spectrum Protect for AIX
Version 8, Release 1, Level 6.000
Licensed Materials - Property of IBM
(C) Copyright IBM Corporation 1990, 2018.
All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 10944920.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR7811I Using instance directory /home/tsminst1.
Illegal instruction(coredump)

# dbx /opt/tivoli/tsm/server/bin/dsmserv core.10944896.28165312
Type 'help' for help.
[using memory image in core.10944896.28165312]
reading symbolic information ...warning: no source compiled with -g

Illegal instruction (illegal opcode) in . at 0x0 ($t1)
warning: Unable to access address 0x0 from core

(dbx) where
.() at 0x0
gsk_src_create__FPPvPv(??, ??) at 0x9000000015b6d88
__ct__8GSKMutexFv(??) at 0x9000000018d664c
__ct__20GSKPasswordEncryptorFv(??) at 0x9000000018cb248
__ct__7gsk_envFv(??) at 0x900000000aaa6b0
GskEnvironmentOpen__FPPvb(??, ??) at 0x900000000ab14c4
gsk_environment_open(??) at 0x900000000ab277c
IPRA.$CheckGSKVersion() at 0x100eecf68
tlsInit() at 0x100eecd70
main(??, ??) at 0x10000112c

(dbx) th
thread state-k wchan state-u k-tid mode held scope function

$t1 run running 41877977 k no sys
$t2 run blocked 21234465 u no sys _cond_wait_global
$t3 run running 24380103 u no sys waitpid


dsmserv fails to start if LDAP is inaccessible

IBM, and the white books, say this is working as designed.
• If LDAP dies, dsmserv stays up without it.
• If LDAP dies, and dsmserv restarts, it refuses to come up.
• ANR3103E https://www.ibm.com/support/knowledgecenter/SSEQVQ_8.1.5/srv.msgs/b_msgs_server.pdf
• Workaround is remove LDAPURL from dsmserv.opt, or wait for LDAP to become accessible.

On a multi-homed server, any links serving LDAP should be fault tolerant.

Any server using LDAP should have fault tolerant LDAP servers.

Here’s where to vote on getting the start-up limitation changed.
http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=121985


ANR3114E LDAP error 81. Failure to connect to the LDAP server

This used to be on IBM’s website, but it disappeared.  It is referenced all over the net, and needed to still exist. I only found it in the wayback machine, so I’m adding another copy to the internet.

2013 SOURCE: www-01.ibm.com/support/docview.wss?uid=swg21656339

Problem(Abstract)

When the SET LDAPUSER command is used, the connection can fail with:

ANR3114E LDAP error 81 (Can’t contact LDAP server)

Cause

The user common name (CN) in the SET LDAPUSER command contains a space or the ldapurl option is incorrectly specified.

Diagnosing the problem

Collect a trace of the Tivoli Storage Manager Server using the following trace classes:
session verbdetail ldap ldapcache unicode

More information about tracing the server can be found here: Enabling a trace for the server or storage agent

The following errors are reported within the trace:

11:02:04.127 [44][output.c][7531][PutConsoleMsg]:ANR2017I Administrator ADMIN issued command: SET LDAPPASSWORD ?***? ~
11:02:04.171 [44][ldapintr.c][548][ldapInit]:Entry: ldapUserNew =      CN=tsm user,OU=TSM,DC=ds,DC=example,DC=com
11:02:04.173 [44][ldapintr.c][5851][LdapHandleErrorEx]:Entry: LdapOpenSession(ldapintr.c:2340) ldapFunc = ldap_start_tls_s_np, ldapRc = 81, ld = 0000000001B0CAB0
11:02:04.174 [44][ldapintr.c][5867][LdapHandleErrorEx]:ldap_start_tls_s_np returned LDAP code 81(Can't contact LDAP server), LDAP Server message ((null)), and possible GSKIT SSL/TLS error 0(Success)
11:02:04.174 [44][output.c][7531][PutConsoleMsg]:ANR3114E LDAP error 81 (Can't contact LDAP server) occurred during ldap_start_tls_s_np.~
11:02:04.174 [44][ldapintr.c][6079][LdapHandleErrorEx]:Exit: rc = 2339, LdapOpenSession(ldapintr.c:2340), ldapFunc = ldap_start_tls_s_np, ldapRc = 81, ld = 0000000001B0CAB0
11:02:04.174 [44][ldapintr.c][1580][ldapCloseSession]:Entry: sessP = 0000000009B99CD0
11:02:04.175 [44][ldapintr.c][3159][LdapFreeSess]:Entry: sessP = 0000000009B99CD0
11:02:04.175 [44][ldapintr.c][2449][LdapOpenSession]:Exit: rc = 2339, ldapHandleP = 000000000AFDE740, bindDn =                              (CN=tsm user,OU=TSM,DC=ds,DC=example,DC=com)
11:02:04.175 [44][output.c][7531][PutConsoleMsg]:ANR3103E Failure occurred while initializing LDAP directory services.~
11:02:04.175 [44][ldapintr.c][856][ldapInit]:Exit: rc = 2339
11:02:04.175 [44][output.c][7531][PutConsoleMsg]:ANR2732E Unable to communicate with the external LDAP directory server.~

Resolving the problem

  • In the trace provided, the common name (CN) contains a space. (CN=tsm user,OU=TSM,DC=ds,DC=example,DC=com)

    Remove the space in the common name when using the SET LDAPUSER command. For example:

    SET LDAPUSER “CN=tsmuser,OU=TSM,DC=ds,DC=example,DC=com”

  • Use an LDAP connection utility such as ldp.exe to ensure the ldapurl option is correct and the LDAP server is accepting connections

    <ldapurl> port 636, check the box for SSL

    Verify there are no errors in the output


Protect initial install

This is happiness…

tsminst1@tsm:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

/bin/bash# for i in /dev/sd? ; do smartctl -a $i ; done | grep ‘Device Model’
Device Model: Samsung SSD 850 EVO 250GB
Device Model: WDC WD30EFRX-68EUZN0
Device Model: Samsung SSD 850 EVO 250GB
Device Model: WDC WD30EFRX-68EUZN0
Device Model: WDC WD30EFRX-68EUZN0

tsminst1@tsm:~$ dsmserv format dbdir=/tsm/db01,/tsm/db02,/tsm/db03,/tsm/db04,/tsm/db05,/tsm/db06,/tsm/db07,/tsm/db08 \
> activelogsize=8192 activelogdirectory=/tsm/log archlogdirectory=/tsm/logarch

ANR7800I DSMSERV generated at 11:32:48 on Sep 19 2017.

IBM Spectrum Protect for Linux/x86_64
Version 8, Release 1, Level 3.000

Licensed Materials – Property of IBM

(C) Copyright IBM Corporation 1990, 2017.
All rights reserved.
U.S. Government Users Restricted Rights – Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 29286.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR0010W Unable to open message catalog for language en_US.UTF-8. The default language message catalog will be used.
ANR7814I Using instance directory /home/tsminst1.
ANR3339I Default Label in key data base is TSM Server SelfSigned SHA Key.
ANR4726I The ICC support module has been loaded.
ANR0152I Database manager successfully started.
ANR2976I Offline DB backup for database TSMDB1 started.
ANR2974I Offline DB backup for database TSMDB1 completed successfully.
ANR0992I Server’s database formatting complete.
ANR0369I Stopping the database manager because of a server shutdown.