AIX 7.2.3.1 breaks GSKit 8.0.50.89

AIX 7.2.3 breaks GSKit8, up through GP29 (8.0.50.89).

This affects TSP/Spectrum Protect, Content Manager, Tivoli Directory Server, Websphere, DB2, Informix, IBM HTTP Server, etc.

Before reboot, everything works still, which implies the change is in the kernel.

We found it on TSM, and AIX 7200-03-01-1838, and Spectrum Protect server 8.1.6.0.

Application crash and DBX follow below.

ANR7800I DSMSERV generated at 12:17:13 on Sep 11 2018.
IBM Spectrum Protect for AIX
Version 8, Release 1, Level 6.000
Licensed Materials - Property of IBM
(C) Copyright IBM Corporation 1990, 2018.
All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 10944920.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR7811I Using instance directory /home/tsminst1.
Illegal instruction(coredump)

# dbx /opt/tivoli/tsm/server/bin/dsmserv core.10944896.28165312
Type 'help' for help.
[using memory image in core.10944896.28165312]
reading symbolic information ...warning: no source compiled with -g

Illegal instruction (illegal opcode) in . at 0x0 ($t1)
warning: Unable to access address 0x0 from core

(dbx) where
.() at 0x0
gsk_src_create__FPPvPv(??, ??) at 0x9000000015b6d88
__ct__8GSKMutexFv(??) at 0x9000000018d664c
__ct__20GSKPasswordEncryptorFv(??) at 0x9000000018cb248
__ct__7gsk_envFv(??) at 0x900000000aaa6b0
GskEnvironmentOpen__FPPvb(??, ??) at 0x900000000ab14c4
gsk_environment_open(??) at 0x900000000ab277c
IPRA.$CheckGSKVersion() at 0x100eecf68
tlsInit() at 0x100eecd70
main(??, ??) at 0x10000112c

(dbx) th
thread state-k wchan state-u k-tid mode held scope function

$t1 run running 41877977 k no sys
$t2 run blocked 21234465 u no sys _cond_wait_global
$t3 run running 24380103 u no sys waitpid


Spectrum Protect / TSM systemd autostart


cat < <'EOF' >/etc/systemd/system/db2fmcd.service
[Unit]
Description=DB2V111

[Service]
ExecStart=/opt/tivoli/tsm/db2/bin/db2fmcd
Restart=always
KillMode=process
KillSignal=SIGHUP

[Install]
WantedBy=default.target
EOF
systemctl enable db2fmcd.service
systemctl start db2fmcd.service

cp -p /opt/tivoli/tsm/server/bin/dsmserv.rc /etc/init.d/tsminst1
cat < <'EOF' >>/etc/systemd/system/tsminst1.service
[Unit]
Description=tsminst1
Requires=db2fmcd.service

[Service]
Type=forking
ExecStart=/etc/init.d/tsminst1 start
ExecReload=/etc/init.d/tsminst1 reload
ExecStop=/etc/init.d/tsminst1 stop
StandardOutput=journal

[Install]
WantedBy=multi-user.target
EOF
systemctl enable tsminst1.service
systemctl start tsminst1.service

ln -s /opt/tivoli/tsm/client/ba/bin/rc.dsmcad /etc/init.d/dsmcad
cat < <'EOF' >>/etc/systemd/system/dsmcad.service
[Unit]
Description=dsmcad

[Service]
Type=forking
ExecStart=/etc/init.d/dsmcad start
ExecReload=/etc/init.d/dsmcad reload
ExecStop=/etc/init.d/dsmcad stop
StandardOutput=journal

[Install]
WantedBy=multi-user.target
EOF
systemctl enable dsmcad.service
systemctl start dsmcad.service


Spectrum Protect – container vulnerability

We ran into an issue where a level-zero operator became root, and cleaned up some TSM dedupe-pool containers so he’d stop getting full filesystem alerts.

Things exposed:

How does someone that green get full, unmonitored root access?
* They told false information about timestaps during defense
* Their senior tech lead was content to advise they not move or delete files without contacting the app owner.
* Imagine if this had been a customer facing database server!

In ISP/TSM, once extents are marked damaged, a new backup of that extent will replace it.
* Good TDP4VT CTL files and other incrementals will send missing files.
* TDP for VMWare full backups fail if the control file backup is damaged.
* Damaged extents do not mark files as damaged or missing.

Replicate Node will back-propagate damaged files.
* Damaged extents do not mark files as damaged or missing.

Also, in case you missed that:
* Damaged extents do not mark files as damaged or missing.

For real, IBM says:
* Damaged extents do not mark files as damaged or missing.
* “That might cause a whole bunch of duplicates to be ingested and processed.”

IBM’s option is to use REPAIR STGPOOL.
* Requires a prior PROTECT STGPOOL (similar to BACKUP STGPOOL and RESTORE STGPOOL).
* PROTECT STGPOOL can go to a container copy on tape, a container copy on FILE, or a container primary on the replica target server.
* PROTECT STGPOOL cannot go to a cloud pool
* STGRULE TIERING only processes files, not PROTECT extents.
* PROTECT STGPOOL cannot go to a cloud pool that way either.
* There is NO WAY to use cloud storage pool to protect a container pool from damage.

EXCEPTION: Damaged extents can be replaced by REPLICATE NODE into a pool.
* You can DISABLE SES, and reverse the replication config.
* Replicate node that way will perform a FULL READ of the source pool.

There is a Request For Enhancement from November, 2017 for TYPE=CLOUD POOLTYPE=COPY.
* That would be a major code effort, but would solve this major hole.
* That has not gotten a blink from product engineering.
* Not even an “under review”, nor “No Way”, nor “maybe sometime”.

Alternatives for PROTECT into CLOUD might be:
* Don’t use cloud. Double the amount of local disk space, and replicate to another datacenter.
* Use NFS (We would need to build a beefy VM, and configure KRB5 at both ends, so we could do NFSv4 encrypted).
* Use CIFS (the host is on AIX, which does not support CIFS v3. Linux conversion up front before we had bulk data was given a big NO.)
* Use azfusefs (Again, it’s not Linux)

Anyway, maybe in 2019 this can be resolved, but this is the sort of thing that really REALLY was poorly documented, and did not get the time and resources to be tested in advance. This is the sort of thing that angers everyone at every level.

REFERENCE: hard,intr,nfsvers=4,tcp,rsize=1048576,wsize=1048576,bg,noatime


New data protection

Upgrading TSM server from Q9650 Core 2 Quad 3.0GHz, 8GB DDR2 on Win 2008R2.

New system is HP Z600, two-socket, 6-core 2.66GHz Xeon X5650 and 48GB of RAM. Wattage is the same per socket, but two sockets now. 3x the cores, 4x the performance.

SSDs for DB and Log are also moving to EVO 850 from Corsair M100. I’ll set up a container pool to replace the dedupe file class, and put that on 3x 3TB RAID5 instead of 2x RAID1.

OS will be Ubuntu 16.04.2 LTS. I’d like to just use Debian 9.1, but Debian and long-term-support seem to not be synonymous. I’d hate to run a patch update and have everything break, then fight with debian testing repo to try to get it all back to normal. Plus, I have no Ubuntu boxes, only Debian. It’ll give me a chance to see what operational differences I run into.

Old TSM is 6.4. New will be “Spectrum Protect” 8.1.3. Yes, the billions spent to rebrand to the same name as Charter Cable’s rebrand really seems like money well spent.

Anyway, Since I lost the offsite replication provider for the dedupe file pool, and it was having trouble keeping up anyway, this will let me change to server-side encryption, and object storage. We’ll see which provider wins out on price once everything is rededuped properly.

If the fan noise is not too bad, maybe this platform can be considered for a low-cost upgrade to the kids’ game machines. Though, these are heavy, with 2 big handles on the top.

Also, really, something new enough to have USB3 on the motherboard is probably better. I have some laptops picked out, but that’s re-buying every component, including ones that are presently decent. *sigh*