TSM 7.1.1 DB Backups

The dsm.sys now goes in:
/opt/tivoli/tsm/server/bin/dbbkapi

And you need to remove PASSWORDACCESS options.

The docs also omit the nodename $$TSMDBMGR$$, but you still need that.

DOCUMENTATION
http://www-01.ibm.com/support/knowledgecenter/SSGSG7_7.1.0/com.ibm.itsm.tshoot.doc/t_pdg_wrongdsmienvarble.html?lang=en

REFERENCES:
# cat ~tsminst1/sqllib/userprofile
export PATH=$PATH:/opt/tivoli/tsm/server/bin64/:/opt/tivoli/tsm/server/bin
export PATH=$PATH:/usr/tivoli/tsm/server/bin64:/usr/tivoli/tsm/server/bin
export PATH=$PATH:/usr/tivoli/tsm/client/ba/bin64:/usr/tivoli/tsm/client/ba/bin
export PATH=$PATH:/opt/tivoli/tsm/client/ba/bin64:/opt/tivoli/tsm/client/ba/bin
export PATH=$PATH:/usr/tivoli/tsm/client/api/bin64:/usr/tivoli/tsm/client/api/bin
export PATH=$PATH:/opt/tivoli/tsm/client/api/bin64:/opt/tivoli/tsm/client/api/bin
export DSMI_CONFIG=/tsm/tsminst1/tsmdbmgr.opt
export DSMI_DIR=/opt/tivoli/tsm/server/bin/dbbkapi
export DSMI_LOG=/tsm/tsminst1

# cat /tsm/tsminst1/tsmdbmgr.opt
SERVERNAME TSMDBMGR_TSMINST1

# cat /opt/tivoli/tsm/server/bin/dbbkapi/dsm.sys
servername TSMDBMGR_TSMINST1
COMMMethod tcpip
tcpserveraddr localhost
errorlogname /tsm/tsminst1/tsmdbmgr.log
nodename $$TSMDBMGR$$

From Q ACTLOG
12/08/14 17:03:32 ANR4626I Database backup will use 4 streams for processing with the number originally requested 4. (SESSION: 26880, PROCESS: 55)
12/08/14 17:03:33 ANR2984E Database backup terminated due to environment or setup issue related to DSMI_CONFIG - DB2 sqlcode -2033 sqlerrmc 406 . (SESSION: 26880, PROCESS: 55)

From db2dump/db2diag.0.log

2014-12-08-17.39.41.193231-420 E14508748A369 LEVEL: Error
PID : 8061086 TID : 1 PROC : db2vend
INSTANCE: tsminst1 NODE : 000
HOSTNAME: tsmserver
EDUID : 1
FUNCTION: DB2 UDB, database utilities, sqluvint, probe:321
DATA #1 : TSM RC, PD_DB2_TYPE_TSM_RC, 4 bytes
TSM RC=0x000007F1=2033 -- see TSM API Reference for meaning.

2014-12-08-17.39.41.193930-420 I14509118A891 LEVEL: Error
PID : 7078032 TID : 52135 PROC : db2sysc 0
INSTANCE: tsminst1 NODE : 000 DB : TSMDB1
APPHDL : 0-2145 APPID: *LOCAL.tsminst1.141209004840
AUTHID : TSMINST1 HOSTNAME: tsmserver
EDUID : 52135 EDUNAME: db2med.47992.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:656
DATA #1 : String, 134 bytes
Vendor error: rc = 11 returned from function sqluvint.
Return_code structure from vendor library /tsm/tsminst1/sqllib/adsm/libtsm.a:

DATA #2 : Hexdump, 48 bytes
0x0A0003050DD8BB10 : 0000 07F1 3332 3120 3230 3333 0000 0000 ....321 2033....
0x0A0003050DD8BB20 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0A0003050DD8BB30 : 0000 0000 0000 0000 0000 0000 0000 0000 ................

2014-12-08-17.39.41.194374-420 I14510010A519 LEVEL: Error
PID : 7078032 TID : 52135 PROC : db2sysc 0
INSTANCE: tsminst1 NODE : 000 DB : TSMDB1
APPHDL : 0-2145 APPID: *LOCAL.tsminst1.141209004840
AUTHID : TSMINST1 HOSTNAME: tsmserver
EDUID : 52135 EDUNAME: db2med.47992.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:696
MESSAGE : Error in vendor support code at line: 321 rc: 2033


After DR of a TSM server, do you need to restore the primary storage pool from the copy cool ?

It depends on if it’s small files or not. I normally have a small-file pool, which is the DIRMC, VMTLMC, and TOCDESTINATION. The offsite for this is kept reclaimed down to 1 tape, and I try to restore that primary pool first.

For large file, TDP, VM and image full backups, they can restore directly from tape, no problem.

For small files, block level incrementals, or instant restore, it depends on the number of tapes, number of drives and number of concurrent restores.

One restore with 6+ tape drives and it’s not much of an issue.

Ten restores with 6 tape drives could be an issue.

This is where properly setting your RTO/RPO and tiers in advance matters.

Tier-0 would be your TSM server, the DIRCOPY/DIRPOOL, ESX hosts, switches, arrays, etc. You’d restore or rebuild these directly before anything else.

Tier-1 would be the things you can restore within the first couple of days. It would be in one collocation group or one storage pool. You could restore those first, direct from tape. It should only be a small number of systems, such as a NIM server, HR/Payroll, and inventory/medical tracking. Primary systems only.

Tier-2 systems would be what can be restored within a week or two. This might be direct from tape, or restore to disk first. It would be any other important systems that your company can run without, but which is a serious pain to go a long time. Generally, this would be under 50 systems. You wouldn’t usually restore all of these at a normal DR test, but maybe a different subset at each DR test, just to make sure you *can* restore them.

Tier 3 might be the systems that you don’t restore for a few months. Some might be rebuilt from production clones, and some might just restore a source tree for critical system development. Dev, QA, HA, and anything else that you can operate without, but which should eventually be taken care of. You probably won’t ever restore this at a DR site, rather you’d wait until you fail back to new-production. If they were unrecoverable, they could be rebuilt, or decommissioned, without business risk.


TSM dedup BACKUP STGPOOL performance

BACKUP STGPOOL for dedupe runs about 6x slower than direct tape to tape.
Why?

1) First, the database has a huge number of random reads for dedupe rehydration.
Tack on any Dedup Deletion activity (SHOW DEDUPDELETEINFO) and anything else that’s competing for DB IOPS.
FIX: Put the database on SSD or RAM backed storage.
NOTE: SSD stats are usually lies. Sustained performance is 4500-12,000 IOPS each, divided by 2 for RAID-1/10, or by 3.5 for RAID-5/6)
FIX: increase server memory and provide more for DB2 bufferpools.
NOTE: This might require manually changing bufferpools, limiting filesystem cache, etc.
FIX: Large amounts of cache for the database containers

2) Next, the file class, while sequential, still has a large number of random read IOPS.
TSM Server has no read ahead for this. It reads the chunks in order, rather than requesting a huge buffer full of chunks.
As such, streaming speed will be limited by DB latency, file-class latency, and actual read IO times.
FIX: Reduce the latency for your file class
FIX: Reduce the latency for your database
FIX: Don’t do anything else during BACKUP STGPOOL.
FIX: Run your EXPIRE INVENTORY and IDENTIFY DUPLICATE after, not before.
FIX: Submit a Design Change Request (DCR) for larger chunk read cache to be used for BACKUP STGPOOL.
FIX: Submit a Design Change Request (DCR) for larger tape write buffer.

3) Last, tape buffer underruns can kill performance.
If the write buffer empties, then the tape will stop.
Before it begins again, the tape has to be repositioned backward.
For LTO drives, usually the minimum write speed is 50MB/sec.
Anything less, and you have latency and tape life consumed by “shoe shining”.
FIX: Fix/improve issues 1 and 2 above.
FIX: Submit a design change request to allow TSM to interleave more threads onto the same tape at once.
FIX: Use tape drives with lower minimum speeds to prevent underruns
FIX: Don’t use tape. Use virtual tape, another dedupe disk pool, or a replica target TSM server.

4) Check TSM server instrumentation.
This will show you where your time is spent, and what to upgrade next.
INSTRUMENTATION BEGIN
BACKUP STGPOOL DEDUP COPYPOOL
wait several minutes
INSTRUMENTATION END FILE=/tsm/instrumentation.out


TSM and NDMP

NDMP backups into a TSM storage pool will not be deduplicated.
If you set ENABLENASDEDUPE YES, that only affects NetApp backups.
IBM doesn’t make the NDMP code, so they don’t support deduplication of anything but NetApp.
That means neither IBM’s v7000 Unified backups, nor any other NDMP device, get deduplicated.

As such, go ahead and have your NDMP backups go to a DISK pool or direct to tape.
Sending to your dedupe pool will just clog things up.


DB2 10.5.0.1 negative colcard

This is a defect in DB2 10.5 FP1
The defect does not exist in DB2 9.7 FP6
This problem affects TSM 7.1.0.0 customers with billions of extents (over 30TB deduplicatedmay release late enough to include DB2 10.5 FP3a,

In TSM Server 7.1.0.0 on AIX (unk if limited to AIX),
when RUNSTATS parses BF_AGGREGATED_BITFILES,
and there are more than maxint unique values for BFID,
then COLCARD may become negative.

A negative column cardinality will the index for queries against it,
which will lead to slowdowns and lock escalations within TSM.
This will present as a growing dedupdelete queue, slow expire, slow BACKUP STGPOOL, and slow client backups.

This is not exactly maxint related, as maxint – colcard was higher than the number of columns by about 20%.

You can check for this by logging in to your instance user, and running:

db2 connect to tsmdb1
db2 set schema tsmdb1
db2 'select TABNAME,COLNAME,COLCARD from SYSSTAT.COLUMNS where COLCARD<-1'

The output should say “0 record(s) selected.”
If it lists any negative values for tables, then that table’s index will becompromised.

There is no fix for TSM Server 7.1, as no patches are available.
TSM 7.1.1 will release with DB2 10.5 FP3, which will not include a fix for this problem.
As of 2014-08-01, the problem has not been isolated yet.

The workaround is to update column cardinality to a reasonable value.
It doesn’t need to be exact. An example command might be:

db2 connect to tsmdb1
db2 set schema tsmdb1
db2 "UPDATE SYSSTAT.COLUMNS SET COLCARD=3300000000 WHERE COLNAME='BFID' AND TABNAME='BF_AGGREGATED_BITFILES' AND TABSCHEMA='TSMDB1'"

There is no APAR for this, and no hits on Google for “DB2 ‘negative column cardinality'”.
This seems slightly related to: http://www-01.ibm.com/support/docview.wss?uid=swg1IC99408

NOTE: DO NOT INSTALL DB2 FIXPACK SEPARATELY. The TSM bundled DB2 is very slightly different. Standard DB2 fixpacks are not supported. If you decide to do this, you may find command or schema problems. If it works, then you may not be able to upgrade TSM afterward without a BACKUP DB, uninstall, reinstall, RESTORE DB — at best.

If you have a large dedupe database, your options include:
* Stay at TSM 6.x
* Monitor for negative column cardinality
* Wait for an APAR and efix from IBM.
* Wait for TSM 7.1.1.1 or TSM 7.2.0 in 2015 (or whatever versions will contain fixes).


TSM 7.1 config

In the past, I set up TSM.PWD as root, but this seems to not be what I needed.

I’m posting because the error messages and IBM docs don’t cover this.

tsmdbmgr.log shows:
ANS2119I An invalid replication server address return code rc value = 2 was received from the server.

TSM Activity log shows:
ANR2983E Database backup terminated due to environment or setup issue related to DSMI_DIR – DB2 sqlcode -2033 sqlerrmc 168. (SESSION: 1, PROCESS: 9)

db2diag.log shows:

2014-02-26-13.54.12.425089-360 E415619A371 LEVEL: Error
PID : 15138852 TID : 1 PROC : db2vend
INSTANCE: tsminst1 NODE : 000
HOSTNAME: tsmserver
EDUID : 1
FUNCTION: DB2 UDB, database utilities, sqluvint, probe:321
DATA #1 : TSM RC, PD_DB2_TYPE_TSM_RC, 4 bytes
TSM RC=0x000000A8=168 — see TSM API Reference for meaning.

EDUID : 38753 EDUNAME: db2med.35926.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:656
DATA #1 : String, 134 bytes
Vendor error: rc = 11 returned from function sqluvint.
Return_code structure from vendor library /tsm/tsminst1/sqllib/adsm/libtsm.a:

DATA #2 : Hexdump, 48 bytes
0x0A00030462F0C4D0 : 0000 00A8 3332 3120 3136 3800 0000 0000 ….321 168…..
0x0A00030462F0C4E0 : 0000 0000 0000 0000 0000 0000 0000 0000 …………….
0x0A00030462F0C4F0 : 0000 0000 0000 0000 0000 0000 0000 0000 …………….

EDUID : 38753 EDUNAME: db2med.35926.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:696
MESSAGE : Error in vendor support code at line: 321 rc: 168

RC 168 per dsmrc.h means:
#define DSM_RC_NO_PASS_FILE 168 /* password file needed and user is
not root */

Verified everything required for this:
• passworddir points to the right directory
• DSMI_DIR points to the right directory
• dsmtca runs okay
• dsmapipw runs okay

Verified hostname info was correct

dsmffdc.log shows:
[ FFDC_GENERAL_SERVER_ERROR ]: (rdbdb.c:4200) GetOtherLogsUsageInfo failed, rc=2813, archLogDir = /tsm/arch.

Checked, and the log directory inside dsmserv.opt was typoed as /tsm/arch instead of /tsm/arc as was used to create the instance and as exists on the filesystems.

Updated dsmserv.opt and restarted tsm server. No change other than fixing Q LOG

SOLUTION:
The TSM.PWD file must be owned by the instance user, not by root.
Make sure to run the dsmapipw as the instance user, or chown the file after.


TSM file class design issue

If you have 6 filesystems backing a sequential access file storage pool, and you remove one filesystem, TSM cannot calculate free space properly.

Instead of looking at the free space of the remaining filesystems, it take the total space of the filesystems, minus the volumes in that device class.

Since there may still be old volumes in the “removed” directory, it considers the device class 100% full if everything currently existing cannot fit into the remaining directories.

Note that removing a directory from a device class does not invalidate the existing volumes in that directory. So long as the directory is still accessible, the volumes will be usable.

This is a problem when you want to reduce a filesystem but not migrate 100% off of it, as there is no other way to tell TSM not to allocate new volumes in that directory other than to remove that dir from the device class.