NDMP TOC failure – datamover type incorrect

ERROR:
ANR4950E The server is unable to retrieve NDMP file history information while building table of contents for node NASNODE01, file space /SVM_NASNODE01_VIRTUALFS. NDMP node ID is 90156245149. Table of contents creation fails.

CAUSE:
One possible cause of this can be if the datamover was defined with the wrong scope (TYPE).  
TYPE can be NAS, NASVSERVER, or NASCLUSTER.  NAS is for node context.  VSERVER is for SVM ccontext.  CLUSTER is for the whole cluster context.

NOTE: There are other possible causes, such as corrupt inodes, or other issues; however, this one bit me and was not clearly define anywhere else.

CORRECTION:
You cannot UPDATE DATAMOVER TYPE=blah, but you can simply DELETE DATAMOVER and DEFINE DATAMOVER to fix.

DELETE DATAMOVER NASNODE01
DEFINE datamover NASNODE01 type=nascluster dataformat=netappdump hla=192.168.128.1 user=NDMPADMIN password=PASSWORDHERE

TRACING INFO:

trace disable
trace enable spi spid toc
trace begin /tmp/server.trc

Once tracing has been enabled, I would then like for you to initiate another backup of the /SVM_SBNAS01_OU_ABOD volume. When the backup completes/fails, you can then issue the following commands to disable tracing:

trace flush
trace end
trace disable
QUERY ACTLOG

grep NDMP dsmffdc.log

NASNODE01::> node run -node SBNAS01-01
Type ‘exit’ or ‘Ctrl-D’ to return to the CLI
NASNODE01> rdfile /etc/log/backup


Protect initial install

This is happiness…

tsminst1@tsm:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

/bin/bash# for i in /dev/sd? ; do smartctl -a $i ; done | grep ‘Device Model’
Device Model: Samsung SSD 850 EVO 250GB
Device Model: WDC WD30EFRX-68EUZN0
Device Model: Samsung SSD 850 EVO 250GB
Device Model: WDC WD30EFRX-68EUZN0
Device Model: WDC WD30EFRX-68EUZN0

tsminst1@tsm:~$ dsmserv format dbdir=/tsm/db01,/tsm/db02,/tsm/db03,/tsm/db04,/tsm/db05,/tsm/db06,/tsm/db07,/tsm/db08 \
> activelogsize=8192 activelogdirectory=/tsm/log archlogdirectory=/tsm/logarch

ANR7800I DSMSERV generated at 11:32:48 on Sep 19 2017.

IBM Spectrum Protect for Linux/x86_64
Version 8, Release 1, Level 3.000

Licensed Materials – Property of IBM

(C) Copyright IBM Corporation 1990, 2017.
All rights reserved.
U.S. Government Users Restricted Rights – Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 29286.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR0010W Unable to open message catalog for language en_US.UTF-8. The default language message catalog will be used.
ANR7814I Using instance directory /home/tsminst1.
ANR3339I Default Label in key data base is TSM Server SelfSigned SHA Key.
ANR4726I The ICC support module has been loaded.
ANR0152I Database manager successfully started.
ANR2976I Offline DB backup for database TSMDB1 started.
ANR2974I Offline DB backup for database TSMDB1 completed successfully.
ANR0992I Server’s database formatting complete.
ANR0369I Stopping the database manager because of a server shutdown.


New data protection

Upgrading TSM server from Q9650 Core 2 Quad 3.0GHz, 8GB DDR2 on Win 2008R2.

New system is HP Z600, two-socket, 6-core 2.66GHz Xeon X5650 and 48GB of RAM. Wattage is the same per socket, but two sockets now. 3x the cores, 4x the performance.

SSDs for DB and Log are also moving to EVO 850 from Corsair M100. I’ll set up a container pool to replace the dedupe file class, and put that on 3x 3TB RAID5 instead of 2x RAID1.

OS will be Ubuntu 16.04.2 LTS. I’d like to just use Debian 9.1, but Debian and long-term-support seem to not be synonymous. I’d hate to run a patch update and have everything break, then fight with debian testing repo to try to get it all back to normal. Plus, I have no Ubuntu boxes, only Debian. It’ll give me a chance to see what operational differences I run into.

Old TSM is 6.4. New will be “Spectrum Protect” 8.1.3. Yes, the billions spent to rebrand to the same name as Charter Cable’s rebrand really seems like money well spent.

Anyway, Since I lost the offsite replication provider for the dedupe file pool, and it was having trouble keeping up anyway, this will let me change to server-side encryption, and object storage. We’ll see which provider wins out on price once everything is rededuped properly.

If the fan noise is not too bad, maybe this platform can be considered for a low-cost upgrade to the kids’ game machines. Though, these are heavy, with 2 big handles on the top.

Also, really, something new enough to have USB3 on the motherboard is probably better. I have some laptops picked out, but that’s re-buying every component, including ones that are presently decent. *sigh*


TSM dedup BACKUP STGPOOL performance

BACKUP STGPOOL for dedupe runs about 6x slower than direct tape to tape.
Why?

1) First, the database has a huge number of random reads for dedupe rehydration.
Tack on any Dedup Deletion activity (SHOW DEDUPDELETEINFO) and anything else that’s competing for DB IOPS.
FIX: Put the database on SSD or RAM backed storage.
NOTE: SSD stats are usually lies. Sustained performance is 4500-12,000 IOPS each, divided by 2 for RAID-1/10, or by 3.5 for RAID-5/6)
FIX: increase server memory and provide more for DB2 bufferpools.
NOTE: This might require manually changing bufferpools, limiting filesystem cache, etc.
FIX: Large amounts of cache for the database containers

2) Next, the file class, while sequential, still has a large number of random read IOPS.
TSM Server has no read ahead for this. It reads the chunks in order, rather than requesting a huge buffer full of chunks.
As such, streaming speed will be limited by DB latency, file-class latency, and actual read IO times.
FIX: Reduce the latency for your file class
FIX: Reduce the latency for your database
FIX: Don’t do anything else during BACKUP STGPOOL.
FIX: Run your EXPIRE INVENTORY and IDENTIFY DUPLICATE after, not before.
FIX: Submit a Design Change Request (DCR) for larger chunk read cache to be used for BACKUP STGPOOL.
FIX: Submit a Design Change Request (DCR) for larger tape write buffer.

3) Last, tape buffer underruns can kill performance.
If the write buffer empties, then the tape will stop.
Before it begins again, the tape has to be repositioned backward.
For LTO drives, usually the minimum write speed is 50MB/sec.
Anything less, and you have latency and tape life consumed by “shoe shining”.
FIX: Fix/improve issues 1 and 2 above.
FIX: Submit a design change request to allow TSM to interleave more threads onto the same tape at once.
FIX: Use tape drives with lower minimum speeds to prevent underruns
FIX: Don’t use tape. Use virtual tape, another dedupe disk pool, or a replica target TSM server.

4) Check TSM server instrumentation.
This will show you where your time is spent, and what to upgrade next.
INSTRUMENTATION BEGIN
BACKUP STGPOOL DEDUP COPYPOOL
wait several minutes
INSTRUMENTATION END FILE=/tsm/instrumentation.out