ANR1812E DELETE FILESPACE VMFULL failed because replication

ERROR:

ANR1812E DELETE FILESPACE VMFULL failed because replication

DESCRIPTION:

Decommed VMs fail to auto-delete during expiration because replication is happening. In an ideal world, there would be enough system resources to perform DB Backup in 2 hours, expiration in 2 hours, and replication in 4-8 hours. In this environment, replication overlaps a lot of other processes, and can get in the way. 

ANR1812E DELETE FILESPACE VMFULL-SOMENODENAME for node failed deletion because of a replication in progress. (SESSION: 123456)

 

WORKAROUND:

Identify the server

 

Cancel replication
CANCEL REPLICATION

 

Identify the filespace
VMFULL-SOMENODENAME in the example

 

Find the node that owns the filespace.
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER    Bkup     \VMFULL-SOMENODENAME 4   53084         –    6,782,908

 

Delete the filespace on both local and replica:
DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘
TSM2: DELETE FI VM_DATACENTER    ‘\VMFULL-SOMENODENAME ‘

 

Monitor Progress until complete
Protect: TSM>q occ * *VMFULL-SOMENODENAME *
NODE_NAME       Type     FILESPACE_NAME          FSID   Files   Phys MB   Logical MB
VM_DATACENTER     Bkup     \VMFULL-SOMENODENAME      4   50848         –    6,469,955

Protect: TSM>q act search=ANR1812E
03/07/21   23:14:23      ANR2017I Administrator ADMIN issued command: QUERY ACTLOG search=ANR1812E  (SESSION: 438180)

Protect: TSM>q proc
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
     395     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER    : 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.

Protect: TSM>TSM2: q proc
ANR1699I Resolved TSM2 to 1 server(s) – issuing command Q PROC against server(s).
ANR1687I Output for command ‘Q PROC’ issued against server TSM2 follows:
Process      Process Description          Job Id     Process Status                                   
——–     ——————–     ———-     ————————————————-
   5,756     DELETE FILESPACE                        Deleting file space \VMFULL-SOMENODENAME
                                                      (fsId=4) (which can include backup and archive
                                                      data) for node VM_DATACENTER: 0 objects deleted,
                                                      0 objects retained, and 0 objects skipped.
ANR1688I Output for command ‘Q PROC’ issued against server TSM2 completed.
ANR1694I Server TSM2 received the request to process command ‘Q PROC’.
ANR1697I Command ‘Q PROC’ processed by 1 server(s):  1 successful, 0 with warnings, and 0 with errors.

 

CAUSE:

Replicate Node, a normal operation, creates locks on any filespace to be processed.

The long-term resolution would be to have enough system resources to not have to overlap daily operations processes.

The benchmark set by IBM for this would be the ability to complete BACKUP DB in 2 hours.  This environment take 8-12 hours for most servers.


Posted in Reference, Work | Tagged , , | Comments Off on ANR1812E DELETE FILESPACE VMFULL failed because replication

ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

ERROR:

ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

 

DESCRIPTION:

This happens when two or more schedulers are connecting as the same node.  One node starts work on a schedule, and the others are denied.

 

 

WORKAROUND:

Check the client node for two or more “dsmcad” and “dsmc sched” processes with the same (or no) config file listed.

Kill the oldest duplicates.

 

If no duplicates are on the client, then search the activity log to see if this client is connecting with multiple IP addresses or hostnames.

If so, find out which client should not be running the scheduler, and kill them on that host.

 

This may require coordination with the UNIX team in cases of cluster failovers.

This may require investigation of start scripts in cases where the same client chronically has duplicates.

 

CAUSE:

Typically, a human will restart a scheduler, but fail to kill the original.

Sometimes, a start/stop script on a host fails to stop the prior instance.

In some cases, multiple start scripts fire on system boot.

 


Posted in Reference, Work | Tagged , , | Comments Off on ANR2568E Request for node (node) to start schedule (name) at (date) is denied.

SMB/CIFS 3 on AIX

Mounting should be vaguely similar to the SMB1 mounting you had before.

Download and install SMB Client 3, and “Network Authentication Service” (aka kerberos 5) from here:

https://www-01.ibm.com/marketing/iwm/iwm/web/pickUrxNew.do?source=aixbp

Ensure your Windows 201x server has SMB v3 enabled.

You want a service account in AD to use for your SMB3 mounts on AIX.

 

Notes about options:

encryption should be desired and secure_negotiate should be desired.
signing should be enabled
​​​​​pver should be 3.0.2
The kerberos realm specified in the “wrkgrp” option must be in all UPPERCASE if your domain is in uppercase.
The username provided for mounting is used for all read/write permissions/access.  
UID and GID default to root.system, but you can specify others.
fmode is the inverse of umask, and what the files’ permissions look like across the whole share.  Default is 755.
port can be 139 (ipv4) or 445 (ipv4 or ipv6).  Default is 445.

 

/etc/filesystems format:

/mnt:
     dev = /corpshare
     vfs = smbc
     mount = true
     options = “wrkgrp=CORP.DOMAIN,signing=enabled,pver=3.0.2,encryption=desired,secure_negotiate=desired”
     nodename = win2016server.corp.domain/sambauser

 

Command line example

mount -v smbc -n win2016server.corp.domain/sambauser/Passw0rd! \
-o “wrkgrp=CORP.DOMAIN,port=445,signing=required,encryption=required, \
secure_negotiate=desired,pver=auto” /corpshare /mnt

 

Store the samba credentials

mksmbcred -s win2016server.corp.comain -u sambauser [-p password]

See also lssmbcred, chsmbcred, and rmsmbcred.

 

Reference 2021:

https://www.ibm.com/docs/en/aix/7.2?topic=protocol-server-message-block-smb-client-file-system 


Supermicro BMC firmware

Setting up the new replica server, I kept running into problems during the initial firmware update.  All IPMI settings and hardware data were inaccessible after BMC firmware update. System otherwise works as expected.  This condition persisted no matter if I went to DEL setup or F11 boot menu.  I could hang out in the UEFI shell, or etc.

It was not that the sensor data was “not present”.  It was that the list of sensors was missing.  Also, System MAC address, and BIOS version info was missing.  The FRU data was empty, as well as the Hardware Information.  All of the IPMI settings were blank, and could not be set.  The diagnostic data page gave “File not found”.  The iKVM was inaccessible, and the system could not be put into maintenance mode, powered off, powered on, or reset from the IPMI interface.  All of the system logs were blank and inaccessible.  Support was not super helpful, but they were responsive.  Supermicro is one of the top tier system makers.  They OEM for IBM, but that equipment is not quite as touchy.

The solution is to be very finicky about the BMC firmware update.

Get the right version for your system here:
https://www.supermicro.com/support/resources/bios_ipmi.php

My system used this code:
https://www.supermicro.com/Bios/softfiles/12085/X11SDVN_BIOS_1_3a_IPMI_1_31_03.zip
Manufacturer Name: Supermicro
Product PartNum: SYS-E300-9D
Chassis Part Number: CSE-E300
Board Product Name: X11SDV-4C-TLN2F
BIOS Vendor: American Megatrends Inc.
Processor: Intel(R) Xeon(R) D-2123IT CPU @ 2.20GHz

Update the code with extra patience
Use the AwUpdate utility to update the IPMI/BMC firmware
.\AwUpdate -f ......\WS_X11AST2500_131_03.bin -i lan -h 192.168.1.210 -u ADMIN -p ADMIN
NOTE: This can be on some other system as long as you can connect TCP between the two.
NOTE: No -r, and in the web UI, we would uncheck all of the “preserve settings” options.

Let all 5 parts (0 through 4) complete
Wait for the “New firmware is updating” to complete
Wait for the system to reboot.

Monitor the console
Wait for a longer version of IPMI Initialization
Wait for a longer than usual DXE — ACPI Initialization
Wait for the red LED to come on
Wait at least 5 more minutes (try 10)

At this point, you should see that it responds to F11 or DEL, but stays hung.
CTRL-ALT-DEL and everything should be populated and working.

The Unit IDentity LED may be stuck red.
You cannot clear the UID red state any way other than pulling the power cord.
Let it drain for 30 seconds, and plug back in.

After this, everything works, AND the UID LED setting in the IPMI web interface will switch from blink blue to off.


OVM CPU Pinning

If you clone or recover an Oracle VM guest, and the source used CPU pinning (Hard Partitioning), the target may not work.  The error is entirely non-intuitive, and I could not find it on the interwebs, so here is a sanitized version.

OVMAPI_5001E Job: 1416254413024/QueuedVmStartDbImpl_1416254413023/OVMJOB_1500J Start/resume vm: PRODVM, on server: PRODSERVER, failed. 
Job Failure Event: 1416254413902/Server Async Command Failed/OVMEVT_00C014D_001 Async command failed on server: PRODSERVER. 
Object: PRODVM, PID: 15431, Server error: 
Command: [‘xm’, ‘create’, ‘/OVS/Repositories/000dead000beef00cafe0421cab55bad/VirtualMachines/000dead000beef00cafef207cabdbbad/vm.cfg’] failed (1): 
stderr: Error: (22, ‘Invalid argument’) 
stdout: Using config file “/OVS/Repositories/000dead000beef00cafe0421cab55bad/VirtualMachines/000dead000beef00cafef207cabdbbad/vm.cfg”. , 
on server: PRODSERVER, associated with object: 000dead000beef00cafef207cabdbbad [Thu Apr 15 00:12:19 EDT 2021]

 

You can remove the “cpus = ‘#-#'” line from vm.cfg to reset this.

References about OVM hard partitioning includes:

xm info

xm list

xenpm get-cpu-topology

xm vcpu-list

# cd /u01/app/oracle/ovm-manager-3/ovm_utils
# ./ovm_vmcontrol -u admin -p YourPassword -h ovm-manager -v my-first-vm -c vcpuset -s 0-7
Oracle VM VM Control utility 0.6.3.
Connected.
Command : vcpuset
Pinning virtual CPUs
Pinning of virtual CPUs to physical threads  '0-7' 'my-first-vm' completed.

After that, vcpu-list will show VM names in column 1 for dedicated CPUs.

NDMP TOC failure – datamover type incorrect

ERROR:
ANR4950E The server is unable to retrieve NDMP file history information while building table of contents for node NASNODE01, file space /SVM_NASNODE01_VIRTUALFS. NDMP node ID is 90156245149. Table of contents creation fails.

CAUSE:
One possible cause of this can be if the datamover was defined with the wrong scope (TYPE).  
TYPE can be NAS, NASVSERVER, or NASCLUSTER.  NAS is for node context.  VSERVER is for SVM ccontext.  CLUSTER is for the whole cluster context.

NOTE: There are other possible causes, such as corrupt inodes, or other issues; however, this one bit me and was not clearly define anywhere else.

CORRECTION:
You cannot UPDATE DATAMOVER TYPE=blah, but you can simply DELETE DATAMOVER and DEFINE DATAMOVER to fix.

DELETE DATAMOVER NASNODE01
DEFINE datamover NASNODE01 type=nascluster dataformat=netappdump hla=192.168.128.1 user=NDMPADMIN password=PASSWORDHERE

TRACING INFO:

trace disable
trace enable spi spid toc
trace begin /tmp/server.trc

Once tracing has been enabled, I would then like for you to initiate another backup of the /SVM_SBNAS01_OU_ABOD volume. When the backup completes/fails, you can then issue the following commands to disable tracing:

trace flush
trace end
trace disable
QUERY ACTLOG

grep NDMP dsmffdc.log

NASNODE01::> node run -node SBNAS01-01
Type ‘exit’ or ‘Ctrl-D’ to return to the CLI
NASNODE01> rdfile /etc/log/backup


reducevg very slow

This is an APAR, but really it’s a description. Reducevg sends the equivalent of TRIM commands, but on a storage array, this is writing nulls. On a big LUN, or with a busy array, this can take a long time. If you do not need to worry about this, then you can disable that space reclaim.

ioo -o -dk_lbp_enabled=0

Here is the IBM doc about it.

 

IJ23045: REDUCEVG UNCLEAR ON DELAY WHEN WAITING FOR INFLIGHT RECLAIM REQ APPLIES TO AIX 7100-05

 

A fix is available

APAR status

  • Closed as program error.

Error description

  • reducevg may be unclear, why there is some delay
    when waiting on inflight reclaim requests.
    

Local fix

  • Disable space reclamation by running:
    ioo -o dk_lbp_enabled=0
    

Problem summary

  • reducevg may be unclear, why there is some delay
    when waiting on inflight reclaim requests.
    

Problem conclusion

  • reducevg displays message incase there are space reclamation
    IOs inflight to indicate reducevg may take some time to
    complete.

TSM SP Remove ReplServer

PROBLEM:
Every 5.5 minutes, this shows up in the actlog

08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:25 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:25 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:25 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:26 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:26 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:26 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.
08/13/20 08:05:28 ANR1663E Open Server: Server OLDSERVER not defined
08/13/20 08:05:28 ANR1651E Server information for OLDSERVER is not available.
08/13/20 08:05:28 ANR4377E Session failure, target server OLDSERVER is not defined on the source server.

SOLUTION:
QUERY REPLSERVER shows the GUID
REMOVE REPLSERVER (GUID) to cause the errors to stop.


SVC, StorWize, FlashSystem, Spectrum Virtualize – replace a drive

When you replace a drive on one of these, mdisk arrays do not auto-rebuild.

If the GUI fix procedures go away, or never show up, or whatever causes the replacement drive to not get included as a new drive in the mdisk, you can do this manually.

 

First, look for the candidate or spare drive you want to use.

lsdrive | grep -v member

 

Then, make sure that drive ID is a candidate:

chdrive -use candidate 72

 

Then, find the missing member:

lsarraymember mdisk1 | grep -v exact

 

Then, set the new drive to use that missing member ID:

charraymember -member 31 -newdrive 72 mdisk1

 

You can watch the progress of the rebuild:

lsarraymemberprogress mdisk1