AIX 7.2.3.1 breaks GSKit 8.0.50.89

AIX 7.2.3 breaks GSKit8, up through GP29 (8.0.50.89).

This affects TSP/Spectrum Protect, Content Manager, Tivoli Directory Server, Websphere, DB2, Informix, IBM HTTP Server, etc.

Before reboot, everything works still, which implies the change is in the kernel.

We found it on TSM, and AIX 7200-03-01-1838, and Spectrum Protect server 8.1.6.0.

Application crash and DBX follow below.

ANR7800I DSMSERV generated at 12:17:13 on Sep 11 2018.
IBM Spectrum Protect for AIX
Version 8, Release 1, Level 6.000
Licensed Materials - Property of IBM
(C) Copyright IBM Corporation 1990, 2018.
All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corporation.

ANR7801I Subsystem process ID is 10944920.
ANR0900I Processing options file /home/tsminst1/dsmserv.opt.
ANR7811I Using instance directory /home/tsminst1.
Illegal instruction(coredump)

# dbx /opt/tivoli/tsm/server/bin/dsmserv core.10944896.28165312
Type 'help' for help.
[using memory image in core.10944896.28165312]
reading symbolic information ...warning: no source compiled with -g

Illegal instruction (illegal opcode) in . at 0x0 ($t1)
warning: Unable to access address 0x0 from core

(dbx) where
.() at 0x0
gsk_src_create__FPPvPv(??, ??) at 0x9000000015b6d88
__ct__8GSKMutexFv(??) at 0x9000000018d664c
__ct__20GSKPasswordEncryptorFv(??) at 0x9000000018cb248
__ct__7gsk_envFv(??) at 0x900000000aaa6b0
GskEnvironmentOpen__FPPvb(??, ??) at 0x900000000ab14c4
gsk_environment_open(??) at 0x900000000ab277c
IPRA.$CheckGSKVersion() at 0x100eecf68
tlsInit() at 0x100eecd70
main(??, ??) at 0x10000112c

(dbx) th
thread state-k wchan state-u k-tid mode held scope function

$t1 run running 41877977 k no sys
$t2 run blocked 21234465 u no sys _cond_wait_global
$t3 run running 24380103 u no sys waitpid


AIX ramdisks

Long ago (think 1999ish), I wrote a techdoc on how to put JFS on a ramdisk on AIX. We called them FAXES, because we would fax them to people, and this was before FAQ was a common acronym. At some point, I put it into the TechDoc system when that came out, because there was a push to use the system.

I lost the original text, but the techdoc lived on. It was rewritten after I left big blue. You can see their better version here:
http://www-01.ibm.com/support/docview.wss?uid=isg3T1010722

I don’t want to copy their doc, because they can be testy about such things. Heck, they can be testy when they plagiarize my docs. The key reference is syntax, which I’ll summarize here. You can also just look up the manpages on mkramdisk, mkfs,

Make a pinned-memory ramdisk: mkramdisk $bytes
The default uses pinend RAM, which is required for JFS or JFS2.

Make an un-pinned ramdisk: mkramdisk -u $bytes
This is okay for raw devices, maybe UDFS, but not for JFS. There are latency/access requirements on JFS, but at least mkfs knows to throw an error here if you try to skip it.

When you run mkfs on /dev/ramdisk0 as JFS, it’s normal, except you mount -o nointegrity.

When you run mkfs on /dev/ramdisk0 as JFS2, use -o log=INLINE on the format, and the mount.

You can, of course, format UDF as well: udfcreate -f3 -d/dev/ramdisk0 ; mount -vudfs /dev/ramdisk0 /RAMDISK

You could probably run a mksysb to the ramdisk. I don’t know if it would be raw, or if it would be UDFS. That might be useful for high speed testing, but of course, the ramdisk evaporates on reboot. You could dd the ramdisk out to some other media.


AIX 7.2 crash removing adapters from etherchannel

If I remove the first main adapter, and re-add it, then I can add/remove either adapter or IP interface after that.

If I remove the second main adapter, and re-add it, then I cannot remove the first, and dropping the IP interface crashes.

So, assuming adapter_names=ent2,ent6

This works everywhere:
/usr/lib/methods/ethchan_config -d ent17 ent2
/usr/lib/methods/ethchan_config -a ent17 ent2
/usr/lib/methods/ethchan_config -d ent17 ent6
/usr/lib/methods/ethchan_config -a ent17 ent6
/usr/sbin/rmdev -Rl en17
/usr/sbin/mkdev -l en17
/usr/sbin/cfgmgr
# Can do any combination of the above after remove/readd first adapter in advance.

And this crashes everywhere:
/usr/lib/methods/ethchan_config -d ent17 ent6
/usr/lib/methods/ethchan_config -a ent17 ent6
# crashed here on one server
/usr/lib/methods/ethchan_config -d ent17 ent2
ethchan_config: 0950-021 Unable to delete adapter ent2 from the
EtherChannel because it could not be found, errno = 2
/usr/sbin/rmdev -Rl en17

# crash here on several others

Crash analysis follows:

(96)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_8 machine with 160 available CPU(s) (64-bit
registers)

SYSTEM STATUS:
sysname... AIX
nodename.. testnode001
release... 2
version... 7
build date Mar 2 2018
build time 13:02:46
label..... 1809C_72H
machine... 00DEADBEEF00
nid....... FBCAFE4C
time of crash: Wed May 9 04:45:59 2018
age of system: 25 day, 10 hr., 54 min., 41 sec.
xmalloc debug: enabled
FRRs active... 0
FRRs started.. 0

CRASH INFORMATION:
CPU 96 CSA F00000002FF47600 at time of crash, error code for LEDs:
30000000
pvthread+1A0E00 STACK:
[00009324].unlock_enable_mem+000018 ()
[06058D54]shientdd:entcore_disable_tx_timeout_timers@AF123_105+000074
(??, ??)
[060592E8]shientdd:entcore_suspend_nic+000028 (??, ??)
[0605FB20]shientdd:entcore_suspend+0001E0 (??, ??, ??)
[06129A68]shientdd:entcore_close_common+000668 (??)
[0612A0B0]shientdd:entcore_close+000490 (??)
[060103CC]shientdd:shi2ent_close+00000C (??)
[F1000000C04911C0]ethchandd:ethchan_close+0001A0 (??)
[00014D70].hkey_legacy_gate+00004C ()
[0057A914]ns_free+000074 (??)
[00014F50].kernel_add_gate_cstack+000030 ()
[069E503C]if_en:en_ioctl+0002DC (??, ??, ??)
[0057126C]if_detach+0001CC (??)
[0056E1DC]ifioctl+00081C (F00000002FF473D0, 8020696680206966,
00000000066EB8A0)
[005EA764]soo_ioctl+0005C4 (??, ??, ??)
[007A4754]common_ioctl+000114 (??, ??, ??, ??)
[00003930]syscall+000228 ()
[kdb_get_virtual_memory] no real storage @ 2FF22358
[D011C92C]D011C92C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF5D60

(96)> status | grep -v wait
CPU INTR TID TSLOT PID PSLOT PROC_NAME
96 20E03BF 6670 380324 3128 ifconfig

(96)> vmlog
Most recent VMM errorlog entry
Error id = DSI_PROC
Exception DSISR/ISISR = 000000000A000000
Exception srval = 00007FFFFFFFD080
Exception virt addr = 0000000000000004
Exception value = 00000086 EXCEPT_PROT

0x86:
Protection exception. An attempt was made to write to a protected
address in memory

(96)> th -n ifconfig
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+1A0E00 6670*ifconfig RUN 20E03BF 03E 96 0
shientdd:.entcore_disable_tx_timeout_timers AF123_105+000074
bla < .unlock_enable>
.
2390 ! SUNLOCK(TX_QUEUE_SLOCK, tx_pri);
.

---- NDD INFO ----( F1000B003952B410)----
name............. ent6 alias............ en6
ndd_next......... 0000000000000000
ndd_flags........ 00610812
(BROADCAST!NOECHO!64BIT!CHECKSUM_OFFLOAD)
ndd_2_flags...... 00000930
(IPV6_LARGESEND!IPV6_CHECKSUM_OFFLOAD!LARGE_RECEIVE!ECHAN_ELEM)

(96)> print entcore_acs_t F1000B00393F0000
struct entcore_acs_t
struct entcore_tx_queue_t
< ...>
struct entcore_ras_cb_t *ffdc_ras_cb = 0xF1000B0039537D40;
struct entcore_tx_atomics_t *atomics = 0x0000000000000000;
struct mbuf *overflow_queue = 0x0000000000000000;
struct mbuf *overflow_queue_tail = 0x0000000000000000;
uint64_t ofq_cnt = 0x0000000000000000;
struct entcore_lock_info_t *p_lock_info = 0x0000000000000000;
void *p_acs = 0xF1000B00393F0000; NULL so DSI

(96)> dd F1000B00393F78D0
F1000B00393F78D0: 0000000000000000 < - p_lock_info

(96)> xm F1000B00393F78D0
Page Information:
heap_vaddr = F1000B0000000000
P_allocrange (range of 2 or more allocated full pages)
page........... 00003937 start.. F1000B00393F0000 page_cnt....... 0017
allocated_size. 00170000 pd_size........ 00010000 pinned......... yes
XMDBG: ALLOC_RECORD

Allocation Record:
F1000B00E4306600: addr......... F1000B00393F0000 allocated pinned
F1000B00E4306600: req_size..... 1458712 act_size..... 1507328
F1000B00E4306600: tid.......... 033F0187 comm......... cfgshien
XMDBG: ALLOC_RECORD
Trace during xmalloc() on CPU 00
0604FCB0(.entcore_allocate_acs+000310)
060129C4(.entcore_config_state_machine+
0601A884(.entcore_perform_init+0000A4)

Free History:
105D 40.955808 SHIENTDD GEN: L3 Close__B d1=F1000B00393F0000
105D 40.955808 SHIENTDD GEN: L3 CloseC_B d1=F1000B00393F0000
105D 40.955809 SHIENTDD GEN: L3 HwClos_B d1=F1000B00393F0000
105D 40.955810 SHIENTDD GEN: L3 HwClos_B -HW| d1=0000000000000000
105D 40.955810 SHIENTDD GEN: L3 HwClos10 -HW| d1=0000000000000000
105D 40.955810 SHIENTDD GEN: L3 HwClos_E -HW| d1=0000000000000000
105D 40.955811 SHIENTDD GEN: L3 HwClos_E d1=0000000000000000

< ...>

105D 41.039269 SHIENTDD GEN: L3 CloseC_E d1=F1000B00393F0000
105D 41.039269 SHIENTDD GEN: L3 Close__E d1=0000000000000000
105D 41.039273 SHIENTDD GEN: L3 Close__B d1=F1000B00393F0000

another close ? >>

105D 41.039273 SHIENTDD GEN: L3 CloseC_B d1=F1000B00393F0000
105D 41.039274 SHIENTDD GEN: L3 HwClos_B d1=F1000B00393F0000
105D 41.039275 SHIENTDD GEN: L3 HwClos_B -HW| d1=0000000000000000
105D 41.039275 SHIENTDD GEN: L3 HwClos10 -HW| d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 HwClos_E -HW| d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 HwClos_E d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 Suspnd_B d1=F1000B00393F0000
105D 41.039279 SHIENTDD GEN: L3 MctSyn_B d1=F1000B00393F0000
105D 41.039281 SHIENTDD GEN: L3 MctSyn_E d1=0000000000000000
END

It seems that 2 closes happened, which would have leaded to a double free, and the crash.

Debug efix was tested for 2 weeks on 24 systems and problem was resolved, patch was stabl.

APAR IJ06720 was generated, and a public efix will be released for that./


AIX JFS2 autoresize

computersarefun put in a request for AIX to auto-grow/shrink filesystems.
Ref: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=114789

This seems more like a monitoring thing than an operating system thing.
Also, handling this as a thin LUN is probably better where possible.
Here is an example script.

Potential improvements:
* Notifications on exceptions
* Config file to track different settings per filesystem
* Also check iused / ifree to handle tiny-files
* Run as a daemon vs from cron.
* Explicit lists of filesystems, or include/exclude lists

#!/bin/ksh
###########
# Run this from cron every minute to automatically resize JFS2 filesystems
# Incorrect limits could cause size flapping for small filesystems.
# We skip things we cannot reduce.

MINFREEPCT=10
MAXFREEPCT=70
MINPPFREE=10

LVLIST=`mount | grep jfs2 | grep /dev/ | awk ‘{print $1;}’ | cut -f 3 -d /`
for lv in $LVLIST ; do
df -gv 2>/dev/null | grep $lv | read device size used free pct iused ifree ipct mountpoint || continue
FREEPCT=$(( $used * 100 / $size ))
VG=`lslv $lv 2>/dev/null | grep “VOLUME GROUP:” | awk ‘{print $6;}’`
PPSIZE=`lsvg $VG 2>/dev/null | grep ‘PP SIZE’ | awk ‘{print $6;}’`
[[ $PPSIZE -gt 0 ]] || continue
#
if [[ $FREEPCT -lt $MINFREEPCT ]] ; then
FREEPPS=`lsvg $VG | grep FREE | awk ‘{print $6;}’`
[[ $FREEPPS -gt $MINFREEPPS ]] && chfs -a size=+1 $mountpoint
continue
fi
#
[[ $FREEPCT -gt $MAXFREEPCT ]] && chfs -a size=-$PPSIZE $mountpoint
#
done


IBM Download Director is a beast

I’m sure this will all change in a week, but until then, here is reference for how to uninstall download director, or forcibly reinstall it.

There was no support, and no google help, no IBM search help, etc. ​After all the usual things, I went to a system without an existing DD installation.

​You can force-reinstall Download Director from here:
https://www-03.ibm.com/isc/esd/dswdown/dldirector/installation_en.html

​You can manually run DD here, but I don’t know how to feed it packages:
https://www14.software.ibm.com/dldirector/IBMDownloadDirectorApp.jnlp

​There is info on how to uninstall DD here:
​https://www-03.ibm.com/isc/esd/dswdown/dldirector/uninstall_en.html

​I’m sure these URLs will change in the next forced web redesign, but for now, this should help for people with broken DD installs.

Reinstall info is obscured in convoluted JavaScript, but here’s the uninstall information:

Windows
How to uninstall

  • Open a new cmd window, paste the following command and hit enter:
  • reg DELETE HKCU\Software\Classes\ibmddp /f && rmdir %HOMEPATH%\AppData\Local\IBM\DD /S /Q
  • You should see a “The operation completed successfully.” message.

How to verify if Download Director is installed

  • Open a new cmd window, paste the following command and hit enter:
  • (reg query HKCU\Software\Classes\ibmddp 1> NUL 2>&1 && IF EXIST %HOMEPATH%\AppData\Local\IBM\DD\DownloadDirectorLauncher.exe (echo DD Installed) else (echo DD not installed)) || echo DD not installed
  • You should see either “DD installed” or “DD not installed”.

Linux
How to uninstall

  • Open a new terminal window, paste the following command and hit enter:
  • xdg-mime uninstall ~/.local/share/applications/ibm-downloaddirector.desktop && rm -rf ~/.local/share/applications/ibm-downloaddirector.desktop ~/.config/download-director/
  • If no errors are displayed, the operation completed successfully.

How to verify if Download Director is installed

  • Open a new terminal window, paste the following command and hit enter:
  • [[ -f ~/.local/share/applications/ibm-downloaddirector.desktop || -f ~/.config/download-director/DownloadDirectorLauncher.sh ]] && echo "DD installed" || echo "DD not installed"
  • You should see either “DD installed” or “DD not installed”.

Mac
How to uninstall

  • Open the “Terminal” app, paste the following command and hit enter:
  • rm -rf ~/Applications/DownloadDirectorLauncher.app/
  • If no errors are displayed, the operation completed successfully.

How to verify if Download Director is installed

  • Open the “Terminal” app, paste the following command and hit enter:
  • [[ -d ~/Applications/DownloadDirectorLauncher.app/ ]] && echo "DD installed" || echo "DD not installed"
  • You should see either “DD installed” or “DD not installed”.

Why I wrote this up:
I find myself stuck with IBM due to the value of legacy skills vs transitioning to newer skills.
Periodically, IBM makes changes to their webpage, or code download system.
Often, these leave things inconsistent (claims that HTTP can be used, but it’s no longer available).
Worse, forced tools will stop working, and the IBM solution is to wipe your entire browser config and start over.

IBM has decided it’s better to force people to use Download Director instead of any standard protocol.
IBM’s mantra is “It worked for me in the lab, so if it doesn’t work for you, tough patooties.”
There is no escalation to people who make decisions. This has been an ongoing issue for a decade.
No one cares, except a few of the ubertechs supporting things, but they have no sway.

I’ve been using HTTP for a while, but they pulled that, so I had to use DD.
This time, DD gave me an error that JavaWS could not be started.
So I uninstalled all Java, reinstalled the newest, and DD said I had no Java installed.

There were no google hits to help, no IBM pages to help, and IBM search is useless as always.
Of the pages I found, none of them had contact forms, because that costs money.
There is no uninstall tool for Download Director.
There is no Browser Extension, no OS uninstall tool.
Removing the AppData folder does not help.

I went to a clean system, and wrote down all that I could find during a new code download attempt.
There is actually a webpage for this, but it is not indexed anywhere. That’s linked above.
That’s what this post is about.

Note that this is not acceptable in any way, and is one of the many reasons people are leaving IBM for open standards.
It’s not about “The Cloud”. It’s about IBM having so many layers between the decision-makers and the workers that they are out of touch. They have no idea how to be a tech business anymore, and are run by people who are content to gut the reputation of IBM so as to report a short-term improvement in gross profit. Zero interest in the long term.


HOWTO: AIX support for R/W filesystem on USBMS

JFS2 Unsupported
Putting JFS2 on non-LVM block devices has been working for a long time. I​ wrote up how to put JFS2 on a ramdisk back at AIX 4.3.3. I lost the techdoc from back then, but IBM has a newer re-write dated 2008 here: http://www-01.ibm.com/support/docview.wss?uid=isg3T1010722

JFS2 requires the underlying system to tell it if something goes away, or for it to stay there as long as the filesystem is mounted. LVM does this for disk, and the ramdisk drivers do this as well (mostly because if the ramdisk fails, likely the system has failed). The key there is that with JFS2, the ramdisk pages are pinned.

I wrote up including performance on USB 1 and USB2 ports in January of 2010 HOWTO: JFS2 on USB device on AIX 5.3.11.1. Everything seems fine, and dandy, even mount on boot, except it’s not supported by AIX Development.

JFS2 Problems
The problem for USB Mass Storage Devices is that the device can just go away unexpectedly. If a disk goes into deep sleep, or resets because of a loose connection, the JFS drivers do not get notified. So, they take writes, and JFS2 saves them up until it’s time to flush. It goes to flush, and the I/O channel is gone. Sometimes, this is just loss of everything in cache. Sometimes, config methods hang until reboot. Other times, the system crashes.

​Because of that, we still cannot put LVM on a USB Mass Storage Device. This would take changes to notification of device availability, perhaps changes to the sync daemon, etc. Who knows, but there’s not been enough push from paying customers to make it a priority for AIX Development. Until that happens, don’t expect formal support for JFS2 on these devices.

UDF is the solution
AIX development supports read/write and even booting from USB Mass Storage Devices, but only with UDFS. The purpose is for writing a mksysb (system boot) image, or tar/cpio files, etc, and exists because of the RDX USB Internal Dock sold with some systems.
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_61/com.ibm.aix.files/usbms_fileref.htm

​Boot support is provided as well: REF: ​http://www-01.ibm.com/support/docview.wss?uid=isg1IZ66737

More info on RDX USB Internal Dock. https://www.ibm.com/support/knowledgecenter/POWER7/p7hdt/fc1103.htm

However, RDX is just a hot-swap USB to SATA drive bay. Any current USB drive (USB3 is preferred due to performance), should work fine.

HOWTO: Create, Read, and Write UDF on AIX

Create a bootable filesystem
mksysb -eXpi /dev/usbms0

Create an empty filesystem
udfcreate -d /dev/usbms0

Create a UDF 2.01 filesystem
udfcreate -f3 -d/dev/usbms0

NOTE: UDF 2.01 supports a real-time filesystem. It’s still UDF, so don’t try to put a database, or a million files on there.

Access read/write
mount -vudfs /dev/usbms0 /USBDRIVE

NOTE: The mksysb is a SPOT, plus a mksysb image, so adding files to the UDF will not make the restore huge.

USB Adapters on AIX
Add-in USB3 XHCI adapter for POWER8 is:
* CCIN 58F9 – PCIE2 4-port USB3 adapter
* FC EC45 and FRU 00E2932 for Low Profile
* FC EC46 and FRU 00E2934 for full height.
* driver is 4c1041821410b204 internal or 4c10418214109e04 PCIe

Add-in USB2 EHCI adapter for POWER7 is:
* CCIN 57D1 – PCI-E 4-port USB2 adapter
* driver is 33103500 integrated or 3310e000 PCIe
* FC 2728 or FRU 46K7394

Add-in USB2 EHCI adapter for POWER6/POWER5 is:
* CCIN 28EF – PCI 2-port USB2 adapter
* FC 2738 or FRU 80P2994
* Belkin F5U219 – exact same card without the sticker.
* driver is 99172604 internal or 99172704 PCI

Original USB1 OHCI /UHCI adapter for POWER5 and earlier was
* driver 22106474 on blades or c1110358 PCI
* This device is not really available anymore.


AIX and PowerHA levels

Research shows these dates for AIX:
https://www.ibm.com/support/pages/aix-support-lifecycle-information
It’s generally 26 weeks, plus or minus, from the initial YYWW date. Once a TLSP APARs releases, the YYWW code is be updated.

  • 7300-01-01-2246 2022-12-02 (Next 2023Q2)
  • 7200-05-05-2246 2022-12-02 (Next 2023Q2)
  • 7100-05-10-2220 2022-09-09 (Next 2023Q1)
  • 6100-09-12-1846 2018-11-16 (EoL CSP)

My AIX selection process would be:

  • AIX 7.3.1.1 from 2022 week 46 is what I have in my repo.  Another TL should be coming out 2023 Q1.  None of my customers run this, but you want this for POWER10.  NIM should be latest of all versions as well.
  • AIX 7.2.5.5 from 2022 week 46 is what I have in my repo.  Another TL should be coming out 2023 Q1.  You probably want this for POWER7 and up.
  • AIX 7.1.5.10 from 2022 week 20 is what I have in my repo.  I think the CSP is 2023Q1.  Supports AIX 5.2 and 5.3 WPARs.  Not much other reason to use this now other than some specific apps that are OK with OpenSSL, OpenSSH, and Java updates, but not kernel updates.
  • AIX 6.1.9.12 is where I stopped tracking.  No real need for AIX 6 anywhere.  Either you’re stuck on 5.3, or you came up to 7.1 (or ideally 7.2).  6.1.9.9 was needed for application compatibility on POWER9.
  • For anything POWER6 or older should really upgrade to p710 to p740 or s81x/s82x as replacements (cost).  POWER8 is EoS 2024-10-31.  POWER7 is EoS 2019.
  • AIX 5.3.12.9 + U866665 on POWER8 is end-stage.  AIX 5.3 was EoS in 2012, but some people still run it now.  Power8 is EoS 2024-10-31.  Power7 was 2019.
  • AIX 5.3 PTF U866665.bff (bos.mp64.5.3.12.10.U) enables POWER8. AIX must be 5.3.12.9. Must be patched before moving to p8. p8 must be 840 firmware or later. VIO must be 2.2.4.10 or later.  Migration is by LPM, NIM, or mksysb.  equires active extended support agreement AIX p8 systems on file to download.
  • AIX older – You should not be running anything older.  AIX 5.1 was all CHRP, and AIX 4.x was all PCI.  AIX 3.x was MicroChannel.  AIX 2.1 had some PS/2 systems.  Outside of a museum, on an isolated network (or no network because CYLONS!), just have this recycled.

My PowerHA (HA/CMP) selection process would be:
https://www.ibm.com/support/pages/powerha-aix-version-compatibility-matrix 

  • 7.2.7 Base is what I last grabbed.  OK for AIX 7.1.5, 7.2.5, 7.3.0 and later patch levels.
  • 7.1.3 SP09 was the end-stage for this.  OK for AIX 6.1.9.11, 7.1.3.9, 7.2.0.6, and later patch levels.  No AIX 5, and no AIX 7.3.
  • 6.1 SP15 was end-stage for this, and supported AIX 5.3.9, 6.1.2.1, 7.1.0, and later patch levels.  No AIX 5.1, 7.2, or 7.3.

Code sources:

UPDATE 2023-03-07:

  • Refreshed all of the info above to current.  If you’re on AIX 5.2, HA/CMP 5.x, or VIO 1.x, that’s really disappointing.
  • VIO should be 3.1.4.10.  Always go current whenever possible.  If not, 2.2.6.65 is your target.  If you have any VIO 1.x, upgrade.  Period.
  • System firmware, adapter microcode, disk microcode, tape microcode, and library firmware should all be latest available.
  • Storage array firmware should be latest LTS patch level, excluding any .0 versions.  I still don’t trust Data Reduction Pools.
  • ADSM/ITSM/TSM/Spectrum Protect/Storage Protect – These should typically be the final version supporting your OS/App combo.  I’m sweet on 8.1.17, though there are still major issues with deduplication pools when your database is over 3TB.  Containers and extents cannot be purged.  I have not seen SP 9.1, but I assume it will be extremely similar to SP 8.1.18 other than some minor rebadging.  Not sure.  I got dropped from the beta program because I didn’t have cycles to test their new code, and they didn’t have any interest in steering features.  They pick what they want to pick, and you’ll like it. 
  • IBM is phasing out Spectrum Protect Plus (Catalogic DPX), but might still be keeping Catalogic ECX (Copy Data Management).  In Storage Defender, IBM has picked up Cohesity DataProtect because of the cloud / DRaaS bits.  These all integrate with DS8000/Flash Systems for data immutability / vaulting / ransomware protection, and they want you to buy Rapid7 for the AI/Logic behind it.  I know regular ISP’s operations center anomaly detection us unusable due to its lack of adaptability/logic.  It just says everything is alerted every week when you run weekly fulls, etc.

I don’t really track IBMi OS (OS400), zOS, zVM, etc.  Storage should be rebranding this year though, but still no NFS/CIFS hardware.  At best, IBM sells a GPFS cluster with Ceph and with some StorWize FS7200s.


AIX and PowerHA versions 2017-06

This changes periodically, but for today, here is what I would do.

My PowerHA selection process would be:
* 7.1.3 SP06 if I needed to deploy quickly, because I have build docs for that.
* 7.1.4 doesn’t exist, but if it came out before deployment, I would consider it. Whichever was a newer release, latest 7.1.3 SP, or latest 7.1.4 SP.
* 7.2.0 SP03 if they wanted longer support, but had time for me to work up the new procedures during the install.
* 7.2.1 SP01 if SP01 came out before I deployed, and had chosen 7.2.0 prior. 7.2.1.0 base is available, but that’s from Dec 2016, and 7.2.0.3 is from May 2017. Newer by date is better.

My AIX selection process would be:
* Any NIM server would be AIX 7.2, latest TLSP.
* Any application support limits would win down to AIX 6.1, plus latest TLSP.
* For POWER9, I would push 7.2, latest TLSP.
* For POWER8, I would push 7.1 or later. — latest TLSP
* For POWER7, I would push 6.1 or later. — latest TLSP
* For POWER6 or older, or AIX 5.3 or older, I would push strongly against due to support and parts limitations.

Code sources:
* I would make sure to install yum from ezinstall, and deploy GNU tar and rsync:
http://public.dhe.ibm.com/aix/freeSoftware/aixtoolbox/ezinstall/ppc/
* I would update openssh from the IBM Web Download expansion:
https://www-01.ibm.com/marketing/iwm/iwm/web/reg/pick.do?source=aixbp&lang=en_US
* If any exposure to the public net, or a high-sensitivity system, I would check AIX security patches also.
http://public.dhe.ibm.com/aix/efixes/security/?C=M;O=D
ftp://ftp.software.ibm.com/aix/efixes/security/
* I would get the latest service pack for both AIX and PowerHA from Fix Central:
https://www-945.ibm.com/support/fixcentral/
* Base media, if I were certain the customer was entitled, but didn’t want to wait for them to provide media, Partnerworld SWAC:
https://www-304.ibm.com/partnerworld/partnertools/eorderweb/ordersw.do

Reference: PowerHA to AIX Support Matrix:
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347


PowerHA holds my disks

I did some testing and needed to document command syntaxen, even though I was not successful.
node01 / node02 – cannot remove EMC disks
aps are stopped

The fuser command will not detect processes that have mmap regions where that associated file descriptor has since been closed.

lsof | grep hdisk   ### nothing
fuser -fx /dev/hdisk2 ### nothing
fuser -d /dev/hdisk2 ### nothing
sudo filemon -O all -o 2.trc ; sleep 10 ; sudo trcstop   ### only shows hottest 2 dsks

### Cannot remove disks after removign from HA, is related to this defect.
http://www-01.ibm.com/support/docview.wss?uid=isg1IV65140
/usr/es/sbin/cluster/events/utils/cl_vg_fence_term -c vgname

In PowerHA 7.1.3, with the shared VG varied off, and the
disk in closed state, rmdev may fail and return a
busy error, eg:

# rmdev -dl hdisk2
Method error (/usr/lib/methods/ucfgdevice):
0514-062 Cannot perform the requested function because
         the specified device is busy.
.

# cl_set_vg_fence_height
Usage: cl_set_vg_fence_height [-c]  [rw|ro|na|ff]

JDSD NOTE: The levels are:
* rw = readwrite
* ro = read only
* na = no access
* ff = fail access

jdsd@node01  /home/jdsd
$ sudo ls -laF /usr/es/sbin/cluster/events/utils/cl*fence*
-rwxr--r--    1 root     system        12832 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_fence_vg*
-rwxr--r--    1 root     system        15624 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height*
-r-x------    1 root     system         5739 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_ssa_fence*
-rwxr--r--    1 root     system        22508 Nov  7 2013  /usr/es/sbin/cluster/events/utils/cl_vg_fence_init*
-rwxr--r--    1 root     system         4035 Feb 26 2015  /usr/es/sbin/cluster/events/utils/cl_vg_fence_redo*
-rwxr--r--    1 root     system        15179 Oct 21 2014  /usr/es/sbin/cluster/events/utils/cl_vg_fence_term*


jdsd@node01  /home/jdsd
$ sudo ls -laF /usr/es/sbin/cluster/events/cspoc/cl*disk*
-r-x------    1 root     system       109726 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_diskreplace*
-rwxr-xr-x    1 root     system        20669 Nov  7 2013  /usr/es/sbin/cluster/cspoc/cl_getdisk*
-r-x------    1 root     system       105962 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_lsreplacementdisks*
-r-x------    1 root     system       103433 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_lsrgvgdisks*
-rwxr-xr-x    1 root     system        12259 Feb 26 2015  /usr/es/sbin/cluster/cspoc/cl_pviddisklist*
-rwxr-xr-x    1 root     system         4929 Nov  7 2013  /usr/es/sbin/cluster/cspoc/cl_vg_non_dhb_disks*


jdsd@node01  /home/jdsd
$ sudo /usr/es/sbin/cluster/cspoc/cl_lsrgvgdisks
#Volume Group   hdisk    PVID             Cluster Node
#---------------------------------------------------------------------
caavg_private   hdisk38  00deadbeefcaff53 node01                        node01,node02 
datavg          hdisk22  00deadbeefca8643 node02                        node01,node02 demo_rg
datavg          hdisk23  00deadbeefca86f9 node02                        node01,node02 demo_rg
datavg          hdisk24  00deadbeefca8752 node02                        node01,node02 demo_rg
datavg          hdisk25  00deadbeefca87ac node02                        node01,node02 demo_rg
datavg          hdisk26  00deadbeefca880e node02                        node01,node02 demo_rg
datavg          hdisk27  00deadbeefca886c node02                        node01,node02 demo_rg
datavg          hdisk28  00deadbeefca88d7 node02                        node01,node02 demo_rg
datavg          hdisk29  00deadbeefca8965 node02                        node01,node02 demo_rg
datavg          hdisk30  00deadbeefca89c5 node02                        node01,node02 demo_rg
datavg          hdisk31  00deadbeefca8a52 node02                        node01,node02 demo_rg
datavg          hdisk32  00deadbeefca8ad2 node02                        node01,node02 demo_rg
datavg          hdisk33  00deadbeefca8b50 node02                        node01,node02 demo_rg
datavg          hdisk34  00deadbeefca8c26 node02                        node01,node02 demo_rg
datavg          hdisk35  00deadbeefca8c9a node02                        node01,node02 demo_rg
datavg          hdisk36  00deadbeefca8cf7 node02                        node01,node02 demo_rg
journalvg       hdisk37  00deadbeefca8d53 node02                        node01,node02 demo_rg


jdsd@node01  /home/jdsd
$ sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
Disk name:                      hdisk2
Disk UUID:                      1edeadbeefcafe04 b512d9e3b580fb13
Fence Group UUID:               0000000000000000 0000000000000000 - Not in a Fence Group
Disk device major/minor number: 18, 2
Fence height:                   2 (Read/Only)
Reserve mode:                   0 (No Reserve)
Disk Type:                      0x01 (Local access only)
Disk State:                     32785

Concurrent vg, so updating on node2 shows up on node1.

From node 2

sudo extendvg journalvg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk12
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk37
# Shows RW

From node 1

sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk37
# Shows RW

From node1

sudo /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height -c journalvg rw
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
# Shows RW

From node2

sudo reducevg journalvg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk12
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
# Shows RO

### OK, try again
From node 1

sudo mkvg -y dummyvg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk12
sudo varyoffvg dummyvg

From node 2

sudo importvg  -y dummyvg hdisk2
sudo /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height -c dummyvg rw
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
### Still RO
sudo /usr/es/sbin/cluster/events/utils/cl_vg_fence_term -c dummyvg
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
### Still RO
sudo varyoffvg dummyvg
sudo rmdev -Rl hdisk2

Both nodes

sudo exportvg dummyvg
sudo importvg -c -y dummyvg hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
### Still RO
sudo /usr/es/sbin/cluster/events/utils/cl_set_vg_fence_height -c dummyvg rw
sudo /usr/es/sbin/cluster/events/utils/cl_vg_fence_init -c dummyvg rw hdisk2
cl_vg_fence_init[279]: sfwAddFenceGroup(dummyvg, 1, hdisk2): No such device
sudo chvg -c dummyvg
sudo varyonvg -n -c -A -O dummyvg
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk2
sudo /usr/es/sbin/cluster/cspoc/cl_getdisk hdisk3
### Still RO
sudo varyoffvg dummyvg

From Node 2
sudo rmdev -Rl hdisk2
Method error (/etc/methods/ucfgdevice):
        0514-062 Cannot perform the requested function because the
                 specified device is busy.

sudo /usr/es/sbin/cluster/events/utils/cl_vg_fence_redo -c dummyvg rw hdisk2
 /usr/es/sbin/cluster/events/utils/cl_vg_fence_redo: line 109: cl_vg_fence_init: not found
 cl_vg_fence_redo: Volume group dummyvg fence height could not be set to read/write

This is related to this defect, but later version:
http://www-01.ibm.com/support/docview.wss?uid=isg1IV52444

sudo su -
export PATH=$PATH:/usr/es/sbin/cluster/utilities:/usr/es/sbin/cluster/events/utils/:/usr/es/sbin/cluster/cspoc/:/usr/es/sbin/cluster/sbin:/usr/es/sbin/cluster
/usr/es/sbin/cluster/events/utils/cl_vg_fence_redo -c dummyvg rw hdisk2
 cl_vg_fence_init[279]: sfwAddFenceGroup(dummyvg, 11, hdisk2, hdisk3, hdisk4, hdisk5, hdisk6, hdisk7, hdisk8, hdisk9, hdisk10, hdisk11, hdisk12): No such device
 cl_vg_fence_redo: Volume group dummyvg fence height could not be set to read/write#
cd /dev
/usr/es/sbin/cluster/events/utils/cl_vg_fence_redo -c dummyvg rw hdisk2
 cl_vg_fence_init[279]: sfwAddFenceGroup(dummyvg, 11, hdisk2, hdisk3, hdisk4, hdisk5, hdisk6, hdisk7, hdisk8, hdisk9, hdisk10, hdisk11, hdisk12): No such device
 cl_vg_fence_redo: Volume group dummyvg fence height could not be set to read/write#

SIGH!

I give up. We will probably have to reboot.


PPC64 Linux on Intel

QEMU on Windows will run ppc64 and ppc64le emulation.
It emulates the same as what PowerKVM on an S812L would provide.
It’s kind of slow because there is no KVM module, AND Intel vs PPC,
AND emulator mode is single-core/proc/thread.

You can get Windows installer here:
https://qemu.weilnetz.de/

You really want ANSI/VT100 escape codes on you “cmd.exe” also:
https://github.com/adoxa/ansicon

To build a blank disk:
qemu-img create -f qcow2 qemu-disk-ppc64.img 32G

You can boot with this:
set SDL_STDIO_REDIRECT=NO
qemu-system-ppc64 -M type=pseries -m 1G,slots=4,maxmem=8G
-cpu POWER8E -smp 1 -vga none -nographic
-netdev user,id=net0 -device spapr-vlan,netdev=net0
-device spapr-vscsi -device scsi-hd,drive=drive0
-drive id=drive0,if=none,file=qemu-disk-ppc64.img
-cdrom D:\Downloads\debian-testing-ppc64el-DVD-1.iso

The QEMU part is all one line. The cdrom image is up to you. I like Debian.

Other Notes:
Any issues with cursor keys, use ctrl-i for TAB, ctrl-n and ctrl-p for next/previous.

Emulation mode is flaky with more than one core.

There is a QEMU AIX build on PERZL.ORG which would be faster, especially for ppc64 BigEndian.

PowerKVM is just PPC Linux, QEMU, KVM, and LIBVIRT. KVM is just a kernel module for spee-dup. LIMVIRT is just a GUI and CLI tool to build VM definitions. QEMU is the emulator. Works best on POWER8, with hypervisor disabled (OPAL mode).

QEMU still does not have enough RTAS and NVRAM to boot AIX. AIX hangs during “Starting AIX”, and Diags just says it’s an unsupported machine type. There is a little bit of dev for this, but not much.​