IBM Download Director is a beast

I’m sure this will all change in a week, but until then, here is reference for how to uninstall download director, or forcibly reinstall it.

There was no support, and no google help, no IBM search help, etc. ​After all the usual things, I went to a system without an existing DD installation.

​You can force-reinstall Download Director from here:
https://www-03.ibm.com/isc/esd/dswdown/dldirector/installation_en.html

​You can manually run DD here, but I don’t know how to feed it packages:
https://www14.software.ibm.com/dldirector/IBMDownloadDirectorApp.jnlp

​There is info on how to uninstall DD here:
​https://www-03.ibm.com/isc/esd/dswdown/dldirector/uninstall_en.html

​I’m sure these URLs will change in the next forced web redesign, but for now, this should help for people with broken DD installs.

Reinstall info is obscured in convoluted JavaScript, but here’s the uninstall information:

Windows
How to uninstall

  • Open a new cmd window, paste the following command and hit enter:
  • reg DELETE HKCU\Software\Classes\ibmddp /f && rmdir %HOMEPATH%\AppData\Local\IBM\DD /S /Q
  • You should see a “The operation completed successfully.” message.

How to verify if Download Director is installed

  • Open a new cmd window, paste the following command and hit enter:
  • (reg query HKCU\Software\Classes\ibmddp 1> NUL 2>&1 && IF EXIST %HOMEPATH%\AppData\Local\IBM\DD\DownloadDirectorLauncher.exe (echo DD Installed) else (echo DD not installed)) || echo DD not installed
  • You should see either “DD installed” or “DD not installed”.

Linux
How to uninstall

  • Open a new terminal window, paste the following command and hit enter:
  • xdg-mime uninstall ~/.local/share/applications/ibm-downloaddirector.desktop && rm -rf ~/.local/share/applications/ibm-downloaddirector.desktop ~/.config/download-director/
  • If no errors are displayed, the operation completed successfully.

How to verify if Download Director is installed

  • Open a new terminal window, paste the following command and hit enter:
  • [[ -f ~/.local/share/applications/ibm-downloaddirector.desktop || -f ~/.config/download-director/DownloadDirectorLauncher.sh ]] && echo "DD installed" || echo "DD not installed"
  • You should see either “DD installed” or “DD not installed”.

Mac
How to uninstall

  • Open the “Terminal” app, paste the following command and hit enter:
  • rm -rf ~/Applications/DownloadDirectorLauncher.app/
  • If no errors are displayed, the operation completed successfully.

How to verify if Download Director is installed

  • Open the “Terminal” app, paste the following command and hit enter:
  • [[ -d ~/Applications/DownloadDirectorLauncher.app/ ]] && echo "DD installed" || echo "DD not installed"
  • You should see either “DD installed” or “DD not installed”.

Why I wrote this up:
I find myself stuck with IBM due to the value of legacy skills vs transitioning to newer skills.
Periodically, IBM makes changes to their webpage, or code download system.
Often, these leave things inconsistent (claims that HTTP can be used, but it’s no longer available).
Worse, forced tools will stop working, and the IBM solution is to wipe your entire browser config and start over.

IBM has decided it’s better to force people to use Download Director instead of any standard protocol.
IBM’s mantra is “It worked for me in the lab, so if it doesn’t work for you, tough patooties.”
There is no escalation to people who make decisions. This has been an ongoing issue for a decade.
No one cares, except a few of the ubertechs supporting things, but they have no sway.

I’ve been using HTTP for a while, but they pulled that, so I had to use DD.
This time, DD gave me an error that JavaWS could not be started.
So I uninstalled all Java, reinstalled the newest, and DD said I had no Java installed.

There were no google hits to help, no IBM pages to help, and IBM search is useless as always.
Of the pages I found, none of them had contact forms, because that costs money.
There is no uninstall tool for Download Director.
There is no Browser Extension, no OS uninstall tool.
Removing the AppData folder does not help.

I went to a clean system, and wrote down all that I could find during a new code download attempt.
There is actually a webpage for this, but it is not indexed anywhere. That’s linked above.
That’s what this post is about.

Note that this is not acceptable in any way, and is one of the many reasons people are leaving IBM for open standards.
It’s not about “The Cloud”. It’s about IBM having so many layers between the decision-makers and the workers that they are out of touch. They have no idea how to be a tech business anymore, and are run by people who are content to gut the reputation of IBM so as to report a short-term improvement in gross profit. Zero interest in the long term.


HOWTO: AIX support for R/W filesystem on USBMS

JFS2 Unsupported
Putting JFS2 on non-LVM block devices has been working for a long time. I​ wrote up how to put JFS2 on a ramdisk back at AIX 4.3.3. I lost the techdoc from back then, but IBM has a newer re-write dated 2008 here: http://www-01.ibm.com/support/docview.wss?uid=isg3T1010722

JFS2 requires the underlying system to tell it if something goes away, or for it to stay there as long as the filesystem is mounted. LVM does this for disk, and the ramdisk drivers do this as well (mostly because if the ramdisk fails, likely the system has failed). The key there is that with JFS2, the ramdisk pages are pinned.

I wrote up including performance on USB 1 and USB2 ports in January of 2010 HOWTO: JFS2 on USB device on AIX 5.3.11.1. Everything is fine, and dandy, even mount on boot, except it’s not supported by AIX Development.

JFS2 Problems
The problem for USB Mass Storage Devices is that the device can just go away unexpectedly. If a disk goes into deep sleep, or resets because of a loose connection, the JFS drivers do not get notified. So, they take writes, and JFS2 saves them up until it’s time to flush. It goes to flush, and the I/O channel is gone. Sometimes, this is just loss of everything in cache. If it’s an important file, then the system crashes.

​Because of that, we still cannot put LVM on a USB Mass Storage Device. This would take changes to notification of device availability, perhaps changes to the sync daemon, etc. Who knows, but there’s not been enough push from paying customers to make it a priority for AIX Development. Until that happens, don’t expect formal support for JFS2 on these devices.

UDF is the solution
AIX development supports read/write and even booting from USB Mass Storage Devices, but only with UDFS. The purpose is for writing a mksysb (system boot) image, or tar/cpio files, etc, and exists because of the RDX USB Internal Dock sold with some systems.
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_61/com.ibm.aix.files/usbms_fileref.htm

​Boot support is provided as well: REF: ​http://www-01.ibm.com/support/docview.wss?uid=isg1IZ66737

More info on RDX USB Internal Dock. https://www.ibm.com/support/knowledgecenter/POWER7/p7hdt/fc1103.htm

RDX is just a hot-swap USB to SATA drive bay. Any current USB drive (USB3 is preferred due to performance), should work fine.

HOWTO: Create, Read, and Write UDF on AIX

Create bootable filesystem

  mksysb -eXpi /dev/usbms0

Create empty filesystem

  udfcreate -d /dev/usbms0

Create UDF 2.01 filesystem

  udfcreate -f3 -d/dev/usbms0

NOTE: UDF 2.01 supports a real-time filesystem. It’s still UDF, so don’t try to put a database, or a million files on there.

Access read/write

  mount -vudfs /dev/usbms0 /USBDRIVE

NOTE: The mksysb is a SPOT, plus a mksysb image, so adding files to the UDF will not make the restore huge.

USB Adapters on AIX
Add-in USB3 XHCI adapter for POWER8 is:

  • CCIN 58F9 – PCIE2 4-port USB3 adapter
  • FC EC45 and FRU 00E2932 for Low Profile
  • FC EC46 and FRU 00E2934 for full height.
  • driver is 4c1041821410b204 internal or 4c10418214109e04 PCIe

Add-in USB2 EHCI adapter for POWER7 is:

  • CCIN 57D1 – PCI-E 4-port USB2 adapter
  • driver is 33103500 integrated or 3310e000 PCIe
  • FC 2728 or FRU 46K7394

Add-in USB2 EHCI adapter for POWER6/POWER5 is:

  • CCIN 28EF – PCI 2-port USB2 adapter
  • FC 2738 or FRU 80P2994
  • Belkin F5U219 – exact same card without the sticker.
  • driver is 99172604 internal or 99172704 PCI

Original USB1 OHCI /UHCI adapter for POWER5 and earlier was

  • driver 22106474 on blades or c1110358 PCI
  • This device is not really available anymore.

AIX and PowerHA levels

Research shows these dates for AIX:

  • AIX 7.2.1.3 should come out around October, 2017 (est Week 46)
  http://www-01.ibm.com/support/docview.wss?uid=isg1IV95390   ### 7200-01-03-1720
  • AIX 7.1.4.5 should come out around October, 2017 (est Week 46)
  http://www-01.ibm.com/support/docview.wss?uid=isg1IV95393   ### 7100-04-05-1720
  • AIX 7.1.5.0 may come out around January, 2018 (est Week 5); however, it may be cancelled.
  http://www-01.ibm.com/support/docview.wss?uid=isg1IV86307   ### 7100-05-00-1731

It’s generally 26 weeks, plus or minus, from the initial YYWW date. Once a TLSP APARs releases, the YYWW code is be updated.

My PowerHA selection process would be:

  • 7.1.3 SP06 if I needed to deploy quickly, because I have build docs for that. However, it may be withdrawn from marketing in 2018.
  • 7.2.0 SP03 if they wanted longer support, but had time for me to work up the new procedures during the install.
  • 7.2.1 SP01 when it comes out, but not 7.2.1 base.

My AIX selection process would be:

  • 7.2.1.2 for any NIM server or POWER9. Next updates should be Oct 2017.
  • 7.1.4.4 or later for customer preference. Next updates should be Oct 7.1.4.5 and Jan 7.1.5.0.
  • 6.1.9.9 Minimum level for application compatability. This is is the final TLSP.
  • For anything POWER6 or older, I push hard for p710 to p740 or s81x/s82x as replacements (cost).
  • For anything AIX 5.3 or older, I push hard for app testing on newer AIX (EoS).
    • PTF U866665.bff (bos.mp64.5.3.12.10.U) enables POWER8. AIX must be 5.3.12.9. Must be patched before p8 (install nim or mksysb). p8 must be 840 firmware. VIO must be 2.2.4.10 or later.
    • PTF U866665 requires an active extended support agreement AND p8 systems on file. No free access to biz partnets.

Code sources:

  • rpm.rte and yum ezinstall, then deploy tar, wget, and rsync:
  http://public.dhe.ibm.com/aix/freeSoftware/aixtoolbox/ezinstall/ppc/
  • openssh from the IBM Web Download expansion:
  https://www-01.ibm.com/marketing/iwm/iwm/web/reg/pick.do?source=aixbp&lang=en_US
  • AIX security patches for any DMZ hosts
  http://public.dhe.ibm.com/aix/efixes/security/?C=M;O=D
  ftp://ftp.software.ibm.com/aix/efixes/security/
  • Base media, if I were certain the customer was entitled, but didn’t want to wait for them to provide media, Partnerworld SWAC:
   https://www-304.ibm.com/partnerworld/partnertools/eorderweb/ordersw.do
  • Latest service pack for AIX from Fix Central:
  https://www-945.ibm.com/support/fixcentral/
  https://www-945.ibm.com/support/fixcentral/aix/selectFixes?release=7.2&function=release
  https://www-945.ibm.com/support/fixcentral/aix/selectFixes?release=7.1&function=release
  • Latest service pack for PowerHA from Fix Central:
  https://www-945.ibm.com/support/fixcentral/swg/selectFixes?parent=Cluster%20software&product=ibm/Other+software/PowerHAClusterManager&release=7.2.0&platform=All&function=all
  https://www-945.ibm.com/support/fixcentral/swg/selectFixes?parent=Cluster%20software&product=ibm/Other+software/PowerHAClusterManager&release=7.1.3&platform=All&function=all

Reference: PowerHA to AIX Support Matrix:

   http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347

Posted in Reference, Work | Comments Off on AIX and PowerHA levels

AIX and PowerHA versions 2017-06

This changes periodically, but for today, here is what I would do.

My PowerHA selection process would be:

  • 7.1.3 SP06 if I needed to deploy quickly, because I have build docs for that.
  • 7.1.4 doesn’t exist, but if it came out before deployment, I would consider it. Whichever was a newer release, latest 7.1.3 SP, or latest 7.1.4 SP.
  • 7.2.0 SP03 if they wanted longer support, but had time for me to work up the new procedures during the install.
  • 7.2.1 SP01 if SP01 came out before I deployed, and had chosen 7.2.0 prior. 7.2.1.0 base is available, but that’s from Dec 2016, and 7.2.0.3 is from May 2017. Newer by date is better.

My AIX selection process would be:

  • Any NIM server would be AIX 7.2, latest TLSP.
  • Any application support limits would win down to AIX 6.1, plus latest TLSP.
  • For POWER9, I would push 7.2, latest TLSP.
  • For POWER8, I would push 7.1 or later. — latest TLSP
  • For POWER7, I would push 6.1 or later. — latest TLSP
  • For POWER6 or older, or AIX 5.3 or older, I would push strongly against due to support and parts limitations.

Code sources:

  • I would make sure to install yum from ezinstall, and deploy GNU tar and rsync:
  http://public.dhe.ibm.com/aix/freeSoftware/aixtoolbox/ezinstall/ppc/
  • I would update openssh from the IBM Web Download expansion:
  https://www-01.ibm.com/marketing/iwm/iwm/web/reg/pick.do?source=aixbp&lang=en_US
  • If any exposure to the public net, or a high-sensitivity system, I would check AIX security patches also.
  http://public.dhe.ibm.com/aix/efixes/security/?C=M;O=D
  ftp://ftp.software.ibm.com/aix/efixes/security/
  • I would get the latest service pack for both AIX and PowerHA from Fix Central:
  https://www-945.ibm.com/support/fixcentral/
  • Base media, if I were certain the customer was entitled, but didn’t want to wait for them to provide media, Partnerworld SWAC:
   https://www-304.ibm.com/partnerworld/partnertools/eorderweb/ordersw.do

Reference: PowerHA to AIX Support Matrix:

   http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD101347

Bad Subnet Kills DHCPD

One, single bad IP in DHCPD config will kill the entire config file. :(

On an EdgeRouter, and probably anything with Ubiquiti, and maybe anything using the same config style (Brocade and others have the same command set)….

If you add a static reservation outside of the DHCP server’s subnet,
as in, if you typo one octet, or decide to do another subnet just because,
your DHCP server will be offline after reboot. No errors, just silently not serving.

It can be outside of the start/stop range, and that’s fine.

Really, this should give you a warning from the webUI, or it should just say “OKAY, We’ll let you hand out stupid IP addresses.” I mean, what if I wanted this to be my DHCP server, but I had a different router and subnet on the same segment?

From command line, you’ll see the error though:

admin@gw1# commit
[ service dhcp-server ]
Static DHCP lease IP '192.169.1.79' under mapping 'CustomerLaptop'
under shared network name 'LAN' is outside of the DHCP lease network '192.168.1.0/24'.
DHCP server configuration commit aborted due to error(s).
[edit]

unpacking .deb

Reminder to self:
Debian packages are stored in library archive format.
http://www.tldp.org/HOWTO/Debian-Binary-Package-Building-HOWTO/x60.html
https://www.debian.org/doc/debian-policy/ap-pkg-binarypkg.html

ar -xv file.deb
This returns three files, in this specific order:
debian-binary # A small text file. Always “2.0\n” for now.
data.tar.gz # All of the filesystem bits that get deployed
control.tar.gz # control, md5sums, and pre/post scripts

Note also that data.tar can be .xz format as well.

There are dpkg-build tools for this, but all of this can be done manually for more control if desired.


oslevel wrong

I always forget instfix and oslevel -rl….
tags: aix oslevel incorrect backlevel wrong upgrade update

When these things show nothing:
lppchk -v
oslevel -sl `oslevel -sq 2>/dev/null | head -1`

and yout bos.rte.install, and bos.mp64, show the correct level compared to:
https://www-304.ibm.com/support/docview.wss?uid=isg1fileset2063572681

You should see the correct level here as well:
oslevel -sq | head

Check these other two things.
oslevel -r -l `oslevel -rq 2>/dev/null | sed -n '1p'`
and
instfix -icqk 6100-09-06-1543 | grep ":-:"


PowerHA Quickbuild

Because Facebook notes editor has zero formatting functionality in the new version.

####################################
### POWERHA QUICKBUILD - SANITIZED
####################################
This is a list of all the commands I'm using to build this cluster.
It's been sanitized of any customer information.


####################################
### Cleanup
####################################
clrmclstr
rmcluster -n MYCLUSTER
y | rmcluster -r hdisk2
rmdev -Rdl cluster0
/usr/sbin/rsct/bin/cthagsctrl -z
/usr/sbin/rsct/bin/cthagsctrl -d
echo "cthags 12348/udp" >> /etc/services
/usr/sbin/rsct/bin/cthagsctrl -a
/usr/sbin/rsct/bin/cthagsctrl -s
stopsrc -s clcomd ; sleep 2 ; startsrc -s clcomd
rm /var/hacmp/adm/* /var/hacmp/log/* /var/hacmp/clverify/* /usr/es/sbin/cluster/etc/auto_versync.pid
no -po nonlocsrcroute=1
no -po ipsrcrouterecv=1
shutdown -Fr now


####################################
### System config
####################################
# oslevel -s
7100-04-01-1543

# halevel -s
7.1.3 SP4

# emgr -P
PACKAGE INSTALLER LABEL
======================================================== =========== ==========
openssl.base installp 101a_fix
bos.net.tcp.client installp IV79944s1a
openssh.base.server installp IV80743m9a
openssh.base.client installp IV80743m9a
bos.net.tcp.client installp IV80191s1a
bos.rte.control installp IV80586s1a

# cat /etc/hosts
127.0.0.1 localhost
10.0.0.1 gateway
10.0.0.10 mycluster MYCLUSTER
10.0.0.11 node1
10.0.0.12 node2


####################################
### Cluster communication
####################################
echo node1 > /etc/cluster/rhosts
echo node2 >> /etc/cluster/rhosts
cat /etc/cluster/chosts > /usr/es/sbin/cluster/etc/rhosts
echo 10.0.0.11 >> /usr/es/sbin/cluster/etc/rhosts
echo 10.0.0.12 >> /usr/es/sbin/cluster/etc/rhosts
echo 10.0.0.1 > /usr/es/sbin/cluster/netmon.cf
stopsrc -s clcomd ; sleep 2 ; startsrc -s clcomd
sleep 10
cl_rsh -n node1 date
cl_rsh -n node2 date

####################################
### Basic cluster build
####################################
export CLUSTER=MYCLUSTER
export NODES="node2 node1"
export HBPVID=deadbeefcafe1234
clmgr add cluster ${CLUSTER} NODES="$NODES"
clmgr modify cluster $CLUSTER REPOSITORY=$HBPVID HEARTBEAT_TYPE=unicast
cldare -rt


####################################
### Add the service address
####################################
/usr/es/sbin/cluster/utilities/claddnode -Tservice -Bmycluster -wnet_ether_01 # -zignore
cllsif
cldare -rt


####################################
### file collections
####################################
clfilecollection -o coll -c Configuration_Files -'' -'AIX and HACMP config files' yes yes
clfilecollection -o coll -c HACMP_Files -'' -'HACMP Resource Group Files' yes yes
clfilecollection -o time -c 10
clfilecollection -o coll -a User_Files 'System user config' yes yes
clfilecollection -o file -a User_Files /etc/passwd
clfilecollection -o file -a User_Files /etc/group
clfilecollection -o file -a User_Files /etc/security/passwd
clfilecollection -o file -a User_Files /etc/security/limits
clfilecollection -o file -a User_Files /.profile
clfilecollection -o file -a User_Files /etc/environment
clfilecollection -o file -a User_Files /etc/profile
clfilecollection -o file -a User_Files /etc/exports
clfilecollection -o file -a User_Files /etc/sudoers
clfilecollection -o file -a User_Files /etc/qconfig
clfilecollection -o file -l Configuration_Files
clfilecollection -o file -l HACMP_Files
clfilecollection -o file -l User_Files


####################################
### mail events
####################################
/usr/es/sbin/cluster/utilities/claddcustom -t event -n'mail_event' \
-I'mail out when event occurs' -v'/usr/local/cluster/mail_event'
for EVENT in `cat /usr/local/cluster/mail_event.list`; do
/usr/es/sbin/cluster/utilities/clchevent -O"$EVENT" \
-s /usr/es/sbin/cluster/events/$EVENT -b mail_event -c 0
done
/usr/es/sbin/cluster/utilities/clacdNM -MA -nLVM_IO_FAIL -p0 -lLVM_IO_FAIL -m/usr/local/cluster/LVM_IO_FAIL
/usr/es/sbin/cluster/utilities/claddserv -s'my_app' \
-b'/usr/local/cluster/APP_start.ksh' -e'/usr/local/cluster/APP_stop.ksh'
/usr/es/sbin/cluster/utilities/claddserv -s'my_dsmc' \
-b'/usr/local/cluster/DSMC_start.ksh' -e'/usr/local/cluster/DSMC_stop.ksh'
cllsserv


####################################
### Resource group
####################################
/usr/es/sbin/cluster/utilities/claddgrp -g 'myclster_rg' -n 'node2 node1' -S 'OFAN' -O 'FNPN' -B 'FBHPN'
cllsgrp


####################################
### Resources
####################################
/usr/es/sbin/cluster/utilities/claddres -g 'myclster_rg' SERVICE_LABEL='myclster' \
APPLICATIONS='my_app my_dsmc' VOLUME_GROUP='prdappvg prdvg prdjrnvg' \
FORCED_VARYON='false' VG_AUTO_IMPORT='false' FILESYSTEM= FSCHECK_TOOL='fsck' \
RECOVERY_METHOD='sequential' PPRC_REP_RESOURCE='' FS_BEFORE_IPADDR='false' \
EXPORT_FILESYSTEM='' ERCMF_REP_RESOURCE='' MOUNT_FILESYSTEM='' \
NFS_NETWORK='' SHARED_TAPE_RESOURCES='' DISK='' AIX_FAST_CONNECT_SERVICES='' \
COMMUNICATION_LINKS='' MISC_DATA='' WPAR_NAME='' GMD_REP_RESOURCE='' SVCPPRC_REP_RESOURCE=''
cllsres
cllsres -g myclster_rg


####################################
### Application monitor
####################################
/usr/es/sbin/cluster/utilities/claddappmon MONITOR_TYPE=process name=my_dsmc_mon \
RESOURCE_TO_MONITOR=my_dsmc INVOCATION='longrunning' PROCESSES='dsm.opt.cluster' \
PROCESS_OWNER=root STABILIZATION_INTERVAL='60' RESTART_COUNT='3' FAILURE_ACTION='notify' \
INSTANCE_COUNT=1 RESTART_INTERVAL=360 NOTIFY_METHOD='/usr/local/cluster/mail_event' \
CLEANUP_METHOD='/usr/local/cluster/DSMC_stop.ksh' \
RESTART_METHOD='/usr/local/cluster/DSMC_start.ksh'
/usr/es/sbin/cluster/utilities/claddappmon name=my_app_mon \
RESOURCE_TO_MONITOR=my_app INVOCATION='both' MONITOR_TYPE=user \
STABILIZATION_INTERVAL=120 MONITOR_INTERVAL=120 \
RESTART_COUNT=3 RESTART_INTERVAL=800 FAILURE_ACTION=fallover \
NOTIFY_METHOD=/usr/local/cluster/mail_event FAILURE_ACTION='notify' \
CLEANUP_METHOD='/usr/local/cluster/APP_stop.ksh' \
RESTART_METHOD='/usr/local/cluster/APP_start.ksh' \
MONITOR_METHOD=/usr/local/cluster/APP_check.ksh HUNG_MONITOR_SIGNAL=9
cllsappmon
cllsappmon my_app_mon
cllsappmon my_dsmc_mon


####################################
### Sync all the changes
####################################
cldare -rt -C interactive


####################################
### Verify both nodes see it fine
####################################
cllsclstr
lscluster -m


####################################
### Start the cluster
####################################
smitty clstart

This is where it complains that hags is not up.
Rebooting does not bring up hags.
Manually starting, and it wil die after 20 mins or so.
Very little logging.

EVERY time I try to mess with HA, it’s broken. It’s always something different. Such a pain. Truly, I don’t know why people do not just use their own scripts.


Owncloud filled /var/lib/mysql!

I installed owncloud, and set it to indexing a pile of files I wanted easier access to.

Well, /var filled, and the DB stopped. :o

I was on Debian Jessie (stable), and needed some updates to continue.

### Expand /var since I'm not ready to move /var/lib/mysql to its on filesystem
lvextend -L 16G /dev/rootvg/hd9
resize2fs /var


### Stop services using mysql
/etc/init.d/apache2 stop


### Dump all databases
mysqldump --all-databases --opt --routines --complete-insert -uroot -p | gzip -9 > /storage/test/mysqldump.2016-03-03.gz
-- Warning: Skipping the data of table mysql.event. Specify the --events option explicitly.


### Drop all databases except mysql and information_schema
tar -czvf /storage/test/mysql_var_minus_innodb.tgz [dm-z]*
mysql -u root -p
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| owncloud           |
| performance_schema |
| phpmyadmin         |
| roundcube          |
| test               |
+--------------------+
7 rows in set (0.00 sec)

mysql> drop database owncloud;
mysql> drop database performance_schema;
mysql> drop database phpmyadmin;
mysql> drop database roundcube;
mysql> drop database test;
mysql> SET GLOBAL innodb_fast_shutdown = 0;
mysql> exit

### Or for the brave
mysql -e "SELECT DISTINCT CONCAT ('DROP DATABASE ',TABLE_SCHEMA,' ;') FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA <> 'mysql' AND TABLE_SCHEMA <> 'information_schema';" | tail -n+2 | mysql -u root -p
mysql -e "SELECT table_name, table_schema, engine FROM information_schema.tables WHERE engine = 'InnoDB';"


### Stop mysql
/etc/init.d/mysql stop

### Remove the InnoDB files
rm /var/lib/mysql/ib*


### changed from jessie to stretch to get MySQL 5.6
### Not quite ready for MariaDB 1x
vi /etc/apt/sources.list
# Standard repo
deb http://ftp.us.debian.org/debian stretch main contrib non-free
deb-src http://ftp.us.debian.org/debian stretch main contrib non-free

### Volatile
deb http://ftp.debian.org/debian/ stretch-updates main contrib non-free
deb-src http://ftp.debian.org/debian/ stretch-updates main contrib non-free

### Debian Backports
deb http://http.debian.net/debian stretch-backports main

### security updates
deb http://security.debian.org/ stretch/updates main contrib non-free
deb-src http://security.debian.org/ stretch/updates main contrib non-free


####################################
apt-get update
apt-get install mysql-server-5.6
apt-get install mysql-server-5.6  ## going from jessie to stretch, so it was a little tweaky


### Increased log and memory size for mysql from defaults (log 25% of buffer pool)
### Changed to barracuda (supports compressed tables)
### Changed to one file per table for various reasons.
vi /etc/mysql/my.conf
[mysqld]
# * InnoDB
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
innodb_file_per_table = ON
innodb_file_format = barracuda
innodb_flush_method=O_DIRECT
innodb_log_file_size=256M
innodb_buffer_pool_size=1G


#####################################
### it recreates the IB files on start
/etc/init.d/mysql start


### Make sure barracuda is set for real
mysql -u root -p
mysql> set global innodb_file_format = 'Barracuda';
mysql> exit


### Import the dump
gunzip < /storage/test/mysqldump.2016-03-03.gz | mysql -u root -p


###########################################################################
###########################################################################
### Repair a problem with MySQL installer / conversion / upgrade
### See http://bugs.mysql.com/bug.php?id=67179
/* 
  temporary fix for problem with windows installer for MySQL 5.6.10 on Windows 7 machines.
  I did the procedure on a clean installed MySql, and it worked for me, at least it stopped
  lines of innodb errors in the log and the use of transient innodb tables. So, do it at
  your own risk..
  
  1. drop these tables from "use mysql":
     innodb_index_stats
     innodb_table_stats
	 slave_master_info
     slave_relay_log_info
     slave_worker_info
	 
  2. delete all .frm & .ibd of the tables above.
  
  3. run this file to recreate the tables above (source five-tables.sql).
  
  4. restart mysqld.
  
  Cheers, 
  CNL
*/

CREATE TABLE `innodb_index_stats` (
  `database_name` varchar(64) COLLATE utf8_bin NOT NULL,
  `table_name` varchar(64) COLLATE utf8_bin NOT NULL,
  `index_name` varchar(64) COLLATE utf8_bin NOT NULL,
  `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `stat_name` varchar(64) COLLATE utf8_bin NOT NULL,
  `stat_value` bigint(20) unsigned NOT NULL,
  `sample_size` bigint(20) unsigned DEFAULT NULL,
  `stat_description` varchar(1024) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`database_name`,`table_name`,`index_name`,`stat_name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin STATS_PERSISTENT=0;

CREATE TABLE `innodb_table_stats` (
  `database_name` varchar(64) COLLATE utf8_bin NOT NULL,
  `table_name` varchar(64) COLLATE utf8_bin NOT NULL,
  `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `n_rows` bigint(20) unsigned NOT NULL,
  `clustered_index_size` bigint(20) unsigned NOT NULL,
  `sum_of_other_index_sizes` bigint(20) unsigned NOT NULL,
  PRIMARY KEY (`database_name`,`table_name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin STATS_PERSISTENT=0;

CREATE TABLE `slave_master_info` (
  `Number_of_lines` int(10) unsigned NOT NULL COMMENT 'Number of lines in the file.',
  `Master_log_name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL COMMENT 'The name of the master binary log currently being read from the master.',
  `Master_log_pos` bigint(20) unsigned NOT NULL COMMENT 'The master log position of the last read event.',
  `Host` char(64) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT 'The host name of the master.',
  `User_name` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The user name used to connect to the master.',
  `User_password` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The password used to connect to the master.',
  `Port` int(10) unsigned NOT NULL COMMENT 'The network port used to connect to the master.',
  `Connect_retry` int(10) unsigned NOT NULL COMMENT 'The period (in seconds) that the slave will wait before trying to reconnect to the master.',
  `Enabled_ssl` tinyint(1) NOT NULL COMMENT 'Indicates whether the server supports SSL connections.',
  `Ssl_ca` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The file used for the Certificate Authority (CA) certificate.',
  `Ssl_capath` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The path to the Certificate Authority (CA) certificates.',
  `Ssl_cert` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The name of the SSL certificate file.',
  `Ssl_cipher` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The name of the cipher in use for the SSL connection.',
  `Ssl_key` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The name of the SSL key file.',
  `Ssl_verify_server_cert` tinyint(1) NOT NULL COMMENT 'Whether to verify the server certificate.',
  `Heartbeat` float NOT NULL,
  `Bind` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'Displays which interface is employed when connecting to the MySQL server',
  `Ignored_server_ids` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The number of server IDs to be ignored, followed by the actual server IDs',
  `Uuid` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The master server uuid.',
  `Retry_count` bigint(20) unsigned NOT NULL COMMENT 'Number of reconnect attempts, to the master, before giving up.',
  `Ssl_crl` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The file used for the Certificate Revocation List (CRL)',
  `Ssl_crlpath` text CHARACTER SET utf8 COLLATE utf8_bin COMMENT 'The path used for Certificate Revocation List (CRL) files',
  `Enabled_auto_position` tinyint(1) NOT NULL COMMENT 'Indicates whether GTIDs will be used to retrieve events from the master.',
  PRIMARY KEY (`Host`,`Port`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 STATS_PERSISTENT=0 COMMENT='Master Information';

CREATE TABLE `slave_relay_log_info` (
  `Number_of_lines` int(10) unsigned NOT NULL COMMENT 'Number of lines in the file or rows in the table. Used to version table definitions.',
  `Relay_log_name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL COMMENT 'The name of the current relay log file.',
  `Relay_log_pos` bigint(20) unsigned NOT NULL COMMENT 'The relay log position of the last executed event.',
  `Master_log_name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL COMMENT 'The name of the master binary log file from which the events in the relay log file were read.',
  `Master_log_pos` bigint(20) unsigned NOT NULL COMMENT 'The master log position of the last executed event.',
  `Sql_delay` int(11) NOT NULL COMMENT 'The number of seconds that the slave must lag behind the master.',
  `Number_of_workers` int(10) unsigned NOT NULL,
  `Id` int(10) unsigned NOT NULL COMMENT 'Internal Id that uniquely identifies this record.',
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 STATS_PERSISTENT=0 COMMENT='Relay Log Information';

CREATE TABLE `slave_worker_info` (
  `Id` int(10) unsigned NOT NULL,
  `Relay_log_name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `Relay_log_pos` bigint(20) unsigned NOT NULL,
  `Master_log_name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `Master_log_pos` bigint(20) unsigned NOT NULL,
  `Checkpoint_relay_log_name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `Checkpoint_relay_log_pos` bigint(20) unsigned NOT NULL,
  `Checkpoint_master_log_name` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `Checkpoint_master_log_pos` bigint(20) unsigned NOT NULL,
  `Checkpoint_seqno` int(10) unsigned NOT NULL,
  `Checkpoint_group_size` int(10) unsigned NOT NULL,
  `Checkpoint_group_bitmap` blob NOT NULL,
  PRIMARY KEY (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 STATS_PERSISTENT=0 COMMENT='Worker Information';
###########################################################################
###########################################################################
###########################################################################


### Regenerate performance_schema
mysql_upgrade --force -u root -p


### Make sure tables are okay
mysqlcheck -p


### Grow mysql temporary space to prevent:
#### ERROR 1034 (HY000): Incorrect key file for table 'oc_filecache'; try to repair it
lvextend -L 16G /dev/rootvg/hd1
resize2fs /dev/rootvg/hd1


### Set to compressed tables
# gzipped, the dump is 319MB, and deployed, the one table is 6GB, for read mostly data.
mysql -u root -p
mysql> alter table owncloud.oc_filecache ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;
mysql> exit


### Clean up free space
mysql -u root -p
mysql> OPTIMIZE TABLE owncloud.oc_filecache;
mysql> exit


#####################################
### fix roundcube since it was unhappy with some of the updates
apt-get install roundcube;


### Cleanup some old stuff amplified by partial updates
apt-get autoremove


### Reboot since we had a new dbus installed, and apache2 is still down
shutdown -fr now

bos.rte.security broken

In a couple of instances, I’ve found bos.rte.* filesets broken during upgrade, perhaps with the root part missing.

It’s always a pain, and I always forget how to fix it.

The problem is that the AIX base media does not include base install images for these. They are S (single) updates instead of I (install) images. This is because, during install, a bff called “bos” is laid down first, and that includes 10-20 core filesets, /usr, /, and all the core stuff. It’s basically a prototype mksysb. Sort of.

Anyway, in rare instances, when there is a known defect, IBM will release a fileset as a patch through support/ztrans to get you fixed. If you don’t have time to wait, or if you are a biz partner, working with a customer who hasn’t yet approved you using their support, then you might have to fix it yourself.

<b># install_all_Updates -cYd /export/lppsource/AIX_7.1.4.1_TLSP</b>
+-----------------------------------------------------------------------------+
                   BUILDDATE Verification ...
+-----------------------------------------------------------------------------+
Verifying build dates...
0503-466 installp: The build date requisite check failed for fileset     bos.rte.security.
Installed fileset build date is 1415.  Selected fileset does not have a build date, but one is required.
installp: Installation failed due to BUILDDATE requisite failure.

install_all_updates: Checking for recommended maintenance level 7100-04.
install_all_updates: Executing /usr/bin/oslevel -rf, Result = 7100-03
install_all_updates: ATTENTION, the system recommended maintenance level
does not correspond to the highest level known to install_all_updates.
For more details, execute /usr/bin/oslevel -rl 7100-04.

install_all_updates: Log file is /var/adm/ras/install_all_updates.log
install_all_updates: Result = FAILURE


<b># installp -acXYd /export/lppsource/AIX_7.1.4.0_Base/installp/ppc bos.rte.security</b>
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
Verifying selections...done
Verifying requisites...Verifying requisites...done
Results...

FAILURES
--------
  Filesets listed in this section failed pre-installation verification
  and will not be installed.

  Requisite Failures
  ------------------
  SELECTED FILESETS:  The following is a list of filesets that you asked to
  install.  They cannot be installed until all of their requisite filesets
  are also installed.  See subsequent lists for details of requisites.

    bos.rte.security 7.1.4.0                  # Base Security Function

  CONFLICTING REQUISITES:  The following filesets are required by one or
  more of the selected filesets listed above.  There are other versions of
  these filesets which are already installed (or which were selected to be
  installed during this installation session).  A base level fileset cannot
  be installed automatically as another fileset's requisite when a different
  version of the requisite is already installed.  You must explicitly select
  the base level requisite for installation.

    bos.64bit 7.1.4.0                         # Base Operating System 64 bit...
    bos.acct 7.1.4.0                          # Accounting Services
    bos.adt.include 7.1.4.0                   # Base Application Development...
    bos.mp64 7.1.4.0                          # Base Operating System 64-bit...
    bos.perf.libperfstat 7.1.4.0              # Performance Statistics Libra...
    bos.perf.perfstat 7.1.4.0                 # Performance Statistics Inter...
    bos.perf.proctools 7.1.4.0                # Proc Filesystem Tools
    bos.perf.tools 7.1.4.0                    # Base Performance Tools
    bos.pmapi.pmsvcs 7.1.4.0                  # Performance Monitor API Kern...
    bos.wpars 7.1.4.0                         # AIX Workload Partitions
    mcr.rte 7.1.4.0                           # Metacluster Checkpoint and R...
    perfagent.tools 7.1.4.0                   # Local Performance Analysis &...

  MISCELLANEOUS FAILING REQUISITES:  The following filesets are requisites
  of one or more of the selected filesets listed above.  Various problems
  associated with these requisites are preventing the selected filesets
  from installing.  See the "Requisite Failure Key" for failure reasons and
  possible recovery hints.

    < bos.rte.security 7.1.3.30               # Base Security Function

  Requisite Failure Key:
  "<" superseded fileset that is applied on the "usr" part which must
      also be applied on the "root" part for consistency.  Select this
      fileset explicitly or use the option to automatically include
      requisite software (-g flag).

  AVAILABLE REQUISITES:  The following filesets are requisites of one or
  more of the selected filesets listed above.  They are available on
  the installation media.  To install these requisites with the selected
  filesets, specify the option to automatically install requisite
  software (-g flag).

    bos.rte.control 7.1.4.0                   # System Control Commands
    bos.rte.libc 7.1.4.0                      # libc Library

  << End of Failure Section >>

+-----------------------------------------------------------------------------+
                   BUILDDATE Verification ...
+-----------------------------------------------------------------------------+
Verifying build dates...done
FILESET STATISTICS
------------------
    1  Selected to be installed, of which:
        1  FAILED pre-installation verification
  ----
    0  Total to be installed


Pre-installation Failure/Warning Summary
----------------------------------------
Name                      Level           Pre-installation Failure/Warning
-------------------------------------------------------------------------------
bos.rte.security          7.1.4.0         Requisite failure

<b># installp -acXYFd /export/lppsource/AIX_7.1.4.0_Base/installp/ppc bos.rte.security</b>
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
Verifying selections...

Pre-installation Failure/Warning Summary
----------------------------------------
0503-500 installp:  After completion of pre-installation processing,
        there were no installable base level filesets found on the
        installation media.  Note that use of the force install option
        (-F flag) will cause installp to consider only base level filesets
        (fileset updates will be ignored).  No installation has occurred.

So then I installed these, thinking maybe….

  bos.64bit 7.1.4.0                           # Base Operating System 64 bit...
  bos.acct 7.1.4.0                            # Accounting Services
  bos.adt.include 7.1.4.0                     # Base Application Development...
  bos.mp64 7.1.4.0                            # Base Operating System 64-bit...
  bos.perf.libperfstat 7.1.4.0                # Performance Statistics Libra...
  bos.perf.perfstat 7.1.4.0                   # Performance Statistics Inter...
  bos.perf.proctools 7.1.4.0                  # Proc Filesystem Tools
  bos.perf.tools 7.1.4.0                      # Base Performance Tools
  bos.pmapi.pmsvcs 7.1.4.0                    # Performance Monitor API Kern...
  bos.wpars 7.1.4.0                           # AIX Workload Partitions
  mcr.rte 7.1.4.0                             # Metacluster Checkpoint and R...
  perfagent.tools 7.1.4.0                     # Local Performance Analysis &...
 bos.rte.control 7.1.4.0                     # System Control Commands

But no joy. bos.rte.libc and bos.rte.security depend on eachother, and it still fails with the top errors.

<b># installp -acXYd /export/lppsource/AIX_7.1.4.0_Base/installp/ppc bos.rte.security</b>
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
Verifying selections...done
Verifying requisites...Verifying requisites...done
Results...

FAILURES
--------
  Filesets listed in this section failed pre-installation verification
  and will not be installed.

  Requisite Failures
  ------------------
  SELECTED FILESETS:  The following is a list of filesets that you asked to
  install.  They cannot be installed until all of their requisite filesets
  are also installed.  See subsequent lists for details of requisites.

    bos.rte.security 7.1.4.0                  # Base Security Function

  MISCELLANEOUS FAILING REQUISITES:  The following filesets are requisites
  of one or more of the selected filesets listed above.  Various problems
  associated with these requisites are preventing the selected filesets
  from installing.  See the "Requisite Failure Key" for failure reasons and
  possible recovery hints.

    < bos.rte.security 7.1.3.30               # Base Security Function

  Requisite Failure Key:
  "<" superseded fileset that is applied on the "usr" part which must
      also be applied on the "root" part for consistency.  Select this
      fileset explicitly or use the option to automatically include
      requisite software (-g flag).

  AVAILABLE REQUISITES:  The following filesets are requisites of one or
  more of the selected filesets listed above.  They are available on
  the installation media.  To install these requisites with the selected
  filesets, specify the option to automatically install requisite
  software (-g flag).

    bos.rte.libc 7.1.4.0                      # libc Library

  << End of Failure Section >>

+-----------------------------------------------------------------------------+
                   BUILDDATE Verification ...
+-----------------------------------------------------------------------------+
Verifying build dates...done
FILESET STATISTICS
------------------
    1  Selected to be installed, of which:
        1  FAILED pre-installation verification
  ----
    0  Total to be installed


Pre-installation Failure/Warning Summary
----------------------------------------
Name                      Level           Pre-installation Failure/Warning
-------------------------------------------------------------------------------
bos.rte.security          7.1.4.0         Requisite failure



<b># installp -acXYd /export/lppsource/AIX_7.1.4.0_Base/installp/ppc bos.rte.libc</b>
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
Verifying selections...done
Verifying requisites...Verifying requisites...done
Results...

FAILURES
--------
  Filesets listed in this section failed pre-installation verification
  and will not be installed.

  Requisite Failures
  ------------------
  SELECTED FILESETS:  The following is a list of filesets that you asked to
  install.  They cannot be installed until all of their requisite filesets
  are also installed.  See subsequent lists for details of requisites.

    bos.rte.libc 7.1.4.0                      # libc Library

  MISCELLANEOUS FAILING REQUISITES:  The following filesets are requisites
  of one or more of the selected filesets listed above.  Various problems
  associated with these requisites are preventing the selected filesets
  from installing.  See the "Requisite Failure Key" for failure reasons and
  possible recovery hints.

    < bos.rte.security 7.1.3.30               # Base Security Function

  Requisite Failure Key:
  "<" superseded fileset that is applied on the "usr" part which must
      also be applied on the "root" part for consistency.  Select this
      fileset explicitly or use the option to automatically include
      requisite software (-g flag).

  AVAILABLE REQUISITES:  The following filesets are requisites of one or
  more of the selected filesets listed above.  They are available on
  the installation media.  To install these requisites with the selected
  filesets, specify the option to automatically install requisite
  software (-g flag).

    bos.rte.security 7.1.4.0                  # Base Security Function

  << End of Failure Section >>

+-----------------------------------------------------------------------------+
                   BUILDDATE Verification ...
+-----------------------------------------------------------------------------+
Verifying build dates...done
FILESET STATISTICS
------------------
    1  Selected to be installed, of which:
        1  FAILED pre-installation verification
  ----
    0  Total to be installed


Pre-installation Failure/Warning Summary
----------------------------------------
Name                      Level           Pre-installation Failure/Warning
-------------------------------------------------------------------------------
bos.rte.libc              7.1.4.0         Requisite failure


<b># installp -acXYFd /export/lppsource/AIX_7.1.4.0_Base/installp/ppc bos.rte.libc</b>
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
Verifying selections...

Pre-installation Failure/Warning Summary
----------------------------------------
0503-500 installp:  After completion of pre-installation processing,
        there were no installable base level filesets found on the
        installation media.  Note that use of the force install option
        (-F flag) will cause installp to consider only base level filesets
        (fileset updates will be ignored).  No installation has occurred.



<b># installp -acXYd /export/lppsource/AIX_7.1.4.0_Base/installp/ppc bos.rte.security bos.rte.libc</b>
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
Verifying selections...done
Verifying requisites...Verifying requisites...done
Results...

SUCCESSES
---------
  Filesets listed in this section passed pre-installation verification
  and will be installed.

  Selected Filesets
  -----------------
  bos.rte.libc 7.1.4.0                        # libc Library
  bos.rte.security 7.1.4.0                    # Base Security Function

  Requisites
  ----------
  (being installed automatically;  required by filesets listed above)
  bos.rte.security 7.1.3.30                   # Base Security Function

  < < End of Success Section >>

+-----------------------------------------------------------------------------+
                   BUILDDATE Verification ...
+-----------------------------------------------------------------------------+
Verifying build dates...
0503-466 installp: The build date requisite check failed for fileset     bos.rte.security.
Installed fileset build date is 1415.  Selected fileset does not have a build date, but one is required.
installp: Installation failed due to BUILDDATE requisite failure.




<b># installp -C</b>
0503-439 installp:  No filesets were found in the Software Vital
        Product Database that could be cleaned up.



<b># installp -c all</b>
+-----------------------------------------------------------------------------+
                        Pre-commit Verification...
+-----------------------------------------------------------------------------+
Verifying selections...done
Verifying requisites...done
Results...

WARNINGS
--------
  Problems described in this section are not likely to be the source of any
  immediate or serious failures, but further actions may be necessary or
  desired.

  Nothing to Commit
  -----------------
  There is nothing in the APPLIED state that needs to be committed.

  < < End of Warning Section >>



<b># lslpp -h bos.rte.security</b>
  Fileset         Level     Action       Status       Date         Time
  ----------------------------------------------------------------------------
Path: /usr/lib/objrepos
  bos.rte.security
                  7.1.3.0   COMMIT       COMPLETE     07/25/14     09:44:45
                 7.1.3.15   COMMIT       COMPLETE     11/20/14     11:25:13
                 7.1.3.30   COMMIT       COMPLETE     12/08/14     02:47:41

Path: /etc/objrepos
  bos.rte.security
                  7.1.3.0   COMMIT       COMPLETE     07/25/14     09:44:45
                 7.1.3.15   COMMIT       COMPLETE     11/20/14     11:25:14



<b># installp -rBXJw bos.rte.security</b>
+-----------------------------------------------------------------------------+
                        Pre-reject Verification...
+-----------------------------------------------------------------------------+
Verifying selections...done
Verifying requisites...done
Results...

WARNINGS
--------
  Problems described in this section are not likely to be the source of any
  immediate or serious failures, but further actions may be necessary or
  desired.

  Not Rejectable
  --------------
  No software could be found installed on the system that could be rejected
  for the following requests:

    bos.rte.security

  (Possible reasons for failure:  1. the selected software has been
   committed, i.e., cannot be rejected, 2. the selected software is not
   installed, 3. the pre-reject script failed, or 4. a typographical
   error was made.)

  < < End of Warning Section >>

FILESET STATISTICS
------------------
    1  Selected to be rejected, of which:
        1  FAILED pre-reject verification
  ----
    0  Total to be rejected


Pre-installation Failure/Warning Summary
----------------------------------------
Name                      Level           Pre-installation Failure/Warning
-------------------------------------------------------------------------------
bos.rte.security                          Failed pre-rejection check



<b># lppchk -vm3</b>
lppchk:  The following filesets need to be installed or corrected to bring
         the system to a consistent state:

  bos.rte.security 7.1.3.30               (usr: COMMITTED, root: not installed)


<b># installp -acXYFd /export/lppsource/AIX_7.1.4.0_Base/installp/ppc bos.rte.security bos.rte.libc</b>
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
Verifying selections...

Pre-installation Failure/Warning Summary
----------------------------------------
0503-500 installp:  After completion of pre-installation processing,
        there were no installable base level filesets found on the
        installation media.  Note that use of the force install option
        (-F flag) will cause installp to consider only base level filesets
        (fileset updates will be ignored).  No installation has occurred.


The solution was ODM surgery.

First, I took a mksysb and copied it to somewhere safe (another server with NIM installed).

Then, I looked into ODM, and found /etc/objrepos/product was missing the entry for this version.
You might be able to copy from /usr/lib/objrepos, but I copied from a valid clone of this system.

# export ODMDIR=/etc/objrepos
# ssh goodserver odmget -q lpp_name=bos.rte.security product | odmadd

Then, I needed to add the history line, which was identical between root and usr:

# odmget -q name=bos.rte.security lpp     (note the lpp_id)
# ODMDIR=/usr/lib/objrepos odmget -q lpp_name=39 history | ODMDIR=/etc/objrepos odmadd

The “inventory” ODM is accessed with lpp_name also, but that had a long list of files already. I did not mess with any of that.

Now, install_all_updates from my TLSP worked fine.


How to show respect when bestowing honors…

It’s great to announce milestones when employees achieve certain number of years. However, if you’re going to do this verbally, it’s important to find out from the person, or their manager, how to pronounce their name.

It’s not acceptable for a CEO or other executive to claim they are honoring someone, but to say “I’m sorry I don’t know how to pronounce these.” If some of the names are really too tough, it’s fine to send out a list via email, and maybe a temporary blurb on the company page. Even having someone else read the list who can pronounce names is acceptable.

Also, if your company is a conglomerate, it’s not okay for the executive to announce only people in the business unit that promoted them, when it’s a call for the entire company. The list really needs to be complete for the audience selected. It is entirely acceptable to thank only a specific unit when only addressing that unit. It’s entirely acceptable to put a list up somewhere and ask people to review it, as long as they are given access and time.

Further, communication really needs to be targeted. If you have several business units, do not spam XYZ with things only related to PDQ, and vice versa. Technical people for one product do not need, and do not want, sales information for other, mostly unrelated products. On the same token, Sales people do not want, nor do they need, in-depth details about technical matters.

Lastly, when concerns about respect are brought up, it’s important to directly address them. Do not put them off to a later date, or assume they are okay. Put the issue on a list, and put follow up dates on your calendar. Make sure you understand the issue, and that it’s been resolved. Usually, it’s simply a communication error, or sometimes it’s a cultural difference.

Remember, honor and respect are key components. These little things are the pillars of any company. If their expression is hollow or incomplete, then what does that say about the foundation of your business?


cl_rsh fails

PROBLEM: On some migrates, we found the rpdomain would not stay running on one node.
The cluster was up, and SEEMED to operate normally, but errpt got CONFIGRM stop/start messages every minute.

lsrpdomain would show Offline, or “Pending online”.

lsrpnode would show:
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.

On the other node, lsrpnode only showed itself, and lsrpdomain showed Online.

“cl_rsh node1 date” worked from both nodes
“cl_rsh node2 date” worked only from node2.
/etc/hosts, cllsif, hostname, /etc/cluster/rhosts… everything was spotless.
clcomd was running, even after refresh.
Same subnet, and ports were not filtered.

Importing a snapshot said:
Warning: unable to verify inbound clcomd communication from

        node "node1" to the local node, "node2".</code>

I applied PowerHA 7.1.3 SP4, and no fix. I think this is a problem with clmigcheck or mkcluster in AIX.

SOLUTION
I saved a snapshot, blew away the cluster, and imported the snapshot.
/usr/es/sbin/cluster/utilities/clsnapshot -c -i -nmysnapshot -d "Snapshot before clrmcluster"
clstop -g -N
stopsrc -g cluster
clrmclstr
rmcluster -r hdisk10

  1. one node's SSHd died here.

rmdev -dl cluster0
cfgmgr
cl_rsh works all the way around now.
/usr/es/sbin/cluster/utilities/clsnapshot -a -n'mysnapshot' -f'false'
cllsclstr ; lscluster -m ; lsrpdomain ; lsrpnode

works fine all around, before and after reboot.
Cluster starts normally.

Error Reference
---------------------------------------------------------------------------
LABEL: CONFIGRM_STOPPED_ST
IDENTIFIER: 447D3237

Date/Time: Tue Nov 24 04:18:36 EST 2015
Sequence Number: 42614
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Description
IBM.ConfigRM daemon has been stopped.

Probable Causes
The RSCT Configuration Manager daemon(IBM.ConfigRMd) has been stopped.

User Causes
The stopsrc -s IBM.ConfigRM command has been executed.

       Recommended Actions
       Confirm that the daemon should be stopped. Normally, this daemon should

not be stopped explicitly by the user.

Detail Data
DETECTING MODULE
RSCT,ConfigRMDaemon.C,1.25.1.1,219
ERROR ID

REFERENCE CODE

---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42613
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42612
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42611
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.12) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42610
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42609
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(192.168.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_MESSAGE_ST
IDENTIFIER: F475ABC7

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42608
Class: O
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,ConfigRMGroup.C,1.337.1.1,6951

DIAGNOSTIC EXPLANATION
get_adapter_info_by_addr(10.0.0.11) FAILED rc=28
---------------------------------------------------------------------------
LABEL: CONFIGRM_PENDINGQUO
IDENTIFIER: A098BF90

Date/Time: Tue Nov 24 04:18:32 EST 2015
Sequence Number: 42607
Class: S
Type: PERM
WPAR: Global
Resource Name: ConfigRM

Description
The operational quorum state of the active peer domain has changed to PENDING_QUORUM.
This state usually indicates that exactly half of the nodes that are defined in the
peer domain are online. In this state cluster resources cannot be recovered although
none will be stopped explicitly.

Failure Causes
One or more nodes in the active peer domain have failed.
One or more nodes in the active peer domain have been taken offline by the user.
A network failure is disrupted communication between the cluster nodes.

       Recommended Actions
       Ensure that more than half of the nodes of the domain are online.
       Ensure that the network that is used for communication between the nodes is functioning correctly.
       Ensure that the active tie breaker device is operational and if it set to

'Operator' then resolve the tie situation by granting ownership to one of
the active sub-domains.

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,19713

---------------------------------------------------------------------------
LABEL: STORAGERM_STARTED_S
IDENTIFIER: EDFF8E9B

Date/Time: Tue Nov 24 04:17:53 EST 2015
Sequence Number: 42606
Node Id: node1
Class: O
Type: INFO
WPAR: Global
Resource Name: StorageRM

Detail Data
DETECTING MODULE
RSCT,IBM.StorageRMd.C,1.49,147

---------------------------------------------------------------------------
LABEL: CONFIGRM_ONLINE_ST
IDENTIFIER: 3B16518D

Date/Time: Tue Nov 24 04:17:52 EST 2015
Sequence Number: 42605
Node Id: node1
Class: S
Type: INFO
WPAR: Global
Resource Name: ConfigRM

Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.8,24950

Peer Domain Name
mycluster


clmigcheck bypass hitachi

Hitachi disks, clmigcheck always says there are no matching disks.
This may be due to spaces in udid, or other problems with the awk line in list_common_free_disks
The easiest fix is to
A) Verify you have absolutely picked the right disk. Set a PVID and remove/readd it to make sure.
B) comment out the “mv” line for the RESULT files.


Posted in Reference, Work | Comments Off on clmigcheck bypass hitachi

AIX 4k sectors

SUMMARY: For AIX, it’s best to stick with 512/528B format devices for internal disk.

DETAILS:
Raw, SAN, and ARRAY are 512b LBA (block or sector size). iSeries are 520. T10 RAID are 528. 4224b pdisks become 4k hdisks only.

For iSeries, vSCSI will remap 512 and 4096 from current PCIe3 SAS-RAID adapters to 520b sectors for hdisk VTDs.

All SAN LUNs, and most enterprise DDMs are 512/520/528b, even when presented through remote PV servers (GLVM/HAGEO) or Network Shared Disks (NSDs/GPFS).

There is no block-size translation for AIX, regardless of whether it’s SSP, LV or PV backed. In other words, VGs, including an SSP, can only have one block size device inside.

Support for 4k disks:
AIX Announced 2013-11-16. SAS controller support 2014-06-06. AIX support 7.1.3.0 and 6.1.9.0. VIO support 2.2.3.3. iSeries support 7.1.9 and 7.2.1.

All of the POWER8 supported 4k parts have 512/528b counterparts:
http://www-01.ibm.com/support/knowledgecenter/P8DEA/p8ecs/p8ecs_drive_parts.htm

4k disks are okay for:

  • Systems that will not have other storage attached.
  • iSeries – 520b sector translation works on 4k PCIe3 SAS.
  • Linux – MDADM is translates, and LVM can adjust data alignment.
  • Windows 7/2008 dynamic arrays (not on POWER)
  • Solaris ZFS ashift=12 can also mix (not on POWER).

References:

Sector sizes:

  • Standard: 512 raw, 528 RAID, 520 iSeries
  • Advanced: 4096 raw, 4224 RAID, 4160 iSeries.

extendvg error
0516-1980 /usr/sbin/extendvg: Block size of all disks in the volume group must be the same. Cannot mix disks with different block sizes.

extendvg manpage:
Note: You cannot mix physical volume (PV) of 4 KB block size with PV blocks of other sizes. The block size of all PVs in the volume group must be the same.

Considerations for SSD


While using SSDs, consider the following specifications:

  • Intermixing of SSDs and HDDs within the same disk array is not allowed. A disk array must contain all SSDs or all HDDs.
  • It is important to properly plan for hot-spare devices when using arrays of SSDs. An SSD hot-spare device is used to replace a failed device in an SSD disk array and an HDD hot spare is used for an HDD disk array
  • Although SSDs can be used in a RAID 0 disk array, it is preferred that SSDs to be protected by RAID levels 5, 6, 10, 5T2, 6T2, or 10T2.
  • See Installing and configuring Solid-state drives to identify specific configuration and placement requirements related to the SSD devices.
  • Some adapters, known as RAID and SSD adapters, contain SSDs, which are integrated on the adapter. See the PCIe SAS RAID card comparison table for features and additional information for your specific adapter type.
  • SSDs are supported only when formatted to a RAID block size and used as part of a RAID array.

During the controller boot process, any 528 bytes per sector (not 4224 bytes per secotor) SSD array candidate attached to a PCIe or PCIe2 SAS RAID Controller that is not already part of a disk array is automatically created as a single-drive RAID 0 disk array.

  • The RAID 0 disk array can be migrated to a RAID 10 disk array by using the technique described in Migrating an existing disk array to a new RAID level.
  • The automatically created RAID 0 disk array can be deleted (see Deleting a disk array) and a new SSD disk array can be created with a different level of RAID protection (see Creating a disk array).

Debug output looks like this:

  1. export LVMT _VERBOSE=9
  2. export LVMGS _VERBOSE=9
  3. export LVMT _OUT=stdout
  4. extendvg datavg hdisk4 hdisk5

[S 8519692 13697088 11/19/15-15:24:14:052 mkvg.c 424] extendvg datavg hdisk4 hdisk5
[7 8519692 0:000 mkvg.c 981] addto_pv_list: pv name=>hdisk4< = [7 8519692 0:000 mkvg.c 981] addto_pv_list: pv name=>hdisk5< = [7 8519692 0:000 comutl.c 1010] lvm_getvginfo: start [7 8519692 0:000 configutl.c 1115] lvm_config: call to hd_cfg, cmd=147 [7 8519692 0:001 comutl.c 1203] lvm_getvginfo: end, rc=0 [7 8519692 0:001 utilities.c 1826] lvm_cfglock: name=datavg, pid=8519692, flags=0x5 [7 8519692 0:001 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139 [7 8519692 0:001 utilities.c 2007] lvm_cfglock_query: name=datavg, cfglock_state=-261, owner_pid=0, owner_ppid=0 [7 8519692 0:001 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139 [7 8519692 0:001 utilities.c 1826] lvm_cfglock: name=hdisk4, pid=8519692, flags=0x1 [7 8519692 0:001 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139 [7 8519692 0:001 utilities.c 2007] lvm_cfglock_query: name=hdisk4, cfglock_state=-261, owner_pid=0, owner_ppid=0 [7 8519692 0:001 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139 [7 8519692 0:001 utilodm.c 1288] lvmdb_chkpvcfg(), rc=0 [7 8519692 0:001 comutl.c 1557] lvm_pvio: start [7 8519692 0:001 comutl.c 1565] lvm_pvio: loading [7 8519692 0:001 comutl.c 1570] lvm_pvio: finished loading [7 8519692 0:042 comutl.c 1426] lvm_thread_pvio: pv->name=hdisk4 hex blocks = 000000004dd00000 = 000000004dd00000
[7 8519692 0:042 comutl.c 1615] lvm_pvio: end, rc=0
[7 8519692 0:042 mkvg.c 1710] validate_pvs: PV type(DD_SCDISK/DD_SCRWOPT)
[2 8519692 0:042 mkvg.c 1718] validate_pvs: Mixed blk sizes! Other disks have 4096 block size. hdisk4 (512)
0516-1980 extendvg: Block size of all disks in the volume group must be the same.

       Cannot mix disks with different block sizes.

[7 8519692 0:042 utilities.c 1826] lvm_cfglock: name=hdisk5, pid=8519692, flags=0x1
[7 8519692 0:042 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139
[7 8519692 0:042 utilities.c 2007] lvm_cfglock_query: name=hdisk5, cfglock_state=-261, owner_pid=0, owner_ppid=0
[7 8519692 0:042 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139
[7 8519692 0:042 utilodm.c 1288] lvmdb_chkpvcfg(), rc=0
[7 8519692 0:042 comutl.c 1557] lvm_pvio: start
[7 8519692 0:042 comutl.c 1565] lvm_pvio: loading
[7 8519692 0:042 comutl.c 1570] lvm_pvio: finished loading
[1 8519692 0:043 comutl.c 1358] lvm_thread_pvio: FAIL: invalid pv, name=hdisk4
[7 8519692 0:083 comutl.c 1426] lvm_thread_pvio: pv->name=hdisk5 hex blocks = 000000004dd00000 = 000000004dd00000
[7 8519692 0:083 comutl.c 1324] lvm_valid_pvs_remain: inv pv name=hdisk4, pv status=0, errno=0
[7 8519692 0:083 comutl.c 1615] lvm_pvio: end, rc=0
[7 8519692 0:083 mkvg.c 1710] validate_pvs: PV type(DD_SCDISK/DD_SCRWOPT)
[2 8519692 0:083 mkvg.c 1718] validate_pvs: Mixed blk sizes! Other disks have 4096 block size. hdisk5 (512)
0516-1980 extendvg: Block size of all disks in the volume group must be the same.

       Cannot mix disks with different block sizes.

[7 8519692 0:083 mkvg.c 2244] num_invalid_pvs: name=hdisk4, status=0
[7 8519692 0:083 mkvg.c 2244] num_invalid_pvs: name=hdisk5, status=0
[1 8519692 0:083 mkvg.c 1887] validate_pvs: FAIL: Invalid PVs!, num_invalid_pvs failed, rc=2
[1 8519692 0:083 mkvg.c 740] main: FAIL: validate_pvs failed
[7 8519692 0:083 mkvg.c 161] cleanup_exit(), signal or line=741, cmd_progress=1
[1 8519692 0:083 mkvg.c 177] cleanup_exit: FAIL: pv_failures=2
0516-792 extendvg: Unable to extend volume group.
[7 8519692 0:083 utilities.c 1826] lvm_cfglock: name=hdisk4, pid=8519692, flags=0x2
[7 8519692 0:083 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139
[7 8519692 0:083 utilities.c 1826] lvm_cfglock: name=hdisk5, pid=8519692, flags=0x2
[7 8519692 0:083 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139
[7 8519692 0:083 utilities.c 1826] lvm_cfglock: name=data, pid=8519692, flags=0x2
[7 8519692 0:083 configutl.c 1115] lvm_config: call to hd_cfg, cmd=139
[E 8519692 0:110 mkvg.c 277] extendvg: exited with rc=1

  1. unset LVMT _OUT
  2. unset LVMGS _VERBOSE
  3. unset LVMT _VERBOSE
  4. exit


Compressed Dovecot Maildir on Debian

I just saved a few gigs with this. Figured I need to document this or I’ll never remember. :)

Add this into /etc/dovecot/conf.d/10*

# Enable zlib plugin globally for reading/writing:
mail_plugins = $mail_plugins zlib
# Enable these only if you want compression while saving:
plugin {
 zlib_save_level = 6 # 1..9; default is 6
 zlib_save = gz # or bz2, xz or lz4
}

Add this into /etc/dovecot/conf.d/20*

protocol imap {
  mail_plugins = zlib
}
protocol pop3 {
  mail_plugins = zlib
}

Remove extra spaces and leftover courier garbage

rename 's/\ /_/g' /home/jdavis/Maildir/.[a-zA-Z]*
rename 's/\__/_/g' /home/jdavis/Maildir/.[a-zA-Z]*
rename 's/\_\./\./g' /home/jdavis/Maildir/.[a-zA-Z]*
rm -r /home/jdavis/Maildir/courier*
rm -r /home/jdavis/Maildir/.[a-zA-Z]*/courier*

Create the script to compress all maildir files

#!/bin/sh
compress_maildir () {
cd $1
DIRS=`find -maxdepth 2 -type d -name cur`
for dir in $DIRS; do
       echo $dir
       cd $dir
       FILES=`find -type f -name "*,S=*" -not -regex ".*:2,.*Z.*"`
       #compress all files
       for FILE in $FILES; do
               NEWFILE=../tmp/${FILE}
               #echo bzip $FILE $NEWFILE
               if ! bzip2 -9 $FILE -c > $NEWFILE; then
                       echo compressing failed
                       exit -1;
               fi
               #reset mtime
               if ! touch -r $FILE $NEWFILE; then
                       echo setting time failed
                       exit -1
               fi
       done
       echo Locking $dir/..
       if PID=`/usr/lib/dovecot/maildirlock .. 120`; then
               #locking successfull, moving compressed files
               for FILE in $FILES; do
                       NEWFILE=../tmp/${FILE}
                       if [ -s $FILE ] && [ -s $NEWFILE ]; then
                               echo mv $FILE $NEWFILE
                               mv $FILE /tmp
                               mv $NEWFILE ${FILE}Z
                       else
                               echo mv failed
                               exit -1
                       fi
               done
               kill $PID
       else
               echo lock failed
               exit -1
       fi
       cd - >/dev/null
done
}

Actually RUN the script to compress all maildir files

./compress_maildir /home/jdavis/Maildir/

References


VIO server hangs

To be updated with resolution at some point.
This is the second time a secondary VIO server has hung with a UIO_WRITE in the kernel log
The VIO servers have only been up 55 days.
Number R12 hung about a week and a half ago, but no dump was collected.
Number R22 hung this morning, and a dump was collected.
I couldn’t find anything juicy (see below), but I did find that E11 had lost its second internal boot disk.
I was able to reset that with chpv, but I’m wondering if there’s something going on with the SAS controllers.

Also, these have network hangs intermittenly, and sometimes vtmenu times out.
I’m wondering if there’s some sort of power issue with the site.

---------------------------------------------------------------------------
LABEL:          DUMP_STATS
IDENTIFIER:     67145A39

Date/Time:       Mon Oct 19 09:19:49 EDT 2015
Sequence Number: 367
Class:           S
Type:            UNKN
WPAR:            Global
Resource Name:   SYSDUMP

Description
SYSTEM DUMP

Probable Causes
UNEXPECTED SYSTEM HALT

User Causes
SYSTEM DUMP REQUESTED BY USER

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
UNEXPECTED SYSTEM HALT

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
DUMP DEVICE
/dev/lg_dumplv1
DUMP SIZE
            1108637696
TIME
Mon Oct 19 08:54:43 2015
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           1
DUMP STATUS
           0
ERROR CODE
0000 0000 0000 0000
DUMP INTEGRITY
after uncompressing
FILE NAME

PROCESSOR ID
           0
---------------------------------------------------------------------------
LABEL:          MINIDUMP_LOG
IDENTIFIER:     F48137AC

Date/Time:       Mon Oct 19 09:19:15 EDT 2015
Sequence Number: 366
Class:           O
Type:            UNKN
WPAR:            Global
Resource Name:   minidump

Description
COMPRESSED MINIMAL DUMP

Probable Causes
System dumped. Minimal Dump collected in Non-Volatile Memory.

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
Minidump Data:
4D33 0D4B 2D17 0060 0027 0027 0032 0048 0000 0000 4214 7800 0000 0000 DF18 67B6
0000 0003 0001 5624 ED78 05E5 36A8 6575 DDDD 0002 0004 0000 000A 000D 5624 E813
0000 0000 000F 0000 2F64 6576 2F6C 675F 6475 6D70 6C76 3165 6345 A4C0 9001 2000
0024 5648 A09B 00A0 2189 8610 234A 9C18 8180 2A14 53B2 4C81 22E5 C990 8910 DDBC
91D3 260C 9B86 2249 9A6C 0865 6300 2408 0040 9060 825F A802 2C18 F2A4 0914 0001
4614 2848 11C0 837F 18E7 FDEB 078A A280 5F62 1055 1C49 7224 BC82 080A 0202 F0E3
573E 0030 1894 5CD9 F225 8C92 382B 76A8 B120 1E00 0500 C02D 4053 3200 49A3 0565
C459 6584 E20E 4DCA 7062 C448 AC8C 8DA2 3909 0A30 C682 4449 9F0F 1440 0090 018F
4725 B52A 0A50 42CB 9F47 AEB0 BEFE 2C00 8401 08B5 C6EE CEA4 8800 920C 9229 0B7A
1544 0427 8601 70E4 BC19 53C6 0D1D 3979 56C4 B859 1104 2958 010A 1800 7032 2B59
0033 7E1A 9177 0857 891F 2B01 8018 0B00 884F 82C0 00E9 8053 3103 68D1 A44D 9B25
8800 825E 8A0B 80A2 F04C 9CA2 EDD0 0A46 977E 112A F56A 8264 881F C75D 1AC6 9FD4
BD29 573C 882A 7541 BF23 178C 3CE0 8B9E CFCD 800E 5401 52FB 76F2 DC3C 72A4 46F1
BB3D F2D1 8C92 0D7B 43B1 0388 0404 5007 4C0D 1560 D701 0407 1034 4480 0579 D38A
2FEF 50E4 2084 123E 1861 8316 5648 2186 1B12 34E1 851E 66B8 6071 2496 6862 4102
2010 D144 245D 94D1 461D A108 9248 2712 D481 389C A194 D307 7B8D 7540 5A25 F148
D24A 633D 00C8 0E81 F804 9450 6C94 34D0 4848 11E4 C216 6E60 E014 5452 4D55 D555
4779 B615 135E 9104 9658 0068 B09B 564A E4A2 9612 B8B8 85A2 314C C405 A592 7431
A09D 56C6 B0B7 174A E08D 1418 415E 1152 5841 F581 30C0 7608 25B6 5863 8F45 36D9
4896 61A6 598E 15B1 165E 8D04 6900 400D D4E5 0208 09CE B1A6 2741 04C8 D08E A599
6EDA 2920 7BA0 E4DB A5DA 3C89 2949 9A72 FADE 21B9 0CF2 49A8 D011 576B A7A3 4C81
1D04 7716 C49D 7704 F509 DC78 E59D 67D4 001B 9C99 AAAD CAE5 42CA 00F3 D597 5C3C
D480 4102 22D8 4540 9B82 0C86 D821 001F 6A08 22BA 1972 B86E BAEE AA0B C088 B3D6
3BEB 002A 4A54 928B FF68 C491 4733 1637 4008 2575 06C0 9F34 B914 0038 320D C992
C2E0 D4BB 534F 3F05 B5C0 B823 1D9A D50B F1C0 40CD 9551 9534 D53F 5B62 95D5 566C
8439 D298 6345 7B56 2F6B F2E2 269E 6CC8 B9F2 5C75 8D8A 2718 3821 DBD7 5F7E 5214
A8CA 906A 6C6C A28A 31E6 1864 9215 ABDA 6599 6D66 7041 A82E 8BE9 0600 DCD0 6929
4FF0 DA1A 45A5 9E5A 11D6 5ADF 9A4B 2992 B85A 1F41 E269 632F 4964 779A 8A06 5E47
E759 DC66 17F3 CEB0 4E03 706C 45CA 1624 5E45 E499 0780 0101 0022 401B 3613 8477
B5C6 E091 2D45 DB76 CBC4 2BE1 624C AFB9 EFB6 CBB9 BCF1 86FE 79BC 9BBF 6D3A 7104
E4CB E248 FCFA 1BA3 5601 1717 2544 0519 8CF0 C336 353C 52C2 3649 CC93 9216 D351
52B9 38CD 90C2 09F6 809C 2555 5699 9CD3 568C 104D 11CB 0070 20ED 5AC1 AC09 CCCC
0040 DFF8 F438 33F0 D05B C856 E473 E03A 0A06 8020 D253 6428 A2A8 28AA 74A3 4D43
0AF5 A453 1354 B5E0 B372 0040 0ED4 B985 15FE E135 D714 2436 B3A9 88FF 0078 AB5B
5CE1 0F6A 0B8F 704E 5790 0506 100D 1FA8 1B71 2CD8 C034 608E 37C4 22C9 DF28 823E
000C 8E22 853B CF40 1270 0FA2 5084 83CA B9C5 1F4E 31B9 8254 0E0C AA08 44E6 0A52
3A78 8DEE 87EC 3A97 0F83 D839 218A 8878 144C E248 0AA0 BA7D 4DC5 7500 0B09 8966
1725 DBA9 0477 9CD1 9DF9 B098 3F13 4D0C 7842 9116 4556 4791 D925 2008 B4B0 84F2
44A6 A5E6 75E9 64C6 6045 FB08 42BD 0E5C 4F09 C558 1331 B8B7 1556 7CAF 2074 AA0B
CFDE F2BA DDF1 A584 0713 DAFA 0645 92F7 5184 3BF2 6314 D31E 5599 FB49 ED2C C450
CB0F EC80 09E2 7400 003B C0CD 10CA A18B 69CC 8735 3118 0912 F987 A94F 86F2 3DA3
DC05 6A08 F29C AF51 A407 71E0 46F9 4A78 C282 E092 132A 0484 029E A133 00B8 5294
E5F0 45D7 0842 9F92 1C13 96E5 18C6 39BC 9384 BBB8 B022 6444 D11B 0BF2 CCD1 8C92
1804 2408 06C6 000C BF89 865B 6158 01A8 7823 2E1E 126F 88F0 F41C 1141 07C4 7852
A874 4ACC 2741 0CD0 C416 3D11 4651 A411 71A8 58BB F499 6864 231B 4931 0B32 3B63
FEAE 6242 E143 4936 03A5 47DE 8016 1803 2596 D8C8 3C2E 5514 8ECC 9823 00A8 E781
3B26 634D C8E0 A331 98F1 4788 0492 0119 ED9E 31CA B927 DE20 F24F 0008 1423 8B06
BF48 2ECD 514E 8B54 D428 8526 6468 D20E 34CD 8A07 00D0 0351 9AE3 0346 6226 2A55
5992 5E9A 68A9 4D85 E553 BF50 C08A E052 9780 039A D56E 1907 601E EE50 1310 47B1
B0EA D410 C8A1 8623 61AB 564D 5009 6A5A 9324 1455 CB36 0922 576F 9AE3 0449 48CD
38CB 9980 7352 230C 6A20 C20E C945 CF79 8ACE B1F5 94A7 3D23 844F 7D2A F100 FD64
DD3F FF25 2329 1267 6065 2C28 6FAE 5893 A068 9122 BC0B 8AEF 28B6 A405 94AF 2009
FAA8 416A F00C 43AC 9124 232B D95E 9D64 0C76 8894 7A1F B863 33D6 C40C 95B2 A3A5
0178 69F9 B612 B19A 1EEC A68A 1414 4E1C 79B4 F825 4D92 40B5 9FA4 2E89 2225 30E3
A8E8 200E 8F7E E054 4264 50AA 0449 6545 56C9 B659 8DB7 BC5C A5A5 A8BC 9ACB 5D8A
9595 6435 AB01 B035 811A ACED BD5A 3544 17E0 5A11 00FB 7511 2FB0 2B00 AE89 4292
C82A B414 31B0 3C86 608E 4584 1300 8335 67E5 C290 8A59 1EAC 9D8C 7DEC 641B 3B62
111F D1B2 2846 4066 2BD2 3A80 7656 A09E 2128 41AC 4810 DE05 000E A725 ADC2 62EA
C587 B616 C67C D96B 0248 4005 4EDC 7624 B975 A36C 7122 8063 30E0 B716 2153 08EE
188D 3541 838F 4E6E E948 C307 09B5 1C23 27E7 BBAF 4173 BAC8 E91A CD20 485B D44F
EB57 C9ED 12B5 7B4A 8086 26EF B03F 9210 6C08 4ED5 C707 0180 82A9 AEB7 AAB3 BAB3
53F7 E180 AE92 15AC 2414 B309 47F2 CBF3 04A8 025A E824 4504 ADD5 7EEC 8A99 6B2B
08A5 BD79 0EC4 2898 C163 D6E6 9201 B0E9 099F 8300 D711 2739 356C 046E 8921 0517
4600 88E7 F54E C9DA DA88 B82E A2AE E555 5914 9F2E 012B A648 8B39 0B3B CF0E 14C2
---------------------------------------------------------------------------
LABEL:          SYS_RESET
IDENTIFIER:     1104AA28

Date/Time:       Mon Oct 19 09:19:15 EDT 2015
Sequence Number: 365
Class:           S
Type:            TEMP
WPAR:            Global
Resource Name:   SYSPROC

Description
SYSTEM RESET INTERRUPT RECEIVED

Probable Causes
SYSTEM RESET INTERRUPT

Detail Data
KEY MODE SWITCH POSITION AT BOOT TIME
normal
KEY MODE SWITCH POSITION CURRENTLY
normal
---------------------------------------------------------------------------
LABEL:          ERRLOG_ON
IDENTIFIER:     9DBCFDEE

Date/Time:       Mon Oct 19 09:20:41 EDT 2015
Sequence Number: 364
Class:           O
Type:            TEMP
WPAR:            Global
Resource Name:   errdemon

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

        Recommended Actions
        NONE

---------------------------------------------------------------------------
LABEL:          CONSOLE
IDENTIFIER:     7F88E76D

Date/Time:       Mon Oct 19 08:54:09 EDT 2015
Sequence Number: 363
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   console

Description
SOFTWARE PROGRAM ERROR

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
USER'S PROCESS ID:
               7536648
DETECTING MODULE
conwrite
FAILING MODULE
UIO_WRITE
RETURN CODE
           6
ERROR CODE
           0

We gathered a snap -ac for IBM, and while waiting, I did a quick look in the dump.

cd /tmp/ibmsupt/dump
chfs -a size=+4G /tmp
uncompress unix.Z
dmpuncompress dump.BZ
kdb dump unix
kdb dump unix
dump mapped from @ 700000000000000 to @ 7000000df1867b6
           START              END <name>
0000000000001000 0000000004150000 start+000FD8
F00000002FF47600 F00000002FFDF9C8 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F1000F0A00000000 F1000F0A10000000 pvproc+000000
F1000F0A10000000 F1000F0A18000000 pvthread+000000
Dump analysis on CHRP_SMP_PCI POWER_PC POWER_7 machine with 16 available CPU(s)  (64-bit registers)
Processing symbol table...
.......................done
read vscsi_scsi_ptrs OK, ptr = 0x0
vmcKdb_anchor_p=0x0000000000000000
vmc kdb command extension, 64 bit version, is loaded.  Commands are:
vmc - load extension and show help text
vmcu - unload extension
vmcd - VMC dump anchor, adapter, connections
vmcfa - VMC fetch anchor from symbol table
vmcsa address - VMC set anchor
vmcdb - VMC dump connection buffers
vmcdm - VMC dump connection messages
vmcdq - VMC dump queue
vmct directoryname - VMC Internal Adapter trace
vmctbm directoryname - VMC buffer and message trace
vmcKdb_anchor_p=0x0000000000000000

### The time the crash was forced
(0)> dw time
time+000000: 00000000 5624E813 F1000A00 20295000  ....V$...... )P.

### Basic stats on the system
(0)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_7 machine with 16 available CPU(s)  (64-bit registers)

SYSTEM STATUS:
sysname... AIX
nodename.. viopr22
release... 1
version... 6
build date May  4 2015
build time 12:52:42
label..... 1516D_61d
machine... REDACTED
nid....... REDACTED
time of crash: Mon Oct 19 08:54:43 2015
age of system: 55 day, 6 hr., 19 min., 15 sec.
xmalloc debug: enabled
FRRs active... 0
FRRs started.. 0

### Process table
(0)> status
CPU INTR      TID  TSLOT     PID  PSLOT  PROC_NAME
  0          20005      2   20004      2  wait
  1         190033     25   F001E     15  wait
  2         1A0035     26  100020     16  wait
  3         1B0037     27  110022     17  wait
  4         1C0039     28  120024     18  wait
  5         1D003B     29  130026     19  wait
  6         1E003D     30  140028     20  wait
  7         1F003F     31  15002A     21  wait
  8         210043     33  16002C     22  wait
  9         220045     34  17002E     23  wait
 10         230047     35  180030     24  wait
 11         240049     36  190032     25  wait
 12         25004B     37  1A0034     26  wait
 13         26004D     38  1B0036     27  wait
 14         27004F     39  1C0038     28  wait
 15         280051     40  1D003A     29  wait
 16-31   Disabled

### Stack trace
(0)> set 18
 18 trace_back_lookup         true

(0)> where
pvthread+000200 STACK:
[0009BFA8].h_cede+000014 ()
[0007BEF0]waitproc+000510 ()
[0020A4B0]procentry+000010 (??, ??, ??, ??)
[kdb_read_mem] no real storage @ FFFFFFFFFFF8C90

### Error Report entries still in ram
(0)> errpt
ERRORS NOT READ BY ERRDEMON (ORDERED CHRONOLOGICALLY):

Error Record:
erec_flags ..............        1
erec_len ................       60
erec_timestamp .......... 5624E813
erec_rec_len ............       3C
erec_cid ................        0
erec_dupcount ...........        0
erec_duptime1 ........... 5624E811
erec_duptime2 ........... 5624E813
erec_rec.error_id ....... 7F88E76D
erec_rec.resource_name .. console
00000000 00730008 636F6E77 72697465  .....s..conwrite
00325549 4F5F5752 49544500 00000006  .2UIO_WRITE.....
00000000 00000000                     ........

Error Record:
erec_flags ..............        1
erec_len ................       48
erec_timestamp .......... 5624E813
erec_rec_len ............       24
erec_cid ................        0
erec_dupcount ...........        0
erec_duptime1 ...........        0
erec_duptime2 ...........        0
erec_rec.error_id ....... 1104AA28
erec_rec.resource_name .. SYSPROC
6E6F726D 616C0000 6E6F726D 616C0000  normal..normal..

### VMM Error entries still in ram
(0)> dw vmmerrlog 9
vmmerrlog+000000: 00000000 53595356 4D4D2000 00000000  ....SYSVMM .....
vmmerrlog+000010: 00000000 00000000 00000000 00000000  ................
vmmerrlog+000020: 00000000                                   ....

### Program errors in memeory
(0)> dw prog_log 8
expected symbol or address

### Memory status - notice bad pages is also 4GB.
### I think this is memory_max, because free pgsp blocks is high.
(0)> vmker

VMM Kernel Data:
        (use [-dr | -seg | -lrul | -psize | -pvl | -skey | -ras] for specific info)

eye catch         (eyec)       : 766D6B6572564D4D
total page frames (nrpages)    : 00200000
bad page frames   (badpages)   : 00100000
good page frames  (goodpages)  : 00100000
ipl page frames   (iplpages)   : 00180000
total pgsp blks   (numpsblks)  : 00100000
free pgsp blks    (psfreeblks) : 000E5C42
rsvd pgsp blks    (psrsvdblks) : 00001000
max file pageout  (maxpout)    : 00002001
min file pageout  (minpout)    : 00001000
repage table size (rptsize)    : 00010000
next free in rpt  (rptfree)    : 00000000
repage decay rate (rpdecay)    : 0000005A
global repage cnt (sysrepage)  : 00000000
swhashmask        (swhashmask) : 000FFFFF
cachealign        (cachealign) : 00001000
overflows         (overflows)  : 004627C2
reloads           (reloads)    : 0056DDCC
alias hash mask   (ahashmask)  : 00007FFF
max pgs to delete (pd_npages)  : 00001000
vrld xlate hits   (vrldhits)   : 00000001
vrld xlate misses (vrldmisses) : 0000079F
pgsp bufst waits (psbufwaitcnt): 0078C9C6
fsys bufst waits (fsbufwaitcnt): 000008B4
rsys bufst waits(rfsbufwaitcnt): 00000490
xpager bufst waits(xpagerbufwaitcnt): 00000636
phys_mem(s)      (phys_mem[0]) : 00280000
phys_mem(s)      (phys_mem[1]) : FFFFFFFF
phys_mem(s)      (phys_mem[2]) : 00000000
THRPGIO buf wait     (_waitcnt)  : 00000000
THRPGIO partial cnt (_partialcnt): 00000000
THRPGIO full cnt    (_fullcnt)   : 00000000
num lgpg\'s added    (nlgpgadded) : 00000000
num lgpg\'s free\'d   (nlgpgfreed) : 00000000
# frd lgp prepal (nlgpgfreedini) : 00000000
num cow mappings    (cow_pages)) : FFFFFFFFFFFF21E6
num cow page-ins    (cow_pgins)) : 066ADE1A
nosib pg-copies (npgcopies_nosib): 00025331
mmap alias reload (mmap_areload) : 00000000
mmap soft alias r (mmap_areload2): 00000000
AME exp. mem size (ame_mem_npgs) : 00000000
AME max  mem sz (ame_maxmem_npgs): 00000000
AME mem exp factor  (ame_factor) : 00000000
AME sys mem view(ame_sys_memview): 01
klock pf rsvdblks(klk_pfrb_pct): 000001F4
LSA ESID alloctor      (lsa_esid_alloc): 0000
LSA 1tb sh thresh     (lsa_sh_alias_th): 000C
LSA 1tb unsh thresh (lsa_unsh_alias_th): 0100
INVALID_HANDLE        (inval_vmh): FFFFF080

### Dynamic reconfig says we've had memory removed.
(0)> vmker -dr

VMM DR Related Data:

max page frames.......... 000000200000  frames on ipl............ 000000180000
current frames........... 000000100000  # bad frames............. 000000100000
DR mem adds.................. 00000001  DR mem removes............... 00000017
DR rsvd mem adds............. 00000000  DR rsvd mem remove........... 00000000
DR lmb reaff ................ 00000000  DR lmb reaff failed.......... 00000000
DR miss reloads ena.......... 00000002  DR miss reloads dis.......... 00000006
DR mig refcntmiss............ 00000000  DR migrate trans............. 00000000
DR mark    trans............. 00000000  DR v_look migr miss.......... 00000000
DR total migrates............ 000F1F30
DR fixlmb migrates........... 00000010  DR serv migrates............. 0000173E
DR lwmig DMA mapper.......... 00000000
MPSS broken migs............. 000006F8  MPSS brk mig errs............ 00000000
MPSS chunk migs.............. 000007CC  MPSS chunk migerrs........... 00000000
DR vmpool adds............... 00000000  DR vmpool removes............ 00000000
current maxvmpool............ 00000001
DR lpgvmp adds............... 00000000  DR lpgvmp remsoves........... 00000000
DR mempool adds.............. 00000000  DR mempool removes........... 00000000
DR memory moves.............. 00000000  DR memp rebal calls.......... 00000011
DR memp transients........... 00000000
Calls to alloclmb............ 00000000  Calls to freelmb............. 00000000
num lgpg\'s added............. 00000000  num lgpg\'s free\'d............ 00000000

### We've had 6 failed page creates.  Is this important?
(0)> vmker -pvl
pvlist overflows             (pvl_ovflows)  : 00002CC5 (00000005 per group)
failed page create           (pvl_grow_fail): 00000006
successful page create       (pvl_grow_succ): 00000007
failed page create (hard)    (pvl_hard_fail): 00000000
successful page create (hard)(pvl_hard_succ): 00000000
successful page free         (pvl_shrink)   : 00000000
skipped grows because no PAL (pvl_nopal)    : 00000025
# entries per group on boot  (pvl_bootavgpg): 00000008
PVLIST kproc thread id       (pvl_tid)      : 00080011
Start of PVLIST array        (pvl_first)    : F200800020000000
Current end of PVLIST array  (pvl_last)     : F200800020200000
Maximum PVLIST eaddr + 1                    : F200800024000000
Current number of PVLIST entries            : 00020000
Max number of PVLIST entries (pvl_maxels)   : 00400000
Average length of free list  (pvl_avgfree)  : 00000000
eaddr to use for RMLMB fail  (pvl_pinaddr)  : F10013A650000000
PVLIST lock                  (pvl_lock)     : 00000000

### Memory shows we have low free, high pinned.
(0)> memstat

Pageable Memory Status

Total pageable frames:    00000F74B0    3.9GB   -----
   4K pageable frames:    0000013DB0  317.7MB     8.0% total pageable
  64K pageable frames:    000000E370    3.6GB    91.9% total pageable

Total free frames:        0000001636   22.2MB     0.5% total pageable
   4K free frames:        0000000746    7.3MB     2.2% 4K pageable
  64K free frames:        00000000EF   14.9MB     0.4% 64K pageable

Total nrsvd frames:       0000000000    0.0MB     0.0% total pageable
   4K nrsvd frames:       0000000000    0.0MB     0.0% 4K pageable

Total comp frames:        00000F51DA    3.8GB    99.1% total pageable

Total perm frames:        0000000B40   11.3MB     0.2% total lruable
   4K perm frames:        0000000B40   11.3MB     3.6% 4K lruable

Total lruable frames:     00000F5880    3.8GB   -----
   4K lruable frames:     0000013810  312.1MB     7.9% total lruable
  64K lruable frames:     000000E207    3.5GB    92.0% total lruable

Total pinned frames:      00000C47FF    3.1GB    79.4% total pageable
   4K pinned frames:      000000FE4F  254.3MB    80.0% 4K pageable
  64K pinned frames:      000000B49B    2.8GB    79.4% 64K pageable

Total pinnable remaining: 0000001568   21.4MB     0.5% total pageable
   4K pinnable remaining: FFFFFFFFFFFFFFD8    0.0TB     0.0% 4K pageable
  64K pinnable remaining: 0000000159   21.6MB     0.5% 64K pageable

!!! 4K free frames less than minfree.
!!! Total perm frames below minperm.
*** 4K perm frames within 5% of minperm.
!!! 4K pinned frames within 5% of maxpin.
!!! 64K pinned frames within 5% of maxpin.
!!! 4K free frames less than psm_minfree_thresh.
*** 64K free frames between psm_maxfree_thresh and psm_minfree thresh.
!!! 4K page size above psm_maxpin limit.
!!! 64K page size above psm_maxpin limit.

### There's nothing waiting on paging.
(0)> th -w WMEM

(0)> th -w WPGIN

(0)> th -w WPGOUT

(0)> th -w WFREEF

### No pending I/Os
(0)> pdt *
               SLOT   NEXTIO           DEVICE  DMSRVAL    IOCNT    OLDIO </name><name>

vmmary_pdt+000000 0000 FFFFFFFF 8000000A00000002 00000000 00000000 00000000 paging
vmmary_pdt+007400 0080 FFFFFFFF 02BE5D40 00000000 00000000 00000000 remote
vmmary_pdt+0074E8 0081 FFFFFFFF 8000000A00000009 00000000 00000000 00000000 local client
vmmary_pdt+0075D0 0082 FFFFFFFF 8000000A00000008 83802E080 00000000 00000000 local client
vmmary_pdt+0076B8 0083 FFFFFFFF 8000000A00000005 00000000 00000000 00000000 local client
vmmary_pdt+0077A0 0084 FFFFFFFF 8000000A00000006 00000000 00000000 00000000 local client
vmmary_pdt+007888 0085 FFFFFFFF 8000000A00000007 00000000 00000000 00000000 local client
vmmary_pdt+007970 0086 FFFFFFFF 8000000A0000000B 00000000 00000000 00000000 local client
vmmary_pdt+007A58 0087 FFFFFFFF 8000000A0000000A 00000000 00000000 00000000 local client
vmmary_pdt+007B40 0088 FFFFFFFF 8000000A0000000C 00000000 00000000 00000000 local client
vmmary_pdt+007C28 0089 FFFFFFFF 8000000A00000003 00000000 00000000 00000000 local client
vmmary_pdt+007D10 008A FFFFFFFF 8000002D00000002 00000000 00000000 00000000 local client

### No locks
(0)> lq
                    BUCKET HEAD            COUNT

(0)> dla
 No deadlock found
</name>

Wmarow’s IOPS Calculator

Marek Wołynko has discontinued the wmarow iops calculator, and pulled down the page.

I use this all the time, though I normally search on “wmarrow iops”, but whatever.

Anyway, archive.org still has this, and just to be safe, I downloaded that and saved it for my own use.

I probably won’t update it, but I did clean it up just a tiny bit (so much social media and ad tracking removed!).

http://omnitech.net/iops/

If my htaccess is ever broken, I also run it at home on the same path, but none of YOU need to know that.
Either you know the server, or you don’t.


Problem Solving

If you want to understand how something happened, then say that.
If you want a solution to a problem, then say that.
If you want to complain and blame people, then say that.

Do not mix these up, because listeners have a choice.

How to understand problems:

  • Identify expectations
  • Identify deviations from expectations
  • Identify causes of deviations
  • Do not reject information you do not understand
  • Obtain more information about things you do not understand

How to solve problems:

  • Understand the problems fist
  • Set new or renewed expectations
  • Identify requirements to meet expectations.
  • Take action to meet expectations.

How to complain and blame people:

  • Ask for understanding
  • Refuse to accept information that would help you understand problems.
  • Interrupt and tell people how bad they are for the things you do not understand.
  • Assume everyone is not telling you the truth.
  • Accuse people of being incompetent.
  • Notice people walk away and/or hang up the phone.

LED 0088 on NIM install & Upgrade

RE: LED 0088 on NIM install & Upgrade
When performing a NIM restore + upgrade at the same time, you have to run no-prompt.
You do this my editing the bosinst.data file from the mksysb,
then you define and allocate that as a NIM resource.

One of the things you have to do is provide the target disks.
If the target disks do not exist, such as only one LUN when you thought there were two, the LPAR will HANG with LED 0088

You might see something like this on the console:
Cannot run a 64-bit program until the 64-bit

       environment has been configured. See the system administrator.

eval /usr/lpp/bosinst/bi_io -c < /dev/console

                                    Erasing Disks
       Please wait…
       Approximate     Elapsed time
    % tasks complete   (in minutes)

0042-008 NIMstate: Request denied – Method_req

Or you might see
MnM_Restore_NIM_Required_Files…

but nothing more. It will just hang there indefinitely.

Give it 30 mins before you call it hung, but you might want to do an RTE install so you can go to the menus and verify you have the right disks listed.


VIO rmrep fails

root@vio2 /

  1. ioscli mkrep -sp rootvg -size 1G

All or parts of the Virtual Media Repository exist.
Run the rmrep command to cleanup the repository

root@vio2 /

  1. ioscli rmrep -f

An Error occured when attempting to remove the repository

root@vio2 /

  1. exportfs -ua

root@vio2 /

  1. ioscli rmrep -f

Virtual Media Repository Removed


Sometimes, AIX migrations are a pain…

*************************************************************
Mksysb\Migration Enabled
*************************************************************

Running: Init_Target_Disks...
Running: Other_Initialization...
Running: Fill_Target_Stanzas...
Running: Check_Other_Stanzas...
Could not load program ls:
Cannot run a 64-bit program until the 64-bit
environment has been configured. See the system administrator.

This is because the default in 32-bit systems is to NOT load the 64-bit kernel. Just another line to override in the bosinst.data file:

So, the total list so far is:
CONTROL_FLOW stanza must be set as follows:

     MKSYSB_MIGRATION_DEVICE = network
     INSTALL_METHOD = migrate
     PROMPT = no
     EXISTING_SYSTEM_OVERWRITE = yes
     RECOVER_DEVICES = no
     INSTALL_64BIT_KERNEL = yes

target_disk_data should have one entry for each of hdisk0 and hdisk1

target_disk_data:
        PVID =
  PHYSICAL_LOCATION =
        CONNECTION =
        LOCATION =
        SIZE_MB =
        HDISKNAME = hdisk0

MIMIX for AIX – Misc Troubleshooting

Because there is NOTHING on the web about this.
PRODUCT: Recover Now / Double Take / MIMIX / EchoStream

Vision Solutions bought Double-Take. Double-Take wrote Recover Now, which is called MIMIX on AS-400. The replication tools underneath are called “EchoStream”.

NOTE: Documentation is hard to find, but here is a shortened URL form of the Windows docs: http://omnitech.net/u/rn35docs

Most functions can be managed from the web GUI:
http://127.0.0.1:8410/ui/portal
Obviously, put your correct IP here if you are not on the same host.

Install Licenses

Stop the license manager on PRIMARY:
stopsrc -cs scrt_lca-1

Stop the license manager on BACKUP:
stopsrc -cs scrt_aba-1

Copy the new license files:
scp -rp NIMSERVER:/export/Vision/license.perm/*_`hostname`_ES_node_license.properties /usr/scrt/run/node_license.properties

Start the license manager on PRIMARY:
startsrc -s scrt_lca-1

Start the license manager on BACKUP:
startsrc -s scrt_aba-1

Define initial contexts

/usr/scrt/bin/rtdr -C PRIMARYID (usually 1) -F BACKUPID (usually 1010) setup

Query RN Contexts

Context 1 is Primary. DR shows as BACKUP to this, and prod shows PRODUCTION for this.
Context 101 is Recovery. DR shows PRIMARY for this, and prod shows BACKUP for this.

root@BACKUPNODE
/usr/scrt/bin/sccfgd_getctxs
HOSTID HEXNUMBER
IPADDRESS MULTIPLELINES
BACKUP 1
PRODUCTION 101

root@PRODUCTIONNODE
# /usr/scrt/bin/sccfgd_getctxs
HOSTID DIFFERENTHEXNUMBER
IPADDRESS MULTIPLELINES
PRODUCTION 1
BACKUP 101

Uninstall EchoStream

/usr/scrt/bin/scsetup -R -C1
/usr/scrt/bin/sclist -DD -C1
odmdelete -o SCCuAt
odmdelete -o SCCuObj
odmdelete -o SCCuRel

NORMAL OPERATIONS

### EchoStream start
/usr/scrt/bin/rtstart -C1

### RN Check to see if kernel module is loaded
/usr/scrt/bin/scconfig -sC1

### RN Check if services are online
lssrc -a | grep scrt
scrt_lca-1 sender
scrt_aba-101 is receiver

### Protected filesystem mount
NOTE: This is usually handled by rtstart.
/usr/scrt/bin/rtmnt -C1

### Protected filesystem umount
NOTE: This is usually handled by rtstop.
/usr/scrt/bin/rtumnt -C1

### EchoStream sync, stop, and unload service
/usr/scrt/bin/rtstop -SC1

### EchoStream stop & unload service
/usr/scrt/bin/rtstop -C1

### EchoStream stop & unload kernel extension
/usr/scrt/bin/rtstop -FC1

### Check dirty blocks in state map
This will show how many blocks need to be sync’d for the recovery group:
/usr/scrt/bin/scconfig -PC1

### RN List buffer utilization
NOTE: When the local buffer overflows, just reverts to state-map tracking withour point-in-time recovery.
/usr/scrt/bin/esmon 1

### Shutdown all contexts
NOTE: This can be added to /etc/rc.shutdown, or in cluster start/stop scripts.
/usr/scrt/bin/rn_shutdown

FAILOVER PROCEDURES

Much is missing here. This is what I could find on the internet.
You can also do this from the WebUI.

### Fail back to Primary Server
/usr/scrt/bin/rtdr -qC 1 failback

### Failover to Recovery Server
/usr/scrt/bin/rtdr -qC 101 failback

### Make clone of filesystem
/usr/scrt/bin/scrt_ra -C1 -X

### Release clone of filesystem
/usr/scrt/bin/scrt_ra -C1 -W -L /dev/dbfs01lv

MANUAL OPERATIONS

### RN Primary Manual start
In troubleshooting and testing, these commands can start Recover Now manually:
/opt/visionsolutions/http/vsisvr/httpsvr/bin/strvsisvr
varyonvg rnvspvg
/usr/bin/startsrc -s scrt_scconfigd
/usr/scrt/bin/rtstart -C1

### Start without mount and fsck
/usr/scrt/bin/rtstart -C1 -M

### RN Primary Manual stop
In troubleshooting and testing, these commands will stop Recover Now manually:

# Unmount the protected filesystems
/usr/scrt/bin/rtumnt -DC1 | tee -a $log

# Kill processes if the filesystem is still mounted.
for i in `/usr/scrt/bin/sclist -C1 -f` ; do
mount | grep $i
if [[ $? -eq 0 ]]; then
fuser -kxuc $i
fi
done

# Try rtumnt again due to some timing issues observed.
sleep 3
/usr/scrt/bin/rtumnt -DC1

# Sync outstanding lfc’s to DR server
/usr/sbin/sync
/usr/scrt/bin/scconfig -SC1

# Stop RecoverNow
/usr/scrt/bin/rtstop -FkC1

Recover Now Reset State Map

This will cause the entire recovery group to be resync’d as if new, clearing any rollback points.

First, manually stop all resources first, as listed above, then bring the context online:
varyonvg rnvspvg
/usr/scrt/bin/scconfig -MC1

### RTDR Resync
# Remote of prod from DR
/usr/scrt/bin/sccfgd_cmd -H PRODNODE -T "1 resync"

# Local on DR
/usr/scrt/bin/rtdr -qC101 resync

### Mount the filesystems on Primary
/usr/scrt/bin/rtmnt -C1

### Mount the filesystems on Recovery
/usr/scrt/bin/rtmnt -C101

### Unmount filesystems
/usr/scrt/bin/rtumnt -C1 # or -C 101

Recover Now Release Stuck Config

For errors such as:
scsmutil: log anchor cksum mismatch
ERROR: Failed to load EchoStream Production Server Drivers
ERROR: Drivers not loaded… Will not mount into an unprotected state

Clear the error:
/usr/scrt/bin/scsetup -MC1
/usr/scrt/bin/scconfig -uC1

Then you can use rtstart as normal.

FIX HOSTID CHANGED

### Start Recover Now
/opt/visionsolutions/http/vsisvr/httpsvr/bin/strvsisvr
varyonvg rnvspvg
/usr/bin/startsrc -s scrt_scconfigd
/usr/scrt/bin/rtstart -C1
Context not properly defined on this system

# /usr/scrt/bin/sccfgd_getctxs
HOSTID (new hostid)
IPADDRESS (multiple lines)
No context for production or backup listed

#/usr/scrt/bin/rtdr -C 1 -F 101 setup
/usr/scrt/bin/rtdr[14]: test: argument expected
rtdr: Configuration error -
rtdr: Primary Context ID <1> is not enabled.
rtdr: The Primary Context ID <1> must be enabled
rtdr: when creating a Failover Context ID.

### Shutdown the context
# /usr/scrt/bin/scsetup -MC1
scsetup: AET_TMO_NOVOTE: Setup failed.
scsetup: Detail: On wrong host.

# /usr/scrt/bin/scconfig -uC1
scconfig: AET_TMO_NOVOTE: Unexpected error
scconfig: Detail: On wrong host.

# cat /usr/scrt/run/node_license.properties
## begin signed data
#DoW Mon DD HH:MM:SS CDT YYYY
vision.license.customer=Company_name_with_underscores
vision.license.productname=EchoStream
vision.license.expirydatemig=YYYY-MM-DD HH\:MM\:SS
vision.license.machineid=0123456789abcdefghijLMNOPQR\=
vision.license.hostname=hostname

### Vision support is via:
RecoverNow/GeoCluster AIX, Replicate1 24×7 CustomerCare Technical Support:
U.S. and Canada: (800) 337-8214
International: +1 (949) 724-5465
CustomerCare Support Email: support@visionsolutions.com

After hours will just page out, but not make a ticket.
Email will have a ticket created within a few minutes.

### Test startup
# /usr/scrt/bin/scsetup
scsetup: AET_TMO_NOVOTE: Setup failed.
scsetup: Detail: On wrong host.

### Set path properly
cat < <'EOF' >> /etc/environment
export PATH=/usr/scrt/bin:$PATH
EOF

### Collect reference info from “production” node and “backup” node.
/usr/scrt/bin/scconfig -v
/usr/scrt/bin/scconfig -q
/usr/scrt/bin/rtattr -C1 -a HostId
/usr/scrt/bin/rtattr -C101 -a HostId
/usr/scrt/bin/rthostid

### Update hostid for changed production node
HOSTID=`rthostid`
/usr/scrt/bin/rtattr -C1 -a HostId -o production -v $HOSTID
/usr/scrt/bin/rtattr -C101 -a HostId -o backup -v $HOSTID
ssh BACKUPNODE /usr/scrt/bin/rtattr -C1 -a HostId -o production -v $HOSTID

### Re-collect all of the same reference data as above.

### Reconfigure the repository
scconfig -sC1
ssh BACKUPNODE /usr/scrt/bin/rtdr -C1 -F101 setup
/usr/scrt/bin/rtdr -C1 -F101 setup

### Restart everything
/usr/scrt/bin/rtstart -C1 && startsrc -s scrt_scconfigd
until df -k /databasedir 2>/dev/null >/dev/null ; do date ; sleep 10 ; done
/opt/visionsolutions/http/vsisvr/httpsvr/bin/strvsisvr 2>/dev/null


My brain

Like to figure things out.  Am never fast, but in single contexts can maintain large amount of state.

Team of resources to help with workload and bouncing ideas, but I mostly like to do my own thing, or dole out independent chunks to others.  

Have high expectations, but TRY to be fair in balancing and shuffling bits to the right people.

Hate training people, and cannot write training.  Can answer questions, research, demonstrate, etc.

Particularly good at finding problems, especially with workflow or tech procedure gaps, but also in unexpected setup and use cases.

Am fine repeating complex tasks until the procedures are refined, but loathe to do repeated simple tasks.

Have decades of experience with AIX system recovery, virtualization, IBM storage, DebIan Linux, mdadm, etc.  Pretty crippled without google or my own build docs to deal with syntax.

Can code in BASH/ksh a bit, and have been proficient in PERL and Object Pascal.  Am not an efficient programmer, so I let most of it fade.

Run Windows desktop (98/2k/xp/7).  Not too content with Mac, Linux or AIX desktop, but can make due.

Too little workload, and I will eplore the intranet, or find some OSS toy, or maybe become a short term expert in something random (pilot, soap, cycling so far).

Too much workload, and I shut down.  Priorities shift, and my work output drops.  Worst with high context shifting, or consistent lack of respect (false justifications, or overt hostility).

Like a lot of flexibility in my schedule.  Some travel is okay. Work from home is great.

Not religious, and not athiest.  Sm my own thing.  Happy to talk politics, religion, etc unless logic is walled out.

Verbose, but I try to simplify emails when I have time.  Often have to talk through something iteratively to figure out what to do.  Mental filtering is some strange, magical thing.

Sometimes have no idea what emotional or other context exists.  Can iterate through and try to define, but not always intuit.

blah blah



Raspberry Pi Quick TTY Setup

This was how I set up my Pi 2 without using the HDMI/USB console. I have a wifi adapter in one USB port, and a Prolific TTY to Serial USB adapter.

      1. Prepare the installer

Download NOOBS from http://www.raspberrypi.org/downloads/
If the zip file is corrupt, then pull down with torrent. Mine had one corrupt block from the webserver.
Format your TF/uSD card as FAT32 (not exFAT, ext*, nor NTFS)
Unpack NOOBS.zip into your TF/uSD card
Remove all but Raspbian from “os” dir
edit recovery.cmdline to have “silentinstall”
edit flavours.json to have only raspbian, not the scratch version.

      1. Connect the TTY console with power

Connect Prolific TTL to Serial adapter
Red pin (5V) to pin 2 (furthest from USB, closest to edge)
Skip pin 4. Black is pin 6, white is pin 8, green is pin 10
DO NOT USE THE USB POWER PORT. We are powering through the red pin. Both power at same time will kill the board. You CAN unplug the red pin so as to allow more amps through the power micro-USB port.

      1. Power and autoboot

Plug in the USB port and connect to your comm port (putty, hyperterm, whatever) at 115200,8,n,1
It takes 4 seconds to say “Recovery Console”, then two for unpacking, and about 20 mins for complete install. Green LED (drive light) should blink.
The pi and it will install then reboot

      1. Basic config

Login as pi / raspberry, then sudo to root
Walk through raspi-config, then “FINISH”
sudo to root and set root password.

      1. Set up network

Plug in your wifi adapter, OR the ethernet cable, or both.
Edit /etc/wpa_supplicant/wpa_supplicant.conf to include:
network={

   ssid="The_ESSID_from_earlier"
   psk="Your_wifi_password"

}
ifdown wlan0 ; ifup wlan0
Wait 20 seconds for reconnect
cp -p /etc/network/interfaces /etc/network/interfaces.bak
cat < <"EOF" > /etc/network/interfaces
auto lo
iface lo inet loopback
iface eth0 inet dhcp

allow-hotplug wlan0
iface wlan0 inet dhcp
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

  1. wpa-roam /etc/wpa_supplicant/wpa_supplicant.conf

iface default inet dhcp
EOF

      1. Change the primary console to be tty instead of GUI

cp -p /boot/cmdline.txt /boot/cmdline.txt.bak
cat < <"EOF" > /boot/cmdline.txt
dwc_otg.lpm_enable=0 console=tty1 console=ttyAMA0,115200 root=/dev/mmcblk0p6 rootfstype=ext4 elevator=deadline rootwait
EOF

      1. Install “resize” for console

apt-get update
apt-get install xterm
resize
cat < <"EOF" >> /etc/profile
alias ll=’ls -laF’
resize
EOF

      1. Set time and certificates (required for firmware update)

apt-get install ntpdate ca-certificates
cat < <"EOF" >> /etc/ntp.conf
server us.pool.ntp.org
server ntp.ubuntu.com
EOF
/etc/init.d/ntp stop
ntpdate us.pool.ntp.org ntp.ubuntu.com
/etc/init.d/ntp start
tzselect

      1. Update firmware

apt-get install rpi-update
rpi-update
reboot

      1. Update base OS

sudo apt-get upgrade

      1. Other references

http://raspberrypi.stackexchange.com/questions/15192/installing-raspbian-from-noobs-without-display
http://www.raspberrypi.org/forums/viewtopic.php?t=83372
http://www.raspberrypi.org/forums/viewtopic.php?f=63&t=88064
http://www.raspberrypi.org/documentation/configuration/config-txt.md
https://github.com/raspberrypi/noobs
https://learn.sparkfun.com/tutorials/setting-up-raspbian-and-doom/setup-raspbian
http://elinux.org/R-Pi_Troubleshooting
http://weworkweplay.com/play/automatically-connect-a-raspberry-pi-to-a-wifi-network/
http://raspi.tv/2012/making-a-reset-switch-for-your-rev-2-raspberry-pi
http://www.raspberrypi-spy.co.uk/2014/11/enabling-the-i2c-interface-on-the-raspberry-pi/
https://pidome.wordpress.com/
http://elinux.org/RPi_Serial_Connection
http://www.raspberrypi.org/raspberry-pi-2-on-sale/
http://www.raspberrypi.org/products/raspberry-pi-2-model-b/

      1. Power test

2015-02-20 03:55am 90% or better 5x-18650 power brick, base install, boot then idle with rtl8188cu wifi-N and SSH connected.
2015-02-20 10:49am 75% battery left
2015-02-20 15:47pm 25% still last LED, getting dimmer
2015-02-20 19:55pm Blinking final LED
2015-02-20 20:59pm last syslog entry before going down.