errpt disk errors

SC_DISK_PCM_ERR1 Subsystem Component Failure

The storage subsystem has returned an error indicating that some component (hardware or software) of the storage subsystem has failed. The detailed sense data identifies the failing component and the recovery action that is required. Failing hardware components should also be shown in the Storage Manager software, so the placement of these errors in the error log is advisory and is an aid for your technical-support representative.

SC_DISK_PCM_ERR2 Array Active Controller Switch

The active controller for one or more hdisks associated with the storage subsystem has changed. This is in response to some direct action by the AIX host (failover or autorecovery). This message is associated with either a set of failure conditions causing a failover or, after a successful failover, with the recovery of paths to the preferred controller on hdisks with the autorecovery attribute set to yes.

SC_DISK_PCM_ERR3 Array Controller Switch Failure

An attempt to switch active controllers has failed. This leaves one or more paths with no working path to a controller. The AIX MPIO PCM will retry this error several times in an attempt to find a successful path to a controller.

SC_DISK_PCM_ERR4 Array Configuration Changed

The active controller for an hdisk has changed, usually due to an action not initiated by this host. This might be another host initiating failover or recovery, for shared LUNs, a redistribute operation from the Storage Manager software, a change to the preferred path in the Storage Manager software, a controller being taken offline, or any other action that causes the active controller ownership to change.

SC_DISK_PCM_ERR5 Array Cache Battery Drained

The storage subsystem cache battery has drained. Any data remaining in the cache is dumped and is vulnerable to data loss until it is dumped. Caching is not normally allowed with drained batteries unless the administrator takes action to enable it within the Storage Manager software.

SC_DISK_PCM_ERR6 Array Cache Battery Charge Is Low

The storage subsystem cache batteries are low and need to be charged or replaced.

SC_DISK_PCM_ERR7 Cache Mirroring Disabled

Cache mirroring is disabled on the affected hdisks. Normally, any cached write data is kept within the cache of both controllers so that if either controller fails there is still a good copy of the data. This is a warning message stating that loss of a single controller will result in data loss.

SC_DISK_PCM_ERR8 Path Has Failed

The I/O path to a controller has failed or gone offline.

SC_DISK_PCM_ERR9 Path Has Recovered

The I/O path to a controller has resumed and is back online.

SC_DISK_PCM_ERR10 Array Drive Failure

A physical drive in the storage array has failed and should be replaced.

SC_DISK_PCM_ERR11 Reservation Conflict

A PCM operation has failed due to a reservation conflict. This error is not currently issued.

SC_DISK_PCM_ERR12 Snapshot™ Volume’s Repository Is Full

The snapshot volume repository is full. Write actions to the snapshot volume will fail until the repository problems are fixed.

SC_DISK_PCM_ERR13 Snapshot Op Stopped By Administrator

The administrator has halted a snapshot operation.

SC_DISK_PCM_ERR14 Snapshot repository metadata error

The storage subsystem has reported that there is a problem with snapshot metadata.

SC_DISK_PCM_ERR15 Illegal I/O – Remote Volume Mirroring

The I/O is directed to an illegal target that is part of a remote volume mirroring pair (the target volume rather than the source volume).

SC_DISK_PCM_ERR16 Snapshot Operation Not Allowed

A snapshot operation that is not allowed has been attempted.

SC_DISK_PCM_ERR17 Snapshot Volume’s Repository Is Full

The snapshot volume repository is full. Write actions to the snapshot volume will fail until the repository problems are fixed.

SC_DISK_PCM_ERR18 Write Protected

The hdisk is write-protected. This can happen if a snapshot volume repository is full.

SC_DISK_PCM_ERR19 Single Controller Restarted

The I/O to a single-controller storage subsystem is resumed.

SC_DISK_PCM_ERR20 Single Controller Restart Failure

The I/O to a single-controller storage subsystem is not resumed. The AIX MPIO PCM will continue to attempt to restart the I/O to the storage subsystem.


AIX types of ethernet interfaces

AIX shows a lot of different info in different places.  This is because AIX predates the time when everyone had RJ45 ethernet ports.

HBA represents a high-function PCI adapter that contains multiple protocols, and which can sometimes be configured to provide ENT devices.  Primary candidates are “Integrated Virtual Ethernet” on POWER5 and POWER6 servers, as well as ROCE adapters, which are “RDMA Over Converged Ethernet”, with RDMA being “Remote Direct Memory Addressing” or “Access”.  Basically, Infiniband adapters which can use ethernet at the link layer.

ENT represents the “physical port”, though that is not always the case.  I’ll explain more later.  There is one one of these for every Ethernet port visible to the operating system.

EN represents the “ETHERNET II” protocol device for IP communication.  This is the standard today, also known as “DIX Etehrnet”, named after DEC, Intel, Xerox.  This is where you will normally put your IP address.  There is one of these for every ENT device.

ET represents an IEEE 802.3 protocol device.  This would have been used in the days of Novel Networking, or with SNA protocol.  Almost no-one uses this anymore, but I’m sure there’s an AIX 3.2.5U2 microchannel server running with this somewhere in the bottom of an old government facility, with coaxial cables and barrell terminators.  Really, I don’t know why this still is needed on anything produced in the last 20 years.  There is one of these for every ENT device.

INET is for config options that affect the entire TCP/IP stack, such as persistent routes, the hostname, and whether you are bypassing ODM for config of your network (rare).  There is only one of these per system, and it is always inet0 unless someone gets cheeky.

There are other ways to get IP devices, such as IP over Fibre Channel, IP over Infiniband, IP over ATM, over FDDI, over serial or parallel, etc.  These are less common, so I’m not going into them here.

Generally, you may have a stack like this:

ent0    physical ethernet port
ent1    physical ethernet port
ent2    Etherchannel (Static, or LACP bond created out of both of the above)
ent3    Virtual Ethernet (Connects to a virtual, firmware-only switch)
ent4    Shared Ethernet (VIO server only, a software bridge between a virtual physical)
ent5    VLAN (an additional VLAN port configured off of any of the above)
en0     IP interface – unused because we give ENT0 as a backing device to ENT3
en1     IP interface – also unused for the same reason
en2     IP Interface – also unused, because this is the backing device for the SEA
en3     IP interface – Also unused because this is a backing device for the SEA.
en4     IP interface hanging off of ENT4 – this can be skipped, and a virtual ethernet used
en5     IP interface hanging off of ENT5 – this can be skipped, and a virtual ethernet used

Each device has its own type of parameters.  You can use “lsattr -El $device”, “netstat -in”, and “entstat -d $device” to get details of this.  Note that entstat wants to be on the top device, not the bottom device.  Start with where the IP address is assigned, and it will show the subdevices, virtual connections, etc.


High Level VIO/Client build

This is off the cuff, and is not a technical walkthrough. This is enough for you to teach yourself assuming you have a system to hack on.

IBM’s POWER8 docs are missing almost everything. I don’t understand how they can call them docs at all. They want you to use some really picky tools that are cumbersome and not flexible in all the right ways.

The IBM POWER7 docs are close, but are missing the SR-IOV info. Your best bet is to skim though this, and stop when you find the bits you want (concepts, config):

The high level jist of building a VIO environment is as follows:

  • Configure to HMC
  • Clear managed system profile data
  • Build a couple VIO servers:
    • 6GB RAM, 3 virtual procs, 0.3 virtual CPUs, 255 CPU weight
    • At least one storage and one network adapter
    • You can use SR-IOV to share an ethernet adapter from firmware if needed
    • One virtual ethernet trunk for each separate physical network.  Assign VLANs here
    • One virtual ethernet non-trunk for each VLAN you want an IP address on (ideal, but you can also hang IPs and VLANs directly from AIX)
    • One virtual SCSI server adapter for each client LPAR that will need virtual CDROM, Virtual Tape, or legacy Virtual SCSI disk (higher CPU load).
    • One virtual fibre adapter for each client port (usually two per client on each VIO server, but can be anywhere from 18)
  • Upload the VIO base media into the HMC media repository
  • Install the VIO server from the HMC
  • SSH into the HMC, and use vtmenu to rebuild the VIO networking
    • Remove all en, et, ent, hba devices, then cfgmgr
    • mkvdev -lnagg for any etherchannel bonded pairs needed for the Shared Ethernet Adapter(s)
    • mkvdev -sea  to build any shared ethernet adapters (ethernet bridge from virtual switch to physical port)
    • mkvdev -lnagg for any etherchannel bonded pairs needed for local IP communication
    • mkvdev -vlan for any additional VLANs hanging directly off an SEA rather than through a virtual ethernet client adapter
    • mktcpip to configure your primary interface, gateway, etc
    • Add any extra IP addresses.
  • Build your Client LPARs
    • Memory, CPU, RAM as desired
    • Virtual ethernet just picks the switch and VLAN that you need.  If this does not exist on any VIO trunk adapters, then you need to fix that.
    • Virtual SCSI client adapter
      • this needs the VIO server partition ID, and the VIO server slot number added to it for the firmware connection.
      • The VIO server virtual SCSI adapter needs the same mapping back to the client LPAR id and slot.
      • There may be some GUI improvements to add this all for you, but it’s been decades of garbage for so long that I just do it all manually.
    • Virtual Fibre adapter – This maps back and forth to the VIO server virtual fibre similar to how VSCSI did.
  • SSH into the VIO server
    • make virtual optical devices attached to the “vhost” (virtual SCSI” if needed
    • Use vfcmap to map the “vfchost” adapters to real “fcs” ports.  This requires them to be NPIV capable (8gbit or newer), logged into an NPIV capable switch (lsnports).
  • Zone any LUNs
    • lsnportlogin can give you the WWNs for the clients, or you can get it from the client profile data manually
    • You can use OpenFirmware’s “ioinfo” to light up a port to force it to log in to the switch.
    • If the LPAR is down, you can use “chnportlogin” from the HMC to log in all ports for that client.
    • You can also zone directly to the VIO server, and “mkvdev” to map them as vscsi disks (higher CPU load on VIO server, and kind of a pain in the rump).
    • Note that LPM requires any VSCSI LUNs to be mapped to all VIO servers in advance.
    • Note that LPM requires any NPIV LUNs to be mapped to the secondary WWNs in advance
  • SSH into the VIO server
    • Make sure lsmap and lsmap -npiv show whatever mapping is required
    • Make sure loadopt has mounted any ISO images as virtual CDROMs if needed
    • You can also just mask an alt_disk_install LUN from a source host.
    • You can also use NIM to do a network install
  • Activate the LPAR profile.
    • If you did not open a vterm from SSH into the HMC, then you can do it from the activate GUI.
    • You can use SMS to pick your boot device
    • Install or boot as desired
    • Reconfigure your network as normal
      • smitty tcpip or “chdev -l en0” and “chdev -l inet0” with appropriate flags
      • Tune everything as desired.
      • If it was a Linux install, then that has its own config options.

SR-IOV can be used instead of Shared Ethernet above. 

It allows you to share a single PCI NIC or single ethernet port between LPARs.  It uses less CPU on the VIO server, and has lower latency for your LPARs.  It’s sort of the Next Generation of network virtualization, though there are some restrictions in its use.  It’s best to review all of the info, and decide up front, but is worth your time to do so.  If you want to use an SEA on SR-IOV, you still only have one VIO server per port, but you can have different ports on different VIO servers.  When sharing among all clients and VIO server without SEA, understand that the percentage capacity is a minimum guaranteed, not a cap.  Leave it low unless you have some critical workload that needs to crowd out anyone else. Some of the best URLs today when I look up “SR-IOV vNIC vio howto” are as follows:

CLI and Automation

If you want to build a whole bunch of VIO clients and servers at once, it may be worth the effort to do it from the HMC CLI.  It gets really complicated, but once you have it set up, you can adjust and rebuild things quickly.  This also lets you manually specify WWNs for your LPARs in case there are collisions, or if you are rebuilding and need to keep the same numbers.

The VIO server can be installed with alt_disk_copy, or from NIM, or from physical CD, or from the HMC.  The CLI version is called “installios” and you MUST specify the MAC address of the boot adapter for it to work properly. Without CLI options, installios will prompt you for all of the info.

 


VIO server hangs

To be updated with resolution at some point.
This is the second time a secondary VIO server has hung with a UIO_WRITE in the kernel log
The VIO servers have only been up 55 days.
Number R12 hung about a week and a half ago, but no dump was collected.
Number R22 hung this morning, and a dump was collected.
I couldn’t find anything juicy (see below), but I did find that E11 had lost its second internal boot disk.
I was able to reset that with chpv, but I’m wondering if there’s something going on with the SAS controllers.

Also, these have network hangs intermittenly, and sometimes vtmenu times out.
I’m wondering if there’s some sort of power issue with the site.

---------------------------------------------------------------------------
LABEL:          DUMP_STATS
IDENTIFIER:     67145A39

Date/Time:       Mon Oct 19 09:19:49 EDT 2015
Sequence Number: 367
Class:           S
Type:            UNKN
WPAR:            Global
Resource Name:   SYSDUMP

Description
SYSTEM DUMP

Probable Causes
UNEXPECTED SYSTEM HALT

User Causes
SYSTEM DUMP REQUESTED BY USER

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
UNEXPECTED SYSTEM HALT

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
DUMP DEVICE
/dev/lg_dumplv1
DUMP SIZE
            1108637696
TIME
Mon Oct 19 08:54:43 2015
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           1
DUMP STATUS
           0
ERROR CODE
0000 0000 0000 0000
DUMP INTEGRITY
after uncompressing
FILE NAME

PROCESSOR ID
           0
---------------------------------------------------------------------------
LABEL:          MINIDUMP_LOG
IDENTIFIER:     F48137AC

Date/Time:       Mon Oct 19 09:19:15 EDT 2015
Sequence Number: 366
Class:           O
Type:            UNKN
WPAR:            Global
Resource Name:   minidump

Description
COMPRESSED MINIMAL DUMP

Probable Causes
System dumped. Minimal Dump collected in Non-Volatile Memory.

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
Minidump Data:
4D33 0D4B 2D17 0060 0027 0027 0032 0048 0000 0000 4214 7800 0000 0000 DF18 67B6
0000 0003 0001 5624 ED78 05E5 36A8 6575 DDDD 0002 0004 0000 000A 000D 5624 E813
0000 0000 000F 0000 2F64 6576 2F6C 675F 6475 6D70 6C76 3165 6345 A4C0 9001 2000
0024 5648 A09B 00A0 2189 8610 234A 9C18 8180 2A14 53B2 4C81 22E5 C990 8910 DDBC
91D3 260C 9B86 2249 9A6C 0865 6300 2408 0040 9060 825F A802 2C18 F2A4 0914 0001
4614 2848 11C0 837F 18E7 FDEB 078A A280 5F62 1055 1C49 7224 BC82 080A 0202 F0E3
573E 0030 1894 5CD9 F225 8C92 382B 76A8 B120 1E00 0500 C02D 4053 3200 49A3 0565
C459 6584 E20E 4DCA 7062 C448 AC8C 8DA2 3909 0A30 C682 4449 9F0F 1440 0090 018F
4725 B52A 0A50 42CB 9F47 AEB0 BEFE 2C00 8401 08B5 C6EE CEA4 8800 920C 9229 0B7A
1544 0427 8601 70E4 BC19 53C6 0D1D 3979 56C4 B859 1104 2958 010A 1800 7032 2B59
0033 7E1A 9177 0857 891F 2B01 8018 0B00 884F 82C0 00E9 8053 3103 68D1 A44D 9B25
8800 825E 8A0B 80A2 F04C 9CA2 EDD0 0A46 977E 112A F56A 8264 881F C75D 1AC6 9FD4
BD29 573C 882A 7541 BF23 178C 3CE0 8B9E CFCD 800E 5401 52FB 76F2 DC3C 72A4 46F1
BB3D F2D1 8C92 0D7B 43B1 0388 0404 5007 4C0D 1560 D701 0407 1034 4480 0579 D38A
2FEF 50E4 2084 123E 1861 8316 5648 2186 1B12 34E1 851E 66B8 6071 2496 6862 4102
2010 D144 245D 94D1 461D A108 9248 2712 D481 389C A194 D307 7B8D 7540 5A25 F148
D24A 633D 00C8 0E81 F804 9450 6C94 34D0 4848 11E4 C216 6E60 E014 5452 4D55 D555
4779 B615 135E 9104 9658 0068 B09B 564A E4A2 9612 B8B8 85A2 314C C405 A592 7431
A09D 56C6 B0B7 174A E08D 1418 415E 1152 5841 F581 30C0 7608 25B6 5863 8F45 36D9
4896 61A6 598E 15B1 165E 8D04 6900 400D D4E5 0208 09CE B1A6 2741 04C8 D08E A599
6EDA 2920 7BA0 E4DB A5DA 3C89 2949 9A72 FADE 21B9 0CF2 49A8 D011 576B A7A3 4C81
1D04 7716 C49D 7704 F509 DC78 E59D 67D4 001B 9C99 AAAD CAE5 42CA 00F3 D597 5C3C
D480 4102 22D8 4540 9B82 0C86 D821 001F 6A08 22BA 1972 B86E BAEE AA0B C088 B3D6
3BEB 002A 4A54 928B FF68 C491 4733 1637 4008 2575 06C0 9F34 B914 0038 320D C992
C2E0 D4BB 534F 3F05 B5C0 B823 1D9A D50B F1C0 40CD 9551 9534 D53F 5B62 95D5 566C
8439 D298 6345 7B56 2F6B F2E2 269E 6CC8 B9F2 5C75 8D8A 2718 3821 DBD7 5F7E 5214
A8CA 906A 6C6C A28A 31E6 1864 9215 ABDA 6599 6D66 7041 A82E 8BE9 0600 DCD0 6929
4FF0 DA1A 45A5 9E5A 11D6 5ADF 9A4B 2992 B85A 1F41 E269 632F 4964 779A 8A06 5E47
E759 DC66 17F3 CEB0 4E03 706C 45CA 1624 5E45 E499 0780 0101 0022 401B 3613 8477
B5C6 E091 2D45 DB76 CBC4 2BE1 624C AFB9 EFB6 CBB9 BCF1 86FE 79BC 9BBF 6D3A 7104
E4CB E248 FCFA 1BA3 5601 1717 2544 0519 8CF0 C336 353C 52C2 3649 CC93 9216 D351
52B9 38CD 90C2 09F6 809C 2555 5699 9CD3 568C 104D 11CB 0070 20ED 5AC1 AC09 CCCC
0040 DFF8 F438 33F0 D05B C856 E473 E03A 0A06 8020 D253 6428 A2A8 28AA 74A3 4D43
0AF5 A453 1354 B5E0 B372 0040 0ED4 B985 15FE E135 D714 2436 B3A9 88FF 0078 AB5B
5CE1 0F6A 0B8F 704E 5790 0506 100D 1FA8 1B71 2CD8 C034 608E 37C4 22C9 DF28 823E
000C 8E22 853B CF40 1270 0FA2 5084 83CA B9C5 1F4E 31B9 8254 0E0C AA08 44E6 0A52
3A78 8DEE 87EC 3A97 0F83 D839 218A 8878 144C E248 0AA0 BA7D 4DC5 7500 0B09 8966
1725 DBA9 0477 9CD1 9DF9 B098 3F13 4D0C 7842 9116 4556 4791 D925 2008 B4B0 84F2
44A6 A5E6 75E9 64C6 6045 FB08 42BD 0E5C 4F09 C558 1331 B8B7 1556 7CAF 2074 AA0B
CFDE F2BA DDF1 A584 0713 DAFA 0645 92F7 5184 3BF2 6314 D31E 5599 FB49 ED2C C450
CB0F EC80 09E2 7400 003B C0CD 10CA A18B 69CC 8735 3118 0912 F987 A94F 86F2 3DA3
DC05 6A08 F29C AF51 A407 71E0 46F9 4A78 C282 E092 132A 0484 029E A133 00B8 5294
E5F0 45D7 0842 9F92 1C13 96E5 18C6 39BC 9384 BBB8 B022 6444 D11B 0BF2 CCD1 8C92
1804 2408 06C6 000C BF89 865B 6158 01A8 7823 2E1E 126F 88F0 F41C 1141 07C4 7852
A874 4ACC 2741 0CD0 C416 3D11 4651 A411 71A8 58BB F499 6864 231B 4931 0B32 3B63
FEAE 6242 E143 4936 03A5 47DE 8016 1803 2596 D8C8 3C2E 5514 8ECC 9823 00A8 E781
3B26 634D C8E0 A331 98F1 4788 0492 0119 ED9E 31CA B927 DE20 F24F 0008 1423 8B06
BF48 2ECD 514E 8B54 D428 8526 6468 D20E 34CD 8A07 00D0 0351 9AE3 0346 6226 2A55
5992 5E9A 68A9 4D85 E553 BF50 C08A E052 9780 039A D56E 1907 601E EE50 1310 47B1
B0EA D410 C8A1 8623 61AB 564D 5009 6A5A 9324 1455 CB36 0922 576F 9AE3 0449 48CD
38CB 9980 7352 230C 6A20 C20E C945 CF79 8ACE B1F5 94A7 3D23 844F 7D2A F100 FD64
DD3F FF25 2329 1267 6065 2C28 6FAE 5893 A068 9122 BC0B 8AEF 28B6 A405 94AF 2009
FAA8 416A F00C 43AC 9124 232B D95E 9D64 0C76 8894 7A1F B863 33D6 C40C 95B2 A3A5
0178 69F9 B612 B19A 1EEC A68A 1414 4E1C 79B4 F825 4D92 40B5 9FA4 2E89 2225 30E3
A8E8 200E 8F7E E054 4264 50AA 0449 6545 56C9 B659 8DB7 BC5C A5A5 A8BC 9ACB 5D8A
9595 6435 AB01 B035 811A ACED BD5A 3544 17E0 5A11 00FB 7511 2FB0 2B00 AE89 4292
C82A B414 31B0 3C86 608E 4584 1300 8335 67E5 C290 8A59 1EAC 9D8C 7DEC 641B 3B62
111F D1B2 2846 4066 2BD2 3A80 7656 A09E 2128 41AC 4810 DE05 000E A725 ADC2 62EA
C587 B616 C67C D96B 0248 4005 4EDC 7624 B975 A36C 7122 8063 30E0 B716 2153 08EE
188D 3541 838F 4E6E E948 C307 09B5 1C23 27E7 BBAF 4173 BAC8 E91A CD20 485B D44F
EB57 C9ED 12B5 7B4A 8086 26EF B03F 9210 6C08 4ED5 C707 0180 82A9 AEB7 AAB3 BAB3
53F7 E180 AE92 15AC 2414 B309 47F2 CBF3 04A8 025A E824 4504 ADD5 7EEC 8A99 6B2B
08A5 BD79 0EC4 2898 C163 D6E6 9201 B0E9 099F 8300 D711 2739 356C 046E 8921 0517
4600 88E7 F54E C9DA DA88 B82E A2AE E555 5914 9F2E 012B A648 8B39 0B3B CF0E 14C2
---------------------------------------------------------------------------
LABEL:          SYS_RESET
IDENTIFIER:     1104AA28

Date/Time:       Mon Oct 19 09:19:15 EDT 2015
Sequence Number: 365
Class:           S
Type:            TEMP
WPAR:            Global
Resource Name:   SYSPROC

Description
SYSTEM RESET INTERRUPT RECEIVED

Probable Causes
SYSTEM RESET INTERRUPT

Detail Data
KEY MODE SWITCH POSITION AT BOOT TIME
normal
KEY MODE SWITCH POSITION CURRENTLY
normal
---------------------------------------------------------------------------
LABEL:          ERRLOG_ON
IDENTIFIER:     9DBCFDEE

Date/Time:       Mon Oct 19 09:20:41 EDT 2015
Sequence Number: 364
Class:           O
Type:            TEMP
WPAR:            Global
Resource Name:   errdemon

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

        Recommended Actions
        NONE

---------------------------------------------------------------------------
LABEL:          CONSOLE
IDENTIFIER:     7F88E76D

Date/Time:       Mon Oct 19 08:54:09 EDT 2015
Sequence Number: 363
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   console

Description
SOFTWARE PROGRAM ERROR

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
USER'S PROCESS ID:
               7536648
DETECTING MODULE
conwrite
FAILING MODULE
UIO_WRITE
RETURN CODE
           6
ERROR CODE
           0

We gathered a snap -ac for IBM, and while waiting, I did a quick look in the dump.

cd /tmp/ibmsupt/dump
chfs -a size=+4G /tmp
uncompress unix.Z
dmpuncompress dump.BZ
kdb dump unix
kdb dump unix
dump mapped from @ 700000000000000 to @ 7000000df1867b6
           START              END 
0000000000001000 0000000004150000 start+000FD8
F00000002FF47600 F00000002FFDF9C8 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F1000F0A00000000 F1000F0A10000000 pvproc+000000
F1000F0A10000000 F1000F0A18000000 pvthread+000000
Dump analysis on CHRP_SMP_PCI POWER_PC POWER_7 machine with 16 available CPU(s)  (64-bit registers)
Processing symbol table...
.......................done
read vscsi_scsi_ptrs OK, ptr = 0x0
vmcKdb_anchor_p=0x0000000000000000
vmc kdb command extension, 64 bit version, is loaded.  Commands are:
vmc - load extension and show help text
vmcu - unload extension
vmcd - VMC dump anchor, adapter, connections
vmcfa - VMC fetch anchor from symbol table
vmcsa address - VMC set anchor
vmcdb - VMC dump connection buffers
vmcdm - VMC dump connection messages
vmcdq - VMC dump queue
vmct directoryname - VMC Internal Adapter trace
vmctbm directoryname - VMC buffer and message trace
vmcKdb_anchor_p=0x0000000000000000

### The time the crash was forced
(0)> dw time
time+000000: 00000000 5624E813 F1000A00 20295000  ....V$...... )P.

### Basic stats on the system
(0)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_7 machine with 16 available CPU(s)  (64-bit registers)

SYSTEM STATUS:
sysname... AIX
nodename.. viopr22
release... 1
version... 6
build date May  4 2015
build time 12:52:42
label..... 1516D_61d
machine... REDACTED
nid....... REDACTED
time of crash: Mon Oct 19 08:54:43 2015
age of system: 55 day, 6 hr., 19 min., 15 sec.
xmalloc debug: enabled
FRRs active... 0
FRRs started.. 0

### Process table
(0)> status
CPU INTR      TID  TSLOT     PID  PSLOT  PROC_NAME
  0          20005      2   20004      2  wait
  1         190033     25   F001E     15  wait
  2         1A0035     26  100020     16  wait
  3         1B0037     27  110022     17  wait
  4         1C0039     28  120024     18  wait
  5         1D003B     29  130026     19  wait
  6         1E003D     30  140028     20  wait
  7         1F003F     31  15002A     21  wait
  8         210043     33  16002C     22  wait
  9         220045     34  17002E     23  wait
 10         230047     35  180030     24  wait
 11         240049     36  190032     25  wait
 12         25004B     37  1A0034     26  wait
 13         26004D     38  1B0036     27  wait
 14         27004F     39  1C0038     28  wait
 15         280051     40  1D003A     29  wait
 16-31   Disabled

### Stack trace
(0)> set 18
 18 trace_back_lookup         true

(0)> where
pvthread+000200 STACK:
[0009BFA8].h_cede+000014 ()
[0007BEF0]waitproc+000510 ()
[0020A4B0]procentry+000010 (??, ??, ??, ??)
[kdb_read_mem] no real storage @ FFFFFFFFFFF8C90

### Error Report entries still in ram
(0)> errpt
ERRORS NOT READ BY ERRDEMON (ORDERED CHRONOLOGICALLY):

Error Record:
erec_flags ..............        1
erec_len ................       60
erec_timestamp .......... 5624E813
erec_rec_len ............       3C
erec_cid ................        0
erec_dupcount ...........        0
erec_duptime1 ........... 5624E811
erec_duptime2 ........... 5624E813
erec_rec.error_id ....... 7F88E76D
erec_rec.resource_name .. console
00000000 00730008 636F6E77 72697465  .....s..conwrite
00325549 4F5F5752 49544500 00000006  .2UIO_WRITE.....
00000000 00000000                     ........

Error Record:
erec_flags ..............        1
erec_len ................       48
erec_timestamp .......... 5624E813
erec_rec_len ............       24
erec_cid ................        0
erec_dupcount ...........        0
erec_duptime1 ...........        0
erec_duptime2 ...........        0
erec_rec.error_id ....... 1104AA28
erec_rec.resource_name .. SYSPROC
6E6F726D 616C0000 6E6F726D 616C0000  normal..normal..

### VMM Error entries still in ram
(0)> dw vmmerrlog 9
vmmerrlog+000000: 00000000 53595356 4D4D2000 00000000  ....SYSVMM .....
vmmerrlog+000010: 00000000 00000000 00000000 00000000  ................
vmmerrlog+000020: 00000000                                   ....

### Program errors in memeory
(0)> dw prog_log 8
expected symbol or address

### Memory status - notice bad pages is also 4GB.
### I think this is memory_max, because free pgsp blocks is high.
(0)> vmker

VMM Kernel Data:
        (use [-dr | -seg | -lrul | -psize | -pvl | -skey | -ras] for specific info)

eye catch         (eyec)       : 766D6B6572564D4D
total page frames (nrpages)    : 00200000
bad page frames   (badpages)   : 00100000
good page frames  (goodpages)  : 00100000
ipl page frames   (iplpages)   : 00180000
total pgsp blks   (numpsblks)  : 00100000
free pgsp blks    (psfreeblks) : 000E5C42
rsvd pgsp blks    (psrsvdblks) : 00001000
max file pageout  (maxpout)    : 00002001
min file pageout  (minpout)    : 00001000
repage table size (rptsize)    : 00010000
next free in rpt  (rptfree)    : 00000000
repage decay rate (rpdecay)    : 0000005A
global repage cnt (sysrepage)  : 00000000
swhashmask        (swhashmask) : 000FFFFF
cachealign        (cachealign) : 00001000
overflows         (overflows)  : 004627C2
reloads           (reloads)    : 0056DDCC
alias hash mask   (ahashmask)  : 00007FFF
max pgs to delete (pd_npages)  : 00001000
vrld xlate hits   (vrldhits)   : 00000001
vrld xlate misses (vrldmisses) : 0000079F
pgsp bufst waits (psbufwaitcnt): 0078C9C6
fsys bufst waits (fsbufwaitcnt): 000008B4
rsys bufst waits(rfsbufwaitcnt): 00000490
xpager bufst waits(xpagerbufwaitcnt): 00000636
phys_mem(s)      (phys_mem[0]) : 00280000
phys_mem(s)      (phys_mem[1]) : FFFFFFFF
phys_mem(s)      (phys_mem[2]) : 00000000
THRPGIO buf wait     (_waitcnt)  : 00000000
THRPGIO partial cnt (_partialcnt): 00000000
THRPGIO full cnt    (_fullcnt)   : 00000000
num lgpg\'s added    (nlgpgadded) : 00000000
num lgpg\'s free\'d   (nlgpgfreed) : 00000000
# frd lgp prepal (nlgpgfreedini) : 00000000
num cow mappings    (cow_pages)) : FFFFFFFFFFFF21E6
num cow page-ins    (cow_pgins)) : 066ADE1A
nosib pg-copies (npgcopies_nosib): 00025331
mmap alias reload (mmap_areload) : 00000000
mmap soft alias r (mmap_areload2): 00000000
AME exp. mem size (ame_mem_npgs) : 00000000
AME max  mem sz (ame_maxmem_npgs): 00000000
AME mem exp factor  (ame_factor) : 00000000
AME sys mem view(ame_sys_memview): 01
klock pf rsvdblks(klk_pfrb_pct): 000001F4
LSA ESID alloctor      (lsa_esid_alloc): 0000
LSA 1tb sh thresh     (lsa_sh_alias_th): 000C
LSA 1tb unsh thresh (lsa_unsh_alias_th): 0100
INVALID_HANDLE        (inval_vmh): FFFFF080

### Dynamic reconfig says we've had memory removed.
(0)> vmker -dr

VMM DR Related Data:

max page frames.......... 000000200000  frames on ipl............ 000000180000
current frames........... 000000100000  # bad frames............. 000000100000
DR mem adds.................. 00000001  DR mem removes............... 00000017
DR rsvd mem adds............. 00000000  DR rsvd mem remove........... 00000000
DR lmb reaff ................ 00000000  DR lmb reaff failed.......... 00000000
DR miss reloads ena.......... 00000002  DR miss reloads dis.......... 00000006
DR mig refcntmiss............ 00000000  DR migrate trans............. 00000000
DR mark    trans............. 00000000  DR v_look migr miss.......... 00000000
DR total migrates............ 000F1F30
DR fixlmb migrates........... 00000010  DR serv migrates............. 0000173E
DR lwmig DMA mapper.......... 00000000
MPSS broken migs............. 000006F8  MPSS brk mig errs............ 00000000
MPSS chunk migs.............. 000007CC  MPSS chunk migerrs........... 00000000
DR vmpool adds............... 00000000  DR vmpool removes............ 00000000
current maxvmpool............ 00000001
DR lpgvmp adds............... 00000000  DR lpgvmp remsoves........... 00000000
DR mempool adds.............. 00000000  DR mempool removes........... 00000000
DR memory moves.............. 00000000  DR memp rebal calls.......... 00000011
DR memp transients........... 00000000
Calls to alloclmb............ 00000000  Calls to freelmb............. 00000000
num lgpg\'s added............. 00000000  num lgpg\'s free\'d............ 00000000

### We've had 6 failed page creates.  Is this important?
(0)> vmker -pvl
pvlist overflows             (pvl_ovflows)  : 00002CC5 (00000005 per group)
failed page create           (pvl_grow_fail): 00000006
successful page create       (pvl_grow_succ): 00000007
failed page create (hard)    (pvl_hard_fail): 00000000
successful page create (hard)(pvl_hard_succ): 00000000
successful page free         (pvl_shrink)   : 00000000
skipped grows because no PAL (pvl_nopal)    : 00000025
# entries per group on boot  (pvl_bootavgpg): 00000008
PVLIST kproc thread id       (pvl_tid)      : 00080011
Start of PVLIST array        (pvl_first)    : F200800020000000
Current end of PVLIST array  (pvl_last)     : F200800020200000
Maximum PVLIST eaddr + 1                    : F200800024000000
Current number of PVLIST entries            : 00020000
Max number of PVLIST entries (pvl_maxels)   : 00400000
Average length of free list  (pvl_avgfree)  : 00000000
eaddr to use for RMLMB fail  (pvl_pinaddr)  : F10013A650000000
PVLIST lock                  (pvl_lock)     : 00000000

### Memory shows we have low free, high pinned.
(0)> memstat

Pageable Memory Status

Total pageable frames:    00000F74B0    3.9GB   -----
   4K pageable frames:    0000013DB0  317.7MB     8.0% total pageable
  64K pageable frames:    000000E370    3.6GB    91.9% total pageable

Total free frames:        0000001636   22.2MB     0.5% total pageable
   4K free frames:        0000000746    7.3MB     2.2% 4K pageable
  64K free frames:        00000000EF   14.9MB     0.4% 64K pageable

Total nrsvd frames:       0000000000    0.0MB     0.0% total pageable
   4K nrsvd frames:       0000000000    0.0MB     0.0% 4K pageable

Total comp frames:        00000F51DA    3.8GB    99.1% total pageable

Total perm frames:        0000000B40   11.3MB     0.2% total lruable
   4K perm frames:        0000000B40   11.3MB     3.6% 4K lruable

Total lruable frames:     00000F5880    3.8GB   -----
   4K lruable frames:     0000013810  312.1MB     7.9% total lruable
  64K lruable frames:     000000E207    3.5GB    92.0% total lruable

Total pinned frames:      00000C47FF    3.1GB    79.4% total pageable
   4K pinned frames:      000000FE4F  254.3MB    80.0% 4K pageable
  64K pinned frames:      000000B49B    2.8GB    79.4% 64K pageable

Total pinnable remaining: 0000001568   21.4MB     0.5% total pageable
   4K pinnable remaining: FFFFFFFFFFFFFFD8    0.0TB     0.0% 4K pageable
  64K pinnable remaining: 0000000159   21.6MB     0.5% 64K pageable

!!! 4K free frames less than minfree.
!!! Total perm frames below minperm.
*** 4K perm frames within 5% of minperm.
!!! 4K pinned frames within 5% of maxpin.
!!! 64K pinned frames within 5% of maxpin.
!!! 4K free frames less than psm_minfree_thresh.
*** 64K free frames between psm_maxfree_thresh and psm_minfree thresh.
!!! 4K page size above psm_maxpin limit.
!!! 64K page size above psm_maxpin limit.

### There's nothing waiting on paging.
(0)> th -w WMEM

(0)> th -w WPGIN

(0)> th -w WPGOUT

(0)> th -w WFREEF

### No pending I/Os
(0)> pdt *
               SLOT   NEXTIO           DEVICE  DMSRVAL    IOCNT    OLDIO 

vmmary_pdt+000000 0000 FFFFFFFF 8000000A00000002 00000000 00000000 00000000 paging
vmmary_pdt+007400 0080 FFFFFFFF 02BE5D40 00000000 00000000 00000000 remote
vmmary_pdt+0074E8 0081 FFFFFFFF 8000000A00000009 00000000 00000000 00000000 local client
vmmary_pdt+0075D0 0082 FFFFFFFF 8000000A00000008 83802E080 00000000 00000000 local client
vmmary_pdt+0076B8 0083 FFFFFFFF 8000000A00000005 00000000 00000000 00000000 local client
vmmary_pdt+0077A0 0084 FFFFFFFF 8000000A00000006 00000000 00000000 00000000 local client
vmmary_pdt+007888 0085 FFFFFFFF 8000000A00000007 00000000 00000000 00000000 local client
vmmary_pdt+007970 0086 FFFFFFFF 8000000A0000000B 00000000 00000000 00000000 local client
vmmary_pdt+007A58 0087 FFFFFFFF 8000000A0000000A 00000000 00000000 00000000 local client
vmmary_pdt+007B40 0088 FFFFFFFF 8000000A0000000C 00000000 00000000 00000000 local client
vmmary_pdt+007C28 0089 FFFFFFFF 8000000A00000003 00000000 00000000 00000000 local client
vmmary_pdt+007D10 008A FFFFFFFF 8000002D00000002 00000000 00000000 00000000 local client

### No locks
(0)> lq
                    BUCKET HEAD            COUNT

(0)> dla
 No deadlock found