If I remove the first main adapter, and re-add it, then I can add/remove either adapter or IP interface after that.
If I remove the second main adapter, and re-add it, then I cannot remove the first, and dropping the IP interface crashes.
So, assuming adapter_names=ent2,ent6
This works everywhere:
/usr/lib/methods/ethchan_config -d ent17 ent2
/usr/lib/methods/ethchan_config -a ent17 ent2
/usr/lib/methods/ethchan_config -d ent17 ent6
/usr/lib/methods/ethchan_config -a ent17 ent6
/usr/sbin/rmdev -Rl en17
/usr/sbin/mkdev -l en17
/usr/sbin/cfgmgr
# Can do any combination of the above after remove/readd first adapter in advance.
And this crashes everywhere:
/usr/lib/methods/ethchan_config -d ent17 ent6
/usr/lib/methods/ethchan_config -a ent17 ent6
# crashed here on one server
/usr/lib/methods/ethchan_config -d ent17 ent2
ethchan_config: 0950-021 Unable to delete adapter ent2 from the
EtherChannel because it could not be found, errno = 2
/usr/sbin/rmdev -Rl en17
# crash here on several others
Crash analysis follows:
(96)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_8 machine with 160 available CPU(s) (64-bit
registers)
SYSTEM STATUS:
sysname... AIX
nodename.. testnode001
release... 2
version... 7
build date Mar 2 2018
build time 13:02:46
label..... 1809C_72H
machine... 00DEADBEEF00
nid....... FBCAFE4C
time of crash: Wed May 9 04:45:59 2018
age of system: 25 day, 10 hr., 54 min., 41 sec.
xmalloc debug: enabled
FRRs active... 0
FRRs started.. 0
CRASH INFORMATION:
CPU 96 CSA F00000002FF47600 at time of crash, error code for LEDs:
30000000
pvthread+1A0E00 STACK:
[00009324].unlock_enable_mem+000018 ()
[06058D54]shientdd:entcore_disable_tx_timeout_timers@AF123_105+000074
(??, ??)
[060592E8]shientdd:entcore_suspend_nic+000028 (??, ??)
[0605FB20]shientdd:entcore_suspend+0001E0 (??, ??, ??)
[06129A68]shientdd:entcore_close_common+000668 (??)
[0612A0B0]shientdd:entcore_close+000490 (??)
[060103CC]shientdd:shi2ent_close+00000C (??)
[F1000000C04911C0]ethchandd:ethchan_close+0001A0 (??)
[00014D70].hkey_legacy_gate+00004C ()
[0057A914]ns_free+000074 (??)
[00014F50].kernel_add_gate_cstack+000030 ()
[069E503C]if_en:en_ioctl+0002DC (??, ??, ??)
[0057126C]if_detach+0001CC (??)
[0056E1DC]ifioctl+00081C (F00000002FF473D0, 8020696680206966,
00000000066EB8A0)
[005EA764]soo_ioctl+0005C4 (??, ??, ??)
[007A4754]common_ioctl+000114 (??, ??, ??, ??)
[00003930]syscall+000228 ()
[kdb_get_virtual_memory] no real storage @ 2FF22358
[D011C92C]D011C92C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF5D60
(96)> status | grep -v wait
CPU INTR TID TSLOT PID PSLOT PROC_NAME
96 20E03BF 6670 380324 3128 ifconfig
(96)> vmlog
Most recent VMM errorlog entry
Error id = DSI_PROC
Exception DSISR/ISISR = 000000000A000000
Exception srval = 00007FFFFFFFD080
Exception virt addr = 0000000000000004
Exception value = 00000086 EXCEPT_PROT
0x86:
Protection exception. An attempt was made to write to a protected
address in memory
(96)> th -n ifconfig
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+1A0E00 6670*ifconfig RUN 20E03BF 03E 96 0
shientdd:.entcore_disable_tx_timeout_timers AF123_105+000074
bla < .unlock_enable>
.
2390 ! SUNLOCK(TX_QUEUE_SLOCK, tx_pri);
.
---- NDD INFO ----( F1000B003952B410)----
name............. ent6 alias............ en6
ndd_next......... 0000000000000000
ndd_flags........ 00610812
(BROADCAST!NOECHO!64BIT!CHECKSUM_OFFLOAD)
ndd_2_flags...... 00000930
(IPV6_LARGESEND!IPV6_CHECKSUM_OFFLOAD!LARGE_RECEIVE!ECHAN_ELEM)
(96)> print entcore_acs_t F1000B00393F0000
struct entcore_acs_t
struct entcore_tx_queue_t
< ...>
struct entcore_ras_cb_t *ffdc_ras_cb = 0xF1000B0039537D40;
struct entcore_tx_atomics_t *atomics = 0x0000000000000000;
struct mbuf *overflow_queue = 0x0000000000000000;
struct mbuf *overflow_queue_tail = 0x0000000000000000;
uint64_t ofq_cnt = 0x0000000000000000;
struct entcore_lock_info_t *p_lock_info = 0x0000000000000000;
void *p_acs = 0xF1000B00393F0000; NULL so DSI
(96)> dd F1000B00393F78D0
F1000B00393F78D0: 0000000000000000 < - p_lock_info
(96)> xm F1000B00393F78D0
Page Information:
heap_vaddr = F1000B0000000000
P_allocrange (range of 2 or more allocated full pages)
page........... 00003937 start.. F1000B00393F0000 page_cnt....... 0017
allocated_size. 00170000 pd_size........ 00010000 pinned......... yes
XMDBG: ALLOC_RECORD
Allocation Record:
F1000B00E4306600: addr......... F1000B00393F0000 allocated pinned
F1000B00E4306600: req_size..... 1458712 act_size..... 1507328
F1000B00E4306600: tid.......... 033F0187 comm......... cfgshien
XMDBG: ALLOC_RECORD
Trace during xmalloc() on CPU 00
0604FCB0(.entcore_allocate_acs+000310)
060129C4(.entcore_config_state_machine+
0601A884(.entcore_perform_init+0000A4)
Free History:
105D 40.955808 SHIENTDD GEN: L3 Close__B d1=F1000B00393F0000
105D 40.955808 SHIENTDD GEN: L3 CloseC_B d1=F1000B00393F0000
105D 40.955809 SHIENTDD GEN: L3 HwClos_B d1=F1000B00393F0000
105D 40.955810 SHIENTDD GEN: L3 HwClos_B -HW| d1=0000000000000000
105D 40.955810 SHIENTDD GEN: L3 HwClos10 -HW| d1=0000000000000000
105D 40.955810 SHIENTDD GEN: L3 HwClos_E -HW| d1=0000000000000000
105D 40.955811 SHIENTDD GEN: L3 HwClos_E d1=0000000000000000
< ...>
105D 41.039269 SHIENTDD GEN: L3 CloseC_E d1=F1000B00393F0000
105D 41.039269 SHIENTDD GEN: L3 Close__E d1=0000000000000000
105D 41.039273 SHIENTDD GEN: L3 Close__B d1=F1000B00393F0000
another close ? >>
105D 41.039273 SHIENTDD GEN: L3 CloseC_B d1=F1000B00393F0000
105D 41.039274 SHIENTDD GEN: L3 HwClos_B d1=F1000B00393F0000
105D 41.039275 SHIENTDD GEN: L3 HwClos_B -HW| d1=0000000000000000
105D 41.039275 SHIENTDD GEN: L3 HwClos10 -HW| d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 HwClos_E -HW| d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 HwClos_E d1=0000000000000000
105D 41.039276 SHIENTDD GEN: L3 Suspnd_B d1=F1000B00393F0000
105D 41.039279 SHIENTDD GEN: L3 MctSyn_B d1=F1000B00393F0000
105D 41.039281 SHIENTDD GEN: L3 MctSyn_E d1=0000000000000000
END
It seems that 2 closes happened, which would have leaded to a double free, and the crash.
Debug efix was tested for 2 weeks on 24 systems and problem was resolved, patch was stabl.
APAR IJ06720 was generated, and a public efix will be released for that./