tsm server status

I ordered the new backup server on October 27.
Initial setup gave app crashes intermittently, so was not ready to make it live yet.
I ran BOINC on it for a day, and at one point, all tasks died at once.

Syslog showed EDAC errors starting 11 days after I got the system, calling out CPU#1Channel#2_DIMM#0

This matches CPU1, DIMM1 on the board (ie, DIMMs are ordered backwards in Linux from printed labels).

I swapped all of CPU1 DIMMS with CPU0 DIMMs to troubleshoot.

Problem went away. 99% chance this was just a slightly loose DIMM from shipping.

Aside from that, the system has been awesome. I’ve run DB2, Spectrum Protect, and BOINC on here. For BOINC, the fans stay on low at 66% and 50% on a warm day, and 66%/66% on a cool day.

TLDR – remember to re-seat your DIMMs after shipping. System is stable otherwise.

Here are logs and system queries:

Nov 7 15:00:43 tsm kernel: [929582.997825] EDAC MC1: 1 CE error on CPU#1Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
...
Nov 14 19:59:05 tsm kernel: [1552272.728748] EDAC MC1: 7112 CE error on CPU#1Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

/bin/bash# ll -d /sys/devices/system/edac/mc/mc1/dimm*
drwxr-xr-x 3 root root 0 Nov 14 20:07 /sys/devices/system/edac/mc/mc1/dimm0/
drwxr-xr-x 3 root root 0 Nov 14 20:07 /sys/devices/system/edac/mc/mc1/dimm3/
drwxr-xr-x 3 root root 0 Nov 14 20:07 /sys/devices/system/edac/mc/mc1/dimm6/

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/dimm_label
CPU#1Channel#2_DIMM#0

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/dimm_location
channel 2 slot 0

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/dimm_mem_type
Registered-DDR3

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/size
8192

/bin/bash# cat /sys/devices/system/edac/mc/mc1/mc_name
i7 core #1

/bin/bash# cat /sys/devices/system/edac/mc/mc1/ce_count
1197602807

/bin/bash# cat /sys/devices/system/edac/mc/mc0/mc_name
i7 core #0

/bin/bash# cat /sys/devices/system/edac/mc/mc0/ce_count
0

/bin/bash# uptime
20:15:26 up 17 days, 23:28, 2 users, load average: 0.01, 0.40, 2.64

Power off and back on, and now BIOS shows:

209-Memory warning condition (WARN_DQS_TEST) detected slot CPU1 DIMM1
209-Memory warning condition (WARN_DQS_TEST) detected slot CPU1 DIMM1
209-Memory warning condition (rd dq dqs) detected slot CPU1 DIMM1
203-Memory module failed self-test and failing rank was disabled slot CPU1 DIMM1

The following configuration options were automatically updated:
Memory:40960 MB

Using ESD precautions, I moved all DIMMs from CPU1 bank to CPU0 bank.
All errors went away.

Loose DIMM. False alarm.