tsm server status

I ordered the new backup server on October 27.
Initial setup gave app crashes intermittently, so was not ready to make it live yet.
I ran BOINC on it for a day, and at one point, all tasks died at once.

Syslog showed EDAC errors starting 11 days after I got the system, calling out CPU#1Channel#2_DIMM#0

This matches CPU1, DIMM1 on the board (ie, DIMMs are ordered backwards in Linux from printed labels).

I swapped all of CPU1 DIMMS with CPU0 DIMMs to troubleshoot.

Problem went away. 99% chance this was just a slightly loose DIMM from shipping.

Aside from that, the system has been awesome. I’ve run DB2, Spectrum Protect, and BOINC on here. For BOINC, the fans stay on low at 66% and 50% on a warm day, and 66%/66% on a cool day.

TLDR – remember to re-seat your DIMMs after shipping. System is stable otherwise.

Here are logs and system queries:

Nov 7 15:00:43 tsm kernel: [929582.997825] EDAC MC1: 1 CE error on CPU#1Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
...
Nov 14 19:59:05 tsm kernel: [1552272.728748] EDAC MC1: 7112 CE error on CPU#1Channel#2_DIMM#0 (channel:2 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

/bin/bash# ll -d /sys/devices/system/edac/mc/mc1/dimm*
drwxr-xr-x 3 root root 0 Nov 14 20:07 /sys/devices/system/edac/mc/mc1/dimm0/
drwxr-xr-x 3 root root 0 Nov 14 20:07 /sys/devices/system/edac/mc/mc1/dimm3/
drwxr-xr-x 3 root root 0 Nov 14 20:07 /sys/devices/system/edac/mc/mc1/dimm6/

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/dimm_label
CPU#1Channel#2_DIMM#0

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/dimm_location
channel 2 slot 0

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/dimm_mem_type
Registered-DDR3

/bin/bash# cat /sys/devices/system/edac/mc/mc1/dimm6/size
8192

/bin/bash# cat /sys/devices/system/edac/mc/mc1/mc_name
i7 core #1

/bin/bash# cat /sys/devices/system/edac/mc/mc1/ce_count
1197602807

/bin/bash# cat /sys/devices/system/edac/mc/mc0/mc_name
i7 core #0

/bin/bash# cat /sys/devices/system/edac/mc/mc0/ce_count
0

/bin/bash# uptime
20:15:26 up 17 days, 23:28, 2 users, load average: 0.01, 0.40, 2.64

Power off and back on, and now BIOS shows:

209-Memory warning condition (WARN_DQS_TEST) detected slot CPU1 DIMM1
209-Memory warning condition (WARN_DQS_TEST) detected slot CPU1 DIMM1
209-Memory warning condition (rd dq dqs) detected slot CPU1 DIMM1
203-Memory module failed self-test and failing rank was disabled slot CPU1 DIMM1

The following configuration options were automatically updated:
Memory:40960 MB

Using ESD precautions, I moved all DIMMs from CPU1 bank to CPU0 bank.
All errors went away.

Loose DIMM. False alarm.


You remembered 81% of the information in the Memory Test.

You remembered 81% of the information in the Memory Test.

But research shows there's a lot you can do to improve your memory. And if you do, it can help you function in more ways than you'd think. There are 6 main types of memory, which help us interpret and store different types of information. You scored highest in object memory.

That kind of memory allows you to visualize how an object will fit in, or move through space, and where it will ultimately end up. This skill is particularly useful when you're playing sports or packing a lot of objects into a small space. With your strength in this area, you're probably able to visualize where an in-flight ball will land and are likely quite good at completing jigsaw puzzles.

Visual: 8 (avg 7.6)
Numeric: 8 (avg 8)
Spatial: 8 (avg 5.9)
Object Oriented: 10 (Avg 8.7)
Reading Comprehension: 8 (avg 8)
Delayed Recall: 8 (avg 7)