mdadm fewer number of larger devices

I could not find where people were confident in the possibility of reshaping an MDAdm array to a fewer number of larger devices.  Plenty of recent people said you cannot do this.  I made this happen, and the biggest concern is making sure you provide enough space on the new devices.  There are some safety warnings that help with this.  I did have to resize my new partitions a couple of times during the process.

I did this because my rootvg needed to move to NVMe, and I only had room for 4 devices, vs the 5 on SATA.  The OS I used was Debian 10 Buster, but this should work on any vaguely contemporary GNU/Linux distribution.

There are always risks with reshaping arrays and LVM, so I recommend you back up your data.
There are always risks with reshaping arrays and LVM, so I recommend you back up your data.
There are always risks with reshaping arrays and LVM, so I recommend you back up your data.
There are always risks with reshaping arrays and LVM, so I recommend you back up your data.

First, build the new NVME partitions
I have p1 for /boot (not UEFI yet, and I’m on LILO still, so unused right now).
I have p2 for rootvg, and p3 for ssddatavg

parted /dev/nvme0n1
mklabel gpt
y
mkpart boot ext4 4096s 300MB
set 1 raid on
set 1 boot on
mkpart root 300MB 80GB
set 2 raid on
### Resizing last
rm 3
resizepart 2 80G
mkpart datassd 80G 100%
set 3 raid on
print
quit

Repeat for the other devices so they match.
My devices looked like this after:

Model: INTEL SSDPEKNW020T8 (nvme)
Disk /dev/nvme3n1: 2048GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
1 2097kB 300MB 298MB boot boot, esp
2 300MB 80.0GB 79.7GB root raid
3 80.0GB 2048GB 1968GB datassd raid

Clear superblocks if needed
If you are retrying after 37 attempts, these commands may come in handy:

### wipe superblock
for i in /dev/nvme?n1p1 ; do mdadm –zero-superblock $i ; done

### Wipe FS
for i in /dev/nvme?n1p1 ; do dd bs=256k count=4k if=/dev/zero of=$i ; done

Rebuild /boot – high level
This is incomplete, because I have not changed my host to UEFI mode.  The reference is good, but incomplete.

### Make new array and filesystem
mdadm –create –verbose /dev/md3 –level=1 –raid-devices=4 /dev/nvme*p1
mkfs.ext4 /dev/md3
mount /dev/md3 /mnt
rsync -avSP /boot/ /mnt/

### Install GRUB2
mkdir /boot/grub

apt update
apt-get install grub2
### From dpkg-reconfigure: kopt=nvme_core.default_ps_max_latency_us=0

### Make the basic config
[root@ns1: /root]

/bin/bash# grub-mkconfig -o /boot/grub/grub.cfg
Generating grub configuration file …
Found linux image: /boot/vmlinuz-4.19.0-10-amd64
Found initrd image: /boot/initrd.img-4.19.0-10-amd64
Found linux image: /boot/vmlinuz-4.19.0-5-amd64
Found initrd image: /boot/initrd.img-4.19.0-5-amd64
done

### Install the bootloader
[root@ns1: /root]

/bin/bash# grub-install /dev/md3
Installing for i386-pc platform.
grub-install: warning: File system `ext2′ doesn’t support embedding.
grub-install: error: embedding is not possible, but this is required for cross-disk install.

[root@ns1: /root]
/bin/bash# grub-install /dev/nvme0n1
Installing for i386-pc platform.
grub-install: warning: this GPT partition label contains no BIOS Boot Partition; embedding won’t be possible.
grub-install: error: embedding is not possible, but this is required for RAID and LVM install.

Need to convert to uefi before installing the bootloader will work.  I also rsync’d my old /boot into the new one, etc.  That is moot until this is corrected.

Swap out my SATA members with SSD

Original members are 37GB, and new are 77GB.  It was time to go bigger, and I found that I kept coming up a few gigs short trying to match size (5×37 vs 4×57).

The goal is to fail a drive, remove a drive, then add a larger SSD replacement. After the last drive is removed, we reshape the array while it is degraded, because we don’t have a 5th device to add.

### Replace the first device
mdadm -f /dev/md1 /dev/sda2

mdadm -r /dev/md1 /dev/sda2
mdadm –add /dev/md1 /dev/nvme0n1p2

### wait until it’s done rebuilding
#mdadm –wait /dev/md1
while grep re /proc/mdstat ; do sleep 20 ; date ; done
mdadm -f /dev/md1 /dev/sdb2
mdadm -r /dev/md1 /dev/sdb2
mdadm –add /dev/md1 /dev/nvme1n1p2

### wait until it’s done rebuilding
#mdadm –wait /dev/md1
sleep 1 ; while grep re /proc/mdstat ; do sleep 20 ; date ; done
mdadm -f /dev/md1 /dev/sdc2
mdadm -r /dev/md1 /dev/sdc2
mdadm –add /dev/md1 /dev/nvme2n1p2

### wait until it’s done rebuilding
#sleep 1 ; while grep re /proc/mdstat ; do sleep 20 ; date ; done
mdadm –wait /dev/md1
mdadm -f /dev/md1 /dev/sdd2
mdadm -r /dev/md1 /dev/sdd2
mdadm –add /dev/md1 /dev/nvme3n1p2

### Remove last smaller device
#sleep 1 ; while grep re /proc/mdstat ; do sleep 20 ; date ; done
mdadm –wait /dev/md1
mdadm -f /dev/md1 /dev/sde2
mdadm -r /dev/md1 /dev/sde2

Reshape the array

Check to make sure your required array resize is larger than the LVM space used in your PV.  

[root@ns1: /root]
/bin/bash# mdadm –grow /dev/md1 –raid-devices=4 –backup-file=/storage/backup
mdadm: this change will reduce the size of the array.
use –grow –array-size first to truncate array.
e.g. mdadm –grow /dev/md1 –array-size 155663872

[root@ns1: /root]
/bin/bash# pvs /dev/md1
PV VG Fmt Attr PSize PFree
/dev/md1 rootvg lvm2 a– 102.50g 8.75g

If you come up short, you can shrink a PV a little, but often, there are used blocks scattered around.  There is no defrag for LVM, so you would have to manually migrate extents.  I was too lazy to do that, and instead, grew my PV from 103GB to 155GB.  I kind of need the space anyway.

# pvresize –setphysicalvolumesize 102G /dev/md1

Final reshape here

Now that I know the size MDADM wants to use, I use that exactly (or smaller, but larger than the PV size currently set.)

mdadm –grow /dev/md1 –array-size 155663872
mdadm –grow /dev/md1 –raid-devices=4 –backup-file=/storage/backup1
sleep 1 ; while grep re /proc/mdstat ; do sleep 20 ; date ; done

One of the drives was stuck as a spare.

This is not guaranteed to happen, but it does happen sometimes.  Just an annoyance, and one of the many reasons using RAID6 is much better than RAID6.  Also, errors can be properly identified better than with RAID5, and various other things.  Just use RAID6 for 4 drives and up.  I promise, it’s worth it.  3 drives can be RAID5, or RAID10 on Linux, but it’s not ideal.  Also, if you have a random-write-intensive workload, then you can use RAID10 to save some IOPS at the expense of more drives used to protect larger arrays, and inferior protection.  (eg, it is possible to lose 2 drives on a 6 drive RAID10 and still lose data, if they are both copies of the same data.)

[root@ns1: /root]
/bin/bash# mdadm /dev/md1 –remove faulty

[root@ns1: /root]
/bin/bash# mdadm –detail /dev/md1
/dev/md1:
State : active, degraded

Number Major Minor RaidDevice State
0 259 15 0 active sync /dev/nvme2n1p2
1 259 17 1 active sync /dev/nvme3n1p2
2 259 11 2 active sync /dev/nvme0n1p2
– 0 0 3 removed

4 259 13 – spare /dev/nvme1n1p2

Remove and re-add the spare

The fix was easy.  I just removed and re-added the drive that was stuck as a spare.

[root@ns1: /root]
/bin/bash# mdadm /dev/md1 –remove /dev/nvme1n1p2
mdadm: hot removed /dev/nvme1n1p2 from /dev/md1

[root@ns1: /root]
/bin/bash# mdadm /dev/md1 –add /dev/nvme1n1p2
mdadm: hot added /dev/nvme1n1p2

Check status on rebuilding
[root@ns1: /root]
/bin/bash# mdadm –detail /dev/md1
/dev/md1:
State : active, degraded, recovering

Number Major Minor RaidDevice State
0 259 15 0 active sync /dev/nvme2n1p2
1 259 17 1 active sync /dev/nvme3n1p2
2 259 11 2 active sync /dev/nvme0n1p2
4 259 13 3 spare rebuilding /dev/nvme1n1p2

Alternatively, this might have been frozen
cat /sys/block/md1/md/sync_action
frozen
echo idle > /sys/block/md1/md/sync_action
echo recover > /sys/block/md1/md/sync_action

Grow to any extra space

Once it is done recovering and/or resyncing, then you can grow into any additional space.  Since we used the value above to set the size “smaller”, we do not have to do this.  Note, when resizing “UP”, it is technically possible to overrun the bitmap.  This example drops the bitmap during the resize.  That is a risk you’ll have to weigh.  A power outage during restructure without a bitmap could be a bad day.

mdadm –grow /dev/md1 –bitmap none
mdadm –grow /dev/md1 –size max
mdadm –wait /dev/md1
mdadm –grow /dev/md1 –bitmap internal

Expand LVM to use the new space

[root@ns1: /root]
/bin/bash# pvresize /dev/md1

[root@ns1: /root]
/bin/bash# pvs
PV VG Fmt Attr PSize PFree
/dev/md1 rootvg lvm2 a– <148.38g 54.62g

Other Notes 1:

I also dropped/readded a drive with pending reallocation sectors.  That is entirely unrelated to the reshaping above, but I’ll dump the log here for my own reference.

### See the errors
/bin/bash# smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-10-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68EUZN0
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always – 1
196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always – 1
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always – 2
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline – 0

### See what arrays use this disk
[root@ns1: /root]
/bin/bash# cat /proc/mdstat | grep -p sda
md0 : active raid1 sda1[4] sdd1[1] sde1[3] sdc1[2] sdb1[0]
271296 blocks [5/5] [UUUUU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md2 : active raid6 sda3[3] sdb3[2] sdd3[4] sdc3[1] sde3[0]
8682399744 blocks level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
bitmap: 0/11 pages [0KB], 131072KB chunk

### Remove and re/add so it re-writes
[root@ns1: /root]
/bin/bash# mdadm /dev/md0 –fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0

[root@ns1: /root]
/bin/bash# mdadm /dev/md0 –remove /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0

[root@ns1: /root]
/bin/bash# mdadm /dev/md0 –add /dev/sda1
mdadm: hot added /dev/sda1

[root@ns1: /root]
/bin/bash# cat /proc/mdstat | grep -p sda
md0 : active raid1 sda1[5] sdd1[1] sde1[3] sdc1[2] sdb1[0]
271296 blocks [5/4] [UUUU_]
[=================>…] recovery = 87.3% (237440/271296) finish=0.0min speed=118720K/sec
bitmap: 0/1 pages [0KB], 65536KB chunk

md2 : active raid6 sda3[3] sdb3[2] sdd3[4] sdc3[1] sde3[0]
8682399744 blocks level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
bitmap: 0/11 pages [0KB], 131072KB chunk

### Remove/Readd the bigger array member
[root@ns1: /root]
/bin/bash# mdadm /dev/md2 –fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2

[root@ns1: /root]
/bin/bash# mdadm /dev/md2 –remove /dev/sda3
mdadm: hot removed /dev/sda3 from /dev/md2

[root@ns1: /root]
/bin/bash# mdadm /dev/md2 –add /dev/sda3
mdadm: hot added /dev/sda3

Other Notes 2:

I also made a new array on partition 3.  That is entirely unrelated to the reshaping above, but I’ll dump the log here for my own reference.

[root@ns1: /root]
/bin/bash# mdadm /dev/md4 –create -l 6 -n 4 /dev/nvme?n1p3
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md4 started.

[root@ns1: /root]
/bin/bash# pvcreate /dev/md4
vgcreate Physical volume “/dev/md4” successfully created.

[root@ns1: /root]
/bin/bash# vgcreate ssdvg /dev/md4 -Ay -Zn
Volume group “ssdvg” successfully created

[root@ns1: /root]
/bin/bash# vgs
VG #PV #LV #SN Attr VSize VFree
datavg 1 7 0 wz–n- <8.09t 704.12g
rootvg 1 7 0 wz–n- <148.38g 54.62g
ssdvg 1 0 0 wz–n- 3.58t 3.58t

[root@ns1: /root]
/bin/bash# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md4 : active raid6 nvme3n1p3[3] nvme2n1p3[2] nvme1n1p3[1] nvme0n1p3[0]
3844282368 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
[>………………..] resync = 1.0% (20680300/1922141184) finish=216.5min speed=146336K/sec
bitmap: 15/15 pages [60KB], 65536KB chunk

md3 : active raid1 nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
289792 blocks super 1.2 [4/4] [UUUU]

md0 : active raid1 sda1[4] sdd1[1] sde1[3] sdc1[2] sdb1[0]
271296 blocks [5/5] [UUUUU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid6 nvme1n1p2[3] nvme3n1p2[1] nvme2n1p2[0] nvme0n1p2[2]
155663872 blocks level 6, 256k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid6 sda3[5] sdb3[2] sdd3[4] sdc3[1] sde3[0]
8682399744 blocks level 6, 512k chunk, algorithm 2 [5/4] [UUU_U]
[=>……………….] recovery = 6.6% (191733452/2894133248) finish=349.9min speed=128713K/sec
bitmap: 0/11 pages [0KB], 131072KB chunk

unused devices: <none>