Introduction
This page describes how AlexisHuxley changes disks in his own systems; it is probably not of general interest.
A broken disk can substantially affect performance; e.g. torchio was became very sluggish, and generated errors like these to dmesg buffer:
[2680955.844731] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[2680955.844734] ata3.00: BMDMA stat 0x24
[2680955.844737] ata3.00: failed command: READ DMA
[2680955.844742] ata3.00: cmd c8/00:08:00:77:14/00:00:00:00:00/e4 tag 0 dma 4096 in
[2680955.844743] res 51/40:00:00:77:14/40:00:03:00:00/e4 Emask 0x9 (media error)
[2680955.844746] ata3.00: status: { DRDY ERR }
[2680955.844747] ata3.00: error: { UNC }
[2680955.869870] ata3.00: configured for UDMA/133
[2680955.869889] sd 2:0:0:0: [sda] Unhandled sense code
[2680955.869891] sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[2680955.869895] sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
[2680955.869899] Descriptor sense data with sense descriptors (in hex):
[2680955.869901] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[2680955.869910] 04 14 77 00
[2680955.869914] sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
[2680955.869918] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 04 14 77 00 00 00 08 00
[2680955.869927] end_request: I/O error, dev sda, sector 68450048
[2680955.869945] ata3: EH complete
[2680955.882291] raid1:md2: read error corrected (8 sectors at 44031248 on sda3)
[2680955.882315] __ratelimit: 1 callbacks suppressed
[2680955.882317] raid1: sdb3: redirecting sector 44029200 to another mirror
[2680955.882801] raid1: sdb3: redirecting sector 44029208 to another mirror
[2680956.078875] raid1: sdb3: redirecting sector 44029216 to another mirror
...
Procedure
Determine the OS device name for the disk with the problem; e.g. by running dmesg.
Determine the physical disk for the disk with the problem; e.g. by running dd if=/dev/sda of=/dev/null to illuminate a disk LED or by tracing SATA cables (sda will be the lowest numbered connected SATA channel.
- Determine what RAID devices are on the disk by examining /proc/mdstat.
Mark the half of any RAID which is on the broken disk as failed; e.g.
mdadm --manage /dev/md2 --fail /dev/sda3 mdadm --manage /dev/md1 --fail /dev/sda2 mdadm --manage /dev/md0 --fail /dev/sda1
(After this is done, the sluggish behaviour should stop.)If there are any DRBD devices in primary role on top of the RAIDs then migrate primary role (this may require migrating services running on the DRBDs); for me this means running:
pasta-vm-manager move local-running-vms remote-server
- Shutdown the system and remove the bad disk.
- If the bad disk was /dev/sda then swap the disk connections so that what was /dev/sdb becomes /dev/sda. (This makes booting, when the disk on the lowest SATA channel is not bootable, easier.)
- Insert the new disk and power on.
- Wait for any DRBD synchronisation to complete (we're about another synchronisation and have two running in parallel is very slow).
- Use the same tool (parted, fdisk) to partition the new disk the same as the old (start and end points, types, bootable flag).
Add the new partitions into the RAIDs; e.g.
mdadm --manage /dev/md0 --add /dev/sdb1 mdadm --manage /dev/md1 --add /dev/sdb2 mdadm --manage /dev/md2 --add /dev/sdb3
- Wait for synchronisation of the root filesystem to complete (grub-pc.postinst will need to know the positions of certain files on the disk and that is not decided until the synchronisation is complete).
Write the GRUB MBR to the new disk (and the old one) by running:
/var/lib/dpkg/info/grub-pc.postinst configure
- Wait for synchronisation to complete.
