Skip to content
 

How we Recovered from a Failed Disk

Gentlent's practical approach to dealing with a disk failure in our server setup. This post covers the steps we took to identify, resolve, and upgrade our system to prevent future issues, all without affecting our customers.

Tom Kleinby Tom Klein · ~ 4 min read
Gentlent's practical approach to dealing with a disk failure in our server setup. This post covers the steps we took to identify, resolve, and upgrade our system to prevent future issues, all without affecting our customers.
Gentlent's practical approach to dealing with a disk failure in our server setup. This post covers the steps we took to identify, resolve, and upgrade our system to prevent future issues, all without affecting our customers.
 

During a routine server check a few days ago, we noticed an unsettling pattern: one of our disks had been ejected from the RAID array for the second time in a month. It became clear that this disk was failing, which had left the server's small RAID array in a degraded state.

The potential for data loss or downtime in such situations is a concern for any IT team. However, we've always prioritized data integrity and system reliability. Thanks to our regular, secure backup protocols and real-time replication for core databases, we were prepared. This approach ensured that even with the server at risk, our operations could continue without interruption, and more importantly, without jeopardizing any customer data.

Upon recognizing the issue, we didn't waste any time. We quickly procured additional SSDs and set about upgrading the RAID arrays on our machines. The upgrade process was smooth for the second server, which we upgraded just in case, but we hit a snag with the first one: its boot partition was on the failing disk.

The Fix

Addressing this issue required a hands-on approach. We went on site, replaced the problematic disk, and reconfigured the RAID array. This process took a few hours, but by the end of it, the server was back up and running as if nothing had happened.

When we identified the failing disk, our immediate focus was to ensure the integrity of our RAID array and to restore full functionality. Here's a brief overview of the technical steps we took:

1. Identifying the Issue

First, we used mdadm to examine the status of our RAID arrays:

sudo mdadm --detail /dev/md0

This command helped us confirm which disk was failing. Upon trying to re-add it to the software RAID array, we noticed a significant drop of write speed in real-time.

2. Booting from Live Image - The Ubuntu Way

Our first hurdle was gaining access to the server's file system without booting from the compromised disk. We achieved this by using a Live Ubuntu Server ISO, which itself is pretty straightforward:

  1. Prepare Live Media: We downloaded the Ubuntu Server ISO and created a bootable USB drive.
  2. Boot into Live Environment: We inserted the USB drive and restarted the server. During the boot process, we selected the USB drive as the boot device.
  3. Enter Live Session Shell: Once the Ubuntu Server live install loaded, we clicked the "Help" button at the top right corner of the screen and selected "Enter shell" to access a terminal without actually installing the image.
  4. Mount the Necessary Filesystems: Commands like mount and chroot were used through this guide to access the server's filesystem. This allowed us to make changes to the server's configuration and RAID array. Those commands might look like this:
    for i in /dev /dev/pts /proc /sys /run; do sudo mount -B $i /mnt$i; done
    sudo chroot /mnt

3. Preparing the New Disk

With access to a shell, we proceeded to prepare the new disk for integration into the RAID array:

  1. Identify the New Disk: We used lsblk to list all block devices and identify the new disk.
  2. Partition the New Disk: Using fdisk on the new disk (/dev/sdX), we created a new partition table and partitions mirroring those of the existing RAID disk(s).
    sudo fdisk /dev/sdX

4. Integrating the Disk into the RAID Array

With the disk partitioned, the next step was to integrate it into the RAID array:

  1. Add the New Partition to the RAID: We used mdadm to add the new partition to the existing RAID array.
    sudo mdadm --manage /dev/md0 --add /dev/sdX1
  2. Monitor the RAID Rebuild: We kept an eye on the rebuild process to ensure it was progressing without issues.
    cat /proc/mdstat

5. Addressing the Boot Partition

The absence of the boot partition on the surviving disk was a critical issue we needed to resolve:

  1. Create a New EFI Partition: We used fdisk to create a new EFI system partition on the surviving disk.
    sudo fdisk /dev/sdY
  2. Format the EFI Partition: Next, we formatted the new EFI partition as FAT32.
    sudo mkfs.vfat -F 32 /dev/sdY1
  3. Mount the EFI Partition: We mounted the new EFI partition to /mnt/efi.
    sudo mount /dev/sdY1 /mnt/efi
  4. Reinstall GRUB: We reinstalled GRUB to the EFI partition to restore boot functionality.
    sudo grub-install --target=x86_64-efi --efi-directory=/mnt/efi --bootloader-id=Ubuntu

6. Updating the fstab

The final step was to ensure the system could automatically mount the new EFI partition at boot:

  1. Find the UUID of the New EFI Partition: We used blkid to get the UUID.
    blkid /dev/sdY1
  2. Edit /etc/fstab: We added a new line for the EFI partition using the UUID obtained from blkid.
    UUID=<new-efi-partition-uuid> /boot/efi vfat umask=0077 0 1

7. Verifying the Recovery

After completing these steps, we rebooted the server to verify that the recovery was successful. The system booted normally, and all RAID arrays were functioning as expected.

Seamless Service

Throughout this ordeal, our main concern was to maintain service continuity for our customers. Thanks to our preemptive measures and quick response, we managed to do just that. No customer data was put at risk, and our services remained online and fully operational.

In facing this challenge, we were reminded of the importance of regular system checks, reliable backup strategies, and the ability to respond swiftly to unforeseen issues. It's these practices that help us keep our promise of reliable service to our customers.


Share article


Tom Klein
Founder & CEO
Gentlent UG (haftungsbeschränkt)

Gentlent
Customer Support
support@gentlent.com



Recent Articles

Wanna learn more?
Get in touch today.

GentlentAn official Gentlent website. Official Gentlent websites are always linked from our website gentlent.com, or contain an extended validated certificate.