Affecting Other
-
allegiance.wa-us.vmho.st
-
06/11/2025 18:29
- 06/17/2025 15:43
-
Last Updated 06/13/2025 04:31
LATEST UPDATE (06/13/2025 @ 12AM EST)
An email has been sent to all impacted by this issue so that they can restore their service.
We're all very tired, as I am sure you are as well. The timeline of events is as follows:
- The host node "Allegiance" in our Washington, USA location was taken offline for a general RAM swap/upgrade. This node was provisioned with only 128GB of RAM, making it half of what the specs call for in our current production fleet. The upgrade was fast and non-problematic. This should have been a 15-20 minute maintenance window, and this is what we had planned for.
- It would appear that 1 of the 2 NVMe drives in the RAID-1 configuration has died between original deployment and this incident.This node was originally deployed in March of 2025, with it passing all of our standard benchmarking / health checks before being put into production use.
- It is assumed that the 2nd drive in the array suffered a similar fate, but the data was still intact as long as it was powered on, but failed to re-initilize on reboot after the RAM upgrade.
- Datarecovery was attempts failed. Obviously this was our first and highest priority after realizing a quick maintenance task has resulted in something more severe. Recovery tools (ex: HDD Raw Copy, DMDE, TestDisk, ddrescue) were unable to read or image either disk.
- The decision was made to reinstall the hypervisor, however, unfortunately, in a troubleshooting step related to a concern about disk firmware and larger RAM totals, a datacenter tech had removed all but a single stick of RAM. This delayed our reprovisioning of the host node by 12+ hours as we waited for DC staff to reinstall the RAM they removed during this troubleshooting step.
- Finally, on 06/12/2025 at about 11PM we had completed the reinstall of the hardware node.
- An email was dispatched to impacted users on how to restore their VM.
While this was incredibly sudden and unexpected, we have enhanced our hardware level health and performance monitoring. We've always kept a close eye on things like network metrics, CPU temps, fan speeds, I/O usage, and have alerting systems in place for concerning metrics. However where we did fail was not having a solid system in place for monitoring things like RAID arrays and disk health, which we are now implementing across the board. It's possible that having had a better monitoring system for drive and RAID health could have given us early warning that could have made this preventable.
In the coming day or so, all impacted users will be given one free month of service. We will manually adjust your renewal date for billing and no action or request is required by you for this to be completed.
In the 4+ years we've been in business we've never experienced a data loss event like this. Please review your own data recovery and disaster plans as this, while rare, could occur with any provider. Also, if you are not already doing so, please consider using the FREE and INCLUDED offsite backup option we provide every VM customer.
Thank you for your patience and understanding.
IncogNET
This node/hypervisor is currently being reinstalled. There was a long delay in delivery of a working system to us which has lengthened the downtime of this server. Reinstallation is about to begin. Impacted users will be notified via email with instructions on how to proceed once this task is completed.
Unfortunately, the situation is more serious and what was supposed to be a simple, quick window of maintenance has turned into something severe.
The VM host node, "Allegiance" was originally provisioned with only 128GB of RAM, whereas our other hostnodes are provisioned with 256GB of RAM. With all other server specs for new deployments being 1:1 matches for consistency, we decided to take this server offline to perform a quick maintenance to upgrade the RAM so that this server could operate to the same capacity as others. The server, otherwise, would be under utilized and only capable of serving half it's potential.
After the maintenance was completed, the server failed to boot. Upon review, it appears to have been an issue with the NVMe RAID array (2 disk, RAID-1). What we believe has happened is that sometime between initial deployment of the hardware node in March of 2025 and recent time is that one of the two NVMe drives died. All components were tested and working as expected upon deployment. It's believed that the second drive was beginning to suffer a similar fate, where the data was intact and there but failed to re-initiatlize on boot.
Data restoration is currently being attempted, but I've been given a low estimate of recovery by the techs working on this.
Since we've been in business, we've never had a data loss event like this, so this is a first for us. We're working hard on recovery, but failing that, we will extend all services by a month (You get a free month) after reprovisioning your account. If you created manual snapshots, these are stored offsite and should be recoverable.
Updates will follow here.
The "Allegiance" node in our Liberty Lake, Washington location is undergoing hardware maintenance to replace RAM in the machine. This should not take more than 30 minutes.
EDIT: Datacenter staff promptly replaced the RAM, but the server is failing to boot properly. Diagnosing this new, unexpected and unplanned issue now.