What had happened two weeks ago to HY?

Hello,

today we like to explain what had happened two weeks ago leading to a one week HY downtime. (The same applies to ht-flagtools and cupmanager, because they are hosted on HY server!)

During the morning hours of thursday the 22th of January, we recognized some problems with our server. We saw two options: 1) killing a lot of processes and thus try to get rid of the problem or 2) simply restart the server. 2) sounds easier, but since we had not restarted the server for a while we were a bit frightened that may problems could occur, which could lead to a shorter downtime (30 to 60mins). Thus we tried to kill a lot of processes first. Pretty fast we come to the conclusion, that this might take longer as restarting the server and may does not fix the problem as well. Therefor we decided to restart the server.
As we still did not get any response from our server ten minutes after the reboot, we were already frightened that we might run into a smaller problem. We waited additional 30mins to be sure, that no file systems checks were running, no response at all. Then we rebooted into a session enabling us to see the screen of the server – you have to know, that our server is located in a datacenter in the area of Nuremberg and thus we do not have physical access to the server. We saw that the boot loader (grub) did not load the operating system, the reason was, that he did not found the filesystem on our hard drives. We rebooted into a rescue system and reinstalled grub – still it did not work. For an unknown reason grub could not found the filesystems or any software raid-devices, but we had loaded all neccessary drivers properly. Since the boot loader of the rescue system found both, we decided to install this bootloader to our system – still no success. We also tried to down- and upgrade to other versions of grub without success.

Since we were running out of ideas, we were working on a plan B. Plan B was to rent a new server and replace the old one. It is no fast and a more expensive resolution, but a working one. After several successless attemps to get grub working, we rent a new server and started installing the operating system and all server software. We took the opportunity and updated all software to the latest versions (which we had in mind to do for our old server soon anyway). By doing this the whole process got slowed down, but we have more stability and security in future. Each day after our regular work (and sometimes during it) we have worked many hours to install and set-up all the stuff resulting in a massive lack of sleep.

At the beginning of the following week our server was configured and we felt that we can be back online soon, but then we have observed some problems with the new server software, which we we able to fix until Wednesday. On Wednesday evening we had done a last test of all software (with some hy users) and it seem to run well. We decided to release on friday evening, because thurdays we were not able to watch out for the server (or even stop it) if a problem occurs. Since we were aware of the fact that a lot of people will login in the first hours when HY is back, I decided to announce that the server will be back online during the weekend and thus lowering the number of users in the first hours on friday to allow us to watch our for problems and fix them, before all people will harass the new server configuration and report bugs to us.

When we put the server online, 50 users were online within minutes. After 1h the first user posted on HT forums that we are back online and some hours later we announced it on twitter as well. Everything worked well but we are absolutely sure, that it was right to not announce when we were back online because in case of an error it might have ended badly 😉 .

 

We are pretty thankful to all kind tweets we got during the downtime! A few words about our backup strategy which might calm down the less kind tweeters a bit: We are running a full backup of HY site and database (same applies for ht-flagtools and cupmanager) each week and an incremental one every day.

We are also thankful to all people who have donated some money to us – thank you!

Special thanks to former HT manager exciler, who have helped us a lot!

/Markus (Mackshot)

0 Responses to “What had happened two weeks ago to HY?”


Comments are currently closed.