From Misc and sysadmin team
UPDATE: Servers are now back to life and testers back to final isos!
As some people may have seen, we suffered from a severe power outage yesterday, around 00h05 CET time in one of our hosting datacenter.
It seems that an electrical problem stopped some servers at the Lost Oasis server room in Marseille, with the net effect of stopping 4 servers (valstar, alamut, jonund and ecosse as well as the virtual machine running on alamut aka friteuse_tmp). It also impacted all servers of zarb.org that still provides support for some services (like www, mailing-list, secondary DNS, SMTP, etc.).
Perenoel, one of the great Lost Oasis guys, went to the building during the night to take care of the issue and so the servers got power again around 00:20 CEST time. Lost Oasis people worked until 4 o’clock in the morning to fix all servers.
Now all but 2 servers, Valstar and Jonund, are back online.
Jonund is just a build node, we have another one and we are in freeze, so we can cope with the failure without much trouble.
Valstar is the main SVN and LDAP server, so almost everything depend on it. Impacted services:
- LDAP
– Identity, no access ( no account creation )
– forum, bugzilla, transifex : mostly read only access, no one can log in, but currently logged in people are still ok
– most @mageia.org aliases ( emails are still in queue on zarb )
– shell access ( rabbit, champagne )
– some Sympa lists ( @ml.mageia.org ), mostly board one
- SVN
- buildsystem ( no scheduler, no mirror for builders )
- automated administration of all servers ( no puppetmaster )
The rest ( website, blog, xymon, mailling list, svnweb ) should be ok. We are still looking into it. Lost Oasis told us they would go look at our server in the afternoon, we will keep you informed of the changes with a mail on our list.
Sysadmins will also be looking at making the infrastructure more resilient to such problems (for example, a 2nd LDAP would have solved most issues, and this is already planned ).
If you have any questions please ask on the sysadmin mailing list or on the #mageia-sysadm IRC channel on Freenode , where we will be happy to answer you.
Update (13:10 CEST): all systems are back, up and operational now. \o/
valstar and jonund is now back online
Pingback: Falta de energia em servidores Mageia | Mageia Blog (Português)
Thank you all for the hared work and speedy response. It’s a great relief to hear that the situation is under control and all is well. It gave us all some nervous moments (hours?).
Looking forward to the new stable release with great anticipation.
Thanks again
Mandriva faced more catastrophic failures ( even if most of the time, this was not public, such as server broke due to iso creation, raid array failing, etc ), and survived. And we have a rather experienced team of system administrators taking care of everything, and we have great contact with our hosters.
So we are kinda prepared to the worst kind of failure ( even if we are still working slowly to a more resilient system, as said on the sysadmin ml )
The final release is a few hours from now. My fresh DVD is laying on a table and I can’t wait to give my PC some fresh ‘Magic’! 🙂
Pingback: Andere blog-berichten | Mageia Blog (Nederlands)