Outage – Investigation in P...

Downtime

Outage – Investigation in Progress

Feb 25 at 06:00pm EET

Affected services

Skylink - Netherlands

Updated
Mar 20 at 11:01pm EET

Thank you for your continued patience as we work to resolve the ongoing outage. We apologise for the lack of recent updates; we have been fully engaged in addressing the complex technical challenges we have faced. We are pleased to report that significant progress has been made and the majority of the issues have been resolved.

We are currently conducting finals and working closely with hardware vendors to address the remaining issues in our network infrastructure. We expect the remaining environment to be fully operational very soon. As always, our priority is to restore all services.

Please be assured that our team is working around the clock to ensure a speedy resolution. Once the outage is fully resolved, we will contact affected customers to offer compensation for any inconvenience caused. We appreciate your continued patience as we work to restore full service.

We will continue to update this status page as more information becomes available. Thank you for your patience as we work through this process.

Updated
Mar 10 at 11:58pm EET

Thank you for your patience as we work to resolve the ongoing outage. We've made significant progress since our last update and would like to share our latest developments with you. Many fixes have been applied and the majority of the equipment is now working properly. Our current focus is on the remaining issues, where we're conducting thorough diagnostics to identify the root cause of the connectivity issues, and working with hardware vendors to address firmware and compatibility issues between the replacement router and our top-of-rack switches.

While we've made significant progress, we've encountered additional configuration challenges that are taking longer than expected. These include optimising network settings to eliminate intermittent connectivity and ensuring that all hardware is compatible with our unique network setup. Please know that we're dedicating all available resources to overcoming these obstacles quickly.

We know how important your services are. Our first priority is to restore them all. As we work through these technical issues, we are prioritising server recovery. Once all services are restored, we will contact affected customers to offer compensation for the inconvenience caused by this outage. We are committed to resolving this issue and appreciate your patience as we focus on solutions. We'll continue to update our status page as more information becomes available. This outage is the result of a challenging combination of hardware and software issues.

Updated
Mar 06 at 09:01pm EET

It's been a nightmare and we know it's upset you. We're really sorry for all the headaches it's caused. We promised to give you the full story and we're sticking to that. We want to set out exactly what went wrong, what we've done to fix it, and why it's taken so damn long. We're not trying to hide anything. We want to be straight with you about how complicated this has been and let you know that we're absolutely determined to get things back to normal.

Let me tell you what happened. On 25 February at 18:00 EET, our monitoring systems detected a critical failure: the core router in our Netherlands data centre, responsible for routing traffic between servers, had completely failed. Think of it as a catastrophic loss of network fabric, rendering all connected services globally inaccessible.

We immediately activated our incident response protocol. Initial troubleshooting included firmware updates and configuration adjustments, and we even brought in external network specialists. We systematically tested power supplies and attempted component bypasses. Ultimately, we determined that the failure was terminal and that a full router replacement was required to restore service and minimise downtime.

A replacement core router was ordered and arrived on 2 March. This unit required a firmware update to match our network architecture, in particular our routing protocols and security configurations. Following the successful application and reconfiguration of the network, initial tests showed a partial recovery. Some servers regained connectivity, but others remained offline, indicating an incomplete recovery.

The recovery process was complicated by several technical challenges. A key issue is firmware and compatibility. The replacement router encountered compatibility issues with several ToR switches, which are essential for server-to-core router connectivity. These switches were experiencing intermittent connectivity or complete loss of traffic due to outdated firmware that didn't support the routing protocols and features (VLAN tagging, jumbo frame handling) implemented on the new router. We manually update the firmware on each affected switch and reconfigure settings to ensure compatibility, a meticulous process to avoid further disruption to operational network segments.

Another major challenge is delays from our third party provider who manages our IP address ranges and peering agreements. When the core router was replaced, they had to update their routing tables and configurations. However, delays and misconfigured routes have resulted in traffic being misrouted or dropped. These errors have exacerbated our internal challenges, although their impact is partially masked by our ongoing internal fixes. We are in constant communication with the provider, urging them to expedite updates and fixes. We are also implementing temporary routing policy adjustments to mitigate these external delays.

Further complicating recovery are hardware and software incompatibilities. Even after applying patches and configurations, some services remain unavailable. We suspect a deeper hardware/software incompatibility between the new router chipset and older switch models. This could be due to differences in the way the router handles network functions such as ARP table management, MTU settings or NIC interactions. MikroTik, the manufacturer of the router, has suggested that the problem lies with our unique network configuration rather than a hardware fault. However, our thorough checks, including NIC driver updates and server interface resets, have not fully resolved the issue. We are conducting detailed diagnostics, including packet capture analysis and stress testing, to determine the root cause. This may require additional patches from MikroTik or further adjustments to our network topology.

The extended outage has also led to secondary issues that have hampered recovery. We've encountered ARP table overflows due to prolonged network instability, network jitter during test phases, and server NICs requiring manual resets or driver updates after extended downtime. Each problem solved sometimes revealed new problems, creating an iterative troubleshooting cycle. We are stabilising the network by fine-tuning configurations, clearing outdated ARP entries and ensuring that all hardware components are operating within expected parameters.

As of 6 March, we have successfully restored 40 services, and efforts are ongoing to bring the remaining affected services back online. Our priorities include completing firmware updates and configuration adjustments on all ToR switches, coordinating with the third party provider for timely and accurate routing updates, and resolving remaining hardware-software incompatibilities through continued investigation and vendor collaboration.

We understand the importance of your data. Even for customers considering refunds, we are prioritising server restoration to ensure data access and retrieval. We will facilitate this process before finalising any account changes. Once the outage is fully resolved, we will proactively contact affected customers to discuss compensation for the disruption.

This outage is a confluence of hardware failure, software incompatibilities and external coordination challenges. We are working diligently to overcome these obstacles and appreciate your patience as we navigate this complex situation. We will continue to provide updates on this status page as new developments arise.

Updated
Mar 06 at 09:58am EET

We're keep working on restoring the left services. Within today we will post detailed explanation of the issue via status page (here)

Updated
Mar 04 at 11:42pm EET

We've restored 40 services so far and are continuing to work on the remaining affected services. We sincerely apologise for the prolonged outage. This is a complex issue that requires third party intervention and additional support. As we have been working around the clock to resolve this issue, we will be taking a short break until morning before resuming our efforts.

In the next day or two, we will provide a detailed explanation of the issue - what exactly happened and how - on our status page (here).

Once the matter is fully resolved, we will reach out to affected customers regarding compensation.

Updated
Mar 04 at 10:01am EET

We are continuing to restore the affected services. So far, 35 services have been successfully restored.

Updated
Mar 03 at 02:35pm EET

We have partially restored affected services, which means that some are now back online and accessible. However, certain services remain unavailable, and we are currently investigating the issue. (30 services restored)

Once the matter is completely resolved, we will contact the affected customers regarding compensation.

Updated
Mar 03 at 11:24am EET

We have partially restored affected services, which means that some are now back online and accessible. However, certain services remain unavailable, and we are currently investigating the issue.

Once the matter is completely resolved, we will contact the affected customers regarding compensation.

Updated
Mar 02 at 06:00pm EET

The replacement part is set to arrive in about an hour, and we’ll begin the replacement process immediately.

We expect to have everything back up and running within approximately 12-16 hours.

Updated
Feb 27 at 10:25am EET

We have received an update from the delivery company stating that the replacement parts are expected to arrive on 02.03. We are still actively investigating alternative/faster solutions that could temporarily resolve the situation.

Updated
Feb 26 at 12:20am EET

We attempted to implement a fix, however, it failed to resolve the issue. We have ordered replacement parts, which should arrive between 28.02 and 03.03.

In the interim, we will persist with our investigation and might rectify the issue before the replacement is delivered.

Updated
Feb 25 at 08:45pm EET

We're investigating the issue and believe we might have found a root cause, but we need to test it to confirm.

Created
Feb 25 at 06:00pm EET

We are currently investigating an outage.