How to DOS yourself | Selfnet e.V.

How to DOS

Maybe you have noticed it throughout last weekend - we had several issues with our network stability.

Friday we started upgrading software on several core routers. That’s why we first thought the necessary reboots caused the problems. But the interruptions did not stop after the update was complete and even got worse.

A reason were several messages like this

vaih-core-1 dc-pfe: PANIC PANIC PANIC PANIC PANIC PANIC

or this

vaih-core-2 fpc0 SCHED: Thread 24 (PIC Periodic) ran for 1914 ms without yielding

This caused lags, unresponsive and even unreachable devices. As we are using routing protocols to manage our traffic (OSPF and BGP) which require constant communication between the devices. As a result our devices can select the optimal path to send packets. So the unresponsive devices caused routing changes in the network protocols as the backups links took over.

As soon as the backup links took over the affected devices became reachable again and the routing changed back to the primary path. This repeated between 3 and 20 times and took between 30 seconds and 20 Minutes.

Finding the the Root Cause

The logs showed an activation of the devices DDoS-Protection measures, repeating messages like the ones above and dying OSPF- or BGP-connections.

The sporadic nature of this problem made it complicated to debug. You needed to be permanently present to notice the problem when it started and before it was over to run several commands to obtain log files and debug information.

First we thought there was a configuration difference between our two NAT servers, as the traffic was handled by one device when we started upgrading and by the other afterwards. But we could disregard this theory shortly after since the other device showed the same behaviour.

Another possible cause could have been the connection between the routers and the NAT servers but we were also able to verify that the setup was correct.

Directly after the incident started, it was more or less obvious for us the DDoS-Protection was just a symptom but not the cause. It was the same as with the Juriper-Bug. So we followed the same investigation pattern: Looking at the packets sent through the NAT when the lags started.

The result were a lot of packets to event.selfnet.de, the public IPv4 NAT-Address for the wireless network we use at public events. Around 2.2% of the 3 seconds we took as a sample were directed to this IP. It was a mixture of UDP and a lot of NTP packets. This caused some concern. UDP could be explained with some former unknown - still active - access point. But there should be nobody using this IP for NTP updates.

So we looked at the source of the packets. Some were from a german vServer provider but the NTP-Packets originated mainly from China-Telecom… weird, why should they go around the globe to call us for a NTP update…

Solving the mystery

So now we wanted to verify it is legitimate traffic and found out: A simple ping to the public IPv4 Address returned with an error message “TTL expired” — which should not happen in a local network…

Further lookups showed IP-Packets bouncing between our core routers (in the internal network) and the edge routers (in the external networks).

The edge routers sent the packets to our NAT (where they should go), but the NAT did not “translate” them. We expect those packets to be translated to the address of the host that opened the connection. Since there was no open connection, the packets should be dropped. But here, the NAT did forwrad the packets without translation. So an IPv4 packet arrived in our core network, that had a public IPv4 address as destination. Since we only use private address space inside the network, the routers have no route for this. So the packet gets forwarded in the direction of the default-route, which is: to the internet. So the packet gets sent back, though the NAT again. This time, the packet isn’t expected to be translated, because it didn’t originate inside our network and already has a public source address (from China Telecom). So our edge router receives the packet, and we start over. So these packets go back and forth, tens or maybe even hundrets of times, until their time-to-live expires.

To make matters worse all of that happened with 40 GBit/s between the devices while the packets still needed to be processed by the cores routing engine. So the “DDoS Violations” were correct, we DDoS’ed ourself with 40 GBit/s…

After this the pattern was clear to us: All IPs of our pool for internal SNAT (like the “Event” wireless network) were routed in a loop as the NAT did just forward them, even when there was no valid state from an internal IP.

So one packet from an external address was multiplied into (up to) 255 of them which were transmitted with 40 GBit/s between the devices. If you think about our observation before: We had 4000 Packets in 3 seconds. This times 255 translates to roughly a million packets in our internal network… Well, sounds reasonable the DDoS-Protection tried to protect the routing engine of the devices and it became unreachable.

But as the routing engine stopped responding, this caused the BGP- and OSPF-Sessions to die. This started a domino effect: Routing switched to the backup paths, other routers were impacted and when the first device became responsive again (since the load suddenly disappeared), the routing started to switch back to the primary connections - and caused the whole loop again.

The Fix

Finally we knew what we had to do: The packets that do not translate at the NAT should be dropped at the end of handling them.

The fix was applied yesterday around 9 p.m. and since then we haven’t detect another larger disruption.

For the ultimate reason we can only speculate:

Was it the new Software? Is there a command or anything which is interpreted in a different way?
Did the reboot of our device change something?
Did somebody try to scan our network for vulnerabilities?
…?

For now we can not say what is more likely. We can only say that our NAT was not configured in the correct way.

And the moral of the story? Do not trust your own NAT if you already have a history!

Thanks to all of our members who helped in any way to solve this problem. We also apologize to all users of our network, as it wasn’t quite stable during the period of recognizing the bug and debugging it.

Do you want to participate in taking care of such problems in the future? For our network to run smoothly there are a lot of different things to do: management of the equipment (including buying new stuff and getting service), taking care of servers for the network or additional services, debugging problems like this or connecting new dormitories to our network.

If you want to volunteer, it doesn’t matter if you are a pro or a starter: Selfnet offers the opportunity to learn everything required. If you are interested in technical stuff, programming, public relations, project management or anything else: We would be glad to welcome you in our team! Just visit our support hours (once the office hours are re-opened).

The Selfnet-Team

Why is the WiFi reception not as good as it should be?

Tale of the JuRIPer-Bug