Network Debugging: Rogue Router Advertisements
In the last few months we observed weird behaviour regarding some of our servers that resulted in our servers not being able to reach the internet over IPv6[1]. The servers could reach other devices within the same layer 2 network but not much else.
[1] The current standard protocol alongside the ancient IPv4 protocol that has been deprecated in the year 1998.
On digging further we discovered that there was an additional default route to the linklocal fe80::0123:45ff:fe67:89ab IPv6 address.
$ ip -6 route
2001:db8::/64 dev eth0 proto kernel metric 256
fe80::/64 dev eth0 proto kernel metric 256
default via 2001:db8::254 dev eth0 metric 1024
default via fe80::0123:45ff:fe67:89ab dev eth0 proto kernel metric 1024 expires 1717sec hoplimit 64
At first we just removed the additional default route from one of the servers.
$ ip -6 route del fe80::0123:45ff:fe67:89ab dev eth0
But this only worked for about 2 minutes. After that the additional default route appeared again.
Since the IPv6 address included the 'ff:fe' pattern we knew that this is in fact an autoconfigured address.
Therefore with some quick searching we found out the brand of the device.
By looking at the switch the servers were attached to we could also find the port of the device in question because we knew the MAC address.
Digging deeper we suspected router advertisements as the culprit (because the bad default route reappeared after a short time).
Running tcpdump on all 3 of our virtualisation servers confirmed that the problem persisted in both server networks (Allmandring and Heilmannstraße) and was in fact caused by 2 devices (one in each server room).
$ sudo tcpdump -vvvv -ttt -i br-server icmp6 and 'ip6[40] = 134'
The devices causing the problem were our Cisco ASAs. This is a VPN/Firewall/... appliance. We only use the VPN functionality and all functionality not related to VPN has been deactivated on both devices.
But since most organisations that operate this kind of device use it primarily as a firewall appliance - we suspect by default - router advertisements are turned on for the main network interface to divert all traffic through the device. (In our case this did not work because the Firewall/Gateway functionality was deactivated entirely so the packages sent from the different servers were just dropped.)
Temporarily the problem could be fixed by disabling the auto-learning of default routes (and removing the defective default routes on the affected servers).
$ sudo sysctl net.ipv6.conf.all.accept_ra=0
After deactivating router advertisements on the two ASA devices the problem could be solved without reconfiguring every server and without having to keep this in mind for future deployments. We are now looking into implementing RA Guard. :-)
PS: Instead of staring at the manpage the arguments for tcpdump have been copied shamelessly from https://gist.github.com/hgn/383308615d8c96551afa.
Update 2017-10-31: In August 2017 we overhauled our virtualisation Servers so they are basically routers.
In this scenario there is no single network bridge on the virtualisation Server but every virtual machine has it's own network interface on the VM Host including a BGP Daemon, a NDP Proxy and an ARP Proxy.
Once a VM is booted up the interface on the host system becomes active and the BGP Daemon announces the IP addresses of the VM to the network.
This does not only mitigate the problem (because rogue router advertisements are not proxied to the VMs) but also makes it possible to live migrate all VMs to the different virtualisation hosts even without having "one unified" server network.
With this one can save a lot of money for routers capable of Layer-2-VPN / EVPN and still be able to live migrate VMs. Also there are cases where you do not want to or simply cannot aggregate or even renumber old subnets so a L2VPN would not help anyway.
We plan to describe what we did in detail later in another blog article.