Post mortem of outages 2024-09-14 to 16
Post mortem of outages September 14th to 16th
Since Saturday we experienced some recurring issues with the routing in our network. This post should show the timeline and the solution to the problem we encountered
Saturday, September the 14th
Around approx. 2pm we changed some connections from one device to another in preparation of our move this week. This move is needed as the building with our main server room will be gutted and rebuilt starting next month until next year. To prevent the main network of shutting down during this period we will move a part of our setup to another building. Shortly after these connections were changed we started to experience minor issues with not working DNS but still working network beneath. According to several externally operated monitoring devices, the LAN network has been down for around 2.5 hours (Stuttgart, Esslingen, Ludwigsburg). Internally it was another story as all Selfnet services were not reachable anymore and e.g. the wireless network was not working. In the course of our debugging we rebooted several core components which did not change anything. Also putting the connections back we changed did not solve our problems. During this time we saw several error messages on our core switches indicating a routing problem, but we could not see where those problems originated.
In the following hours we searched for the problem and around 6 pm we identified a possible origin of the problem. As soon as one of our three core switches (vaih-core-3) was attached to the network the problems started to show up. After we isolated this core and relied on the current redundancies to handle the load the network became stable again, and we called it a day as we were mentally exhausted at this point.
The wireless network was recovered a few hours later as the network outage broke the replication of our database system responsible for device identification. As a result, our firewall would drop all packets going through the wireless network. After the database replica was recovered, the wireless network was functional again.
Later this evening we checked the logs and could not find any big problems there. We decided to update the core in question to a newer software version just to be sure, but this did not change anything.
Sunday, September the 15th
During the day we tried different config changes at the affected core which did not result in any improvements. Later, we tried to turn the device on bit by bit to find what causes the problem. It turned out the basic network functionality with OSPF (which we use to dynamically propagate the networks between devices) was fine, and it started only to show problems when we enabled BGP (which we use to dynamically propagate the Service-IPs in our network). It did not matter which node we allowed but started to show as soon as a connection was established. So we hoped we could only use the basic network functionality of this device for now to not block the move the following week.
Monday, September the 16th
Around noon we started to populate the affected core switch with links again to test the theory from Sunday. Afterwards we realized it would not work. At this time the core had only a default route to its connected neighbours. So all traffic (internet and internal services) will go to the same route. But if we connect the NAT/secondary uplink again the default route would come from another point than the internal services were reachable. So we would have Internet, but no DNS or DHCP.
During the evening, we tried to find the reason why BGP was the problem. First we swapped hardware with a spare device just to see if it is a hardware or a software problem. As the spare device produced the same problem it had to be a problem with our configuration.
So we started to whittle down the problem piece by piece. The BGP connection itself seemed to work. It only showed problems when importing routes into the device. As the next step, we only imported partial routes and not all at once. But again it started to show problems after some time. If we imported more IPs it started sooner than if we imported only one route - but it started in every case.
As we were a bit without clue at this point we stepped back and viewed the configuration in itself. At this point we noticed an OSPF export policy which looked funny. It only had one single statement without conditions and accepting everything. Based on the policy name it was a leftover from some years back. After we restricted this policy some more and enabled BGP it... - worked without problems? So we started connecting different systems again and as we don't see any problems, yet it seems to have been the issue.
So the underlying fault was the following: Vaih-Core-3 received the routes to our services via BGP and exported them to OSPF again due to the faulty policy statement. As OSPF is preferred compared to BGP on a device the neighbouring cores saw those updates via OSPF and started to send all traffic to our services to the affected Vaih-Core-3. Vaih-Core-3 itself tried to forward it to the IP it had in the BGP - which was the device it just got the traffic from and started sending it back. The other core again looked up its table and... - sent it right back. After a while, this clogged up the processor of all devices involved and everything started to break down.
The remaining question is: Why did the change on Saturday trigger the Problem? The underlying configuration problem was present for years and did not show up until now...