Post mortem of outages 2024-09-14 to 16

Post mortem of outages September 14th to 16th

Since Saturday we experienced some recurring issues with the routing in our network. This post should show the timeline and the solution to the problem we encountered

Saturday, September the 14th

Around approx. 2pm we changed some connections from one device to another in preparation of our move this week. This move is needed as the building with our main server room will be gutted and rebuilt starting next month until next year. To prevent the main network of shutting down during this period we will move a part of our setup to another building. Shortly after these connections were changed we started to experience minor issues with not working DNS but still working network beneath. According to several externally operated monitoring devices, the LAN network has been down for around 2.5 hours (Stuttgart, Esslingen, Ludwigsburg). Internally it was another story as all Selfnet services were not reachable anymore and e.g. the wireless network was not working. In the course of our debugging we rebooted several core components which did not change anything. Also putting the connections back we changed did not solve our problems. During this time we saw several error messages on our core switches indicating a routing problem, but we could not see where those problems originated.

In the following hours we searched for the problem and around 6 pm we identified a possible origin of the problem. As soon as one of our three core switches (vaih-core-3) was attached to the network the problems started to show up. After we isolated this core and relied on the current redundancies to handle the load the network became stable again, and we called it a day as we were mentally exhausted at this point.

The wireless network was recovered a few hours later as the network outage broke the replication of our database system responsible for device identification. As a result, our firewall would drop all packets going through the wireless network. After the database replica was recovered, the wireless network was functional again.

Later this evening we checked the logs and could not find any big problems there. We decided to update the core in question to a newer software version just to be sure, but this did not change anything.

Sunday, September the 15th

During the day we tried different config changes at the affected core which did not result in any improvements. Later, we tried to turn the device on bit by bit to find what causes the problem. It turned out the basic network functionality with OSPF (which we use to dynamically propagate the networks between devices) was fine, and it started only to show problems when we enabled BGP (which we use to dynamically propagate the Service-IPs in our network). It did not matter which node we allowed but started to show as soon as a connection was established. So we hoped we could only use the basic network functionality of this device for now to not block the move the following week.

Monday, September the 16th

Around noon we started to populate the affected core switch with links again to test the theory from Sunday. Afterwards we realized it would not work. At this time the core had only a default route to its connected neighbours. So all traffic (internet and internal services) will go to the same route. But if we connect the NAT/secondary uplink again the default route would come from another point than the internal services were reachable. So we would have Internet, but no DNS or DHCP.

During the evening, we tried to find the reason why BGP was the problem. First we swapped hardware with a spare device just to see if it is a hardware or a software problem. As the spare device produced the same problem it had to be a problem with our configuration.

So we started to whittle down the problem piece by piece. The BGP connection itself seemed to work. It only showed problems when importing routes into the device. As the next step, we only imported partial routes and not all at once. But again it started to show problems after some time. If we imported more IPs it started sooner than if we imported only one route - but it started in every case.

As we were a bit without clue at this point we stepped back and viewed the configuration in itself. At this point we noticed an OSPF export policy which looked funny. It only had one single statement without conditions and accepting everything. Based on the policy name it was a leftover from some years back. After we restricted this policy some more and enabled BGP it... - worked without problems? So we started connecting different systems again and as we don't see any problems, yet it seems to have been the issue.

So the underlying fault was the following: Vaih-Core-3 received the routes to our services via BGP and exported them to OSPF again due to the faulty policy statement. As OSPF is preferred compared to BGP on a device the neighbouring cores saw those updates via OSPF and started to send all traffic to our services to the affected Vaih-Core-3. Vaih-Core-3 itself tried to forward it to the IP it had in the BGP - which was the device it just got the traffic from and started sending it back. The other core again looked up its table and... - sent it right back. After a while, this clogged up the processor of all devices involved and everything started to break down.

The remaining question is: Why did the change on Saturday trigger the Problem? The underlying configuration problem was present for years and did not show up until now...

posted at 00:00 by michaelm · Blog · Netzwerk Technologie

Erweiterung des Selfnet-Netzwerkes in der Stuttgarter City (Teil I)

Als Selfnet vor mehr als 17 Jahren gegründet wurde war dies nur durch eine große Anzahl an Bürgschaften, Spenden, Krediten und personeller Unterstützung möglich. Dass aus dieser Vereinigung von damaligen Studenten einmal eine so große Angelegenheit wird, hätte damals vermutlich nie jemand gedacht.

Mittlerweile sind diese damaligen Studenten bei verschiedenen Technologieunternehmen eingestiegen. Durch die nachfolgenden Generationen von Studenten ist Selfnet eines der größten Studentennetze in Deutschland geworden. Selfnet trägt sich nach wie vor nur durch Spenden, Mitgliedsbeiträge und der ehrenamtlichen Arbeit der Mitglieder selbst, ohne eine Einflussnahme von außen. Dadurch können wir technisch und auch ideell Dinge in unserem Netzwerk realisieren, welche man in einem industriellen Umfeld niemals finden würde.

Durch diese "Wir wollen dass alle Studenten eine gute Netzwerkanbindung haben" Mentalität, welche sich bis heute gehalten hat, wurde vieles verwirklicht. In den letzten Jahren kamen in der Stuttgarter Innenstadt mehrere große Wohnheime zu unserem Netzwerk dazu, unter anderem die Heilmannstraße 3-7 (190 Zimmer) und die Rosensteinstraße 1-5 (346 Zimmer) welche auch gleichzeitig flächendeckend mit WLAN versorgt werden konnten. Nach langen Diskussionen mit dem Vorstand und der Mitgliederversammlung, sowie Verhandlungen mit dem Studierendenwerk Stuttgart können wir jetzt einen weiteren Schritt gehen.

Vor einigen Wochen haben wir mit dem Studierendenwerk einen Vertrag geschlossen, welcher es uns ermöglicht in den kommenden Monaten weitere Wohnheime in der Innenstadt an unser leistungsstarkes Netzwerk anzubinden. Es handelt sich dabei um die Wohnheime

Neckarstraße 132
Wohngruppen im Studierendenhotel (Neckarstr. 172)
Rieckestraße
Kernerstraße
Landhausstraße
Anna-Herrigel-Haus (Nach der Kernsanierung im Februar 2018)

Die Umstellung der Wohnheime wird voraussichtlich zwischen September und Oktober vorgenommen, ein genaues Datum für jedes Wohnheim ist durch verschiedene Faktoren leider nicht festlegbar. Die Bewohner werden vor den Arbeiten natürlich noch einmal benachrichtigt.

Die Anbindung der Wohnheime wird dabei durch eine Glasfaser-Anbindung zu unserem bestehenden Netz realisiert, sodass zukünftig auch in diesen relativ kleinen Wohnheimen die Netzwerkgeschwindigkeiten pro Zimmer auf 1 GBit/s angehoben werden können. Durch die relativ hohen Investitionskosten (insgesamt etwa 200.000 €) können wir dort vorerst keine WLAN-Abdeckung aufbauen, werden aber natürlich versuchen diese in den kommenden Jahren zu realisieren.

In Zukunft wollen wir auch versuchen die noch verbliebenen Wohnheime in Stuttgart anzubinden. Ob, wann und wie schnell dies geschieht, hängt aber auch davon ab, ob wir Unterstützung von Bewohnern vor Ort bekommen. Da die Projekte wie immer von Ehrenamtlichen durchgeführt werden, suchen wir auch Leute vor Ort, die bei der Planung und Realisierung helfen möchten. Sofern du uns helfen willst in (deinem oder anderen) Wohnheimen deine Ideen zu verwirklichen: Komm einfach zum Schluss einer Sprechstunde bei uns vorbei und lass dir völlig unverbindlich alles zeigen. Wir sind immer auf der Suche nach neuen und motivierten Mitgliedern. Es spielt dabei keine Rolle, ob du dich für Technik begeisterst oder nicht. Lediglich Neugierde ist eine wichtige Voraussetzung.

posted at 00:00 by michaelm · Blog · network club fiber routing

Selfnet Blog

Sep 17, 2024