zerosleeps

Since 2010

Dead (but not dead) Ubiquiti UniFi Security Gateway

This is the story of my USG, which has failed in the weirdest way. The summary, which is repeated at the end of this post, is that after years of flawless operation it suddenly stopped communicating with the rest of my LAN via one particular port, but it can still send data on that port, has other ports which function perfectly, and the problem is definitely not software-related.

This isn’t a how-to or anything, because I can’t remember every command I typed or setting I changed, but I’m putting the general gist out there because it might help some poor sod someday. If you’re hoping for a solution, stop reading, but if you know anything about network equipment, please send help. I sent out a couple of cries for assistance about this on Ubiquiti’s community forums and on reddit, but crickets.

Basic troubleshooting

So one afternoon we suddenly lost internet connectivity. First thing I did was jump onto my UniFi Controller to see where the problem lay, and it told me that the USG had failed to adopt. Since the device was previously working fine, that status was UniFi saying “it’s disappeared from your network”.

I can honestly say that in three years of running Ubiquiti equipment I’ve never had to reboot or reset anything. In fact, it was because I got so fed up with flaky crappy all-in-one consumer “routers” that I went all-in on Ubiquiti in the first place. Anyway, that’s where I started - pull power from the USG, count to 10, plug it back in.

All indications were good: all the status lights and Ethernet link lights illuminated, flashed, and changed colour as they should, but the Controller still couldn’t find the USG.

It’s worth pointing out that as well as routing traffic between WAN and LAN and performing firewall duties the USG also provides DHCP services, so LAN traffic was starting to fail as well: new clients couldn’t join the network and existing DHCP leases couldn’t be renewed. Fabulous.

I tried a few reboots of the entire network stack with no change, so I dug out a paper-clip and did a hardware reset of the USG. Again, all the lights did the correct thing, and the USG appeared ready to adopt, except… it still didn’t appear to be connected to the network. Sure it was physically connected - OSI layers 1 and 2 looked good - but nothing above that. After a reset of that nature the USG sets it’s LAN port to 192.168.1.1/24 and runs a DHCP server, so at this point it should have been dishing out DHCP leases and routing traffic, even without adoption.

Corrupt storage?

I then ended up on a path to nowhere. I discovered that the USG boots from a USB drive tucked away inside it. There were lots of grumbles online about these internal drives failing. I was reasonably confident that wasn’t my problem, because my USG was booting. Well the status lights certainly indicated it was booting, and if the boot drive had failed it would have been more obvious, but with no way of communicating with the device via Ethernet I didn’t know for sure. So I replaced the internal drive, farted about with dd and .img files and blah blah blah. Didn’t make the blindest bit of difference so I’m not going to dwell on this part of story. Sure did learn a lot about USB drive initialisation times and Octeon boot commands though…

There’s nothing wrong with this thing

I bought a rollover cable from eBay. The USG has a console port, and with no packet data it was my only hope of making progress. The cable arrived, and immediately revealed that the USG was indeed properly booting, all services were running, and it correctly reported eth1 link status. It was also connected to the WAN (the USG is a regular DHCP client on that side of the network) and able to communicate with the outside world.

What the hell? At this point I’m starting to suspect a bad port, despite positive link status. But that can’t be a thing that happens, can it? Let’s see if we can confirm that:

As well as console, WAN, and LAN ports, the USG has a third port which is labelled “WAN 2 / LAN 2”, and it shows up in software as just another Ethernet interface. I mucked about for a while trying to work out how to configure this port to assume the duties of eth1. To Ubiquiti’s credit, once you understand a few of EdgeOS’s basic commands it’s actually pretty enjoyable to view the device’s configuration settings and change them.

To my surprise I managed to set up LAN 2 and disable LAN 1, and it worked. Connecting a device to LAN 2 immediately resulted in a DHCP offer, and all traffic flowed as it should. As it did before the start of this story. With the same attached devices and cables!

So there we are: bad port. Well, yes, I’m 90% sure about that, but there’s more…

(By this stage I’d already bought a new USG, and there was no way I was going to put the broken USG back into service. I can’t trust it, and I’m pretty sure the UniFi Controller will fight me at every step of the way if I try to use LAN2 instead of LAN1.)

Can speak but not hear

While I was poking about via console during all of the above, I discovered that I could capture traffic flowing through the USG’s ports using a command like show interfaces ethernet eth1 capture, and I could see some activity being logged. Surely if the port was dead nothing would work, or it would fail in some other obvious way: no link light, no traffic, or errors in a log perhaps, but as far as I could tell the USG was reporting all systems green. Remember I mentioned the OS was even correctly reporting link status in addition to this trickle of data I was now seeing.

I cracked out Wireshark, and it’s at this point that I gave up. This makes no sense:

I could see that upon connecting my MacBook to the USG’s LAN 1 port, DHCP discover requests were being sent from my Mac, but the USG’s capturing tool didn’t show them ever being received. But what the USG did capture were it’s own outgoing UniFi discovery requests and they were being received by my Mac. Here’s a screenshot from Wireshark (I’ve filtered out some irrelevant junk caused by my Mac sending out mDNS and ARP probes):

And here’s the corresponding USG capture:

ubnt@ubnt:~$ show interfaces ethernet eth1 capture
Capturing traffic on eth1 ...
10:27:21.461396 IP 192.168.1.1.51602 > 255.255.255.255.10001: UDP, length 145
10:27:31.682842 IP 192.168.1.1.51456 > 255.255.255.255.10001: UDP, length 145
10:27:41.915718 IP 192.168.1.1.59045 > 255.255.255.255.10001: UDP, length 145
10:27:52.156490 IP 192.168.1.1.41723 > 255.255.255.255.10001: UDP, length 145
10:28:02.381654 IP 192.168.1.1.47105 > 255.255.255.255.10001: UDP, length 145
10:28:12.599347 IP 192.168.1.1.41440 > 255.255.255.255.10001: UDP, length 145
10:28:22.819994 IP 192.168.1.1.54372 > 255.255.255.255.10001: UDP, length 145

The purple lines from the screenshot are my Mac’s DHCP requests, which never show up in the USG’s capture, but the 7 yellow lines match up perfectly with the 7 generated and captured by the USG.

So, to summarise, I have an Ethernet device with factory-supplied software and settings, no software related issues, warnings, or errors, which suddenly stopped receiving packet data on one port, but can still negotiate a link and send data on that port, and has other ports which function perfectly.

🤯

Please, if anyone reading this has any idea how this kind of failure is possible, get in touch.