At about 7pm yesterday, one of my servers in Cheltenham stopped responding to the once-per-minute check. And I was unable to log on to that server from here. But if I logged in to another server in Cheltenham, I could log on to that server.
Which means that it's some sort of routing problem. But what caused it, and how to deal with it?
First, I rebooted the server, because that's easy to do, and often fixes problems. That didn't help. Then I tried to set up the route table, but that didn't help either. Eventually, I had the idea of doing a traceroute to that server, which bombed out before it reached my firewall, which meant that the problem was outside my control. But I had a customer-facing server not working!
I reported the problem to the hosting company, so they could start to work on it. Meanwhile, I used the NAT (network address translation) capability of the firewall, to move the external IP address to something else. That also meant that I had to edit the DNS files and re-propagate them, so that external computers would be able to find the server on the new address.
And that all worked.
Today, I found out what the problem was. The hosting company monitors the packet volume to each IP address, and that IP address was getting a lot. I mean, a *LOT*. They interpreted that as some sort of attack, and killed the routing to the server. Actually, it wasn't an attack, it was a rather greedy customer.
I can understand why they killed the routing, but (as I told them today) they should also have notified me that they'd done so. That would have saved me some hours in working out what the problem was, and I'd guess that some people running servers wouldn't be able to work it out at all.
Everything is OK now. And they've told me that they'll disable the route-killer for my IP addresses.