Network troubleshooting conundrum - next steps?

Network troubleshooting conundrum - next steps?

Author
Discussion

Murph7355

Original Poster:

37,714 posts

256 months

Monday 25th January 2021
quotequote all
outnumbered said:
That is an IPv6 address, but it's more likely you're getting dual stack (IPv4 and IPv6) rather than IPv6 only. IPv6 connectivity is typically preferred if it's present, which is why you see that at whatsmyip.
It wasn't initially getting an IPv4 address but the guys at EE gave me an alternate APN to try and it gave one. Unfortunately the situation remains the same as, I suspect, the IPv6 one is being preferred as you note.

The 3 sim I have does not give any IPv6 address at all.

Nor does the JL Broadband ADSL.

So I'm pretty sure this is where the trouble is smile

I'm not at all familiar with IPv6 addressing or routing, so am starting from a low bar on that front. I'm wondering if there's something I can change on my LAN side network to make it understand IPv6? Though would have thought the Huawei router would handle that translation??

bhstewie said:
Can you get a ten quid SIM from some other network?

Feels like something fky by the carrier.
My 3 sim works fine as above...would like to get the EE one working ideally as it has 4x better download (moot if I can't get the bugger to work I know!) and costs about the same with the discounts I get.

There is also the "don't want to be defeated by a white plastic box syndrome" smile

Bikerjon said:
What model 4G router is it? Some have a WAN 2 connection that you can hook up a DSL modem to and then have the 4G router do the failover? I realise this removes the USG from the equation, but it also makes a simpler setup while still allowing failover.
It's a Huawei B818.

I did take the USG out of the equation. The problem I then had is that not all devices were registering correctly with the B818 as a router. I'm not exactly sure why, but that was causing bigger issues at home than the slower internet connection. So I switched it back.

I suspect it may have been something to do with all devices on the network already having IP addresses/lease times and a new DHCP server coming in to handle things.

I may give this another go this weekend (during the week is not the best time to potentially take everything down smile). Not wedded to the USG, but it seemed like a decent solution to what I needed to do.

All input appreciated.


theboss

6,913 posts

219 months

Monday 25th January 2021
quotequote all
I run a similar setup through a USG with a pair of consumer routers on each WAN port in turn routing to a fixed wireless provider on one and Voda LTE on the other. The USG acquires DHCP leases on both WAN interfaces so there is double NAT taking place on the wireless link and triple (taking CGNAT into account) on LTE.

I haven’t had any major problems but found this configuration more stable than when I either terminated the Wireless PPPoE connection on the USG or set the LTE router to bridge/pass through mode.

Are you sure the EE SIM is intended to support tethering / multiple devices?

Murph7355

Original Poster:

37,714 posts

256 months

Monday 25th January 2021
quotequote all
theboss said:
I run a similar setup through a USG with a pair of consumer routers on each WAN port in turn routing to a fixed wireless provider on one and Voda LTE on the other. The USG acquires DHCP leases on both WAN interfaces so there is double NAT taking place on the wireless link and triple (taking CGNAT into account) on LTE.

I haven’t had any major problems but found this configuration more stable than when I either terminated the Wireless PPPoE connection on the USG or set the LTE router to bridge/pass through mode.

Are you sure the EE SIM is intended to support tethering / multiple devices?
So it sounds like your set up is the same as mine.

Does your Voda sim give you an IPv6 address as it's main addres?

The EE sim can be used for tethering/multiple devices (the only constraint is a1Tb/mth fair usage "limit"). It actually works if I take the USG out of the equation and plug the 4G router straight into the network. When I tried that config, however, a number of devices wouldn't connect to the network at all (which I have a feeling was down to some sort of DHCP conflicts).

I probably need to check that config again this weekend (the connections are in heavy use during the week which limits the buggering about I can do) to make sure I wasn't imagining it. And then possibly see if I can make the general network DHCP/routing piece work for me on the Huawei and ditch the USG.

theboss

6,913 posts

219 months

Monday 25th January 2021
quotequote all
Murph7355 said:
So it sounds like your set up is the same as mine.

Does your Voda sim give you an IPv6 address as it's main addres?

The EE sim can be used for tethering/multiple devices (the only constraint is a1Tb/mth fair usage "limit"). It actually works if I take the USG out of the equation and plug the 4G router straight into the network. When I tried that config, however, a number of devices wouldn't connect to the network at all (which I have a feeling was down to some sort of DHCP conflicts).

I probably need to check that config again this weekend (the connections are in heavy use during the week which limits the buggering about I can do) to make sure I wasn't imagining it. And then possibly see if I can make the general network DHCP/routing piece work for me on the Huawei and ditch the USG.
The WAN interface in my LTE router gets a private IPv4 address on Voda's network behind their CGNAT gateway.

Whether I connect a client directly to the LTE modem or via the USG, I'm able to route traffic over the link.

I use a Netgear M1 which is a handy sized travel router with a good battery length etc. and I've designed the network so I can whip that out and take it away with me (thus avoiding paying for an extra SIM) without affecting any services at home, providing the Wireless ISP connection stays up. Normally I split outbound connections 50/50 using the weighted LB option on the USG.

How have you configured the USG WAN interface connecting to the Huawei? I would disable IPv6 entirely and try configuring a static IP if you've been trying with DHCP assigned, or vice versa. Also is the Draytek connected during this testing, or disconnected? Is it configured for Failover only, as a backup, or weighted LB? I would disconnect it until you can get the EE connection working as expected via USG.

It should be possible to get this to work and there's value in having the USG doing the dual WAN routing if you're invested in the wider Unifi stack.

Edited by theboss on Monday 25th January 17:30

Murph7355

Original Poster:

37,714 posts

256 months

Monday 25th January 2021
quotequote all
theboss said:
...

How have you configured the USG WAN interface connecting to the Huawei? I would disable IPv6 entirely and try configuring a static IP if you've been trying with DHCP assigned, or vice versa.

It should be possible to get this to work and there's value in having the USG doing the dual WAN routing if you're invested in the wider Unifi stack.
Originally via DHCP.

Thought the same as you so gave it a static in the right reserved range. No dice still. Also bound the IP to the gateways WAN MAC. No dice.

I am invested in the Unifi stac (PoE switch, APs etc)...which is part of the reason I'm keen to get this to work.

The supplier of the USG are looking into it too - unfortunately we're all on the "it should work" path at the moment.

The USG did throw a wobbler not long after first install (without the EE WAN interface on it) whereby it disconnected, then endlessly cycled through restarting and disconnected. No idea what caused it. I've asked the vendor if they think it may have a fault.

The USG evidently isn't happy with the IP traffic coming from the router with the EE sim in. Hopefully it won't come down to ditching EE or the USG, but that's feeling like the option that might present itself.

theboss

6,913 posts

219 months

Monday 25th January 2021
quotequote all
Sorry only just went through the traceroute and what you said about not showing the LTE router hop on the EE SIM is very telling.

it sounds to me like the USG is downing the LTE path. When you enable a second WAN it ascertains path availability using pings to the host you configure for latency monitoring, by default this is ping.ubnt.com (or similar) but you can set it to your choosing.

With the EE SIM in the router, can you SSH into the USG and run the following commands:

show load-balance status
show load-balance watchdog

This is what I get when both of mine are up - you are looking for an UNREACHABLE which indicates that the USG has administratively disabled the link and thus removed it from its routing table.

eth0
status: Running
pings: 999
fails: 5
run fails: 0/3
route drops: 1
ping gateway: 1.0.0.1 - REACHABLE
last route drop : Mon Jan 25 15:26:12 2021
last route recover: Mon Jan 25 15:29:07 2021

eth2
status: Running
pings: 10268
fails: 23
run fails: 0/3
route drops: 0
ping gateway: 1.0.0.1 - REACHABLE

Chozza

808 posts

152 months

Monday 25th January 2021
quotequote all
Slightly off-topic.

Don't use 8.8.8.8 as a destination for your tracert .

It is an "anycast" address , which works similar to a CDN network , in that the address can be advertised in multiple places. So the server you are reaching at 8.8.8.8 from on device is not necessarily in the same network location as you get from another network.

It can give you misleading results ( yours looks like a more local issue ..)


( RFC 4768 for the geeks )


Murph7355

Original Poster:

37,714 posts

256 months

Monday 25th January 2021
quotequote all
theboss said:
Sorry only just went through the traceroute and what you said about not showing the LTE router hop on the EE SIM is very telling.

it sounds to me like the USG is downing the LTE path. When you enable a second WAN it ascertains path availability using pings to the host you configure for latency monitoring, by default this is ping.ubnt.com (or similar) but you can set it to your choosing.

With the EE SIM in the router, can you SSH into the USG and run the following commands:

show load-balance status
show load-balance watchdog

This is what I get when both of mine are up - you are looking for an UNREACHABLE which indicates that the USG has administratively disabled the link and thus removed it from its routing table.

eth0
status: Running
pings: 999
fails: 5
run fails: 0/3
route drops: 1
ping gateway: 1.0.0.1 - REACHABLE
last route drop : Mon Jan 25 15:26:12 2021
last route recover: Mon Jan 25 15:29:07 2021

eth2
status: Running
pings: 10268
fails: 23
run fails: 0/3
route drops: 0
ping gateway: 1.0.0.1 - REACHABLE
With the 3 sim in:

eth0
status: Running
pings: 2703
fails: 14
run fails: 0/3
route drops: 17
ping gateway: ping.ubnt.com - REACHABLE
last route drop : Mon Jan 25 14:19:18 2021
last route recover: Mon Jan 25 14:19:42 2021

eth2
status: Running
failover-only mode
pings: 7051
fails: 3
run fails: 0/3
route drops: 4
ping gateway: ping.ubnt.com - REACHABLE
last route drop : Mon Jan 25 02:09:59 2021
last route recover: Mon Jan 25 02:10:40 2021

With the EE sim in:

eth0
status: Waiting on recovery (0/3)
pings: 2718
fails: 17
run fails: 5/3
route drops: 18
ping gateway: ping.ubnt.com - DOWN
last route drop : Mon Jan 25 21:57:27 2021
last route recover: Mon Jan 25 14:19:42 2021

eth2
status: Running
failover-only mode
pings: 7243
fails: 4
run fails: 0/3
route drops: 4
ping gateway: ping.ubnt.com - REACHABLE
last route drop : Mon Jan 25 02:09:59 2021
last route recover: Mon Jan 25 02:10:40 2021

and

interface : eth0
carrier : up
status : inactive
gateway : 192.168.8.1
route table : 201
weight : 0%
flows
WAN Out : 142000
WAN In : 49
Local Out : 378

interface : eth2
carrier : up
status : active
gateway : 192.168.1.1
route table : 202
weight : 100%
flows
WAN Out : 33949
WAN In : 0
Local Out : 118

theboss

6,913 posts

219 months

Monday 25th January 2021
quotequote all
Looks like pings to ubiquiti’s host are failing intermittently over EE which is causing the route availability to flap up and down

I have had exactly the same with my WISP (not voda)

I would confirm this by plugging a laptop into the EE modem and setting a continual ping against ping.ubnt.com, observing the response which we predict will fail frequently.

You’d then want to know if this is down to actual packet loss on a particular route to Ubiquiti’s server or whether EE are just dropping ICMP for anything. So try some others e.g. 8.8.8.8 for google DNS

You are looking to find a host you can reliably ping on this connection as that’s what the USG needs to do in order to monitor each link.

Murph7355

Original Poster:

37,714 posts

256 months

Tuesday 26th January 2021
quotequote all
theboss said:
Looks like pings to ubiquiti’s host are failing intermittently over EE which is causing the route availability to flap up and down

I have had exactly the same with my WISP (not voda)

I would confirm this by plugging a laptop into the EE modem and setting a continual ping against ping.ubnt.com, observing the response which we predict will fail frequently.

You’d then want to know if this is down to actual packet loss on a particular route to Ubiquiti’s server or whether EE are just dropping ICMP for anything. So try some others e.g. 8.8.8.8 for google DNS

You are looking to find a host you can reliably ping on this connection as that’s what the USG needs to do in order to monitor each link.
Ahhhh. So the link could well actually be working, but because the USG works this out by using ping responses, if the ping responses fail then it drops the link and cuts to failover?

Presumably if the failover link isn't there it still doesn't bother routing the traffic?

theboss

6,913 posts

219 months

Tuesday 26th January 2021
quotequote all
Murph7355 said:
Ahhhh. So the link could well actually be working, but because the USG works this out by using ping responses, if the ping responses fail then it drops the link and cuts to failover?

Presumably if the failover link isn't there it still doesn't bother routing the traffic?
Yes, the link can be online and passing traffic, but if the USG's route monitor fails because pings are getting dropped above a certain threshold then the link will be deemed unusable by USG and dropped from outbound routing.

That's why you aren't seeing 192.168.8.1 appear as your next hop on the traceroute, and its that reflected in the info you posted - eth0 (WAN1) is shown to be inactive from a routing perspective because a certain number of pings to ping.ubnt.com had been dropped.

You can change the settings on the USG i.e. designate a different host to ping, as well as the interval and pass/fail thresholds. This stuff isn't exposed in the GUI though meaning you have to do so via SSH and then place the settings into a config json stored on the controller and delivered to the USG every time it gets provisioned, otherwise you revert to the GUI settings on restart.

I don't know exactly what it does when WAN2 is unplugged, but I have a feeling this route monitoring behaviour is always active when you have a WAN2 / second internet connection defined.

Try removing the Internet connection for your VDSL in the Unifi GUI altogether which should revert to the config to a fixed interface without the load-balancing and route monitoring aspects. Then try pinging some hosts which you could monitor, from the USG itself in a SSH session. You want to establish a continuous ping which rarely fails.

I found the default ping.ubnt.com wasn't very good and I was having the same problem you are, and found advice from others online suggesting than monitoring an IP rather than a FQDN worked better, by taking name resolution out of the equation. I have alternated between several of the public DNS providers which reply to pings - you can see in the config I posted above I was using 1.0.0.1 as my route monitor, which is Cloudflare's secondary anycast IP.

Try this config to change the IP monitored and also relax the thresholds slightly. Reboot the USG to reinstate the original config.

configure
set load-balance group wan_failover interface eth0 route-test type ping target 1.0.0.1 [or whatever IP you choose]
set load-balance group wan_failover interface eth0 route-test initial-delay 10
set load-balance group wan_failover interface eth0 route-test interval 10
set load-balance group wan_failover interface eth0 route-test count success 2
set load-balance group wan_failover interface eth0 route-test count failure 4
save;commit;exit




Edited by theboss on Tuesday 26th January 10:14

Murph7355

Original Poster:

37,714 posts

256 months

Tuesday 26th January 2021
quotequote all
theboss said:
Yes, the link can be online and passing traffic, but if the USG's route monitor fails because pings are getting dropped above a certain threshold then the link will be deemed unusable by USG and dropped from outbound routing.

That's why you aren't seeing 192.168.8.1 appear as your next hop on the traceroute, and its that reflected in the info you posted - eth0 (WAN1) is shown to be inactive from a routing perspective because a certain number of pings to ping.ubnt.com had been dropped.

You can change the settings on the USG i.e. designate a different host to ping, as well as the interval and pass/fail thresholds. This stuff isn't exposed in the GUI though meaning you have to do so via SSH and then place the settings into a config json stored on the controller and delivered to the USG every time it gets provisioned, otherwise you revert to the GUI settings on restart.

I don't know exactly what it does when WAN2 is unplugged, but I have a feeling this route monitoring behaviour is always active when you have a WAN2 / second internet connection defined.

Try removing the Internet connection for your VDSL in the Unifi GUI altogether which should revert to the config to a fixed interface without the load-balancing and route monitoring aspects. Then try pinging some hosts which you could monitor, from the USG itself in a SSH session. You want to establish a continuous ping which rarely fails.

I found the default ping.ubnt.com wasn't very good and I was having the same problem you are, and found advice from others online suggesting than monitoring an IP rather than a FQDN worked better, by taking name resolution out of the equation. I have alternated between several of the public DNS providers which reply to pings - you can see in the config I posted above I was using 1.0.0.1 as my route monitor, which is Cloudflare's secondary anycast IP.

Try this config to change the IP monitored and also relax the thresholds slightly. Reboot the USG to reinstate the original config.

configure
set load-balance group wan_failover interface eth0 route-test type ping target 1.0.0.1 [or whatever IP you choose]
set load-balance group wan_failover interface eth0 route-test initial-delay 10
set load-balance group wan_failover interface eth0 route-test interval 10
set load-balance group wan_failover interface eth0 route-test count success 2
set load-balance group wan_failover interface eth0 route-test count failure 4
save;commit;exit
Struck out again unfortunately:

eth0
status: Running
pings: 7
fails: 7
run fails: 7/10
route drops: 0
ping gateway: 1.0.0.1 - REACHABLE

I'm not entirely convinced "reachable" actually means reachable....it just counted up through the tried until it hit the tenth and downed the connection.

eth0
status: Waiting on recovery (0/2)
pings: 10
fails: 10
run fails: 10/10
route drops: 1
ping gateway: 1.0.0.1 - DOWN
last route drop : Tue Jan 26 22:23:31 2021

At which point the failover took over

eth2
status: Running
failover-only mode
pings: 0
fails: 0
run fails: 0/3
route drops: 1
ping gateway: ping.ubnt.com - REACHABLE
last route drop : Tue Jan 26 22:23:21 2021
last route recover: Tue Jan 26 22:24:02 2021


I thought this might have it licked frown

(The config also seems to be persisting...which is annoying as I think I've managed to add an interface somehow smile).

The vendor's support guys are confused on this one too. They have a line in to Ubiquiti so will see what they can find.

theboss

6,913 posts

219 months

Tuesday 26th January 2021
quotequote all
Looks to me like EE isn’t passing ping reliably - can you plug a device into the LTE modem directly bypassing the USG, and see if you can ping these same IPs from there?

That config won’t persist if you reboot the USG - it will re-provision your config as configured in the controller.

It also looks like you have set test count failure to 10 which means 10 consecutive pings need to fail before it deems the host unreachable and drops the link.

Edited by theboss on Tuesday 26th January 23:03

Murph7355

Original Poster:

37,714 posts

256 months

Tuesday 26th January 2021
quotequote all
Looking to see how I managed to get a pppoe interface on the thing, I ran an interfaces status command and got this:

Interface IP Address S/L Description

---------- --- -----------
eth0 192.168.8.2/24 u/u WAN
2a01:4c8:1070:5b77:f0c4:2f9e:55e3:4/128

eth1 10.0.1.1/24 u/u LAN

eth2 192.168.1.12/24 u/u WAN2

lo 127.0.0.1/8 u/u
::1/128


Which I *think* is showing that the eth0 interface (4G router connection) is getting an IPv6 address from EE (the first bits of the address are the same as the public IP address EE gives the router).

I believe the /128 part might be the prefix size, and seem to recall reading somewhere that EE only uses /64...

Murph7355

Original Poster:

37,714 posts

256 months

Wednesday 27th January 2021
quotequote all
theboss said:
Looks to me like EE isn’t passing ping reliably - can you plug a device into the LTE modem directly bypassing the USG, and see if you can ping these same IPs from there?

That config won’t persist if you reboot the USG - it will re-provision your config as configured in the controller.

It also looks like you have set test count failure to 10 which means 10 consecutive pings need to fail before it deems the host unreachable and drops the link.
(I set it to 10 to give chance to see what was occurring smile Also, the USG has been rebooted a few times, reprovisioning in the controller and this config is persisting).

I tried the ping earlier whilst directly attached to the router and it worked (avg response 30ms).

The IPv6 address is the result of me setting IPv6 to DHCPv6 in one of my previous looks. Set it to disabled and it shows nothing now (as expected...and also with no change to the situation).

(Thanks again for the input btw).

camel_landy

4,898 posts

183 months

Wednesday 27th January 2021
quotequote all
IIRC - EE use IPV6 for the end points on their LTE network.

M

Murph7355

Original Poster:

37,714 posts

256 months

Wednesday 27th January 2021
quotequote all
camel_landy said:
IIRC - EE use IPV6 for the end points on their LTE network.

M
They do - and the APN I was using didn't even offer an IPv4 address (whatsmyipaddress shows it "not detected").

They gave me another APN to use which does allocate an IPv4 address, but the IPv6 one still looks to be the primary.

The vendor support guys (who have been great thus far. As have EE's technical team) asked me to drop them the "show configuration" output...it's a pretty big file but scanning through it I noticed this:

ipv6-name WANv6_IN {
default-action drop
description "packets from internet to intranet"
rule 3001 {
action accept
description "allow established/related sessions"
state {
established enable
invalid disable
new disable
related enable
}

As far as I'm aware I have the USG firewall off. But the above looks like a rule of some sort that drops IPv6 packets which if it were happening would explain my situation! I haven't read the above in context of the rest of the file yet, but will get to that today (assuming the vendor/Ubiquiti don't come back first).

theboss

6,913 posts

219 months

Wednesday 27th January 2021
quotequote all
I think you're burdening yourself with the added complexity of IPv6 unnecessarily. They may well assign an IPv6 address to the LTE interface on your router, and use some transitioning technology on their side, but you needn't concern yourself with it. They aren't mandating that customers run IPv6 on their own network. Essentially your USG should be an IPv4 only DHCP client on the LAN side of the LTE router. Disable IPv6 on the USG WAN1 interface and if possible, disable IPV6 on the LAN interface of the Huawei as well. Save that particular challenge for another day smile

So are you saying you can connect an IPv4 client to the Huawei and ping away to your heart's content without problems?

LordGrover

33,539 posts

212 months

Wednesday 27th January 2021
quotequote all
Have you renewed the DHCP leases on the clients when you make a change?

Murph7355

Original Poster:

37,714 posts

256 months

Wednesday 27th January 2021
quotequote all
theboss said:
I think you're burdening yourself with the added complexity of IPv6 unnecessarily. They may well assign an IPv6 address to the LTE interface on your router, and use some transitioning technology on their side, but you needn't concern yourself with it. They aren't mandating that customers run IPv6 on their own network. Essentially your USG should be an IPv4 only DHCP client on the LAN side of the LTE router. Disable IPv6 on the USG WAN1 interface and if possible, disable IPV6 on the LAN interface of the Huawei as well. Save that particular challenge for another day smile

So are you saying you can connect an IPv4 client to the Huawei and ping away to your heart's content without problems?
It's highly likely I'm barking up a wrong tree (a whole feckin forest of them I suspect) biggrin

IPv6 on the USG WAN interface is now disabled again. Same issue (I only enabled it previously to check it and forgot it was still set).

If I connect my PC directly to the Huawei, I get internet connectivity and can ping away.

If I connect the Huawei direct to my switch, bypassing the USG (and removing it from the equation totally), then devices on the network can also connect to the network - the last time I tried this I did have some devices that couldn't connect to anything, which I put down to some DHCP oddities (but may not have been).

Unfortunately I don't see any way of disabling IPv6 on the router's LAN interface. The USG was picking up a valid IPv4 address from the 4G router via DHCP (I've also had it connected via a static IP too).

Neither of the other WAN connections I have (3 sim on the 4G router and the PlusNet based ADSL) show an IPv6 address (both show "not detected" in whatsmyip).

LordGrover said:
Have you renewed the DHCP leases on the clients when you make a change?
In the config I'm running the only real "client" in that respect (ie client of the 4G router) is the gateway. All other devices on the network get their details from the USG.