Multiple Home Internet Connections, part 2
This continues from part 1, which went over my choices for redundant internet. This will go into how I've configured the redundancy. I'm using openwrt software on my routers, so my solution will involve openwrt specific features.
Simplest Possible Setup
After getting two internet providers, the easiest configuration is to just have two different routers with two different WiFi networks with two different SSIDs. This makes some things harder like controlling your Roku over the local network, or running local servers. But you can switch between connections by connecting to the other SSID. To make things even easier, Android and Windows will scan for other WiFi networks if they detect your current connection has no internet access. Other things like Smart TVs and IOT devices might not be able to fail over automatically or even accept multiple SSIDs in their configuration.
This has some tradeoffs, but is extremely easy to do.
More Complicated Setup
If I want to control the failover between internet connections in a central point and just use one WiFi network, then I need to bring both internet connections to one router and then configure it for failover. I can still have two WiFi networks, one primary and one for debugging and experimenting. But I only need to configure my devices for the primary WiFi network.
Network Diagram & Physical Connections
The first consideration is: how can I physically connect things together?
The AT&T GPON connection is terminated on an ONT in the utility room, the Spectrum cable is in the wiring panel in the master closet, and the T-Mobile signal is strongest in the office.
The wiring panel in the master closet was originally designed for coax and telephone lines (RJ11). But thankfully, cat5e cables were used to run the telephone lines, so I replaced the ends with RJ45, and now they run Ethernet. Because of this, the router in the master closet has a direct connection to most other rooms in the house. The exception to this is the office, which did not have any phone lines. It is closest to the utility closet, so I ran an Ethernet cable directly between them.
Logical Connections
Next thing to consider is: configuring vlans and routing
Because the AT&T connection is my primary connection, I'll make the router it connects to the primary router on my network. That router will handle DHCP, DNS, and failover between connections. It's not directly connected to any of the other internet connections, so I'll use vlans to connect the primary router to the other providers.
The two Ethernet physical links between office.lan (T-Mobile), turris.lan (AT&T), and closet.lan (Spectrum) use vlan tagging to provide multiple virtual Ethernet links between them. I'm using the following vlans:
Vlan ID | Name | DHCP Server |
---|---|---|
1 | Lan (default) | turris.lan |
2 | Internet Monitoring | turris.lan |
3 | Internet of Things | turris.lan |
4 | T-Mobile (lan) | T-Mobile router |
5 | Spectrum (wan) | Spectrum |
6 | Spectrum (lan) | closet.lan |
I'm going to ignore the T-Mobile connection for the rest of this post, because I'm getting rid of it.
For the Spectrum connection, I have a choice of where to put the logical connection. Do I carry the wan on its own vlan over to my main router (turris.lan)? Or, do I put the DHCP client on the router it is physically connected to (closet.lan)?
Carrying the wan vlan over to the main router has the benefit of consistency - both connections are in the same spot. This is easier to configure the basic failover, but I want more flexibility and redundancy.
With the DHCP client on the physically connected router, I can easily setup a network that only uses the Spectrum connection. It's also not dependent on the primary connection in case of hardware failure. I'll need to setup routing so I'm not using double NAT when the connection fails over, which is easy.
AT&T connection setup
Since the provided AT&T router is missing a lot of options, I'm not using it. I extracted the X509 certificates from it and setup wpa_supplicant because AT&T requires 802.1x auth in my area. I'm still using the provided Nokia ONT to convert between 1000Base-T and GPON.
Once my router is authenticated, I can use a pretty simple network config:
# AT&T requires vlan 0 tagging on the wan interface
config device 'dev_wan'
option name 'eth2.0'
# The 802.1X x509 cert contains a mac address, use that
option macaddr 'AA:11:AA:11:AA:11'
# use the untagged network interface to reach the ONT's web interface
config interface 'ont'
option device 'eth2'
option proto 'static'
option netmask '255.255.255.0'
option delegate '0'
option ipaddr '192.168.1.200'
option defaultroute '0'
# The 802.1X x509 cert contains a mac address, use that
option macaddr 'AA:11:AA:11:AA:11'
config interface 'wan'
option proto 'dhcp'
option peerdns '0'
list dns '8.8.8.8'
list dns '8.8.4.4'
option device 'eth2.0'
config interface 'wan6'
option proto 'dhcpv6'
option reqaddress 'try'
option peerdns '0'
option noserverunicast '1'
# I copied the DHCPv6 clientid from the AT&T router
option clientid '00000aa11aa11aa'
option reqprefix '60'
option device '@wan'
# more on these three options below
option ip6assign '64'
list ip6class 'wan6'
option ip6hint '7'
The DHCPv6 setup is configured to assign an IPv6 /64 from the prefix delegation to the wan interface. This is because AT&T blocks the "infrastructure" IPv6 addresses that is automatically assigned in order to protect its network. So when my router wants to automatically update its firmware, it fails. In addition to assigning a subnet from the prefix delegation, I also have to remove the blocked address from the interface with a DHCP script, /etc/odhcp6c.user
# This script is sourced by odhcp6c's dhcpv6.script at every DHCPv6 event.
# AT&T blocks all external traffic to the 2001:506: address
ATT_NETWORK_SUBNET=`ip -6 addr show dev eth2.0 | grep "inet6 2001:506:" | awk '{print $2}'`
if [ -n "$ATT_NETWORK_SUBNET" ]; then
ip addr del "$ATT_NETWORK_SUBNET" dev eth2.0
fi
This gets the AT&T connection up and running.
Spectrum connection setup
On the physically connected router, I setup the usual wan + wan6 connections with default settings. In order to allow IPv6 prefix sub-delegation, I assigned a /63 to the LAN instead of a /64. odhcpd will then accept prefix delegation requests on the LAN interface and hand out the second /64 in the /63 to the turris. I was not able to get prefix delegation working for subnets larger than a /64.
From that screenshot, you can see I changed the default openwrt domain on closet.lan from "lan" to "lan2" so I could forward all queries about .lan from closet.lan to turris.lan. That way, clients connected to the Spectrum specific WiFi (vlan 6) can still reach devices on the main network by their usual hostname.
On turris.lan, I setup vlan interfaces named spectrum / spectrum6 on the appropriate vlan. I didn't want turris.lan to allocate the prefix delegation subnet to any of its lan interfaces (vlans 1-3), so I'm using an ip6class filter of wan6
on them so it only allocates from the AT&T connection.
In order for closet.lan and turris.lan to exchange traffic for their given vlans, I setup static routes between them. This way, a client on vlan 6 can send traffic to vlan 1 without NAT. With both IPv4 and IPv6 routes in place on both sides, local traffic can stay local.
Remote traffic, however, needs NAT. For IPv4 traffic going out the Spectrum connection, NAT happens on closet.lan. I'm using NETMAP to handle IPv6 traffic that tries to go out the Spectrum connection with an AT&T subnet. I started with the openwrt documentation, but since turris.lan runs fw3 instead of fw4, I needed to change the script from nft to ip6tables:
LAN_IF="lan"
WAN_IF="spectrum6"
. /lib/functions/network.sh
network_flush_cache
network_get_device LAN_DEV "${LAN_IF}"
eval $(ifstatus "$LAN_IF" | jsonfilter -F / -e "LAN_PFX=@['ipv6-prefix-assignment'][0]['address', 'mask']")
network_get_prefix6 NAT66_PFX "${WAN_IF}"
network_get_device WAN_DEV "${WAN_IF}"
#nft add rule inet fw4 srcnat oifname "${WAN_DEV}" snat ip6 prefix to ip6 saddr map { "${LAN_PFX}" : "${NAT66_PFX}" }
#nft add rule inet fw4 srcnat oifname "${LAN_DEV}" snat ip6 prefix to ip6 saddr map { "${NAT66_PFX}" : "${LAN_PFX}" }
ip6tables -t nat -A POSTROUTING -s "$LAN_PFX" -o "$WAN_DEV" -j NETMAP --to "$NAT66_PFX"
ip6tables -t nat -A PREROUTING -d "$NAT66_PFX" -i "$WAN_DEV" -j NETMAP --to "$LAN_PFX"
This sets up a direct 1:1 mapping from one /64 to another. Because I'm applying it in both directions, incoming traffic works as well as outgoing.
On turris.lan, I put the Spectrum connection on its own routing table by adding an entry in /etc/iproute2/rt_tables, and then assigning the spectrum / spectrum6 interfaces to use it.
Monitoring (network config)
In order to monitor these connections, I need a way to force packets out one connection or the other. To force packets a certain way, I can either use policy routing or I can use virtual systems. With policy routing, I can force a specific source IPs to always use one connection and never fail over. This is easy to do in IPv4, but is more complicated in IPv6 due to the possibility of my network address changing. I'll use a very lightweight virtual system - a Linux network namespace. While policy routing can chose between multiple routing tables with the same interfaces, a network namespace has completely different network interfaces from the main system. Processes inherit their network namespace from their parent, and they only have access to the network interfaces on their network namespace. Network namespaces only segment the network resources, everything else is shared (ex: files, process list, filesystem mounts, users, etc). This makes it very lightweight in terms of resources used.
On my monitoring server, I'll start out with an AT&T namespace (the default) and a Spectrum namespace. The AT&T namespace is handled through the normal network configuration system, in this case network manager. For the Spectrum namespace, I wrote a shell script to create it and configure it:
#!/bin/sh
# Net Namespace
ip netns add spectrum
# create gretap link device
ip link add name spectrum type gretap local 10.1.1.10 remote 10.1.1.14
ip link set spectrum netns spectrum up
# create loopback device to talk between namespaces
ip link add spectrum-v type veth peer name sandfish-v
ip addr add 192.168.1.9/29 dev spectrum-v
ip link set spectrum-v up
ip link set sandfish-v up netns spectrum
# configure namespace addresses & routes
ip netns exec spectrum ip link set lo up
ip netns exec spectrum ip addr add 192.168.1.10/29 dev sandfish-v
ip netns exec spectrum dhclient spectrum
ip netns exec spectrum dhclient -6 spectrum
This creates three network interfaces in the Spectrum netns:
- gretap type interface, "spectrum"
- veth type interface pair "sandfish-v" (Spectrum netns) / "spectrum-v" (AT&T netns)
- loopback interface "lo"
The loopback interface is the bare minimum needed for a network stack.
The gretap interface has a matching one in the openwrt router closet.lan. This provides a tunnel for ethernet traffic because sw1.lan and sw2.lan on my network are unmanaged switches and can't deal with vlans. Even with a modest 775MHz MIPS processor, closet.lan is able to do >150 Mbit over gretap.
The veth interfaces provide a way for the Spectrum netns to talk to services running inside the AT&T netns.
A simple systemd service config is used to run this script on startup:
[Unit]
Description=Network configuration for spectrum netns
After=network.target
[Service]
Type=oneshot
ExecStart=/root/bin/spectrum-netns
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
Next, I wanted a way to login to the netns without using sudo ip netns exec spectrum sudo -u dan bash
Systemd has the ability to run services under a given netns with the configuration NetworkNamespacePath=/run/netns/spectrum
So I can run sshd in the Spectrum netns with:
[Unit]
Description=OpenSSH server daemon - spectrum
Documentation=man:sshd(8) man:sshd_config(5)
After=network.target sshd-keygen.target
Wants=sshd-keygen.target
# Migration for Fedora 38 change to remove group ownership for standard host keys
# See https://fedoraproject.org/wiki/Changes/SSHKeySignSuidBit
Wants=ssh-host-keys-migration.service
[Service]
NetworkNamespacePath=/run/netns/spectrum
Type=notify
EnvironmentFile=-/etc/sysconfig/sshd
ExecStart=/usr/sbin/sshd -D $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=42s
[Install]
WantedBy=multi-user.target
After that, I just need to make a DNS entry for my monitoring server's spectrum IPs and I can ssh into either netns based on the hostname.
Monitoring (smokeping)
I'm using smokeping to monitor both my local network and my internet connections.
To have smokeping monitor both connections, I have one copy running on the AT&T namespace as the "controller" and one running on the Spectrum namespace as the "worker"
Because I didn't find smokeping examples of having multiple workers, here's how I configured them when I had three connections:
*** Slaves ***
secrets=/etc/smokeping/smokeping_secrets
+5g.lan
display_name=5G
color=00ffff
+spectrum.lan
display_name=Spectrum
color=ff00ff
I'm using fping with the probe FPing4 for IPv4 pings, and FPing6 for IPv6:
*** Probes ***
+ FPing
binary = /usr/sbin/fping
++ FPing4
protocol = 4
++ FPing6
protocol = 6
Then, I gave it targets to run on all three connections:
+ Internet
menu = Internet
title = Internet
++ vps3
probe = FPing4
menu = vps3
title = vps3
host = vps3.drown.org
slaves = 5g.lan spectrum.lan
++ vps3-6
probe = FPing6
menu = vps3-6
title = vps3-6
host = vps3.drown.org
slaves = 5g.lan spectrum.lan
The worker processes were started by systemd in their proper netns. They use the veth network interfaces to reach their controller.
[Unit]
Description=Latency Logging and Graphing System, spectrum
After=syslog.target network-online.target spectrum-netns.service
[Service]
NetworkNamespacePath=/run/netns/spectrum
ExecStart=/usr/sbin/smokeping --nodaemon -master-url=http://192.168.1.9/smokeping/sm.cgi --cache-dir=/var/smokeping/cache --shared-secret=/var/smokeping/secret-spectrum.txt --slave-name=spectrum.lan
# Execute pre script as root
PermissionsStartOnly=true
# Fix owner of files and dirs
ExecStartPre=/usr/libexec/smokeping-fix-ownership
ExecReload=/bin/kill -HUP $MAINPID
User=smokeping
Group=apache
Restart=always
RestartSec=1
Nice=19
IOSchedulingClass=3
PrivateTmp=true
CapabilityBoundingSet=CAP_CHOWN CAP_SETGID CAP_SETUID CAP_DAC_OVERRIDE CAP_KILL CAP_NET_ADMIN CAP_NET_BROADCAST CAP_NET_RAW CAP_SYS_CHROOT
ReadOnlyDirectories=/etc
ReadOnlyDirectories=/usr
[Install]
WantedBy=multi-user.target
When changing the smokeping config, you not only have to restart the controller process, but you also have to kill any fcgi processes for the new config to take effect.
Failover/mwan3 config
This section is to go over automatic failover between connections. I'm using openwrt's mwan3 for that goal. The first thing to decide is what traffic I want to send through each connection. The high level choices are: both active or one active, one standby.
For both active, this would assign an internet connection for every TCP/UDP/ICMP flow.
- Benefit: more bandwidth
- Drawback: there are many services that get upset when your IP changes quickly
- Drawback: when there's an issue on the internet, it takes more work to figure out which connection is having the problem
For one active, one standby, all traffic would use one internet connection till it was marked "down"
- Benefit: acts more like a "regular" home network
- Benefit: easier to debug
- Drawback: your standby connection is doing nothing 99% of the time
Because my "backbone" connections are all gigabit, I'm not going to be able to use more bandwidth. I've chosen the active/standby setup.
For IPv6, I could put subnets from each internet connection on my main network for an active/active setup. For active/standby, I'll only put the AT&T IPv6 subnet on the main network. I've setup NETMAP NAT (see the "Spectrum connection setup" section above) instead. This makes sure any IPv6 traffic being sent over the standby connection will have the right subnet.
While testing failover, I noticed that openwrt immediately removed the IPv6 subnet when the link was lost. Windows immediately removes the IPv6 address, while Android and Linux both keep it around. I decided to manually assign the AT&T IPv6 subnet to the lan interface so it would stay configured when the AT&T connection went down.
Testing Failover
I left an IPv4 ping running to a cloud server while I unplugged the AT&T connection:
64 bytes from 96.126.122.39: icmp_seq=19 ttl=51 time=9.87 ms
64 bytes from 96.126.122.39: icmp_seq=20 ttl=51 time=10.1 ms
[packetloss]
64 bytes from 96.126.122.39: icmp_seq=22 ttl=52 time=22.8 ms
64 bytes from 96.126.122.39: icmp_seq=23 ttl=52 time=23.4 ms
mwan3 detected the link loss immediately and all traffic shifted within 1 second. You can see the latency changed as well as the ttl. Leaving youtube playing a video, there was no indication that anything had changed. When plugging the connection back in, the ping continued going over the Spectrum connection until I stopped it and started it again. TCP and UDP "connections" were also sticky to the provider they were made on. So things like mosh (which uses UDP) sticks to Spectrum after failover until I pause it for long enough for it to fall out of the state table.
mwan3 also has the option to force traffic out specific interfaces (using policy routing). This also uses the state table, so if I force all ssh out Spectrum, open a ssh connection, and then remove the policy routing rule, the ssh session continues to use the Spectrum provider until it is closed.
For IPv6, I had to make sure that "source-based default routing" was not enabled in openwrt. As I mentioned in "Failover/mwan3 config", I had to statically assign the AT&T subnet to the lan interface so it would stick around when the AT&T connection went down. I also setup IPv6 1:1 NAT subnet mapping (described in "Spectrum connection setup") so the AT&T subnet would be mapped to a Spectrum subnet when it went out the Spectrum interface.
Leaving an IPv6 ping running to a cloud server while I unplugged the AT&T connection:
64 bytes from vps3.drown.org: icmp_seq=17 ttl=54 time=26.3 ms
64 bytes from vps3.drown.org: icmp_seq=18 ttl=54 time=19.8 ms
64 bytes from vps3.drown.org: icmp_seq=19 ttl=54 time=77.1 ms
[packetloss]
64 bytes from vps3.drown.org: icmp_seq=31 ttl=51 time=35.8 ms
64 bytes from vps3.drown.org: icmp_seq=32 ttl=51 time=41.0 ms
64 bytes from vps3.drown.org: icmp_seq=33 ttl=51 time=38.1 ms
64 bytes from vps3.drown.org: icmp_seq=34 ttl=51 time=38.2 ms
64 bytes from vps3.drown.org: icmp_seq=35 ttl=51 time=39.7 ms
This took much longer (12 seconds instead of 1 second), but it also failed over. This ping was from a wifi client, so the latency and jitter is much higher. You can tell the failover happened through the ttl changing.
Notes
When debugging mwan3, you can use the following commands:
mwan3 status
- shows interface status, the policies, and firewall rules
ip -4 rule
/ ip –6 rule
- shows the policy routing rules (ex: from all fwmark 0x400/0x3f00 lookup 4
)
ip -4 route show table 1
/ ip -6 route show table 4
- shows the routing tables that the policy routing rules are referring to