DMVPN Tunnel Goes Down at Midnight

Info

Network information were modified to protect the privacy of the organization

Summary

Users were observing that their devices on the DMVPN hub-spoke network topology were intermittently getting disconnected every day around midnight. The issue was critical enough to have pulled all of the IT resources together to troubleshoot the issue. There were no specific times that the tunnel would go down; the only timeframe that were given to me was that this issue would occur after midnight until about one in the morning that the tunnel and topology would be stable.

Originally, there were thoughts that there was a misconfiguration on the DMVPN topology somewhere so the networking team tried to rebuild the tunnel and changed the configuration on the DMVPN tunnel. There were thoughts that it could be an MTU issue along the line of the forwarding hops in the tunnel but that would have been difficult to troubleshoot and pinpoint.

I took a different approach to trying to understand and remediate the issue - I wanted to see patterns and correlation from the devices that belong in the topology. I met with the users that originally observed and experienced the issue where their monitoring devices on the DMVPN network would drop around that midnight timeframe. The users observed that there was a site that would alert on/off status at midnight prior to the whole network going down. I checked the logs on the site and sure enough - I saw that the physical interface was flapping roughly 50-100 times around midnight and that would also bring down the SVI since that is the only device that is locally connected to the router on that network. Other spoke sites in the topology didn't have this issue.

I met with my seniors and boss and made the recommendation that we shut down the port prior to midnight to not generate the flapping of Vlan404 that would take down the tunnel endpoint at the site. Once I shut down the port at midnight, the DMVPN tunnel was stable and the users did not experienced any issues. Once we found the issue, it was generally accepted that the issue was caused by the flood of EIGRP advertisement from the spoke site to the hub from the flapping Vlan404 and brought down the network. By default, DMVPN Tunnel have a low throughput so the flood of EIGRP advertisements overloaded the link and caused the tunnel to go down.

Since the issue was discovered, the CCIE consultant made the recommendation that all Vlan404 should have the no autostate so that it didn't associate it's up/ up status with a L2 port. I wrote two scripts to address and prevent further issues from this flapping activity. The first script was implemented to ssh in all field routers to configure the "no autostate" command that has Vlan404 in its' configuration. The second script is to ssh in all identified field routers to check its' logs for up/down activities on the associated physical interface of the device in Vlan404.

Log from the DMVPN Hub Router:

HUB-DMVPN-ROUTER#sh logg | in down
Apr 20 00:35:42.232 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 20 00:54:11.478 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 20 02:12:21.083 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 20 16:58:53.009 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 21 00:24:45.641 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: Peer Termination received
Apr 21 00:39:56.899 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 21 09:37:10.359 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 21 12:24:55.277 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 21 23:54:46.737 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 22 05:31:24.436 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 23 00:17:26.171 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 23 00:51:17.711 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 24 00:31:57.000 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 24 00:44:41.722 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 25 00:06:23.409 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 25 00:13:43.385 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 25 00:27:43.496 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 25 00:42:58.027 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 25 00:47:32.839 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 26 00:12:01.344 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 26 00:16:38.427 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 26 00:26:00.985 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 26 00:32:01.188 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 26 00:40:40.054 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 26 09:01:05.893 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded
Apr 26 09:35:14.868 HST: %DUAL-5-NBRCHANGE: EIGRP-IPv4 300: Neighbor 192.168.0.1 (Tunnel0) is down: retry limit exceeded

Log from the DMVPN Spoke Router:

Apr 20 00:01:38.452 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to down
Apr 20 00:01:38.456 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to down
Apr 20 00:01:40.600 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to up
Apr 20 00:01:40.606 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to up
Apr 20 00:02:18.531 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to down
Apr 20 00:02:18.537 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to down
Apr 20 00:02:20.679 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to up
Apr 20 00:02:20.685 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to up
Apr 20 00:04:59.731 Hawaii: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0/1, changed state to down
Apr 20 00:04:59.732 Hawaii: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan404, changed state to down
Apr 20 00:05:01.024 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to down
Apr 20 00:05:01.030 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to down
Apr 20 00:05:03.723 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to up
Apr 20 00:05:03.729 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to up
Apr 20 00:05:04.724 Hawaii: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0/1, changed state to up
Apr 20 00:05:04.728 Hawaii: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan404, changed state to up
Apr 20 00:05:39.820 Hawaii: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0/1, changed state to down
Apr 20 00:05:39.820 Hawaii: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan404, changed state to down
Apr 20 00:05:41.052 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to down
Apr 20 00:05:41.058 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to down
Apr 20 00:05:43.765 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to up
Apr 20 00:05:43.771 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to up

[Some log messages are omitted...]

Apr 20 00:58:48.073 Hawaii: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan404, changed state to up
Apr 20 00:59:24.372 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to down
Apr 20 00:59:24.376 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to down
Apr 20 00:59:27.071 Hawaii: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/1, changed state to up
Apr 20 00:59:27.077 Hawaii: %LINK-3-UPDOWN: Interface Vlan404, changed state to up