Network outage randomly - Need help to investigate

Started by unam, November 29, 2019, 02:54:34 PM

Previous topic - Next topic
Hi,

I used opnsense for few years now and I really like it !

I run a virtual machine on Proxmox (kvm) with 2vcpu and 2gb of ram, 10Gb hdd.

On this vm, I have 4 virtual interfaces with dedicated mac address and routing on the hoster network (ovh).
These interfaces are dedicated to haproxy to deliver web services, and 3 openvpn servers.

On the lan side, I have multiple vlan on the same interface. Each of this vlan is a /30 subnet where I configure a virtual server and an opnsense ip address for gateway.

It was working without any reboot for last 4 months. And, randomly last week, our services where not available anymore and we had to stop / restart the firewall.

Today, another outage and I tried to reboot directly the virtual machine without success, our services became available for 10 seconds. Then the firewall stopped to respond.

For troubleshoot, I checked at the arp table and found that every local ip had the same mac address.

I tried to stop the vm and to start it (cold boot) again, and miracle, everything seems to be fine and working again. I checked at the arp table and every local ip has a specific mac address now.

I think that the arp table was full, and everything dropped. The reboot did not flush the table, maybe because the table is directly reloaded in case of reboot ?

Please if anyone has any king of solution, investigation, or anything else ? I do not really know how to troubleshoot quickly this problem before it appears again ?

Thanks for your reply.

Regards,

I don't think this is related to OPNsense.

You would need someone to debugg proxmox, your switch environment and the VMs.

If you have support, try to ask the Proxmox guys, they do a very good job.
Twitter: banym
Mastodon: banym@bsd.network
Blog: https://www.banym.de

Yup, on your advices I just checked my ovs-vswitchd.log history and find that on last week, I get


2019-11-21T20:52:15.544Z|00788|netdev_linux|WARN|veth104i0: removing policing failed: No such device
2019-11-21T20:52:15.544Z|00789|ofproto|WARN|vmbr1: cannot get STP status on nonexistent port 33
2019-11-21T20:52:15.544Z|00790|ofproto|WARN|vmbr1: cannot get RSTP status on nonexistent port 33
2019-11-21T20:52:15.546Z|00791|bridge|WARN|could not open network device veth104i0 (No such device)
2019-11-21T20:52:20.155Z|00792|bridge|WARN|could not open network device veth103i0 (No such device)
2019-11-21T20:53:20.214Z|00793|bridge|WARN|could not open network device veth102i0 (No such device)


I keep investigating that way.

Thanks for your quick reply !

Regards,

You're welcome.
If it is related to OPNsenes it would be nice to keep us updated here.

Good luck with your debugging.
Twitter: banym
Mastodon: banym@bsd.network
Blog: https://www.banym.de