OPNsense Forum

Archive => 18.7 Legacy Series => Topic started by: eugenmayer on December 08, 2018, 07:02:24 pm

Title: Suddenly (every week) unresponsive opnsense box - cold reboot needed
Post by: eugenmayer on December 08, 2018, 07:02:24 pm
Hello,

fighting this for some time already now and i am really out of ideas.

Setup
- I have about 5 KVM based OPNsense boxes, 1 AWS and 2 apu2c2 boxes. (18.7.8)
- Those 5 KVM boxes are basically identical, running: DHCP, Unbound, OpenVPN, Tinc, HAproxy, ACME (18.7.8)
- 1 AWS is running DHCP, Unbound, OpenVPN, HAproxy, ACME, webproxy (18.1 latest)

Problem
2 of those 5 AWS keep stalling on Saturday every single week ( for 5 and more weeks no). Right now its always the same boxes, it used to be randomly for those 5.

The AWS box seems to stall every week, also Saturday.

What i mean by "stall":
it seems some traffic is still passing through the OPNsense box it looks like NAT is still working as also stateful connections. It seems like the boxes behind OPNsense though cannot access WAN anylonger (outbound issue?)

Also i cannot connect using SSH or terminal, in both cases i can enter the user, but then instead of asking for the password - it just "hangs" there.

What i deducted
For several weeks now, after i detected that the auto-upgrade did not work and they are stuck at 18.7.4, i upgraded them to 18.7.7 ( then .8 ). Now always the same get stuck. I suspected that it is the upgrade so i deactivated the upgrade cron tasks - but this week no update was available, still those 2 stalled and the AWS box.

I also suspected the KVM boxes to "stall" on proxmox backups, i disabled them but that did not help either. Also since the AWS box is not backup using that at all, i expect that was not the right assumption anyway.

Also, 18.1 and 18.7 boxes are affected by this  - host on totally different hypervisors (AWS/kvm proxmox).

While the KVM boxes have about a every similar duty, the AWS box is rather different, still affected.


Help
Could anyway help me getting to the bottom of this - this becomes a real blocker for me in a sense that i might also consider to migrate away if i cannot solve this at all at some point.

If i can get any logs or can let the boxes log additional things while stale out, let me know. Maybe some rrd graph could be interesting or whatever, let me know. Thanks!
Title: Re: Suddenly (every week) unresponsive opnsense box - cold reboot needed
Post by: eugenmayer on December 08, 2018, 07:20:35 pm
Maybe some stats, i run an external uptime tracker so i have at least some timings of when the boxes are going down for now - at least fully for the AWS box ( see screenshot )

It seems like the "every 2 weeks is not perfectly right, seems like it was ok for e.g. nearly 1 month now, then crashed.

The pattern for the KVM boxes is 100% predictable though, every week, on saturday.

---

Also something interesting, out of those 5 KVM boxes, only 2 run HAproxy - those 2 which are crashing. Also i migrated away from HAproxy on the other 3 and it seems like this might be the reason they stopped crashing.

The AWS box has HAproxy too - also crashing.

---

Could that be HAproxy related or maybe something with the ACME plugin which runs a companion there? Not sure, do not want to misguide, but it seems like an interesting pattern here.

 - when do the ACME task run usually? ( the one in cron are rather daily )
 - are there any HAproxy related tasks?