OPNsense Forum

Archive => 22.7 Legacy Series => Topic started by: Untoasted9563 on November 15, 2022, 11:33:56 am

Title: Weird intermittent HAproxy failure
Post by: Untoasted9563 on November 15, 2022, 11:33:56 am
Hi all,

I am quite new to OPNsense, but Im confident I made the right choice, and it has everything (and more) I need. I run a Truenas server in my network, with several services that I (plus friends and family) need to access remotely.

I set up HAproxy according to this how-to: https://forum.opnsense.org/index.php?topic=23339.0 , which is absolutely awesome and almost hassle free :)

It works mostly flawlessly, but I have this weird effect, where some of my services are just not reachable. If I check in back later, they respond again.
- I see the pass action in the firewall, I see also that HAproxy did its thing in the logs
- It does not matter, if accessed from within the network, or from externally
- If I access the service directly by IP, it works without issues or delays.
- Its always Nextcloud, Sonarr or Radarr that don't respond (not always at the same time), but Plex, Transmission and Bitwarden always work.

I attached the redacted HAproxy conf and the HAproxy logs (a working and a failing one). I hope anyone can help me or point me towards the right direction, I really struggle on how to debug this.

Thanks in advance, cheers
Title: Re: Weird intermittent HAproxy failure
Post by: Untoasted9563 on November 17, 2022, 12:10:13 pm
I stand corrected. It can also happen to Bitwarden. Only Plex and Transmission seem perfectly stable.
Title: Re: Weird intermittent HAproxy failure
Post by: Fright on November 18, 2022, 03:30:46 pm
Hi
enabling Detailed Logging on frontends is a good step to debug imho )
Title: Re: Weird intermittent HAproxy failure
Post by: Untoasted9563 on November 18, 2022, 05:20:58 pm
Hey, thanks for the suggestion. I've activated it and will check the logs and give an update soon.
Title: Re: Weird intermittent HAproxy failure
Post by: Untoasted9563 on November 19, 2022, 10:57:07 pm
So, i made some tests. In the logs below, there are five connection attempts, one of them (radarr at 20:30) was successful and instantly showed me the login page.

The rest of them (trying sonarr) kind of just timed out. The "0_SNI_frontend SSL-backend/SSL_server" lines appear about 30 seconds after the connection attempt, so they basically mark the timeout of the session which is noticeable in the client browser.
And two minutes after the attempt, the GET request is logged. This is a very long time. What causes this delay? What do all the numbers mean in this detailed log?  e.g. 0/90064/-1/-1/120071 503 0


Firewall log:
Code: [Select]
WAN 2022-11-19T20:31:32 <<mobileIP>>:15612 <<homeIP>>:443 tcp allow HTTP/S traffic to HAproxy
WAN 2022-11-19T20:31:00 <<mobileIP>>:15611 <<homeIP>>:443 tcp allow HTTP/S traffic to HAproxy
WAN 2022-11-19T20:30:01 <<mobileIP>>:15610 <<homeIP>>:443 tcp allow HTTP/S traffic to HAproxy
WAN 2022-11-19T20:29:23 <<mobileIP>>:15609 <<homeIP>>:443 tcp allow HTTP/S traffic to HAproxy
WAN 2022-11-19T20:28:33 <<mobileIP>>:15608 <<homeIP>>:443 tcp allow HTTP/S traffic to HAproxy


HAproxy log:
Code: [Select]
2022-11-19T20:33:32 Informational haproxy <<mobileIP>>:15612 [19/Nov/2022:20:31:32.083] 1_HTTPS_frontend~ sonarr-backend/sonarr-real 0/90064/-1/-1/120071 503 0 - - sC-- 5/3/0/0/3 0/0 "GET https://sonarr.mydomain.dedyn.io/ HTTP/2.0"
2022-11-19T20:33:00 Informational haproxy <<mobileIP>>:15611 [19/Nov/2022:20:31:00.479] 1_HTTPS_frontend~ sonarr-backend/sonarr-real 0/90091/-1/-1/120133 503 0 - - sC-- 6/4/1/1/3 0/0 "GET https://sonarr.mydomain.dedyn.io/ HTTP/2.0"
2022-11-19T20:32:02 Informational haproxy <<mobileIP>>:15612 [19/Nov/2022:20:31:31.947] 0_SNI_frontend SSL-backend/SSL_server 1/0/30178 543 cD 7/3/2/2/0 0/0
2022-11-19T20:31:30 Informational haproxy <<mobileIP>>:15611 [19/Nov/2022:20:31:00.332] 0_SNI_frontend SSL-backend/SSL_server 1/0/30203 5164 cD 6/3/2/2/0 0/0
2022-11-19T20:31:23 Informational haproxy <<mobileIP>>:15609 [19/Nov/2022:20:29:23.273] 1_HTTPS_frontend~ sonarr-backend/sonarr-real 0/90109/-1/-1/120112 503 0 - - sC-- 7/4/1/1/3 0/0 "GET https://sonarr.mydomain.dedyn.io/ HTTP/2.0"
2022-11-19T20:30:34 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:01.462] 0_SNI_frontend SSL-backend/SSL_server 1/0/32996 1978141 CD 7/3/2/2/0 0/0
2022-11-19T20:30:34 Informational haproxy <<mobileIP>>:15608 [19/Nov/2022:20:28:34.048] 1_HTTPS_frontend~ sonarr-backend/sonarr-real 0/90132/-1/-1/120138 503 0 - - sC-- 8/5/1/1/3 0/0 "GET https://sonarr.mydomain.dedyn.io/ HTTP/2.0"
2022-11-19T20:30:02 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:02.446] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/0/4/8 200 9781 - - ---- 8/5/0/0/0 0/0 "GET https://radarr.mydomain.dedyn.io/radarr/Content/Images/Icons/apple-touch-icon.png?h=QMW7w5FhldXR090sP7kDlQ HTTP/2.0"
2022-11-19T20:30:02 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:02.436] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/0/7/7 200 2899 - - ---- 8/5/0/0/0 0/0 "GET https://radarr.mydomain.dedyn.io/radarr/Content/Images/Icons/favicon-32x32.png?h=bUxyGU7P9+rZb2XbRYUb6Q HTTP/2.0"
2022-11-19T20:30:02 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:02.233] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/0/13/29 200 63671 - - ---- 8/5/0/0/0 0/0 "GET https://radarr.mydomain.dedyn.io/radarr/Content/Fonts/Roboto-Regular.woff2?v=1.3.0 HTTP/2.0"
2022-11-19T20:30:02 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:02.015] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/7/8/15 200 582 - - ---- 8/5/0/0/0 0/0 "GET https://radarr.mydomain.dedyn.io/radarr/Content/Fonts/fonts.css?h=mBv1x1/yHnvphLclzeQIBA HTTP/2.0"
2022-11-19T20:30:02 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:02.015] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/6/7/13 200 1574 - - ---- 8/5/1/1/0 0/0 "GET https://radarr.mydomain.dedyn.io/radarr/Content/Images/logo.svg?h=UK+1ymD8NGZtp+MiKdEQRg HTTP/2.0"
2022-11-19T20:30:02 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:02.006] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/0/8/17 200 38162 - - ---- 8/5/2/2/0 0/0 "GET https://radarr.mydomain.dedyn.io/radarr/Content/styles.css?h=mZHtITCRtmV9tzv6R5l22A HTTP/2.0"
2022-11-19T20:30:01 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:01.785] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/0/5/6 200 2880 - - ---- 8/5/0/0/0 0/0 "GET https://radarr.mydomain.dedyn.io/radarr/login?ReturnUrl=%2F HTTP/2.0"
2022-11-19T20:30:01 Informational haproxy <<mobileIP>>:15607 [19/Nov/2022:20:28:01.705] 1_HTTPS_frontend~ sonarr-backend/sonarr-real 0/90076/-1/-1/120079 503 0 - - sC-- 9/6/2/2/3 0/0 "GET https://sonarr.mydomain.dedyn.io/ HTTP/2.0"
2022-11-19T20:30:01 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:01.699] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/0/5/5 302 164 - - ---- 9/6/0/0/0 0/0 "GET https://radarr.mydomain.dedyn.io/login?ReturnUrl=%2F HTTP/2.0"
2022-11-19T20:30:01 Informational haproxy <<mobileIP>>:15610 [19/Nov/2022:20:30:01.616] 1_HTTPS_frontend~ radarr-backend/radarr-real 0/0/7/9/16 302 157 - - ---- 9/6/0/0/0 0/0 "GET https://radarr.mydomain.dedyn.io/ HTTP/2.0"
2022-11-19T20:29:53 Informational haproxy <<mobileIP>>:15609 [19/Nov/2022:20:29:23.167] 0_SNI_frontend SSL-backend/SSL_server 1/0/30150 543 sD 8/3/2/2/0 0/0
2022-11-19T20:29:04 Informational haproxy <<mobileIP>>:15608 [19/Nov/2022:20:28:33.913] 0_SNI_frontend SSL-backend/SSL_server 1/0/30196 543 cD 7/3/2/2/0 0/0



I attached also a log when sonarr was working again (2hrs later).
Title: Re: Weird intermittent HAproxy failure
Post by: mag on November 19, 2022, 11:30:27 pm
The 1st set of numbers (0/90064/-1/-1/120071) is timing information (TR/Tw/Tc/Tr/Ta).
The 2nd set of numbers (5/3/0/0/3) are connection counters (actconn/feconn/beconn/srv_conn/retries).

What's also important is the termination state just before the 2nd set of numbers (sC--) which would explain why the connection was closed.

sC-- means:

s: the server-side timeout expired while waiting for the server to send or receive data.
C: the proxy was waiting for the CONNECTION to establish on the server. The server might at most have noticed a connection attempt.

aka
sC: The "timeout connect" stroke before a connection to the server could complete. When this happens in HTTP mode, the status code is likely a 503 or 504 here.

More information (and description of each of those fields):
https://www.haproxy.com/blog/introduction-to-haproxy-logging/
http://docs.haproxy.org/2.4/configuration.html#8.2.3
http://docs.haproxy.org/2.4/configuration.html#8.4
http://docs.haproxy.org/2.4/configuration.html#8.5
Title: Re: Weird intermittent HAproxy failure
Post by: Untoasted9563 on December 07, 2022, 09:25:31 am
Thanks for the information @mag, this was already very helpful, even though it didn't lead me to the solution yet.

I still don't have stable access to my services, however, I was able to narrow it down by some extent.

It seems like less of a HAproxy problem, but probably more like a general networking problem. New facts:
- when a server is not responding, OPNsense can also not ping it (DMZ rules wide open for testing purposes)
- as soon as the server pings OPNsense, the OPNsense pings to the server also start to work immediately, and a couple of moments later, HAproxy manages to redirect to this server.
- even with my DMZ rules in place, the ping from server to OPNsense is not successful but still kickstarts the connection from HAproxy to the server.

Googling this issue didn't really give me anything that seams reasonable in this case. Any smart people here that are not puzzled by this??
Title: Re: Weird intermittent HAproxy failure --- or is it an ARP issue?
Post by: Untoasted9563 on December 09, 2022, 05:06:39 pm
I thought I had it.

I found this post on server fault, where someone describes exactly my issue and this was a suggestion:
Quote
Does sound like a arp problem, if its long spanned vlan over a vpn or similar, you may need to add static arp entries, with an ether address of the local switch.
https://serverfault.com/questions/521928/cannot-ping-b-from-a-until-b-pings-a


My setup looks as follows:

EVILINTERNET    ----   Modem   ----   OpnSense    ----    Wifi AP   <--wifibridge-->   Wifi AP   ----  Managed Switch (Netgear GS108Ev2)  ---- Server

I can imagine, that the wifi-bridge has a similar effect as the described long spanned vlan over vpn. So I tried the static ARP entries, as described in this post:
https://forum.opnsense.org/index.php?topic=7273.0

I created /etc/rc.conf on OPNsense with the IPs and MAC adresses of the web servers and then restarting the static arp service. But no change, it still happens, that I cannot reach or ping some servers.

Then I saw, that the serverfault post stated, that the switch MAC adress should be entered. But still nothing.

But no matter what, when displaying arp -a, I see all these expiration times on my web servers, whereas only IPs from OPNsense itself seem to be permanent. What should be expected when setting static ARP entries? I never used this before, as anyone can guess, I am not a professional.

These are the two attempts I made in the rc.conf (above individual MAC adresses of all servers, below the managed switch):
Code: [Select]
# setting MAC of each web server
#static_arp_pairs="nc vw son rad trans plex"
#static_arp_nc="192.168.DMZ.13 MAC:df"
[...]
#static_arp_plex="192.168.DMZ.14 MAC:ab"


# setting MAC of managed switch in front of web servers
static_arp_pairs="nc vw son rad trans plex"
static_arp_nc="192.168.DMZ.13 MAC:9c"
[...]
static_arp_plex="192.168.DMZ.14 MAC:9c"



Shortened output of arp -a:
Code: [Select]
root@opnsense:~ % arp -a
? (192.168.DMZ.13) at <<MAC>>:df on vlan0.DMZ expires in 898 seconds [vlan]
? (192.168.DMZ.14) at <<MAC>>:ab on vlan0.DMZ expires in 847 seconds [vlan]
[...]
? (192.168.DMZ.1)  at <<MAC>>:e8 on vlan0.DMZ permanent [vlan]
[...]
<<ISP>> (<<GLOBALIP>>.1) at <<MAC>> on igc0 expires in 1200 seconds [ethernet]
<<IP@ISP>> (<<GLOBALIP>>.180) at <<MAC>> on igc0 permanent [ethernet]

Do you guys think the ARP thing sounds realistic and should be more investigated, or is this unlikely and a dead end?
Title: Re: Weird intermittent HAproxy failure
Post by: Untoasted9563 on January 03, 2023, 11:12:45 am
Update (for completeness sake):

In the meantime I got the hint to check the static arp's in the arp table. It didn't work according to the above approach with the rc.conf file (entries still expiring) but did work, when using the option "static arp" in the dhcp server settings in OPNsense GUI.

Next idea was using tcpdump to inspect the traffic, and tcpdump -e to check, if the correct mac adresses are used. As far as I could tell, OPNsense does resolve the MAC adresses correctly.

I was still unsure of which MAC adress to use in the arp table, since also behind the WiFi bridge it is a certain chain of MAC adresses that I could use for specifying the services adress:

Code: [Select]
                    | TrueNAS internally:
ManagedSwitch   ---  physical NIC   ---   Bridge   ---   VNET epair interface
I used the last one, since it is the only one linked to the IP adress.

Anyhow, inspired by Plex, which is broadcasting all the time and works for that reason, I found a workaround, that I can live with for the time being. All services drop a little "hello there" towards opnsense:
Code: [Select]
* * * * * ping -c 1 192.168.DMZ.1
Not elegant, but does the trick (not even a firewall rule needed for this, even a blocked ping makes OPNsense realize the presence of the VMs). Hopefully, later down the road I will be able to eliminate the WiFi bridge and maybe also this weird issue.

Cheers