HAproxy 503 errors since upgrade to 21.7.5

Started by opn_minded, November 12, 2021, 07:18:35 PM

Previous topic - Next topic
Dear all,

I upgraded from 21.7.3 to 21.7.5 today, since then HAproxy throws (only) 503s when I try to access my server from outside the network. I haven't done any changes, whatsoever - just the upgrade.

Is there anything new/known that might cause that?

I saw that 21.7.5 would include os-haproxy 3.7[6] (21.7.4 included 3.6) and the latest haproxy for my former version (21.7.3) was 3.5 introduced in 21.7.2.

I looped through the changelog at https://github.com/opnsense/plugins/blob/stable/21.7/net/haproxy/pkg-descr, but haven't found anything particular.

Many thanks!

hi
I think it's worth looking at the haproxy logs (backend unavailable? error connecting to backend? certificate error? something else?)

Hi Fright, many thanks for taking the time to reply to my issue!

I had a look at the haproxy-logs (both GUI and CLOG), but there's "nothing" that would indicate the root cause.

I restarted haproxy now at least 5 times, I also re-booted opnSense twice (one time after the upgrade and one time just to be sure that wouldn't be the issue).

My logging is set to debug (settings > settings > logging > filter syslog level).

This is one log-entry (tried to reach my server from outside):
2021-11-13T15:50:07 haproxy[18223] Connect from 1.2.3.4:29968 to 127.0.0.1:8443 (PublicService_HTTPS/HTTP)

.. nothing else besides that. I've also run an ACME challenge to see if that works - yes, it's just fine.

The reason I know that haproxy throws 503s is that if I try to reach my server, I see the error message in the browser, also telegraf (which is monitoring haproxy) increases the 503-error count.

It must have something to do with haproxy, because if I bypass it via NAT-port forward, everything is fine.

imho it is worth raising logging on a 'public service': Edit public service (enable 'Advanced mode') -> Logging Options -> enable "Raise Log Level" and "Detailed Logging". Apply and try to connect again.
maybe something interesting will appear in the log

hi fright, here we go:
1.2.3.4:17799 [14/Nov/2021:12:01:26.338] PublicService_HTTPS~ BackendPool_Default/RealServer_Default 0/3030/-1/-1/3036 503 222 - - SC-- 1/1/0/0/3 0/0 "GET https://server.com/start HTTP/2.0"

Perhaps something messed up in stepping over 21.7.4/3.6...

Maybe try:

  • Performing a health audit at System: Firmware > Run an audit
  • Carefully check HAProxy configuration items in GUI
  • Carefully check HAProxy configuration items in the actual config file
  • Running HAProxy in check mode in the foreground at the console (with -c)
  • Running HAProxy in the foreground at the console (without daemon option, i.e. do not use -D) perhaps with verbose (-V), maybe debug (-d)

QuoteBackendPool_Default/RealServer_Default 0/3030/-1/-1/3036 503 222 - - SC-- 1/1/0/0/3 0/0
sc - "The server explicitly refused the TCP connection"
backend (or something between opnsense and backend) refuses (or fails) connection
Quote"GET https://server.com/start HTTP/2.0"
can you try to connect to backend from opnsense shell with "openssl s_client" (dont forget -servername if you use SNI) and share results?


Possibly a change in OPNsense DNS behaviour...? Doing anything funky, maybe with plugins?

Do you have any Resolver Options configured in your Backend Pool?

hi guys, trying my best to sum up!

@benyamin:
QuotePerforming a health audit at System: Firmware > Run an audit
***GOT REQUEST TO AUDIT HEALTH***
Currently running OPNsense 21.7.5 (amd64/OpenSSL) at Sun Nov 14 18:12:47 CET 2021
>>> Check installed kernel version
Version 21.7.5 is correct.
>>> Check for missing or altered kernel files
No problems detected.
>>> Check installed base version
Version 21.7.5 is correct.
>>> Check for missing or altered base files
No problems detected.
>>> Check for missing package dependencies
Checking all packages: .......... done
>>> Check for missing or altered package files
Checking all packages: .......... done
>>> Check for core packages consistency
Core package "opnsense" has 66 dependencies to check.
Checking packages: .................................................................... done
***DONE***


QuoteCarefully check HAProxy configuration items in GUI / Carefully check HAProxy configuration items in the actual config file
I've pulled a 21.7.5 backup and compared it to the last working 21.7.3 - it's all the same. Compared it via VS Code.

QuoteRunning HAProxy in check mode in the foreground at the console (with -c)
haproxy -f /usr/local/etc/haproxy.conf -c
[WARNING] 317/181810 (10164) : parsing [/usr/local/etc/haproxy.conf:64] : a 'http-request' rule placed after a 'use_backend' rule will still be processed before.
[WARNING] 317/181810 (10164) : parsing [/usr/local/etc/haproxy.conf:89] : a 'http-request' rule placed after a 'use_backend' rule will still be processed before.
Warnings were found.


.. but please take note that I had those warnings before.

QuoteRunning HAProxy in the foreground at the console (without daemon option, i.e. do not use -D) perhaps with verbose (-V), maybe debug (-d)

this is what haproxy -f /usr/local/etc/haproxy.cfg -d returned;
00000003:PublicService_HTTPS.accept(0007)=000a from [1.2.3.4:47425] ALPN=h2
00000003:PublicService_HTTPS.clireq[000a:ffffffff]: GET https://<server.com>/favicon.ico HTTP/2.0
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: host: <server.com>
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: sec-ch-ua: "Microsoft Edge";v="95", "Chromium";v="95", ";Not A Brand";v="99"
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: dnt: 1
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: sec-ch-ua-mobile: ?1
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: user-agent: Mozilla/5.0 (Linux; Android 11; LE2123) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Mobile Safari/537.36 EdgA/95.0.1020.48
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: sec-ch-ua-platform: "Android"
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: accept: image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: sec-fetch-site: same-origin
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: sec-fetch-mode: no-cors
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: sec-fetch-dest: image
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: referer: https://<server.com>/
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: accept-encoding: gzip, deflate, br
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: accept-language: de-AT,de;q=0.9,en-US;q=0.8,en;q=0.7,de-DE;q=0.6,en-GB;q=0.5
00000003:PublicService_HTTPS.clihdr[000a:ffffffff]: cookie: __Host-nc_sameSiteCookielax=true; __Host-nc_sameSiteCookiestrict=true; oc_sessionPassphrase=<REDACTED>; <REDACTED>=<REDACTED>
fd[0xb] OpenSSL error[0x1416f086] tls_process_server_certificate: certificate verify failed
00000004:GLOBAL.accept(0004)=000b from [unix:1] ALPN=<none>
00000004:GLOBAL.srvcls[adfd:ffffffff]
00000004:GLOBAL.clicls[adfd:ffffffff]
00000004:GLOBAL.closed[adfd:ffffffff]
fd[0xb] OpenSSL error[0x1416f086] tls_process_server_certificate: certificate verify failed
fd[0xb] OpenSSL error[0x1416f086] tls_process_server_certificate: certificate verify failed
fd[0xb] OpenSSL error[0x1416f086] tls_process_server_certificate: certificate verify failed
00000003:BackendPool_Nextcloud.clicls[000a:000b]
00000003:BackendPool_Nextcloud.closed[000a:000b]
00000005:GLOBAL.accept(0004)=000b from [unix:1] ALPN=<none>
00000005:GLOBAL.srvcls[adfd:ffffffff]
00000005:GLOBAL.clicls[adfd:ffffffff]
00000005:GLOBAL.closed[adfd:ffffffff]

Configuration file is valid[/code]

.. so that's a first indication that it has something to do with the certificates as @Fright mentioned in his post. Does that mean it's coming from the backend-server or from haproxy?

@Fright

Quotecan you try to connect to backend from opnsense shell with "openssl s_client" (dont forget -servername if you use SNI) and share results?
openssl s_client <REDACTED>
5513582366720:error:0200203D:system library:connect:Connection refused:/usr/src/crypto/openssl/crypto/bio/b_sock2.c:110:
5513582366720:error:2008A067:BIO routines:BIO_connect:connect error:/usr/src/crypto/openssl/crypto/bio/b_sock2.c:111:
connect:errno=61


.. also an indication that something goes wrong with my certificates.

@benyamin (#2):
QuoteDoing anything funky, maybe with plugins?
No, not at all. The following are installed;

  • os-acme-client
  • os-dyndns
  • os-haproxy
  • os-intrusion-detection-content-snort-vrt
  • os-mdns-repeater
  • os-nextcloud-backup
  • os-telegraf
  • os-theme-vicuna
  • os-udpbroadcastrelay
  • os-upnp
  • os-wol

At this point, the server that I'm talking about is a nextCloud-instance and I just found out that the os-nextcloud-backup also stopped working on 11-11-2021 (the day where I upgraded to 21.7.5).

QuoteDo you have any Resolver Options configured in your Backend Pool?
Nope, this field is empty.

Apart from the many things you suggested to do (many, many thanks for your time at that point), I rolled haproxy back to the 21.7.3 version (opnsense-revert -r 21.7.3 os-haproxy), but the behaviour is the same.

Is it possible servers in your backend pool still have bad LE CA certs configured in their chains?

im also starting to suspect "LE" problems (possibly a "long chain issue" again). which CA is specified in the "SSL Verify CA" field of the Real Server settings?

November 15, 2021, 07:09:14 AM #11 Last Edit: November 15, 2021, 07:29:55 AM by opn_minded
Hi there!

OK, so from a CA perspective, the corresponding real server was always authenticating against the R3 CA of LE.

The option "SSL Verify CA" has the following options;

  • (STAGING) Artificial Apricot R3 (Let's Encrypt)
  • my self-signed CA
  • my self-signed (intermediate) CA
  • R3 (ACME Client)
  • R3 (Let's Encrypt)

I've tried every combination - still getting 503s. If I remove "Verify SSL Certificate" from the config, it works.

Verify SSL Certificate = true / SSL Verify CA = "all" => 503s
Verify SSL Certificate = true / SSL Verify CA = "nothing" => 503s
Verify SSL Certificate = false / SSL Verify CA = "nothing" => OK

2021-11-15T06:56:55 haproxy[30508] <REDACTED>:42259 [15/Nov/2021:06:56:55.895] PublicService_HTTPS~ BackendPool_Nextcloud/RealServer_Nextcloud 0/0/0/51/51 200 4953 - - ---- 1/1/0/0/0 0/0 "GET https://<REDACTED>/core/js/oc.js?v=8007c44e HTTP/2.0"
2021-11-15T06:56:55 haproxy[30508] <REDACTED>:42259 [15/Nov/2021:06:56:55.624] PublicService_HTTPS~ BackendPool_Nextcloud/RealServer_Nextcloud 0/0/10/126/136 200 5348 - - ---- 1/1/0/0/0 0/0 "GET https://<REDACTED>/login HTTP/2.0"


I've run a test with https://www.ssllabs.com - everything is either green or OK.

(switching back to Verify SSL Certificate = true)

This is what a cURL returns (from opnSense-shell), using the "R3 (ACME Client)" CA:
curl -v https://<REDACTED> --cacert <PATH-TO-R3-CERT>.crt
*   Trying <REDACTED>:443...
* Connected to <REDACTED> (<REDACTED>) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: <PATH-TO-R3-CERT>.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=<REDACTED>
*  start date: Nov 12 17:06:56 2021 GMT
*  expire date: Feb 10 17:06:55 2022 GMT
*  subjectAltName: host "<REDACTED>" matched cert's "<REDACTED>"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55e35be79560)
> GET / HTTP/2
> Host: <REDACTED>
> user-agent: curl/7.74.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 302
< server: nginx/1.21.4
< date: Mon, 15 Nov 2021 06:17:46 GMT
< content-type: text/html; charset=UTF-8
< location: https://<REDACTED>/login
< expires: Thu, 19 Nov 1981 08:52:00 GMT
< cache-control: no-store, no-cache, must-revalidate
< pragma: no-cache
< set-cookie: <REDACTED>; path=/; secure; HttpOnly; SameSite=Lax
< set-cookie: <REDACTED>; path=/; secure; HttpOnly; SameSite=Lax
< content-security-policy: default-src 'self'; script-src 'self' '<REDACTED>'; style-src 'self' 'unsafe-inline'; frame-src *; img-src * data: blob:; font-src 'self' data:; media-src *; connect-src *; object-src 'none'; base-uri 'self';
< set-cookie: __Host-nc_sameSiteCookielax=true; path=/; httponly;secure; expires=Fri, 31-Dec-2100 23:59:59 GMT; SameSite=lax
< set-cookie: __Host-nc_sameSiteCookiestrict=true; path=/; httponly;secure; expires=Fri, 31-Dec-2100 23:59:59 GMT; SameSite=strict
< strict-transport-security: max-age=15768000; includeSubDomains; preload
< referrer-policy: no-referrer
< x-content-type-options: nosniff
< x-download-options: noopen
< x-frame-options: SAMEORIGIN
< x-permitted-cross-domain-policies: none
< x-robots-tag: none
< x-xss-protection: 1; mode=block
<
* Connection #0 to host <REDACTED> left intact


NGINX (on the server) is using the same LE certificate as haproxy (that's why I run the cURL with the --cacert option), because I have an unbound-override in place that would redirect (internal) requests to the server if a device is within my network.

November 15, 2021, 07:40:00 AM #12 Last Edit: November 15, 2021, 07:44:00 AM by Fright
I think one of you R3 ("R3 (Let's Encrypt)" i think) is expired and can be safely removed from System: Trust: Authorities.
the other one ("R3 (ACME Client)" i think) includes cross-signed intermediate CA. this (depending on how the application builds the chains) can cause the end of the chain to be the expired root (but users of old Android devices should be happy).
can you try to add ISRG Root https://letsencrypt.org/certs/isrgrootx1.pem to the System: Trust: Authorities and select it as a "SSL Verify CA" in backend settings?

other option is to cut cross-signed CA cert from R3 to get rid of long chain (https://github.com/opnsense/core/issues/5257#issuecomment-933668219)

since technicaly certificates have to be checked against the roots, the first option is more correct. but adding a ISRG root will most likely cause services on opensense (using LE certificates) to start including the root certificate in the chain (technically, this is not prohibited, but some sites like ssllabs will pay attention to this)

Hi there,

R3 (Let's Encrypt) (as you expected) is valid
from Wed, 07 Oct 2020 21:21:40 +0200
until Wed, 29 Sep 2021 21:21:40 +0200
.. and has 0 signed

(STAGING) Artificial Apricot R3 (Let's Encrypt) is valid
from Fri, 04 Sep 2020 02:00:00 +0200
until Mon, 15 Sep 2025 18:00:00 +0200
.. and has 0 signed

R3 (ACME Client) is valid
from Fri, 04 Sep 2020 02:00:00 +0200
until Mon, 15 Sep 2025 18:00:00 +0200
.. and has 1 signed

I have, as you suggested, imported https://letsencrypt.org/certs/isrgrootx1.pem /w alias "LE - ISRG Root X1" (it shows up as "self-signed") - immediately afterwards the existing R3 (ACME Client) would show issuer = LE - ISRG Root X1.

Back in haproxy > settings > real server >
Verify SSL Certificate = true
SSL Verify CA = LE - ISRG Root X1 (+ R3 (ACME Client))

Applied the changes. It works - many, many thanks for that :)!

For documentation, I've re-run a ssllabs test - it now informs about "Incorrect order, Contains anchor" in section "Additional Certificates (if supplied)", but I assume this is not that important.

QuoteR3 (Let's Encrypt) (as you expected) is valid
is expired  ;) (since now expired certificates are not imported into the trusted store this should be safe (but I would remove it))

QuoteSSL Verify CA = LE - ISRG Root X1 (+ R3 (ACME Client))
imho R3 is not required here. selecting ISRG Root should be enough

QuoteApplied the changes. It works
glad it works )

QuoteContains anchor
side effect of including root certificates of services chains in the System: Trust: Authorities
imho we realy need to ask @frankie to add preffered chain selection to the acme plugin. although I do share his concerns about choosing between manual input (and possible errors) and preset values (and complication of maintanace)