Is HAProxy on OPNSense stable/reliable?

Started by gctwnl, January 26, 2023, 11:32:09 PM

Previous topic - Next topic
I have been running HAProxy on my OPNSense 22.10 business edition for a while now. Sadly, I have to conclude that this doesn't increase availability as HAProxy after a few days stops passing on port 587 ansd this has now happened 3 times in one week. HAproxy just becomes a black hole when that happens. Stopping and starting haproxy solves that, so just to be sure I have now created a cron job to restart the router once a day (which is ugly).

Is there anyone who recognises this and knows what to do about it? Or how to find out what goes wrong?

My internal postfix/dovecot servers listen haproxy-aware on 991 (postfix/postscreen), 990 (postfix/submission), 994 (dovecot/imaps) and they listen non-haproxy-aware on the official ports (25,587,993)

haproxy.conf:
#
# Automatically generated configuration.
# Do not edit this file manually.
#

global
    uid                         80
    gid                         80
    chroot                      /var/haproxy
    daemon
    stats                       socket /var/run/haproxy.socket group proxy mode 775 level admin
    nbproc                      1
    nbthread                    1
    hard-stop-after             60s
    no strict-limits
    tune.ssl.default-dh-param   2048
    spread-checks               2
    tune.bufsize                16384
    tune.lua.maxmem             0
    log                         /var/run/log local0 info
    lua-prepend-path            /tmp/haproxy/lua/?.lua

defaults
    log     global
    option redispatch -1
    timeout client 30s
    timeout connect 30s
    timeout check 10s
    timeout server 30s
    retries 3
    default-server init-addr last,libc

# autogenerated entries for ACLs


# autogenerated entries for config in backends/frontends

# autogenerated entries for stats




# Frontend: smtpd-loadbalancing (Port 25 Load Balancing)
frontend smtpd-loadbalancing
    bind 192.168.2.2:25 name 192.168.2.2:25
    mode tcp
    default_backend mail.rna.nl.991
    # tuning options
    timeout client 30s

    # logging options

# Frontend: submission-loadbalancing (Port 587 Load Balancing)
frontend submission-loadbalancing
    bind 192.168.2.2:587 name 192.168.2.2:587
    mode tcp
    default_backend mail.rna.nl.991
    # tuning options
    timeout client 30s

    # logging options

# Frontend: imaps-loadbalancing (Port 993 Load Balancing)
frontend imaps-loadbalancing
    bind 192.168.2.2:993 name 192.168.2.2:993
    mode tcp
    default_backend mail.rna.nl.994
    # tuning options
    timeout client 30s

    # logging options

# Backend: mail.rna.nl.991 (postfix haproxy postscreen pool)
backend mail.rna.nl.991
    option log-health-checks
    # health check: port991-health-monitor
    mode tcp
    balance roundrobin

    # tuning options
    timeout connect 30s
    timeout check 10s
    timeout server 30s
    server albus-991 192.168.2.66:991 check inter 300s port 991  send-proxy
    server snape-991 192.168.2.125:991 check inter 300s port 991  send-proxy

# Backend: mail.rna.nl.990 (postfix haproxy submssion pool)
backend mail.rna.nl.990
    option log-health-checks
    # health check: port991-health-monitor
    mode tcp
    balance roundrobin

    # tuning options
    timeout connect 30s
    timeout check 10s
    timeout server 30s
    server albus-990 192.168.2.66:990 check inter 300s port 991  send-proxy
    server snape-990 192.168.2.125:990 check inter 300s port 991  send-proxy

# Backend: mail.rna.nl.994 (postfix haproxy imaps pool)
backend mail.rna.nl.994
    option log-health-checks
    # health check: port991-health-monitor
    mode tcp
    balance roundrobin

    # tuning options
    timeout connect 30s
    timeout check 10s
    timeout server 30s
    server albus-994 192.168.2.66:994 check inter 300s port 991  send-proxy
    server snape-994 192.168.2.125:994 check inter 300s port 991  send-proxy

It sounds a bit like an upstream issue with the HAProxy software itself.

But: do you need HAProxy? The business version also has a proxy plugin based on Apache if it fits your use cases:

https://docs.opnsense.org/vendor/deciso/opnwaf.html


Cheers,
Franco

would start with enabling Detailed Logging and looking in logs perhaps )
it is also interesting to understand the network configuration (fontends and backends in 192.168.2 ?)


Quote from: franco on January 27, 2023, 09:25:40 AM
It sounds a bit like an upstream issue with the HAProxy software itself.

But: do you need HAProxy? The business version also has a proxy plugin based on Apache if it fits your use cases:

https://docs.opnsense.org/vendor/deciso/opnwaf.html


Cheers,
Franco
I would guess that for load balancing SMTP etc, a WAF would not be the right choice. I would suspect Apache to be http(s) oriented. But I haven't looked.

Quote from: Fright on January 27, 2023, 03:43:42 PM
would start with enabling Detailed Logging and looking in logs perhaps )
it is also interesting to understand the network configuration (fontends and backends in 192.168.2 ?)
What is noticeable in the logs is what is not there. I should see 3 health checks every 5 minutes or so, but this is what I see:
2023-01-27T05:11:12 Notice haproxy Health check for server mail.rna.nl.991/albus-991 failed, reason: Layer4 connection problem, info: "General socket error (Network is down)", check duration: 0ms, status: 0/2 DOWN.
2023-01-26T22:38:19 Notice haproxy Health check for server mail.rna.nl.994/snape-994 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-26T22:37:29 Notice haproxy Health check for server mail.rna.nl.994/albus-994 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-26T22:36:39 Notice haproxy Health check for server mail.rna.nl.990/snape-990 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-26T22:35:49 Notice haproxy Health check for server mail.rna.nl.990/albus-990 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-26T22:34:59 Notice haproxy Health check for server mail.rna.nl.991/snape-991 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-26T22:34:09 Notice haproxy Health check for server mail.rna.nl.991/albus-991 succeeded, reason: Layer4 check passed, check duration: 2ms, status: 3/3 UP.
2023-01-25T00:57:00 Notice haproxy Health check for server mail.rna.nl.994/snape-994 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-25T00:56:10 Notice haproxy Health check for server mail.rna.nl.994/albus-994 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-25T00:55:20 Notice haproxy Health check for server mail.rna.nl.990/snape-990 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-25T00:54:30 Notice haproxy Health check for server mail.rna.nl.990/albus-990 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-25T00:53:40 Notice haproxy Health check for server mail.rna.nl.991/snape-991 succeeded, reason: Layer4 check passed, check duration: 0ms, status: 3/3 UP.
2023-01-25T00:52:50 Notice haproxy Health check for server mail.rna.nl.991/albus-991 succeeded, reason: Layer4 check passed, check duration: 2ms, status: 3/3 UP.

This is weird. For one, I do not see health checks after Jan 25 00:57 but I only became aware of a user not being able to contact port 587 around 20:15 on Jan 26. And I am positive that in the meantime port 587 has worked (I sent mail myself).

(the issue at 5:11 is because I have now programmed my router to reboot every night at 5:10)


I guess I do not understand the question about the network. What do you need to know other than what is in my original post?

QuoteI only became aware of a user not being able to contact port 587 around 20:15 on Jan 26. And I am positive that in the meantime port 587 has worked (I sent mail myself)
if this is somehow related to syslog-ng communication, then probably these checks should be visible in the postfix logs - I would look there.
but I'm also talking about raising the logging level in the frontend settings ("option log-separate-errors", "option tcplog": Edit Public Service -> advanced mode -> Raise Log Level and Detailed Logging checkboxes).
Maybe this will help you understand what's going on.
about network settings: if you are sure that routes cannot be involved, then do not bother )