Nginx Reverse Proxy doesn't detect upsteam hosts as down

Started by m2e, February 21, 2024, 09:06:46 AM

Previous topic - Next topic
In reference to https://stackoverflow.com/questions/77522129/nginx-does-not-detect-an-upstream-server-as-down

I have an upstream (generated by OpnSense Nginx plugin)

upstream upstream9dbd5491033b477e84564ebe3e516c0b {
        server aa.bb.cc.d1:443 weight=1 max_conns=10000 max_fails=3 fail_timeout=10;
        server aa.bb.cc.d2:443 weight=1 max_conns=10000 max_fails=3 fail_timeout=10;
        server aa.bb.cc.d3:443 weight=1 max_conns=10000 max_fails=3 fail_timeout=10;
}


and host aa.bb.cc.d3 is down. But Nginx does not detect the host as down, unless I add the down flag to it.
See screenshot below. The red line shows a host that is shut down (power off) but still up for Nginx.

I expect Nginx to not forward any requests to the server anymore. But unfortunately, it still does (there is a significant performance change when I "down" the server manually).

Also the statistics view in OpnSense says, that server aa.bb.cc.d3 is up.

The documentation [1] is quite clear, except the following facts:

QuoteWhat is considered an unsuccessful attempt is defined by the proxy_next_upstream, fastcgi_next_upstream, uwsgi_next_upstream, scgi_next_upstream, memcached_next_upstream, and grpc_next_upstream directives.

Well, I have no proxy_next_upstream [2] and the default value is error:

Quotean error occurred while establishing a connection with the server, passing a request to it, or reading the response header

But the default of proxy_next_upstream_timeout is 0:

QuoteLimits the time during which a request can be passed to the next server. The 0 value turns off this limitation.

Do these default values disable that feature completely, or what else could be the reason, that Nginx still keeps a server up, that is not reachable at all?

References:

[1] https://nginx.org/en/docs/http/ngx_http_upstream_module.html
[2] https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream

It looks like, that the feature to "mark a host down automatically after n retries" is not a basic feature a may be available in the commercial healthcheck module: https://nginx.org/en/docs/http/ngx_http_upstream_hc_module.html

The only chance to get this feature work, is to reduce `max_fails` and `fail_timeout` and let `proxy_next_upstream` do the job.

QuoteThe only chance to get this feature work, is to reduce `max_fails` and `fail_timeout` and let `proxy_next_upstream` do the job.
Quotemax_fails=3 fail_timeout=10;
hm. what if you just increase the `fail_timeout` value in this case?
say 'max_fails=1 fail_timeout=60;'