Elastic server being flooded by thousands of connections from Zenarmor

Started by BillyJoePiano, December 30, 2022, 08:07:50 PM

Previous topic - Next topic
I recently installed Zenarmor on my home OPNsense router.  I also installed Elasticsearch in my DMZ server which is running FreeBSD.  Elasticsearch and Kibana are in their own jail, served on the loopback interface only, and they are proxied behind an NGINX server that handles the TLS/SSL.

I have no issue connecting to either of these proxied services (elasticsearch and kibana) from my desktop computer.  However, it seems that the router is flooding the server with TCP requests on the elasticsearch port, to the order of thousands of open TCP sockets at a time:
sockstat | grep <router ip> | wc -l
...inside the NGINX jail is currently showing 1786 connections.  This is crippling NGINX's ability to serve the normal websites it serves to the public internet.

I have tried reconfiguring NGINX to eliminate "keepalive", in case that was the problem, but it seems to have no impact.

It is not clear what is causing this, because Zenarmor is not indicating any sort of error condition or connection issues.  But maybe it is having these and is not indicating it in the GUI?

Hi @BillyJoePiano,

This is not expected. In my home firewall, I see around 15-30 active connections.

Are they all in ESTABLISHED state?

Thanks for asking.  I checked netstat in the NGINX jail, and it looks like very few are in ESTABLISHED state.  Most are CLOSE_WAIT or TIME_WAIT.

Also, I guess that ZenArmor is indicating a (possibly related) error condition.  The reports page shows all errors ("An error occurred while report is being loaded), with the error message "Query timeout expired!"

I am starting to suspect there may be an issue with how the router is accessing the elastic server, relating to DNS of the elastic domain, and port forwarding... I am going to try reconfiguring that and see if it solves the problem.

Got it. It's ok to have some sockets lingering in TIME_WAIT / CLOSE_WAIT. They do not consume resources and will be purged by the OS network stack after their timeouts.

This might be the culprit. If there's a network problem between the router & ES server, connections might be stuck in a stale state.

The problem is that there are ~5-10 new connections being created per second.  So even if the old ones time out after a while, these build up pretty quickly into the thousands.

After further investigating, I don't think it is a network issue between the router and the ES server.  Everything seems to be getting there and back fine.  When I look at some TCP streams in Wireshark, there is definitely back-and-forth between the router and the server, so it wouldn't seem to be an issue on OSI layers 1-4.  I'm starting to suspect the a problem is in the application layer... but I'm not getting much info from the Elasticsearch logs

I think I've identified at least part of the problem.

ModSecurity is enabled by default on all of my NGINX server blocks, and this was generating 400 HTTP status codes for nearly all of the POST requests, so they weren't even being forwarded to the loopback listener.  I disabled ModSecurity for the Elastic reverse proxy, and the traffic seems to flowing more normally now, when watching Wireshark on the Loopback interface (where I can see everything in plaintext)

I'm still getting error messages in the Zenarmour dashboard, though I can now see a few stats there.  I'm not sure what would be causing this still.  I may need to examine the NGINX logs more closely, in case there are other issues with the proxying.

Small update:

After solving the above issue, it revealed that there was an additional issue with my custom-made configuration of the NGINX jail, having to do with read-write permissions.  That was relatively easy to fix, and based on the NGINX logs it would seem everything is running smoothly now (only 200 status codes)

However, the Zenarmor dashboard is still showing the same error messages screenshotted above

Solved the dashboard report errors... was another thing related to read/write permissions in the NGINX jail... temporary files NGINX needed to create for proxying and posting, needed to make those directories writable