OPNsense Forum

Archive => 18.1 Legacy Series => Topic started by: Webxorcist on May 02, 2018, 12:55:04 pm

Title: Timeout Server stroke with HAPROXY
Post by: Webxorcist on May 02, 2018, 12:55:04 pm
Hi all,

I am running HAPROXY 2.6 on OPNsense 18.1.6-amd64. I have several web servers on an internal network that are reached through HAPROXY. This used to work fine, but since several weeks I get time outs on .js and or .css files. The files on which the time outs occur keep changing. Also, I don't know exactly when the problems started because it is not happing all the time.

Now when I go to a website hosted on my web servers sometimes it takes up to 30-35 seconds for the site to load. With the dev tools or websites likes pingdom.com you can see a connection error on these .js and or .css files (again, it keeps changing).

In the HAPROXY log file you can see the files marked with sH. According to the documentation:

sH     The "timeout server" stroke before the server could return its
          response headers. This is the most common anomaly, indicating too
          long transactions, probably caused by server or database saturation.
          The immediate workaround consists in increasing the "timeout server"
          setting, but it is important to keep in mind that the user experience
          will suffer from these long response times. The only long term
          solution is to fix the application.


When I access the web sites from the internal network I can't reproduce the problem. Also from the OPNsense console I can successfully curl the failed files over and over again without the problem occurring, yet when I try coming from the internet from any kind of machine (windows, Linux, iOS - Firefox, Chrome, Safari, Edge/IE) the problem occurs 1 out of 3 tries easily.

I don't fully understand how to read the documentation above. Saturation in the database seems unlikely since the files don't need database access and aren't called from a database entry. The webserver seems fine and the problem doesn't occur accessing the sites internally. The long term solution is to fix the application? What application?

Also, sometimes when this problem occurs, yet again, not al of the time, the haproxy service on the OPNsense machine takes 100% CPU for the same amount of time it takes for the site to load.

These are the only CPU spikes on a machine that has nothing to do all day. The sites have a very low visitor rate.

Hardware:
lscpu output:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Stepping:            3
CPU MHz:             3033.631
CPU max MHz:         4000,0000
CPU min MHz:         800,0000
BogoMIPS:            6816.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves ibpb ibrs stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

RAM: 32GB
Hard Drive: 2 x 4TB 6Gb/s 7200RPM


OPNsense:
Virtual
1 CPU (Skylake)
2 GiB RAM
CPU utilization: stable between 10-15%
Memory usage: stable at 17%
Disk usage: stable at 7%
Plugings: haproxy, acme-client (letsencrypt)
Number of websites configured in HAPROXY: 12

Web servers (3):
Virtual
OpenSUSE Tumbleweed
Apache 2.4
1 CPU
1 - 4 GiB Ram
CPU utilization: lower than 1%

Database server (1):
Virtual
OpenSUSE Tumbleweed
MariaDB
1 CPU
2 GiB RAM
CPU utilization: lower than 1%

Websites software:
Wordpress
iTop
Moodle
ownCloud (with (usually) 2 clients that poll the server every few seconds over HTTPS)
Plain HTML

The problem occurs on all websites except the plain HTML which has no js or css files.

So far I was just using this for personal servers, but I am planning on renting server space once I have everything fully automated on the back-end. And then this problem came along.

I have no idea where to look now for a solution. I could add another CPU to OPNsense but nothing really indicates this is the problem.

When I shutdown all servers but one webserver, the database server and OPNsense the problem remains. So definitely no overcommitting.

I'd like to understand the problem before I add more hardware servers.

Any ideas anyone?
Title: Re: Timeout Server stroke with HAPROXY
Post by: Webxorcist on May 03, 2018, 01:45:36 pm
 == deprecated ==
Title: Re: Timeout Server stroke with HAPROXY
Post by: Webxorcist on May 04, 2018, 09:34:37 am
I found the problem. The e1000 driver on the LAN side has I/O errors. I guess I can't fix it, since I read the FreeBSD Virtio driver has a lot of problems.
Title: Re: Timeout Server stroke with HAPROXY
Post by: Webxorcist on May 04, 2018, 10:54:18 am
:( Changed all nic to virtio, the I/O errors are gone. Now packets are being dropped and the main problem remains.
Title: Re: Timeout Server stroke with HAPROXY
Post by: Webxorcist on May 04, 2018, 04:47:27 pm
I am totally lost. The problem is gone!

While I was testing stuff, I desperately accepted today's update to 18.1.7
The problem remained... But I noticed my pull down menu's where empty in HAPROXY. So in this thread: https://forum.opnsense.org/index.php?topic=8603.0 I noticed there seemed to be a problem with this release and we should revert back to 18.1.6

As soon as I did this, the problem went away.

So YAAAY

But also, I now have some trust issues with OPNsense.
Are settings also kept in a database? Not only in a conf file on disk?

What could have caused reverting back to solve my problem which also existed while running that same version I reverted back to.

Anyone?
Title: Re: Timeout Server stroke with HAPROXY
Post by: loredo on May 04, 2018, 05:10:33 pm
Sounds like you missed to disable all hardware related acceleration on a virtual machine.
You should follow these instructions:

https://docs.opnsense.org/manual/virtuals.html
Title: Re: Timeout Server stroke with HAPROXY
Post by: Webxorcist on May 04, 2018, 06:24:41 pm
Sounds like you missed to disable all hardware related acceleration on a virtual machine.
You should follow these instructions:

https://docs.opnsense.org/manual/virtuals.html

Hi,

thank you for replying. I didn't miss those settings actually ;-)

It also wouldn't explain why the problem was gone after today's roll back to 18.1.6