Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - BillyJoePiano

#1
22.7 Legacy Series / Re: Intermittent Kernel Panics
July 03, 2023, 08:27:07 AM
I have another update:

I was looking at more detailed logs from a recent crash, and about 20 hours before the crash there was a sensei/zenarmor process 'ipdrstreamer.py' (Python) which suddenly starting consuming 100% CPU (presumably of one core), and stayed in this high-CPU-consumption state for the entire ~20 hours before the crash.  Prior to that it was showing low CPU consumption like the other sensei python processes... digging into this more
#2
22.7 Legacy Series / Re: Intermittent Kernel Panics
June 17, 2023, 09:25:32 PM
UPDATE:

I wrote a script to log the outputs of several commands every 2 seconds, cycling over a 2-minute period -- top, ps aux, dmesg, and pftop

Doing this, I was able to identify a handful of python processes which started 30 seconds before the logging went dark, presumably due to the kernel panic.  The top command also shows a possible memory leak.

It's interesting to note that the were 15 logs (spanning 30 seconds) in the rotation that were totally empty, meaning the script started writing to the file but there was no output.  Note that the write to each log file is a backgrounded subshell that also echos a timestamp variable from the outer shell, and even that timestamp wasn't written to those 15 logs that are empty.


New Python processes that show up 30 seconds before logging went dark.  These only show up in a single 'ps aux' log
root                        40568  10.8  0.8   43020  31564  -  S    07:08       0:01.21 /usr/local/bin/python3 /usr/local/opnsense/scripts/filter/update_tables.py (python3.9)
root                        42497   2.9  0.4   27288  16448  -  S    07:08       0:00.33 /usr/local/sensei/py_venv/bin/python3 /usr/local/opnsense/scripts/OPNsense/Sensei/periodicals.py (python3.9)
root                        42708   0.0  0.3   23736  12396  -  S    07:08       0:00.20 /usr/local/bin/python3 /usr/local/sbin/configctl sensei userenrich (python3.9)
root                        42969   0.0  0.5   32280  22052  -  R    07:08       0:00.57 /usr/local/sensei/py_venv/bin/python3 /usr/local/opnsense/scripts/OPNsense/Sensei/userenrich.py (python3.9)
root                        48158   0.0  0.1   43020   3668  -  R    07:08       0:00.00 /sbin/pfctl -t __opt15_network -T show (python3.9),


These Python processes are showing up consistently from ps aux:
root                         8723   1.5  1.8  114752  74968  -  S<   Tue10      57:24.50 /usr/local/sensei/py_venv/bin/python3 /usr/local/sensei//scripts/datastore/ipdrstreamer.py (python3.9)
root                        27669   0.0  0.3   23736  12016  -  S    Tue10       0:37.58 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
root                        30075   0.0  0.3   23972  12100  -  S    Tue10       0:34.03 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)
root                        33269   0.0  0.6   36256  23788  -  Is   Wed17       0:02.44 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py (python3.9)
root                        34926   0.0  0.8   66484  33788  -  I    Wed17       1:32.06 /usr/local/bin/python3 /usr/local/opnsense/service/configd.py console (python3.9)
root                        76595   0.0  0.3   23736  12092  -  S    Tue19       0:32.45 /usr/local/bin/python3 /usr/local/sbin/configctl -e -t 0.5 system event config_changed (python3.9)
root                        76891   0.0  0.3   23972  12176  -  S    Tue19       0:29.67 /usr/local/bin/python3 /usr/local/opnsense/scripts/syslog/lockout_handler (python3.9)


Here is the last 'top' output from before the logging cuts out.  Take note of the amount of free memory.
last pid: 89561;  load averages:  0.43,  0.43,  0.39  up 2+20:18:50    07:08:31
79 processes:  1 running, 78 sleeping
CPU:  3.3% user,  0.0% nice,  2.5% system,  0.0% interrupt, 94.1% idle
Mem: 106M Active, 2334M Inact, 1054M Wired, 384M Buf, 381M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
8723 root          7  20  -20   112M    73M select   0  57:25   0.68% python3.
8118 root         12  20  -20  2214M   571M nanslp   1  36:16   0.00% eastpect
8160 root         12  20  -20  2129M   482M nanslp   2  19:33   0.00% eastpect
53587 root          1  52  -20    14M  4356K wait     0   9:33   0.00% bash
48656 root          2  20  -20   845M   189M nanslp   0   4:33   0.00% eastpect
28722 root          1  52    0    13M  3008K wait     0   1:54   0.00% sh
34926 root          2  27    0    65M    33M accept   0   1:32   0.00% python3.
69189 unbound       4  20    0   103M    40M kqread   1   1:22   0.00% unbound
92515 root          1  20    0    13M  2348K kqread   2   1:16   0.00% rtsold
93925 root          1  20    0    13M  2428K select   0   1:12   0.00% rtsold
72490 root          1  20    0    12M  2324K select   3   1:06   0.00% radvd
76399 root          3  20    0    40M    12M kqread   1   0:46   0.00% syslog-n
27669 root          1  20    0    23M    12M select   3   0:38   0.00% python3.
30075 root          1  20    0    23M    12M select   3   0:34   0.00% python3.
95965 root          1  20    0    21M  6552K select   3   0:34   0.00% ntpd
76595 root          1  20    0    23M    12M select   3   0:32   0.00% python3.
76891 root          1  20    0    23M    12M select   2   0:30   0.00% python3.
50611 root          1  20    0    13M  2628K bpf      0   0:25   0.00% filterlo


Here is the normal 'top' output:
last pid: 42538;  load averages:  0.36,  0.27,  0.32  up 0+15:43:46    14:21:45
80 processes:  2 running, 78 sleeping
CPU:  2.0% user,  0.0% nice,  2.2% system,  0.0% interrupt, 95.8% idle
Mem: 101M Active, 684M Inact, 962M Wired, 355M Buf, 2083M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
70937 root          6  20  -20    69M    44M select   0   3:36   0.10% python3.9
88652 root          1  52  -20    14M  4376K wait     3   0:46   0.10% bash
69398 root         12  20  -20  2005M   318M nanslp   2   2:43   0.00% eastpect
68511 root         12  20  -20  2001M   287M nanslp   1   1:53   0.00% eastpect
98717 root          2  20    0   103M    53M accept   1   1:30   0.00% python3.9
5306 root          1  52    0    60M    39M piperd   2   0:43   0.00% php
51212 root          2  20  -20   851M   195M nanslp   0   0:40   0.00% eastpect
6804 root          1  52    0    13M  2832K wait     3   0:19   0.00% sh
81801 root          1  20    0    13M  2352K kqread   0   0:16   0.00% rtsold
75363 root          1  20    0    25M    15M select   2   0:14   0.00% python3.9
85012 root          1  20    0    13M  2284K select   2   0:14   0.00% rtsold
31667 root          3  20    0    42M    12M kqread   0   0:13   0.00% syslog-ng
33698 root          1  20    0    12M  2324K select   2   0:13   0.00% radvd
47169 root          1  52    0    56M    37M accept   3   0:11   0.00% php-cgi
49850 root          1  20    0    23M    12M select   1   0:08   0.00% python3.9
52451 root          1  20    0    23M    12M select   0   0:07   0.00% python3.9
21484 root          1  20    0    21M  6552K select   3   0:07   0.00% ntpd
90459 root          1  52    0    54M    35M accept   0   0:07   0.00% php-cgi


My guess is that there is something in the Python process that is causing a memory leak?  I'm not sure how the fan/cooling is related to this, but perhaps CPU running faster contributes to a problem?
#3
22.7 Legacy Series / Intermittent Kernel Panics
June 17, 2023, 08:44:44 AM
I have a Protecli Vault 4-port running OPNsense 22.7.11_1, along with ZenArmor and mDNS repeater.  ZenArmor uses an Elasticsearch database hosted on my DMZ server to which it is directly connected (point to point)

About 3 or 4 months ago, the router began randomly rebooting, about once a day or so... sometimes it would be fine for several days, but then it could happen multiple times in a day, including in the middle of the night when there was next to no activity.  I never got around to diagnosing the cause, because it wasn't a show-stopper and I had more pressing things to deal with.  When it happened, our internet would be down for about 1-2 minutes, and then it would be back again.

Strangely, the problem went away completely for about a month after I unplugged a small external fan that was blowing air across the Protectli's passive cooling fins.  I'd mention that this fan was cooling it long before the crash/reboot issues started happening.  But sure enough, when I plugged this fan back in, the problem started again.  The fan brings the CPU temps down about 20 degrees F, from the 115-120F range to about 95-100F  (for those more accustomed to Centigrade, that's about 46-49 C vs. 35-38 C) which seems like a good thing.  But for whatever reason, the router would randomly crash and reboot as a result.

I know this makes it easy to blame the hardware.  But here's the thing... now that I'm digging into the problem, I'm finding that these reboots are consistently due to kernel panic happening in the Python 3.9 process, possibly related to ZenArmor.  I am still collecting more data on it, but here are three kernel panic messages from three separate crashes/reboots (I obtained these through dmesg command)

Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 06
instruction pointer = 0x20:0xffffffff8114c9a8
stack pointer         = 0x28:0xfffffe00a022cc20
frame pointer         = 0x28:0xfffffe00a022cd60
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 48339 (python3.9)
trap number = 9
panic: general protection fault
cpuid = 3
time = 1686917344
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a022ca40
vpanic() at vpanic+0x17f/frame 0xfffffe00a022ca90
panic() at panic+0x43/frame 0xfffffe00a022caf0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a022cb50
calltrap() at calltrap+0x8/frame 0xfffffe00a022cb50
--- trap 0x9, rip = 0xffffffff8114c9a8, rsp = 0xfffffe00a022cc20, rbp = 0xfffffe00a022cd60 ---
pmap_remove_pages() at pmap_remove_pages+0x4d8/frame 0xfffffe00a022cd60
vmspace_exit() at vmspace_exit+0x7f/frame 0xfffffe00a022cd90
exit1() at exit1+0x57f/frame 0xfffffe00a022cdf0
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00a022ce00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe00a022cf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00a022cf30
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x80078c0da, rsp = 0x7fffffffeb18, rbp = 0x7fffffffeb30 ---
KDB: enter: panic


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x141172067
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff8114d864
stack pointer         = 0x0:0xfffffe00a02c5bb0
frame pointer         = 0x0:0xfffffe00a02c5be0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 34353 (python3.9)
trap number = 12
panic: page fault
cpuid = 1
time = 1686966901
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00a02c5970
vpanic() at vpanic+0x17f/frame 0xfffffe00a02c59c0
panic() at panic+0x43/frame 0xfffffe00a02c5a20
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a02c5a80
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00a02c5ae0
calltrap() at calltrap+0x8/frame 0xfffffe00a02c5ae0
--- trap 0xc, rip = 0xffffffff8114d864, rsp = 0xfffffe00a02c5bb0, rbp = 0xfffffe00a02c5be0 ---
pmap_is_prefaultable() at pmap_is_prefaultable+0x164/frame 0xfffffe00a02c5be0
vm_fault_prefault() at vm_fault_prefault+0x112/frame 0xfffffe00a02c5c50
vm_fault() at vm_fault+0x120c/frame 0xfffffe00a02c5d70
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe00a02c5dc0
trap_pfault() at trap_pfault+0x1f3/frame 0xfffffe00a02c5e20
trap() at trap+0x40a/frame 0xfffffe00a02c5f30
calltrap() at calltrap+0x8/frame 0xfffffe00a02c5f30
--- trap 0xc, rip = 0x8003cbf7d, rsp = 0x7fffffffd3f0, rbp = 0x7fffffffd410 ---
KDB: enter: panic


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xffffffff811491ac
stack pointer         = 0x0:0xfffffe00b5a92b90
frame pointer         = 0x0:0xfffffe00b5a92b90
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 31806 (python3.9)
trap number = 9
panic: general protection fault
cpuid = 1
time = 1686973021
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b5a929b0
vpanic() at vpanic+0x17f/frame 0xfffffe00b5a92a00
panic() at panic+0x43/frame 0xfffffe00b5a92a60
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00b5a92ac0
calltrap() at calltrap+0x8/frame 0xfffffe00b5a92ac0
--- trap 0x9, rip = 0xffffffff811491ac, rsp = 0xfffffe00b5a92b90, rbp = 0xfffffe00b5a92b90 ---
pmap_pvh_remove() at pmap_pvh_remove+0x4c/frame 0xfffffe00b5a92b90
pmap_enter() at pmap_enter+0xd59/frame 0xfffffe00b5a92c50
vm_fault() at vm_fault+0x1016/frame 0xfffffe00b5a92d70
vm_fault_trap() at vm_fault_trap+0x6d/frame 0xfffffe00b5a92dc0
trap_pfault() at trap_pfault+0x1f3/frame 0xfffffe00b5a92e20
trap() at trap+0x40a/frame 0xfffffe00b5a92f30
calltrap() at calltrap+0x8/frame 0xfffffe00b5a92f30
--- trap 0xc, rip = 0x80049fb58, rsp = 0x7fffffffd2f0, rbp = 0x7fffffffd450 ---
KDB: enter: panic
#4
Quote from: jlab on January 14, 2023, 04:45:27 PM
OP,

Can you explain what you are actually trying to do and achieve ?

Why not turn on Client isolation and leave it at that ? Why do you have so many SSID's too ?

I can't just turn on blanket client isolation because I need hosts on some of the Wifi networks to be able to communicate with other hosts on the same network.

My goal is selective host isolation on certain SSID networks.  Specifically, I have an SSID for IOT devices which I treat as a "high risk" network, and would like absolute host isolation between all IOT devices on that network.  Additionally, there are some other (non-IOT) devices which occasionally log onto that network for the purposes of debugging and admin (e.g. my desktop workstation) and I need THAT device to have access to the other devices on the IOT network.

In short, I would like the firewall rules to dictate which hosts are "isolated" and which can communicate with others on their own subnet.  The solution would involve turning on blanket isolation at the AP obviously, but then having the router determine which communications would be allowed and re-forwarding the allowed communications (albeit back to the same interface, but with a different layer 2 destination address)

While thinking through this problem, I realized that I could achieve something very similar to what I've proposed  previously by putting each host on its own point-to-point subnet with the router using virtual IPs, and using the firewall rules there to determine what is allowed.  It would just be normal layer-3 routing in that case.  The issue which arises in that scenario is that there needs to be a separate point-to-point network (probably /30 to allow for a broadcast address, meaning taking up 4 ip addresses...) for each host.  If all the hosts are known in advance this is very do-able, but if I want to dynamically allocate host addresses where they can communicate with each other, I run into a problem.
#5
I recently obtained a TP-Link TL-WA1201 Wifi Access point.  It is connected to one of the 'opt' ports on my Protectli Vault with OPNsense installed.  The Wifi AP is configured with 4 different SSIDs, each on a different VLAN, corresponding to a VLAN virtual interface on the OPNsense router.  DHCP service is disabled on the AP, and handled by the router.  Each VLAN is a separate subnet.

I'm trying to achieve selective host isolation depending on the subnet and individual hosts, which would be dictated by firewall rules.  The Access Point itself is all-or-nothing with the isolation setting... I can't even enable it on individual SSIDs (which is really my goal here...).  If I do enable isolation, then there is no way for any client on any SSID to access another wifi client even on the same SSID/VLAN... or at least via the AP itself.

HOWEVER, I'm wondering if it is possible for the router to take over this task, and act as a (sort-of?) "layer-2 router/switch" by responding to the intra-subnet ARP requests with its own MAC address.  I did try using ARP proxy in the Virtual IP settings, but this was causing problems with DHCP and conflicting MAC addresses.  The clients were sending DHCP Refusal packets as soon as they saw the ARP conflict.  I'd need the router to abstain from sending these ARP packets back to the client that it is spoofing.  It should only respond with an ARP to a client looking for another client on the same subnet.  In other words, if 192.168.1.5 is looking for 192.168.1.6, it should spoof 192.168.1.6 with its own MAC to 192.168.1.5, but NOT send this ARP to 192.168.1.6 (or via a layer-2/ARP broadcast) because that would confuse the latter client.  Once the router has all the clients pointing to itself for their intra-subnet traffic, the router would then be responsible for determining if a packet is allow (or not) based on its firewall rules, and if allowed then retransmit using the actual MAC/IP combo of the destination host.

I realize it's also possible this behavior would confuse the Access Point itself, since it is supposed to be the layer-2 switch for all the wifi clients.  But shouldn't it be irrelevant since it just ignores the layer-3 address, and only "routes" based on layer 2?  I'm not sure??

Perhaps what I'm suggesting is impossible to achieve with the devices I'm working with.  Any thoughts or suggestions are appreciated!
#6
Solved the dashboard report errors... was another thing related to read/write permissions in the NGINX jail... temporary files NGINX needed to create for proxying and posting, needed to make those directories writable
#7
Small update:

After solving the above issue, it revealed that there was an additional issue with my custom-made configuration of the NGINX jail, having to do with read-write permissions.  That was relatively easy to fix, and based on the NGINX logs it would seem everything is running smoothly now (only 200 status codes)

However, the Zenarmor dashboard is still showing the same error messages screenshotted above
#8
I think I've identified at least part of the problem.

ModSecurity is enabled by default on all of my NGINX server blocks, and this was generating 400 HTTP status codes for nearly all of the POST requests, so they weren't even being forwarded to the loopback listener.  I disabled ModSecurity for the Elastic reverse proxy, and the traffic seems to flowing more normally now, when watching Wireshark on the Loopback interface (where I can see everything in plaintext)

I'm still getting error messages in the Zenarmour dashboard, though I can now see a few stats there.  I'm not sure what would be causing this still.  I may need to examine the NGINX logs more closely, in case there are other issues with the proxying.
#9
The problem is that there are ~5-10 new connections being created per second.  So even if the old ones time out after a while, these build up pretty quickly into the thousands.

After further investigating, I don't think it is a network issue between the router and the ES server.  Everything seems to be getting there and back fine.  When I look at some TCP streams in Wireshark, there is definitely back-and-forth between the router and the server, so it wouldn't seem to be an issue on OSI layers 1-4.  I'm starting to suspect the a problem is in the application layer... but I'm not getting much info from the Elasticsearch logs
#10
Thanks for asking.  I checked netstat in the NGINX jail, and it looks like very few are in ESTABLISHED state.  Most are CLOSE_WAIT or TIME_WAIT.

Also, I guess that ZenArmor is indicating a (possibly related) error condition.  The reports page shows all errors ("An error occurred while report is being loaded), with the error message "Query timeout expired!"

I am starting to suspect there may be an issue with how the router is accessing the elastic server, relating to DNS of the elastic domain, and port forwarding... I am going to try reconfiguring that and see if it solves the problem.
#11
I recently installed Zenarmor on my home OPNsense router.  I also installed Elasticsearch in my DMZ server which is running FreeBSD.  Elasticsearch and Kibana are in their own jail, served on the loopback interface only, and they are proxied behind an NGINX server that handles the TLS/SSL.

I have no issue connecting to either of these proxied services (elasticsearch and kibana) from my desktop computer.  However, it seems that the router is flooding the server with TCP requests on the elasticsearch port, to the order of thousands of open TCP sockets at a time:
sockstat | grep <router ip> | wc -l
...inside the NGINX jail is currently showing 1786 connections.  This is crippling NGINX's ability to serve the normal websites it serves to the public internet.

I have tried reconfiguring NGINX to eliminate "keepalive", in case that was the problem, but it seems to have no impact.

It is not clear what is causing this, because Zenarmor is not indicating any sort of error condition or connection issues.  But maybe it is having these and is not indicating it in the GUI?
#12
22.7 Legacy Series / Re: IPFW not listed in service
October 03, 2022, 03:21:41 PM
I think ipfw is a Kernel module in FreeBSD.  Might want to confirm that it is loaded there with kldstat.  Also may want to double check everything described at this link is in place: https://docs.freebsd.org/doc/6.1-RELEASE/usr/share/doc/handbook/firewalls-ipfw.html
#13
22.7 Legacy Series / Re: Unable to Ping VLAN Gateway
October 03, 2022, 08:22:58 AM
I know when I've had VLAN issues it had to do with the configuration of the switch, especially with regards to what VLAN untagged packets are sent to on each interface.  If you forget to change that setting when you change the VLAN for the interface you might have issues.

Not sure whether that is helpful or not...
#14
I am having difficulty getting outbound NAT working with a static IP address configuration.  It worked fine using the default automatic configurations when it had a dynamic DHCP assignment (using another router facing the public internet) but as soon as I put OPNsense as the public-facing router with a static IP assignment, the NAT stops working.

I am watching a live packet capture of the WAN/outside interface (on Wireshark, via an SSH pipe) and it seems OPNsense is trying to route the inside packets to the ISP using their internal addresses rather than NATing the source.  However, the router itself is still sending a little traffic using the correct address of the WAN/outside interface, mostly DNS and NTP.

I'm including some screenshots of the current configuration, attached.  I did try manually setting the IPv4 Upstream Gateway in the WAN Interface configuration, but this didn't have any effect.

#15
Thanks for your help on this.

Perhaps I'm misunderstanding or didn't explain the situation correctly.  Are you saying that the meanings of "in" and "out" are reversed in OPNsense, from what I'm used to in other contexts?

For example, if my LAN client computer makes a web request, I think of this as being an "in" at the LAN interface, and an "out" at the WAN/outside interface.

To be clear -- I do have an "allow in and out" rule for the LAN interface (again... it seems it needed to be in the floating rules), but the one that I'm concerned about is on the WAN interface where I need to "allow in", which is like opening the door wide open, when I only want statefully established responses allowed in.