Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - zzyzx

#1
24.1, 24.4 Legacy Series / Custom collectd
March 18, 2024, 05:34:56 PM
Is the correct place to add custom collectd functionality in /usr/local/etc/collectd.conf?

I've added a script to collect disk and cpu temperature information, but it is no longer working. It stopped working under 23.7, so it's not related to the latest update. My config addition:

<Plugin exec>
    Exec "daemon:daemon" "/home/opnuser/collectd/temps.sh"
</Plugin>


Script output should be fine since it was working at one point, but has since stopped with no change to the script. However, I can't tell if collectd is running the script and the data is not being collected or if the script simply isn't running. All other default collectd metrics are collected fine into influx, so it seems to be the script not running.
#2
I've resolved my initial wireguard problems. Adding my experience to what seems like a variety of issues. Without any hard evidence, mine seems to have been related to old config information that I was able to clear out.

I upgraded to 24.1_3 from 23.7 and immediately experience wireguard problems. No connections worked, no handshake. My wireguard logs showed this entry whenever I restarted the service.
/usr/local/opnsense/scripts/Wireguard/wg-service-control.php: The command '/sbin/route -q -n add -'inet' '/' -interface 'wg1'' returned exit code '68', the output was 'route: bad address:'

My steps to resolve, some may be related, some probably not:
- deleted and rebuilt my wg instance from scratch. This moved the interface from wg1 to wg0. No change. Same log entries.

- Realized I needed reassign the new wg0 interface in Interfaces --> Assignments. Above error log entries went away and changed to
2024-03-17T21:57:03-07:00 Notice wireguard wireguard instance main (wg0) started
2024-03-17T21:57:03-07:00 Notice wireguard /usr/local/opnsense/scripts/Wireguard/wg-service-control.php: ROUTING: entering configure using 'opt7'
2024-03-17T21:57:03-07:00 Notice wireguard wireguard instance main (wg0) can not reconfigure without stopping it first.


- rebuilt all peer entries from scratch. No change. Wireguard port connections were allowed through the firewall and the handshake occurred, but no traffic, LAN or Outside.

- I've got all DNS running through PiHole and noticed all DNS traffic was being denied through the wireguard interface despite being allowed in the interface rules.

- I temporarily allowed all through the interface and traffic started flowing, including all the earlier rules. I turned off the allow all rule, and everything continues to work.

Based on the above, it seems like some conflicting/bad config info got cleared out. My setup is similar to CJ's which he noted earlier.

Quote from: CJ on March 09, 2024, 03:38:47 PM
Add me to the no problems with wireguard list.  I'm on 24.1.2_1.

(1) Do you use DNS entries as endpoint addresses?

I use a dynamic DNS entry for the server endpoint.

(2) Do you use tunnel addresses on your instances?

I have a /24 tunnel address set on my server instance and a /32 on my client.

(3) Do you have allowed IPs on your peers?

I have my clients configured as peers on the server instance and 0.0.0.0/0 for my client allowed peers.

(4) Do you have the instances assigned as interfaces?

I have my server instance assigned as an interface.

(5) If yes for (4) do you have an IPv4/IPv6 mode set in the interface?

Both IPv4 and IPv6 are set to None on my interface.  Also, I don't use IPv6 for my dynamic DNS entry.

(6) If yes for (4) do you have VIPs assigned to these interfaces?

N/A

Hope this helps, and I'm happy to try and provide more info for comparison/troubleshooting.
#3
Today the gui was very slow, practically unresponsive. When the dashboard finally loaded all the tables were empty. Login via ssh had no problems.

Before reboot this was in the system log. Do these entries indicate a problem?

<13>1 2023-06-26T16:07:54-07:00 thechekt.lunas.lan dhclient 43547 - [meta sequenceId="1"] Creating resolv.conf
<11>1 2023-06-26T21:03:00-07:00 thechekt.lunas.lan configctl 79000 - [meta sequenceId="1"] error in configd communication  Traceback (most recent call last):   File "/usr/l
ocal/sbin/configctl", line 66, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
<11>1 2023-06-26T22:02:00-07:00 thechekt.lunas.lan configctl 51892 - [meta sequenceId="1"] error in configd communication  Traceback (most recent call last):   File "/usr/l
ocal/sbin/configctl", line 66, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
<11>1 2023-06-26T22:03:00-07:00 thechekt.lunas.lan configctl 36621 - [meta sequenceId="1"] error in configd communication  Traceback (most recent call last):   File "/usr/l
ocal/sbin/configctl", line 66, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
1 line changed; 5 lines deleted


Earlier entries were the usual repeats:

pid 29620 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)

load average from top seemed ok. Temps are often on the high side of 50-55C but not crazy.


last pid: 82800;  load averages:  0.38,  0.38,  0.31                                                                                                up 0+19:21:47  22:22:23
49 processes:  1 running, 48 sleeping
CPU:  0.8% user,  0.0% nice,  2.4% system,  0.0% interrupt, 96.8% idle
Mem: 87M Active, 367M Inact, 622M Wired, 40K Buf, 6624M Free
ARC: 263M Total, 65M MFU, 162M MRU, 280K Anon, 2329K Header, 33M Other
     183M Compressed, 527M Uncompressed, 2.88:1 Ratio
Swap: 8192M Total, 8192M Free
#4
Thanks for the responses.

I agree, the zfs filesystem issues are likely a symptom of another underlying issue. Swapping out hardware this weekend and I'll run some ram tests to see if there are any culprits that are highlighted.

One thing I'm considering is these lockups happen most frequently when wireguard is in heavier use. Hard to test, but I'll report back of something more conclusive surfaces.
#5
crash report!
#6
More info from the most recent lockup. Same symptoms, firewall becomes unresponsive and hardware is very hot. Hard reset often results in kernel panic on reboot:
Solaris(panic): zfs: removing nonexistent segment from range tree (offset (4a7172000 size=1000)

although I think this is a result of the hard reset and not the root cause of the initial lockup.
#7
Hardware is a fitlet2 with Celeron J3455 quad-core, 8GB RAM, 105GB SSD

no SMART error issues listed. The only strangeness in dmesg/system logs I could see was this error multiple times:
pid 29620 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)

Which logs can I provide to help diagnose?

Thanks for the help.
#8
Since updating to the 23 series, maybe just coincidental, my firewall frequently locks up (three times in the past month) and becomes unresponsive. When it does lock up, the hardware gets much hotter, so the CPU seems to be chewing on something.

When I (hard) reset, it sometimes recovers normally, but I've had to reinstall/restore twice now due to a kernel panic, I assume from the reset. What is the best way to diagnose the cause?

Thanks.