OPNsense Forum

Archive => 19.1 Legacy Series => Topic started by: sporkman on February 10, 2019, 03:08:20 am

Title: Internet outage, all hell breaks loose
Post by: sporkman on February 10, 2019, 03:08:20 am
Just updated to 19.1.1 last night and it seemed to work well.

A few hours ago, I start getting txts that my fios line is down, and sure enough, there appears to be no internet access (fairly rare occurrence with FTTH outside of maintenance hours, TBH).  So my first thought is that like 18.7, opnsense had paniced or something, or I'd hit some new bug in 19.1.

Web interface worked, but only to a limited extent - dashboard showed some info, but actually toggling things (enable disable fios interface) or renewing the dhcp lease, no response.  I ssh'd in and ran 'dmesg' and it was just full of "[zone: pf states] PF states limit reached" messages. Digging a bit more with 'pfctl -ss', I saw that it was basically all outbound DNS requests, presumably from unbound.

I killed unbound, but couldn't remember how to manually kill states (and couldn't google it!). So then I just checked my fios interface and I think I confirmed an outage by noting that tcpdump was showing me absolutely nothing (or the interface was locked up?).

I attempted the "restart all services" in hopes of getting the full GUI back, and it was hanging on restarting cron, had to "kill -9" poor cron in the shell. Things were still odd. It appears php-fpm and configd/python were just dying:

Code: [Select]
pid 64855 (python2.7), uid 0: exited on signal 10 (core dumped)
pid 78374 (python2.7), uid 0: exited on signal 10 (core dumped)
[HBSD SEGVGUARD] [python2.7 (78374)] Suspension expired.
 -> pid: 78374 ppid: 78269 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
[HBSD SEGVGUARD] [/usr/local/bin/python2.7 (35385)] Suspension expired.
 -> pid: 35385 ppid: 34118 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
pid 43398 (sleep), uid 0: exited on signal 10
[HBSD SEGVGUARD] [/bin/sleep (81756)] Suspension expired.
 -> pid: 81756 ppid: 81407 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
[HBSD SEGVGUARD] [/usr/local/bin/php (46831)] Suspension expired.
 -> pid: 46831 ppid: 38217 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
pid 8347 (python2.7), uid 0: exited on signal 10 (core dumped)
ovpns1: link state changed to DOWN
pid 8749 (python2.7), uid 0: exited on signal 11 (core dumped)
[HBSD SEGVGUARD] [python2.7 (8749)] Suspension expired.
 -> pid: 8749 ppid: 38443 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
ovpns1: link state changed to UP
[HBSD SEGVGUARD] [/usr/local/bin/python2.7 (48064)] Suspension expired.
 -> pid: 48064 ppid: 47524 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
pid 23052 (awk), uid 0: exited on signal 10 (core dumped)
[HBSD SEGVGUARD] [/usr/bin/awk (74876)] Suspension expired.

No idea what that's about other than it must be some HardenedBSD feature that's giving that extra info. I have way more of that logged if it's of interest.

Finally gave up and rebooted and internet came back and no more unusual state table bloating and so far no dying php/python.

Not sure if this was an outage triggering all this chaos or if the chaos happened and rebooting was the thing that resolved the outage. The totally blank tcpdump gave me pause.  I got a new IP when rebooting, which is unusual for fios, but common after maintenance and outages sooo???

unbound filling the state table is kind of odd too - not sure why it would just keep firing off new queries when no answers are received.

Anyone want more info?
Title: Re: Internet outage, all hell breaks loose
Post by: newsense on February 10, 2019, 03:22:53 am
Started seeing [HBSD SEGVGUARD] for Unbound since 18.7.10 came out however I'm unsure what's causing it. On one machine it helped reinstalling Unbound, however all others are still randomly seeing it at various intervals, both on VMs and APU hardware.

Since all the configuration is identical: no System defined DNS and only 1.1.1.1 and 9.9.9.9 over TLS is defined in Options according to the pfsense blog post last year when 1.1.1.1 was announced - I'm a bit unsure where the issue actually is as I couldn't see anything in the logs with increased verbosity.

The python segvguard is new to me however, although all the other bits match for the Unbound error:
Quote
Suspension expired and SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT


That being said, the problems with Unbound started a while back therefore there's no direct correlation between your Internet outage and anything else.
Title: Re: Internet outage, all hell breaks loose
Post by: sporkman on February 10, 2019, 06:54:45 pm
Ugh, this is killing me.

Kernel panic this morning:

Code: [Select]
Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xffffffff80f51d8c
stack pointer         = 0x0:0xfffffe011a4758d0
frame pointer         = 0x0:0xfffffe011a4758d0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 80737 (python2.7)
version.txt0600006713430061227  7534 ustarrootwheelFreeBSD 11.2-RELEASE-p8-HBSD  31af16db12b(stable/19.1)

I submitted using the reporter, but I feel like this 11.2 HBSD migration might be a little bit half-baked...

For the devs, is there anything I can supply? Is there anything hardware-wise that's not friendly to opnsense?  This box has one bge and two re cards. It's an old Dell optiplex core2duo. I guess I could burn a memtest86 CD and make sure I don't have RAM issues...

I figure I'll give this a week or so and if the panic thing is fairly regular it's back to the firewall that shall not be named (barf).
Title: Re: Internet outage, all hell breaks loose
Post by: sporkman on February 11, 2019, 05:07:58 am
And again - this time something to do with my LAN interface:

Code: [Select]
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff8248a028
stack pointer           = 0x28:0xfffffe00efb90f10
frame pointer           = 0x28:0xfffffe00efb91390
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (bge0 taskq) <<--

Turned off suricatis this morning (was only in IDS mode, not IPS), trying to turn off netflow now (it's unconfigured in the GUI, but there's still a python process running that has something to do with processing that data).

Of note, before this the logfile has lots of interface up/down events on the ovpns1 (openvpn server interface).
Title: Re: Internet outage, all hell breaks loose
Post by: sporkman on March 01, 2019, 07:24:23 am
No panics since turning off netflow and the IDS.

Any devs have interest in this or no?
Title: Re: Internet outage, all hell breaks loose
Post by: sporkman on March 25, 2019, 02:01:12 am
Spoke too soon, got another one tonight:

Code: [Select]
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x65
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff810bb549
stack pointer         = 0x28:0xfffffe011a1e8890
frame pointer         = 0x28:0xfffffe011a1e88e0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 67259 (php)

I sent a report in with the automated thing.

Is there interest in debugging this or no? There are things I dislike about pfsense, but the basic "it doesn't panic" feature I do enjoy.