Potential Bug - Captive Portal causing 100% IOPS

Started by dpsguard, August 12, 2023, 05:27:03 PM

Previous topic - Next topic
Hi @AdSchellevis  and @franco

I will request your help in reviewing this and see if this is a bug that can have a workaround or a patch to resolve.

Thanks so much.


The /usr/local/bin/python3 /usr/local/opnsense/scripts/OPNsense/CaptivePortal/cp-background-process.py causes 100% IO usage approx every 2 second. We debugged it with truss and found the following:

fstatat(AT_FDCWD,"/var/captiveportal/captiveportal.sqlite-journal",0x820fab8a8,0x0) ERR#2 'No such file or directory'
pread(5,"\0\0\0\M-n\0\0\0\f\0\0\0\0\0\0\0"...,16,0x18) = 16 (0x10)
fstatat(AT_FDCWD,"/var/captiveportal/captiveportal.sqlite-wal",0x820fab868,0x0) ERR#2 'No such file or directory'
fstat(5,{ mode=-rw-r----- ,inode=66717,size=49152,blksize=49152 }) = 0 (0x0)
fcntl(5,F_SETLK,0x820fab930)                     = 0 (0x0)
fstatat(AT_FDCWD,"/var/captiveportal/captiveportal.sqlite",{ mode=-rw-r----- ,inode=66717,size=49152,blksize=49152 },0x0) = 0 (0x0)
fstat(5,{ mode=-rw-r----- ,inode=66717,size=49152,blksize=49152 }) = 0 (0x0)
fstatat(AT_FDCWD,"/var/captiveportal/captiveportal.sqlite",{ mode=-rw-r----- ,inode=66717,size=49152,blksize=49152 },0x0) = 0 (0x0)
close(5)                                         = 0 (0x0)
pipe2(0x820faba18,O_CLOEXEC)                     = 0 (0x0)
pipe2(0x820faba18,O_CLOEXEC)                     = 0 (0x0)
fstat(5,{ mode=p--------- ,inode=102063,size=0,blksize=4096 }) = 0 (0x0)
ioctl(5,TIOCGETA,0x820faba20)                    ERR#25 'Inappropriate ioctl for device'
lseek(5,0x0,SEEK_CUR)                            ERR#29 'Illegal seek'
fstat(7,{ mode=p--------- ,inode=102065,size=0,blksize=4096 }) = 0 (0x0)
ioctl(7,TIOCGETA,0x820faba20)                    ERR#25 'Inappropriate ioctl for device'
lseek(7,0x0,SEEK_CUR)                            ERR#29 'Illegal seek'
pipe2(0x820fab928,O_CLOEXEC)                     = 0 (0x0)
fork()                                           = 40657 (0x9ed1)
close(10)                                        = 0 (0x0)
close(8)                                         = 0 (0x0)
close(6)                                         = 0 (0x0)
read(9,0x827427460,50000)                        = 0 (0x0)
close(9)                                         = 0 (0x0)
poll({ 5/POLLIN 7/POLLIN },2,-1)                 = 2 (0x2)
read(5,0x8273ac0e0,32768)                        = 0 (0x0)
close(5)                                         = 0 (0x0)
read(7,0x8273acde0,32768)                        = 0 (0x0)
close(7)                                         = 0 (0x0)
wait4(40657,{ EXITED,val=0 },0x0,0x0)            = 40657 (0x9ed1)

I updated today to version OPNsense 23.7.1_3-amd64 and started having this exact same problem. I even added more RAM and it still consumes all available resources.

@ameschke , your issue seems to be different from what I have. In my case, it is the issue of disk IO which shows 100% every 2 seconds when I see it using "top -m io" or using "systat -iostat 1".

Now my understanding might be flawed as this 100% might indicate that this is the only process at the time that uses the IOPS and hence showing 100%  and it may not be that all available disk IO are consumed.

Are you also using Captive Portal for Guest and if so, you can try to disable the portal to see if  CPU and RAM goes back to normal and then that will indicate issues with Captive Portal in new version?

Stopping, disabling or removing the captive portal, the problem continued. As this only happened after updating to the latest version and I had the server in hand, making the settings to deploy to the client in the next few days, I made a backup of opnsense and zenarmor and formatted it. I installed version 23.7, downloaded the available updates and did the restore. Everything operating normally so far.

Quote from: dpsguard on August 13, 2023, 06:36:46 AM


Now my understanding might be flawed as this 100% might indicate that this is the only process at the time that uses the IOPS and hence showing 100%  and it may not be that all available disk IO are consumed.


You'd need a pretty busy CP to max out the IOPS there, and most likely there'd be other points of failure long before you reach that limit.

Are you sure the storage is healthy there ?

Thanks @newsense for looking into this. Yes storage is all good. Two new SSDs in a ZFS software mirror.  zpool status did not show any errors few times I had looked at.

Now when I checked the "top -m io" , I had GUI session logged out and tab also closed. So at that time, it will be either 0% across all processes or every couple of seconds, 100% for python cp background process. I then fired the GUI and now I could see that PHP-CGI also consumes IOPS and Python has fewer, but some total of both remains 100%. So that tells me that I may have flawed understanding or it might be something to do with top implementation in FreeBSD.