TLS issues when trying to access specific hosts

Started by anton_bakker, May 06, 2026, 10:21:20 AM

Previous topic - Next topic
When accessing certain url's I get a timeout due to TLS issues. These occured after updating to 26.1.7. I have isolated the issue to OPNsense as can be read in the issue report below. Short story:
- timesout: curl -sv --connect-timeout 10 "https://s3.eu-west-1.amazonaws.com"
- succeeds: curl -sv --connect-timeout 10 "https://sts.eu-west-1.amazonaws.com"


curl -sv --connect-timeout 10 "https://s3.eu-west-1.amazonaws.com"
* Host s3.eu-west-1.amazonaws.com:443 was resolved.
* IPv6: (none)
* IPv4: 3.5.64.226, 3.5.71.68, 52.92.19.112, 3.5.67.193, 3.5.74.107, 3.5.64.223, 3.5.64.231, 3.5.67.218
*   Trying 3.5.64.226:443...
* Connected to s3.eu-west-1.amazonaws.com (3.5.64.226) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* SSL connection timeout
* Closing connection


curl -sv --connect-timeout 10 "https://sts.eu-west-1.amazonaws.com"
* Host sts.eu-west-1.amazonaws.com:443 was resolved.
* IPv6: (none)
* IPv4: 3.253.222.165
*   Trying 3.253.222.165:443...
* Connected to sts.eu-west-1.amazonaws.com (3.253.222.165) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256 / [blank] / UNDEF
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=sts.eu-west-1.amazonaws.com
*  start date: Nov  5 00:00:00 2025 GMT
*  expire date: Aug 10 23:59:59 2026 GMT
*  subjectAltName: host "sts.eu-west-1.amazonaws.com" matched cert's "sts.eu-west-1.amazonaws.com"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M04
*  SSL certificate verify ok.
* using HTTP/1.x
> GET / HTTP/1.1
> Host: sts.eu-west-1.amazonaws.com
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 302 Found
< x-amzn-RequestId: bf0d8a02-5d9e-41f0-99fa-38438b552a55
< Location: https://aws.amazon.com/iam
< X-Amz-Sts-Extended-Request-Id: MTpldS13ZXN0LTE6UzoxNzc4MDU0NTE2ODgyOlI6aDVoWXhZaVk=
< Content-Length: 0
< Date: Wed, 06 May 2026 08:01:56 GMT
<
* Connection #0 to host sts.eu-west-1.amazonaws.com left intact


There are no rules that would block this traffic.

Here are the findings

# OPNsense Bug Report: pf NAT corrupts TCP SYN window field on forwarded packets

## Summary

OPNsense 26.1.7_3 corrupts the TCP window field on ALL forwarded/NAT'd SYN packets. The original window value (65535) is placed in the sequence number field, the window is zeroed, and TCP options are stripped. This causes TLS handshake failures to services that strictly honor `win 0` (notably AWS S3).

## Environment

| Component | Value |
|-----------|-------|
| OPNsense version | 26.1.7_3 (Witty Woodpecker) |
| FreeBSD | 14.3-RELEASE-p12 |
| Architecture | amd64 |
| Kernel | stable/26.1-n272089-81f87c4d694c SMP |
| NICs | Intel igb (igb0–igb5), native netmap support |
| Hardware | 8-core, 32GB RAM |
| WAN interface | igb0 (82.174.137.64/23, gateway 82.174.136.1) |
| LAN interface | igb1 (172.16.16.254/24) |
| NAT | Standard outbound NAT (LAN → WAN) |
| Suricata | Stopped (not running) |
| Zenarmor | Completely removed (uninstalled) |
| Previous working version | OPNsense 26.1.6 (last confirmed working: 2026-04-26) |
| Broken since | 2026-05-01 (update to 26.1.7) |

## Steps to Reproduce

1. Install OPNsense 26.1.7 with standard outbound NAT configuration.
2. From any LAN client, initiate a new TCP connection to any external host on port 443:
   ```
   curl --connect-timeout 10 https://s3.eu-west-1.amazonaws.com/
   ```
3. Capture the SYN packet on the WAN interface:
   ```
   tcpdump -i igb0 -nn 'tcp[tcpflags] & tcp-syn != 0 and src host <WAN_IP> and dst port 443' -c 5
   ```

## Expected Behavior

SYN packet on WAN interface after NAT should have:
- Random sequence number (from client TCP stack)
- `win 65535` (or client's advertised receive window)
- Full TCP options (MSS, window scale, SACK permitted, timestamps)

Example of correct SYN (as seen on LAN side before NAT):
```
172.16.16.44:57087 > 3.5.74.57:443: Flags [SEW], seq 1439618638, win 65535,
  options [mss 1460,nop,wscale 6,nop,nop,TS val 941018787 ecr 0,sackOK,eol], length 0
```

## Actual Behavior

SYN packet on WAN interface after NAT has:
- `seq 65535` (the original window value placed in the sequence number field)
- `win 0` (window zeroed)
- No TCP options (stripped)
- Packet length 40 bytes (bare minimum TCP/IP header)

Captured 2026-05-06 09:44 UTC+2 on igb0 (WAN):
```
09:44:14.110592 IP 82.174.137.64.2450 > 1.1.1.1.443: Flags , seq 65535, win 0, length 0
09:44:14.113966 IP 82.174.137.64.37716 > 104.20.23.154.443: Flags , seq 65535, win 0, length 0
09:44:14.113975 IP 82.174.137.64.55075 > 44.199.179.5.443: Flags , seq 65535, win 0, length 0
09:44:14.148608 IP 82.174.137.64.1394 > 16.15.228.9.443: Flags , seq 65535, win 0, length 0
09:44:14.149591 IP 82.174.137.64.39837 > 52.218.62.139.443: Flags , seq 65535, win 0, length 0
09:44:14.420920 IP 82.174.137.64.46954 > 72.145.163.206.443: Flags , seq 65535, win 0, length 0
09:44:14.702825 IP 82.174.137.64.6982 > 3.78.205.28.443: Flags , seq 65535, win 0, length 0
09:44:19.815275 IP 82.174.137.64.12773 > 35.204.155.255.443: Flags , seq 65535, win 0, length 0
```

Note: ALL destinations are affected (Cloudflare, AWS, GitHub, httpbin, etc.), not just S3. The `seq` field always contains `65535` (the original window value) and `win` is always `0`.

## Impact

**Severity: High**

- ALL new outbound TCP connections from LAN clients have corrupted SYN packets.
- Most internet services appear to work because they ignore `win 0` in SYN packets and use their own default receive window.
- Services that strictly honor the advertised zero window (AWS S3) deadlock: S3 responds with `win 0`, neither side can send data, TLS handshake times out.
- Locally-originated traffic from the firewall itself is NOT affected (bypasses pf forwarding path).
- Existing connections (HTTP/2 keep-alive, QUIC/UDP) are unaffected.

### Confirmed affected services

| Service | Result | Reason |
|---------|--------|--------|
| AWS S3 (all regions) | TLS timeout | Strictly honors win=0 |
| All other HTTPS services | Appear to work | Ignore win=0 in SYN |

### Confirmed working (comparison)

| Test | Result |
|------|--------|
| Same S3 URL from firewall itself (`curl` on OPNsense) | Works immediately (HTTP 405) |
| Same S3 URL from LAN client via mobile hotspot | Works immediately |
| DNS resolution from LAN client | Works |
| TCP connect (SYN/ACK) from LAN client | Works (ACK packets are NOT corrupted) |
| TLS to non-S3 services from LAN client | Works (servers ignore win=0) |

## Root Cause Analysis

The corruption pattern — window value moved to sequence number, window zeroed, options stripped — indicates a buffer offset error in pf's NAT rewrite path for forwarded packets. Only the initial SYN is affected; subsequent packets (ACK, data) pass through correctly:

```
# ACK packet (NOT corrupted):
82.174.137.64:51454 > 3.5.67.235:443: Flags [.], ack 1, win 65535, length 0
```

This suggests the bug is in the code path that creates the initial NAT state entry from the SYN packet, not in the ongoing state-based forwarding.

## Workarounds Attempted

| Workaround | Result |
|-----------|--------|
| Stop Suricata IPS | Corruption persists |
| Complete Zenarmor removal (`pkg delete os-sensei*`) | Corruption persists |
| Flush pf states (`pfctl -Fs`) | No effect |
| Reboot (multiple times) | Corruption persists across reboots |
| Update to 26.1.7_3 | No effect |
| Downgrade via `opnsense-revert` | Only 26.1.7_2 available in repo |

## Reproduction Commands

```bash
# On OPNsense — capture outbound SYNs on WAN:
tcpdump -i igb0 -nn 'tcp[tcpflags] & tcp-syn != 0 and src host 82.174.137.64 and dst port 443' -c 5

# From any LAN client — trigger new TCP connections:
curl --connect-timeout 10 https://s3.eu-west-1.amazonaws.com/
curl --connect-timeout 10 https://example.com/

# From OPNsense itself (works — proves forwarding path is the issue):
curl -skI https://s3.eu-west-1.amazonaws.com/
# Returns HTTP 405 immediately

# Compare LAN-side (before NAT) vs WAN-side (after NAT):
# LAN (igb1): normal SYN with seq=random, win=65535, full options
# WAN (igb0): corrupted SYN with seq=65535, win=0, no options
```

## Timeline

| Date | Event |
|------|-------|
| 2026-04-26 | Last confirmed working state (OPNsense 26.1.6) |
| 2026-05-01 | Updated to OPNsense 26.1.7 — corruption begins |
| 2026-05-03 | Issue discovered (S3 TLS failures) |
| 2026-05-04 | Zenarmor fully removed — corruption persists, confirming pf/kernel bug |
| 2026-05-06 | Fresh verification — bug still present on 26.1.7_3 |

## Additional Context

- The corruption affects ALL forwarded SYN packets regardless of destination, but is only *observable* as a failure with AWS S3 because S3 is one of the few services that strictly enforces the zero window.
- The ECE/CWR flags present in the original SYN (`Flags [SEW]`) are also stripped to just `Flags ` on the WAN side.
- Intel igb NICs have native netmap support. Zenarmor previously used netmap routed mode on these interfaces, but the corruption persists after complete Zenarmor removal and reboot.
- No relevant changes to pf rules or NAT configuration were made between the working (26.1.6) and broken (26.1.7) states.
- The `opnsense-revert` tool does not have 26.1.6 packages available, making downgrade impossible without a full reinstall from ISO.