Connection tracked to death

I installed updated packages on my Linux firewall/router this morning and rebooted it.  When it came back up my Asterisk PBX was not able to register with my SIP provider.  I ran tcpdump on the WAN side to see what was going on.  I saw that the forwarded packets still had the internal address of my PBX on them.  In other words, masquerading was not occurring as it should. Based on prior experience I blamed the new kernel that came in with the batch of updates.  I rebooted with the prior kernel.  The problem persisted.  I tried removing nf_nat_sip.  Didn’t help.  I googled.  No luck.  I added debugging rules to various chains in the nat and filter tables.  The debugging rules in the filter table showed hits.  The debugging rules in the PREROUTING chain of the nat table did not show hits.  What the hell?

I googled more.  Eventually I landed on this image which shows how a packet flows through Netfilter:

There’s text in the middle of the image that stood out in my mind: “nat” table only consulted for “NEW” connections.

Hmm.  Interesting.  Connection tracking handles NAT operations. This is something that used to know but I hadn’t dealt with this stuff in ages so I was fumbling around with stuff the same way I might have done 10 years ago.  I did remember that the difference between a new and a non-new connection is the fact that a non-new connection has an entry in the connection tracking table.   So I looked in /proc/net/nf_conntrack and found a hit for the SIP “connection”.  The “connection” is regularly refreshed (i.e., the entry is kept alive) because Asterisk is aggressively trying to register.  I wondered what would happen if I killed the existing connection tracking entry.  I shut down Asterisk so that it would stop sending packets.  I also installed the conntrack-tools package so that I could manipulate the table.   I removed the existing conntrack entry (conntrack -D -s pbx), then started Asterisk.  It immediately registered successfully. Woot!

So, again, what the hell?  I can’t tolerate an unreliable router.  Unfortunately I didn’t save the transcripts of the debugging session (particularly useful to have would have been the old conntrack entry) but here’s what I think happened:

  • When the router starts up, connection tracking is one of the first modules to come online.  It immediately starts tracking connections for packets that it sees, regardless of whether or not they’re actually successfully being forwarded.
  • Then, the firewall and nat rules are loaded from /etc/sysconfig/iptables

So, if a SIP registration packet is seen after connection tracking is live, but before my nat rule is in place, then that and all subsequent SIP packets are part of a connection that does not undergo NAT.  If I had left Asterisk running and/or had not deleted the existing conntrack entry, it never would have started working properly unless I had rebooted the router and got lucky with the timing of packets.

One of these days it would be nice to test this guess in a controlled environment to verify these findings.  That said, I’m pretty confident of the general conclusion.