January 28, 2009
On Monday night, the power supply unit (PSU) in the server that hosts our mail server failed at around 2200 GMT. We don’t have physical access to the server out of hours, so I wasn’t able to replace it until about 1045 the next day, so our main email server was down for nearly 13 hours.
We didn’t have a backup MX because:
- It usually can’t check whether recipients are valid or not, and therefore must accept mail that it can’t deliver;
- It usually doesn’t have as good antispam checks as the primary, because it’s a hassle to keep it updated;
- Spammers usually abuse backup MXes to send more spam, including Joe Jobs.
I thought that this was OK because people who send us mail also have mail servers with queues, which should hold the mail until our server comes back up. It’s normal for mail servers to go down sometimes and this should not cause mail to be lost or returned.
However, we had a report that one of our users did not receive a mail addressed to them, and was told by the sender that it had bounced. I saw the bounce messsage and suspected Exchange, so I decided to check how long Exchange holds messages before bouncing them. Turns out it’s only five hours by default. Most mail servers hold mail for far longer, for example five days, sending a warning message back to the sender after one day.
Bouncing messages looks bad on us. Apart from making our main mail server more reliable 🙂 we need a backup MX to accept mail when the master is down.
However I do still want to minimise the spam problem that this will cause. Therefore I configured our backup MX to only accept mail when the master is down. Otherwise it defers it, which will tell the sender to try sending it to the master (again).
How did I achieve this magic? With a little Exim configuration that took me a day and that I’m quite proud of. I set up a new virtual machine which just has Exim on it, nothing else. I configured it as an Internet host, and to relay for our most important domains. Then I created /etc/exim4/exim4.conf.localmacros with the following contents:
CHECK_RCPT_LOCAL_ACL_FILE=/etc/exim4/exim4.acl.conf callout_positive_expire = 5m
This allows us to create a file called /etc/exim4/exim4.acl.conf which contains additional ACL (access control list) conditions. The other change, callout_positive_expire, I’ll describe in a minute.
I created /etc/exim4/exim4.acl.conf with the following contents:
# if we know that the primary MX rejects this address, we should too deny ! verify = recipient/callout=30s,defer_ok message = Rejected by primary MX # detect whether the callout is failing, without causing it to # defer the message. only a warn verb can do this. warn set acl_m_callout_deferred = true verify = recipient/callout=30s set acl_m_callout_deferred = false # if the callout did not fail, and the primary mail server is not # refusing mail for this address, then it's accepting it, so tell # our client to try again later defer ! condition = $acl_m_callout_deferred message = The primary MX is working, please use it # callout is failing, main server must be failing, # accept everything accept message = Accepting mail on behalf of primary MX
The first clause, which has a deny verb, does a callout to the recipient. A callout is an Exim feature which makes a test SMTP connection and starts the process of sending a mail, checking that the recipient would be accepted. This is designed to catch and block emails that the main server would reject. Our backup server has no idea what addresses are valid in our domains; only the primary knows that.
The callout response is cached for the default two hours if it returns a negative result (the recipient does not exist on the master) or five minutes (see callout_positive_expire above) if the address does exist. We use a defer_ok condition here so that if we fail to contact the master, we don’t defer the mail immediately, but instead assume that the address is OK and therefore continue to the next clause.
The second clause of the ACL, which has a warn verb, is what took me so long to work out. Normally, if a condition in a statement returns a result of defer, which means that it failed, the server will defer the whole message (tell the sender to come back later). In almost all cases this is the right thing to do, but it’s the exact opposite of what we want here. We want to accept mail if the callout is failing, not defer it, otherwise our backup MX is useless (it stops accepting mail if the primary goes down).
Because this is such an unusual thing to do, there is no configurable option for it in Exim. The only workaround that I found is that there is exactly one way to avoid a deferring condition causing the message to be deferred: a warn verb. The documentation for the warn verb says:
If any condition on a warn statement cannot be completed (that is, there is some sort of defer), the log line specified by log_message is not written… After a defer, no further conditions or modifiers in the warn statement are processed. The incident is logged, and the ACL continues to be processed, from the next statement onwards.
So what we do is:
- Set the local variable
acl_m_callout_deferred to true;
- Try the callout. If it defers (cannot contact the primary server) then we stop processing the rest of the conditions in the warn statement, as described above;
- If we get to this point, we know that the callout did not defer, so we set acl_m_callout_deferred to false.
The third clause of the ACL, which has a defer verb, simply checks the variable that we set above. If we get this far then the primary server is not rejecting this address; and if it’s not deferring either, then it must be accepting mail for the address. In that case, we defer the message, telling our SMTP client to try again later, at which point it will hopefully succeed in delivering directly to the primary.
Callout result caching becomes a problem here. If the master was not reachable, but a previous callout had verified that a particular address existed, and that callout result was cached for the default 24 hours, then the backup MX would defer subsequent mail to that address for the next 24 hours, even if the master went down. This is why we changed the positive callout result caching time to 5 minutes earlier.
The fourth clause of the ACL, which has an accept verb, is even simpler. It accepts everything that was not denied or deferred earlier. We can only get this far if the master is not accepting or rejecting mail for that address.
So far the configuration appears to work fine and has blocked 14 spam attempts (abusing the backup MX) in 14 hours.