I just found the following unusual message in my Exim logs:

2009-06-27 21:14:58 host name alias list truncated for 69.10.169.230

I guessed that this meant that the host had a long list of reverse name mappings (IP to name). Curious as to why, I did a DNS lookup on that IP:

chris@top ~ $ host 69.10.169.230 | wc -l
86

chris@top ~ $ host 69.10.169.230 | head -5
;; Truncated, retrying in TCP mode.
230.169.10.69.in-addr.arpa domain name pointer heavenlydonut.com.
230.169.10.69.in-addr.arpa domain name pointer pitrivertribe.org.
230.169.10.69.in-addr.arpa domain name pointer shastawebmail.com.
230.169.10.69.in-addr.arpa domain name pointer vidalvineyard.com.

So, the host has 86 names, right? And they all look like spam domains to me.

This looks like someone is trying hard to get around SMTP HELO verification, by providing a valid domain with forward and reverse lookups that map to their own IP. But they tried a bit too hard, because that’s a LONG list of domains. Nobody does that in the real world, I think.

So I decided to block mail from anyone with more than four reverse DNS entries. I have no idea what the collateral damage will be. I’m going to keep an eye on it.

Luckily, Exim makes this very easy:

defer
        set acl_c_ptr_count = ${reduce {${lookup dnsdb{>: \
                ptr=$sender_host_address}}} {0} {${eval:$value+1}}}
        condition = ${if >{$acl_c_ptr_count}{4}}
        message = Too many PTR records ($acl_c_ptr_count)

This counts the number of entries in the PTR list, assigns it to a local variable, and tests whether that number is greater than four. If so, it defers the message (tells the sender to come back later). This gives me a chance to fix it if I discover that it’s rejecting valid email, and still get the message.

The code to count the number of entries in a list is pretty ugly. I don’t suppose anyone wants to implement a “count” operation to count the number of items in a list in Exim?

Advertisements

This article has been moved to the Aptivate blog. Sorry for any inconvenience.

Offline Wikipedia part 2

December 1, 2008

Having decided on a local MediaWiki installation, I started working through the import process. I noticed a few things that may help others.

If one forgets to increase the MySQL max_packet_size, then the import breaks somewhere in the middle (around 3 million records) but the Java process keeps producing progress information, so it’s not at all clear that the import has failed. One sign is that the import process rate of progress, as reported by the import tool in pages per second, suddenly speeds up by a factor of 5-10. You may wish to look out for this and abort the import if it happens, and to monitor the import process with mysqladmin processlist to ensure that it’s still doing things.

Installing the MediaWiki ParserFunctions extension solves most of the problems with random program code appearing in articles.

The import will tend to slow down very badly over time. For example, on one system it started at a rate of 160 pages/second and dropped to 18 over a three-day period. At this rate, it would have taken around 5-6 days to import all 7.5 million pages. Using the MySQL disable keys command did not help much, but what did was to restructure the tables to remove all the indexes. You can even do this while the import is running (I did). The SQL commands are:

  • ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY, DROP INDEX name_title, DROP INDEX page_random, DROP INDEX page_len;
  • ALTER TABLE revision MODIFY COLUMN rev_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY, DROP INDEX rev_id, DROP INDEX rev_timestamp, DROP INDEX page_timestamp, DROP INDEX user_timestamp, DROP INDEX usertext_timestamp;
  • ALTER TABLE text MODIFY COLUMN old_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY;

The following SQL commands should restore the indexes after the import is complete. If you don’t do this, the MediaWiki site will be very slow in operation.

  • ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ADD UNIQUE KEY name_title (page_namespace,page_title), ADD KEY page_random (page_random), ADD KEY page_len (page_len);
  • ALTER TABLE revision MODIFY COLUMN rev_id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ADD UNIQUE KEY rev_id (rev_id), ADD KEY rev_timestamp (rev_timestamp),
    ADD KEY page_timestamp (rev_page,rev_timestamp), ADD KEY user_timestamp (rev_user,rev_timestamp), ADD KEY usertext_timestamp (rev_user_text,rev_timestamp);
  • ALTER TABLE text MODIFY COLUMN old_id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY;

With these changes I was able to achieve import speeds around fifty times faster, or 1000 pages per second, which should make it possible to import the entire Wikipedia in about 2 hours.