I just found the following unusual message in my Exim logs:

2009-06-27 21:14:58 host name alias list truncated for 69.10.169.230

I guessed that this meant that the host had a long list of reverse name mappings (IP to name). Curious as to why, I did a DNS lookup on that IP:

chris@top ~ $ host 69.10.169.230 | wc -l
86

chris@top ~ $ host 69.10.169.230 | head -5
;; Truncated, retrying in TCP mode.
230.169.10.69.in-addr.arpa domain name pointer heavenlydonut.com.
230.169.10.69.in-addr.arpa domain name pointer pitrivertribe.org.
230.169.10.69.in-addr.arpa domain name pointer shastawebmail.com.
230.169.10.69.in-addr.arpa domain name pointer vidalvineyard.com.

So, the host has 86 names, right? And they all look like spam domains to me.

This looks like someone is trying hard to get around SMTP HELO verification, by providing a valid domain with forward and reverse lookups that map to their own IP. But they tried a bit too hard, because that’s a LONG list of domains. Nobody does that in the real world, I think.

So I decided to block mail from anyone with more than four reverse DNS entries. I have no idea what the collateral damage will be. I’m going to keep an eye on it.

Luckily, Exim makes this very easy:

defer
        set acl_c_ptr_count = ${reduce {${lookup dnsdb{>: \
                ptr=$sender_host_address}}} {0} {${eval:$value+1}}}
        condition = ${if >{$acl_c_ptr_count}{4}}
        message = Too many PTR records ($acl_c_ptr_count)

This counts the number of entries in the PTR list, assigns it to a local variable, and tests whether that number is greater than four. If so, it defers the message (tells the sender to come back later). This gives me a chance to fix it if I discover that it’s rejecting valid email, and still get the message.

The code to count the number of entries in a list is pretty ugly. I don’t suppose anyone wants to implement a “count” operation to count the number of items in a list in Exim?

I usually use Linux firewalls for traffic shaping, because the power of the traffic control (tc) system exceeds FreeBSD’s dummynet in most ways.

Dummynet can be used to create arbitrary delays and packet loss, which is very useful for simulating poor connections, but not for sharing bandwidth and prioritising packets between different traffic classes on a real traffic shaper.

However, I’ve just been testing PF (the new standard packet filter) and ALTQ (the alternative queueing system) on FreeBSD, and I’m impressed by the capabilities. It does annoy me that ALTQ is not enabled in the default kernel, so you have to compile your own kernel. I used the following commands:

cd /boot
cp -p kernel GENERIC # backup the current kernel
cd /usr/src/sys/i386/conf
cp GENERIC ~/ALTQ
ln -s ~/ALTQ .
vi ALTQ

and added the following lines to my new kernel configuration file, which I called ALTQ:

options ALTQ
options ALTQ_RED
options ALTQ_RIO
options ALTQ_HFSC
options ALTQ_PRIQ

and then compiled and installed the new kernel:

cd /usr/src
make buildkernel KERNCONF=ALTQ
make installkernel KERNCONF=ALTQ

and then reboot to load the new kernel. After that, we need to create a pf configuration. I prefer HFSC over CBQ queueing, because:

  • HFSC is guaranteed accurate, whereas CBQ is approximate
  • CBQ requires you to guess the average packet size and its accuracy depends entirely on this
  • HFSC has service curves which allow you to deliver small files quickly and drop the priority of large connections (e.g. file downloads) with great ease.

I prefer PF+ALTQ over linux TC because:

  • PF and ALTQ are fully integrated and configured using the same file, whereas TC has its own (very hard to use) classifier. I normally use the iptables CLASSIFY target to classify traffic instead, but this is not integrated.
  • TC is very hard to use generally. The authors seem more concerned with functionality than usability.
  • ALTQ has named queues which helps usability enormously compared to TC’s hex numbered classes.
  • ALTQ gives very low delay when the interface is not 100% saturated, which seems impossible to achieve with TC.

Here is a sample configuration of PF+ALTQ that I used for testing on a transparent bridging firewall (bridge0 connecting em0 and em1):

altq on em1 hfsc bandwidth 1Mb queue { ftp, ssh, icmp, other }
queue ftp bandwidth 30% priority 0 hfsc (upperlimit 99%)
queue ssh bandwidth 30% priority 2 hfsc (upperlimit 99%)
queue icmp bandwidth 10% priority 2 hfsc (upperlimit 99%)
queue other bandwidth 30% priority 1 hfsc (default upperlimit 99%)
pass out quick on bridge0 inet proto tcp from any port 21 to any queue ftp
pass out quick on bridge0 inet proto tcp from any port 22 to any queue ssh
pass out quick on bridge0 inet proto icmp from any to any queue icmp
pass out quick on bridge0 all

We are only queueing on em1 here, which is the downstream, so we are only limiting downloads. We deliberately limit them to 1 Mbps for testing. The limit should always be lower than your actual download bandwidth, to ensure that the queue is on the FreeBSD firewall and not any other device.

We create four named queues under the root, which is implicitly named root_em1. We reserve 30% of bandwidth each for FTP, SSH and other traffic, and 10% for ICMP. However, any class can exceed its reserved bandwidth, up to the upperlimit, which defaults to 100%, which means that one class can potentially cause delays to traffic in other classes, so we override this to 99%.

Note that even though we create the queues on the em1 device, we must filter packets on bridge0, as otherwise our traffic does not match our pf rules.

Update: I found some more information about traffic shaping and advanced usage of HFSC, including realtime guaranteed classes for VoIP.

As seen on Slashdot:

Adobe uses a proprietary encrypted communications system between their Flash player and their Media Server product. This is intended to ensure that only people who pay for Flash Media Server can stream Flash movies, and only official clients can access them.

In other words, it’s a copy protection (DRM) scam. It’s completely antithetical to the goals of running a free software desktop or serving content to users using free software. However, despite Adobe’s claims, it doesn’t actually provide any security except through the obscurity of the protocol and some short secret keys.

lkcl claims to have created an open source, clean-room implementation of this protocol, called RTMPE, and published it on Sourceforge. Despite promising in January to open RTMP, Adobe wants to protect their revenue stream, so they sent a DMCA takedown notice to Sourceforge, who complied by censoring the project.

If you value your freedom to publish and receive Flash videos using free software, help us fight Adobe and embarrass SourceForge by nominating rtmpdump for “Best Project for Multimedia” in the SourceForge Community Choice awards.

If you just want to download it, here are some handy links now that it’s been censored by SourceForge: LKCL sehe.nl megashare.com mininova.org sumotorrent.com fulldls.com btjunkie.org mybittorrent.com demonoid.com mininova/TOR.

Live CDs are great. In particular, they’re a great way to try out software, knowing that the chances of damaging the host system are minimal and you can throw away the entire system if you want to.

Sometimes you want to use a live CD environment without a CD. CDs are slow, get lost and scratched, and require a CD drive. If you’re going to use live environments a lot, you’d probably prefer to boot them over the network from a machine with a hard disk and a cache.

Luckily, Ubuntu’s live CD includes all the necessary support to do this easily, if you know how to use it. Unfortunately, it’s not really documented as far as I can tell. Please correct me if I’m wrong about this.

I managed to make the live CD boot over the network on a PXE client using the following steps.

  • set your DHCP server up to hand off to a TFTP server. For example, add the following lines to your subnet definition in /etc/dhcp3/dhcpd.conf:
  • next-server 10.0.156.34;
    filename "pxelinux.0";
    
  • get a copy of pxelinux.0 from the pxelinux package and put it in the tftproot of your TFTP server.
  • copy the casper directory off the CD and put it into your tftproot as well.
  • get an NFS server on your network to loopback-mount the Desktop ISO (e.g. ubuntu-8.04.2-desktop-i386.iso) and export the mount directory through NFS. Let’s say your NFS server is 1.2.3.4 and the ISO is mounted at /var/nfs/ubuntu/live. Edit /etc/exports on the server and export the mount directory to the world by adding the following line:
  • /var/nfs/ubuntu/live *(ro,all_squash,no_subtree_check)
    
  • put the following section into your tftproot/pxelinux.cfg/default file:
  • DEFAULT live-804
    LABEL live-804
      kernel casper/vmlinuz
      append file=/cdrom/preseed/ubuntu.seed boot=casper initrd=ubuntu/ubuntu-8-04/casper/initrd.gz netboot=nfs nfsroot=1.2.3.4:/var/nfs/ubuntu/live quiet splash --
    
  • test that the PXE client boots into the live CD environment
  • if it doesn’t, remove the “quiet splash” from the end of the “append” line and boot it again, to see where it gets stuck.

I hope this helps someone, and that NFS-booting a live environment will be properly documented (better than this!) one day.

(Also filed on Ubuntu bug 296089.)

Fouad Bajwa writes of an unusual deal between the Pakistani government and Microsoft, on the s-asia-it mailing list:

To all members of the IT Industry & Technical Community,

Everyone is well aware that global financial recession has hit even the Tech Giants where companies like Microsoft and Intel have being saying goodbye to thousands of their employees. The situation doesn’t seem to be getting better but interestingly our Pakistani National ICT R&D Fund is thinking about helping Microsoft in Pakistan and we from the industry feel that it is sad that instead of supporting local Hi-Tech Start-ups and struggling IT Entrepreneurs [they are]  funding the usual “Non-Useful” activities like conferences [and] so-called accelerator programs for Pakistan…

To be fair, they have funded a number of open source projects, and funding for conferences and other networking activities is always in short supply for those without a significant marketing budget.

I have come to know through my friends in the IT Industry that the National ICT R&D Fund has signed an MoU with Microsoft to fund the Microsoft Developers Conference and something called an “Innovators Accelerator Program”. The funds haven’t been disbursed yet but it definitely annoys me and many of my friends in the IT industry that our government should fund Microsoft initiatives which is already a global giant. I have heard that around 5 million rupees [about USD 60,000] or thereabouts for the innovation accelerator program which will involve Microsoft training, entrepreneurship training and connecting with Microsoft partners and similar amounts related.

I also find it strange that Pakistan would choose it invest money in Microsoft at this time, despite their obvious experience and competence with open source. Others come to the Fund’s defence, saying:

ICT R&D Fund is one of the few institutions in the country that are doing an excellent job… [it] is the role of a funding agency to encourage collaborations for promoting research cultures and provide help in bringing the best minds closer.

But nobody has denied that the Fund has signed an MoU with Microsoft, or argued for its benefit to Pakistan. Fouad also writes:

When will our national institutions support its people, the vulnerable, not the already empowered? Why doesn’t it support the local entrepreneurs, the ones that don’t have large companies or university backings? Why does it have liabilities to include universities whereas it knows what the state of R&D in universities has been except for a few handful? Why doesn’t it include this money for Social Enterprise and created a NATIONAL INCUBATION AND ACCELERATION CENTRE where people like me or you or anyone can walk in and build their ideas and companies?

Ashiq Anjum replies that “No funding agency can build incubators for industry, probably this is outside of their scope.” But the Fund’s stated goal is “To transform Pakistan’s economy into a knowledge based economy by promoting efficient, sustainable and effective ICT initiatives through synergic development of industrial and academic resources.”

It sounds entirely reasonable on this basis for them to assist university graduates in gaining skills that are useful in the knowledge industry, and in setting up their own companies in the knowledge industry. Indeed, another stated goal is to “make Pakistan an attractive destination for service oriented and research and development related outsourced jobs.”

We can establish centres like http://www.socialinnovation.ca/
and help local entrepreneurs in business development and social innovation with the same amount of money[.] That helps and benefits our people and companies directly as well as innovate for local and international markets.

I agree that all countries should support local development, training and entrepreneurship as much as possible.

zdnet.com reports that ‘In an effort to improve Web users’ compatibility experience, Microsoft added a new, user-selectable Compatibility List to the Release Candidate test version of IE 8 that the company released in January… Microsoft describes the list — Version 1.0 of which includes 2,400 sites that don’t render properly in IE 8 (in other words, an “incompatibility list”) – as a tool designed to “make sure IE8 customers have a great experience with highly trafficked sites that have not yet fully accomodated IE8’s better implementation of web standards.”‘

(read more from the horse’s business end at http://blogs.msdn.com/ie/archive/2009/02/16/just-the-facts-recap-of-compatibility-view.aspx)

I think this is interesting. On the one hand Microsoft has finally (finally!) decided to bite the bullet and fix some of the bugs in IE that cause web developers so much pain. In my experience, supporting IE’s buggy CSS takes about as much effort as developing the CSS for Firefox in the first place.

Microsoft has always used the excuse before that users would view sites that rendered badly in a new standards-compliant IE and blame IE for the problems. This is an understandable, if self-serving excuse. Perhaps with IE’s market share below 70%, they feel that they can no longer get away with it on the basis of user base alone.

On the other hand, the list has some very interesting entries, apart from nearly every chinese website in existence:

  • amazon.com
  • blogger.com
  • ebay.com
  • facebook.com
  • google.com
  • live.com
  • microsoft.com
  • msn.com
  • myspace.com
  • wikipedia.org
  • yahoo.com
  • youtube.com

I can’t think of a high-profile site that’s not on the list. I think Microsoft has asked a million monkeys to beta-test IE8 and they’re hitting the error report button randomly.

Otherwise, I can only assume that IE8 doesn’t support any websites at all. Perhaps this is the EU-competition-commission version of IE8 that they were testing?

(thanks to PC The Great at lugm.org for the heads-up)

Open source in Government

February 17, 2009

The Register has an interesting article about various open source vendors’ latest attempt to legislate their way into the healthcare system, and why it’s doomed to fail.

I found it well-written and convincing right up to the last
paragraph but one:

If open source is going to make any real headway in the government, there needs to be an incentive to choose it, not a rule. Time and again, this is where the open source community falls short: Quality code isn’t enough of an incentive. You can put the best engineering in the world
into your product, but if you don’t know how to market, your project will rot in the source repository.

Uhh, non sequitur? Needs to be an incentive to choose it => needs better marketing? Where’s the incentive in marketing? Surely the incentive should be that it’s a better product or that it saves money or time, not that it has flashing lights all over it?

Backup Mail Exchangers

January 28, 2009

On Monday night, the power supply unit (PSU) in the server that hosts our mail server failed at around 2200 GMT. We don’t have physical access to the server out of hours, so I wasn’t able to replace it until about 1045 the next day, so our main email server was down for nearly 13 hours.

We didn’t have a backup MX because:

  • It usually can’t check whether recipients are valid or not, and therefore must accept mail that it can’t deliver;
  • It usually doesn’t have as good antispam checks as the primary, because it’s a hassle to keep it updated;
  • Spammers usually abuse backup MXes to send more spam, including Joe Jobs.

I thought that this was OK because people who send us mail also have mail servers with queues, which should hold the mail until our server comes back up. It’s normal for mail servers to go down sometimes and this should not cause mail to be lost or returned.

However, we had a report that one of our users did not receive a mail addressed to them, and was told by the sender that it had bounced. I saw the bounce messsage and suspected Exchange, so I decided to check how long Exchange holds messages before bouncing them. Turns out it’s only five hours by default. Most mail servers hold mail for far longer, for example five days, sending a warning message back to the sender after one day.

Bouncing messages looks bad on us. Apart from making our main mail server more reliable :) we need a backup MX to accept mail when the master is down.

However I do still want to minimise the spam problem that this will cause. Therefore I configured our backup MX to only accept mail when the master is down. Otherwise it defers it, which will tell the sender to try sending it to the master (again).

How did I achieve this magic? With a little Exim configuration that took me a day and that I’m quite proud of. I set up a new virtual machine which just has Exim on it, nothing else. I configured it as an Internet host, and to relay for our most important domains. Then I created /etc/exim4/exim4.conf.localmacros with the following contents:

CHECK_RCPT_LOCAL_ACL_FILE=/etc/exim4/exim4.acl.conf
callout_positive_expire = 5m

This allows us to create a file called /etc/exim4/exim4.acl.conf which contains additional ACL (access control list) conditions. The other change, callout_positive_expire, I’ll describe in a minute.

I created /etc/exim4/exim4.acl.conf with the following contents:

# if we know that the primary MX rejects this address, we should too
deny
        ! verify = recipient/callout=30s,defer_ok
        message = Rejected by primary MX

# detect whether the callout is failing, without causing it to
# defer the message. only a warn verb can do this.
warn
        set acl_m_callout_deferred = true
        verify = recipient/callout=30s
        set acl_m_callout_deferred = false

# if the callout did not fail, and the primary mail server is not
# refusing  mail for this address, then it's accepting it, so tell
# our client to try again later
defer
        ! condition = $acl_m_callout_deferred
        message = The primary MX is working, please use it

# callout is failing, main server must be failing,
# accept everything
accept
        message = Accepting mail on behalf of primary MX

The first clause, which has a deny verb, does a callout to the recipient. A callout is an Exim feature which makes a test SMTP connection and starts the process of sending a mail, checking that the recipient would be accepted. This is designed to catch and block emails that the main server would reject. Our backup server has no idea what addresses are valid in our domains; only the primary knows that.

The callout response is cached for the default two hours if it returns a negative result (the recipient does not exist on the master) or five minutes (see callout_positive_expire above) if the address does exist. We use a defer_ok condition here so that if we fail to contact the master, we don’t defer the mail immediately, but instead assume that the address is OK and therefore continue to the next clause.

The second clause of the ACL, which has a warn verb, is what took me so long to work out. Normally, if a condition in a statement returns a result of defer, which means that it failed, the server will defer the whole message (tell the sender to come back later). In almost all cases this is the right thing to do, but it’s the exact opposite of what we want here. We want to accept mail if the callout is failing, not defer it, otherwise our backup MX is useless (it stops accepting mail if the primary goes down).

Because this is such an unusual thing to do, there is no configurable option for it in Exim. The only workaround that I found is that there is exactly one way to avoid a deferring condition causing the message to be deferred: a warn verb. The documentation for the warn verb says:

If any condition on a warn statement cannot be completed (that is, there is some sort of defer), the log line specified by log_message is not written… After a defer, no further conditions or modifiers in the warn statement are processed. The incident is logged, and the ACL continues to be processed, from the next statement onwards.

So what we do is:

  1. Set the local variable
    acl_m_callout_deferred to true;
  2. Try the callout. If it defers (cannot contact the primary server) then we stop processing the rest of the conditions in the warn statement, as described above;
  3. If we get to this point, we know that the callout did not defer, so we set acl_m_callout_deferred to false.

The third clause  of the ACL, which has a defer verb, simply checks the variable that we set above. If we get this far then the primary server is not rejecting this address; and if it’s not deferring either, then it must be accepting mail for the address. In that case, we defer the message, telling our SMTP client to try again later, at which point it will hopefully succeed in delivering directly to the primary.

Callout result caching becomes a problem here. If the master was not reachable, but a previous callout had verified that a particular address existed, and that callout result was cached for the default 24 hours, then the backup MX would defer subsequent mail to that address for the next 24 hours, even if the master went down. This is why we changed the positive callout result caching time to 5 minutes earlier.

The fourth clause  of the ACL, which has an accept verb, is even simpler. It accepts everything that was not denied or deferred earlier. We can only get this far if the master is not accepting or rejecting mail for that address.

So far the configuration appears to work fine and has blocked 14 spam attempts (abusing the backup MX) in 14 hours.

Offline Wikipedia part 2

December 1, 2008

Having decided on a local MediaWiki installation, I started working through the import process. I noticed a few things that may help others.

If one forgets to increase the MySQL max_packet_size, then the import breaks somewhere in the middle (around 3 million records) but the Java process keeps producing progress information, so it’s not at all clear that the import has failed. One sign is that the import process rate of progress, as reported by the import tool in pages per second, suddenly speeds up by a factor of 5-10. You may wish to look out for this and abort the import if it happens, and to monitor the import process with mysqladmin processlist to ensure that it’s still doing things.

Installing the MediaWiki ParserFunctions extension solves most of the problems with random program code appearing in articles.

The import will tend to slow down very badly over time. For example, on one system it started at a rate of 160 pages/second and dropped to 18 over a three-day period. At this rate, it would have taken around 5-6 days to import all 7.5 million pages. Using the MySQL disable keys command did not help much, but what did was to restructure the tables to remove all the indexes. You can even do this while the import is running (I did). The SQL commands are:

  • ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY, DROP INDEX name_title, DROP INDEX page_random, DROP INDEX page_len;
  • ALTER TABLE revision MODIFY COLUMN rev_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY, DROP INDEX rev_id, DROP INDEX rev_timestamp, DROP INDEX page_timestamp, DROP INDEX user_timestamp, DROP INDEX usertext_timestamp;
  • ALTER TABLE text MODIFY COLUMN old_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY;

The following SQL commands should restore the indexes after the import is complete. If you don’t do this, the MediaWiki site will be very slow in operation.

  • ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ADD UNIQUE KEY name_title (page_namespace,page_title), ADD KEY page_random (page_random), ADD KEY page_len (page_len);
  • ALTER TABLE revision MODIFY COLUMN rev_id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ADD UNIQUE KEY rev_id (rev_id), ADD KEY rev_timestamp (rev_timestamp),
    ADD KEY page_timestamp (rev_page,rev_timestamp), ADD KEY user_timestamp (rev_user,rev_timestamp), ADD KEY usertext_timestamp (rev_user_text,rev_timestamp);
  • ALTER TABLE text MODIFY COLUMN old_id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY;

With these changes I was able to achieve import speeds around fifty times faster, or 1000 pages per second, which should make it possible to import the entire Wikipedia in about 2 hours.

Offline Wikipedia

November 21, 2008

I’m working on making Wikipedia, the (in)famous free encyclopaedia, available offline, for a project in a school in rural Zambia where Internet access will be slow, expensive and unreliable.

What I’m looking for is:

  • Completely offline operation
  • Runs on Linux
  • Reasonable selection of content from English Wikipedia, preferably with some images
  • Looks and feels like the Wikipedia website (e.g. accessed through a browser)
  • Keyword search like the Wikipedia website

Tools that have built-in search engines usually require that you download a pages and articles dump file from Wikipedia (about 3 GB download) and then generate a search index, which can take from half an hour to five days.

For an open source project that seems ideally suited to being used offline, and considering the amount of interest, there are surprisingly few options (already developed). They also took me a long time to find, so I’m collating the information here in the hope that it will help others. Here are my impressions of the solutions that I’ve tried so far, gathered from various sources including makeuseof.com.

The One True Wikipedia

The One True Wikipedia, for comparison

MediaWiki (the Wikipedia wiki software) can be downloaded and installed on a computer configured as an AMP server (Apache, MySQL, PHP). You can then import a Wikipedia database dump and use the wiki offline. This is quite a complex process, and importing takes a long time, about 4 hours for the articles themselves (on a 3 GHz P4). Apparently it takes days to build the search index (I’m testing this at the moment). This method does not include any images, as the image dump is apparently 75 GB, and no longer appears to be available, and it displays some odd template codes in the text (shown in red below) which may confuse users.

Mediawiki local installation

Mediawiki local installation

Wikipedia Selection for Schools is a static website, created by Wikimedia and SOS Childrens Villages, with a hand-chosen and checked selection of articles from the main Wikipedia, and images, that fit on a DVD or 3GB of disk space. It’s available for free download using BitTorrent, which is rather slow. Although it looks like Wikipedia, it’s a static website, so while it’s easy to install, it has no search feature. It also has only 5,500 articles compared to the 2 million in Wikipedia itself (about 0.25%). Another review is on the Speed of Creativity Blog. Older versions are available here. (thanks BBC)

Wikipedia Selection for Schools

Wikipedia Selection for Schools

Zipedia is a Firefox plugin which loads and indexes a Wikipedia dump file. It requires a different dump file, containing the latest metadata (8 GB) instead of the usual one (3 GB). You can then access Wikipedia offline in your browser by going to a URL such as wikipedia://wiki. It does not support images, and the search feature only searches article titles, not their contents. You can pass the indexed data between users as a Zip file to save time and bandwidth, and you may be able to share this file between multiple users on a computer or a network. (thanks Ghacks.net)

WikiTaxi is a free Windows application which also loads and indexes Wikipedia dump files. It has its own user interface, which displays Wikipedia formatting properly (e.g. tables). It looks very nice, but it’s a shame that it doesn’t run on Linux.

WikiTaxi screenshot (wikitaxi.org)

WikiTaxi screenshot (wikitaxi.org)

Moulin Wiki is a project to develop open source offline distributions of Wikipedia content, based on the Kiwix browser. They claim that their 150 MB Arabic version contains an impressive 70,000 articles, and that their 1.5 GB French version contains the entire French Wikipedia, more than 700,000 articles. Unfortunately they have not yet released an English version.

Kiwix itself can be used to read a downloaded dump file, thereby giving access to the whole English Wikipedia via the 3 GB download. It runs on Linux only (as far as I know) and the user interface is a customised version of the Firefox browser. Unfortunately I could not get it to build on Ubuntu Hardy due to an incompatible change in Xulrunner. (Kiwix developers told me that a new version would be released before the end of November 2008, but I wasn’t able to test it yet).

Kiwix (and probably MoulinWiki)

Kiwix (and probably MoulinWiki)

Wikipedia Dump Reader is a KDE application which browses Wikipedia dump files. It generates an index on the first run, which took 5 hours on a 3 GHz P4, and you can’t use it until it’s finished. It doesn’t require extracting or uncompressing the dump file, so it’s efficient on disk space, and you can copy or share the index between computers. The display is in plain text, so it looks nothing like Wikipedia, and it includes some odd system codes in the output which could confuse users.

Wikipedia Dump Reader

Wikipedia Dump Reader

Thanassis Tsiodras has created a set of scripts to extract Wikipedia article titles from the compressed dump, index them, parse and display them with a search engine. It’s a clever hack but the user interface is quite rough, it doesn’t always work, requires about two times the dump file size in additional data, it was a pain to figure out how to use it and get it working, and it looks nothing like Wikipedia, but better than the Dump Reader above.

Thanassis Tsiodras' Fast Wiki with Search

Thanassis Tsiodras' Fast Wiki with Search

Pocket Wikipedia is designed for PDAs, but apparently runs on Linux and Windows as well. The interface looks a bit rough, and I haven’t tested the keyword search yet. It doesn’t say exactly how many articles it contains, but my guess is that it’s about 3% of Wikipedia. Unfortunately it’s closed source, and as it comes from Romania, I don’t trust it enough to run it. (thanks makeuseof.com)

Pocket Wikipedia on Linux

Pocket Wikipedia on Linux (makeuseof.com)

Wikislice allows users to download part of Wikipedia and view it using the free Webaroo client. Unfortunately this client appears only to work on Windows. (thanks makeuseof.com)

WikiSlice (makeuseof.com)

WikiSlice (makeuseof.com)

Encyclopodia puts the open source project on an iPod, but I want to use it on Linux.

Encyclopodia

Encyclopodia

It appears that if you need search and Linux compatibility, then running a real Wikipedia (MediaWiki) server is probably the best option, despite the time taken.