Offline Wikipedia

November 21, 2008

I’m working on making Wikipedia, the (in)famous free encyclopaedia, available offline, for a project in a school in rural Zambia where Internet access will be slow, expensive and unreliable.

What I’m looking for is:

  • Completely offline operation
  • Runs on Linux
  • Reasonable selection of content from English Wikipedia, preferably with some images
  • Looks and feels like the Wikipedia website (e.g. accessed through a browser)
  • Keyword search like the Wikipedia website

Tools that have built-in search engines usually require that you download a pages and articles dump file from Wikipedia (about 3 GB download) and then generate a search index, which can take from half an hour to five days.

For an open source project that seems ideally suited to being used offline, and considering the amount of interest, there are surprisingly few options (already developed). They also took me a long time to find, so I’m collating the information here in the hope that it will help others. Here are my impressions of the solutions that I’ve tried so far, gathered from various sources including makeuseof.com.

The One True Wikipedia

The One True Wikipedia, for comparison

MediaWiki (the Wikipedia wiki software) can be downloaded and installed on a computer configured as an AMP server (Apache, MySQL, PHP). You can then import a Wikipedia database dump and use the wiki offline. This is quite a complex process, and importing takes a long time, about 4 hours for the articles themselves (on a 3 GHz P4). Apparently it takes days to build the search index (I’m testing this at the moment). This method does not include any images, as the image dump is apparently 75 GB, and no longer appears to be available, and it displays some odd template codes in the text (shown in red below) which may confuse users.

Mediawiki local installation

Mediawiki local installation

Wikipedia Selection for Schools is a static website, created by Wikimedia and SOS Childrens Villages, with a hand-chosen and checked selection of articles from the main Wikipedia, and images, that fit on a DVD or 3GB of disk space. It’s available for free download using BitTorrent, which is rather slow. Although it looks like Wikipedia, it’s a static website, so while it’s easy to install, it has no search feature. It also has only 5,500 articles compared to the 2 million in Wikipedia itself (about 0.25%). Another review is on the Speed of Creativity Blog. Older versions are available here. (thanks BBC)

Wikipedia Selection for Schools

Wikipedia Selection for Schools

Zipedia is a Firefox plugin which loads and indexes a Wikipedia dump file. It requires a different dump file, containing the latest metadata (8 GB) instead of the usual one (3 GB). You can then access Wikipedia offline in your browser by going to a URL such as wikipedia://wiki. It does not support images, and the search feature only searches article titles, not their contents. You can pass the indexed data between users as a Zip file to save time and bandwidth, and you may be able to share this file between multiple users on a computer or a network. (thanks Ghacks.net)

WikiTaxi is a free Windows application which also loads and indexes Wikipedia dump files. It has its own user interface, which displays Wikipedia formatting properly (e.g. tables). It looks very nice, but it’s a shame that it doesn’t run on Linux.

WikiTaxi screenshot (wikitaxi.org)

WikiTaxi screenshot (wikitaxi.org)

Moulin Wiki is a project to develop open source offline distributions of Wikipedia content, based on the Kiwix browser. They claim that their 150 MB Arabic version contains an impressive 70,000 articles, and that their 1.5 GB French version contains the entire French Wikipedia, more than 700,000 articles. Unfortunately they have not yet released an English version.

Kiwix itself can be used to read a downloaded dump file, thereby giving access to the whole English Wikipedia via the 3 GB download. It runs on Linux only (as far as I know) and the user interface is a customised version of the Firefox browser. Unfortunately I could not get it to build on Ubuntu Hardy due to an incompatible change in Xulrunner. (Kiwix developers told me that a new version would be released before the end of November 2008, but I wasn’t able to test it yet).

Kiwix (and probably MoulinWiki)

Kiwix (and probably MoulinWiki)

Wikipedia Dump Reader is a KDE application which browses Wikipedia dump files. It generates an index on the first run, which took 5 hours on a 3 GHz P4, and you can’t use it until it’s finished. It doesn’t require extracting or uncompressing the dump file, so it’s efficient on disk space, and you can copy or share the index between computers. The display is in plain text, so it looks nothing like Wikipedia, and it includes some odd system codes in the output which could confuse users.

Wikipedia Dump Reader

Wikipedia Dump Reader

Thanassis Tsiodras has created a set of scripts to extract Wikipedia article titles from the compressed dump, index them, parse and display them with a search engine. It’s a clever hack but the user interface is quite rough, it doesn’t always work, requires about two times the dump file size in additional data, it was a pain to figure out how to use it and get it working, and it looks nothing like Wikipedia, but better than the Dump Reader above.

Thanassis Tsiodras' Fast Wiki with Search

Thanassis Tsiodras' Fast Wiki with Search

Pocket Wikipedia is designed for PDAs, but apparently runs on Linux and Windows as well. The interface looks a bit rough, and I haven’t tested the keyword search yet. It doesn’t say exactly how many articles it contains, but my guess is that it’s about 3% of Wikipedia. Unfortunately it’s closed source, and as it comes from Romania, I don’t trust it enough to run it. (thanks makeuseof.com)

Pocket Wikipedia on Linux

Pocket Wikipedia on Linux (makeuseof.com)

Wikislice allows users to download part of Wikipedia and view it using the free Webaroo client. Unfortunately this client appears only to work on Windows. (thanks makeuseof.com)

WikiSlice (makeuseof.com)

WikiSlice (makeuseof.com)

Encyclopodia puts the open source project on an iPod, but I want to use it on Linux.

Encyclopodia

Encyclopodia

It appears that if you need search and Linux compatibility, then running a real Wikipedia (MediaWiki) server is probably the best option, despite the time taken.

Advertisements

One of the reasons that using a Content Management System (or CMS) is so powerful is that they allow people with little to no knowledge of building web pages to set up and start publishing their very own site with hardly any fuss. Back in the day when I first learned how to design basic websites it was through grubbing around finding quick and dirty fixes to html and css in order to get to roughly where I wanted to be. Anyone starting out today could create a much funkier site in far less time and with far, far less effort.

Essentially CMS opens up online publishing to those with very limited web literacy allowing them to get cracking on their content. That’s very empowering.

But there’s a catch

However, one of the repercussions of this very welcome deskilling has been the fact that we are left in the hands of the CMS providers. Without any understanding of how to create your own sleek, bandwidth friendly site you’re often left with the clunky scripts, unnecessarily large graphics and general baggage that comes with your template design.

Aptivate created a set of Guidelines to help people understand that whilst in the developed world it’s tempting to believe that bandwidth is infinite if you don’t clear out the junk (that you may not even notice) from your websites you may unintentionally be preventing those in developing nations from using your site at all.

Unfortunately, for those who are using a CMS to create their websites they are far less likely to have the know how or the confidence to get into the bowels of their site in order to make something usable for those on low bandwidth. Which means, that for this particular audience, we need to provide solutions for them, rather than attempting to train up the whole world up from scratch on how to create a low bandwidth CMS template.

I’ve long argued that something that would be a useful addendum to our guidelines would be a set of good looking and functional low-bandwidth templates for the main CMS providers. Providing a way to strip down drupal, blogger, wordpress, et al would be phenomenally useful for those without the skills to do this for themselves.

Effectively this would be a low bandwidth website in a box which even current users of these CMS’s could transfer over to without much fuss. Currently the best alternative I’ve found at my regular (blogger) blog The Daily (Maybe) is to provide a link to a loband provided version which is certainly faster but is a bit of a hastse and doesn’t allow me any real flexibility over layout.

Maneno: a blogging platform for Africa

So imagine my joy when I came across Maneno last week. A CMS blogging platform designed specifically with low bandwidth in mind and provided from servers in Africa, cutting down on slow internal connections. As the blurb says “Maneno strives to provide a communication and development platform for Sub-Saharan Africa.”

Good looking and providing all the functionality you need in a decent website, the online feedback I’ve seen so far has been universally positive, particularly around download times, which can massively increase the expense of browsing the net in the very places where this service needs to be as cheap as possible. That is really important. In the words of blogger White African “The site absolutely flies.”

Although Maneno is still in a beta version it works like a dream and looks very impressive. It seems just the ticket if you are setting up a new site with little knowledge of design and want to ensure potential readers in Africa actually get the opportunity to read what you have to say.

I apologise for not writing another article sooner. I’d like to respond to some of the comments on my previous article, IT in the field, by some of the people I mentioned, Jeff Allen and Jon Thompson. I’ll include their comments and my responses inline.

Jon writes:

1) Almost no one uses PDA’s. Mobiles – yes. (See my post on Nokia Data Gathering.)

I would argue that I’m coming from a different position, not on what’s currently in the field and being used, but what could be. I think that Jon is referring to a similar kind of future possibilities in his follow-on article. But I’m also thinking in terms of what an organisation (NGO or government) can do to avoid these problems in the first place, by changing the way that they deploy technology.

I accept completely that PDAs are not commonly used in the field at the moment. However, I’d suggest that organisations take a much closer look at deploying PDAs (and mobile phones) instead of general-purpose PCs in current and future projects, as it would avoid most of the problems that Jeff reports and that we are discussing here.

2) Is anyone buying and deploying Inveneo? Who is Aleutia? What is their penetration. Odds are Jeff’s guy in Congo will never know about either.

I’m not intending to put the onus on Jeff’s guy in Congo, or any other end user, to select appropriate technology for their circumstances. I don’t think they have the technical skills or buying power to do so. The choice was already made for them by the organisation that supplied a PC running Windows without antivirus updates. I believe that this was a bad choice, driven by the factors that I gave in my previous article, and that different choices could be made in future to avoid such problems.

3) Webmail will stay the same until the cows come home. Sure, someone can write tricky code but will the guy in Congo ever see it? Probably not.

Actually, I have seen comments asking why Google doesn’t use their own Gears library for their own webmail service, particularly in the context of recent Gmail service outages. If they did decide to do so, the benefits should translate immediately to users with low-bandwidth or intermittent connections, in the Congo as elsewhere.

4) Mac OSX does not exist in the rest of the world and barely in the Balkans, some of which are even slated for EU membership.

I don’t see that as a strong argument against deploying alternatives such as MacOS X in new systems. We have a problem: poor maintainability of traditional IT systems in the field. We have to find solutions to those problems, which may involve deploying different technology. The same argument applies to Ubuntu.

5) So you do support Ubuntu adoption as an alternative? I think you are getting my point. You could just install Ubuntu and forget about the AVG argument.

But the end user will not deploy Ubuntu. The providers of the equipment will have to do it for them. That’s my point, that the organisations have to change, not the end users.

Vertical programs that distribute hardware almost never follow-up so anything they put in the field is usually toast within a few months.

That’s precisely my point in the last article, even if I rambled a bit in the process of getting there. But I think it would be great to have hard numbers that we could present as evidence to the organisations to get them to recognise the problems in what they’re doing, and to change their ways.

For example, let us just say that something causes the OS to hiccup so the well meaning local IT guy steps in and recommends reinstalling the OS. Of course the owner doesn’t have one so the IT guy offers one of his own. Bootleg install. Story over.

That is certainly true for general-purpose computers, but even a minimal amount of BIOS lock-down can prevent the reinstallation of the software, or make it significantly harder. There’s also the question of why the man in the Congo is turning to his “local IT guy” for technical support? Why isn’t the organisation that provided the computer providing support for it as well?

9) Forget about NGO’s as they are not the problem. Remember, this guy worked for the Congolese Gov’t so he gets whatever trickles down (a few odd machines from international agencies)

These “international agencies” are precisely who I meant by NGOs. Perhaps I should have included IOs as well, but all this jargon is going to get very confusing to anyone who’s not an expert in the field. Is there any reason not to lump NGOs, IOs and government programmes into the same basket as “organisations” all?

His support team? The local DVD vendor and the well meaning IT tech. Therefore, educate the local health official, the DVD vendor and the local IT tech…

And why are they his only (or primary) means of support? MSF sent Jeff out to deal with a problem that wasn’t even to do with any equipment that they provided (as far as we know), but it was stopping the man in the Congo from interacting with MSF.

Had MSF not sent Jeff to the Congo, you, me and Jeff would be none the wiser.

Indeed, but had the original organisation not sent out the wrong equipment (a Windows computer with no support and no antivirus updates), MSF would not have had to waste a lot of valuable time and money and resources on sending Jeff out in the first place. Also, had MSF provided their own, more suitable equipment, for the man in the Congo to use, Jeff would not have had to make an on-site visit either.

I’d like to respond to Jeff’s comments as well, but this article is already getting rather long and ranty, so I’ll leave that for another day.