Offline Wikipedia part 2

December 1, 2008

Having decided on a local MediaWiki installation, I started working through the import process. I noticed a few things that may help others.

If one forgets to increase the MySQL max_packet_size, then the import breaks somewhere in the middle (around 3 million records) but the Java process keeps producing progress information, so it’s not at all clear that the import has failed. One sign is that the import process rate of progress, as reported by the import tool in pages per second, suddenly speeds up by a factor of 5-10. You may wish to look out for this and abort the import if it happens, and to monitor the import process with mysqladmin processlist to ensure that it’s still doing things.

Installing the MediaWiki ParserFunctions extension solves most of the problems with random program code appearing in articles.

The import will tend to slow down very badly over time. For example, on one system it started at a rate of 160 pages/second and dropped to 18 over a three-day period. At this rate, it would have taken around 5-6 days to import all 7.5 million pages. Using the MySQL disable keys command did not help much, but what did was to restructure the tables to remove all the indexes. You can even do this while the import is running (I did). The SQL commands are:

  • ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY, DROP INDEX name_title, DROP INDEX page_random, DROP INDEX page_len;
  • ALTER TABLE revision MODIFY COLUMN rev_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY, DROP INDEX rev_id, DROP INDEX rev_timestamp, DROP INDEX page_timestamp, DROP INDEX user_timestamp, DROP INDEX usertext_timestamp;
  • ALTER TABLE text MODIFY COLUMN old_id INT(10) UNSIGNED NOT NULL, DROP PRIMARY KEY;

The following SQL commands should restore the indexes after the import is complete. If you don’t do this, the MediaWiki site will be very slow in operation.

  • ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ADD UNIQUE KEY name_title (page_namespace,page_title), ADD KEY page_random (page_random), ADD KEY page_len (page_len);
  • ALTER TABLE revision MODIFY COLUMN rev_id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ADD UNIQUE KEY rev_id (rev_id), ADD KEY rev_timestamp (rev_timestamp),
    ADD KEY page_timestamp (rev_page,rev_timestamp), ADD KEY user_timestamp (rev_user,rev_timestamp), ADD KEY usertext_timestamp (rev_user_text,rev_timestamp);
  • ALTER TABLE text MODIFY COLUMN old_id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY;

With these changes I was able to achieve import speeds around fifty times faster, or 1000 pages per second, which should make it possible to import the entire Wikipedia in about 2 hours.

Advertisements