Wikipedia offline edition - again

Update: There is an easier way now available, altough without Images. See Wikipedia offline edition - reloaded for details.


I'm still trying to somehow pack the Wikipedia onto a DVD. Still with no luck.

But today I finally managed to copy all files from my work-computer where I stored them until now to my personal notebook where the processing of those pile of files should work much faster.

Since this all is still a dump from the wikipedia of somewhere around July or October last year I was thinking of how I could update this dump. What's really a problem is that the Wikipedia has significantly grown over the size of a single DVD-R and had quite a few layout-improvements like those side-tables especially visible on scientific pages.

Also the wiki2static.pl Perl script is quite outdated so this introduces further problems.

Well, what are my options?

  • Download an ready-to-burn image file. Con: Only german edition.
  • Fix up wiki2static.pl. Con: I'd need to somehow reverse-engineer the Wikipedia-software itself.
  • Run a Web-Spider across en.wikipedia.org. Con: I'd have to carefully fine-tune the spiders settings to not get pages I don't want (discussions etc.). Also I don't want to stress Wikipedias bandwidth that much.

All possible but none really acceptable for me. But the last one was sexy so I stressed my mind a bit more and came up with a fourth possibility:

  1. download the Wikipedia database, containing all pages data
  2. download MediaWiki, the Wikipedia engine
  3. download Uniform Server, a small-footpring all-in-one webserver-package
  4. Set up the Uniform Server, launch the MediaWiki platform and import the whole Wikipedia database

At this point, I should have a locally running instance of the Wikipedia, just lacking all media-files because they are not included in the database. But theoretically it should be possible through MediaWiki or an Apache URL-rewriting to redirect all media-links to the real files.

Then I start up a Webspider and crawl my own Wikipedia instance. The Pro of that is, that I can adjust the layout and contents of the pages just as I like.

The final problem will be the sheer size of everything. I was thinking of somehow down-sampling all of the images to a slighter lower quality which should squeeze out some more space.

I really have to think over that a bit more.

|

Similar entries

Comments

Nice work. I have been thinking about getting Wikipedia on my laptop for some time now, but I ran into the problem of no ready-made tools to do it for me.

I was thinking of getting a local running copy of Mediawiki, and I thought about crawling. Never did connect the two :D.

I wish you the best of luck, and I will check back to your site to see whats going on.

P.S. What do you think of also applying the same technique to Wiktionary? The Wiktionary DB is a bit less than 100MB.

Meanwhile I tried out several things regarding offline-Wikipedia, see my latest post for experiences.

About the Wiktionary... Well, I stumbled over that word but never actually had a look at it, what it is all about and so on. But did that after seeing your comment, looks nice too.

It would surely be an option after I have mastered the problems with Wikipedia itself.

Thanks for the hint!

Hi, my name is Olajide Olaolorun and i am the Project Manager of The Uniform Server. I would like to thank you for the mention.. :D and also say that when you do get Wikipedia on DVD, please send me a link so i can add it to our site/wiki

Thanks again :D

oops.. oh my, i am so sorry.... i didn't know it went through... i was getting an error message and the page didn't go anywhere soemtimes so i though it was not working...

i am so sorry...

No problem Olajinde, I just deleted the excess entries.

Sadly it turned out that this approach was almost a dead end, see my post http://www.freepgs.com/kosi2801/2005/05/06/coupla_stuff.html

It's not because Uniform Server, its setup went through without any problems or warnings, it's just because WikiMedias internal structure is not quite readonly-media-friendly.

But let me tell you that I'm still using Uniform Server as development platform if I just want to try out minor web-services-stuff I stumble across the net here and there, it has almost everything I need. Great work, thanks.