Wikipedia offline edition - again
Update: There is an easier way now available, altough without Images. See Wikipedia offline edition - reloaded for details.
I'm still trying to somehow pack the Wikipedia onto a DVD. Still with no luck.
But today I finally managed to copy all files from my work-computer where I stored them until now to my personal notebook where the processing of those pile of files should work much faster.
Since this all is still a dump from the wikipedia of somewhere around July or October last year I was thinking of how I could update this dump. What's really a problem is that the Wikipedia has significantly grown over the size of a single DVD-R and had quite a few layout-improvements like those side-tables especially visible on scientific pages.
Also the wiki2static.pl Perl script is quite outdated so this introduces further problems.
Well, what are my options?
- Download an ready-to-burn image file. Con: Only german edition.
- Fix up wiki2static.pl. Con: I'd need to somehow reverse-engineer the Wikipedia-software itself.
- Run a Web-Spider across en.wikipedia.org. Con: I'd have to carefully fine-tune the spiders settings to not get pages I don't want (discussions etc.). Also I don't want to stress Wikipedias bandwidth that much.
All possible but none really acceptable for me. But the last one was sexy so I stressed my mind a bit more and came up with a fourth possibility:
- download the Wikipedia database, containing all pages data
- download MediaWiki, the Wikipedia engine
- download Uniform Server, a small-footpring all-in-one webserver-package
- Set up the Uniform Server, launch the MediaWiki platform and import the whole Wikipedia database
At this point, I should have a locally running instance of the Wikipedia, just lacking all media-files because they are not included in the database. But theoretically it should be possible through MediaWiki or an Apache URL-rewriting to redirect all media-links to the real files.
Then I start up a Webspider and crawl my own Wikipedia instance. The Pro of that is, that I can adjust the layout and contents of the pages just as I like.
The final problem will be the sheer size of everything. I was thinking of somehow down-sampling all of the images to a slighter lower quality which should squeeze out some more space.
I really have to think over that a bit more.