Coup'la stuff

Whew, had no time to post something for quite a while. Nevertheless some interesting stuff has happened meanwhile.

Firstly I did download the german Wikipedia DVD I mentioned in my previous post and had a look at it. Quite nice work guys, the only thing I miss is that the included Digibib viewer doesn't automatically open up the Wikipedia database and I always have to manually choose "Run from CD/DVD". The cause could also be on my side, I did an install of Digibib from the mounted .iso-file of Wikipedia which had some drive-letter, deinstalled it, burned it to DVD and reinstalled it from another drive-letter. As far as I remember when I had it just mounted it DID actually start up the Wikipedia itself, but my mind could be betraying me.

Secondly I tinkled around several things with my own offline Wikipedia and gained some more insight. I managed to get a local MediaWiki instance running with a rather recend database dump. But then I quickly noticed, that my idea of crawling a local instance isn't that trivial. For example the wikipedia meanwhile seems to give back a full HTML page if I just want to view a single image. It also seems to create some sort of thumbnail on the originating page which is IMHO just wasted space for an offline edition. Another point was that if I would crawl through that local instance the spider would probably create the same directory structure as it would encounter from the wikipedia: everything in one directory. > 500.000 entries in a single directory are not acceptable for me because it would take aeons to open that directory in a file browser. There are also other obstacles like page-layout and links to interactive pages (configureations) but I guess these could be avoided with some skin-hacking. Generally I suspended that idea for now and went back to the wiki2static alternative.

On that side I had successes as well as failures. I bugged a bit around with my last years dump and I think for that specific dump I have made good progress on some problems:

  • Size: I found a rather comfortable way to free up some space. With the batch-conversion mode of IrfanView it is possible to downsample the image quality of almost all image-files so that it is hardly visible but saves loads of space. The advanced settings of the batch-conversion even allows to keep the whole directory-structure of the media files intact. My particular settings for conversion was to downsample all JPG-like files to 75% of quality and remove all EXIF- and other optional information. I also processed all PNG files to be saved with maximum compression. There were surely other optimizations possible but the two points above already gave me more than 1 Gigabyte of space, enough to fit all content on a single DVD now.
  • Burning Tool: Since Nero turned out to be completely useless for such a vast amount of files I was looking for alternatives. I found one with K3b for Linux which at least is able to import all Wikipedia files into a project. For that step alone I had to create a gigantic (1,8 GB) swapfile for Linux wich was used up to ~80% after importing from K3b. Didn't try to burn actually because I tried to navigate inside the project which caused massive RAM swapping happen. I killed K3b after about an hour or so. Next time I won't try navigating but burning instantly after import.

Well, all these problems will have to be re-evaluated with the recent version of Wikipedia because there is now much more content, articles as well as media files, to be considered.

Thirdly, I have looked into CrystalSpace development again, digged through some API docs, tutorials and articles. Slowly I'm getting a bit more comfortable with the whole stuff. Perhaps there will be more on that from me in the future. I'm quite confident because this CrystalSpace stuff is around in my head for years now.


Similar entries