Since last friday I'm trying to suck down a HTML dump of Wikipedia.
The initial download took almost the whole weekend, I left it running at work computers. On Monday I found a nice local edition of the Wikipedia on my HDD.
Just a few glitches:
- the Main Page detection had not worked, because the file format seems to have changed since the script was written
- in the DB dump there are many lonely "\n"'s, which confuse the script in many ways (especially link and list detection)
Currently I'm trying to invent some regular expression to fix up those lost "\n"'s.
Everything else seems to work quite fine, the final dump including redirects has currently a size of 4.71 GB, just right to be burned on DVD :)