Wget: A Way to Copy Entire Web Sites
Now, I know when I mention the downloading of Web sites, some of you squirm with anxiety about following copyright law. So, I’ll begin by saying please keep licenses and copyright laws in mind when playing with any tool that eases the downloading of Web sites.
If any of you have attended a general talk by Gary Price* within the last year, you might have heard him mention Wget. It’s a tool that allows someone to make a copy of a Web site for offline viewing. I’ve been trying to figure out how to do just that for a few Web sites to which I either own the rights or have permission to do so. After exploring other options that weren’t working so well for one of the sites, I tried Wget. It seems to have worked ok, though I haven’t thoroughly tested what it put it on my machine. With Fink’s help, it installed beautifully. (I admit running into some problems when I tried to compile it myself. {Which C? Where’d it go?})
Some of us have pondered how to give people offline copies of our weblogs for extended reading or even just to make backups. I tried to make a copy of the scratchpad with Wget, but I ran out of time. (That’s not to say Wget is slow. The scratchpad is rather large {it’s up to item 7000+ now thanks to comment spam} and time was short.) Wget is one way to do those tasks. It saves Web pages as HTML files, so one advantage it has over Manila’s backup is giving me files with which I actually might be able to do something to extract the content. The built-in backup mechanism for this software creates a file only another Manila administrator can install on a Manila platform–not particularly helpful for many of us bloggers.
It’s too bad I wasn’t using Wget before Frassle crashed. I would love to have a copy of one of the blogs on my machine to show to people. Frassle keeps coming up in conversation. I’d find it very useful to be able to demo it. Maybe I should mention that to Shimon …
Anyway …
Here’s another instance where I’m really glad someone like Gary willingly travels the world to educate those of us who can’t keep up with everything. Great suggestion, Gary*!
*Correction: Per Gary’s comment below that he didn’t talk about Wget, I’ve removed his name from the appropriate parts of this post. I could have sworn I learned about it in two of his presentations during 2006, but I must be mistaken. I can’t remember who told me about it.





December 28th, 2006 at 1:03 pm
J-
Thanks for the very kind words about my work. However, I need to make a small correction. I do not talk about Wget in my presentations.
The tool I do talk about is also open source and a volunteer project. It has never failed me and installs in seconds. I’ve installed it on several systems and never had a problem.
HTTrack (3.41-BETA-13)
http://www.httrack.com
A new beta version came out in November.
Also, in the 18 months or so, the HTTRACK team has released ProxyTrack.
“ProxyTrack, is a standalone project aimed to help web archivists to easily build caches based on websites downloaded by httrack.”
http://www.httrack.com/proxytrack/
cheers,
gary
December 28th, 2006 at 11:59 pm
Hhhmmm … that’s odd. I have it in my notes from one of your talks. Maybe someone mentioned it to me at one of them and I added it there. Thanks for the clarification! My apologies!