wget usage tutorial

Examples


The examples are classified into three sections, because of clarity. The first section is a tutorial for beginners. The second section explains some of the more complex program features. The third section contains advice for mirror administrators, as well as even more complex features (that some would call perverted).

Simple Usage

  • Say you want to download a URL. Just type:
    wget http://fly.cc.fer.hr/
    
    The response will be something like:
    --13:30:45--  http://fly.cc.fer.hr:80/
               => `index.html'
    Connecting to fly.cc.fer.hr:80... connected!
    HTTP request sent, fetching headers... done.
    Length: 1,749 [text/html]
    
        0K -> .
    
    13:30:46 (68.32K/s) - `index.html' saved [1749/1749]
    
  • But what will happen if the connection is slow, and the file is lengthy? The connection will probably fail before the whole file is retrieved, more than once. In this case, Wget will try getting the file until it either gets the whole of it, or exceeds the default number of retries (this being 20). It is easy to change the number of tries to 45, to insure that the whole file will arrive safely:
    wget --tries=45 http://fly.cc.fer.hr/jpg/flyweb.jpg
    
  • Now let's leave Wget to work in the background, and write its progress to log file `log'. It is tiring to type `--tries', so we shall use `-t'.
    wget -t 45 -o log http://fly.cc.fer.hr/jpg/flyweb.jpg &
    
    The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use `-t inf'.
  • The usage of FTP is as simple. Wget will take care of login and password.
    $ wget ftp://gnjilux.cc.fer.hr/welcome.msg
    --23:35:55--  ftp://gnjilux.cc.fer.hr:21/welcome.msg
               => `welcome.msg'
    Connecting to gnjilux.cc.fer.hr:21... connected!
    Logging in as anonymous ... Logged in!
    ==> TYPE I ... done.  ==> CWD not needed.
    ==> PORT ... done.    ==> RETR welcome.msg ... done.
    Length: 1,340 (unauthoritative)
     
        0K -> .
     
    23:35:56 (37.39K/s) - `welcome.msg' saved [1340]
    
  • If you specify a directory, Wget will retrieve the directory listing, parse it and convert it to HTML. Try:
    wget ftp://prep.ai.mit.edu/pub/gnu/
    lynx index.html
    

Advanced Usage

  • You would like to read the list of URLs from a file? Not a problem with that:
    wget -i file
    
    If you specify `-' as file name, the URLs will be read from standard input.
  • Create a mirror image of GNU WWW site (with the same directory structure the original has) with only one try per document, saving the log of the activities to `gnulog':
    wget -r -t1 http://www.gnu.ai.mit.edu/ -o gnulog
    
  • Retrieve the first layer of yahoo links:
    wget -r -l1 http://www.yahoo.com/
    
  • Retrieve the index.html of `www.lycos.com', showing the original server headers:
    wget -S http://www.lycos.com/
    
  • Save the server headers with the file:
    wget -s http://www.lycos.com/
    more index.html
    
  • Retrieve the first two levels of `wuarchive.wustl.edu', saving them to /tmp.
    wget -P/tmp -l2 ftp://wuarchive.wustl.edu/
    
  • You want to download all the GIFs from an HTTP directory. `wget http://host/dir/*.gif' doesn't work, since HTTP retrieval does not support globbing. In that case, use:
    wget -r -l1 --no-parent -A.gif http://host/dir/
    
    It is a bit of a kludge, but it works. `-r -l1' means to retrieve recursively (See section Recursive Retrieval), with maximum depth of 1. `--no-parent' means that references to the parent directory are ignored (See section Directory-Based Limits), and `-A.gif' means to download only the GIF files.`-A "*.gif"' would have worked too.
  • Suppose you were in the middle of downloading, when Wget was interrupted. Now you do not want to clobber the files already present. It would be:
    wget -nc -r http://www.gnu.ai.mit.edu/
    
  • If you want to encode your own username and password to HTTP or FTP, use the appropriate URL syntax (See section URL Format).
    wget ftp://hniksic:mypassword@jagor.srce.hr/.emacs
    
  • If you do not like the default retrieval visualization (1K dots with 10 dots per cluster and 50 dots per line), you can customize it through dot settings (See section Wgetrc Commands). For example, many people like the "binary" style of retrieval, with 8K dots and 512K lines:
    wget --dot-style=binary ftp://prep.ai.mit.edu/pub/gnu/README
    
    You can experiment with other styles, like:
    wget --dot-style=mega ftp://ftp.xemacs.org/pub/xemacs/xemacs-20.4/xemacs-20.4.tar.gz
    wget --dot-style=micro http://fly.cc.fer.hr/
    
    To make these settings permanent, put them in your `.wgetrc', as described before (See section Sample Wgetrc).

Guru Usage


  • If you wish Wget to keep a mirror of a page (or FTP subdirectories), use `--mirror' (`-m'), which is the shorthand for `-r -N'. You can put Wget in the crontab file asking it to recheck a site each Sunday:
    crontab
    0 0 * * 0 wget --mirror ftp://ftp.xemacs.org/pub/xemacs/ -o /home/me/weeklog
    
  • You may wish to do the same with someone's home page. But you do not want to download all those images--you're only interested in HTML.
    wget --mirror -A.html http://www.w3.org/
    
  • But what about mirroring the hosts networkologically close to you? It seems so awfully slow because of all that DNS resolving. Just use `-D' (See section Domain Acceptance).
    wget -rN -Dsrce.hr http://www.srce.hr/
    
    Now Wget will correctly find out that `regoc.srce.hr' is the same as `www.srce.hr', but will not even take into consideration the link to `www.mit.edu'.
  • You have a presentation and would like the dumb absolute links to be converted to relative? Use `-k':
    wget -k -r URL
    
  • You would like the output documents to go to standard output instead of to files? OK, but Wget will automatically shut up (turn on `--quiet') to prevent mixing of Wget output and the retrieved documents.
    wget -O - http://jagor.srce.hr/ http://www.srce.hr/
    
    You can also combine the two options and make weird pipelines to retrieve the documents from remote hotlists:
    wget -O - http://cool.list.com/ | wget --force-html -i -

No comments :

Post a Comment