curl vs. wget

In the process of developing scripts to handle downloads, I have tested curl and Wget (including GNU wget2). For scripting purposes, I’ve found curl to be superior in terms of being able to extract reliable information as the download proceeds.

curl

The trick to extracting download information from curl is to redirect standard error to a file:

curl -f -o /path/to/output/file URL 2>/path/to/log/file

In order to output progress information, curl overwrites the existing line of data using carriage return characters. For processing purposes, these are easily replaced with newlines:

cat /path/to/log/file | tr '\r' '\n'

Now any field can be extracted from a progress line using awk. For example, the most recent download speed can be obtained using:

cat /path/to/log/file | tr '\r' '\n' | tail -n 1 | awk '{print $12}'

wget

With wget, processing the output is a bit trickier. First, it is necessary to ensure that the dot progress format is used for the download (which is the default when logging), and then the output can be logged to a file with the -o option:

wget -o /path/to/log/file -O /path/to/output/file URL

To get to something useful with awk, we need to do quite a bit of cleanup on the output:

cat /path/to/log/file | grep '%' | tail -n 1 | sed 's/[ \.]* / /g'

The above command looks for the status update lines (which contain the % sign), grabs the last of those lines, and then removes all the dots from the output, so that field numbers are constant for awk extraction. To get the download speed, one might use:

cat /path/to/log/file | grep '%' | tail -n 1 | sed 's/[ \.]* / /g' | awk '{print $3}'

However, this command breaks down at the last progress update, when 100% of the download is reached. At this point, awk reports the average download speed, followed by an = sign, followed by the total time, like so:

55.3M=34s

Extracting the speed reliably therefore requires a more complex awk script:

(NF == 3) {
    percent = $2
    count = split($3, pieces, "=")
    if (count == 2) {
        print pieces[1]
    }
}

(NF == 4) {
    print $3
}

Problematically, the speeds reported by wget are in prefixed base units of bits per second, not in bytes per second like curl. Furthermore, wget extrapolates the latest download speed rather wildly and doesn’t average it over any significant time period until the download has finished. Therefore, if the kernel happens to deliver a coalesced group of packets to the TCP/IP stack, the resulting extracted chunk of data might cause wget to report a speed that is wildly in excess of the actual physical layer speed. This output isn’t particularly useful when trying to use this tool as the backend for a download manager.

wget2

As of the time of this writing, I have tested wget2 version 2.0.0, as shipped on Slackware Linux 15.0. The output from this tool is an absolute dumpster fire.

First, original wget-style “dot” output doesn’t appear to have been implemented in this version. Thus, only the “bar” progress is available, and it has to be forcibly turned on like this:

wget2 --progress=bar --force-progress -O /path/to/output/file URL > /path/to/log/file

To make the bar fancy, the authors used terminal control sequences to move the cursor and draw the bar. Consequently, there is really only one line of output in the log file. Coercing this output into something useful requires quite a bit of work, and I’m not even sure this code will work in all cases for all downloads:

cat /path/to/log/file | sed -e 's/[^[:print:]]//g' -e 's~B/s~B/s\n~g' | sed 's/[^[:space:]]* //' | \
    grep '^ ' | tail -n 1 | sed 's/[=>]//g' | sed -e 's/\[//' -e 's/\]//'

Here, I’m stripping all the non-printing characters and then using the B/s string (which is the final part of the speed, which itself appears to be the last piece of data output per update) as a point for injecting newline characters. There is still quite a bit of junk left in the output, but we can at least get the status updates by looking for lines that are indented (starting with a space). After a bit more cleanup, the speed can be extracted from the $3 field of the result.

Conclusions

Given that curl supports a much larger number of protocols than wget/wget2, its output is much easier to parse, and it also supports socks proxies (where wget/wget2 only support http(s) proxies), my conclusion is that I plan to use curl for scripting purposes. On a desktop or server system, I do not see the benefit of using wget.

There are still a few corner cases where wget might be useful, however. In particular, both BusyBox and recent versions of ToyBox have available wget implementations. On embedded devices with limited resources, these might still be preferable to curl. However, I should note that the progress output might be different, and I have not tested the above code to determine if it works with the BusyBox/ToyBox implementations of the standard command-line tools.