wget
用来下载文件，例如：

# 下载首页htmlwget http://ayqy.net# 下载多个文件wget http://www.example.com http://ayqy.net

上例中不带www的地址会返回301，wget会自动追过去，下载index.html并保存到当前目录，默认文件名相同，已存在的话自动添后缀

支持2种URL格式：

# httphttp://host[:port]/directory/file# ftpftp://host[:port]/directory/file# 带用户名密码验证的http://user:password@host/pathftp://user:password@host/path# 或者wget --user=user --password=password URL

保存文件名通过-O选项来指定：

# 输出到文件wget http://ayqy.net -O page.html# -表示标准输出wget http://ayqy.net -O -

注意：必须是大O，小o表示把进度信息及错误信息记录到指定的log文件。如果指定的文件已存在，会被覆盖掉

其它常用选项：

# POSTwget --post-data 'a=1&b=2' http://www.example.com# 或者wget --post-file post-body.txt http://www.example.com# 断点续传wget -c http://www.example.com# 错误重试3次wget -t 3 http://www.example.com# 下载限速1k，避免占满带宽wget --limit-rate 1k http://www.example.com# 限制总下载量，避免占太多磁盘空间wget -Q 1m http://www.example.com http://www.example.com

P.S.限制总下载量依赖服务提供的Content-Lengt，不提供就无法限制

另外，wget还有非常强大的爬虫功能：

# 递归爬取所有页面，逐个下载wget --mirror http://www.ayqy.net# 指定深度1级，要和-r递归选项一起使用wget -r -l 1 http://www.ayqy.net

还可以增量更新，只下载新文件（本地不存在的，或者最后修改时间更新的）：

# -N比较时间戳增量更新，只下载新文件wget -N http://node.ayqy.net

服务文件不变的话，下次不会下载，提示：

Server file no newer than local file `index.html’ — not retrieving.

P.S.当然，增量更新依赖服务提供的Last-Modified，如果不给就无法增量更新，默认下载覆盖

P.S.关于wget的更多信息，请查看GNU Wget 1.18 Manual

curl
比wget更强大，不仅可以下载文件，还可以发送请求（GET/POST/PUT/DELETE/HEAD等等），指定请求头等等，支持HTTP、HTTPS、FTP等协议，支持Cookie、UA、Authentication等等

经常用来测试RESTful API：

# 增curl -X POST http://localhost:9108/user/ayqy# 删curl -X DELETE http://localhost:9108/user/ayqy# 改curl -X PUT http://localhost:9108/user/ayqy/cc# 查curl -X GET http://localhost:9108/user/ayqyPOST提交表单：# 模拟表单提交curl -d 'a=1&b=2' --trace-ascii /dev/stdout http://www.example.com# 请求头和请求体=> Send header, 148 bytes (0x94)0000: POST / HTTP/1.10011: Host: www.example.com0028: User-Agent: curl/7.43.00041: Accept: */*004e: Content-Length: 70061: Content-Type: application/x-www-form-urlencoded0092:=> Send data, 7 bytes (0x7)0000: a=1&b=2

-d表示--data-ascii，另外3种方式是--data-raw、--data-binary、--data-urlencode，其中--data-urlencode会对参数值进行编码

--trace-ascii用来输出请求/响应头、请求/响应体，或者通过代理工具查看请求内容：

# -x或者--proxy走代理，否则抓不着curl -d 'a=1&b=2' -x http://127.0.0.1:8888 http://www.example.com

也可以像wget一样下载文件，只是默认输出到标准输出，而不是写入文件：

# 直接输出响应内容curl http://ayqy.net

会得到一个301简单页，curl不会自动追过去，可以利用这一点来追踪重定向（当然，直接抓包看更简单粗暴）

下载文件可以通过输出重定向或者-o选项来完成：

# 写入文件，默认会输出进度信息curl http://ayqy.net > 301.html# 或者curl http://ayqy.net -o 301.html# 使用URL中的文件名curl http://ayqy.net/index.html -O# URL中没有文件名的话无法下载curl http://ayqy.net -O# 静默下载，不输出进度信息curl http://ayqy.net --silent -o 301.html

一个很有意思的命令：

# curl安装nvmcurl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.1/install.sh | bash

参数o的值为-，表示重定向到标准输出，然后管道交给bash命令执行，整行作用是获取在线bash脚本并执行

wget的与之类似：

# wget安装nvmwget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.33.1/install.sh | bash

-q选项禁言，保证结果干净，-O -重定向到标准输出，再交给bash命令执行

curl的强大之处在于可以修改请求头字段值：

# 指定referer字段curl --referer http://ayqy.net http://node.ayqy.net# 设置cookiecurl -v --cookie 'isVisted=true' http://localhost:9103# 或者，用-H设置任意头字段curl -v -H 'Cookie: isVisted=true' http://localhost:9103curl -v -H 'Cookie: isVisted=true' -H 'Referer: http://a.com' http://localhost:9103# 把返回的cookie写入文件curl http://localhost:9103 -c cookie.txt# 设置UAcurl -v -A 'hello, i am android' 'http://localhost:9105'

其它特性及选项：

# 显示下载进度条curl http://ayqy.net --progress -o 301.html# 断点续传# 手动指定偏移量，跳过15个字节，DOCTYPE声明被跳过了curl http://node.ayqy.net -C 15# 自动计算偏移量（类似于wget -c）curl http://node.ayqy.net -C -# 下载限速（不重定向到文件的话，输出到标准输出也会限速）curl http://www.ayqy.net > ayqy.html --limit-rate 1k# 限制总下载量curl http://node.ayqy.net --max-filesize 100# 用户名密码验证curl -v -u username:password http://example.com# 只输出响应头# www少很多字段curl -I http://node.ayqy.netcurl -I http://www.ayqy.net

批量下载图片
利用curl很容易完成类似的简单工作：

#!/bin/bash# 批量下载图片# 参数数量检查if [ $# -ne 3 ];then    echo 'Usage: -d <dir> <url>'    exit 1fi# 取出参数for i in {1..3};do    case $1 in    -d) shift; dir=$1; shift;;     *) url=${url:-$1}; shift;;    esacdone# 截取基urlbaseurl=$(echo $url | egrep -o 'https?://[a-z.]+')# 取源码，滤出img，提取srctmpFile="/tmp/img_url_$$.tmp"curl $url --silent \    | egrep -o '<img\s.*src="[^"]+\"[^>]*>' \    | sed 's/.*src="\([^"]*\)".*/\1/g' \    > $tmpFileecho "save image urls to $tmpFile"# 相对根路径转绝对路径sed -i '' "s;^/;$baseurl;g" "$tmpFile"# 创建目录mkdir -p $dircd $dir# 下载图片while read imgUrl;do    filename=${imgUrl##*/}    curl $imgUrl --silent > "$filename"    echo "save to $dir/$filename"done < "$tmpFile"echo 'done'

执行以上脚本，抓取捧腹的图片：

./imgdl.sh http://www.pengfu.com -d imgs
核心部分非常容易，拿到源码，找出img标签，提取src，遍历下载。取参数部分有个小技巧：

取出参数

for i in {1..3};
do
case $1 in
-d) shift; dir=$1; shift;;
*) url=${url:-$1}; shift;;
esac
done

其中shift命令用来弹出命令参数（$1...n）的首元，与其它语言中数组的shift方法含义相同，移除首元，其余元素前移，所以循环中可以只判断首元$1。case匹配参数名和值，处理方式是读一个删一个，每次都读第一个。例如，如果参数是-d <dir>这样的键值对形式，先shift去掉-d，接着读取<dir>，最后把读完的<dir>也shift掉，继续下一趟读后面的参数

这样读取参数的好处是不限制参数顺序，当然，键值对形式参数要在一起，各参数之间的顺序随意

其中${url:-$1}表示如果变量url存在且非空，就取url的值，否则取$1的值。这个特性叫参数展开（parameter expansion）：

${parameter:-word}

parameter未定义或者为空的话，取word的值，否则取parameter的值

${parameter:=word}

用来设置默认值。parameter未定义或者为空的话，把word的值赋值给parameter，位置参数（positional parameters，比如$012..n）和特殊参数不允许这样赋值（因为是只读的）

${parameter:?word}

用来检查变量未定义或为空的错误。parameter未定义或者为空的话，把word原样输出到标准错误（例如parameter: word，如果没给word，就输出parameter null or not set），如果不是可交互的场景就直接退出脚本。parameter存在且不为空的话，取parameter的值

${parameter:+word}

用来检查变量是否存在。parameter未定义或者为空的话，取空，否则取word的值

另外，还有4个不带:的版本，表示parameter可以为空

P.S.关于参数展开的更多信息，请查看Bash Reference Manual: Shell Parameter Expansion

web_bash笔记5

取出参数

更多相关文章

随机推荐