用JAVA制作一个爬取商品信息的爬虫（爬取大众点评）

时间：2014-09-22 15:47:02 收藏：0 阅读：4678

很多企业要求利用爬虫去爬取商品信息，一般的开发模型如下：

for i=1;i<=最大页号;i++
     列表页面url=商品列表页面url+?page=i(页号)
     列表页面=爬取(列表页面url)
     商品链接列表=抽取商品链接（列表页面）
     for 链接 in 商品链接列表:
          商品页面=爬取(链接)
          抽取(商品页面);

这样的模型看似简单，但是有一下几个问题：

1）爬虫没有线程池支持。

2）没有断点机制。

3）没有爬取状态存储，爬取商品网站经常会出现服务器拒绝链接（反问次数过多）,导致一旦出现

拒绝链接，有部分页面是未爬取状态。而没有爬取状态记录，导致爬虫需要重新爬取，才可获得完整数据。

4）当抽取业务复杂时，代码可读性差（没有固定框架）

很多企业解决上面问题时，并没有选择nutch、crawler4j这样的爬虫框架，因为这些爬虫都是基于广度遍历的，上面的业务虽然是简单的双重循环，但是不是广度遍历。但是实际上这个双重循环，是可以转换成广度遍历的，当广度遍历的的层数为1的时候，等价于基于url列表的爬取(种子列表)。上面业务中的循环，其实就是基于url列表的爬取。上面的伪代码是双重循环，所以可以拆分成2次广度遍历来完成的。

我们设计两个广度遍历器LinkCrawler和ProductCrawler：
1)LinkCrawler负责遍历商品列表页面，抽取每个商品详情页面的url，将抽取出的url注入（inject)到ProductCrawler里
2)ProductCrawler以LinkCrawler注入的url为种子，进行爬取，对每个商品详情页面进行抽取。

这里以WebCollector爬虫框架为例，给出一段爬取大众点评团购的示例：

import java.io.File;
import java.io.IOException;

import java.util.regex.Pattern;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;
import cn.edu.hfut.dmic.webcollector.generator.Injector;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.util.Config;
import cn.edu.hfut.dmic.webcollector.util.FileUtils;

/**
 * 爬取大众点评团购信息的爬虫Demo 很多精抽取的爬虫，并不是采用简单的广度遍历算法，而是采用两个步骤完成：
 * 1.用循环遍历商品列表页面，抽取每个商品详情页面的url 2.对每个商品详情页面进行抽取
 * 大多数爬虫往往只支持广度遍历，所以很多人选择自已用循环来进行上面的抽取 操作，这样做往往不能享受到爬虫框架所提供的线程池、异常处理和断点支持等 功能。
 *
 * 其实上面的抽取任务，是可以通过拆分成2次广度遍历来完成的。 当广度遍历的的层数为1的时候，等价于基于url列表的爬取(种子列表)
 * 我们设计两个广度遍历器LinkCrawler和ProductCrawler
 * 1)LinkCrawler负责遍历商品列表页面，抽取每个商品详情页面的url，将抽取出的url注 入（inject)到ProductCrawler里
 * 2)ProductCrawler以LinkCrawler注入的url为种子，进行爬取，对每个商品详情页面进行 抽取。
 *
 * @author hu
 */
public class DazhongDemo {

    public static class LinkCrawler extends BreadthCrawler {

        Injector injector;

        public LinkCrawler(String linkPath, String productPath) {
            setCrawlPath(linkPath);

            /*向ProductCrawler爬虫注入种子的注入器*/
            injector = new Injector(productPath);
            /*LinkCrawler负责遍历商品列表页面，i是页号*/
            for (int i = 1; i < 3; i++) {
                addSeed("http://t.dianping.com/list/hefei-category_1?pageno=" + i);
            }
            addRegex(".*");
        }

        @Override
        public void visit(Page page) {
            Document doc = page.getDoc();
            Elements links = doc.select("li[class^=floor]>a[track]");
            for (Element link : links) {
                /*href是从商品列表页面中抽取出的商品详情页面url*/
                String href = link.attr("abs:href");
                System.out.println(href);
                synchronized (injector) {
                    try {
                        /*将商品详情页面的url注入到ProductCrawler作为种子*/
                        injector.inject(href, true);
                    } catch (IOException ex) {
                    }
                }
            }

        }

        /*Config.topN=0的情况下，深度为1的广度遍历，等价于对种子列表的遍历*/
        public void start() throws IOException {
            start(1);
        }
    }

    public static class ProductCrawler extends BreadthCrawler {

        public ProductCrawler(String productPath) {
            setCrawlPath(productPath);
            addRegex(".*");
            setResumable(true);
            setThreads(5);
        }

        @Override
        public void visit(Page page) {
            /*判断网页是否是商品详情页面，这个程序里可以省略*/
            if (!Pattern.matches("http://t.dianping.com/deal/[0-9]+", page.getUrl())) {
                return;
            }
            Document doc = page.getDoc();
            String name = doc.select("h1.title").first().text();
            String price = doc.select("span.price-display").first().text();
            String origin_price = doc.select("span.price-original").first().text();
            String validateDate = doc.select("div.validate-date").first().text();
            System.out.println(name + "  " + price + "/" + origin_price + validateDate);
        }

        /*Config.topN=0的情况下，深度为1的广度遍历，等价于对种子列表的遍历*/
        public void start() throws IOException {
            start(1);
        }
    }

    public static void main(String[] args) throws IOException {

        /*
         Config.topN表示爬虫做链接分析时，链接数量上限，由于本程序只要求遍历
         种子url列表，不需根据链接继续爬取，所以要设置为0
         */
        Config.topN = 0;
        /*
         每个爬虫的爬取依赖一个文件夹，这个文件夹会对爬取信息进行存储和维护
         这里有两个爬虫，所以需要设置两个爬取文件夹
         */
        String linkPath = "crawl_link";
        String productPath = "crawl_product";

        File productDir = new File(productPath);
        if (productDir.exists()) {
            FileUtils.deleteDir(productDir);
        }

        LinkCrawler linkCrawler = new LinkCrawler(linkPath, productPath);
        linkCrawler.start();

        ProductCrawler productCrawler = new ProductCrawler(productPath);
        productCrawler.start();

    }

}

小壁虎 2016-10-28 16:46:21

推荐使用神箭手云爬虫，完全在云上编写和执行爬虫，不需要配置任何开发环境，快速开发快速实现。?

简单几行 javascript 就可以实现复杂的爬虫，同时提供很多功能函数：反反爬虫、 js 渲染、数据发布、图表分析、反防盗链等，这些在开发爬虫过程中经常会遇到的问题都由神箭手帮你解决。

共1条