python学习（二）百度爬虫0.1

时间：2014-04-28 17:37:30 收藏：0 阅读：707

参照着网上的爬虫案例（点我），先做了一个demo，基本的爬虫项目创建，以及数据抽取，数据分析，数据保存等等过程基本上有所掌握。

我的需求是需要检索指定的百度贴吧，根据指定的关键字库，搜索出含有关键字的链接，并抽取出来，用于后续告警。

因此，基于需求，分如下步骤：

第一：基于Scrapy创建爬虫项目；

第二：新建TieBaSpider爬虫；

第三：新建外部关键字库dictionary.txt文件，贴吧地址配置url.txt文件；

第一步参考晚上案例。

从第二步开始，编写爬虫，同时创建实例对象以及创建管道。

爬虫代码：

# -*- coding:gbk -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import TieBa

import codecs
import re

class TieBaSpider(Spider):
    name = "tieba"
    allowed_domains = ["tieba.baidu.com"]  #限定爬虫搜索域
    start_urls = []
    dictionarys = []

    def __init__(self):
        self.readFile(self.dictionarys, ‘dictionary.txt‘)
        self.readFile(self.start_urls, ‘url.txt‘)

    def readFile(self, file, fileName):
        tempFile = codecs.open(fileName,‘r‘)
        for temp in tempFile.readlines():
            #去回车符
            if temp[-1] == ‘\n‘:
                temp = temp[0:-1]
            file.append(temp)
        tempFile.close()

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath(‘//div[@class="threadlist_text threadlist_title j_th_tit  notStarList "]‘)
        items = []
    
        for site in sites:
            item = TieBa()
            title = site.xpath(‘a/@title‘).extract()
            childLink = site.xpath(‘a/@href‘).extract()

            item[‘title‘] = self.change(title)
            item[‘childLink‘] = "http://tieba.baidu.com" + self.change(childLink)

            for key in self.dictionarys:
                if item[‘title‘].find(key.decode(‘gbk‘)) != -1:
                    items.append(item)
        
        return items

    def change(self, field):
        for temp in field:
            return temp

name用来指定爬虫的名字

allowed_domains指定爬虫检索的限定域

在__init__方法中，将外部的关键字字典文件dictionary.txt与贴吧指定文件url.txt文件的内容读入到工程中。

在读文件的时候，如果有多行，python不会主动去除换行符，需要代码处理下（说是open方法的第二个参数配置为’r’就可以解决，但是我没搞定）。

parse函数是Scrapy的默认回调函数，针对start_urls中的每一条url，Scrapy都会创建一个Request，然后默认会回调parse函数，传入response。

在字符串查找的时候，因为在windows系统下，python读文件内容时的编码为ascii码，需要转换为其它编码才能进行比较。这里选用gbk是因为百度贴吧的html编码为gbk。

change方法是将list转换为string，不知道还有其它办法没，先这样用了。

爬虫定义好之后，Scrapy会自动调用配置的pipeline，将item对象传入，管道会执行写文件操作。

在items.py中创建实例。

实例代码：

class TieBa(Item):
    title = Field()
    childLink = Field()
    childValue = Field()

在pipelines.py中创建管道。

管道：

class JsonWithEncodingPipeline(object):
    def __init__(self):
        self.file = codecs.open(‘tieba.json‘,‘wb‘,encoding=‘utf-8‘)

    def process_item(self,item,spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line.encode(‘gbk‘).decode("unicode_escape"))
        return item

    def spider_closed(self, spider):
        self.file.colse()

在settings.py中配置管道信息。

管道的配置：

ITEM_PIPELINES = {
        ‘tutorial.pipelines.JsonWithEncodingPipeline‘ :300
    }

python学习（二）百度爬虫0.1,布布扣,bubuko.com

迷上了代码！