scrapy入门教程分享 rules的使用

　　LinkExtractor：链接提取器，为了从response对象中获取链接，并且该链接会被接下来爬取

　　主要参数：

　　allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配

　　deny：与这个正则表达式或者正则表达式列表不匹配的URL一定不提取

　　allow_domains：会被提取的链接的domains。

　　deny_domains：一定不会被提取链接的domains。

　　restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接

　　callback：当获取到链接时，参数所指定的值作为回调函数

　　注意：避免使用parse作为回调函数，因为crawlSpider使用parse方法来实现其逻辑，如果覆盖了parse方法，crawlSpider会运行失败

　　follow：指定了根据该规则从response提取的链接是否需要跟进。当callback为none，默认值为true。

　　编写格式（一）

　　rules=[

　　提取“下一篇”的链接并跟进，若不适用restrict_xpaths参数限制

　　# 会将页面中所有符合allow的链接全部抓取

　　Rule(SmglLinkExtractor(allow=('/u2323243432/article/details'),

　　restrict_xpaths=('//li[@class="next_article"]')),

　　follow=True)

　　# 提取“下一篇”链接并执行处理

　　Rule(SgmlLinkExtractor(allow=('/u2323243432/article/details')),

　　callback='parse_item',

　　follow=False),

　　]

　　编写格式（二）

　　rules=[

　　Rule(SgmlLinkExtractor(allow=('/u2323243432/article/details'),

　　restrict_xpaths=('//li[@class="next_article"]')),

　　callback='parse_item',

　　follow=True)

　　]

如需转载，请注明文章出处和来源网址：http://www.divcss5.com/html/h60178.shtml