很多复杂的网页都是用javascript来对网页进行填充,这样用request的body和在浏览器中看到的不一样啊。这个时候splash就可以使用了,它是提供一个轻量级的api,传给它网址,它返回网页内容。这样就OK了
1.安装splash
要先安装docker
docker pull registry.docker-cn.com/scrapinghub/splash #从docker镜像中拉取splash实例 docker run -p 8050:8050 scrapinghub/splash #启动splash实例
现在docker 可以用国内的源了,所以要加上registry.docker-cn.com来加快速度
还需要安装个scrapy-splash
sudo pip3 install scrapy-splash
2. setting 设置
SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
3. spider中使用
import scrapy import logging import re import datetime from tutorial import settings from scrapy_splash import SplashRequest class JanDanSpider(scrapy.Spider): name = "jandan" def start_requests(self): #start_url = 'http://jandan.net/ooxx' start_url = 'http://www.baidu.com' headers = { 'Connection' : 'keep - alive', 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36' } yield SplashRequest(url=start_url, callback = self.parse, args={'wait': 1.0}) def parse(self, response): # --------------------------------body image url-------------------------------------------- image_urls = response.css('img::attr("src")').extract() new_image_urls = [] for i in range(len(image_urls)): new_image_urls.append('http://' + image_urls[i][2:]) pass
主要这里用了SplashRequest,加了一个等待时间,callback回来就跟默认的一样了。进行爬虫的其他处理。。。
3302
[…] 爬虫里的写法不变,这边主要是写了一个downloader middleware,返回一个用我们chrome driver请求后的page_source,这样parse里就是返回后的内容了。这样就可以爬javascript 渲染后的页面,启动后,感觉性能要比那个splash要差。。。 […]