1. 先写一个特殊的Item
class CSDNImgItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field()
注意这个字段是写死的image_urls 是图片的地址的一个数组,images记录图片信息不用管。
2.yield item
image_urls = response.css('#cnblogs_post_body img::attr("src")').extract() if len(image_urls) > 0: imageItem = CSDNImgItem() imageItem['image_urls'] = image_urls yield imageItem
这样写即可记得image_urls是一个数组
3. 图片下载pipeline
from scrapy.pipelines.images import ImagesPipeline class CSDNImgPipeline(ImagesPipeline): def get_media_requests(self, item, info): if 'image_urls' in item.keys(): for image_url in item['image_urls']: last_name = image_url[image_url.rfind('/') + 1:len(image_url)] yield Request(image_url, meta={'name': last_name}) def file_path(self, request, response=None, info=None): today = datetime.datetime.now().strftime('%Y%m%d') name = '' if '.' in request.meta['name']: name = request.meta['name'][0:request.meta['name'].rindex('.')] else: name = request.meta['name'] result = "%s/%s.jpg" % (today, name) return result pass def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] # if not image_paths: # raise DropItem("Item contains no images") return item
我这里对不是图片的item没有进行dropitem的处理,因为有可能是其他的文字item。然后重写file_path,我这边是以201804/123456.jpg 这样的格式命名的。
4.图片下载settings
ITEM_PIPELINES = 'tutorial.pipelines.CSDNImgPipeline':400, } IMAGES_STORE ='/Users/walle/PycharmProjects/imgSave/img'
定义pipeline和图片保存地址即可。
3240