from scrapy_redis.dupefilter import RFPDupeFilter class CustomFilter(RFPDupeFilter): def request_seen(self, request): """Returns True if request was already seen. Parameters ---------- request : scrapy.http.Request Returns ------- bool """ if 'https://segmentfault.com/stop-robot' in request.url: return False fp = self.request_fingerprint(request) # This returns the number of values added, zero if already exists. added = self.server.sadd(self.key, fp) return added == 0
这边我写了一个自定义的过滤器,继承于scrapy-redis中的。因为我有个需求是,这条url https://segmentfault.com/stop-robot不过滤。
settings.py
DUPEFILTER_CLASS = 'tutorial.CustomFilter.CustomFilter'
注意我项目名字是tutorial
4615