A Powerful Spider(Web Crawler) System in Python.
Go to file
2014-11-16 23:36:16 +08:00
data add test for scheduler 2014-03-07 01:44:58 +08:00
database improve test coverage 2014-11-11 21:07:43 +08:00
fetcher add test for processor and impove coverage for fetcher 2014-11-10 23:57:54 +08:00
libs update readme 2014-11-14 23:02:51 +08:00
processor add test for processor and impove coverage for fetcher 2014-11-10 23:57:54 +08:00
result result work not block 2014-11-11 13:14:46 +08:00
scheduler fix wrong cache_key 2014-10-31 22:55:13 +08:00
test improve test coverage 2014-11-11 21:07:43 +08:00
webui fix resize not work on iframe (web panel) 2014-11-06 00:33:56 +08:00
.coveragerc improve test coverage 2014-11-11 21:07:43 +08:00
.gitignore block scripts in iframe, use dataurl to prevent referer content block 2014-03-16 11:23:52 +08:00
.travis.yml improve test coverage 2014-11-11 21:07:43 +08:00
Dockerfile fix phantomjs keep-alive bug 2014-10-31 22:45:36 +08:00
LICENSE update readme and license 2014-11-16 23:36:16 +08:00
logging.conf add cdn option to change cdn service 2014-04-03 20:01:57 +08:00
README.md update readme and license 2014-11-16 23:36:16 +08:00
requirements.txt add login request for locked projects 2014-10-27 22:06:32 +08:00
run.py disable webui debug by default 2014-11-05 10:50:57 +08:00
runtest.py add phantomjs to run.py 2014-10-31 20:25:37 +08:00

pyspider Build Status Coverage Status

A Powerful Spider System in Python. Try It Now!

  • Write script in python with powerful API
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, SQLite as database backend
  • Javascript pages supported!
  • Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
  • Distributed architecture

Sample Code:

from libs.base_handler import *

class Handler(BaseHandler):
    '''
    this is a sample handler
    '''
    @every(minutes=24*60, seconds=0)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10*24*60*60)
    def index_page(self, response):
        for each in response.doc('a[href^="http://"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
                "url": response.url,
                "title": response.doc('title').text(),
                }

demo

Installation

if ubuntu: apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml

or Run with Docker

Documents

Contribute

License

Licensed under the Apache License, Version 2.0