A Powerful Spider(Web Crawler) System in Python.
Go to file
binux 31e5235525 chage default cdn to cndjs.com
slime .crawl style update
2015-06-19 15:28:33 +08:00
data
docs Update Quickstart.md 2015-06-02 20:41:12 +08:00
pyspider chage default cdn to cndjs.com 2015-06-19 15:28:33 +08:00
tests add panel 2015-06-12 23:39:19 +08:00
.coveragerc
.gitignore
.travis.yml sleep before webui phantomjs test 2015-06-10 21:47:19 +08:00
Dockerfile fix Docerfile 2015-06-04 11:51:22 +08:00
LICENSE
MANIFEST.in
mkdocs.yml
README.md not set schedule field into status_pack if its not exist in task 2015-05-23 01:21:10 +08:00
requirements.txt use high level interface kumbu to connect to multipul queues 2015-05-22 23:34:53 +08:00
run.py
setup.py add easywebdav for test 2015-06-04 10:37:05 +08:00
tox.ini drop requirements.txt 2015-06-03 17:08:07 +08:00

pyspider Build Status Coverage Status Try

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

  • Write script in Python
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, Redis, SQLite, PostgreSQL with SQLAlchemy as database backend
  • RabbitMQ, Beanstalk, Redis and Kombu as message queue
  • Task priority, retry, periodical, recrawl by age, etc...
  • Distributed architecture, Crawl Javascript pages, Python 2&3, etc...

Documentation: http://docs.pyspider.org/
Tutorial: http://docs.pyspider.org/en/latest/tutorial/

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Demo

Installation

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • local mode, load script from file.
  • works as a framework (all components running in one process, no threads)
  • redis
  • shell mode like scrapy shell
  • a visual scraping interface like portia

more

License

Licensed under the Apache License, Version 2.0