mirror of https://github.com/binux/pyspider.git synced 2024-11-25 16:34:30 +08:00

A Powerful Spider(Web Crawler) System in Python.

Go to file

binux 52151c8b6c add test and feature test for timeout		2016-09-02 23:39:34 +01:00
data
docs	add send_message command line doc	2016-08-29 12:37:06 +01:00
pyspider	add test and feature test for timeout	2016-09-02 23:39:34 +01:00
tests	add test and feature test for timeout	2016-09-02 23:39:34 +01:00
tools	tools/migrate.py	2015-10-01 00:44:44 +01:00
.coveragerc
.gitignore	ignore IntelliJ IDEA config dir	2015-09-09 18:15:57 +08:00
.travis.yml	lxml wheel doesn't work for travis	2016-08-18 19:48:51 +01:00
Dockerfile	fix docker build	2016-05-24 22:22:23 +01:00
LICENSE
MANIFEST.in	fix path in MANIFEST.in	2015-01-29 00:44:43 +08:00
mkdocs.yml	add docs/Deployment-demo.pyspider.org.md	2016-07-10 11:16:44 +01:00
README.md	add readme for Elasticsearch	2016-01-23 18:20:51 +00:00
requirements.txt	downgrade amqp for docker image	2016-06-22 19:41:54 +01:00
run.py
setup.py	fix test fail: disable lazy_limit for message queue test test_30_full	2016-08-28 14:15:28 +01:00
tox.ini	test add IGNORE_ALL, join every thread, make sure ioloop always stop in	2016-03-06 01:18:44 +00:00

README.md

pyspider

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

Write script in Python
Powerful WebUI with script editor, task monitor, project manager and result viewer
MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
RabbitMQ, Beanstalk, Redis and Kombu as message queue
Task priority, retry, periodical, recrawl by age, etc...
Distributed architecture, Crawl Javascript pages, Python 2&3, etc...

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Installation

pip install pyspider
run command pyspider, visit http://localhost:5000/

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

Use It
Open Issue, send PR
User Group
中文问答

TODO

v0.4.0

local mode, load script from file.
works as a framework (all components running in one process, no threads)
redis
shell mode like scrapy shell
a visual scraping interface like portia

edit script with vim via WebDAV

License

Licensed under the Apache License, Version 2.0