A Powerful Spider(Web Crawler) System in Python.
Go to file
2017-04-18 23:01:04 +01:00
.github add ISSUE_TEMPLATE 2017-04-18 23:01:04 +01:00
data
docs accept user_agent argument from self.crawl, as an alias of headers['User-Agent'] 2017-02-13 23:23:45 +00:00
pyspider Merge branch 'master' of github.com:binux/pyspider 2017-04-18 22:29:44 +01:00
tests try to debug "FAIL: test_30_full (test_message_queue.TestPikaRabbitMQ)" 2017-03-05 20:51:23 +00:00
tools
.coveragerc
.gitignore
.travis.yml add support for python 3.6 2017-02-26 15:48:54 +00:00
Dockerfile fix docker build 2017-03-05 23:32:22 +00:00
LICENSE
MANIFEST.in
mkdocs.yml add docs/Deployment-demo.pyspider.org.md 2016-07-10 11:16:44 +01:00
README.md add support for python 3.6 2017-02-26 15:48:54 +00:00
requirements.txt tblib is required pyspider/libs/response.py#L15 2017-01-18 13:27:38 +08:00
run.py
setup.py fix test for python2.6 2017-04-18 22:29:23 +01:00
tox.ini change dockerfile mysql-connector-python curl 2017-01-17 01:17:25 +08:00

pyspider Build Status Coverage Status Try

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Demo

Installation

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0