A Powerful Spider(Web Crawler) System in Python.
Go to file
2014-11-24 23:16:31 +08:00
data
pyspider move run.py to pyspider 2014-11-24 23:16:31 +08:00
tests move run.py to pyspider 2014-11-24 23:16:31 +08:00
.coveragerc update coverage for package 2014-11-17 20:05:11 +08:00
.gitignore
.travis.yml improve test coverage 2014-11-11 21:07:43 +08:00
Dockerfile fix phantomjs keep-alive bug 2014-10-31 22:45:36 +08:00
LICENSE update readme and license 2014-11-16 23:36:16 +08:00
logging.conf pyspider package, tests passed 2014-11-17 19:20:02 +08:00
README.md update readme, change badge style 2014-11-24 19:06:08 +08:00
requirements.txt new run.py 2014-11-20 02:51:47 +08:00
run.py move run.py to pyspider 2014-11-24 23:16:31 +08:00
runtest.py pyspider package, tests passed 2014-11-17 19:20:02 +08:00
setup.py move run.py to pyspider 2014-11-24 23:16:31 +08:00

pyspider Build Status Coverage Status

A Powerful Spider(Web Crawler) System in Python. Try It Now!

  • Write script in python with powerful API
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, SQLite as database backend
  • Javascript pages supported!
  • Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
  • Distributed architecture

Sample Code:

from libs.base_handler import *

class Handler(BaseHandler):
    '''
    this is a sample handler
    '''
    @every(minutes=24*60, seconds=0)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10*24*60*60)
    def index_page(self, response):
        for each in response.doc('a[href^="http://"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
                "url": response.url,
                "title": response.doc('title').text(),
                }

demo

Installation

  • python2.6/7 (windows is not supported currently)
  • pip install --allow-all-external -r requirements.txt
  • ./run.py , visit http://localhost:5000/

if ubuntu: apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml

Running with Docker

Documents

Contribute

License

Licensed under the Apache License, Version 2.0