Skip to main content

multi requests to combine a structure item.

Project description

结构化爬虫

通过组建Item请求树抓取结构化数据

USAGE

安装structure_spider

dev@ubuntu:~$ pip install structure-spider

生成项目

dev@ubuntu:~$ structure-spider create project -n myapp
New structure-spider project 'myapp', using template directory '/home/dev/.pyenv/versions/3.6.0/lib/python3.6/site-packages/structor/templates/project', created in:
    /home/dev/myapp

You can start the spider with:
    cd myapp
    custom-redis-server -ll INFO -lf
    scrapy crawl douban

开始简单redis,可以使用正式版redis,只需把settings.py中的CUSTOM_REDIS=True注释掉即可

dev@ubuntu:~$ custom-redis-server -ll INFO -lf

生成自定义spider及item

使用createspider可以生成直接可用的spider,-s指定spider名称,随后创建要抓取的字段及其规则 ,使用=连接。规则可以是正则表达式,xpath, css。

如需进一步增加复杂规则或进行数据清洗,请参考wiki。

dev@ubuntu:~$ cd myapp/myapp/
dev@ubuntu:~/myapp/myapp$ ls
items  settings.py  spiders
dev@ubuntu:~/myapp/myapp$ structure-spider create spider -n zhaopin "product_id=/(\d+)\\.htm" "job=//h1/text()" "salary=//a/../../strong/text()" 'city=//ul[@class="terminal-ul clearfix"]//strong/a/text()' 'education=//span[contains(text(), "学历")]/following-sibling::strong/text()' "company=h2 > a" -ip '//td[@class="zwmc"]/div/a[1]/@href' -pp '//li[@class="pagesDown-pos"]/a/@href'
ZhaopinSpdier and ZhaopinItem have been created.
dev@ubuntu:~/myapp/myapp$

参考资料:使用structure_spider多请求组合抓取结构化数据

启动爬虫

dev@ubuntu:~/myapp/myapp$ scrapy crawl zhaopin

投入任务

dev@ubuntu:~/myapp$ structure-spider feed -s zhaopin -u "https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E6%B5%8E%E5%8D%97&kw=%E9%94%80%E5%94%AE&sm=0&p=1" -c zhaopin --custom # --custom代表使用的是简单redis

查看任务状态

dev@ubuntu:~/myapp$ structure-spider check zhaopin --custom

更多资源:

[structure_spider每周一练]:一键下载百度mp3

个性化爬虫一键生成,想抓哪里点哪里!

scrapy进阶,组合多请求抓取Item利器ItemCollector详解!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structure_spider-1.3.3.tar.gz (35.0 kB view details)

Uploaded Source

File details

Details for the file structure_spider-1.3.3.tar.gz.

File metadata

  • Download URL: structure_spider-1.3.3.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for structure_spider-1.3.3.tar.gz
Algorithm Hash digest
SHA256 15b5b515d224fc2b476d49b7600bf2df953a39baeaddad2fc788dc797ea15185
MD5 768fe2e9670e176209360b69930f634e
BLAKE2b-256 7e3acd37352e6a2702e0569e506650b55cad525defe12b3098586e79d2c732dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page