Skip to main content

Crawler for importing data from a filesystem directory into Solr

Project description

Introduction

bg.crawler is a command-line frontend for feeding a tree of files (a directory) into a Solr for indexing

Usage

Command line options:

blackmoon:~/src/bg.crawler> bin/solr-crawler --help
usage: solr-crawler [-h] [--solr-url SOLR_URL] [--max-depth MAX_DEPTH]
                    [--batch-size BATCH_SIZE] [--tag TAG] [--clear-all]
                    [--clear-tag SOLR_CLEAR_TAG] [--verbose] [--no-type-check]
                    <directory>

Commandline parser

positional arguments:
  <directory>           Directory to be crawled

optional arguments:
  -h, --help            show this help message and exit
  --solr-url SOLR_URL   SOLR server URL
  --max-depth MAX_DEPTH
                        maximum folder depth
  --batch-size BATCH_SIZE
                        Solr batch size
  --tag TAG             Solr import tag
  --clear-all           Clear the Solr indexes before crawling
  --clear-tag SOLR_CLEAR_TAG
                        Remove all items from Solr indexed tagged with the
                        given tag
  --verbose             Verbose logging
  --no-type-check       Apply extension filter while crawling
  • --solr-url defines the URL of the SOLR server

  • --max-depth limits the crawler to a given folder depth

  • --batch-size insert N documents within one batch before sending a commit to Solr (default behavior: every single add to the Solr indexed will be committed)

  • --tag will tag the imported document(s) with a string (this may be useful importing different document sources into Solr while supporting the option to filter by tag at query time)

  • --clear-all clear the complete Solr index before running the import

  • --clear-tag remove all documents with the given tag before running the import

  • --verbose enable extensive logging

  • --no-type-check if set: do not apply any type check filtering but instead pass all file types to Solr

Licence

bg.crawler is published under the GNU Public Licence V2 (GPL 2)

Credits

bg.crawler is sponsored by BG Phoenics

Author

Written by

ZOPYX Ltd.
c/o Andreas Jung
Charlottenstr. 37/1
D-72070 Tuebingen
Germany
www.zopyx.com

Contributors

Changelog

0.1. (2011-11-11)

  • initial release [ajung]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bg.crawler-0.1.1.zip (17.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page