Skip to main content

No project description provided

Project description

zlodziej-crawler

Table of Contents

About The Project

Small web-scraper for scraping and processing offers from website olx.pl.

Built With

Getting Started

Prerequisites

Poetry is used for managing project dependencies, you can install it by:

pip install poetry

Installation

  • Clone the repo
git clone https://gitlab.com/mwozniak11121/zlodziej-crawler-public.git
  • Spawn poetry shell
poetry shell
  • Install dependencies and package
poetry install

 

Or if you want to install package through pip

pip install zlodziej-crawler

Usage

The only script made available is steal, which prompts for url with offer's category, e.g. olx.pl/nieruchomosci/mieszkania/wynajem/wroclaw/
and then scraps, processes and saves found offers. (Results are saved in dir: cwd / results)

Example output for RentOffer looks like this:

Extending Project

Project is meant to be easily extendable by adding new Pydantic models to zlodziej_crawler/models.py.
BaseOffer serves purpose as a generic offer for all types of offers that are not specificly processed.
RentOffer and its parent class BaseOffer look like this:

class BaseOffer(BaseModel):
    url: HttpUrl
    offer_name: str
    description: str
    id: PositiveInt
    time_offer_added: datetime
    views: PositiveInt
    location: str
    price: Union[PositiveInt, str]
    website: Optional[Website] = None
    unused_data: Optional[Dict] = None


class RentOffer(BaseOffer):
    rent: PositiveInt
    area: float

    number_of_rooms: Optional[str] = None
    offer_type: Optional[OfferType] = OfferType.UNKNOWN
    floor: Optional[str] = None
    building_type: Optional[BuildingType] = BuildingType.UNKNOWN
    furnished: Optional[bool] = None

    total_price: Optional[int] = None
    price_per_m: Optional[PositiveFloat] = None
    total_price_per_m: Optional[PositiveFloat] = None

Project can be simply extended by adding matching classes based on other categories at olx.pl.
Adding new OfferType needs:

  • Parsing functions in zlodziej_crawler/olx/offers_extraction/NEW_OFFER.py
  • Factory function in OLXParserFactory (zlodziej_crawler/olx/parser_factory.py)
  • Matching offer category url in OLXParserFactory.get_parser (zlodziej_crawler/olx/parser_factory.py)

Currently any information found by scraper in titlebox-details section and not yet processed is saved as unused_data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zlodziej-crawler-0.1.1.tar.gz (19.9 kB view hashes)

Uploaded Source

Built Distribution

zlodziej_crawler-0.1.1-py3-none-any.whl (13.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page