Skip to main content

Fastest Url parser in the world

Project description

Logo

Fastest domain extractor library written in C++ with python binding.

First and complete library for parsing url in C++ and Python and Command Line

license Python Python

About The Project

Features

  • Multiple programming language supported such as Python, C++ and Shell
  • Intuitive interface and identical in C++ and Python
  • Provide two seperated class Url and Host for the purpose of clean code
  • Also support public_suffix_list for known combinatorial suffix such as "ac.ir"
  • Support unknown suffix like "google.comm" (it detect "comm" as suffix)
  • Update public_suffix_list automatically before each build and deploy
  • Host properties:
    • subdomain
    • domain
    • domain_name
    • suffix
  • Url properties:
    • protocol
    • userinfo
    • host (and all the host properties)
    • port
    • path
    • query
    • params
    • fragment

Setup

C++:

build steps:

git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install

Python and Command Line:

Be aware that it required python>=3.8

Installation

pip install liburlparser

Or

pip install git+https://github.com/mohammadraziei/liburlparser

Or

git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser

Usage

Command Line

python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json

Python

you can use liburlparser so intutively

all of classes has help section

import liburlparser
help(liburlparser)
print(liburlparser.__version__)

from liburlparser import Url, Host
help(Url)
help(Host)

parse url and host

from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or 
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url 
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())

Also there is some helping api to get better performance for some small tasks

# if you need to extract the host of url as a string without any parsing 
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast

if you are fan of pydomainextractor, there is some interface similar to it

import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url

# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api

C++

there is some examples in examples folder

#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");

you can see all methods in python we can use in c++ very easily

Performance

Extract From Host

Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)

Library Function Time
liburlparser liburlparser.Host 1.12s
PyDomainExtractor pydomainextractor.extract 1.50s
publicsuffix2 publicsuffix2.get_sld 9.92s
tldextract __call__ 29.23s
tld tld.parse_tld 34.48s

Extract From URL

The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)

Library Function Time
liburlparser liburlparser.Host.from_url 2.10s
PyDomainExtractor pydomainextractor.extract_from_url 2.24s
publicsuffix2 publicsuffix2.get_sld 10.84s
tldextract __call__ 36.04s
tld tld.parse_tld 57.87s

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Project Link: https://github.com/mohammadraziei/liburlparser

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

liburlparser-1.4.0.tar.gz (30.4 kB view hashes)

Uploaded Source

Built Distributions

liburlparser-1.4.0-cp311-cp311-win_amd64.whl (162.7 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

liburlparser-1.4.0-cp311-cp311-win32.whl (147.9 kB view hashes)

Uploaded CPython 3.11 Windows x86

liburlparser-1.4.0-cp311-cp311-musllinux_1_1_x86_64.whl (515.4 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

liburlparser-1.4.0-cp311-cp311-musllinux_1_1_i686.whl (554.9 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

liburlparser-1.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (198.7 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (206.9 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

liburlparser-1.4.0-cp310-cp310-win_amd64.whl (163.1 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

liburlparser-1.4.0-cp310-cp310-win32.whl (148.3 kB view hashes)

Uploaded CPython 3.10 Windows x86

liburlparser-1.4.0-cp310-cp310-musllinux_1_1_x86_64.whl (515.8 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

liburlparser-1.4.0-cp310-cp310-musllinux_1_1_i686.whl (555.3 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

liburlparser-1.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (199.3 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (207.4 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

liburlparser-1.4.0-cp39-cp39-win_amd64.whl (163.3 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

liburlparser-1.4.0-cp39-cp39-win32.whl (148.5 kB view hashes)

Uploaded CPython 3.9 Windows x86

liburlparser-1.4.0-cp39-cp39-musllinux_1_1_x86_64.whl (515.9 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

liburlparser-1.4.0-cp39-cp39-musllinux_1_1_i686.whl (555.2 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

liburlparser-1.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (199.3 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (207.4 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

liburlparser-1.4.0-cp38-cp38-win_amd64.whl (163.3 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

liburlparser-1.4.0-cp38-cp38-win32.whl (148.4 kB view hashes)

Uploaded CPython 3.8 Windows x86

liburlparser-1.4.0-cp38-cp38-musllinux_1_1_x86_64.whl (515.7 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

liburlparser-1.4.0-cp38-cp38-musllinux_1_1_i686.whl (554.9 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

liburlparser-1.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (199.1 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (207.0 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page