Skip to main content

Fastest Url parser in the world

Project description

Logo

Fastest domain extractor library written in C++ with python binding.

First and complete library for parsing url in C++ and Python and Command Line

mohammadraziei - liburlparser stars - liburlparser forks - liburlparser

PyPi Python Cpp

GitHub release License issues - liburlparser

SonarCloud

Quality Gate Status snyk.io

About The Project

liburlparser is a powerful domain extractor library written in C++ with Python bindings. It provides efficient URL parsing capabilities for both C++ and Python, making it a valuable tool for projects that involve working with web addresses.

Features

Here are some key features of liburlparser:

  1. Multiple Language Support:

    • liburlparser can be used in multiple programming languages, including Python, C++, and Shell.
    • It offers an intuitive interface that remains consistent across both C++ and Python.
  2. Clean Code Design:

    • The library provides two separate classes: Url and Host.
    • This separation allows for cleaner and more organized code when dealing with URLs.
  3. Public Suffix List Support:

    • liburlparser supports known combinatorial suffixes (e.g., "ac.ir") using the public_suffix_list.
    • It can also handle unknown suffixes (e.g., "comm" in "google.comm").
  4. Automatic Public Suffix List Updates:

    • Before each build and deployment, liburlparser updates the public_suffix_list automatically.
  5. Host Properties:

    • The Host class includes properties such as subdomain, domain, domain name, and suffix.
  6. URL Properties:

    • The Url class provides properties like protocol, userinfo, host (and all host properties), port, path, query parameters, and fragment.

Usage

Command Line

python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json

Python

you can use liburlparser so intutively

all of classes has help section

import liburlparser
help(liburlparser)
print(liburlparser.__version__)

from liburlparser import Url, Host
help(Url)
help(Host)

parse url and host

from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or 
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url 
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())

Also there is some helping api to get better performance for some small tasks

# if you need to extract the host of url as a string without any parsing 
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast

if you are fan of pydomainextractor, there is some interface similar to it

import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url

# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api

C++

there is some examples in examples folder

#include "liburlparser"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");

you can see all methods in python we can use in c++ very easily

Installation

C++:

build steps:

git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install

Python and Command Line:

Be aware that it required python>=3.8

Installation

pip by pypi
pip install liburlparser

if you want to use psl.update to update the public suffix list, you must install the online version

pip install "liburlparser[online]"

Or

pip by git
pip install git+https://github.com/mohammadraziei/liburlparser

Or

manually
git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser

Performance

Extract From Host

Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)

Library Function Time
liburlparser liburlparser.Host 1.12s
PyDomainExtractor pydomainextractor.extract 1.50s
publicsuffix2 publicsuffix2.get_sld 9.92s
tldextract __call__ 29.23s
tld tld.parse_tld 34.48s

Extract From URL

The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)

Library Function Time
liburlparser liburlparser.Host.from_url 2.10s
PyDomainExtractor pydomainextractor.extract_from_url 2.24s
publicsuffix2 publicsuffix2.get_sld 10.84s
tldextract __call__ 36.04s
tld tld.parse_tld 57.87s

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Project Link:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

liburlparser-1.4.6.tar.gz (35.0 kB view hashes)

Uploaded Source

Built Distributions

liburlparser-1.4.6-cp312-cp312-win_amd64.whl (162.2 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

liburlparser-1.4.6-cp312-cp312-win32.whl (147.8 kB view hashes)

Uploaded CPython 3.12 Windows x86

liburlparser-1.4.6-cp312-cp312-musllinux_1_1_x86_64.whl (515.2 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ x86-64

liburlparser-1.4.6-cp312-cp312-musllinux_1_1_i686.whl (553.9 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ i686

liburlparser-1.4.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (198.0 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.6-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl (205.8 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ i686

liburlparser-1.4.6-cp311-cp311-win_amd64.whl (164.5 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

liburlparser-1.4.6-cp311-cp311-win32.whl (149.5 kB view hashes)

Uploaded CPython 3.11 Windows x86

liburlparser-1.4.6-cp311-cp311-musllinux_1_1_x86_64.whl (518.0 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

liburlparser-1.4.6-cp311-cp311-musllinux_1_1_i686.whl (557.3 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

liburlparser-1.4.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (201.3 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.6-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (209.3 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

liburlparser-1.4.6-cp310-cp310-win_amd64.whl (164.9 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

liburlparser-1.4.6-cp310-cp310-win32.whl (149.9 kB view hashes)

Uploaded CPython 3.10 Windows x86

liburlparser-1.4.6-cp310-cp310-musllinux_1_1_x86_64.whl (518.4 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

liburlparser-1.4.6-cp310-cp310-musllinux_1_1_i686.whl (557.8 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

liburlparser-1.4.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (201.9 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.6-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (209.8 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

liburlparser-1.4.6-cp39-cp39-win_amd64.whl (165.1 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

liburlparser-1.4.6-cp39-cp39-win32.whl (150.1 kB view hashes)

Uploaded CPython 3.9 Windows x86

liburlparser-1.4.6-cp39-cp39-musllinux_1_1_x86_64.whl (518.4 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

liburlparser-1.4.6-cp39-cp39-musllinux_1_1_i686.whl (557.6 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

liburlparser-1.4.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (201.9 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.6-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (209.8 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

liburlparser-1.4.6-cp38-cp38-win_amd64.whl (165.1 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

liburlparser-1.4.6-cp38-cp38-win32.whl (150.0 kB view hashes)

Uploaded CPython 3.8 Windows x86

liburlparser-1.4.6-cp38-cp38-musllinux_1_1_x86_64.whl (518.2 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

liburlparser-1.4.6-cp38-cp38-musllinux_1_1_i686.whl (557.2 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

liburlparser-1.4.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (201.8 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

liburlparser-1.4.6-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (209.4 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page