Skip to main content

A light-weight, Extendable, high level, universal code parser built on top of tree-sitter

Project description

tree-hugger

Mining source code repositories at scale easily. Tree-hugger is a light-weight, high level library which provides Pythonic APIs to mine recursively trough Github repositories. Tree-hugger is built on top of tree-sitter.

Covered languages:

  • Python
  • PHP
  • Java
  • JavaScript
  • C++ (coming)

System Requirement: Python 3.6

Contents

  1. Installation

  2. Setup

  3. Hello world example

  4. API reference

  5. Extending tree-hugger

  6. Roadmap


Installation

From pip:

pip install tree-hugger

From Source:

git clone https://github.com/autosoft-dev/tree-hugger.git

cd tree-hugger

pip install -e .

The installation process is tested in macOS Mojave, we have a separate docker binding for compiling the libraries for Linux and soon this library will be integrated in that as well

You may need to install libgit2. In case you are in mac just use brew install libgit2

Setup

Building the .so files

Please note that building the libraries has been tested under a macOS Mojave with Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Please check out our Linux specific instructions here

Once this library is installed it gives you a command line utility to download and compile tree-sitter .so files with ease. As an example -

create_libs python

Here is the full usage guide of the command

usage: create_libs [-h] [-c] [-l LIB_NAME] langs [langs ...]

positional arguments:
  langs                 Give the name of languages for tree-sitter (php,
                        python, go ...)

optional arguments:
  -h, --help            show this help message and exit
  -c, --copy-to-workspace
                        Shall we copy the created libs to the present dir?
                        (default: False)
  -l LIB_NAME, --lib-name LIB_NAME
                        The name of the generated .so file

Environment variables

You can set up TS_LIB_PATH environment variable for the tree-sitter lib path and then the libary will use them automatically. Otherwise, as an alternative, you can pass it when creating any Parser object.

Hello world example

  1. Generate the librairies : run the above command to generate the libraries.

    In our settings we use the -c flag to copy the generated tree-sitter library's .so file to our workspace. Once copied, we place it under a directory called tslibs (It is in the .gitignore).

    ⚠ If you are using linux,you will need to use our tree-sitter-docker image and manually copy the final .so file.

  2. Setup environment variable (optional) Assuming that you have the necessary environment variable setup. The following line of code will create a PythonParser object

from tree_hugger.core import PythonParser

pp = PythonParser()

And then you can pass in any Python file that you want to analyze, like so :

pp.parse_file("tests/assets/file_with_different_functions.py")
Out[3]: True

parse_file returns True if success

And then you are free to use the methods exposed by that particular Parser object. As an example -

pp.get_all_function_names()
Out[4]:
['first_child',
 'second_child',
 'say_whee',
 'wrapper',
 'my_decorator',
 'parent']

OR

pp.get_all_function_documentations()
Out[5]:
{'parent': '"""This is the parent function\n    \n    There are other lines in the doc string\n    This is the third line\n\n    And this is the fourth\n    """',
 'first_child': "'''\n        This is first child\n        '''",
 'second_child': '"""\n        This is second child\n        """',
 'my_decorator': '"""\n    Outer decorator function\n    """',
 'say_whee': '"""\n    Hellooooooooo\n\n    This is a function with decorators\n    """'}

(Notice that, in the last call, it only returns the functions which has a documentation comment)

API reference

Language Functions Methods Classes
Python all_function_names all_function_doctrings all_function_names_and_params all_function_bodies all_class_methods all_class_method_docstrings all_class_names all_class_docstrings
PHP all_function_names all_function_names_and_params all_function_bodies all_class_methods all_class_names
Java all_class_methods all_method_names_and_params all_method_bodies all_class_names
JavaScript all_function_names all_function_names_and_params all_function_bodies all_class_methods all_class_names

Extending tree-hugger

Extending tree-hugger for other languages and/or more functionalities for the already provided ones, is easy.

  1. Adding languages:

Parsed languages can be extended through adding a parser class from the BaseParser class. The only mandatory argument that a Parser class should pass to the parent is the language. This is a string. Such as python (lower case). Each parser class must have the options to take in the path of the tree-sitter library (.so file that we are using to parse the code) and the path to the queries yaml file, in their constructor.

The BaseParser class can do few things:

  • Loading and preparing the .so file with respect to the language you just mentioned.
  • Loading, preparing and parsing the query yaml file. (for the queries, we internally use an extended UserDict class)
  • Providing an API to parse a file and prepare it for query. BaseParser.parse_file

It also gives you another (most likely not to be exposed outside) API _run_query_and_get_captures which lets you run any queries and return back the matched results (if any) from the parsed tree.

We use those APIs once we have called parse_file and parsed the file.

  1. Adding queries:

Queries processed on source code are s-expressions, they are listed in a queries.ymlfile for each parser class. Tree-hugger gives you a way to write your queries in yaml file for each language parsed.

Query structure: A name of a query followed by the query itself. Written as an s-expression. Example:

all_function_docstrings:
        "
        (
            function_definition
            name: (identifier) @function.def
            body: (block(expression_statement(string))) @function.docstring
        )
        "

You have to follow yaml grammar while writing these queries. You can see a bit more about writng these queries in the documentation of tree-sitter.

Some example queries, that you will find in the yaml file (and their corresponding API from the PythonParser class) -

* all_function_names => get_all_function_names()

* all_function_docstrings => get_all_function_documentations()

* all_class_methods => get_all_class_method_names()

Roadmap

  • Documentation: tutorial on queries writing

  • Write *Parser class for other languages

Languages Status-Finished Author
Python Shubhadeep
PHP Clément
Java Clément
JavaScript Clément
C++ Clément

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tree-hugger-0.8.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tree_hugger-0.8.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file tree-hugger-0.8.0.tar.gz.

File metadata

  • Download URL: tree-hugger-0.8.0.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.5

File hashes

Hashes for tree-hugger-0.8.0.tar.gz
Algorithm Hash digest
SHA256 9ab478dd84e4ebf50856a8ef672168c08e0deeb6dc4d2f18ed549d116767ff87
MD5 34300cc89f17f7496547d4baa84a158f
BLAKE2b-256 24023d5d281948d1b92e1bd9bf0d813e003e11db8c54b3a124b6b8b63a29562b

See more details on using hashes here.

File details

Details for the file tree_hugger-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: tree_hugger-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.5

File hashes

Hashes for tree_hugger-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ea22829cd3aa4eedf8eed6f1b2c0e18742a1f389066ed6270be86d8e5223c13
MD5 ccd5ef41bd6284d318026b2e8edebc5f
BLAKE2b-256 850b53ed8018e10707fc0e634a9133e62233a0eb647edf5e944fb61f3e201200

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page