Skip to main content

A scientific papers recomendation tool.

Project description

refy

A scientific papers recommendation tool.

Overview

refy leverages Natural Langual Processing (NLP) machine learning tools to find new papers that might be relevant given the ones that you've read already.

There's a few software tools out there that facilitate the exploration of scientific literature, including:

  • meta.org which allows users to set up feeds that identify newly published papers that seem relevant given a set of keywords
  • inciteful and scite.ai let you explore the network of citations around a given paper of interest
  • connected papers let's you visualize a graph representation of papers related to a given paper of interest

Most currently available software is limited in two key ways:

  1. Tools like meta.org rely on keywords, but keywords (e.g. computational neuroscience, Parkinson's Disease) are often overly general. As a result of that you have to sift through a lot irrelevant literature before you find something interesting
  2. Other tools like connected papers only work with one input paper at the time: you give it the title of a paper you've read and they give you suggestions. This is limiting: any software that can analyse all papers you've read can use a lot more information to find new papers that match more closely your interests.

This is what refy is for: refy analyzes the abstracts of several papers of yours and matches them agaist a database of almost ONE MILLION paper abstracts. By using many input papers at once refy has a lot more information at its disposal which (hopefully) means that it can better recommend relevant papers. More details about the database used by refy can be found at the bottom of this document.

Disclaimer: The database used by refy is focused on neuroscience papers and preprints published in the last 30 years. If you are interested in older papers or work in a different field, please read the instructions below about how to adjust the database to your needs.

Usage

Installation

If you have an environment with python >= 3.6, you can install refy with:

pip install refy

You can check if everything went okay with:

refy example

which should print something like:

Usage

refy provides a Command Line Interface (CLI) to expose it's functionality directly in your terminal:

Note: the first time you use refy it will have to download several files (which you can see here) with data it needs to work. This should only take a few minuts and it will take up about 3GB of your hard disk space.

You can use refy in three modes:

  1. In query mode you can find papers relevant for a given input string (e.g. locomotion mouse brainstem)
  2. In suggest mode you give refy a .bib bibtext file with metadata about as many publications as you want. refy will use this information to find papers relevant across all of your input papers.
  3. daily suggests relevant papers from those released on biorxiv.org in the last 24 hours (see below)

For query mode you will use the command refy query STRING, for suggest you'd use refy suggest PATH.

In all cases you can use optional arguments:

    -N INTEGER            number of recommendations to show  [default: 10]
    -since INTEGER        Only keep papers published after SINCE
    -to INTEGER           Only keep papers published before TO
    -save-path, --s TEXT  Save suggestions to file
    -debug, --d           set debug mode ON/OFF  [default: False]

For example:

refy query "locomotion control brainstem" --N 100 --since 2015 --to 2018 --s refs.csv

Will show 100 suggested papers published between 2015 and 2018 related to locomotrion control and will save the results to refs.csv.

Note: suggest mode is much more powerful and hence it's the recommended way for finding new literature, however query allows you to quickly look up new papers without having to createa a .bib field. Note: in suggest mode, the content of your .bib file must include papers abstracts. Only papers with abstracts will be used for the analysis. Your .bib file entries should look like this:

@ARTICLE{Claudi2020-tb,
  title    = "Brainrender. A python based software for visualisation of
              neuroanatomical and morphological data",
  author   = "Claudi, Federico and Tyson, Adam L and Branco, Tiago",
  abstract = "Abstract Here we present brainrender, an open source python
              package for rendering three-dimensional neuroanatomical data
              aligned to the Allen Mouse Atlas. Brainrender can be used to
              explore, visualise and compare data from publicly available
              datasets (e.g. from the Mouse Light project from Janelia) as well
              as data generated within individual laboratories. Brainrender
              facilitates the exploration of neuroanatomical data with
              three-dimensional renderings, aiding the design and
              interpretation of experiments and the dissemination of anatomical
              findings. Additionally, brainrender can also be used to generate
              high-quality, publication-ready, figures for scientific
              publications.",
  journal  = "Cold Spring Harbor Laboratory",
  pages    = "2020.02.23.961748",
  month    =  feb,
  year     =  2020,
  language = "en",
  doi      = "10.1101/2020.02.23.961748"
}

hint: if you use reference managers like zotero or paperpile you can easily export bibtext data about your papers

refy has two kind of outputs:

  1. It will print to terminal a list of N recommended paper, sorted by their recommendation score. It will also show keywords that appear in your input papers and the authors that came up most frequently in your recommendations. See the figure at the top of this document for reference.
  2. Optionally, refy can save the list of recommended paper to .csv file so that you may explore these at your leasure.

**hint: ** the DOIs on the right of the report are working links (if you terminal supports links)!

Daily

Hard to keep up with all the preprints that come out all the time? Fear not, refy's daily function is here to help. In your terminal, call refy daily mybib.bib and refy will recomend papers from those that came out of biorxiv in the last 24 hours. You can use the -N argument to specify how many suggestions you'd like to see.

Scripting

You can of course access all functionality through normal python scripts. For instance:

from refy import suggest

rec = suggest('my_library.bib')
print(rec.suggestions)

Under the hood

This section explains how refy works. If you just want to use refy and don't care about what happens under the hood then feel free to skip this.

refy uses NLP algorithms to estimate semantic similarity across papers based on the content of their abstracts. In particular, it uses Doc2Vec which is an adaptation of Word2Vec, a model that embeddings of words in which semantically similar words are closer in the embedding space than semantically dissimilar words are. Dov2Vec expands Word2Vec to learn vector embedding of entire documents.

The Doc2Vec model used here is train on the entire corpus of almost one million papers. When it comes finding recommendations for your papers, refy uses Doc2Vec to create a vector representation of your paper and find the N closest vectors which, hopefully, are papers that are similar to yours.

This operation is repeated for each paper in your .bib file and then recommendations are pooled and scored: the papers that scored highest for the most number of input papers will be the most strongly recommended ones.

Database

refy uses a curated database of about one million papers metadata to recommend literature for you. TJe data come from two sources:

Data from these two vasts databases (several million publications) are filtered to selectively keep neuoroscience and ML papers written in english and published in the last 30 years. This selection is necessary to keep the compute and memory requirements within reasonable bounds.

If you wish to create your custom database, these are the steps you'll need to follow:

  1. download the compressed data from semantic scholar's Open Corpus and save them in a folder. Note the entire database is >100GB in size even while compressed so it might take a while
  2. clone this repository with git clone https://github.com/FedeClaudi/refy.git
  3. in the shell cd refy
  4. within the cloned repository edit refy/settings.py. In it there are a few settings that apply to the creation of database (e.g. select papers based on the year of publication). Set these settings to values that match your need
  5. Install the edited version of refy with pip install . -U
  6. Create the edited database with refy update_database FOLDER where FOLDER is the path to where you saved the data downloaded from semantic scholar

Once the database update has finished (should take <5 hours), you can re-train the Doc2Vec model to fit it to your database with refy train (use refy train --help to see the options available). This step will like require ~12 hours but the duration depend on the specs of your machine and the size of the database as you've created it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refy-0.3.6.5.2.tar.gz (32.7 kB view hashes)

Uploaded Source

Built Distribution

refy-0.3.6.5.2-py3-none-any.whl (38.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page