Skip to main content

TrollHunter

Project description

TrollHunter

TrollHunter is a Twitter Crawler & News Website Indexer. It aims at finding Troll Farmers & Fake News on Twitter.

It composed of three parts:

  • Twint API to extract information about a tweet or a user
  • News Indexer which indexes all the articles of a website and extract its keywords
  • Analysis of the tweets and news

Installation

Docker

TrollHunter requires many services to run

  • ELK ( Elastic Search, Logstash, Kibana)
  • InfluxDb & Grafana
  • RabbitMQ

You can either launch them individually if you already have them setup or use our docker-compose.yml

  • Install Docker
  • Run docker-compose up -d

Change the .env with the required values Export the .env variables

export $(cat .env | sed 's/#.*//g' | xargs)

You can either run

pip3 install TrollHunter

or clone the project and run

pip3 install -r requirements.txt

Twint API

News Indexer

The second main part of the project is the crawler and indexer of news.

For this, we use the sitemap xml file of news websites to crawl all the articles. In a sitemap file, we extract the tag sitemap and url.

The sitemap tag is a link to a child sitemap xml file for a specific category of articles in the website.

The url tag represents an article/news of the website.

The root url of a sitemap is stored in a postgres database with a trust level of the website (Oriented, Verified, Fake News, ...) and headers. The headers are the tag we want to extract from the url tag which contains details about the article (title, keywords, publication date, ...).

The headers are the list of fields use in the index pattern of ElasticSearch.

In crawling sitemaps, we insert the new child sitemap in the database with the last modification date or update it for the ones already in the database. The last modification date is used to crawl only sitemaps which change since the last crawling.

The data extracts from the url tags are built in a dataframe then sent in ElasticSearch for further utilisation with the request in Twint API.

In the same time, some sitemaps don't provide the keywords for their articles. Hence, from ElasticSearch we retrieve the entries without keywords. Then, we download the content of the article and extract the keywords thanks to NLP. Finally, we update the entries in ElasticSearch.

Run

For the crawler/indexer:

from TrollHunter.news_crawler import scheduler_news

scheduler_news(time_interval)

For updating keywords:

from TrollHunter.news_crawler import scheduler_keywords

scheduler_keywords(time_interval, max_entry)

Or see with the main use with docker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TrollHunter-0.3.4.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TrollHunter-0.3.4-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file TrollHunter-0.3.4.tar.gz.

File metadata

  • Download URL: TrollHunter-0.3.4.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for TrollHunter-0.3.4.tar.gz
Algorithm Hash digest
SHA256 24cbf38462b396628b4a6decca2c99dd4823bc20ac6cfb5a4946534924bda731
MD5 4728fdccf311626bc1d29cf6e944a470
BLAKE2b-256 85fb66651d03a3f6f32338fc0939c255966ac5a626514aee4e7a21e2596ba96e

See more details on using hashes here.

File details

Details for the file TrollHunter-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: TrollHunter-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 64.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for TrollHunter-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a3888aba78a4fbc84c58a6d05207a05a293e66734ab611e1f8fd67be09c690a8
MD5 d7be78a74859948a3ad34fe2411cdd02
BLAKE2b-256 62feab0a4cd9f4989432b356a991610bc8c45b80db16e1a38b23676b512109e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page