Skip to main content

Commandline tools for training Fathom rulesets

Project description

This is the commandline trainer for Fathom, which itself is a supervised-learning system for recognizing parts of web pages. This package also includes other tools for ruleset development, like fathom-extract, fathom-pick, and fathom-test. See docs for the trainer here.

Version History

3.4.1
  • Add confusion matrices to fathom-train and fathom-test readouts.

  • Catch JS syntax errors and other compile-type errors, and report them in fathom-train and fathom-test.

  • Catch errors due to the absence of prerequisite commands like npm.

  • Catch and nicely report HTTP server errors during autovectorization rather than just spewing tracebacks. Add --delay option to fathom-train and fathom-test to work around them.

  • Don’t spit out nan for precision or F1 when we don’t get any samples right.

3.4
  • Make vectorization automatic. This largely obsoletes fathom-list and fathom-serve. We also remove the need to have 3 terminal tabs open, running yarn watch, yarn browser, and fathom-serve. We remove the error-prone hardlinking of the ruleset into FathomFox, which breaks when git changes to a new branch with a changed ruleset file. We eliminate the possibility of forgetting to revectorize after changing a ruleset or samples. And finally, we pave the way to dramatically simplify our teaching and documentation.

    We tried to hew to the CLI design of the previous version of the trainer to keep things familiar. Basically, where you used to pass in a vector file, now feel free to pass in a directory of samples instead. If you do, you’ll also need to pass in your ruleset file and the trainee ID so we can turn the samples into vectors behind the scenes. You can also keep passing in vector files manually if you want more control in some niche situation, like if you’re trying to reproduce results from an old branch.

    Aggressive caching is in place to remove every possible impediment to using auto-vectorization. We store hashes of the ruleset and samples so we can tell when revectorizing is necessary. We also cache a built copy of FathomFox (embedded in the Python package) so we don’t need to run npm or yarn or hit the network again until you upgrade to a new version of the Fathom CLI tools.

  • Add an --exclude option to the trainer to help with feature ablation.

  • Fix an issue where the trainer would read vectors as non-UTF-8 on Windows.

  • In the trainer output, make tag excerpts that contain wide Unicode chars fit in their columns.

  • Don’t show tag excerpts in fathom-test by default.

  • Add application/x-javascript and application/font-sfnt to fathom-extract’s list of known MIME types.

  • fathom-list, though no longer needed in most cases, is now always recursive. It has also learned to ignore resources directories.

  • fathom-unzip is gone.

3.3
  • Add to the trainer a readout of the average time per candidate tag examined.

  • Replace trainer’s per-page metrics, which were increasingly incoherent in Fathom 3, with per-tag ones. Per-page results were most useful back before Fathom could emit confidences. Now, most problems are concerned with per-tag accuracy, and problems that innately concern the page as a whole model it by scoring the <html> tag. Thus, we swap out the old per-page report for a per-tag one. This is a superset of the per-page report.

  • Add a confidence-threshold customization option to fathom-train.

3.2
  • Add fathom-test tool for computing test-corpus accuracies.

  • Add fathom-extract to break down frozen pages into small enough pieces to check into GitHub.

  • Add fathom-serve to dodge the CORS errors that otherwise happen when loading extracted pages.

  • Add a test harness for the Python code.

  • Add confidence intervals for false positives and false negatives in trainer.

  • Add precision and recall numbers to trainer.

  • Add optional positive-sample weighting in trainer, for trading off between precision and recall.

  • Add experimental support for deeper neural networks in trainer.

  • Add recognition-time speed metrics to trainer.

3.1
  • Add fathom-list tool.

  • Further optimize trainer: about 17x faster for a 60-sample corpus, with superlinear improvements for larger ones.

3.0
  • Move to Fathom repo.

  • Add fathom-unzip and fathom-pick.

  • Switch to the Adam optimizer, which is significantly more turn-key, to the point where it doesn’t need its learning-rate decay set manually.

  • Tolerate pages for which no candidate nodes were collected.

  • Add 95% CI for per-page training accuracy.

  • Add validation-guided early stopping.

  • Revise per-page accuracy calculation and display.

  • Shuffle training samples before training.

  • Add false-positive and false-negative numbers to per-tag metrics.

3.0a1
  • First release, intended for use with Fathom itself 3.0 or later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fathom-web-3.4.1.tar.gz (317.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fathom_web-3.4.1-py2.py3-none-any.whl (325.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file fathom-web-3.4.1.tar.gz.

File metadata

  • Download URL: fathom-web-3.4.1.tar.gz
  • Upload date:
  • Size: 317.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.3

File hashes

Hashes for fathom-web-3.4.1.tar.gz
Algorithm Hash digest
SHA256 71db007ec0bb7694cd8a78432f23a44e863da2f9b181a2a65e59be88f0fb2ca4
MD5 057bf08dcf06ef620ec2de48b33d9e73
BLAKE2b-256 6464230eb2a907d47f5efaf5a1c855442a38b9074913bdd4532edb96c345f661

See more details on using hashes here.

File details

Details for the file fathom_web-3.4.1-py2.py3-none-any.whl.

File metadata

  • Download URL: fathom_web-3.4.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 325.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.3

File hashes

Hashes for fathom_web-3.4.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c3d079bd1570ed59df5fa1fdc91279c3ff0bb7813cd187ac32421b80865e1cba
MD5 9de1993f8fc9715efc02746f752b74a1
BLAKE2b-256 53dd5db88c34c7cb677919dcf43a323af91817b3841a82f3977e2f62bd2828f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page