A Python library to push web resources into public web archives
Project description
Archive Now (archivenow)
=============================
A Tool To Push Web Resources Into Web Archives
----------------------------------------------
Archive Now (**archivenow**) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., ia_handler.py) and place it inside the folder "handlers".
As explained below, this library can be used through:
- CLI
- A Web Service
- A Docker Container
- Python
Installing
----------
The latest release of **archivenow** can be installed using pip:
.. code-block:: bash
$ pip install archivenow
The latest development version containing changes not yet released can be installed from source:
.. code-block:: bash
$ git clone git@github.com:maturban/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./
CLI USAGE
---------
Usage of sub-commands in **archivenow** can be accessed through providing the `-h` or `--help` flag, like any of the below.
.. code-block:: bash
$ archivenow -h
usage: archivenow.py [-h] [--cc] [--cc_api_key [CC_API_KEY]] [--ia] [--is]
[--wc] [-v] [--all] [--server] [--host [HOST]]
[--port [PORT]]
[URI]
positional arguments:
URI URI of a web resource
optional arguments:
-h, --help show this help message and exit
--cc Use The Perma.cc Archive
--cc_api_key [CC_API_KEY]
An API KEY is required by The Perma.cc Archive
--ia Use The Internet Archive
--is Use The Archive Today
--wc Use The WebCite Archive
-v, --version Report the version of archivenow
--all Use all possible archives
--server Run archiveNow as a Web Service
--host [HOST] A server address
--port [PORT] A port number to run a Web Service
Examples
--------
- **Example 1**
To save the web page (www.foxnews.com) in the Internet Archive:
.. code-block:: bash
$ archivenow --ia www.foxnews.com
['https://web.archive.org/web/20170209135625/http://www.foxnews.com']
- **Example 2**
By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:
.. code-block:: bash
$ archivenow www.foxnews.com
['https://web.archive.org/web/20170215164835/http://www.foxnews.com']
- **Example 3**
To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive Today (archive.is):
.. code-block:: bash
$ archivenow --ia --is www.foxnews.com
['https://web.archive.org/web/20170209140345/http://www.foxnews.com', 'http://archive.is/fPVyc']
- **Example 4**
To save the web page (www.foxnews.com) in all configured web archives:
.. code-block:: bash
$ archivenow.py --all www.foxnews.com --cc_api_key $YOUR-Perma-cc-API-KEY
['https://perma.cc/8YYC-C7RM','https://web.archive.org/web/20170220074919/http://www.foxnews.com','http://archive.is/jy8B0','http://www.webcitation.org/6o9IKD9FP']
Server
------
You can run **archivenow** as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)
.. code-block:: bash
$ archivenow --server
2017-02-09 14:20:33
Running on http://0.0.0.0:12345
(Press CTRL+C to quit)
- **Example 5**
To save the web page (www.foxnews.com) in The Internet Archive through the web service:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/ia/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:29:23 GMT
{
"results": [
"https://web.archive.org/web/20170209142922/http://www.foxnews.com"
]
}
- **Example 6**
To save the web page (www.foxnews.com) in all configured archives though the web service:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/all/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 172
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:33:47 GMT
{
"results": [
"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
"http://archive.is/H2Yfg",
"http://www.webcitation.org/6o9Jubykh",
"Error (The Perma.cc Archive): An API KEY is required"
]
}
- **Example 7**
Because an API Key is required by Perma.cc, the HTTP request should be as following:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/all/www.foxnews.com?cc_api_key=$YOUR-Perma-cc-API-KEY
Or use only the Perma.cc:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/cc/www.foxnews.com?cc_api_key=$YOUR-Perma-cc-API-KEY
Running as a Docker Container
-----------------------------
.. code-block:: bash
$ docker pull maturban/archivenow
Different ways to run archivenow
.. code-block:: bash
$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
Python Usage
------------
.. code-block:: bash
>>> from archivenow import archivenow
- **Example 8**
To save the web page (www.foxnews.com) in The WebCite Archive:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']
- **Example 9**
To save the web page (www.foxnews.com) in all configured archives:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]
- **Example 10**
To save the web page (www.foxnews.com) in The Perma.cc:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$YOUR-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']
- **Example 11**
To start the server from Python do the following. The server/port number can be passed (e.g, start(port=1111, host='localhost')):
.. code-block:: bash
>>> archivenow.start()
2017-02-09 15:02:37
Running on http://0.0.0.0:12345
(Press CTRL+C to quit)
Configuring a new archive or removing existing one
--------------------------------------------------
Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "\*_handler.py" organized.
Removing an archive can be done by one of the following options:
- Removing the archive handler file from the folder "handlers"
- Rename the archive handler file to other name that does not end with "_handler.py"
- Simply, inside the handler file, set the variable "enabled" to "False"
Notes
-----
The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm. The IA will create a new copy (lets call it C1) of this CNN homepage. The IA will return (C1) for all requests to archive the CNN homepage recived before 10:02pm. The Archive Today sets this time gap to five minutes.
=============================
A Tool To Push Web Resources Into Web Archives
----------------------------------------------
Archive Now (**archivenow**) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., ia_handler.py) and place it inside the folder "handlers".
As explained below, this library can be used through:
- CLI
- A Web Service
- A Docker Container
- Python
Installing
----------
The latest release of **archivenow** can be installed using pip:
.. code-block:: bash
$ pip install archivenow
The latest development version containing changes not yet released can be installed from source:
.. code-block:: bash
$ git clone git@github.com:maturban/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./
CLI USAGE
---------
Usage of sub-commands in **archivenow** can be accessed through providing the `-h` or `--help` flag, like any of the below.
.. code-block:: bash
$ archivenow -h
usage: archivenow.py [-h] [--cc] [--cc_api_key [CC_API_KEY]] [--ia] [--is]
[--wc] [-v] [--all] [--server] [--host [HOST]]
[--port [PORT]]
[URI]
positional arguments:
URI URI of a web resource
optional arguments:
-h, --help show this help message and exit
--cc Use The Perma.cc Archive
--cc_api_key [CC_API_KEY]
An API KEY is required by The Perma.cc Archive
--ia Use The Internet Archive
--is Use The Archive Today
--wc Use The WebCite Archive
-v, --version Report the version of archivenow
--all Use all possible archives
--server Run archiveNow as a Web Service
--host [HOST] A server address
--port [PORT] A port number to run a Web Service
Examples
--------
- **Example 1**
To save the web page (www.foxnews.com) in the Internet Archive:
.. code-block:: bash
$ archivenow --ia www.foxnews.com
['https://web.archive.org/web/20170209135625/http://www.foxnews.com']
- **Example 2**
By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:
.. code-block:: bash
$ archivenow www.foxnews.com
['https://web.archive.org/web/20170215164835/http://www.foxnews.com']
- **Example 3**
To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive Today (archive.is):
.. code-block:: bash
$ archivenow --ia --is www.foxnews.com
['https://web.archive.org/web/20170209140345/http://www.foxnews.com', 'http://archive.is/fPVyc']
- **Example 4**
To save the web page (www.foxnews.com) in all configured web archives:
.. code-block:: bash
$ archivenow.py --all www.foxnews.com --cc_api_key $YOUR-Perma-cc-API-KEY
['https://perma.cc/8YYC-C7RM','https://web.archive.org/web/20170220074919/http://www.foxnews.com','http://archive.is/jy8B0','http://www.webcitation.org/6o9IKD9FP']
Server
------
You can run **archivenow** as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)
.. code-block:: bash
$ archivenow --server
2017-02-09 14:20:33
Running on http://0.0.0.0:12345
(Press CTRL+C to quit)
- **Example 5**
To save the web page (www.foxnews.com) in The Internet Archive through the web service:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/ia/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:29:23 GMT
{
"results": [
"https://web.archive.org/web/20170209142922/http://www.foxnews.com"
]
}
- **Example 6**
To save the web page (www.foxnews.com) in all configured archives though the web service:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/all/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 172
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:33:47 GMT
{
"results": [
"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
"http://archive.is/H2Yfg",
"http://www.webcitation.org/6o9Jubykh",
"Error (The Perma.cc Archive): An API KEY is required"
]
}
- **Example 7**
Because an API Key is required by Perma.cc, the HTTP request should be as following:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/all/www.foxnews.com?cc_api_key=$YOUR-Perma-cc-API-KEY
Or use only the Perma.cc:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/cc/www.foxnews.com?cc_api_key=$YOUR-Perma-cc-API-KEY
Running as a Docker Container
-----------------------------
.. code-block:: bash
$ docker pull maturban/archivenow
Different ways to run archivenow
.. code-block:: bash
$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
Python Usage
------------
.. code-block:: bash
>>> from archivenow import archivenow
- **Example 8**
To save the web page (www.foxnews.com) in The WebCite Archive:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']
- **Example 9**
To save the web page (www.foxnews.com) in all configured archives:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]
- **Example 10**
To save the web page (www.foxnews.com) in The Perma.cc:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$YOUR-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']
- **Example 11**
To start the server from Python do the following. The server/port number can be passed (e.g, start(port=1111, host='localhost')):
.. code-block:: bash
>>> archivenow.start()
2017-02-09 15:02:37
Running on http://0.0.0.0:12345
(Press CTRL+C to quit)
Configuring a new archive or removing existing one
--------------------------------------------------
Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "\*_handler.py" organized.
Removing an archive can be done by one of the following options:
- Removing the archive handler file from the folder "handlers"
- Rename the archive handler file to other name that does not end with "_handler.py"
- Simply, inside the handler file, set the variable "enabled" to "False"
Notes
-----
The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm. The IA will create a new copy (lets call it C1) of this CNN homepage. The IA will return (C1) for all requests to archive the CNN homepage recived before 10:02pm. The Archive Today sets this time gap to five minutes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file archivenow-2017.2.20.6.8.18.tar.gz.
File metadata
- Download URL: archivenow-2017.2.20.6.8.18.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c90bf9bc44a06aea6ce549fd6e4fb9e49bdac90cdbb708deec6a54197ef0db82
|
|
| MD5 |
96f80cca00ab6f55f019225ef1fdc8b8
|
|
| BLAKE2b-256 |
cdd295892fe283b15559763464d3bb4edf8d099d7640880a52aa858625f7e477
|