some extensions for bleach
Project description
bleach_extras is a package of unofficial "extras" and utilities paired for use with the bleach library.
The first utility is TagTreeFilter which is utilized by clean_strip_content and cleaner_factory__strip_content.
TagTreeFilter, clean_strip_content, cleaner_factory__strip_content
clean_strip_content is paired to bleach.clean; the only intended difference
is to support the concept of stripping the content tree of tags -- not just the
tag node itself. cleaner_factory__strip_content is a factory function used to create
configured bleach.Cleaner instances.
bleach has a strip flag that toggles the behavior of "unsafe" tags:
strip = False will render the tags as escaped HTML encodings, such as this replacement
- foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
+ foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
strip = True will strip the tags, but leave the HTML within as plaintext:
- foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
+ foo.<div>1alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");2</div>.bar
Many users of bleach want to remove both the tag and contents of unsafe tags for a variety of reasons, such as:
- escaping the tags make the text safe, but unreadable
- leaving the tags' content without the tags negatively affects readability and comprehension
- leaving the tags' content allows a malicious user to still have some sort of fallback payload which is displayed
clean_strip_content is a function that mimics bleach.clean with a key difference:
- tags destined for content stripping are fed into a
Cleanerinstance as allowed - the tags are stripped during the filter process via
TagTreeFilter
An expected transformation is such:
- foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
+ foo.12.bar
Look at that! all the evil payload is gone, including the bitcoin wallet address that f---- spammers tried to slip through.
Why do this filtering with bleach and not something else ?
Parsing/Tokenzing HTML is not very efficient. Performing this outside of bleach would require performing these operations on the HTML fragments at least twice.
bleach's design implementation encodes/strips 'unsafe' tags during the parsing/tokening process - before the plugin filtering process starts. In order to filter the tags out correctly, they must be allowed during the generation of the dom tree, then removed during the filter step. This trips a lot of people up; offering this in a public library with tests that can grow is ideal.
Example:
dangerous = """foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar"""
print(bleach.clean(dangerous, tags=['div', ], strip=False))
# foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");</script>2</div>.bar
print(bleach.clean(dangerous, tags=['div', ], strip=True))
# foo.<div>1alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");2</div>.bar
print(bleach_extras.clean_strip_content(dangerous, tags=['div'], ))
# foo.<div>12</div>.bar
cleaner = bleach_extras.cleaner_factory__strip_content(tags=['div'],)
print(cleaner.clean(dangerous))
# foo.<div>12</div>.bar
print(bleach_extras.clean_strip_content(dangerous, tags=['div', ], strip=True, ))
# foo.<div>12</div>.bar
custom replacement of stripped nodes
maybe you need to replace the evil content with a warning. this "extra" has you covered!
dangerous2 = """foo.<div>1<script>alert("ur komputer hs VIRUS! Giv me ur BITCOIN in 24 hours! Wallet is: abdefg!");<iframe>iiffrraammee</iframe></script>2</div>.bar"""
class IFrameFilter2(bleach_extras.TagTreeFilter):
tags_strip_content = ('script', 'style', 'iframe')
tag_replace_string = "<unsafe garbage/>"
print bleach_extras.clean_strip_content(dangerous2, tags=['div', ], filters=[IFrameFilter2, ])
# foo.<div>1&lt;unsafe garbage/&gt;2</div>.bar
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file bleach_extras-0.1.1.tar.gz.
File metadata
- Download URL: bleach_extras-0.1.1.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c71906ce0673ac4d5f2f5f28d089a8a1f5a9452dfe136ac4d57ed99307f0064
|
|
| MD5 |
f5d243a211552f6ccc72bc5bc9c8254e
|
|
| BLAKE2b-256 |
ff113b983e218a037feaf2d9c21f8b8269875338333efa7722c979242468500b
|