如果字符串包含html代码,如何用python检测?
How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines
如何检测字符串中是否包含html(可以是html4,html5,只是部分文本中的html)?我不需要HTML的版本,而是如果字符串只是一个文本或它包含一个HTML。文本通常是多行,也是空行
Update:
example inputs:
html:
<head><title>I'm title</title></head>
Hello, <b>world</b>
non-html:
<ht fldf d><
<html><head> head <body></body> html
4 个解决方案
#1
19
You can use an HTML parser, like BeautifulSoup
. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:
您可以使用HTML解析器,例如BeautifulSoup。请注意,它确实最好地解析HTML,即使是破坏的HTML,根据底层的解析器,它可能非常宽松而且不是很宽松:
>>> from bs4 import BeautifulSoup
>>> html = """<html>
... <head><title>I'm title</title></head>
... </html>"""
>>> non_html = "This is not an html"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
>>> bool(BeautifulSoup(non_html, "html.parser").find())
False
This basically tries to find any html element inside the string. If found - the result is True
.
这基本上试图在字符串中找到任何html元素。如果找到 - 结果为True。
Another example with an HTML fragment:
HTML片段的另一个例子:
>>> html = "Hello, <b>world</b>"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
Alternatively, you can use lxml.html
:
或者,您可以使用lxml.html:
>>> import lxml.html
>>> html = 'Hello, <b>world</b>'
>>> non_html = "<ht fldf d><"
>>> lxml.html.fromstring(html).find('.//*') is not None
True
>>> lxml.html.fromstring(non_html).find('.//*') is not None
False
更多相关文章
- 如何设置div的样式以使其与文本一起运行?
- 用正则表达式剔除文本里面HTML标记
- jquery将html转换为字符串和html
- 【HTML】让标签文本自动换行
- js去除字符串中所有html标签及 符号
- 用于将Word文档文本转换为HTML的库
- 用JAVA从HTML标记中撕下子字符串
- 使用Objective-C将HTML文本转换为纯文本
- 点击后如何使弹出文本消失?