How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines

如何检测字符串中是否包含html(可以是html4,html5,只是部分文本中的html)?我不需要HTML的版本,而是如果字符串只是一个文本或它包含一个HTML。文本通常是多行,也是空行

Update:

example inputs:

html:

<head><title>I'm title</title></head>
Hello, <b>world</b>

non-html:

<ht fldf d><
<html><head> head <body></body> html

4 个解决方案

#1

You can use an HTML parser, like BeautifulSoup. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:

您可以使用HTML解析器,例如BeautifulSoup。请注意,它确实最好地解析HTML,即使是破坏的HTML,根据底层的解析器,它可能非常宽松而且不是很宽松:

>>> from bs4 import BeautifulSoup
>>> html = """<html>
... <head><title>I'm title</title></head>
... </html>"""
>>> non_html = "This is not an html"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
>>> bool(BeautifulSoup(non_html, "html.parser").find())
False

This basically tries to find any html element inside the string. If found - the result is True.

这基本上试图在字符串中找到任何html元素。如果找到 - 结果为True。

Another example with an HTML fragment:

HTML片段的另一个例子:

>>> html = "Hello, <b>world</b>"
>>> bool(BeautifulSoup(html, "html.parser").find())
True

Alternatively, you can use lxml.html:

或者,您可以使用lxml.html:

>>> import lxml.html
>>> html = 'Hello, <b>world</b>'
>>> non_html = "<ht fldf d><"
>>> lxml.html.fromstring(html).find('.//*') is not None
True
>>> lxml.html.fromstring(non_html).find('.//*') is not None
False

如果字符串包含html代码，如何用python检测？

Update:

4 个解决方案

#1

更多相关文章

随机推荐