HTML parsing and extraction tool Beautiful Soup

Beautiful Soup is a Python library that can extract data from HTML or XML files. Simply put, it can parse HTML tag files into a tree structure and then easily get the corresponding attributes of the specified tags. This feature is similar to lxml.

Beautiful Soup installation

Beautiful Soup 3 is currently out of development and it is recommended to use Beautiful Soup 4 in your current projects, installed by

`1`	`pip install beautifulsoup4`

Beautiful Soup’s parsers

If you just want to parse an HTML document, just create a BeautifulSoup object with the document. Beautiful Soup automatically chooses a parser to parse the document, but you can also specify which parser to use with parameters. The first parameter to BeautifulSoup should be the document string or file handle to be parsed, and the second parameter should identify how to parse the document. If the second parameter is empty, Beautiful Soup automatically selects a parser based on the libraries currently installed on the system, in the following order of preference: lxml, html5lib, Python standard library.

The parser priority will change under the following two conditions.

What type of document to parse: currently supports “html”, “xml”, and “html5”
Which parser to use: currently “lxml”, “html5lib”, and “parser” are supported.

If the specified parser is not installed Beautiful Soup will automatically choose another option. Currently only the lxml parser supports parsing XML documents, so creating beautifulsoup objects without the lxml library installed will not result in parsed objects regardless of whether lxml is specified.

To install the parser.

lxml, Windows installation may cause problems
html5lib Directly execute: pip install html5lib

Here it is recommended to use lxml as parser because it is more efficient.

Use of Beautiful Soup

Creating Beautiful Soup objects

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests

url = 'https://www.biaodianfu.com'
r = requests.get(url, timeout=20)
soup = BeautifulSoup(r.content, 'html.parser')
print(type(soup))
print(soup)

The second parameter of the BeautifulSoup constructor is the document parser, if this parameter is not passed, BeautifulSoup will choose the most appropriate parser to parse the document on its own, but there will be a warning prompt. It can also be initialized by a file handle, which can be used to save the HTML source code to the local sibling directory reo.html, and then the file name as a parameter.

`1`	`soup = BeautifulSoup(open('test.html'))`

Beautiful Soup will be a complex HTML document into a complex tree structure, each node is a Python object, all objects can be grouped into four kinds.

Tag

What is a Tag? In layman’s terms, it’s a tag in HTML. Here’s how to use Beautiful Soup to easily get Tags.

1
2

print(soup.title)
print(soup.h1)

We can easily get the content of these tags using soup plus the tag name, but one thing is that it looks for the first tag that matches in all the content, if we want to query all the tags, we will introduce it later.

For Tag, it has two important properties, which are name and attrs.

name: The soup object itself is special in that its name is [document], and for other internal tags, the output value is then the name of the tag itself.
attrs: A Tag object can have more than one attribute, the operation method is the same as a dictionary, the attribute contains key and value, you can also get the information of value.

1
2
3

print(soup.p.attrs)
print(soup.p['class'])
print(soup.p.get('class')) #等价print(soup.p['class'])

The attributes of tag can be added, deleted (del soup.b[‘class’]) and modified, same as the dictionary method. If an attribute key corresponds to more than one value, a list of values is returned.

NavigableString

Now that we have got the content of the tag, the question arises, what if we want to get the text inside the tag? It’s simple, just use .string, the string in the Tag is the NavigableString object.

`1`	`print(soup.p.string)`

Use this type outside of BeautifulSoup, conversion to Unicode is recommended: unicode(Tag.string). tag can contain other tags or strings, while NavigableString cannot contain other objects. No support for .content, .string, find(), only partial support for traversing the document tree and searching for properties in the document tree.

Comment

Comment object is a special type of NavigableString object, in fact, the output still does not include comment symbols, but if we do not handle it properly, it may cause unexpected trouble for our text processing.

Let’s find a tag with a comment

1
2
3

print(soup.a)
print(soup.a.string)
print(type(soup.a.string))

The results of the run are as follows.

1
2
3

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie
<class 'bs4.element.Comment'>

The content of the a tag is actually a comment, but if we use .string to output its content, we find that it has removed the comment symbol, so this may bring us unnecessary trouble.

In addition, we print out its type and find that it is a Comment type, so we’d better make a judgment before using it, the judgment code is as follows.

1
2

if type(soup.a.string) == bs4.element.Comment:
    print(soup.a.string)

In the code above, we first determine its type, whether it is the Comment type, and then perform other operations such as printing the output.

Traversing the document tree

BeautifulSoup object as a tree, there are multiple nodes. For a node, relative to where it is located, there are child nodes, parent nodes, and sibling nodes.

child nodes

A Tag can contain multiple Tags as well as Strings, which are all child nodes of that Tag. And NavigableString will not have child nodes.

Direct child nodes

The .content attribute of a .content tag can output the tag’s child nodes as a list: the

1
2

print(soup.head.contents)
#[<title>The Dormouse's story</title>]

.children It does not return a list, but we can get all the children by iterating over them.

print(soup.head.children)
#<listiterator object at 0x7f71457f5710>
for child in soup.body.children:
    print(child)

If you want to get a certain Tag, the above mentioned methods have been mentioned.

`1`	`soup.tag_name`

By pointing to the attribute, you can only get the first tag of the current name, to get all, you need to use the method in the search document tree:

`1`	`soup.find_all('tag_name')`

The tag’s .contents property outputs all child nodes as a list. The tag’s .children generator can be used to iterate over all child nodes. .contents and .children are only useful for getting the direct children of a tag, .descendants can be used to iterate over all the children of a tag.

If tag has only one child node of type NavigableString, use .string to get it. If it contains more than one, use .strings to traverse it. If the output string contains spaces or blank lines, use .stripped_strings to remove them.

All children nodes

The .contents and .children properties contain only the direct children of the tag, while the .descendants property recursively loops over all the children of the tag, and similar to children, we need to iterate through them to get their contents.

1
2

for child in soup.descendants:
    print child

If the tag has only one child of type NavigableString, then the tag can use .string to get the child node. If a tag has only one child node, then the tag can also use the .string method, and the output will be the same as the .string result of the current unique child node.

In layman’s terms, if there is no tag inside a tag, then .string will return the contents of the tag. If there is only one tag inside the tag, then .string will also return the innermost content. For example

print(soup.head.string)
#The Dormouse's story
print(soup.title.string)
#The Dormouse's story

If the tag contains more than one child, the tag cannot determine which child should be called by the string method, and the output of .string is None.

to get more than one content, but you need to iterate through them, as in the following example.

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    # u'\n'

.stripped_strings, the output string may contain many spaces or blank lines, use .stripped_strings to remove the extra whitespace.

for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'...'

parent node

p = soup.p
print(p.parent.name)
#body

content = soup.head.title.string
print(content.parent.name)
#title

All Parent Nodes

All parent nodes of an element can be obtained recursively through the .parents attribute of the element, for example

1
2
3

content = soup.head.title.string
for parent in content.parents:
    print(parent.name)

Return data.

title
head
html
[document]

sibling node

The .next_sibling attribute gets the next sibling of the node, while .previous_sibling returns None if the node does not exist.

Note: The .next_sibling and .previous_sibling attributes of the tag in the actual document are usually strings or whitespace, since a whitespace or newline can also be treated as a node, so the result may be a whitespace or newline.

1
2
3

print(soup.p.next_sibling)
print(soup.p.prev_sibling)
print(soup.p.next_sibling.next_sibling)

Return data.

title
head
html
[document]

sibling node

The .next_sibling attribute gets the next sibling of the node, while .previous_sibling returns None if the node does not exist.

1
2
3

print(soup.p.next_sibling)
print(soup.p.prev_sibling)
print(soup.p.next_sibling.next_sibling)

All Siblings

The .next_siblings and .previous_siblings attributes allow iterative output of the current node’s siblings.

for sibling in soup.a.next_siblings:
    print(repr(sibling))
    # u',\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u' and\n'
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    # u'; and they lived at the bottom of a well.'
    # None

The .next_element .previous_element attribute differs from .next_sibling .previous_sibling in that it is not specific to sibling nodes, but in all nodes, regardless of hierarchy.

all preceding and following nodes

The .next_elements and .previous_elements iterators make it possible to access the parsed content of a document forward or backward, as if the document were being parsed.

for element in last_a_tag.next_elements:
    print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'
# None

The above is the basic usage of traversing the document tree.

search the document tree

find_all( name , attrs , recursive , text , kwargs )**

find_all() method searches all tag children nodes of the current tag and determines if the conditions of the filter are met.

name argument

It can find all the tag with the name name, string objects will be ignored automatically.

A. Pass string

The simplest filter is a string. Passing a string parameter to the search method Beautiful Soup will find a complete match to the string, the following example is used to find all <b> tags in the document.

`1`	`print(soup.find_all('a'))`

B. Passing regular expressions

If you pass a regular expression as a parameter Beautiful Soup will match the content by the match() of the regular expression. The following example finds all tags starting with b, which means that both <body> and <b> tags should be found.

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

C. Passing a list

If you pass in the list parameter Beautiful Soup will return the content that matches any element in the list. The following code finds all <a> tags and <b> tags in the document.

`1`	`soup.find_all(["a", "b"])`

D. Pass True

True can match any value, the following code finds all the tags but does not return the string node

1
2

for tag in soup.find_all(True):
    print(tag.name)

E. Passing a method

If there is no appropriate filter, then you can also define a method that accepts only one element parameter and returns True if the current element matches and is found, or False if it is not.

The following method checks the current element and returns True if it contains the class attribute but not the id attribute.

1
2

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

Passing this method as an argument to the find_all() method will get all <p> tags:

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

keyword parameter

Note: If a parameter with a specified name is not a built-in parameter name, the search will treat the parameter as an attribute of the specified name tag to search.

1
2

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If the href parameter is passed, Beautiful Soup will search for the “href” attribute of each tag.

1
2

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Multiple attributes of a tag can be filtered at the same time using multiple parameters with specified names.

1
2

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

Here we want to filter by class, but class is a python keyword, so what do we do? Just add an underscore

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all('tag.name',attrs={'class':'class_value'})

Some tag attributes are not available in search, such as the data-* attribute in HTML5

1
2
3

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

However, it is possible to define a dictionary parameter to search for tags containing special attributes via the attrs parameter of the find_all() method.

1
2

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

text parameters

The text parameter allows you to search for string content in a document. Like the optional value of the name parameter, the text parameter accepts strings, regular expressions, lists, True

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

limit parameters

The find_all() method returns the entire search structure, which can be very slow if the document tree is large. If we do not need all the results we can use the limit parameter to limit the number of results returned. The effect is similar to the limit keyword in SQL, when the number of search results reaches the limit the search will stop returning results.

There are 3 tags in the document tree that match the search criteria but only 2 results are returned because we have limited the number of results.

`1`	`soup.find_all("a", limit=2)`

recursive parameter

Beautiful Soup retrieves all children of the current tag when calling the find_all() method of the tag, if you want to search only the direct children of the tag you can use the parameter recursive=False.

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

find( name , attrs , recursive , text , **kwargs )

The find() method is equivalent to find_all(limit=1) and returns the first object that matches the condition. The only difference between it and the find_all() method is that the find_all() method returns a list of values containing one element while the find() method returns the result directly.

In addition to find() and find_all() there are a number of search methods.

find_parent()
find_next_sibling()
find_previous_sibling()

The above three can be followed by ’s’ for all.

find_next()
find_previous()
find_all_next()
find_all_previous()

find_parents() find_parent()

find_all() and find() search only for all children, grandchildren, etc. of the current node. find_parents () and find_parent () used to search for the current node’s parent nodes, search methods and ordinary tag search method is the same, search the document search document contains the content

find_next_siblings () find_next_sibling ()

these two methods through the .next_siblings attribute when the tag of all the later resolved sibling tag nodes to iterate, find_next_siblings () method to return all eligible later sibling nodes, find_next_sibling () only to return to the eligible later the first tag node

find_previous_siblings() find_previous_sibling()

These two methods iterate over the preceding resolved sibling tag nodes of the current tag using the .previous_siblings attribute. find_previous_siblings() method returns all eligible preceding siblings, find_previous_sibling() method returns the first eligible previous siblings

find_all_next() find_next()

These two methods iterate over the tags and strings after the current tag using the .next_elements attribute. find_all_next() method returns all eligible nodes, find_next() method returns the first eligible node

find_all_previous() and find_previous()

These two methods iterate over the tag and string preceding the current node using the .previous_elements attribute. find_all_previous() method returns all eligible nodes, find_previous() method returns the first eligible node.

Note: The usage of the above methods is exactly the same as find_all(), and the principles are similar, so we will not repeat them here.

CSS selector

When we write CSS, the tag name is not modified, the class name is preceded by a dot and the id name is preceded by #, here we can also use a similar method to filter the elements, the method used is soup.select(), the return type is list

Find by tag name

print(soup.select('title'))
#[<title>The Dormouse's story</title>]

print(soup.select('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('b'))
#[<b>The Dormouse's story</b>]

Find by class name

1
2

print(soup.select('.sister'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Find by id name

1
2

print(soup.select('#link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

combination search

Combined search is the same as the combination of tag name, class name and id name when writing class files, for example, to find the contents of p tag, id equal to link1, the two need to be separated by spaces

1
2

print(soup.select('p #link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

Direct sub-label search

1
2

print(soup.select("head > title"))
#[<title>The Dormouse's story</title>]

attribute lookup

Find can also add attribute elements, attributes need to be enclosed in brackets, note that attributes and tags belong to the same node, so no spaces can be added in between, otherwise they will not be matched to.

print(soup.select('a[class="sister"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

Again, attributes can still be combined with the above lookup, not in the same node separated by spaces, the same node without spaces

1
2

print(soup.select('p a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

The above select methods return results in list form, which can be iterated through and then get_text() to get its content. get_text(strip=True) removes the white space before and after the text.

soup = BeautifulSoup(html, 'lxml')
print(type(soup.select('title')))
print(soup.select('title')[0].get_text())

for title in soup.select('title'):
    print title.get_text()

Encoding issues

Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding by default. beautifulSoup object’s .original_encoding property to get the result of automatic encoding recognition. Of course, this is slow and sometimes wrong. Using the chartdet library can improve the efficiency of encoding detection. You can create a BeautifulSoup object by specifying the from_encoding parameter to tell you the encoding of the document. Sometimes when transcoding some special characters are replaced with special Unicode, you can determine if this is the case by using the .contains_repalcement_characters property of the BeautifulSoup object, which is True that there is a special replacement.

Table of Contents