Beautiful Soup is a Python library that can extract data from HTML or XML files. Simply put, it can parse HTML tag files into a tree structure and then easily get the corresponding attributes of the specified tags. This feature is similar to lxml.
Beautiful Soup installation
Beautiful Soup 3 is currently out of development and it is recommended to use Beautiful Soup 4 in your current projects, installed by
|
|
Beautiful Soup’s parsers
If you just want to parse an HTML document, just create a BeautifulSoup object with the document. Beautiful Soup automatically chooses a parser to parse the document, but you can also specify which parser to use with parameters. The first parameter to BeautifulSoup should be the document string or file handle to be parsed, and the second parameter should identify how to parse the document. If the second parameter is empty, Beautiful Soup automatically selects a parser based on the libraries currently installed on the system, in the following order of preference: lxml, html5lib, Python standard library.
The parser priority will change under the following two conditions.
- What type of document to parse: currently supports “html”, “xml”, and “html5”
- Which parser to use: currently “lxml”, “html5lib”, and “parser” are supported.
If the specified parser is not installed Beautiful Soup will automatically choose another option. Currently only the lxml parser supports parsing XML documents, so creating beautifulsoup objects without the lxml library installed will not result in parsed objects regardless of whether lxml is specified.
To install the parser.
- lxml, Windows installation may cause problems
- html5lib Directly execute:
pip install html5lib
Here it is recommended to use lxml as parser because it is more efficient.
Use of Beautiful Soup
Creating Beautiful Soup objects
The second parameter of the BeautifulSoup constructor is the document parser, if this parameter is not passed, BeautifulSoup will choose the most appropriate parser to parse the document on its own, but there will be a warning prompt. It can also be initialized by a file handle, which can be used to save the HTML source code to the local sibling directory reo.html, and then the file name as a parameter.
|
|
Beautiful Soup will be a complex HTML document into a complex tree structure, each node is a Python object, all objects can be grouped into four kinds.
Tag
What is a Tag? In layman’s terms, it’s a tag in HTML. Here’s how to use Beautiful Soup to easily get Tags.
We can easily get the content of these tags using soup plus the tag name, but one thing is that it looks for the first tag that matches in all the content, if we want to query all the tags, we will introduce it later.
For Tag, it has two important properties, which are name and attrs.
- name: The soup object itself is special in that its name is [document], and for other internal tags, the output value is then the name of the tag itself.
- attrs: A Tag object can have more than one attribute, the operation method is the same as a dictionary, the attribute contains key and value, you can also get the information of value.
The attributes of tag can be added, deleted (del soup.b[‘class’]) and modified, same as the dictionary method. If an attribute key corresponds to more than one value, a list of values is returned.
NavigableString
Now that we have got the content of the tag, the question arises, what if we want to get the text inside the tag? It’s simple, just use .string, the string in the Tag is the NavigableString object.
|
|
Use this type outside of BeautifulSoup, conversion to Unicode is recommended: unicode(Tag.string). tag can contain other tags or strings, while NavigableString cannot contain other objects. No support for .content, .string, find(), only partial support for traversing the document tree and searching for properties in the document tree.
Comment
Comment object is a special type of NavigableString object, in fact, the output still does not include comment symbols, but if we do not handle it properly, it may cause unexpected trouble for our text processing.
Let’s find a tag with a comment
The results of the run are as follows.
The content of the a tag is actually a comment, but if we use .string to output its content, we find that it has removed the comment symbol, so this may bring us unnecessary trouble.
In addition, we print out its type and find that it is a Comment type, so we’d better make a judgment before using it, the judgment code is as follows.
In the code above, we first determine its type, whether it is the Comment type, and then perform other operations such as printing the output.
Traversing the document tree
BeautifulSoup object as a tree, there are multiple nodes. For a node, relative to where it is located, there are child nodes, parent nodes, and sibling nodes.
child nodes
A Tag can contain multiple Tags as well as Strings, which are all child nodes of that Tag. And NavigableString will not have child nodes.
Direct child nodes
The .content attribute of a .content tag can output the tag’s child nodes as a list: the
.children It does not return a list, but we can get all the children by iterating over them.
If you want to get a certain Tag, the above mentioned methods have been mentioned.
|
|
By pointing to the attribute, you can only get the first tag of the current name, to get all, you need to use the method in the search document tree:
|
|
The tag’s .contents property outputs all child nodes as a list. The tag’s .children generator can be used to iterate over all child nodes. .contents and .children are only useful for getting the direct children of a tag, .descendants can be used to iterate over all the children of a tag.
If tag has only one child node of type NavigableString, use .string to get it. If it contains more than one, use .strings to traverse it. If the output string contains spaces or blank lines, use .stripped_strings to remove them.
All children nodes
The .contents and .children properties contain only the direct children of the tag, while the .descendants property recursively loops over all the children of the tag, and similar to children, we need to iterate through them to get their contents.
If the tag has only one child of type NavigableString, then the tag can use .string to get the child node. If a tag has only one child node, then the tag can also use the .string method, and the output will be the same as the .string result of the current unique child node.
In layman’s terms, if there is no tag inside a tag, then .string will return the contents of the tag. If there is only one tag inside the tag, then .string will also return the innermost content. For example
If the tag contains more than one child, the tag cannot determine which child should be called by the string method, and the output of .string is None.
to get more than one content, but you need to iterate through them, as in the following example.
|
|
.stripped_strings, the output string may contain many spaces or blank lines, use .stripped_strings to remove the extra whitespace.
|
|
parent node
All Parent Nodes
All parent nodes of an element can be obtained recursively through the .parents attribute of the element, for example
Return data.
sibling node
The .next_sibling attribute gets the next sibling of the node, while .previous_sibling returns None if the node does not exist.
Note: The .next_sibling and .previous_sibling attributes of the tag in the actual document are usually strings or whitespace, since a whitespace or newline can also be treated as a node, so the result may be a whitespace or newline.
Return data.
sibling node
The .next_sibling attribute gets the next sibling of the node, while .previous_sibling returns None if the node does not exist.
Note: The .next_sibling and .previous_sibling attributes of the tag in the actual document are usually strings or whitespace, since a whitespace or newline can also be treated as a node, so the result may be a whitespace or newline.
All Siblings
The .next_siblings and .previous_siblings attributes allow iterative output of the current node’s siblings.
The .next_element .previous_element attribute differs from .next_sibling .previous_sibling in that it is not specific to sibling nodes, but in all nodes, regardless of hierarchy.
all preceding and following nodes
The .next_elements and .previous_elements iterators make it possible to access the parsed content of a document forward or backward, as if the document were being parsed.
The above is the basic usage of traversing the document tree.
search the document tree
**find_all( name , attrs , recursive , text , kwargs )
find_all() method searches all tag children nodes of the current tag and determines if the conditions of the filter are met.
name argument
It can find all the tag with the name name, string objects will be ignored automatically.
A. Pass string
The simplest filter is a string. Passing a string parameter to the search method Beautiful Soup will find a complete match to the string, the following example is used to find all <b>
tags in the document.
|
|
B. Passing regular expressions
If you pass a regular expression as a parameter Beautiful Soup will match the content by the match() of the regular expression. The following example finds all tags starting with b, which means that both <body>
and <b>
tags should be found.
C. Passing a list
If you pass in the list parameter Beautiful Soup will return the content that matches any element in the list. The following code finds all <a>
tags and <b>
tags in the document.
|
|
D. Pass True
True can match any value, the following code finds all the tags but does not return the string node
E. Passing a method
If there is no appropriate filter, then you can also define a method that accepts only one element parameter and returns True if the current element matches and is found, or False if it is not.
The following method checks the current element and returns True if it contains the class attribute but not the id attribute.
Passing this method as an argument to the find_all() method will get all <p>
tags:
keyword parameter
Note: If a parameter with a specified name is not a built-in parameter name, the search will treat the parameter as an attribute of the specified name tag to search.
If the href parameter is passed, Beautiful Soup will search for the “href” attribute of each tag.
Multiple attributes of a tag can be filtered at the same time using multiple parameters with specified names.
Here we want to filter by class, but class is a python keyword, so what do we do? Just add an underscore
|
|
Some tag attributes are not available in search, such as the data-* attribute in HTML5
However, it is possible to define a dictionary parameter to search for tags containing special attributes via the attrs parameter of the find_all() method.
text parameters
The text parameter allows you to search for string content in a document. Like the optional value of the name parameter, the text parameter accepts strings, regular expressions, lists, True
limit parameters
The find_all() method returns the entire search structure, which can be very slow if the document tree is large. If we do not need all the results we can use the limit parameter to limit the number of results returned. The effect is similar to the limit keyword in SQL, when the number of search results reaches the limit the search will stop returning results.
There are 3 tags in the document tree that match the search criteria but only 2 results are returned because we have limited the number of results.
|
|
recursive parameter
Beautiful Soup retrieves all children of the current tag when calling the find_all() method of the tag, if you want to search only the direct children of the tag you can use the parameter recursive=False.
find( name , attrs , recursive , text , **kwargs )
The find() method is equivalent to find_all(limit=1) and returns the first object that matches the condition. The only difference between it and the find_all() method is that the find_all() method returns a list of values containing one element while the find() method returns the result directly.
In addition to find() and find_all() there are a number of search methods.
- find_parent()
- find_next_sibling()
- find_previous_sibling()
The above three can be followed by ’s’ for all.
- find_next()
- find_previous()
- find_all_next()
- find_all_previous()
find_parents() find_parent()
find_all() and find() search only for all children, grandchildren, etc. of the current node. find_parents () and find_parent () used to search for the current node’s parent nodes, search methods and ordinary tag search method is the same, search the document search document contains the content
find_next_siblings () find_next_sibling ()
these two methods through the .next_siblings attribute when the tag of all the later resolved sibling tag nodes to iterate, find_next_siblings () method to return all eligible later sibling nodes, find_next_sibling () only to return to the eligible later the first tag node
find_previous_siblings() find_previous_sibling()
These two methods iterate over the preceding resolved sibling tag nodes of the current tag using the .previous_siblings attribute. find_previous_siblings() method returns all eligible preceding siblings, find_previous_sibling() method returns the first eligible previous siblings
find_all_next() find_next()
These two methods iterate over the tags and strings after the current tag using the .next_elements attribute. find_all_next() method returns all eligible nodes, find_next() method returns the first eligible node
find_all_previous() and find_previous()
These two methods iterate over the tag and string preceding the current node using the .previous_elements attribute. find_all_previous() method returns all eligible nodes, find_previous() method returns the first eligible node.
Note: The usage of the above methods is exactly the same as find_all(), and the principles are similar, so we will not repeat them here.
CSS selector
When we write CSS, the tag name is not modified, the class name is preceded by a dot and the id name is preceded by #, here we can also use a similar method to filter the elements, the method used is soup.select(), the return type is list
Find by tag name
|
|
Find by class name
Find by id name
combination search
Combined search is the same as the combination of tag name, class name and id name when writing class files, for example, to find the contents of p tag, id equal to link1, the two need to be separated by spaces
Direct sub-label search
attribute lookup
Find can also add attribute elements, attributes need to be enclosed in brackets, note that attributes and tags belong to the same node, so no spaces can be added in between, otherwise they will not be matched to.
|
|
Again, attributes can still be combined with the above lookup, not in the same node separated by spaces, the same node without spaces
The above select methods return results in list form, which can be iterated through and then get_text() to get its content. get_text(strip=True) removes the white space before and after the text.
Encoding issues
Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding by default. beautifulSoup object’s .original_encoding property to get the result of automatic encoding recognition. Of course, this is slow and sometimes wrong. Using the chartdet library can improve the efficiency of encoding detection. You can create a BeautifulSoup object by specifying the from_encoding parameter to tell you the encoding of the document. Sometimes when transcoding some special characters are replaced with special Unicode, you can determine if this is the case by using the .contains_repalcement_characters property of the BeautifulSoup object, which is True that there is a special replacement.