Beautiful soup documentation tutorial : It is called the library for pulling the data out from XML and HTML files. The beautiful soup3 is replaced by the soup4 beautiful and works on the python 3.x. It is fast and has more features with the third party parsers like html and lxml libraries. The library is used to provide an idiomatic way of searching and modifying the parse tree. This will also save the time of programmers and make work fast. Beautiful soup documentation will cover the version of 4.8 and the 2.7 and 3.2. It will work on the python 2.x and also on 3.x. It is also called as faster and works with the third party parsers like html5lib and lxml.
We start using beautifulsoup after installation.
At first python, script.import is used then passes the value to beautifulsoup to create soup as a document.
If the soup does not fetch a web page you have to do it yourself.
Filtering:-
Some filters used are used with API and used based on tag names in attributes of string and combination.
Regular expression:-
After passing the expression object the beautifulsoup will filter against regular expression using the match () method.
Import re
For tag in soup.find_all (re.compile (“^b”)):
Print(tag.name)
This code finds the name that contain letter”t”:
For tag in soup.find_all (re.compile (“t”)):
Print(tag.name)
To extract data from the tags we can use PyQuery and grab the actual text contents and the html contents.
Example:-
From pyquery import pyquery
Import urllib2
Response=urllib2.urlopen ('http: //en.wikipedia.org/wik/python_ (programming_language)’)
Html=response. read ()
Pq=PyQuery (html)
Tag=pq(div#toc‘)
Print tag. text()
Print tag.html ()
The constructor will takes an XML or HTML document in the form of a string.
It will parse the document and creates a data structure in memory.
If you give Beautiful Soup a perfectly-formed document and parsed data structure looks like the original document.
Beautiful Soup uses heuristics to figure out a reasonable structure for the data structure.
The tags can be nested (<BLOCKQUOTE>) and some cannot (<P>).
The table and list tags have nesting order and instance <TD> tags go inside <TR> tags not the other way around.
The contents of a <SCRIPT> tag should not be parsed as HTML.
A <META> tag may specify an encoding for the document.
from BeautifulSoup import BeautifulSoup
html = "<html><p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
soup = BeautifulSoup (html)
print soup. prettify()
<html>
<p>
Para 1
</p>
<p>
Para 2
<blockquote>
Quote 1
<blockquote>
Quote 2
</blockquote>
</blockquote>
</p>
</html>
from BeautifulSoup import BeautifulSoup
html = """
<html>
<form>
<table>
<td><input name="input1">Row 1 cell 1
<tr><td>Row 2 cell 1
</form>
<td>Row 2 cell 2<br>This</br> sure is a long cell
</body>
</html>"""
Beautiful Soup will handle the document,
print BeautifulSoup (html).prettify ()
<html>
<form>
<table>
<td>
<input name=”input”/>
Row 1 cell 1
</td>
<td>
Row 2 cell 1
</td>
</tr>
</table>
</form>
<td>
Row 2 cell 2
<br/>
This
Sure is a long cell
</td>
</html>
Beautiful Soup will decide to close the <TABLE> tag when it will close the <FORM> tag.
The document intended the <FORM> tag to extend it to end of the table.
It will parse the invalid document and gives you access to all the data.
BeautifulSoup class is full of web-browser for divining the intent of HTML authors.
The XML does not have a tag set and don't apply do XML well.
BeautifulStoneSoup class is used to parse XML documents
from BeautifulSoup import BeautifulStoneSoup
xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3"
soup = BeautifulStoneSoup (xml)
print soup. prettify ()
<doc>
<tag1>
Contents 1
<tag2>
Contents2
</tag2>
</tag1>
<tag1>
Contents3
</tag1>
</doc>
BeautifulStoneSoup they do not know about self-closing tags.
HTML has a fixed set of self-closing tags with XML and depends on what the DTD says.
BeautifulStoneSoup that tags are self-closing by passing in their names as the selfClosingTags argument to the constructor.
from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<self closing>Text 2"
print BeautifulStoneSoup (xml).prettify ()
<tag>
Text1
<selfclosing>
Text2
</selfclosing>
</tag>
print BeautifulStoneSoup (xml, selfClosingTags= ['selfclosing']).prettify ()
<tag>
Text1
<selfclosing/>
Text2
</tag>
Your document is parsed and transformed into Unicode.
It will store the Unicode strings in its data structures.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Hello")
soup. contents[0]
u’Hello’
soup.originalEncoding
They use a class called UnicodeDammit which is used to detect the encodings of documents to give it also convert them to Unicode no matter what.
If you need to do this you can use UnicodeDammit by itself.
The encoding will pass fromEncoding argument to soup constructor.
Beautifulsoup will find the kind of encoding within the document and parse the document again from the beginning and gives the encoding.
It will always guess the right if it can make a guess at all.
soup = BeautifulSoup(euc_jp)
soup.originalEncoding
str(soup)
soup = BeautifulSoup(euc_jp, fromEncoding="euc-jp")
soup.originalEncoding
str(soup)
soup.__str__ (self, 'euc-jp') == euc_jp
It destroys the document quotes and other Windows-specific characters.
Then transforming those characters into their Unicode equivalents the Beautiful Soup transforms them into HTML entities.
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
text = "Deploy the \x91SMART QUOTES\x92!"
str (BeautifulSoup (text))
str (BeautifulStoneSoup (text))
str (BeautifulSoup (text, smartQuotesTo="xml"))
BeautifulSoup (text, smartQuotesTo=none).contents [0]
The Beautiful Soup is turned to document into a string with the str function prettify or renderContents methods.
You can use the Unicode function to get a document as a Unicode string.
The prettify method will add strategic newlines and spacing to make the structure of the document obvious.
It will strip out text nodes that contain whitespace which can change the meaning of an XML document.
The str and Unicode functions will not strip out text nodes that contain only whitespace, and does not add any whitespace between nodes either.
from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup (doc)
str(soup)
soup.renderContents ()
soup.__str__ ()
unicode (soup)
soup. prettify ()
print soup. prettify ()
The str and renderContents will give different result when used on a tag with document.
str will print a tag and the contents renderContents only prints the contents.
heading = soup.h1
str (heading)
heading.renderContents ()
from BeautifulSoup import BeautifulSoup
doc = "Sacr\xe9 bleu!"
soup = BeautifulSoup (doc)
str (soup)
soup.__str__ ("ISO-8859-1")
soup.__str__ ("UTF-16")
soup.__str__ ("EUC-JP")
After loading an HTML document into BeautifulSoup and prints it back.
from BeautifulSoup import BeautifulSoup
doc = """<html>
<meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" >
Sacr\xe9 bleu!
</html>"""
print BeautifulSoup (doc).prettify ()
from BeautifulSoup import BeautifulStoneSoup
doc = """<? xml version="1.0" encoding="ISO-Latin-1">Sacr\xe9 bleu!"""
print BeautifulStoneSoup (doc).prettify ()
A parser object is a deeply-nested and well-connected data structure that corresponds to the structure of an XML or HTML document.
The parser object will contain two other types of objects as Tag objects which correspond to tags like the <TITLE> tag and the <B> tags.
NavigableString objects correspond to strings like "Page title" and "This is paragraph".
from BeautifulSoup import BeautifulSoup
import re
hello = "Hello! <!--I've got to be nice to get what I want. -->"
CommentSoup = BeautifulSoup (hello)
Comment = commentSoup.find (text=re.compile ("nice"))
comment.__class__
comment
comment.previousSibling
str(comment)
print commentSoup
The tag and NavigableString objects have lot of members of which are covered.
firstPTag, secondPTag = soup.findAll ('p')
firstPTag ['id']
secondPTag['id']
All Tag objects have all of the members listed below NavigableString objects had all of them except for contents and string.
The parent of the <HEAD> Tag is the <HTML> Tag.
nextSibling of the <HEAD> Tag is the <BODY> Tag because the <BODY> Tag is the next thing directly beneath the <html> Tag.
The nextSibling of the <BODY> tag is none because there is nothing else directly beneath the <HTML> Tag.
soup.head.nextSibling.name
soup.html.nextSibling == none
The previousSibling of the <BODY> Tag is the <HEAD> tag, and the previousSibling of the <HEAD>,
soup.body.previousSibling.name
soup.head.previousSibling == none
NavigableString is none inside the first <P> Tag.
soup.p.nextSibling
SecondBTag = soup.findAll ('b') [1]
secondBTag.previousSibling
secondBTag.previousSibling.previousSibling == none
You can iterate over the contents of a Tag by treating it as a list.
Then see how many child nodes a Tag has, you can call len(tag) instead of len(tag.contents).
for i in soup.body:
print i
len (soup.body)
len (soup.body.contents)
It is easy to navigate the parse tree by acting as though the name of the tag you want is a member of a parser or Tag object.
The terms of soup. head gives us the <HEAD> Tag in the document:
soup.head
Then calling mytag.foo will return the first child of mytag that happens to be a <FOO> Tag. If there aren't any <FOO> Tags beneath mytag, then mytag.foo returns none.
Use this to traverse the parse tree very quickly:
soup.head.title
soup.body.p.b.string
You can also use this to quickly jump to a certain part of a parse tree.
If you're not worried about <TITLE> tags in weird places outside of the <HEAD> tag, you can just use soup.title to get an HTML document's title.
soup.title.string
soup.p jump to the first <P> tag inside a document wherever it is.
Soup.table.tr.td jump used to the first column of the first row of the first table in the document.
from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
XmlSoup = BeautifulStoneSoup (xml)
xmlSoup.person.parent
xmlSoup.person.parentTag
It will provide many methods that traverse the parse tree, gathering Tags and NavigableStrings which match crieteria.
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup (''.join (doc))
print soup. prettify ()
If you call the parser object or Tag like a function you can pass in all of findall's arguments and it's the same as calling findall.
soup (text=lambda(x): len(x) < 12)
soup.body ('p', limit=1)
The find method is like findAll except that instead of finding all the matching objects it only finds the first one.
It is like imposing a limit of 1 on the result set and then extracting the single result from the array.
soup.findAll ('p', limit=1)
soup. find ('p', limit=1)
soup. find ('nosuchtag', limit=1) == none
You rip an element out of its parent's content but the rest of the document will still have references to the thing you ripped out.
Beautiful Soup will offer several methods that let you modify the parse tree while maintaining its internal consistency.
You can use dictionary assignment to modify the attribute values of Tag objects.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup ("<b id="2">Argh! </b>")
print soup
b = soup.b
b['id'] = 10
print soup
b['id'] = "ten"
print soup
b['id'] = 'one "million"'
print soup
You can also delete attribute values, and add new ones:
del(b ['id'])
print soup
b['class'] = "extra bold and brassy!"
print soup
You can rip it out of the tree with the extract method as this code removes all the comments.
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup ("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
Comments=soup.findall(text=lambda text:isinstance (text, comment))
[comment. extract () for comment in comments]
print soup
This code removes a subtree from a document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup ("<a1></a1><a><b>Amazing content<c><d></a><a2></a2>")
soup.a1.nextSibling
soup.a2.previousSibling
Subtree = soup. a
subtree. Extract ()
print soup
soup.a1.nextSibling
soup.a2.previousSibling
The extract method turns one parse tree into two disjoint trees. The navigation members are changed so that it looks like the trees had never been together:
soup.a1.nextSibling
soup.a2.previousSibling
subtree.previousSibling == none
subtree.parent == none
The replaceWith method extracts one page element and replaces it with a different one. The new element can be a Tag or a NavigableString.
If you pass a plain old string into replaceWith then it get turned into a NavigableString. The navigation members are changed though the document had been parsed that way in the first place.
Example:-
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup ("<b>Argh! </b>")
soup.find (text="Argh!").replace With ("Hooray!")
print soup
newText = soup.find (text="Hooray!")
newText.previous
newText.previous.next
newText.parent
soup.b.contents
The more complex example that replaces one tag with another:
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup ("<b>Argh! <a>Foo</a></b><i>Blah! </i>")
tag = Tag (soup, "newTag", [("id", 1)])
tag. insert (0, "Hooray!")
soup.a.replaceWith (tag)
print soup
from BeautifulSoup import BeautifulSoup
text=”<html>There<b>no</b>business like<b>show</b>
business</html>"
soup = BeautifulSoup (text)
no, show = soup.findAll ('b')
show.replaceWith (no)
print soup
The Tag class and the parser classes will support a method called insert.
It work like a Python list insert method and takes an index to the tag's contents member also sticks a new element in that slot.
We replace a tag in the document with a brand new tag.
from BeautifulSoup import BeautifulSoup, Tag, and NavigableString
soup = BeautifulSoup ()
tag1 = Tag (soup, "my tag")
tag2 = Tag (soup, "myOtherTag")
tag3 = Tag (soup, "myThirdTag")
soup. insert(0, tag1)
tag1.insert (0, tag2)
tag1.insert(1, tag3)
print soup
text = NavigableString ("Hello!")
tag3.insert (0, text)
print soup
The element occurs in only one place in one parse tree.
If you insert an element which is already connected to a soup object and gets disconnected before it is connected elsewhere.
tag2.insert (0, text)
print soup
The element only has one parent and one nextSibling it can only be in one place at a time.