Beautiful soup documentation | beautifulsoup docs

Beautiful soup documentation tutorial

Beautiful soup documentation tutorial : It is called the library for pulling the data out from XML and HTML files. The beautiful soup3 is replaced by the soup4 beautiful and works on the python 3.x. It is fast and has more features with the third party parsers like html and lxml libraries. The library is used to provide an idiomatic way of searching and modifying the parse tree. This will also save the time of programmers and make work fast. Beautiful soup documentation will cover the version of 4.8 and the 2.7 and 3.2. It will work on the python 2.x and also on 3.x. It is also called as faster and works with the third party parsers like html5lib and lxml.

Usage of beautiful soup:-

We start using beautifulsoup after installation.
At first python, script.import is used then passes the value to beautifulsoup to create soup as a document.
If the soup does not fetch a web page you have to do it yourself.
Filtering:-
Some filters used are used with API and used based on tag names in attributes of string and combination.
Regular expression:-
After passing the expression object the beautifulsoup will filter against regular expression using the match () method.
Import re
For tag in soup.find_all (re.compile (“^b”)):
Print(tag.name)
This code finds the name that contain letter”t”:
For tag in soup.find_all (re.compile (“t”)):
Print(tag.name)

A jquery like a library for Python:-

To extract data from the tags we can use PyQuery and grab the actual text contents and the html contents.
Example:-
From pyquery import pyquery
Import urllib2
Response=urllib2.urlopen ('http: //en.wikipedia.org/wik/python_ (programming_language)’)
Html=response. read ()
Pq=PyQuery (html)
Tag=pq(div#toc‘)
Print tag. text()
Print tag.html ()

Parsing a Document:-

The constructor will takes an XML or HTML document in the form of a string.
It will parse the document and creates a data structure in memory.
If you give Beautiful Soup a perfectly-formed document and parsed data structure looks like the original document.
Beautiful Soup uses heuristics to figure out a reasonable structure for the data structure.

Parsing HTML:-

The tags can be nested (<BLOCKQUOTE>) and some cannot (<P>).
The table and list tags have nesting order and instance <TD> tags go inside <TR> tags not the other way around.
The contents of a <SCRIPT> tag should not be parsed as HTML.
A <META> tag may specify an encoding for the document.

from BeautifulSoup import BeautifulSoup

html = "<html><p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"

soup = BeautifulSoup (html)

print soup. prettify()

<html>

<p>

Para 1

</p>

<p>

Para 2

<blockquote>

Quote 1

<blockquote>

Quote 2

</blockquote>

</blockquote>

</p>

</html>

That document is not valid HTML and also not bad.
The <FORM> tag that starts outside of a <TABLE> tag and ends inside the <TABLE> tag.

from BeautifulSoup import BeautifulSoup

html = """

<html>

<form>

 <table>

 <td><input name="input1">Row 1 cell 1

 <tr><td>Row 2 cell 1

 </form>

 <td>Row 2 cell 2<br>This</br> sure is a long cell

</body>

</html>"""

Beautiful Soup will handle the document,

print BeautifulSoup (html).prettify ()

<html>

<form>

<table>

<td>

<input name=”input”/>

Row 1 cell 1

</td>

<td>

Row 2 cell 1

</td>

</tr>

</table>

</form>

<td>

Row 2 cell 2

<br/>

This

Sure is a long cell

</td>

</html>

Beautiful Soup will decide to close the <TABLE> tag when it will close the <FORM> tag.
The document intended the <FORM> tag to extend it to end of the table.
It will parse the invalid document and gives you access to all the data.

Parsing XML:-

BeautifulSoup class is full of web-browser for divining the intent of HTML authors.
The XML does not have a tag set and don't apply do XML well.
BeautifulStoneSoup class is used to parse XML documents

from BeautifulSoup import BeautifulStoneSoup

xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3"

soup = BeautifulStoneSoup (xml)

print soup. prettify ()

<doc>

<tag1>

Contents 1

<tag2>

Contents2

</tag2>

</tag1>

<tag1>

Contents3

</tag1>

</doc>

BeautifulStoneSoup they do not know about self-closing tags.
HTML has a fixed set of self-closing tags with XML and depends on what the DTD says.
BeautifulStoneSoup that tags are self-closing by passing in their names as the selfClosingTags argument to the constructor.

from BeautifulSoup import BeautifulStoneSoup

xml = "<tag>Text 1<self closing>Text 2"

print BeautifulStoneSoup (xml).prettify ()

<tag>

Text1

<selfclosing>

Text2

</selfclosing>

</tag>

print BeautifulStoneSoup (xml, selfClosingTags= ['selfclosing']).prettify ()

<tag>

Text1

<selfclosing/>

Text2

</tag>

Beautiful Soup Gives You Unicode:-

Your document is parsed and transformed into Unicode.
It will store the Unicode strings in its data structures.

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("Hello")

soup. contents[0]

u’Hello’

soup.originalEncoding

They use a class called UnicodeDammit which is used to detect the encodings of documents to give it also convert them to Unicode no matter what.
If you need to do this you can use UnicodeDammit by itself.
The encoding will pass fromEncoding argument to soup constructor.
Beautifulsoup will find the kind of encoding within the document and parse the document again from the beginning and gives the encoding.
It will always guess the right if it can make a guess at all.

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(euc_jp)

soup.originalEncoding

str(soup)

soup = BeautifulSoup(euc_jp, fromEncoding="euc-jp")

soup.originalEncoding

str(soup)

soup.__str__ (self, 'euc-jp') == euc_jp

It destroys the document quotes and other Windows-specific characters.
Then transforming those characters into their Unicode equivalents the Beautiful Soup transforms them into HTML entities.

from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup

text = "Deploy the \x91SMART QUOTES\x92!"

str (BeautifulSoup (text))

str (BeautifulStoneSoup (text))

str (BeautifulSoup (text, smartQuotesTo="xml"))

BeautifulSoup (text, smartQuotesTo=none).contents [0]

Printing a Document:-

The Beautiful Soup is turned to document into a string with the str function prettify or renderContents methods.
You can use the Unicode function to get a document as a Unicode string.
The prettify method will add strategic newlines and spacing to make the structure of the document obvious.
It will strip out text nodes that contain whitespace which can change the meaning of an XML document.
The str and Unicode functions will not strip out text nodes that contain only whitespace, and does not add any whitespace between nodes either.

Example:-

from BeautifulSoup import BeautifulSoup

doc = "<html><h1>Heading</h1><p>Text"

soup = BeautifulSoup (doc)

str(soup)

soup.renderContents ()

soup.__str__ ()

unicode (soup)

soup. prettify ()

print soup. prettify ()

The str and renderContents will give different result when used on a tag with document.
str will print a tag and the contents renderContents only prints the contents.

heading = soup.h1

str (heading)

heading.renderContents ()

Example:-

from BeautifulSoup import BeautifulSoup

doc = "Sacr\xe9 bleu!"

soup = BeautifulSoup (doc)

str (soup)

soup.__str__ ("ISO-8859-1")

soup.__str__ ("UTF-16")

soup.__str__ ("EUC-JP")

After loading an HTML document into BeautifulSoup and prints it back.

Example:

from BeautifulSoup import BeautifulSoup

doc = """<html>

<meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" >

Sacr\xe9 bleu!

</html>"""

print BeautifulSoup (doc).prettify ()

XML example:-

from BeautifulSoup import BeautifulStoneSoup

doc = """<? xml version="1.0" encoding="ISO-Latin-1">Sacr\xe9 bleu!"""

print BeautifulStoneSoup (doc).prettify ()

The Parse Tree:-

A parser object is a deeply-nested and well-connected data structure that corresponds to the structure of an XML or HTML document.
The parser object will contain two other types of objects as Tag objects which correspond to tags like the <TITLE> tag and the <B> tags.
NavigableString objects correspond to strings like "Page title" and "This is paragraph".

from BeautifulSoup import BeautifulSoup

import re

hello = "Hello! <!--I've got to be nice to get what I want. -->"

CommentSoup = BeautifulSoup (hello)

Comment = commentSoup.find (text=re.compile ("nice"))

comment.__class__

comment

comment.previousSibling

str(comment)

print commentSoup

The attributes of Tags:-

The tag and NavigableString objects have lot of members of which are covered.

firstPTag, secondPTag = soup.findAll ('p')

firstPTag ['id']

secondPTag['id']

Navigating the Parse Tree:-

All Tag objects have all of the members listed below NavigableString objects had all of them except for contents and string.

The parent of the <HEAD> Tag is the <HTML> Tag.

NextSibling and previousSibling:-

nextSibling of the <HEAD> Tag is the <BODY> Tag because the <BODY> Tag is the next thing directly beneath the <html> Tag.
The nextSibling of the <BODY> tag is none because there is nothing else directly beneath the <HTML> Tag.

soup.head.nextSibling.name

soup.html.nextSibling == none

The previousSibling of the <BODY> Tag is the <HEAD> tag, and the previousSibling of the <HEAD>,

soup.body.previousSibling.name

soup.head.previousSibling == none

NavigableString is none inside the first <P> Tag.

soup.p.nextSibling

SecondBTag = soup.findAll ('b') [1]

secondBTag.previousSibling

secondBTag.previousSibling.previousSibling == none

Iterating over a Tag

You can iterate over the contents of a Tag by treating it as a list.
Then see how many child nodes a Tag has, you can call len(tag) instead of len(tag.contents).

for i in soup.body:

print i

len (soup.body)

len (soup.body.contents)

Using tag names as members:-

It is easy to navigate the parse tree by acting as though the name of the tag you want is a member of a parser or Tag object.
The terms of soup. head gives us the <HEAD> Tag in the document:

soup.head

Then calling mytag.foo will return the first child of mytag that happens to be a <FOO> Tag. If there aren't any <FOO> Tags beneath mytag, then mytag.foo returns none.
Use this to traverse the parse tree very quickly:

soup.head.title

soup.body.p.b.string

You can also use this to quickly jump to a certain part of a parse tree.
If you're not worried about <TITLE> tags in weird places outside of the <HEAD> tag, you can just use soup.title to get an HTML document's title.

soup.title.string

soup.p jump to the first <P> tag inside a document wherever it is.
Soup.table.tr.td jump used to the first column of the first row of the first table in the document.

from BeautifulSoup import BeautifulStoneSoup

xml = '<person name="Bob"><parent rel="mother" name="Alice">'

XmlSoup = BeautifulStoneSoup (xml)

xmlSoup.person.parent

xmlSoup.person.parentTag

Searching the Parse Tree:-

It will provide many methods that traverse the parse tree, gathering Tags and NavigableStrings which match crieteria.

from BeautifulSoup import BeautifulSoup

doc = ['<html><head><title>Page title</title></head>',

       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',

       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',

       '</html>']

soup = BeautifulSoup (''.join (doc))

print soup. prettify ()

Calling a tag is like calling find all:-

If you call the parser object or Tag like a function you can pass in all of findall's arguments and it's the same as calling findall.

soup (text=lambda(x): len(x) < 12)

soup.body ('p', limit=1)

find(name, attrs, recursive, text, **kwargs):-

The find method is like findAll except that instead of finding all the matching objects it only finds the first one.
It is like imposing a limit of 1 on the result set and then extracting the single result from the array.

soup.findAll ('p', limit=1)

soup. find ('p', limit=1)

soup. find ('nosuchtag', limit=1) == none

Modifying the Parse Tree:-

You rip an element out of its parent's content but the rest of the document will still have references to the thing you ripped out.
Beautiful Soup will offer several methods that let you modify the parse tree while maintaining its internal consistency.

Changing attribute values:-

You can use dictionary assignment to modify the attribute values of Tag objects.

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup ("<b id="2">Argh! </b>")

print soup

b = soup.b

b['id'] = 10

print soup

b['id'] = "ten"

print soup

b['id'] = 'one "million"'

print soup

You can also delete attribute values, and add new ones:

del(b ['id'])

print soup

b['class'] = "extra bold and brassy!"

print soup

Removing elements:-

You can rip it out of the tree with the extract method as this code removes all the comments.

from BeautifulSoup import BeautifulSoup, Comment

soup = BeautifulSoup ("""1<!--The loneliest number-->

                        <a>2<!--Can be as bad as one--><b>3""")

Comments=soup.findall(text=lambda text:isinstance (text, comment))

[comment. extract () for comment in comments]

print soup

This code removes a subtree from a document:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup ("<a1></a1><a><b>Amazing content<c><d></a><a2></a2>")

soup.a1.nextSibling

soup.a2.previousSibling

Subtree = soup. a

subtree. Extract ()

print soup

soup.a1.nextSibling

soup.a2.previousSibling

The extract method turns one parse tree into two disjoint trees. The navigation members are changed so that it looks like the trees had never been together:

soup.a1.nextSibling

soup.a2.previousSibling

subtree.previousSibling == none

subtree.parent == none

Replacing one Element with Another:-

The replaceWith method extracts one page element and replaces it with a different one. The new element can be a Tag or a NavigableString.
If you pass a plain old string into replaceWith then it get turned into a NavigableString. The navigation members are changed though the document had been parsed that way in the first place.
Example:-

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup ("<b>Argh! </b>")

soup.find (text="Argh!").replace With ("Hooray!")

print soup

newText = soup.find (text="Hooray!")

newText.previous

newText.previous.next

newText.parent

soup.b.contents

The more complex example that replaces one tag with another:

from BeautifulSoup import BeautifulSoup, Tag

soup = BeautifulSoup ("<b>Argh! <a>Foo</a></b><i>Blah! </i>")

tag = Tag (soup, "newTag", [("id", 1)])

tag. insert (0, "Hooray!")

soup.a.replaceWith (tag)

print soup

We rip out an element from one part of the document and stick it in another part:

from BeautifulSoup import BeautifulSoup

text=”<html>There<b>no</b>business like<b>show</b>

business</html>"

soup = BeautifulSoup (text)

no, show = soup.findAll ('b')

show.replaceWith (no)

print soup

Adding a Brand New Element:-

The Tag class and the parser classes will support a method called insert.
It work like a Python list insert method and takes an index to the tag's contents member also sticks a new element in that slot.
We replace a tag in the document with a brand new tag.

from BeautifulSoup import BeautifulSoup, Tag, and NavigableString

soup = BeautifulSoup ()

tag1 = Tag (soup, "my tag")

tag2 = Tag (soup, "myOtherTag")

tag3 = Tag (soup, "myThirdTag")

soup. insert(0, tag1)

tag1.insert (0, tag2)

tag1.insert(1, tag3)

print soup

text = NavigableString ("Hello!")

tag3.insert (0, text)

print soup

The element occurs in only one place in one parse tree.
If you insert an element which is already connected to a soup object and gets disconnected before it is connected elsewhere.