Scraping images python beautifulsoup tutorial : Uaing the Scrapy we can get images from the internet and give these images as the input to PyTesseract. It is a little thing that will not be solved in some cases. Scraping data from websites not complicated as it used to be. We are using these scraped data for many purposes and many tools available for doing this job. Many tools are available on the internet Scrapy along with PyTesseract is one of the best combos we can work with. The pytesseract gives the text contents of the image as text data. We can simply use this data.
Scraping a bunch of data from a website is a somewhat difficult job and done it manually.
It is also called as the process of extracting the information from websites and involve in downloading web pages.
The beautifulsoup is the parsing library that will enable us to extract the XML and HTML documents.
It will detect the gracefully and encoding handle documents with the special characters.
Import urlib, urllib2
Req=urlib2.Request (http://example.com/form/submit/urldata=urlib.urlencode ({‘field1:’value’,’field2’:’value’, ’field3’:’value’}),
Headers= {‘user-agent’: ‘Mozilla something’, ‘cookie’: name=value, name2=value2’})
Response=urllib2.urlopen (req)
Images, video, text, audio are used in Python to download data from the web.
Download_baidu (word)
Download_google (word)
Import re
Imports requests
From bs4 import Beautifulsoup
From urllib.parse import urlparse
Import os
Def download_baidu (keyword):
url=’https://image.baidu.com/search? tn=baiduimage&ie=utf-8&word=’+word+’&ct=201326592&v=flip’
result=requests. get (url)
html=result. text
pic_url=re.findall (“objURL” :”(.*?’ html, re.S)
i=0
for each in pic_url:
print(pic_url)
try:
pic=requests. get (each, timeout=10)
except requests.exceptions.connectionError:
print(‘exception‘)
continue
string=’pictures’+keyword+’’+str (i) +’.jpg’
fp=open (string.’wb’)
fp.write (pic.content)
fp.close ()
i+=1
def download_google (word):
url=’https://www.google.com/search?q=’+word+’&client=opera&hs=cTQsource=lnms&tbm=isch&sa=x&ved=0ahUKEwig3Lox4PZKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982’
page=requests. get (url).text
soup=Beautifulsoup (page,’html.parser’)
for raw_img in soup.find_all (‘img’):
link=raw_img.get (‘src’)
os. System(“wget”+link)
if_name__==’__main__’:
word=input (Input key word:”)
download_baidu(word)
Import the libraries to run the code and BeautifulSoup to give it an alias bs.
Requests library is used to fetch content from a given link. Urllib.request is another package that helps in opening and reading URLs.
argparse allows parsing arguments passed with the file execution.
Os provides functionalities to interact with the file system and all packages but BeautifulSoup is a part of the Python standard library.
Lines 8–12: -
Initialize the argument parser and parse the filename argument.
Lines 14- 21:-
os.getcwd () will return the path to the current working directory.
Split out the .csv extension from the file name, and join it with the current working directory to form our desired output directory to save the images in.
Lines 23–25:-
Use the open method to open the csv file and read the file and split it on the delimiter in a csv file also the links will hold a list of links to image display pages.
Lines 27–28:-
We find the length of links and print this information then the number of images will be downloaded.
Lines 30–34: -
We create a function to accept an image URL and download it.
Lines 36–39: -
The loop over each hyperlink href in the image will display links and using the get method in requests library, fetch the URL.
At line 40–41:-
thesoup.find_all ('meta', attrs= {"name":"twitter: image”}) this method will look for all Meta tags with attribute.
At line 41:-
Then we use the string method to replace modified the image link and used the download_image function to download the image.
The imageScraper will depend on requests, setproctitle and depend on pythreadpool which can be downloaded and installed.
Import image_scraper
Image_scraper.scrape_images (URL)
-h--help | We have to print help |
-s –save-dir<path>> | Name of folder to save the image |
-m—max-images<number> | It is maximum number of images to be scraped |
formats> | We specify format of image to be scraped |
--dump-urls | Then print URL of image |
Scrape all the images
$image-scraper_ananth.co.in/test.html
Scrape max 2 images
$image-scraper-m 2 ananth.co.in/test.html
Scrape gifs and download to folder ./mygifs
$image-scraper-s mygifs ananth.co.in/test.html—formats gif