Scraping vaccine images from Dantri
===================================
Site analysis
-------------
The site I used to work with to get the articles, download and scrape images about vaccine is::
https://dantri.com.vn/vaccine.tag
The href provided in ```` can't directly link to the articles, some ```` have none of href attribute and some ```` have href start with http. For example:
.. code-block:: html
For the images and captions in the articles, we will get the ``src``---image source and ``alt`` tag.
Scraping Explaination
---------------------
Define the site as :
.. code-block:: python
INDEX = 'https://dantri.com.vn/vaccine.tag'
The scraper will be focused on three main functions ``download()``, ``scrape_images()`` and ``articles()``.
articles()
^^^^^^^^^^
.. code-block:: python
def articles(links):
"""Search for URLs contains 'vacxin' in the given link."""
for a in links:
href = a.get('href')
if href is None: continue
if href.startswith('http'):
url = href
else:
url = 'https://dantri.com.vn' + href
if url.endswith('.htm') and 'vac' in url:
yield url
```` tags try to get the href attribute of each ```` tag. Since some ```` don't have an href attribute, we will ignore if href returns None. To make href a recognized url, we add ``http://dantri.com.vn`` in advance href. If href start with ``http``, we add url as href else we add url in advance href .Finally, to get vaccine-relevant articles, we just get the end of the url with ``.htm`` and contains ``vac``.
download()
^^^^^^^^^^
.. code-block:: python
async def download(caption, url, dest, client):
"""Save the given image with caption if it's about vaccine."""
name, ext = splitext(basename(url))
directory = dest / name
await directory.mkdir(parents=True, exist_ok=True)
try:
fi = await client.get(url)
except ConnectTimeout:
return
async with await open_file(directory/f'image{ext}', 'wb') as fo:
async for chunk in fi.aiter_bytes(): await fo.write(chunk)
await (directory/'caption').write_text(caption, encoding='utf-8')
print(caption)
This code will download the image from ``src`` and the caption from ``alt``. First I use ``name, ext = splitext(basename(url))`` to split url to find image name and extension. Then each image and its caption is then put in the same folder.
scrape_images()
^^^^^^^^^^^^^^^
.. code-block:: python
async def scrape_images(url, dest, client, nursery):
"""Download vaccine images from the given Dantri article."""
try:
article = await client.get(url)
except ConnectError:
print(url)
return
for img in parse_html5(article.text).iterfind('.//img'):
caption, url = img.get('alt'), img.get('src')
if caption is None: continue
if 'vac' in caption.lower() or 'vắc' in caption.lower():
nursery.start_soon(download, caption, url, dest, client)
First, I try to get url of article from client, except Connection is error then i show the url. The appropriate urls are then fetched and parsed in order to find all the ``
`` tags available as *vac* and *vắc*.