Scraping vaccine images from VnExpress

Site analysis

In order to scrape images from VnExpress, we first need to analyze the site. The news site is kind enough to provide a portal solely for vaccine-related news:

https://vnexpress.net/suc-khoe/vaccine

Taking a peak into the page’s source, it is easy to notice that articles are pointed to via a HTML tag of the following format.

<a href="https://vnexpress.net/{normalized title}.html"
   class="thumb thumb-5x3"
   title="{title}">

Looking into the article’s source, we can see the content image tags like so:

<img itemprop="contentUrl"
     intrinsicsize="{intrinsic size}"
     alt="{caption}"
     class="lazy"
     src="{encoded source}"
     data-src="{image URL}">

Now we have all needed infomation, let’s cook up a scraper!

Scraper construction

Since the task is I/O intensive and we are using Python, it is natural to employ an asynchronous input/output framework. We pick Trio for its ease-of-use, and thus use HTTPX as the HTTP client. The client and Trio nursery are prepared as follows:

from httpx import AsyncClient
from trio import open_nursery

async with AsyncClient() as client, open_nursery() as nursery:
    ...

The vaccine portal page

INDEX = 'https://vnexpress.net/suc-khoe/vaccine'

is then fetched as simply as

index = await client.get(INDEX)

Next, we need to parse the page and use html5lib for it. For convenience purposes, we define a wrapper with namespaceHTMLElements disabled by default:

from functools import partial
from html5lib import parse

parse_html5 = partial(parse, namespaceHTMLElements=False)

All the a tags at the appropriate levels can then be found using

parse_html5(index.text).iterfind('.//a')

Now we need to extract the only URLs to articles about vaccine. As discussed earlier, these end with .html and probably contain vaccine:

from urllib.parse import urldefrag

def articles(links):
    """Return URLs to vaccine articles from the given links."""
    for a in links:
        url, fragment = urldefrag(a.get('href'))
        if url.endswith('.html') and 'vaccine' in url: yield url

We then use nursery to fetch each of these articles in a concurrent task

nursery.start_soon(scrape_images, url, dest, client, nursery)

and look for the content images

async def scrape_images(url, dest, client, nursery):
    """Download vaccine images from the given VnExpress article."""
    article = await client.get(url)
    for img in parse_html5(article.text).iterfind('.//img'):
        caption, url = img.get('alt'), img.get('data-src')
        if caption is None or 'vaccine' not in caption.lower(): continue
        # VnExpress gives different HTML depending on the client.
        if url is None: url = img.get('src')
        if url.endswith('logo.svg'): continue
        nursery.start_soon(download, caption, url, dest, client)

The async function download takes care of the rest of the work, namely fetching and putting the images and caption in the specified location:

from os.path import basename, splitext
from urllib.parse import urlparse
from trio import open_file

async def download(caption, url, dest, client):
    """Save the given image with caption if it's about vaccine."""
    name, ext = splitext(basename(urlparse(url).path))
    directory = dest / name
    await directory.mkdir(parents=True, exist_ok=True)

    async with await open_file(directory/f'image{ext}', 'wb') as fo:
        async for chunk in fi.aiter_bytes(): await fo.write(chunk)
    await (directory/'caption').write_text(caption)