Scraping vaccine images from VnExpress¶
Site analysis¶
In order to scrape images from VnExpress, we first need to analyze the site. The news site is kind enough to provide a portal solely for vaccine-related news:
https://vnexpress.net/suc-khoe/vaccine
Taking a peak into the page’s source, it is easy to notice that articles are pointed to via a HTML tag of the following format.
<a href="https://vnexpress.net/{normalized title}.html"
class="thumb thumb-5x3"
title="{title}">
Looking into the article’s source, we can see the content image tags like so:
<img itemprop="contentUrl"
intrinsicsize="{intrinsic size}"
alt="{caption}"
class="lazy"
src="{encoded source}"
data-src="{image URL}">
Now we have all needed infomation, let’s cook up a scraper!
Scraper construction¶
Since the task is I/O intensive and we are using Python, it is natural to employ an asynchronous input/output framework. We pick Trio for its ease-of-use, and thus use HTTPX as the HTTP client. The client and Trio nursery are prepared as follows:
from httpx import AsyncClient
from trio import open_nursery
async with AsyncClient() as client, open_nursery() as nursery:
...
The vaccine portal page
INDEX = 'https://vnexpress.net/suc-khoe/vaccine'
is then fetched as simply as
index = await client.get(INDEX)
Next, we need to parse the page and use html5lib for it.
For convenience purposes, we define a wrapper with namespaceHTMLElements
disabled by default:
from functools import partial
from html5lib import parse
parse_html5 = partial(parse, namespaceHTMLElements=False)
All the a tags at the appropriate levels can then be found using
parse_html5(index.text).iterfind('.//a')
Now we need to extract the only URLs to articles about vaccine.
As discussed earlier, these end with .html and probably contain vaccine:
from urllib.parse import urldefrag
def articles(links):
"""Return URLs to vaccine articles from the given links."""
for a in links:
url, fragment = urldefrag(a.get('href'))
if url.endswith('.html') and 'vaccine' in url: yield url
We then use nursery to fetch each of these articles in a concurrent task
nursery.start_soon(scrape_images, url, dest, client, nursery)
and look for the content images
async def scrape_images(url, dest, client, nursery):
"""Download vaccine images from the given VnExpress article."""
article = await client.get(url)
for img in parse_html5(article.text).iterfind('.//img'):
caption, url = img.get('alt'), img.get('data-src')
if caption is None or 'vaccine' not in caption.lower(): continue
# VnExpress gives different HTML depending on the client.
if url is None: url = img.get('src')
if url.endswith('logo.svg'): continue
nursery.start_soon(download, caption, url, dest, client)
The async function download takes care of the rest of the work,
namely fetching and putting the images and caption in the specified location:
from os.path import basename, splitext
from urllib.parse import urlparse
from trio import open_file
async def download(caption, url, dest, client):
"""Save the given image with caption if it's about vaccine."""
name, ext = splitext(basename(urlparse(url).path))
directory = dest / name
await directory.mkdir(parents=True, exist_ok=True)
async with await open_file(directory/f'image{ext}', 'wb') as fo:
async for chunk in fi.aiter_bytes(): await fo.write(chunk)
await (directory/'caption').write_text(caption)