Scraping vaccine images from Dantri

Site analysis

The site I used to work with to get the articles, download and scrape images about vaccine is:

https://dantri.com.vn/vaccine.tag

The href provided in <a> can’t directly link to the articles, some <a> have none of href attribute and some <a> have href start with http. For example:

<figure class="image align-center" contenteditable="false">
<img title="Bill Gates dự đoán tới hết năm 2022 dịch Covid-19 mới chấm dứt - 1" src="https://icdn.dantri.com.vn/thumb_w/640/2020/08/11/covid-1597127036692.jpg"
     alt="Bill Gates dự đoán tới hết năm 2022 dịch Covid-19 mới chấm dứt - 1" data-width="800" data-height="450"data-original="https://icdn.dantri.com.vn/2020/08/11/covid-1597127036692.jpg" data-photo-id="1034257" />

For the images and captions in the articles, we will get the src—image source and alt tag.

Scraping Explaination

Define the site as :

INDEX = 'https://dantri.com.vn/vaccine.tag'

The scraper will be focused on three main functions download(), scrape_images() and articles().

articles()

def articles(links):
    """Search for URLs contains 'vacxin' in the given link."""
    for a in links:
        href = a.get('href')
        if href is None: continue
        if href.startswith('http'):
            url = href
        else:
            url = 'https://dantri.com.vn' + href
        if url.endswith('.htm') and 'vac' in url:
            yield url

<a> tags try to get the href attribute of each <a> tag. Since some <a> don’t have an href attribute, we will ignore if href returns None. To make href a recognized url, we add http://dantri.com.vn in advance href. If href start with http, we add url as href else we add url in advance href .Finally, to get vaccine-relevant articles, we just get the end of the url with .htm and contains vac.

download()

async def download(caption, url, dest, client):
    """Save the given image with caption if it's about vaccine."""
    name, ext = splitext(basename(url))
    directory = dest / name
    await directory.mkdir(parents=True, exist_ok=True)

    try:
        fi = await client.get(url)
    except ConnectTimeout:
        return
    async with await open_file(directory/f'image{ext}', 'wb') as fo:
        async for chunk in fi.aiter_bytes(): await fo.write(chunk)
    await (directory/'caption').write_text(caption, encoding='utf-8')
    print(caption)

This code will download the image from src and the caption from alt. First I use name, ext = splitext(basename(url)) to split url to find image name and extension. Then each image and its caption is then put in the same folder.

scrape_images()

async def scrape_images(url, dest, client, nursery):
    """Download vaccine images from the given Dantri article."""
    try:
        article = await client.get(url)
    except ConnectError:
        print(url)
        return
    for img in parse_html5(article.text).iterfind('.//img'):
        caption, url = img.get('alt'), img.get('src')
        if caption is None: continue
        if 'vac' in caption.lower() or 'vắc' in caption.lower():
            nursery.start_soon(download, caption, url, dest, client)

First, I try to get url of article from client, except Connection is error then i show the url. The appropriate urls are then fetched and parsed in order to find all the <img> tags available as vac and vắc.