Scraping vaccine images from thanhnien

Site analysis

We scrape image and caption about vaccine from searched page from thanhnien.vn:

https://thanhnien.vn/vaccine

For the images and their captions in the articles, we will try to get the src—image source and alt—image caption of the <img> tag.

Scraping explanation

The site taken is site for searched for vaccine in thanhnien.vn.

INDEX = 'https://thannien.vn/vaccine'

The scraper then focuses on 3 main functions articles(), scrape_image() and download() and one function name thanhnien to run the function from main.py.

articles()

def articles(links):
    """Find URLs contains 'vacxin' in the given link."""
    for a in links:
        href = a.get('href')
        if href is None: continue
        url = 'http://thanhnien.vn/' + href
        if url.endswith('.html') and 'vac' in url: yield url

From searched articles we focus only on url ended with html and vac in the url.

scrape_image()

async def scrape_images(url, dest, client, nursery):
    """Search for img in the articles in order to download the images."""
    article = await client.get(url)
    for img in parse_html5(article.text).iterfind('.//img'):
        caption, url = img.get('alt'), img.get('data-src')
        if caption is None or ('vắc') not in caption.lower():
            if caption is None or ('vac') not in caption.lower(): continue
        if url is None: url = img.get('src')
        if url.endswith('logo.svg'): continue
        nursery.start_soon(download, caption, url, dest, client)

The website used in vietnamese languages so we use both key words as vac and vắc for either vaccine and vắc xin.

download()

async def download(caption, url, dest, client):
    """The image and caption saved if contain information about vaccine"""
    name, ext = splitext(basename(urlparse(url).path))
    directory = dest / name
    await directory.mkdir(parents=True, exist_ok=True)

try:
    fi = await client.get(url)
    except ConnectTimeout:
        return
async with await open_file(directory/f'image{ext}', 'wb') as fo:
    async for chunk in fi.aiter_bytes(): await fo.write(chunk)
    await (directory/'caption').write_text(caption, encoding='utf-8')
    print(caption)

The download function will download the image from src and the caption from alt. Each image and its caption is list on in folder and named image, caption respectively inside a folder named by the name of the website.