Scraping vaccine images from tuoitre¶

Site analysis¶

The site that we are going to work with to get the articles about vaccine from tuoitre.vn is:

https://tuoitre.vn/vaccine.html

Different form VnExpress, the href provided in <a> can’t directly link to the articles, moreover, some <a> have no href attribute. For example:

<a href="/vac-xin-covid-19-cua-astrazeneca-bi-che-bai-do-gia-mem-20201127105727613.htm"
   title="V&#7855;c xin COVID-19 c&#7911;a AstraZeneca b&#7883; chê bai do... giá m&#7873;m?"
   class="img212x132 pos-rlt" data-displayinslide="0">

<a id="refresh-captcha">Lấy mã mới</a>

For the images and their captions in the articles, we will try to get the src —image source and alt —image caption of the <img> tag.

Scraping explanation¶

The site that will be worked with is defined as:

INDEX = 'https://tuoitre.vn/vaccine.html'

Then it is fetched and parsed in order to find all the available <a> tag.

The scraper then focuses on 3 main functions articles(), scrape_image() and download().

articles()¶

def articles(links):
    """Search for URLs contains 'vacxin' in the given link."""
    for a in links:
        href = a.get('href')
        if href is None: continue
        url = 'http://tuoitre.vn' + href
        if url.endswith('.htm') and 'vac' in url: yield url

From the <a> tags from we try to get the href attribute of each <a>. Since some <a> have no href attribute, we will skip if the href returns None. To make the href become a recognized url, we add http://tuoitre.vn before the href. Finally, in order to get the appropriate articles related to vaccine, we only get the url end with .htm and contains vac.

scrape_image()¶

async def scrape_images(url, dest, client, nursery):
    """Search for img in the articles in order to download the images."""
    article = await client.get(url)
    for img in parse_html5(article.text).iterfind('.//img'):
        if img.get('type') == 'photo':
            nursery.start_soon(download, img, dest, client)

The appropriate urls are then fetched and parsed in order to find all the <img> tags available. We notice that the main images of the articles all have type="photo" so we only have get the <img> satisfied the condition without having to check the captions’ contents.

download()¶

async def download(img, dest, client):
    """Save the images with theirs captions of the searched articles."""
        caption, url = img.get('alt'), img.get('src')
        name, ext = splitext(basename(urlparse(url).path))
        directory = dest / name
        await directory.mkdir(parents=True, exist_ok=True)

        try:
            fi = await client.get(url)
        except ConnectTimeout:
            return
        async with await open_file(directory/f'image{ext}', 'wb') as fo:
            async for chunk in fi.aiter_bytes(): await fo.write(chunk)
        await (directory/'caption').write_text(caption, encoding='utf-8')
        print(caption)

Last one is the download() function. It will do all the work remaining. It will download the image from src and the caption from alt. Each image and its caption is then put in the same folder and named “image”, “caption” respectively.

Scraping vaccine images from tuoitre¶

Site analysis¶

Scraping explanation¶

articles()¶

scrape_image()¶

download()¶

spider-venom

Navigation

Related Topics