Scraping vaccine images from VnExpress ====================================== Site analysis ------------- In order to scrape images from VnExpress, we first need to analyze the site. The news site is kind enough to provide a portal solely for vaccine-related news:: https://vnexpress.net/suc-khoe/vaccine Taking a peak into the page's source, it is easy to notice that articles are pointed to via a HTML tag of the following format. .. code-block:: html Looking into the article's source, we can see the content image tags like so: .. code-block:: html {caption} Now we have all needed infomation, let's cook up a scraper! Scraper construction -------------------- Since the task is I/O intensive and we are using Python, it is natural to employ an asynchronous input/output framework. We pick Trio_ for its ease-of-use, and thus use HTTPX_ as the HTTP client. The client and Trio nursery are prepared as follows: .. code-block:: python from httpx import AsyncClient from trio import open_nursery async with AsyncClient() as client, open_nursery() as nursery: ... The vaccine portal page .. code-block:: python INDEX = 'https://vnexpress.net/suc-khoe/vaccine' is then fetched as simply as .. code-block:: python index = await client.get(INDEX) Next, we need to parse the page and use html5lib_ for it. For convenience purposes, we define a wrapper with ``namespaceHTMLElements`` disabled by default: .. code-block:: python from functools import partial from html5lib import parse parse_html5 = partial(parse, namespaceHTMLElements=False) All the ``a`` tags at the appropriate levels can then be found using .. code-block:: python parse_html5(index.text).iterfind('.//a') Now we need to extract the only URLs to articles about vaccine. As discussed earlier, these end with ``.html`` and probably contain ``vaccine``: .. code-block:: python from urllib.parse import urldefrag def articles(links): """Return URLs to vaccine articles from the given links.""" for a in links: url, fragment = urldefrag(a.get('href')) if url.endswith('.html') and 'vaccine' in url: yield url We then use ``nursery`` to fetch each of these articles in a concurrent task .. code-block:: python nursery.start_soon(scrape_images, url, dest, client, nursery) and look for the content images .. code-block:: python async def scrape_images(url, dest, client, nursery): """Download vaccine images from the given VnExpress article.""" article = await client.get(url) for img in parse_html5(article.text).iterfind('.//img'): caption, url = img.get('alt'), img.get('data-src') if caption is None or 'vaccine' not in caption.lower(): continue # VnExpress gives different HTML depending on the client. if url is None: url = img.get('src') if url.endswith('logo.svg'): continue nursery.start_soon(download, caption, url, dest, client) The ``async`` function ``download`` takes care of the rest of the work, namely fetching and putting the images and caption in the specified location: .. code-block:: python from os.path import basename, splitext from urllib.parse import urlparse from trio import open_file async def download(caption, url, dest, client): """Save the given image with caption if it's about vaccine.""" name, ext = splitext(basename(urlparse(url).path)) directory = dest / name await directory.mkdir(parents=True, exist_ok=True) async with await open_file(directory/f'image{ext}', 'wb') as fo: async for chunk in fi.aiter_bytes(): await fo.write(chunk) await (directory/'caption').write_text(caption) .. _Trio: https://trio.readthedocs.io .. _HTTPX: https://www.python-httpx.org .. _html5lib: https://html5lib.readthedocs.io