Scraping vaccine images from tuoitre
====================================
Site analysis
-------------
The site that we are going to work with to get the articles about vaccine from tuoitre.vn is::
https://tuoitre.vn/vaccine.html
Different form VnExpress, the href provided in ```` can't directly link to the articles, moreover, some ````
have no href attribute. For example:
.. code-block:: html
Lấy mã mới
For the images and their captions in the articles, we will try to get the ``src`` ---image source and
``alt`` ---image caption of the ``
`` tag.
Scraping explanation
--------------------
The site that will be worked with is defined as:
.. code-block:: python
INDEX = 'https://tuoitre.vn/vaccine.html'
Then it is fetched and parsed in order to find all the available ```` tag.
The scraper then focuses on 3 main functions articles(), scrape_image() and download().
articles()
^^^^^^^^^^
.. code-block:: python
def articles(links):
"""Search for URLs contains 'vacxin' in the given link."""
for a in links:
href = a.get('href')
if href is None: continue
url = 'http://tuoitre.vn' + href
if url.endswith('.htm') and 'vac' in url: yield url
From the ```` tags from we try to get the href attribute of each ````. Since some ```` have no href attribute,
we will skip if the href returns None. To make the href become a recognized url, we add ``http://tuoitre.vn``
before the href. Finally, in order to get the appropriate articles related to vaccine, we only get the url end
with ``.htm`` and contains ``vac``.
scrape_image()
^^^^^^^^^^^^^^
.. code-block:: python
async def scrape_images(url, dest, client, nursery):
"""Search for img in the articles in order to download the images."""
article = await client.get(url)
for img in parse_html5(article.text).iterfind('.//img'):
if img.get('type') == 'photo':
nursery.start_soon(download, img, dest, client)
The appropriate urls are then fetched and parsed in order to find all the ``
`` tags available.
We notice that the main images of the articles all have ``type="photo"`` so we only have get the
``
`` satisfied the condition without having to check the captions' contents.
download()
^^^^^^^^^^
.. code-block:: python
async def download(img, dest, client):
"""Save the images with theirs captions of the searched articles."""
caption, url = img.get('alt'), img.get('src')
name, ext = splitext(basename(urlparse(url).path))
directory = dest / name
await directory.mkdir(parents=True, exist_ok=True)
try:
fi = await client.get(url)
except ConnectTimeout:
return
async with await open_file(directory/f'image{ext}', 'wb') as fo:
async for chunk in fi.aiter_bytes(): await fo.write(chunk)
await (directory/'caption').write_text(caption, encoding='utf-8')
print(caption)
Last one is the download() function. It will do all the work remaining. It will download the
image from ``src`` and the caption from ``alt``. Each image and its caption is then put in the same
folder and named "image", "caption" respectively.