Scraping vaccine images from tuoitre ==================================== Site analysis ------------- The site that we are going to work with to get the articles about vaccine from tuoitre.vn is:: https://tuoitre.vn/vaccine.html Different form VnExpress, the href provided in ```` can't directly link to the articles, moreover, some ```` have no href attribute. For example: .. code-block:: html Lấy mã mới For the images and their captions in the articles, we will try to get the ``src`` ---image source and ``alt`` ---image caption of the ```` tag. Scraping explanation -------------------- The site that will be worked with is defined as: .. code-block:: python INDEX = 'https://tuoitre.vn/vaccine.html' Then it is fetched and parsed in order to find all the available ```` tag. The scraper then focuses on 3 main functions articles(), scrape_image() and download(). articles() ^^^^^^^^^^ .. code-block:: python def articles(links): """Search for URLs contains 'vacxin' in the given link.""" for a in links: href = a.get('href') if href is None: continue url = 'http://tuoitre.vn' + href if url.endswith('.htm') and 'vac' in url: yield url From the ```` tags from we try to get the href attribute of each ````. Since some ```` have no href attribute, we will skip if the href returns None. To make the href become a recognized url, we add ``http://tuoitre.vn`` before the href. Finally, in order to get the appropriate articles related to vaccine, we only get the url end with ``.htm`` and contains ``vac``. scrape_image() ^^^^^^^^^^^^^^ .. code-block:: python async def scrape_images(url, dest, client, nursery): """Search for img in the articles in order to download the images.""" article = await client.get(url) for img in parse_html5(article.text).iterfind('.//img'): if img.get('type') == 'photo': nursery.start_soon(download, img, dest, client) The appropriate urls are then fetched and parsed in order to find all the ```` tags available. We notice that the main images of the articles all have ``type="photo"`` so we only have get the ```` satisfied the condition without having to check the captions' contents. download() ^^^^^^^^^^ .. code-block:: python async def download(img, dest, client): """Save the images with theirs captions of the searched articles.""" caption, url = img.get('alt'), img.get('src') name, ext = splitext(basename(urlparse(url).path)) directory = dest / name await directory.mkdir(parents=True, exist_ok=True) try: fi = await client.get(url) except ConnectTimeout: return async with await open_file(directory/f'image{ext}', 'wb') as fo: async for chunk in fi.aiter_bytes(): await fo.write(chunk) await (directory/'caption').write_text(caption, encoding='utf-8') print(caption) Last one is the download() function. It will do all the work remaining. It will download the image from ``src`` and the caption from ``alt``. Each image and its caption is then put in the same folder and named "image", "caption" respectively.