Scraping vaccine images from thanhnien ====================================== Site analysis ------------- We scrape image and caption about vaccine from searched page from thanhnien.vn: .. code-block:: html https://thanhnien.vn/vaccine For the images and their captions in the articles, we will try to get the ``src``---image source and ``alt``---image caption of the ```` tag. Scraping explanation -------------------- The site taken is site for searched for vaccine in thanhnien.vn. .. code-block:: python INDEX = 'https://thannien.vn/vaccine' The scraper then focuses on 3 main functions articles(), scrape_image() and download() and one function name thanhnien to run the function from main.py. articles() ^^^^^^^^^^ .. code-block:: python def articles(links): """Find URLs contains 'vacxin' in the given link.""" for a in links: href = a.get('href') if href is None: continue url = 'http://thanhnien.vn/' + href if url.endswith('.html') and 'vac' in url: yield url From searched articles we focus only on url ended with html and vac in the url. scrape_image() ^^^^^^^^^^^^^^ .. code-block:: python async def scrape_images(url, dest, client, nursery): """Search for img in the articles in order to download the images.""" article = await client.get(url) for img in parse_html5(article.text).iterfind('.//img'): caption, url = img.get('alt'), img.get('data-src') if caption is None or ('vắc') not in caption.lower(): if caption is None or ('vac') not in caption.lower(): continue if url is None: url = img.get('src') if url.endswith('logo.svg'): continue nursery.start_soon(download, caption, url, dest, client) The website used in vietnamese languages so we use both key words as *vac* and *vắc* for either *vaccine* and *vắc xin*. download() ^^^^^^^^^^ .. code-block:: python async def download(caption, url, dest, client): """The image and caption saved if contain information about vaccine""" name, ext = splitext(basename(urlparse(url).path)) directory = dest / name await directory.mkdir(parents=True, exist_ok=True) try: fi = await client.get(url) except ConnectTimeout: return async with await open_file(directory/f'image{ext}', 'wb') as fo: async for chunk in fi.aiter_bytes(): await fo.write(chunk) await (directory/'caption').write_text(caption, encoding='utf-8') print(caption) The download function will download the image from ``src`` and the caption from ``alt``. Each image and its caption is list on in folder and named ``image``, ``caption`` respectively inside a folder named by the name of the website.