Audhi Aprilliant
Audhi Aprilliant Data Scientist. Tech Writer. Statistics, Data Analytics, and Computer Science Enthusiast

Apache Airflow as Job Orchestration for Web Scraping of Covid-19

Apache Airflow as Job Orchestration for Web Scraping of Covid-19

Overview

Airflow is a platform to programmatically author, schedule, and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes our tasks on an array of workers while following the specified dependencies. Basically, it help us to automate the script. Meanwhile, the COVID‑19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID‑19), caused by severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2). The outbreak was first identified in Wuhan, China, in December 2019. We need to monitor the data in Indonesia, daily. Further, Kompas news is one of the platform which updates the data in dashboard at here

Prerequisites

Before talking more intensively, please read and set up following tools properly

  1. Install Apache Airflow read here
  2. Install the module dependencies
    • requests for web scraping
    • bs4 for web scraping using BeautifulSoup
    • pandas for data manipulation
    • re for regular expression
    • os for file management
    • datetime for timing
    • json for reading JSON file
  3. Telegram
  4. Email
    • Apps password that contains 16 digits of characters read here
    • set up airflow.cfg file to synchronize with our email
  5. set up file and directory in Airflow
    • Save the DAG Python file in directory dags
    • Save Telegram chat ID in directory config
    • Create directory data/covid19 in Airflow to store summary_covid19.txt and daily_update_covid.csv. Please click here to look up the detail of recommended directory.

Full article

You can read full article at Medium

comments powered by Disqus