Automation System of Covid-19 Data
COVID-19 is a disease caused by a new strain of coronavirus. 'CO' stands for corona, 'VI' for virus, and 'D' for disease. Formerly, this disease was referred to as '2019 novel coronavirus' or '2019-nCoV'.
Summary
In Indonesia, for making data analysis, we should collected the daily data, which is limited. So, this program will update the data automatically from trusted source, Kompas news as one of the largest news portal in Indonesia.
Prerequisites
- Python 3.x, of course
- Good internet connection is recommended
- Several python’s modules
- pandas for data manipulation
- bs4 is a Python library for pulling data out of HTML and XML files
- os provides functions for interacting with the operating system
- re provides regular expression matching operations similar to those found in Perl
- datetime supplies classes for manipulating dates and times
- requests allows you to send HTTP/1.1 requests extremely easily
Steps
The program is easy to run by following steps:
- Clone this repo
- Open your terminal
- Download the module dependencies by typing
pip install -r requirements.txt
- Type
python3 'Web Scraping Covid-19 Kompas News.py'
- Finally, the data will be in your directory
Output
Two possibilities that we have:
-
Our program captures the up to date data. So the output must be like this one
-
Unluckily, our program is too early running
Automation
We could also automate the program by using crobtab scheduler in Linux. Follow steps below to configure the crontab:
- Type
crontab -e
in your terminal to add a new crobjob - Specify the scheduler. First, I suggest you to look at here for the detail of scheduler and also the examples
- Open new terminal and find a directory of our Python3 by typing
whereis python3
. It must be saved in/usr/bin/python3
directory - Back to the first terminal and type
45 16 * * * cd /your path of web scraping script/ && /usr/bin/python3 'Web Scraping Covid-19 Kompas News.py' >> test.out
If you feel a little bit confuse with above command, let me tell you what I know45 16 * * *
is our schedule. The crontab uses our local time machine instead of UTC. So our program is going to be running at 16.45 everyday for every month/your path of web scraping script/
must be the directory where you keep the python script. In my case, it is in ‘home/covid19 data’/usr/bin/python3
is the directory of Python3 interpreter>> test.out
implies that the filetest.out
would be created and as logs for the outputs
- Finally, save the crontab configuration
Head over to my Github repository!