Scraping Service
Scraper service
This setup is intended to be temporary and flexible. You can insert new scrapers by adding them to the scraper_config.py
file.
They can be simple python sourcing scripts loading something from an S3 Bucket or more complex ones scraping websites.
The scraper service executes a flow of tasks to collect and process content. Here’s how it works:
1
Initialize Scraping Flow
- Service receives scraping flags (e.g., “cnn”, “reuters”). They are essentially “Jobs” that need to be executed.
Execute Source Scripts
For each source flag:
SSARE/scraper_service/scrapers/scraper_config.py
For example this script returning cnn articles:
SSARE/scraper_service/scrapers/cnn.py
The time function of regular scraping is not yet implemented. As it stands now all flags are always set, and we just run them every time. Our Orchestration.py script is responsible for that.
Process Results
For each successful scrape:
- Load CSV data into pandas DataFrame
- Convert records to Content models: