This setup is intended to be temporary and flexible. You can insert new scrapers by adding them to the scraper_config.py file. They can be simple python sourcing scripts loading something from an S3 Bucket or more complex ones scraping websites.

The scraper service executes a flow of tasks to collect and process content. Here’s how it works:

1

Initialize Scraping Flow

  1. Service receives scraping flags (e.g., “cnn”, “reuters”). They are essentially “Jobs” that need to be executed.

Execute Source Scripts

For each source flag:

SSARE/scraper_service/scrapers/scraper_config.py
# Load source configuration
config = {
    "scrapers": {
        "cnn": {
            "location": "scrapers/cnn.py",
            "last_run": "2015-01-01 00:00:00"
        }
    }
}
# Execute corresponding scraper script
# Script outputs to CSV: url | headline | paragraphs

For example this script returning cnn articles:

SSARE/scraper_service/scrapers/cnn.py
articles = await asyncio.gather(*tasks)
return pd.DataFrame(articles, columns=['url', 'headline', 'paragraphs'])
....
async def main():
    async with aiohttp.ClientSession() as session:
        df = await scrape_cnn_articles(session)
        os.makedirs('/app/scrapers/data/dataframes', exist_ok=True)

        df.to_csv('/app/scrapers/data/dataframes/cnn_articles.csv', index=False)

The time function of regular scraping is not yet implemented. As it stands now all flags are always set, and we just run them every time. Our Orchestration.py script is responsible for that.

Process Results

For each successful scrape:

  1. Load CSV data into pandas DataFrame
  2. Convert records to Content models:
Content(
    url=content_data['url'],
    title=content_data['headline'],
    text_content=content_data['paragraphs'],
    source=flag,
    content_type="article",
    insertion_date=datetime.utcnow()
)