Introduction

SSARE is an open-source service designed to orchestrate the collection, processing, and analysis of news articles, with a focus on political intelligence.

SSARE stands for Semantic Search Article Recommendation Engine.

Key Features:

Scraping

Ingest data from arbitrary sourcing scripts

Vector Processing

Convert articles into vector representations

Entity Recognition

Identify entities like locations, persons, and organizations

Geocoding

Convert recognized locations to geographical coordinates

Storage

Store articles and metadata in SQL and vector databases

Querying

Provide endpoints for semantic search and recommendations

LLM Classification

Use natural language models to organize, label, and rate data

Local LLM Support

Integrate with Ollama for on-premise LLM processing

Structured Output

Leverage Instructor for generating structured data from LLMs

Orchestration

Manage and schedule tasks efficiently using Prefect, a workflow orchestration and observation tool. It can be observed in the SSARE Dashboard.

Quick Start

  1. Get SSARE up and running in minutes:
bash
git clone https://github.com/open-politics/SSARE.git
cd SSARE
mv .env.example .env
docker-compose up --build

Access the dashboard at

bash
http://localhost:8089/

Architecture

Our data engine consists of several pipelines packaged in individual Docker instances. Each of these acts like a serverless or ETL workflow, engineering new features upon the base doc format.

This modular approach allows for the seamless integration of new features and data sources.

Orchestration & observation

Prefect

Services

Databases

Flows

Ingestion Process

Ingestion Process

Collect Articles

Scraper scripts collect articles from various sources. Sources can be adjusted with custom source scripts, just make them return:

url, headline, paragraphs and source in a csv file.

This flexibility allows you to integrate various data sources into the SSARE pipeline.

Process Articles

Articles are processed for new features. Each of the following steps is a microservice roughly working in the same way:

Vector Representation

Convert article text into numerical vectors for semantic analysis.

Entity Extraction

Identify and extract named entities such as persons, organizations, and locations.

Geocoding

Convert extracted location entities to geographical coordinates.

Classification

LLMs are used to classify the article into the categories you want to use. With instructor you can choose any dimension that you wish to analyse your data by. You can use str, int, List[str] and most other regular types supported by Pydantic models to define your classification schema. This flexibility allows you to tailor the classification to your specific needs and extract structured information from the articles efficiently.

You can use local llms with Ollama through SSARE, just change LOCAL_LLM=True in the .env.

These examples demonstrate how you can create custom intelligence systems to extract specific metrics from articles, tailored to your analysis needs.

Store Data

Processed data is stored in PostgreSQL and vector databases.

Access

API endpoints allow for querying and retrieval of processed data. Here’s a detailed breakdown of the API endpoint for retrieving articles:

Get Articles

This endpoint allows you to retrieve articles with various filtering options. It supports both text-based and semantic search, and can filter articles based on the presence of embeddings, geocoding, entities, and classifications.

The response will include detailed information about each article, including its content, associated entities, tags, and classification (if available).

Adding Sources

1

Create Scraping/ Sourcing Script

To add a new source to SSARE, create a scraping script in the scraper_service/scrapers folder. That doesn’t necessarily need to be scraping script, it can also be just a function where you load data e.g. from an S3 Storage Bucket.

2

Define Output

Ensure your script outputs a DataFrame with the following columns:

| url | headline | paragraphs | source |

3

Integrate with SSARE

Add your new scraper to the SSARE pipeline for automatic processing.

Use Cases

Entity Ranking

Retrieve and rank entities from your processed articles:

GeoJSON Generation

Create GeoJSON features from the locations in your data:

Roadmap

Custom Embedding Models

Support for user-defined embedding models

Enhanced Geocoding

Improve accuracy and coverage of location data

Kubernetes Orchestration

Scalable deployment with Kubernetes

Expanded Scraper Support

We are looking forward to creating “flavors” of information spaces. This will need flexible and modular scraping.

Knowledge Graphs

Implement knowledge graph capabilities for enhanced data relationships

GraphRAG

Integrate Graph Retrieval-Augmented Generation for improved context understanding

Custom Information Spaces

Our initial focus is on international politics and global news. We aim to expand this to individual information spaces for more granular coverage. We’re also working towards multi-region and multi-language support.

Contributing

We welcome contributions from developers, data scientists, and political enthusiasts. To contribute:

  1. Fork the repository
  2. Create a new branch for your feature
  3. Commit your changes
  4. Open a pull request

For major changes, please open an issue first to discuss what you would like to change.

License

SSARE is distributed under the MIT License. See the LICENSE file in the repository for full details.