Data & Our Engine
SSARE provides a ready-to-use infrastructure for systematically ingesting, processing, and hosting data to meet your analytical needs. Our open-source service offers a robust data backbone for political intelligence and news analysis.
Introduction
SSARE is an open-source service designed to orchestrate the collection, processing, and analysis of news articles, with a focus on political intelligence.
SSARE stands for Semantic Search Article Recommendation Engine.
Key Features:
Scraping
Ingest data from arbitrary sourcing scripts
Vector Processing
Convert articles into vector representations
Entity Recognition
Identify entities like locations, persons, and organizations
Geocoding
Convert recognized locations to geographical coordinates
Storage
Store articles and metadata in SQL and vector databases
Querying
Provide endpoints for semantic search and recommendations
LLM Classification
Use natural language models to organize, label, and rate data
Local LLM Support
Integrate with Ollama for on-premise LLM processing
Structured Output
Leverage Instructor for generating structured data from LLMs
Orchestration
Manage and schedule tasks efficiently using Prefect, a workflow orchestration and observation tool. It can be observed in the SSARE Dashboard.
Quick Start
- Get SSARE up and running in minutes:
Access the dashboard at
Architecture
Our data engine consists of several pipelines packaged in individual Docker instances. Each of these acts like a serverless or ETL workflow, engineering new features upon the base doc format.
This modular approach allows for the seamless integration of new features and data sources.
Orchestration & observation
Prefect
Services
Data Model Update
We have moved from an “Article” to “Content” based model. This includes images ad is a first step towards multi-modality in our infrastructure. Please check out the Data Structure page for more information.
Databases
Flows
Ingestion Process
Ingestion Process
Collect Articles
Scraper scripts collect articles from various sources. Sources can be adjusted with custom source scripts, just make them return:
url, headline, paragraphs and source in a csv file.
This flexibility allows you to integrate various data sources into the SSARE pipeline.
Process Articles
Articles are processed for new features. Each of the following steps is a microservice roughly working in the same way:
Vector Representation
Convert article text into numerical vectors for semantic analysis.
Entity Extraction
Identify and extract named entities such as persons, organizations, and locations.
Geocoding
Convert extracted location entities to geographical coordinates.
Classification
LLMs are used to classify the article into the categories you want to use. With instructor you can choose any dimension that you wish to analyse your data by. You can use str, int, List[str] and most other regular types supported by Pydantic models to define your classification schema. This flexibility allows you to tailor the classification to your specific needs and extract structured information from the articles efficiently.
These examples demonstrate how you can create custom intelligence systems to extract specific metrics from articles, tailored to your analysis needs.
Store Data
Processed data is stored in PostgreSQL and vector databases.
Access
API endpoints allow for querying and retrieval of processed data. Here’s a detailed breakdown of the API endpoint for retrieving articles:
Get Articles
This endpoint allows you to retrieve articles with various filtering options. It supports both text-based and semantic search, and can filter articles based on the presence of embeddings, geocoding, entities, and classifications.
The response will include detailed information about each article, including its content, associated entities, tags, and classification (if available).
Adding Sources
Create Scraping/ Sourcing Script
To add a new source to SSARE, create a scraping script in the scraper_service/scrapers folder. That doesn’t necessarily need to be scraping script, it can also be just a function where you load data e.g. from an S3 Storage Bucket.
Define Output
Ensure your script outputs a DataFrame with the following columns:
| url | headline | paragraphs | source |
Integrate with SSARE
Add your new scraper to the SSARE pipeline for automatic processing.
Use Cases
Entity Ranking
Retrieve and rank entities from your processed articles:
GeoJSON Generation
Create GeoJSON features from the locations in your data:
Roadmap
Custom Embedding Models
Support for user-defined embedding models
Enhanced Geocoding
Improve accuracy and coverage of location data
Kubernetes Orchestration
Scalable deployment with Kubernetes
Expanded Scraper Support
We are looking forward to creating “flavors” of information spaces. This will need flexible and modular scraping.
Knowledge Graphs
Implement knowledge graph capabilities for enhanced data relationships
GraphRAG
Integrate Graph Retrieval-Augmented Generation for improved context understanding
Custom Information Spaces
Our initial focus is on international politics and global news. We aim to expand this to individual information spaces for more granular coverage. We’re also working towards multi-region and multi-language support.
Contributing
We welcome contributions from developers, data scientists, and political enthusiasts. To contribute:
- Fork the repository
- Create a new branch for your feature
- Commit your changes
- Open a pull request
For major changes, please open an issue first to discuss what you would like to change.
License
SSARE is distributed under the MIT License. See the LICENSE
file in the repository for full details.