completed
Distributed Twitter Analytics Platform
Real-time distributed analytics engine for streaming social data with scalable processing and query support.
Go
gRPC
Redis
Apache Kafka
ClickHouse
Docker
System Evidence
- high throughput workload planning
- Go, gRPC, Redis implementation surface
- 85% latency-focused optimization target
- Traffic Target
- 10,000 req/sec
- Latency Gain
- 85% lower
- Reliability
- 99.9% uptime
- Daily Data
- 2.5TB
Overview
A distributed analytics platform designed to ingest, process, and query Twitter streaming data in real time. The system handles the full data lifecycle — from ingestion through Kafka consumers, real-time processing via Go workers, to analytical queries through ClickHouse.
Problem
Social stream analytics needs high write throughput, low-latency query paths, and recovery behavior that keeps duplicated events from corrupting downstream analysis.
Architecture
The platform follows an event-driven microservices architecture:
- Ingestion Layer: Kafka consumers processing Twitter firehose data with at-least-once delivery guarantees
- Processing Layer: Go-based worker pool with gRPC inter-service communication
- Storage Layer: ClickHouse for analytical queries, Redis for hot data caching
- Query Layer: Low-latency query API with automatic query optimization
Engineering Tradeoffs
- Used Kafka as the durability boundary between ingestion and processing
- Kept hot aggregations in Redis while ClickHouse handled analytical scans
- Designed workers around idempotent processing so retries would not inflate metrics
Key Achievements
- Designed for sustained real-time query traffic
- Reduced end-to-end latency compared to batch processing
- Supports large-volume daily data processing
- Uses idempotent processing to reduce data loss risk
Validation Focus
- Measure consumer lag and worker throughput during stream bursts
- Compare cached query paths against ClickHouse analytical queries
- Test replay scenarios to confirm duplicate handling remains stable