completed

Distributed Twitter Analytics Platform

Real-time distributed analytics engine for streaming social data with scalable processing and query support.

gRPC

Redis

Apache Kafka

ClickHouse

Docker

View source

System Evidence

high throughput workload planning
Go, gRPC, Redis implementation surface
85% latency-focused optimization target

Traffic Target: 10,000 req/sec
Latency Gain: 85% lower
Reliability: 99.9% uptime
Daily Data: 2.5TB

Overview

A distributed analytics platform designed to ingest, process, and query Twitter streaming data in real time. The system handles the full data lifecycle — from ingestion through Kafka consumers, real-time processing via Go workers, to analytical queries through ClickHouse.

Problem

Social stream analytics needs high write throughput, low-latency query paths, and recovery behavior that keeps duplicated events from corrupting downstream analysis.

Architecture

The platform follows an event-driven microservices architecture:

Ingestion Layer: Kafka consumers processing Twitter firehose data with at-least-once delivery guarantees
Processing Layer: Go-based worker pool with gRPC inter-service communication
Storage Layer: ClickHouse for analytical queries, Redis for hot data caching
Query Layer: Low-latency query API with automatic query optimization

Engineering Tradeoffs

Used Kafka as the durability boundary between ingestion and processing
Kept hot aggregations in Redis while ClickHouse handled analytical scans
Designed workers around idempotent processing so retries would not inflate metrics

Key Achievements

Designed for sustained real-time query traffic
Reduced end-to-end latency compared to batch processing
Supports large-volume daily data processing
Uses idempotent processing to reduce data loss risk

Validation Focus

Measure consumer lag and worker throughput during stream bursts
Compare cached query paths against ClickHouse analytical queries
Test replay scenarios to confirm duplicate handling remains stable