Clippy

Clippy is a Ruby on Rails web application that processes video and audio files by extracting audio, transcribing it using OpenAI’s Whisper API, and generating summaries with entity extraction. The application is designed primarily for meeting recordings but can handle any spoken audio content. It features a real-time processing pipeline with background job processing and provides an interactive web interface for viewing transcripts with synchronized media playback.

Technology Stack

Backend:

  • Ruby 3.3.0 with Rails 7.1.3
  • Litestack for SQLite support in production
  • Puma web server with Thruster for SSL/HTTP2

Frontend:

  • Hotwire (Turbo + Stimulus) for reactive UI
  • Bootstrap 5 for styling and components
  • Sass for CSS preprocessing
  • Importmap for JavaScript module management

AI/Processing:

  • OpenAI API (Whisper for transcription, GPT for summarization)
  • FFmpeg for audio extraction and video processing
  • Image processing with libvips

Infrastructure:

  • Docker containerization
  • Active Storage for file management
  • ActionCable for real-time updates
  • Self-hostable with local file storage

Development Timeline

March 2024 - Initial Development:

  • 03/25: Core functionality added (uploads, transcripts, summaries, segments)
  • 03/26: Added additional summary fields
  • 03/25-26: Audio segments and language support added

April 2024 - Feature Expansion:

  • 04/01: Enhanced audio segments with text formatting
  • 04/06: Added processing timestamps
  • 04/10: Implemented clips functionality
  • 04/16: Added processing stage tracking

This project was developed over approximately 3 weeks with rapid iteration and feature additions.

Architecture Patterns

Pipeline Pattern: The application implements a processing pipeline with defined stages (pending → started → extracting_audio → transcribing → collating → summarising → complete) managed through state machines and background jobs.

Job Queue Pattern: Uses Rails’ ActiveJob for asynchronous processing, allowing the web interface to remain responsive during long-running transcription and summarization tasks.

Observer Pattern: Implements Rails’ broadcasts_refreshes for real-time UI updates when processing stages change.

Repository Pattern: Clean separation between models (Upload, Transcript, AudioSegment, etc.) with well-defined relationships and responsibilities.

Service Object Pattern: Processing logic is encapsulated in specialized job classes (ExtractAudioJob, TranscribeAudioSegmentJob, etc.) rather than being embedded in models.

Real-time Synchronization: Uses Stimulus controllers for client-side media synchronization with transcript segments, enabling click-to-seek functionality and automatic scrolling.

Challenges Faced

File Processing Complexity: The application handles multiple file formats (video/audio) requiring FFmpeg integration and chunking of large files to work within OpenAI’s 25MB upload limit.

Asynchronous Processing: Managing the multi-stage processing pipeline with proper error handling and state management across background jobs.

Real-time UI Updates: Implementing live progress updates and synchronized media playback without page refreshes using ActionCable and Stimulus.

Cost Management: Processing costs can be significant (~$0.30/hour of audio), so the application is built to support local processing and other OpenAI-compatible services.

Security Considerations: The application is intentionally unauthenticated, requiring deployment behind a proxy (Nginx/Caddy) for access control in production environments. I deploy this application behind Cloudflare Access.

Configuration Management: API keys and credentials support multiple sources, including environment variables, Rails encrypted credentials, and local files, to aid in self-hosting.