Project

General

Profile

Actions

To Do #155

closed

Pipecat Pipeline Integration & Gemini Service

Added by Harikrishnan Murugan 3 months ago. Updated 2 months ago.

Status:
Closed
Priority:
High
Target version:
-
Start date:
10/09/2025
Due date:
10/17/2025
% Done:

100%

Estimated time:
Prioritization:
P0

Description

Task Description

Build Pipecat pipeline with GeminiMultimodalLiveLLMService for real-time audio communication. Implement SmallWebRTC transport for peer-to-peer audio streaming and create a weather assistant demo to verify the complete setup works correctly. This establishes the technical foundation that can later be adapted to the AI Communication Coach by changing the system prompt and tools configuration.

IMPORTANT NOTE: Use SmallWebRTCTransport specifically for this project. Do NOT use DailyTransport, FastAPIWebSocketTransport, or any other transport implementations.

Model to Use: gemini-2.5-flash-native-audio-preview-09-2025

Technical Approach

  • Configure GeminiMultimodalLiveLLMService with model: gemini-2.5-flash-native-audio-preview-09-2025
  • Implement SmallWebRTCTransport (NOT DailyTransport or FastAPIWebSocketTransport)
  • Build Pipecat pipeline (audio input → processing → Gemini → audio output)
  • Set up session management (start/stop/reconnect)
  • Add error handling and connection recovery
  • Create weather assistant demo with function calling
  • Implement weather API tool for real-time tool calls
  • Create minimal React frontend using Pipecat React SDK
  • Implement weather UI display component for tool results
  • Test complete pipeline with voice commands

Acceptance Criteria

  • Pipecat pipeline runs successfully
  • GeminiMultimodalLiveLLMService configured with gemini-2.5-flash-native-audio-preview-09-2025 model
  • SmallWebRTCTransport implemented for audio streaming (not other transports)
  • Audio input/output streaming works via WebRTC
  • Error handling covers disconnections
  • Session lifecycle managed properly
  • Weather assistant responds to voice queries
  • Tool calling works (get_weather function)
  • Frontend displays weather UI when tool is called
  • Complete setup can be adapted by changing system prompt and tools config

Resources/References

Actions #1

Updated by Harikrishnan Murugan 3 months ago

  • Description updated (diff)
Actions #2

Updated by Ishita Singh Faujdar 2 months ago

The system now provides:

Voice Interaction:

  • Real-time voice conversation with AI assistant
  • Automatic greeting when you connect
  • Natural language understanding and response
  • Weather information for any city worldwide

Technical Features:

  • Gemini 2.5 Flash with native audio I/O (no separate STT/TTS)
  • WebRTC peer-to-peer audio streaming (low latency)
  • Professional voice activity detection (Silero VAD)
  • Function calling (AI can fetch real weather data from APIs)
  • Multi-user session support
  • Production-grade logging

User Experience:

  • Real-time audio level visualization
  • Clear conversation state indicators
  • Automatic speech detection (knows when you're done speaking)
  • Silence detection with helpful prompts

Documentation

  • Complete documentation available in docs/:
  • Task completion report
  • Installation guide
  • Architecture documentation
  • All fixes documented with code examples
  • Guide for switching to Communication Coach

Note: The Zip file is attached below. Switch to weather-app branch.
To run

  1. Backend:
python -m venv venv
venv\Scripts\activate  # Windows
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
  1. Configure
    Create a .env file and add GEMINI API KEY and OPEN WEATHER MAP API KEY

  2. Frontend

cd frontend
npm install
npm run dev

Open http://localhost:3000

Expected Experience

  • Click "Start Voice Session" button
  • Allow microphone access when prompted
  • Hear AI greet you: "Hello! I'm Aria, your weather assistant. I can help you check the weather in any city around the world..."
  • Speak: "What's the weather in London?" or "How's the weather in Delhi right now?"
  • See your microphone level at the bottom (green bar = good audio level)
  • Wait ~2-3 seconds after speaking (AI detects you're done)
  • Hear AI respond with current weather: "Currently in London, it's 15 degrees Celsius with partly cloudy skies..."
  • Continue conversation naturally - ask about other cities!

Visual Feedback:

  • Screen shows "🔊 AI is responding..." when AI speaks
  • "✅ Listening - Speak now" when waiting for you
  • "🎤 Recording your voice..." when you speak
  • Audio level bar turns green when you're loud enough

Pipeline Flow

User Speech → Microphone → WebRTC → SmallWebRTC Transport
    ↓
Voice Activity Detection (Silero VAD)
    ↓
Context Aggregator (tracks user messages)
    ↓
Gemini Multimodal Live LLM
  - Understands speech
  - Calls functions if needed (get_weather)
  - Generates natural response
  - Converts to speech
    ↓
Transport Output → WebRTC → Speaker
    ↓
Context Aggregator (stores AI responses)
    ↓
User Hears AI Response 🔊
Actions #3

Updated by Harikrishnan Murugan 2 months ago

  • Status changed from Pending to Closed
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF