Project

General

Profile

Actions

To Do #156

closed

Speech Analysis POC with Gemini

Added by Harikrishnan Murugan 3 months ago. Updated 2 months ago.

Status:
Closed
Priority:
High
Assignee:
Target version:
-
Start date:
10/07/2025
Due date:
10/17/2025
% Done:

100%

Estimated time:
Prioritization:
P0

Description

Task Description

Conduct a proof-of-concept to test if Gemini 2.5 Flash (or Flash Lite) can analyze audio input and produce detailed speech assessment output similar to SpeechSuper's unscripted speech evaluation. Try the Fluently App (https://getfluently.app/) to understand how speech analysis works - this will help you understand the expected output quality and structure.

Technical Approach

  • Use Gemini's multimodal capability via google-genai Python package to process audio input directly (audio length: 5 seconds to 10 minutes)
  • Create test suite with 5-10 diverse audio samples (different Indian English accents and proficiency levels, varying lengths)
  • Implement prompt engineering to analyze speech for:
    • Fluency and Coherence: Speech length, speaking rate, pause fillers, long pauses
    • Pronunciation: Identify ONLY mispronounced words with accuracy scoring (these words will be converted to audio for correct pronunciation playback)
    • Grammar: Error detection with correction suggestions
    • Vocabulary: Word choice assessment with better vocabulary suggestions
  • Design JSON output structure based on SpeechSuper format (reference: https://www.speechsuper.com/ unscripted speech assessment)
  • Generate scores out of 10 with decimal points (e.g., 4.5, 6.5, 8.0, 9.5) with detailed breakdown
  • Test both Gemini 2.5 Flash and Flash Lite models
  • Document response time and output quality

Acceptance Criteria

  • Audio input successfully processed by Gemini using google-genai package (5 seconds to 10 minutes range)
  • JSON output includes:
    • Overall scores for: Fluency, Pronunciation, Grammar, Vocabulary (out of 10 with decimals)
    • Detailed metrics (speech length, speaking rate, pause fillers, etc.)
    • Mispronounced words only with accuracy scores (for audio playback generation)
    • Grammar errors with correction suggestions
    • Vocabulary assessment with better word suggestions
  • Test suite covers 5-10 diverse audio samples with varying lengths
  • Both Gemini 2.5 Flash and Flash Lite tested
  • Response time documented for different audio lengths
  • Output quality assessment documented
  • JSON structure clearly defined and matches SpeechSuper format

Resources/References

Actions #1

Updated by Harikrishnan Murugan 2 months ago

  • Status changed from Pending to Closed
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF