Actions
To Do #156
closedSpeech Analysis POC with Gemini
Start date:
10/07/2025
Due date:
10/17/2025
% Done:
100%
Estimated time:
Prioritization:
P0
Description
Task Description
Conduct a proof-of-concept to test if Gemini 2.5 Flash (or Flash Lite) can analyze audio input and produce detailed speech assessment output similar to SpeechSuper's unscripted speech evaluation. Try the Fluently App (https://getfluently.app/) to understand how speech analysis works - this will help you understand the expected output quality and structure.
Technical Approach
- Use Gemini's multimodal capability via google-genai Python package to process audio input directly (audio length: 5 seconds to 10 minutes)
- Create test suite with 5-10 diverse audio samples (different Indian English accents and proficiency levels, varying lengths)
- Implement prompt engineering to analyze speech for:
- Fluency and Coherence: Speech length, speaking rate, pause fillers, long pauses
- Pronunciation: Identify ONLY mispronounced words with accuracy scoring (these words will be converted to audio for correct pronunciation playback)
- Grammar: Error detection with correction suggestions
- Vocabulary: Word choice assessment with better vocabulary suggestions
- Design JSON output structure based on SpeechSuper format (reference: https://www.speechsuper.com/ unscripted speech assessment)
- Generate scores out of 10 with decimal points (e.g., 4.5, 6.5, 8.0, 9.5) with detailed breakdown
- Test both Gemini 2.5 Flash and Flash Lite models
- Document response time and output quality
Acceptance Criteria
- Audio input successfully processed by Gemini using google-genai package (5 seconds to 10 minutes range)
- JSON output includes:
- Overall scores for: Fluency, Pronunciation, Grammar, Vocabulary (out of 10 with decimals)
- Detailed metrics (speech length, speaking rate, pause fillers, etc.)
- Mispronounced words only with accuracy scores (for audio playback generation)
- Grammar errors with correction suggestions
- Vocabulary assessment with better word suggestions
- Test suite covers 5-10 diverse audio samples with varying lengths
- Both Gemini 2.5 Flash and Flash Lite tested
- Response time documented for different audio lengths
- Output quality assessment documented
- JSON structure clearly defined and matches SpeechSuper format
Resources/References
- Fluently App (try for understanding): https://getfluently.app/
- SpeechSuper Unscripted Speech Assessment: https://www.speechsuper.com/
- Google Gemini Python SDK: https://github.com/googleapis/python-genai
Actions