Knowledge processing pipeline
What happens between clicking Upload and the document being searchable by the AI.
What happens between clicking Upload and the document being searchable by the AI.
Once processing completes, each file's status flips to Ready with a chunk count next to it — that's the indexed unit the AI will retrieve from during sessions.

The five stages
Upload → Parse → Chunk → Embed → Index
Each stage runs server-side. Most documents complete in 30-90 seconds; large videos take 5-15 minutes.
1. Upload
Files land in your tenant's S3 storage via signed-URL upload from the Dashboard. Maximum file size 250 MB. See Uploading knowledge for accepted formats.
2. Parse
Text + structural metadata is extracted from each file:
| Source | Parser | Notes |
|---|---|---|
| PDF (native text) | PyMuPDF | Preserves page numbers, headings, table structure |
| PDF (scanned) | Vision-AI OCR | Slower; fallback when no selectable text |
Word .docx |
Mammoth + heading detection | Style hierarchy → markdown headings |
| CSV | Row-based | One chunk per row by default |
| Images | Vision-AI description | Generates a detailed text description of the visual |
| Video | Audio transcription + key-frame analysis | Both feed into chunks |
Parsed content is stored as structured markdown with metadata comments (<!-- page: 3 -->, <!-- section: "Combustion analysis" -->) so the AI can cite back to the source.
3. Chunk
Parsed markdown is split into chunks of ~500-800 tokens with a 50-token overlap. The chunker is structure-aware:
- Never breaks mid-sentence
- Never splits a table across chunks
- Headings + their first paragraph stay together
- Bullet lists stay whole if under chunk size
Chunks inherit their parent document's metadata, including collection name + folder.
4. Embed
Each chunk is embedded with gemini-embedding-001 (768-dim vectors). Embeddings are stored in pgvector inside your tenant's Postgres schema. Embeddings are generated once at upload time, not per query.
For multi-language content, embeddings work across languages — Gemini's embedding model is multilingual, so an English query can surface relevant chunks from Welsh or Polish content if the meaning is close.
5. Index
Final stage: the chunks are written to pgvector with an HNSW index for fast cosine-similarity lookup. The HNSW index is built incrementally — you don't lose query capability while new documents are being processed. Existing chunks stay queryable throughout.
Re-processing
Sometimes you need to re-process — you updated the source document, or want to take advantage of an improved parser:
Knowledge → row → ⋯ → Re-process. Re-processing is incremental — only changed chunks get re-embedded.
If a session is showing wrong answers from a known-good document, first check the document's Last processed timestamp. If it predates your most recent upload, re-process.
Query time
When the AI calls search_knowledge during a session:
- The engineer's question is embedded with the same
gemini-embedding-001model. - pgvector returns the top chunks by cosine similarity.
- Top chunks are passed back to the AI as context.
- The AI cites back using the source's page + section metadata.
Total query time at the engineer's end: 200-400ms.
What can go wrong
| Symptom | Likely cause | Fix |
|---|---|---|
| Doc stuck on Processing for hours | Parse failure on the source file | Re-process; if it still fails, the source is likely corrupted — re-export from your original tool |
| Doc shows Ready but AI can't find content | Manifest cached stale; or generic file name | Wait for the next session (manifest rebuilds on session start); rename file to something specific |
| Chunks include lots of header/footer noise | Source PDF has running headers + page numbers per page | Re-export the source without page numbering, or accept the noise (it doesn't usually affect retrieval) |
| Video upload taking very long | Audio transcription stage | Normal — typical 250MB video is 5-15 minutes; you can leave the tab |
Where to next
- Uploading knowledge — what to upload
- Knowledge collections — organising for AI signal
- How the AI uses knowledge — what happens at query time