Skills & Knowledge · Last updated 29 June 2026 · 4 min read

Knowledge processing pipeline

What happens between clicking Upload and the document being searchable by the AI.

What happens between clicking Upload and the document being searchable by the AI.

Once processing completes, each file's status flips to Ready with a chunk count next to it — that's the indexed unit the AI will retrieve from during sessions.

Knowledge topic detail — file in Ready state showing chunk count (26 chunks) and metadata

The five stages

Upload → Parse → Chunk → Embed → Index

Each stage runs server-side. Most documents complete in 30-90 seconds; large videos take 5-15 minutes.

1. Upload

Files land in your account's S3 storage via signed-URL upload from the Dashboard. Maximum file size 250 MB. See Uploading knowledge for accepted formats.

2. Parse

Text + structural metadata is extracted from each file:

Source	Parser	Notes
PDF (native text)	PyMuPDF	Preserves page numbers, headings, table structure
PDF (scanned)	Vision-AI OCR	Slower; fallback when no selectable text
Word `.docx`	Mammoth + heading detection	Style hierarchy → markdown headings
CSV	Row-based	One chunk per row by default
Images	Vision-AI description	Generates a detailed text description of the visual
Video	Audio transcription + key-frame analysis	Both feed into chunks

Parsed content is stored as structured markdown with metadata comments (, ) so the AI can cite back to the source.

3. Chunk

Parsed markdown is split into chunks of ~500-800 tokens with a 50-token overlap. The chunker is structure-aware:

Never breaks mid-sentence
Never splits a table across chunks
Headings + their first paragraph stay together
Bullet lists stay whole if under chunk size

Chunks inherit their parent document's metadata, including topic name + folder.

4. Embed

Each chunk is embedded with gemini-embedding-001 (768-dim vectors). Embeddings are stored in pgvector inside your account's Postgres schema. Embeddings are generated once at upload time, not per query.

For multi-language content, embeddings work across languages — Gemini's embedding model is multilingual, so an English query can surface relevant chunks from Welsh or Polish content if the meaning is close.

5. Index

Final stage: the chunks are written to pgvector with an HNSW index for fast cosine-similarity lookup. The HNSW index is built incrementally — you don't lose query capability while new documents are being processed. Existing chunks stay queryable throughout.

Re-processing

Sometimes you need to re-process — you updated the source document, or want to take advantage of an improved parser:

Knowledge → row → ⋯ → Re-process. Re-processing is incremental — only changed chunks get re-embedded.

Tip

If a session is showing wrong answers from a known-good document, first check the document's Last processed timestamp. If it predates your most recent upload, re-process.

Query time

When the AI calls search_knowledge during a session:

The engineer's question is embedded with the same gemini-embedding-001 model.
pgvector returns the top chunks by cosine similarity.
Top chunks are passed back to the AI as context.
The AI cites back using the source's page + section metadata.

Total query time at the engineer's end: 200-400ms.

What can go wrong

Symptom	Likely cause	Fix
Doc stuck on Processing for hours	Parse failure on the source file	Re-process; if it still fails, the source is likely corrupted — re-export from your original tool
Doc shows Ready but AI can't find content	Manifest cached stale; or generic file name	Wait for the next session (manifest rebuilds on session start); rename file to something specific
Chunks include lots of header/footer noise	Source PDF has running headers + page numbers per page	Re-export the source without page numbering, or accept the noise (it doesn't usually affect retrieval)
Video upload taking very long	Audio transcription stage	Normal — typical 250MB video is 5-15 minutes; you can leave the tab

Where to next

Uploading knowledge — what to upload
Knowledge topics — organising for AI signal
How the AI uses knowledge — what happens at query time