Skills & Knowledge · Last updated 18 May 2026 · 4 min read

Knowledge processing pipeline

What happens between clicking Upload and the document being searchable by the AI.

What happens between clicking Upload and the document being searchable by the AI.

Once processing completes, each file's status flips to Ready with a chunk count next to it — that's the indexed unit the AI will retrieve from during sessions.

Knowledge collection detail — file in Ready state showing chunk count (26 chunks) and metadata

The five stages

Upload → Parse → Chunk → Embed → Index

Each stage runs server-side. Most documents complete in 30-90 seconds; large videos take 5-15 minutes.

1. Upload

Files land in your tenant's S3 storage via signed-URL upload from the Dashboard. Maximum file size 250 MB. See Uploading knowledge for accepted formats.

2. Parse

Text + structural metadata is extracted from each file:

Source Parser Notes
PDF (native text) PyMuPDF Preserves page numbers, headings, table structure
PDF (scanned) Vision-AI OCR Slower; fallback when no selectable text
Word .docx Mammoth + heading detection Style hierarchy → markdown headings
CSV Row-based One chunk per row by default
Images Vision-AI description Generates a detailed text description of the visual
Video Audio transcription + key-frame analysis Both feed into chunks

Parsed content is stored as structured markdown with metadata comments (<!-- page: 3 -->, <!-- section: "Combustion analysis" -->) so the AI can cite back to the source.

3. Chunk

Parsed markdown is split into chunks of ~500-800 tokens with a 50-token overlap. The chunker is structure-aware:

  • Never breaks mid-sentence
  • Never splits a table across chunks
  • Headings + their first paragraph stay together
  • Bullet lists stay whole if under chunk size

Chunks inherit their parent document's metadata, including collection name + folder.

4. Embed

Each chunk is embedded with gemini-embedding-001 (768-dim vectors). Embeddings are stored in pgvector inside your tenant's Postgres schema. Embeddings are generated once at upload time, not per query.

For multi-language content, embeddings work across languages — Gemini's embedding model is multilingual, so an English query can surface relevant chunks from Welsh or Polish content if the meaning is close.

5. Index

Final stage: the chunks are written to pgvector with an HNSW index for fast cosine-similarity lookup. The HNSW index is built incrementally — you don't lose query capability while new documents are being processed. Existing chunks stay queryable throughout.

Re-processing

Sometimes you need to re-process — you updated the source document, or want to take advantage of an improved parser:

Knowledge → row → ⋯ → Re-process. Re-processing is incremental — only changed chunks get re-embedded.

Tip

If a session is showing wrong answers from a known-good document, first check the document's Last processed timestamp. If it predates your most recent upload, re-process.

Query time

When the AI calls search_knowledge during a session:

  1. The engineer's question is embedded with the same gemini-embedding-001 model.
  2. pgvector returns the top chunks by cosine similarity.
  3. Top chunks are passed back to the AI as context.
  4. The AI cites back using the source's page + section metadata.

Total query time at the engineer's end: 200-400ms.

What can go wrong

Symptom Likely cause Fix
Doc stuck on Processing for hours Parse failure on the source file Re-process; if it still fails, the source is likely corrupted — re-export from your original tool
Doc shows Ready but AI can't find content Manifest cached stale; or generic file name Wait for the next session (manifest rebuilds on session start); rename file to something specific
Chunks include lots of header/footer noise Source PDF has running headers + page numbers per page Re-export the source without page numbering, or accept the noise (it doesn't usually affect retrieval)
Video upload taking very long Audio transcription stage Normal — typical 250MB video is 5-15 minutes; you can leave the tab

Where to next