For the user, this is a nightmare. They search the corporate website for a solution, but the answer lies in the technical documentation portal on a subdomain or on a subsidiary’s service website.
For a long time, federated search was the standard workaround to mask this problem. But in the age of artificial intelligence and large language models (LLMs), this approach is reaching its limits. A genuine, semantic AI search that understands and answers questions across platforms requires a completely different architecture: a centralised, harmonised vector system. A genuine digital brain for the entire organisation.
The architectural dilemma: federated vs. centralised
To understand why the data foundation is the decisive lever, we need to compare the two technological approaches.
Federated search: the illusion of unity
Traditional federated search is essentially a digital postman. When a user enters a search term, the system sends the query in parallel to Site A (e.g. TYPO3 Corporate), Site B (e.g. the e-commerce shop) and Site C (e.g. the document archive). It waits until all systems have returned their own lists of results, and then attempts to somehow group these results together by relevance in the front end.
The problem is that each system assesses relevance differently. Furthermore, the search remains rigidly keyword-based. If the user searches for “How do I maintain the pump”, but the manual refers to “maintenance intervals for the vacuum unit”, the federated search finds nothing – even though the information exists. AI-powered response generation (RAG) is impossible on this basis, as there is no LLM in the background capable of meaningfully cross-referencing and summarising the various sources.
Centralised AI vector search: the common foundation
In a modern AI search system, content from all multisites flows through a central pipeline into a single, high-performance vector database (such as Qdrant or Milvus). There, texts are stored not as words, but as mathematical vectors that represent the meaning of the content.
The advantage is that it makes no difference whatsoever which platform the content is on or what wording has been used. The system understands the context. Furthermore, this central index provides the perfect foundation for Retrieval-Augmented Generation (RAG): the AI filters out the three best text excerpts across all platforms and uses them to generate a single, precise answer for the user.
Technical implementation: Preparing the database for the central ‘brain’
The switch to a centralised AI search is not purely a front-end project, but a data-structure project. To ensure that the vector database functions accurately and does not mix up the content from the various multisites, the data preparation (ingestion pipeline) must go through three core steps:
Semantic cross-site chunking
: Texts must be broken down into small, digestible sections (chunks) before they can be converted into vectors. With multisites, rigidly truncating text after a fixed number of characters is disastrous. If Site A is a technical manual and Site B is a marketing blog, the contexts become mixed up when paragraphs are split in the middle of a sentence.
The approach: HTML-structure-based chunking.
The pipeline must intelligently analyse the heading hierarchies (H1 to H3) of the respective CMS instances. A chunk must never cross the boundary of a logical section. Each fragment must remain a self-contained unit of meaning so that the AI can later understand the exact context.
The global metadata schema (the common denominator)
To ensure that the central index knows where a piece of information comes from, in which language it is available and who is actually authorised to view it, a standardised metadata object must be attached to each text section prior to vectorisation. This includes:
- source_site: Which domain does the text come from?
- content_type: Is it legal information, a product teaser or a set of instructions?
- security_access_group: Is the content public or visible only to logged-in B2B customers?
Why this is critical: When a user searches the public corporate website, the system uses this metadata for what is known as ‘metadata pre-filtering’. It filters out chunks from restricted areas before AI processing power is used for semantic matching. This guarantees absolute GDPR compliance and data security whilst maximising performance.
Context enrichment through synthetic metadata
A major problem in the multisite landscape is content loss due to isolation. If the body text on a subpage for Brand X simply states: “The system is compatible with 12V”, the central vector index will not know which system is meant once the sentence has been taken out of context from the website.
The solution: before a chunk is embedded, it is automatically enriched with context by a lean, specialised AI model. The model invisibly adds control data to the text (e.g.: “This text refers to the maintenance of the Alpha vacuum pump by Brand X”). Only this enriched text is finally vectorised.
Conclusion: Data quality determines the intelligence of the search
The move away from traditional, sluggish federated search towards a central AI vector index finally breaks down the data silos of multi-site environments. As a result, companies not only gain a search field that understands, across all platforms, what the user actually means, but they also lay the fundamental groundwork for future, company-wide AI assistants.
The key to success lies not primarily in selecting the latest large language model, but in the clean, standardised processing of data streams in the background. Those who master their data base break down silos and create genuine digital added value.