Under the hood: full-text and vectorial indexation
While genAI models are extremely proficient at retrieving and inferring information from high volumes of data and documents, their success still depends on the indexation of all that information.
This can happen in several ways, for example:
- Full-text indexation is how traditional search engines query documents and pages: by creating a searchable catalog of all the words it contains. It’s the perfect approach for precise, keyword-specific searches.
- Vectorial indexation converts text into numerical vectors using embedding-models (OpenAI Ada, Google BERT, Mistral Embeddings, etc.). Each word or phrase is a point in a multidimensional space, and the distance and direction between these points represent their semantic relationships. This approach is particularly effective when semantic meaning and intent are more important than specific keywords – for example in natural language queries.
“Vectorial indexation is crucial for genAI models and chatbots to work,” Wouter continues. “It allows large language models (LLMs) like GPT to capture nuances, context, and even user sentiment. This enables a more accurate and contextually relevant understanding of the prompt.”
Vectorial indexation is a complex task that requires a deep understanding of the data, the objectives, and the LLMs involved. “Since different models have different strengths and weaknesses, they may produce different results as well. So, for vectorial indexation, you need more than just a basic knowledge.”
Democratizing NLP capabilities
GenAI is taking over tasks that were traditionally addressed by NLP. “Large, pretrained models like GPT 3.5 and 4.0 can understand and generate natural language at an unprecedented level of sophistication,” says Wouter. “As a result, they can now handle challenges that were once the unique domain of NLP, including translating, writing, summarizing, generating code, and more. Plus genAI has made these capabilities accessible to a wider, non-technical audience as well.”