A credit union system is using generative AI to enhance customer support. They are developing an intelligent chatbot designed to provide accurate and timely responses to user inquiries about their services, policies, and other relevant information. The chatbot utilizes a Retrieval-Augmented Generation (RAG) system. This tool navigates through a dataset of public website documents, retrieves the most relevant information, and generates coherent responses based on the retrieved data.
The RAG system is then fine-tuned using a large publicly available finance dataset. This structure includes 52K instruction-following examples related to financial topics. The company's intelligent chatbot is a valuable tool for customer support, and it can provide users with the information they need, which helps to improve customer satisfaction and reduce the workload on the customer support team.
Model selection
For this particular challenge, we selected to work with Gemini Pro, the latest Large Language Model (LLM) from Google, which can perform multiple tasks and comes with a long context window capable of handling more complex tasks. It offers a comprehensive set of features and capabilities that meet our specific needs. The platform also provides robust integration with Google Cloud, enabling developers to seamlessly enhance AI-powered assistance through easy-to-implement customizations.
Our model selection criteria focused on finding a capable LLM that can handle question-and-answer assignments while providing answers in both Portuguese and English. We also need the model to be easy to fine-tune and scale, ensuring our evolving requirements.
Prompt enrichment/model tuning
The dataset chosen for this task was the wealthy Alpaca LoRA dataset. This dataset was selected because it provides a diverse set of data sources, combining both structured financial data and conversational examples. This allows the model to be trained on a variety of inputs, enhancing its ability to generalize and perform well in different contexts.
The dataset was then split between training and validation sets, following a standard 80%-20% split. The model tuning process is deployed using Vertex AI’s supervised tuning job, which allows us to monitor the training process and the metrics through the platform. The Supervised Tuning Job handles the device usage and distribution automatically based on the model and dataset size, eliminating concerns about infrastructure.
The Supervised Tuning Job computes a loss metric that evaluates the accuracy of the responses compared to the dataset output. Although the specific metric used for this computation wasn’t provided, it is likely to be cross-entropy, which is the standard metric for Low-Rank Adapters (LoRA) and other current state-of-the-art fine-tuning algorithms.
By applying prompt enrichment techniques, we can enhance the model's task comprehension, reduce ambiguity, and achieve more accurate and relevant results. These techniques significantly improve the quality of output from language models, resulting in more informed and meaningful responses.
Figure 1: Prompt enrichment elements used in this demo
1. Prompt templates
Using a template ensures consistent and structured prompts. By clearly defining a format with placeholders, the AI can generate responses that are both relevant and contextually appropriate.
2. Role and task specification
By clearly defining the AI’s role and specific task, we help set a precise context. This ensures that the AI understands its purpose and the nature of the responses it should generate, leading to more accurate and contextually appropriate outputs.
3. Source attribution
By instructing the AI to always include the source of its information, the prompt ensures that the responses are not only informative but also verifiable, thereby increasing user trust.
Model evaluation
Before a model can go into production, we need to ensure that it achieves a certain level of quality. This can be achieved in several ways.
To filter out potentially harmful or misleading outputs, we utilize Vertex AI’s content filtering through LangChain’s library. Below, you will find the “safety_settings” that block harmful content, such as detrimental content. These safety settings are then applied to the chain.
Figure 2: Demonstration of safety filters
Another way to ensure quality content is through Human oversight using Reinforcement Learning with Human Feedback (RLHF). When we considered implementing this method in the fine-tuning process of our model, we discovered that the Gemini models are not yet supported for this type of action, as seen below:
Figure 3: Supported models for RLHF
We then had to decide between using the Gemini-Pro model or a T5 model as the foundation of our AI agent. Given the reliability that Gemini-Pro and the fact that it comes pre-tuned with RLHF, we believed that sticking with our original choice would deliver better results.
Finally, we used a manually created test dataset to evaluate the RAG system. This was necessary because our fine-tuning dataset did not have questions specific to the company's services. Therefore, we built a custom dataset and evaluated it on coherence, fluency, and safety metrics.
The coherence metric evaluates the model's ability to provide a coherent response.
The fluency metric assesses the model's mastery of language.
The safety metric measures whether the response contains any unsafe text.
Figure 4: Evaluation metrics for untuned LLM
Figure 5: Evaluation metrics of the tuned LLM
Our fine-tuned LLM demonstrated superior performance compared to the untuned version in terms of coherence and fluency, according to the metrics presented. Additionally, our RAG system exhibited exceptional results by providing coherent answers and achieving a perfect safety score.
Author
Davy Costa
Experienced professional in Software Development, Data Engineering and Analysis, Cloud Infrastructure, Machine Learning, and Generative AI.