Skip to main content

Core Features and Functionalities

LLM Fine-Tuning

Feature Description

LLM Fine-Tuning allows users to customize open-source models with proprietary data, enhancing their performance for specific use cases.

Finrtuning techniques

Step-by-step Usage

a. Access the Data Source Manager:

  • Navigate to the "LLM Fine-Tuning" module from the left panel.
  • The Data Source Manager displays a list of available data sources.

b. Connect a New Data Source:

  • Click the "CONNECT DATA SOURCE" button in the top-right corner.
  • Follow the prompts to upload or connect your dataset.

c. Create a New Fine-Tuning Job:

  • Go to the "Jobs" tab in the top menu.
  • Click the "NEW JOB" button.
  • Provide a job name, select a model (e.g., microsoft/phi-2, google/gemma-2b), and choose a fine-tuning technique (e.g., LoRA).
  • Fine-Tuning Techniques Explained:
TechniqueDescriptionBest ForResource Usage
Full Fine-TuningUpdates all model parameters for maximum customization. Most comprehensive but resource-intensive approach• Complete model behavior change
• Abundant computing resources
• Production-critical models
High (40+ GB VRAM)
LoRA (Low-Rank Adaptation)Trains only small additional parameters while keeping original model frozen. Reduces training time by 90%+• Limited computing resources
• Quick iterations
• Domain-specific adaptations
Low (2-8 GB VRAM)
QLoRA (Quantized LoRA)Combines LoRA with 4-bit quantization for ultra-low memory usage• Very limited resources
• Edge deployment
• Cost-sensitive projects
Very Low (4-6 GB VRAM)
AdaLoRAAdaptive version of LoRA that automatically allocates parameters based on importance• Optimal efficiency
• Complex tasks
• When unsure about rank settings
Low (4-8 GB VRAM)
Prefix TuningAdds trainable tokens to the beginning of prompts without modifying model weights• Prompt engineering tasks
• Multiple task switching
• Preserving base model
Very Low (2-4 GB VRAM)
P-Tuning v2Enhanced prefix tuning that adds trainable parameters to all model layers• Better than Prefix Tuning for smaller models
• NLU tasks
• Structured data tasks
Low (4-6 GB VRAM)
Supervised Fine-Tuning (SFT)Traditional supervised learning on labeled examples• Clear input-output pairs
• Classification/regression tasks
• Well-defined objectives
Medium (8-16 GB VRAM)
Instruction Fine-TuningTrains models to follow natural language instructions and commands• Chatbots and assistants
• Task automation
• User-facing applications
Medium (8-16 GB VRAM)
DPO (Direct Preference Optimization)Trains models based on human preferences without reward modeling• Alignment with human values
• Reducing harmful outputs
• Quality over specific answers
Medium (12-20 GB VRAM)

Quick Decision Guide:

  • New to fine-tuning? → Start with LoRA
  • Building a chatbot? → Use Instruction Fine-Tuning
  • Limited GPU memory? → Choose QLoRA or Prefix Tuning
  • Need human-like responses? → Consider DPO
  • Want automatic optimization? → Try AdaLoRA

d. Configure Job Parameters:

  • Set training parameters such as learning rate, batch size, number of epochs, etc.
  • Select the appropriate compute resources for your job.

e. Monitor Job Progress:

Once the job is started, you can track its progress in the comprehensive monitoring dashboard:

Run Details Panel (Left)

  • Run Name: Your job identifier (e.g., "QA 1")
  • Status: Current state (FINISHED, RUNNING, FAILED)
  • User: Account running the job
  • Start/End Time: Timestamps for job execution
  • Source: Dataset being used for fine-tuning
  • Artifact URL: Location of saved model checkpoints

System & Hardware Panel (Center)

  • Platform: Infrastructure being used (e.g., Linux)
  • GPU Cores: Number of GPU cores allocated
  • System Memory: RAM usage in GB
  • GPU Details: Specific GPU model and specifications
  • NVIDIA Metrics: Real-time GPU utilization including:
    • Temperature (°C)
    • Memory usage (MB)
    • GPU utilization percentage

Metrics Panel (Right)

  • total_params: Total number of model parameters
  • runtime_seconds: Elapsed training time
  • trainable_params: Number of parameters being updated
  • final_loss: Final training loss value
  • train_steps_per_second: Training throughput
  • train_loss: Current training loss (monitor for decrease)
  • train_runtime: Total training duration
  • train_samples_per_second: Data processing speed

Parameters Panel (Bottom)

  • Displays all configured training parameters for reference
  • Includes hyperparameters like learning rate, batch size, epochs
  • Shows model-specific settings and optimization configurations

Key Metrics to Watch:

  • Status = RUNNING: Job is actively training
  • train_loss decreasing: Model is learning effectively (good!)
  • GPU Memory < 95%: Healthy resource utilization
  • Temperature < 80°C: GPU operating within safe limits
  • train_steps_per_second stable: Consistent training speed

🚨 Warning Signs:

  • Status = FAILED: Check logs for error details
  • train_loss increasing: Learning rate may be too high
  • GPU Memory at 100%: Consider reducing batch size

Once the job is started, you can track its progress in the Jobs list.

Model Compression

Feature Description

Model Compression reduces the size of pretrained models while maintaining performance, enabling faster deployment and lower infrastructure costs.

Step-by-step Usage

a. Initiate Model Compression:

  • Navigate to the "LLM Optimizer" module from the left panel.
  • Click "NEW JOB" to start a new compression task.

b. Select Dataset:

  • Choose from available datasets in various formats (e.g., Q&A, Chat).

c. Choose Model:

  • Select from a range of models such as LLaMA-2, Mistral, Phi-2, etc.

d. Select Fine-Tuning Technique:

  • Options include Full Fine-Tuning, LoRA, QLoRA, AdaLoRA, and more.
TechniqueDescriptionSize ReductionSpeed GainBest For
Dynamic QuantizationQuantizes weights at runtime, keeps activations in float~25-40%2-3xCPU deployment, variable batch sizes
Static QuantizationPre-calibrates and quantizes both weights and activations~50-75%3-4xFixed input distributions, edge devices
GPTQ (Group-wise PTQ)Advanced post-training quantization using group-wise optimization~65-75%3-4xLarge language models, GPU inference
AWQ (Activation-aware)Preserves important weights based on activation patterns~60-70%3-4xMaintaining high accuracy, production models
SmoothQuantSmooths activation outliers for better quantization~50-60%2-3xModels with activation spikes, transformers
QLoRAQuantized LoRA fine-tuning approach~65-75%2-3xFine-tuning with limited memory
BitsAndBytes8-bit and 4-bit quantization optimized for CUDA~70-75%2-4xNVIDIA GPU deployment
ONNX RuntimeCross-platform quantization for ONNX models~50-60%2-3xMulti-platform deployment
TensorRTNVIDIA's high-performance inference optimization~60-70%4-8xNVIDIA GPU production servers
TFLiteMobile and edge device optimization~60-75%3-5xAndroid/iOS deployment
CoreMLApple ecosystem optimization~50-65%3-4xiOS/macOS applications
Fake Quantization (Simulated QAT)Simulates quantization during training~40-50%2xTraining-aware optimization
GGUF FormatEfficient format for quantized models~60-75%2-4xLocal deployment, llama.cpp compatible

e. Set Training Parameters:

  • Configure parameters like learning rate, batch size, number of epochs, etc.

f. Choose Compute Resources:

  • Select from various compute options (AWS, GCP, Azure, etc.) based on your requirements and budget.

g. Select Quantization Technique:

  • Choose from options like Dynamic Quantization, Static Quantization, GPTQ, etc.

h. AI Recommendation:

  • Optionally, use the AI Recommendation feature to get suggestions for optimal settings.
  • The AI Recommendation feature analyzes your specific configuration to suggest optimal compression settings:

How It Works:

    • Model Analysis: Examines your selected model's architecture, size, and complexity
    • Dataset Profiling: Analyzes your dataset characteristics and distribution
    • Hardware Matching: Considers your target deployment environment
    • Performance Targets: Balances size reduction with accuracy preservation

Tips & Notes

  • The AI Recommendation feature can help you choose the best compression settings based on your model and data.
  • Different compute options are suitable for various techniques and model sizes. Consider the cost and performance trade-offs when selecting.
  • Experiment with different quantization techniques to find the best balance between model size reduction and performance preservation.

App Design

6.3.1 Feature Description

App Design is a no-code/low-code platform that allows users to build, customize, and deploy LLM-powered and Agentic workflows using an intuitive drag-and-drop interface. It enables the creation of complex AI applications without extensive coding knowledge.

Step-by-step Usage

a. Access App Design:

  • Navigate to the "App Design" module from the left panel.

b. Create a New Workflow:

  • Click on the "New Workflow" or "+" button to start a new project.

c. Design Your Workflow:

  • Use the drag-and-drop interface to add components to your canvas.
  • Components may include:
    • LLM models
    • Prompt templates
    • Data sources
    • API integrations
    • Custom functions

d. Configure Components:

  • Click on each component to set its parameters and properties.
  • Connect components by drawing lines between their inputs and outputs.

e. Set Up Input/Output:

  • Define the input format for your application (e.g., text input, file upload).
  • Configure the desired output format (e.g., text response, generated image).

f. Test Your Workflow:

  • Use the built-in testing panel to run your workflow with sample inputs.
  • Debug and refine your design as needed.

g. Deploy Your Application:

  • Once satisfied with your workflow, click the "Deploy" button.
  • Choose deployment options (e.g., API endpoint, web interface, chat widget).

Tips & Notes

  • Utilize pre-built templates and components to accelerate your development process.
  • Leverage the visual nature of App Design to create complex, multi-step AI workflows without writing code.
  • Experiment with different component combinations to achieve your desired functionality.
  • Use the version control feature to manage different iterations of your workflow.
  • Take advantage of the built-in monitoring and analytics to track your application's performance and usage.

Key Features of App Design

  • Drag-and-Drop Interface: Easily create workflows by dragging and connecting components.
  • Wide Range of Components: Access a variety of LLM models, data processing tools, and integrations.
  • Real-time Preview: Test your workflow at any stage of development.
  • Custom Function Support: Integrate your own Python functions for specialized tasks.
  • Collaborative Editing: Work with team members on the same workflow simultaneously.
  • Version Control: Keep track of changes and revert to previous versions if needed.
  • One-Click Deployment: Quickly deploy your application to production environments.

Use Cases

  • Chatbots and Conversational AI
  • Document Analysis and Summarization
  • Content Generation and SEO Optimization
  • Data Extraction and Processing Pipelines
  • Sentiment Analysis and Customer Feedback Processing
  • Automated Report Generation
  • Multi-modal AI Applications (text, image, audio)

App Design in LLMOps provides a powerful yet accessible way to create sophisticated AI applications, bridging the gap between complex LLM capabilities and practical, deployable solutions for various business needs.

Monitoring

Feature Description

The Monitoring feature in LLMOps provides comprehensive tracking and analysis of LLM usage, performance, and compliance. It offers real-time insights into system activity, token usage, request trends, and key metrics across all LLM features.

Interface Overview

a. LLM Tracing:

  • Displays detailed information about each LLM interaction, including prompts, responses, token usage, and response times.
  • Allows for easy tracking and auditing of LLM requests and responses.

b. Validation Flow:

  • A customizable sequence of checks to ensure LLM outputs are safe, relevant, and high-quality.
  • Users can drag and drop various validators to create a custom validation pipeline.

c. Monitoring Dashboard:

  • Provides visual representations of key metrics and trends.
  • Includes graphs for daily request trends, total requests by feature, and average requests by feature.
  • Offers detailed breakdowns of token usage for different LLM applications.

Key Components

a. LLM Tracing Table:

  • Columns: ID, Prompt, Response, Input Tokens, Output Tokens, Response Time, Date, and custom metrics.
  • Allows for detailed analysis of individual LLM interactions.

b. Validation Flow:

  • Includes validators such as Bias Check, Gibberish Text, Detect PII, Guardrails PII, and Toxic Language.
  • Additional validators available: Detect Jailbreak, Llama Guard, Wiki Provenance, Sensitive Topic, Unusual Prompt, Saliency Check, and Restrict to Topic.

c. Monitoring Dashboard:

  • Daily Request Trends by Feature: Line graph showing usage patterns over time.
  • Total Requests by Feature: Pie chart illustrating the distribution of requests across different LLM applications.
  • Average Requests by Feature: Pie chart showing the average usage of each feature.
  • Feature-specific graphs: Bar charts and line graphs for detailed analysis of token usage and request patterns for each LLM application (e.g., Smart Chat, Brew Content, Data Dive, Text to Image).

How to Use

a. Accessing the Monitoring Feature:

  • Navigate to the "Monitoring" section from the left sidebar.

b. Analyzing LLM Tracing:

  • Review the LLM Tracing table to inspect individual requests and responses.
  • Use the table to identify patterns, issues, or anomalies in LLM interactions.

c. Configuring the Validation Flow:

  • Access the Validation Flow section.
  • Drag and drop desired validators into the flow.
  • Arrange validators in the desired order to create a custom validation pipeline.

d. Using the Monitoring Dashboard:

  • Set the date range using the "From" and "To" date pickers at the top of the dashboard.
  • Analyze trends and patterns in the various charts and graphs.
  • Use the feature-specific graphs to dive deeper into usage patterns for individual LLM applications.

Key Features and Benefits

  • Comprehensive Tracking: Monitor all aspects of LLM usage, from individual requests to system-wide trends.
  • Customizable Validation: Ensure LLM outputs meet specific quality and safety standards with a flexible validation pipeline.
  • Real-time Insights: Get up-to-date information on system performance and usage patterns.
  • Token Usage Analysis: Track and optimize token consumption across different LLM applications.
  • Performance Metrics: Monitor response times and other key performance indicators.
  • Compliance and Governance: Use the validation flow and detailed tracing to maintain compliance with organizational policies and regulations.

Tips & Notes

  • Regularly review the Monitoring Dashboard to identify usage trends and optimize resource allocation.
  • Use the Validation Flow to implement and enforce organizational policies on LLM output quality and safety.
  • Leverage the detailed LLM Tracing data for debugging, auditing, and improving LLM applications.
  • Pay attention to token usage patterns to manage costs and improve efficiency.
  • Use the date range selector to analyze performance and usage over specific time periods.

The Monitoring feature in LLMOps provides a powerful set of tools for managing, optimizing, and governing LLM usage within your organization. By leveraging these capabilities, you can ensure the safe, efficient, and effective use of LLM technology across all your applications.