Core Features and Functionalities
LLM Fine-Tuning
Feature Description
LLM Fine-Tuning allows users to customize open-source models with proprietary data, enhancing their performance for specific use cases.
Finrtuning techniques
Step-by-step Usage
a. Access the Data Source Manager:
- Navigate to the "LLM Fine-Tuning" module from the left panel.
- The Data Source Manager displays a list of available data sources.
b. Connect a New Data Source:
- Click the "CONNECT DATA SOURCE" button in the top-right corner.
- Follow the prompts to upload or connect your dataset.
c. Create a New Fine-Tuning Job:
- Go to the "Jobs" tab in the top menu.
- Click the "NEW JOB" button.
- Provide a job name, select a model (e.g., microsoft/phi-2, google/gemma-2b), and choose a fine-tuning technique (e.g., LoRA).
- Fine-Tuning Techniques Explained:
| Technique | Description | Best For | Resource Usage |
| Full Fine-Tuning | Updates all model parameters for maximum customization. Most comprehensive but resource-intensive approach | • Complete model behavior change • Abundant computing resources • Production-critical models | High (40+ GB VRAM) |
| LoRA (Low-Rank Adaptation) | Trains only small additional parameters while keeping original model frozen. Reduces training time by 90%+ | • Limited computing resources • Quick iterations • Domain-specific adaptations | Low (2-8 GB VRAM) |
| QLoRA (Quantized LoRA) | Combines LoRA with 4-bit quantization for ultra-low memory usage | • Very limited resources • Edge deployment • Cost-sensitive projects | Very Low (4-6 GB VRAM) |
| AdaLoRA | Adaptive version of LoRA that automatically allocates parameters based on importance | • Optimal efficiency • Complex tasks • When unsure about rank settings | Low (4-8 GB VRAM) |
| Prefix Tuning | Adds trainable tokens to the beginning of prompts without modifying model weights | • Prompt engineering tasks • Multiple task switching • Preserving base model | Very Low (2-4 GB VRAM) |
| P-Tuning v2 | Enhanced prefix tuning that adds trainable parameters to all model layers | • Better than Prefix Tuning for smaller models • NLU tasks • Structured data tasks | Low (4-6 GB VRAM) |
| Supervised Fine-Tuning (SFT) | Traditional supervised learning on labeled examples | • Clear input-output pairs • Classification/regression tasks • Well-defined objectives | Medium (8-16 GB VRAM) |
| Instruction Fine-Tuning | Trains models to follow natural language instructions and commands | • Chatbots and assistants • Task automation • User-facing applications | Medium (8-16 GB VRAM) |
| DPO (Direct Preference Optimization) | Trains models based on human preferences without reward modeling | • Alignment with human values • Reducing harmful outputs • Quality over specific answers | Medium (12-20 GB VRAM) |
Quick Decision Guide:
- New to fine-tuning? → Start with LoRA
- Building a chatbot? → Use Instruction Fine-Tuning
- Limited GPU memory? → Choose QLoRA or Prefix Tuning
- Need human-like responses? → Consider DPO
- Want automatic optimization? → Try AdaLoRA
d. Configure Job Parameters:
- Set training parameters such as learning rate, batch size, number of epochs, etc.
- Select the appropriate compute resources for your job.
e. Monitor Job Progress:
Once the job is started, you can track its progress in the comprehensive monitoring dashboard:
Run Details Panel (Left)
- Run Name: Your job identifier (e.g., "QA 1")
- Status: Current state (FINISHED, RUNNING, FAILED)
- User: Account running the job
- Start/End Time: Timestamps for job execution
- Source: Dataset being used for fine-tuning
- Artifact URL: Location of saved model checkpoints
System & Hardware Panel (Center)
- Platform: Infrastructure being used (e.g., Linux)
- GPU Cores: Number of GPU cores allocated
- System Memory: RAM usage in GB
- GPU Details: Specific GPU model and specifications
- NVIDIA Metrics: Real-time GPU utilization including:
- Temperature (°C)
- Memory usage (MB)
- GPU utilization percentage
Metrics Panel (Right)
- total_params: Total number of model parameters
- runtime_seconds: Elapsed training time
- trainable_params: Number of parameters being updated
- final_loss: Final training loss value
- train_steps_per_second: Training throughput
- train_loss: Current training loss (monitor for decrease)
- train_runtime: Total training duration
- train_samples_per_second: Data processing speed
Parameters Panel (Bottom)
- Displays all configured training parameters for reference
- Includes hyperparameters like learning rate, batch size, epochs
- Shows model-specific settings and optimization configurations
Key Metrics to Watch:
- Status = RUNNING: Job is actively training
- train_loss decreasing: Model is learning effectively (good!)
- GPU Memory < 95%: Healthy resource utilization
- Temperature < 80°C: GPU operating within safe limits
- train_steps_per_second stable: Consistent training speed
🚨 Warning Signs:
- Status = FAILED: Check logs for error details
- train_loss increasing: Learning rate may be too high
- GPU Memory at 100%: Consider reducing batch size
Once the job is started, you can track its progress in the Jobs list.
Model Compression
Feature Description
Model Compression reduces the size of pretrained models while maintaining performance, enabling faster deployment and lower infrastructure costs.
Step-by-step Usage
a. Initiate Model Compression:
- Navigate to the "LLM Optimizer" module from the left panel.
- Click "NEW JOB" to start a new compression task.
b. Select Dataset:
- Choose from available datasets in various formats (e.g., Q&A, Chat).
c. Choose Model:
- Select from a range of models such as LLaMA-2, Mistral, Phi-2, etc.
d. Select Fine-Tuning Technique:
- Options include Full Fine-Tuning, LoRA, QLoRA, AdaLoRA, and more.
| Technique | Description | Size Reduction | Speed Gain | Best For |
| Dynamic Quantization | Quantizes weights at runtime, keeps activations in float | ~25-40% | 2-3x | CPU deployment, variable batch sizes |
| Static Quantization | Pre-calibrates and quantizes both weights and activations | ~50-75% | 3-4x | Fixed input distributions, edge devices |
| GPTQ (Group-wise PTQ) | Advanced post-training quantization using group-wise optimization | ~65-75% | 3-4x | Large language models, GPU inference |
| AWQ (Activation-aware) | Preserves important weights based on activation patterns | ~60-70% | 3-4x | Maintaining high accuracy, production models |
| SmoothQuant | Smooths activation outliers for better quantization | ~50-60% | 2-3x | Models with activation spikes, transformers |
| QLoRA | Quantized LoRA fine-tuning approach | ~65-75% | 2-3x | Fine-tuning with limited memory |
| BitsAndBytes | 8-bit and 4-bit quantization optimized for CUDA | ~70-75% | 2-4x | NVIDIA GPU deployment |
| ONNX Runtime | Cross-platform quantization for ONNX models | ~50-60% | 2-3x | Multi-platform deployment |
| TensorRT | NVIDIA's high-performance inference optimization | ~60-70% | 4-8x | NVIDIA GPU production servers |
| TFLite | Mobile and edge device optimization | ~60-75% | 3-5x | Android/iOS deployment |
| CoreML | Apple ecosystem optimization | ~50-65% | 3-4x | iOS/macOS applications |
| Fake Quantization (Simulated QAT) | Simulates quantization during training | ~40-50% | 2x | Training-aware optimization |
| GGUF Format | Efficient format for quantized models | ~60-75% | 2-4x | Local deployment, llama.cpp compatible |
e. Set Training Parameters:
- Configure parameters like learning rate, batch size, number of epochs, etc.
f. Choose Compute Resources:
- Select from various compute options (AWS, GCP, Azure, etc.) based on your requirements and budget.
g. Select Quantization Technique:
- Choose from options like Dynamic Quantization, Static Quantization, GPTQ, etc.
h. AI Recommendation:
- Optionally, use the AI Recommendation feature to get suggestions for optimal settings.
- The AI Recommendation feature analyzes your specific configuration to suggest optimal compression settings:
How It Works:
-
- Model Analysis: Examines your selected model's architecture, size, and complexity
- Dataset Profiling: Analyzes your dataset characteristics and distribution
- Hardware Matching: Considers your target deployment environment
- Performance Targets: Balances size reduction with accuracy preservation
Tips & Notes
- The AI Recommendation feature can help you choose the best compression settings based on your model and data.
- Different compute options are suitable for various techniques and model sizes. Consider the cost and performance trade-offs when selecting.
- Experiment with different quantization techniques to find the best balance between model size reduction and performance preservation.
App Design
6.3.1 Feature Description
App Design is a no-code/low-code platform that allows users to build, customize, and deploy LLM-powered and Agentic workflows using an intuitive drag-and-drop interface. It enables the creation of complex AI applications without extensive coding knowledge.
Step-by-step Usage
a. Access App Design:
- Navigate to the "App Design" module from the left panel.
b. Create a New Workflow:
- Click on the "New Workflow" or "+" button to start a new project.
c. Design Your Workflow:
- Use the drag-and-drop interface to add components to your canvas.
- Components may include:
- LLM models
- Prompt templates
- Data sources
- API integrations
- Custom functions
d. Configure Components:
- Click on each component to set its parameters and properties.
- Connect components by drawing lines between their inputs and outputs.
e. Set Up Input/Output:
- Define the input format for your application (e.g., text input, file upload).
- Configure the desired output format (e.g., text response, generated image).
f. Test Your Workflow:
- Use the built-in testing panel to run your workflow with sample inputs.
- Debug and refine your design as needed.
g. Deploy Your Application:
- Once satisfied with your workflow, click the "Deploy" button.
- Choose deployment options (e.g., API endpoint, web interface, chat widget).
Tips & Notes
- Utilize pre-built templates and components to accelerate your development process.
- Leverage the visual nature of App Design to create complex, multi-step AI workflows without writing code.
- Experiment with different component combinations to achieve your desired functionality.
- Use the version control feature to manage different iterations of your workflow.
- Take advantage of the built-in monitoring and analytics to track your application's performance and usage.
Key Features of App Design
- Drag-and-Drop Interface: Easily create workflows by dragging and connecting components.
- Wide Range of Components: Access a variety of LLM models, data processing tools, and integrations.
- Real-time Preview: Test your workflow at any stage of development.
- Custom Function Support: Integrate your own Python functions for specialized tasks.
- Collaborative Editing: Work with team members on the same workflow simultaneously.
- Version Control: Keep track of changes and revert to previous versions if needed.
- One-Click Deployment: Quickly deploy your application to production environments.
Use Cases
- Chatbots and Conversational AI
- Document Analysis and Summarization
- Content Generation and SEO Optimization
- Data Extraction and Processing Pipelines
- Sentiment Analysis and Customer Feedback Processing
- Automated Report Generation
- Multi-modal AI Applications (text, image, audio)
App Design in LLMOps provides a powerful yet accessible way to create sophisticated AI applications, bridging the gap between complex LLM capabilities and practical, deployable solutions for various business needs.
Monitoring
Feature Description
The Monitoring feature in LLMOps provides comprehensive tracking and analysis of LLM usage, performance, and compliance. It offers real-time insights into system activity, token usage, request trends, and key metrics across all LLM features.
Interface Overview
a. LLM Tracing:
- Displays detailed information about each LLM interaction, including prompts, responses, token usage, and response times.
- Allows for easy tracking and auditing of LLM requests and responses.
b. Validation Flow:
- A customizable sequence of checks to ensure LLM outputs are safe, relevant, and high-quality.
- Users can drag and drop various validators to create a custom validation pipeline.
c. Monitoring Dashboard:
- Provides visual representations of key metrics and trends.
- Includes graphs for daily request trends, total requests by feature, and average requests by feature.
- Offers detailed breakdowns of token usage for different LLM applications.
Key Components
a. LLM Tracing Table:
- Columns: ID, Prompt, Response, Input Tokens, Output Tokens, Response Time, Date, and custom metrics.
- Allows for detailed analysis of individual LLM interactions.
b. Validation Flow:
- Includes validators such as Bias Check, Gibberish Text, Detect PII, Guardrails PII, and Toxic Language.
- Additional validators available: Detect Jailbreak, Llama Guard, Wiki Provenance, Sensitive Topic, Unusual Prompt, Saliency Check, and Restrict to Topic.
c. Monitoring Dashboard:
- Daily Request Trends by Feature: Line graph showing usage patterns over time.
- Total Requests by Feature: Pie chart illustrating the distribution of requests across different LLM applications.
- Average Requests by Feature: Pie chart showing the average usage of each feature.
- Feature-specific graphs: Bar charts and line graphs for detailed analysis of token usage and request patterns for each LLM application (e.g., Smart Chat, Brew Content, Data Dive, Text to Image).
How to Use
a. Accessing the Monitoring Feature:
- Navigate to the "Monitoring" section from the left sidebar.
b. Analyzing LLM Tracing:
- Review the LLM Tracing table to inspect individual requests and responses.
- Use the table to identify patterns, issues, or anomalies in LLM interactions.
c. Configuring the Validation Flow:
- Access the Validation Flow section.
- Drag and drop desired validators into the flow.
- Arrange validators in the desired order to create a custom validation pipeline.
d. Using the Monitoring Dashboard:
- Set the date range using the "From" and "To" date pickers at the top of the dashboard.
- Analyze trends and patterns in the various charts and graphs.
- Use the feature-specific graphs to dive deeper into usage patterns for individual LLM applications.
Key Features and Benefits
- Comprehensive Tracking: Monitor all aspects of LLM usage, from individual requests to system-wide trends.
- Customizable Validation: Ensure LLM outputs meet specific quality and safety standards with a flexible validation pipeline.
- Real-time Insights: Get up-to-date information on system performance and usage patterns.
- Token Usage Analysis: Track and optimize token consumption across different LLM applications.
- Performance Metrics: Monitor response times and other key performance indicators.
- Compliance and Governance: Use the validation flow and detailed tracing to maintain compliance with organizational policies and regulations.
Tips & Notes
- Regularly review the Monitoring Dashboard to identify usage trends and optimize resource allocation.
- Use the Validation Flow to implement and enforce organizational policies on LLM output quality and safety.
- Leverage the detailed LLM Tracing data for debugging, auditing, and improving LLM applications.
- Pay attention to token usage patterns to manage costs and improve efficiency.
- Use the date range selector to analyze performance and usage over specific time periods.
The Monitoring feature in LLMOps provides a powerful set of tools for managing, optimizing, and governing LLM usage within your organization. By leveraging these capabilities, you can ensure the safe, efficient, and effective use of LLM technology across all your applications.