Core Features and Functionalities

LLM Fine-Tuning

Feature Description

LLM Fine-Tuning allows users to customize open-source models with proprietary data, enhancing their performance for specific use cases.

Finrtuning techniques

Step-by-step Usage

a. Access the Data Source Manager:

Navigate to the "LLM Fine-Tuning" module from the left panel.
The Data Source Manager displays a list of available data sources.

b. Connect a New Data Source:

Click the "CONNECT DATA SOURCE" button in the top-right corner.
Follow the prompts to upload or connect your dataset.

c. Create a New Fine-Tuning Job:

Go to the "Jobs" tab in the top menu.
Click the "NEW JOB" button.
Provide a job name, select a model (e.g., microsoft/phi-2, google/gemma-2b), and choose a fine-tuning technique (e.g., LoRA).
Fine-Tuning Techniques Explained:


Technique	Description	Best For	Resource Usage
Full Fine-Tuning	Updates all model parameters for maximum customization. Most comprehensive but resource-intensive approach	• Complete model behavior change • Abundant computing resources • Production-critical models	High (40+ GB VRAM)
LoRA (Low-Rank Adaptation)	Trains only small additional parameters while keeping original model frozen. Reduces training time by 90%+	• Limited computing resources • Quick iterations • Domain-specific adaptations	Low (2-8 GB VRAM)
QLoRA (Quantized LoRA)	Combines LoRA with 4-bit quantization for ultra-low memory usage	• Very limited resources • Edge deployment • Cost-sensitive projects	Very Low (4-6 GB VRAM)
AdaLoRA	Adaptive version of LoRA that automatically allocates parameters based on importance	• Optimal efficiency • Complex tasks • When unsure about rank settings	Low (4-8 GB VRAM)
Prefix Tuning	Adds trainable tokens to the beginning of prompts without modifying model weights	• Prompt engineering tasks • Multiple task switching • Preserving base model	Very Low (2-4 GB VRAM)
P-Tuning v2	Enhanced prefix tuning that adds trainable parameters to all model layers	• Better than Prefix Tuning for smaller models • NLU tasks • Structured data tasks	Low (4-6 GB VRAM)
Supervised Fine-Tuning (SFT)	Traditional supervised learning on labeled examples	• Clear input-output pairs • Classification/regression tasks • Well-defined objectives	Medium (8-16 GB VRAM)
Instruction Fine-Tuning	Trains models to follow natural language instructions and commands	• Chatbots and assistants • Task automation • User-facing applications	Medium (8-16 GB VRAM)
DPO (Direct Preference Optimization)	Trains models based on human preferences without reward modeling	• Alignment with human values • Reducing harmful outputs • Quality over specific answers	Medium (12-20 GB VRAM)

Quick Decision Guide:

New to fine-tuning? → Start with LoRA
Building a chatbot? → Use Instruction Fine-Tuning
Limited GPU memory? → Choose QLoRA or Prefix Tuning
Need human-like responses? → Consider DPO
Want automatic optimization? → Try AdaLoRA

d. Configure Job Parameters:

Set training parameters such as learning rate, batch size, number of epochs, etc.
Select the appropriate compute resources for your job.

e. Monitor Job Progress:

Once the job is started, you can track its progress in the comprehensive monitoring dashboard:

Run Details Panel (Left)

Run Name: Your job identifier (e.g., "QA 1")
Status: Current state (FINISHED, RUNNING, FAILED)
User: Account running the job
Start/End Time: Timestamps for job execution
Source: Dataset being used for fine-tuning
Artifact URL: Location of saved model checkpoints

System & Hardware Panel (Center)

Platform: Infrastructure being used (e.g., Linux)
GPU Cores: Number of GPU cores allocated
System Memory: RAM usage in GB
GPU Details: Specific GPU model and specifications
NVIDIA Metrics: Real-time GPU utilization including:
- Temperature (°C)
- Memory usage (MB)
- GPU utilization percentage

Metrics Panel (Right)

total_params: Total number of model parameters
runtime_seconds: Elapsed training time
trainable_params: Number of parameters being updated
final_loss: Final training loss value
train_steps_per_second: Training throughput
train_loss: Current training loss (monitor for decrease)
train_runtime: Total training duration
train_samples_per_second: Data processing speed

Parameters Panel (Bottom)

Displays all configured training parameters for reference
Includes hyperparameters like learning rate, batch size, epochs
Shows model-specific settings and optimization configurations

Key Metrics to Watch:

Status = RUNNING: Job is actively training
train_loss decreasing: Model is learning effectively (good!)
GPU Memory < 95%: Healthy resource utilization
Temperature < 80°C: GPU operating within safe limits
train_steps_per_second stable: Consistent training speed

🚨 Warning Signs:

Status = FAILED: Check logs for error details
train_loss increasing: Learning rate may be too high
GPU Memory at 100%: Consider reducing batch size

Once the job is started, you can track its progress in the Jobs list.

Model Compression

Feature Description

Model Compression reduces the size of pretrained models while maintaining performance, enabling faster deployment and lower infrastructure costs.

Step-by-step Usage

a. Initiate Model Compression:

Navigate to the "LLM Optimizer" module from the left panel.
Click "NEW JOB" to start a new compression task.

b. Select Dataset:

Choose from available datasets in various formats (e.g., Q&A, Chat).

c. Choose Model:

Select from a range of models such as LLaMA-2, Mistral, Phi-2, etc.

d. Select Fine-Tuning Technique:

Options include Full Fine-Tuning, LoRA, QLoRA, AdaLoRA, and more.


Technique	Description	Size Reduction	Speed Gain	Best For
Dynamic Quantization	Quantizes weights at runtime, keeps activations in float	~25-40%	2-3x	CPU deployment, variable batch sizes
Static Quantization	Pre-calibrates and quantizes both weights and activations	~50-75%	3-4x	Fixed input distributions, edge devices
GPTQ (Group-wise PTQ)	Advanced post-training quantization using group-wise optimization	~65-75%	3-4x	Large language models, GPU inference
AWQ (Activation-aware)	Preserves important weights based on activation patterns	~60-70%	3-4x	Maintaining high accuracy, production models
SmoothQuant	Smooths activation outliers for better quantization	~50-60%	2-3x	Models with activation spikes, transformers
QLoRA	Quantized LoRA fine-tuning approach	~65-75%	2-3x	Fine-tuning with limited memory
BitsAndBytes	8-bit and 4-bit quantization optimized for CUDA	~70-75%	2-4x	NVIDIA GPU deployment
ONNX Runtime	Cross-platform quantization for ONNX models	~50-60%	2-3x	Multi-platform deployment
TensorRT	NVIDIA's high-performance inference optimization	~60-70%	4-8x	NVIDIA GPU production servers
TFLite	Mobile and edge device optimization	~60-75%	3-5x	Android/iOS deployment
CoreML	Apple ecosystem optimization	~50-65%	3-4x	iOS/macOS applications
Fake Quantization (Simulated QAT)	Simulates quantization during training	~40-50%	2x	Training-aware optimization
GGUF Format	Efficient format for quantized models	~60-75%	2-4x	Local deployment, llama.cpp compatible

e. Set Training Parameters:

Configure parameters like learning rate, batch size, number of epochs, etc.

f. Choose Compute Resources:

Select from various compute options (AWS, GCP, Azure, etc.) based on your requirements and budget.

g. Select Quantization Technique:

Choose from options like Dynamic Quantization, Static Quantization, GPTQ, etc.

h. AI Recommendation:

Optionally, use the AI Recommendation feature to get suggestions for optimal settings.
The AI Recommendation feature analyzes your specific configuration to suggest optimal compression settings:

How It Works:

- Model Analysis: Examines your selected model's architecture, size, and complexity
- Dataset Profiling: Analyzes your dataset characteristics and distribution
- Hardware Matching: Considers your target deployment environment
- Performance Targets: Balances size reduction with accuracy preservation

Tips & Notes

The AI Recommendation feature can help you choose the best compression settings based on your model and data.
Different compute options are suitable for various techniques and model sizes. Consider the cost and performance trade-offs when selecting.
Experiment with different quantization techniques to find the best balance between model size reduction and performance preservation.

App Design

6.3.1 Feature Description

App Design is a no-code/low-code platform that allows users to build, customize, and deploy LLM-powered and Agentic workflows using an intuitive drag-and-drop interface. It enables the creation of complex AI applications without extensive coding knowledge.

Step-by-step Usage

a. Access App Design:

Navigate to the "App Design" module from the left panel.

b. Create a New Workflow:

Click on the "New Workflow" or "+" button to start a new project.

c. Design Your Workflow:

Use the drag-and-drop interface to add components to your canvas.
Components may include:
- LLM models
- Prompt templates
- Data sources
- API integrations
- Custom functions

d. Configure Components:

Click on each component to set its parameters and properties.
Connect components by drawing lines between their inputs and outputs.

e. Set Up Input/Output:

Define the input format for your application (e.g., text input, file upload).
Configure the desired output format (e.g., text response, generated image).

f. Test Your Workflow:

Use the built-in testing panel to run your workflow with sample inputs.
Debug and refine your design as needed.

g. Deploy Your Application:

Once satisfied with your workflow, click the "Deploy" button.
Choose deployment options (e.g., API endpoint, web interface, chat widget).

Tips & Notes

Utilize pre-built templates and components to accelerate your development process.
Leverage the visual nature of App Design to create complex, multi-step AI workflows without writing code.
Experiment with different component combinations to achieve your desired functionality.
Use the version control feature to manage different iterations of your workflow.
Take advantage of the built-in monitoring and analytics to track your application's performance and usage.

Key Features of App Design

Drag-and-Drop Interface: Easily create workflows by dragging and connecting components.
Wide Range of Components: Access a variety of LLM models, data processing tools, and integrations.
Real-time Preview: Test your workflow at any stage of development.
Custom Function Support: Integrate your own Python functions for specialized tasks.
Collaborative Editing: Work with team members on the same workflow simultaneously.
Version Control: Keep track of changes and revert to previous versions if needed.
One-Click Deployment: Quickly deploy your application to production environments.

Use Cases

Chatbots and Conversational AI
Document Analysis and Summarization
Content Generation and SEO Optimization
Data Extraction and Processing Pipelines
Sentiment Analysis and Customer Feedback Processing
Automated Report Generation
Multi-modal AI Applications (text, image, audio)

App Design in LLMOps provides a powerful yet accessible way to create sophisticated AI applications, bridging the gap between complex LLM capabilities and practical, deployable solutions for various business needs.

Monitoring

Feature Description

The Monitoring feature in LLMOps provides comprehensive tracking and analysis of LLM usage, performance, and compliance. It offers real-time insights into system activity, token usage, request trends, and key metrics across all LLM features.

Interface Overview

a. LLM Tracing:

Displays detailed information about each LLM interaction, including prompts, responses, token usage, and response times.
Allows for easy tracking and auditing of LLM requests and responses.

b. Validation Flow:

A customizable sequence of checks to ensure LLM outputs are safe, relevant, and high-quality.
Users can drag and drop various validators to create a custom validation pipeline.

c. Monitoring Dashboard:

Provides visual representations of key metrics and trends.
Includes graphs for daily request trends, total requests by feature, and average requests by feature.
Offers detailed breakdowns of token usage for different LLM applications.

Key Components

a. LLM Tracing Table:

Columns: ID, Prompt, Response, Input Tokens, Output Tokens, Response Time, Date, and custom metrics.
Allows for detailed analysis of individual LLM interactions.

b. Validation Flow:

Includes validators such as Bias Check, Gibberish Text, Detect PII, Guardrails PII, and Toxic Language.
Additional validators available: Detect Jailbreak, Llama Guard, Wiki Provenance, Sensitive Topic, Unusual Prompt, Saliency Check, and Restrict to Topic.

c. Monitoring Dashboard:

Daily Request Trends by Feature: Line graph showing usage patterns over time.
Total Requests by Feature: Pie chart illustrating the distribution of requests across different LLM applications.
Average Requests by Feature: Pie chart showing the average usage of each feature.
Feature-specific graphs: Bar charts and line graphs for detailed analysis of token usage and request patterns for each LLM application (e.g., Smart Chat, Brew Content, Data Dive, Text to Image).

How to Use

a. Accessing the Monitoring Feature:

Navigate to the "Monitoring" section from the left sidebar.

b. Analyzing LLM Tracing:

Review the LLM Tracing table to inspect individual requests and responses.
Use the table to identify patterns, issues, or anomalies in LLM interactions.

c. Configuring the Validation Flow:

Access the Validation Flow section.
Drag and drop desired validators into the flow.
Arrange validators in the desired order to create a custom validation pipeline.

d. Using the Monitoring Dashboard:

Set the date range using the "From" and "To" date pickers at the top of the dashboard.
Analyze trends and patterns in the various charts and graphs.
Use the feature-specific graphs to dive deeper into usage patterns for individual LLM applications.

Key Features and Benefits

Comprehensive Tracking: Monitor all aspects of LLM usage, from individual requests to system-wide trends.
Customizable Validation: Ensure LLM outputs meet specific quality and safety standards with a flexible validation pipeline.
Real-time Insights: Get up-to-date information on system performance and usage patterns.
Token Usage Analysis: Track and optimize token consumption across different LLM applications.
Performance Metrics: Monitor response times and other key performance indicators.
Compliance and Governance: Use the validation flow and detailed tracing to maintain compliance with organizational policies and regulations.

Tips & Notes

Regularly review the Monitoring Dashboard to identify usage trends and optimize resource allocation.
Use the Validation Flow to implement and enforce organizational policies on LLM output quality and safety.
Leverage the detailed LLM Tracing data for debugging, auditing, and improving LLM applications.
Pay attention to token usage patterns to manage costs and improve efficiency.
Use the date range selector to analyze performance and usage over specific time periods.

The Monitoring feature in LLMOps provides a powerful set of tools for managing, optimizing, and governing LLM usage within your organization. By leveraging these capabilities, you can ensure the safe, efficient, and effective use of LLM technology across all your applications.

LLM Fine-Tuning​

Feature Description​

Step-by-step Usage​

Model Compression​

Feature Description​

Step-by-step Usage​

Tips & Notes​

App Design​

Step-by-step Usage​

Tips & Notes​

Key Features of App Design​

Use Cases​

Monitoring​

Feature Description​

Interface Overview​

Key Components​

How to Use​

Key Features and Benefits​

Tips & Notes​

LLM Fine-Tuning

Feature Description

Step-by-step Usage

Model Compression

Feature Description

Step-by-step Usage

Tips & Notes

App Design

Step-by-step Usage

Tips & Notes

Key Features of App Design

Use Cases

Monitoring

Feature Description

Interface Overview

Key Components

How to Use

Key Features and Benefits

Tips & Notes