Local LLM Dev Platform

Sovereign AI infrastructure: fine-tune, serve, and optimize open-weight models entirely on your own hardware.

As open-weight AI models become increasingly capable, the question of how to run, adapt, and operationalize them on local hardware has become practically important — for sovereignty, for data sensitivity, and for applications where inference cannot cross a network boundary. This research project built a complete local AI infrastructure platform to investigate these questions from first principles.

The platform serves two related purposes: a fine-tuning laboratory for adapting open-weight models to specific domains using QLoRA at minimal hardware footprint, and a production-grade inference server that routes between multiple fine-tuned adapters simultaneously via vLLM's OpenAI-compatible API. Both components run on consumer hardware: two NVIDIA RTX 3080s in a Windows WSL2 environment.

The most substantive finding was the AI-guided optimization loop — a system where a fine-tuned local model analyzes the results of prior experiments and proposes parameter configurations for subsequent ones. The model acts as a domain-specific optimizer with knowledge baked into its weights, outperforming random and grid search at equivalent iteration counts.

Architecture

One platform, two layers

Fine-tuning runs through LLaMA-Factory with Unsloth's kernel optimizations — reducing memory requirements enough to fit useful 7B–14B parameter models on 10GB-VRAM consumer cards using 4-bit quantization. Multi-GPU training uses DeepSpeed ZeRO-2/3 configurations to distribute optimizer state across both cards without a managed cluster.

Training datasets are built through a naturalization pipeline: raw structured data is converted into the natural language format models reason over most effectively. A numerical record becomes a prose sentence — the format the model was pretrained to understand, and the format it will see at inference time.

The inference layer is vLLM, configured to serve multiple LoRA adapters simultaneously from a single loaded base model. One server, multiple specialized variants — domain-specific expertise available via a single OpenAI-compatible endpoint that any existing tooling can consume without modification. DuckDB handles columnar analytics over experiment histories, providing millisecond aggregate queries across thousands of recorded runs.

Technical Stack

vLLM Multi-adapter inference server
LLaMA-Factory + Unsloth QLoRA fine-tuning pipeline
PyTorch + DeepSpeed Multi-GPU training (2× RTX 3080)
PEFT + bitsandbytes LoRA adapters and 4-bit quant
DuckDB Columnar experiment analytics
Qwen 2.5 / Llama 3.1 (7B–14B) Base model family
TensorBoard + W&B Training monitoring

4-bit QLoRA quantization — 7B models on 10GB VRAM

N×1 Adapters per base model in vLLM

Research Areas

What was built and investigated

Domain Fine-Tuning

QLoRA training runs across four domains: IT support, academic tutoring, financial analysis, and code generation — each producing a distinct LoRA adapter loadable without restarting the base model.

AI-Guided Optimization

A two-phase optimization loop: Latin Hypercube Sampling explores the parameter space broadly in the first 20% of iterations, then a fine-tuned model guides exploitation — analyzing Sharpe ratio, drawdown, and win rate to propose the next configuration.

Multi-Adapter Serving

vLLM configured to serve multiple LoRA adapters from a single base model instance — exposing each via a unique model ID on an OpenAI-compatible endpoint, with no client-side changes required.

Data Naturalization

A pipeline converting raw structured records into natural language training examples — the format models reason over most effectively at inference time, improving task-specific accuracy measurably over structured prompt injection alone.

Columnar Experiment Analytics

DuckDB stores every training run, backtest result, and optimization iteration. Fast aggregate queries surface relationships across thousands of experiments — which configurations cluster, which metrics correlate, where the model's suggestions were wrong.

CLI Orchestration

A Typer-based CLI covers server lifecycle management, adapter loading and routing, dataset preparation, training invocation, and backtest orchestration — the full research workflow accessible from a single interface.

Ongoing Research

Sovereign AI is an infrastructure question.

The findings from this platform inform how we build AI-powered tools across the portfolio — particularly where data cannot leave a device or a controlled environment. The research continues.

← Back to Portfolio