Ollama

Run LLMs locally with no API keys

Ollama

Local AI inference, full privacy

Run powerful LLMs locally on your machine. No API keys, no cloud, no costs - just privacy and control.


Setup

1. Install Ollama

First, install Ollama on your machine:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows: Download from https://ollama.ai

2. Pull a Model

# Start Ollama server
ollama serve

# Pull a model (in another terminal)
ollama pull llama3.1        # Best for chat & tools (~4.7GB)
ollama pull llava           # For vision support (~4.7GB)
ollama pull qwen2.5:1.5b    # Lightweight option (~986MB)

3. Install Packages

npm install @yourgpt/copilot-sdk @yourgpt/llm-sdk

No additional SDK required! Ollama uses native fetch - no openai or other provider SDK needed.

4. Usage

import { createOllama } from '@yourgpt/llm-sdk/ollama';

const ollama = createOllama();
const model = ollama('llama3.1');

for await (const event of model.stream({
  messages: [{ id: '1', role: 'user', content: 'Hello!' }],
})) {
  if (event.type === 'message:delta') {
    process.stdout.write(event.content);
  }
}

5. Streaming (API Route)

app/api/chat/route.ts
import { createOllama } from '@yourgpt/llm-sdk/ollama';

const ollama = createOllama();

export async function POST(req: Request) {
  const { messages } = await req.json();

  const model = ollama('llama3.1');

  const stream = new ReadableStream({
    async start(controller) {
      for await (const event of model.stream({
        messages,
        system: 'You are a helpful assistant.',
      })) {
        if (event.type === 'message:delta') {
          controller.enqueue(new TextEncoder().encode(event.content));
        }
      }
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { 'Content-Type': 'text/plain' },
  });
}

Available Models

ModelVisionToolsContextSize
llama3.1128k~4.7GB
llama3.2-vision128k~4.7GB
llava4k~4.7GB
mistral8k~4.1GB
mixtral32k~26GB
qwen2.5:1.5b32k~986MB
deepseek16k~4GB
codellama16k~3.8GB
// Use any Ollama model
ollama('llama3.1')           // General purpose, tool support
ollama('llava')              // Vision capable
ollama('mistral')            // Fast, good for coding
ollama('qwen2.5:1.5b')       // Lightweight, great for testing

Ollama-Specific Options

Ollama supports unique configuration options for fine-tuning model behavior:

import { createOllama } from '@yourgpt/llm-sdk/ollama';

const ollama = createOllama({
  baseUrl: 'http://localhost:11434',  // Custom server URL
  options: {
    // Context & Performance
    num_ctx: 8192,           // Context window size
    num_batch: 512,          // Batch size for processing
    num_gpu: 1,              // Number of GPUs to use

    // Sampling
    temperature: 0.7,        // Creativity (0.0-2.0)
    top_p: 0.9,              // Nucleus sampling
    top_k: 40,               // Top-k sampling

    // Repetition Control
    repeat_penalty: 1.1,     // Penalize repetition
    repeat_last_n: 64,       // Look back window

    // Advanced
    mirostat: 0,             // Mirostat sampling (0, 1, or 2)
    mirostat_eta: 0.1,       // Mirostat learning rate
    mirostat_tau: 5.0,       // Mirostat target entropy
    seed: 42,                // For reproducible outputs
  },
});

Tool Calling

Ollama supports tool calling with compatible models (llama3.1, mistral, qwen2):

import { createOllama } from '@yourgpt/llm-sdk/ollama';

const ollama = createOllama();
const model = ollama('llama3.1');

for await (const event of model.stream({
  messages: [{ id: '1', role: 'user', content: "What's the weather in San Francisco?" }],
  actions: [
    {
      name: 'get_weather',
      description: 'Get current weather for a location',
      parameters: {
        city: { type: 'string', required: true, description: 'City name' },
      },
      handler: async ({ city }) => {
        return { temperature: 65, condition: 'foggy', city };
      },
    },
  ],
})) {
  if (event.type === 'action:start') {
    console.log(`\nTool called: ${event.name}`);
  }
  if (event.type === 'action:result') {
    console.log(`Result: ${JSON.stringify(event.result)}`);
  }
  if (event.type === 'message:delta') {
    process.stdout.write(event.content);
  }
}

Not all models support tool calling. Use llama3.1, mistral, or qwen2 for best results.


Vision

Use vision-capable models like LLaVA to analyze images:

import { createOllama } from '@yourgpt/llm-sdk/ollama';
import { readFileSync } from 'fs';

const ollama = createOllama();
const model = ollama('llava');

// Read image and convert to base64
const imageBuffer = readFileSync('./image.png');
const base64Image = imageBuffer.toString('base64');

for await (const event of model.stream({
  messages: [
    {
      id: '1',
      role: 'user',
      content: 'What do you see in this image?',
      metadata: {
        attachments: [
          {
            type: 'image',
            data: base64Image,
            mimeType: 'image/png',
          },
        ],
      },
    },
  ],
})) {
  if (event.type === 'message:delta') {
    process.stdout.write(event.content);
  }
}

With Copilot UI

Use with the Copilot React components:

app/providers.tsx
'use client';

import { CopilotProvider } from '@yourgpt/copilot-sdk/react';

export function Providers({ children }: { children: React.ReactNode }) {
  return (
    <CopilotProvider runtimeUrl="/api/chat">
      {children}
    </CopilotProvider>
  );
}
app/api/chat/route.ts
import { createOllama } from '@yourgpt/llm-sdk/ollama';

const ollama = createOllama();

export async function POST(req: Request) {
  const { messages } = await req.json();
  const model = ollama('llama3.1');

  const stream = new ReadableStream({
    async start(controller) {
      for await (const event of model.stream({ messages })) {
        if (event.type === 'message:delta') {
          controller.enqueue(new TextEncoder().encode(event.content));
        }
      }
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { 'Content-Type': 'text/plain' },
  });
}

Why Ollama?

BenefitDescription
PrivacyAll data stays on your machine - nothing sent to external servers
No API KeysNo billing, no rate limits, no account required
OfflineWorks completely offline once models are downloaded
No CostsRun unlimited inferences without paying per token
Fast IterationNo network latency for local development
CustomizableFine-tune with Ollama modelfiles

Ollama is perfect for development, testing, privacy-sensitive applications, and air-gapped environments.


Environment Variables

VariableDefaultDescription
OLLAMA_BASE_URLhttp://localhost:11434Ollama server URL
OLLAMA_MODELllama3.1Default model

Troubleshooting

Cannot connect to Ollama

# Make sure Ollama is running
ollama serve

# Check if it's responding
curl http://localhost:11434/api/tags

Model not found

# Pull the required model
ollama pull llama3.1
ollama pull llava  # for vision

Tool calling not working

Not all models support tool calling. Supported models:

  • llama3.1
  • mistral
  • qwen2

Models like codellama and gemma2 don't support tools.


Next Steps

On this page