OL
Ollama
BEGINNER

Ollama

Quick reference for Ollama — run large language models locally with CLI commands, REST API, Modelfiles, vision, tool calling, and OpenAI-compatible endpoints.

8 min read
ollamallmailocal-aicliapimachine-learning

Installation & Setup

Install Ollama on macOS, Linux, and Windows and start the server.

Install Ollama

Install Ollama on your platform using the official installer or package manager.

bash
💡 Ollama runs as a background service automatically after installation
⚡ The Docker image supports GPU passthrough with --gpus=all for NVIDIA
📌 Default API server runs on http://localhost:11434
🟢 After install, run "ollama pull llama3.2" to get your first model
installsetup

Environment Configuration

Configure Ollama behavior with environment variables.

bash
💡 Ollama runs as a system service automatically — no need to manually start it
⚡ Set OLLAMA_HOST=0.0.0.0 to expose the API on your local network
📌 OLLAMA_KEEP_ALIVE controls how long models stay loaded in memory (default 5m)
🟢 Use OLLAMA_MODELS to store models on a different drive if disk space is limited
serverconfig

Running Models

Run and interact with models from the command line.

Run a Model

Start an interactive chat session with a model.

bash
💡 Models are automatically downloaded on first run if not already pulled
⚡ Use the tag (e.g., :1b, :8b, :27b) to pick a specific model size
📌 Interactive mode supports multi-line input with triple quotes (""")
🟢 Vision models like gemma3 can accept image file paths directly
runchatcli

Model Management

Pull, list, copy, and remove models.

Pull & List Models

Download models from the registry and view installed models.

bash
💡 Use "ollama ps" to see which models are currently loaded in memory
⚡ Models are stored in ~/.ollama/models by default (configurable via OLLAMA_MODELS)
📌 The :latest tag is used by default if no tag is specified
🟢 Browse available models at https://ollama.com/library
pulllistmodels

Copy & Remove Models

Create model aliases and delete models from local storage.

bash
💡 Copying a model creates a lightweight alias — it does not duplicate the model files
⚡ Use cp to alias models to OpenAI-compatible names (e.g., gpt-3.5-turbo)
📌 Removing a model frees up disk space by deleting its layers
🟢 You can always re-pull a removed model from the registry later
copyremovealias

Modelfile

Create custom models with system prompts, parameters, and adapters using Modelfiles.

Create a Custom Model

Define a Modelfile with a base model, system prompt, and parameters, then build it.

bash
💡 FROM is the only required instruction — everything else is optional
⚡ Use PARAMETER to tune behavior: temperature, num_ctx, top_p, repeat_penalty
📌 MESSAGE pre-seeds the conversation to set the model's behavior through examples
🟢 View any model's Modelfile with "ollama show --modelfile <model>"
modelfilecustomcreate

Modelfile Parameters

Reference for common model parameters and their effects.

bash
💡 temperature 0 gives deterministic output; 1.0+ increases creativity
⚡ Increase num_ctx for longer conversations but be aware of memory limits
📌 Use seed for reproducible responses — same seed + prompt = same output
🟢 stop sequences tell the model when to stop generating text
parametersconfig

REST API

Interact with models programmatically using the Ollama REST API.

Generate Completions

Generate text completions using the /api/generate endpoint.

bash
💡 Set "stream": false to get the complete response in a single JSON object
⚡ The "context" field from a response can be passed back to maintain conversation state
📌 Use "options" to override model parameters like temperature and num_ctx per request
🟢 Default streaming returns newline-delimited JSON objects as tokens are generated
apigeneraterest

Chat Completions

Use the /api/chat endpoint for multi-turn conversations.

bash
💡 The chat endpoint manages conversation history via the messages array
⚡ Roles are "system", "user", and "assistant" — same as OpenAI format
📌 Set "format": "json" to force the model to respond with valid JSON
🟢 Include the full message history in each request for context continuity
apichatrest

Embeddings

Generate vector embeddings for text using embedding models.

Generate Embeddings

Use the /api/embed endpoint to create text embeddings for RAG and similarity search.

bash
💡 Use /api/embed (not /api/embeddings) for the latest API with batch support
⚡ all-minilm is fast and lightweight (23MB) — great for getting started
📌 Embeddings are used for RAG, semantic search, and similarity comparisons
🟢 Batch multiple inputs in a single request for better performance
embeddingsapirag

Vision & Multimodal

Use vision models to analyze images alongside text prompts.

Image Analysis

Send images to vision-capable models via CLI or API.

bash
💡 Vision models include gemma3, llava, and moondream — pull one to get started
⚡ The CLI handles image encoding automatically — just pass the file path
📌 API requires base64-encoded images in the "images" array; SDKs accept file paths
🟢 You can combine multiple images in a single request for comparison tasks
visionmultimodalimages

Structured Output

Force models to respond with structured JSON or schema-based output.

JSON Mode & Schema

Use the format parameter to get structured JSON responses.

bash
💡 "format": "json" ensures the model always returns valid JSON
⚡ Pass a JSON Schema object as "format" to enforce an exact output structure
📌 Include format instructions in your prompt too — the model responds better with both
🟢 Schema-based output works best with larger models (7B+) for complex structures
jsonstructuredschema

Tool Calling

Let models call external functions and tools to augment their capabilities.

Function Calling with Tools

Define tools that models can invoke, then process the tool calls.

python
💡 Python SDK auto-generates tool schemas from type hints and docstrings
⚡ Tool calling uses a two-step flow: model requests tools, you execute and return results
📌 Models like qwen3, llama3.2, and command-r support tool calling
🟢 Add role: "tool" messages with results so the model can form its final answer
toolsfunction-calling

OpenAI Compatibility

Use the OpenAI-compatible API to drop in Ollama as a local replacement.

OpenAI SDK Drop-in

Point the OpenAI SDK at Ollama for local inference with no code changes.

python
💡 Just change base_url to localhost:11434/v1/ — the API key is required but ignored
⚡ Supports /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models
📌 Use "ollama cp" to alias models to OpenAI names for zero-code-change migration
🟢 Great for testing apps locally before switching to cloud API providers
openaicompatibilitysdk

Official SDKs

Use the official Python and JavaScript libraries for a native Ollama experience.

Python SDK

Install and use the official Ollama Python library.

python
💡 The Python SDK provides both sync and async clients (AsyncClient)
⚡ Pass stream=True for token-by-token streaming in real time
📌 Functions like chat(), generate(), embed(), list(), pull() mirror the REST API
🟢 Type hints are included — full autocomplete support in modern editors
pythonsdk

JavaScript SDK

Install and use the official Ollama JavaScript/TypeScript library.

typescript
💡 The JS SDK works in Node.js and supports ESM and CommonJS imports
⚡ Streaming uses async iterators — use "for await...of" to process chunks
📌 Full TypeScript support with typed responses out of the box
🟢 Use ollama.pull() with stream: true to show download progress
javascripttypescriptsdk

GPU & Performance

Configure GPU acceleration and optimize model performance.

GPU Configuration

Configure GPU layers, parallel requests, and memory management.

bash
💡 Ollama automatically detects and uses available GPUs (NVIDIA, AMD, Apple Silicon)
⚡ Use OLLAMA_FLASH_ATTENTION=1 for faster inference on supported hardware
📌 Send keep_alive: 0 to immediately unload a model and free GPU memory
🟢 Preload models by sending an empty prompt — great for reducing first-response latency
gpuperformancememory