What is WebLLM

WebLLM: Running LLMs in the Browser

Large Language Models (LLMs) changed the play for natural language processing (NLP) as they help implement chatbots, code generations, etc easier. As we move forward the models are getting smaller and faster and this opens a lot of opportunities. Traditionally, LLMs had to be served from powerful cloud-based GPUs and powerful servers, but recent progress made it much easier to be even possible to run in a local client-side browser. In this article, we will cover how WebLLM can help this process of serving a model fully on the client side.

What is WebLLM?

WebLLM is an approach implemented by MLC-AI team that allows LLMs to run fully locally within a browser using WebAssembly (WASM), WebGPU, and other modern web technologies. When WebLLM is used, it first downloads the chosen model and stores it locally in the CacheStorage. From that moment on, it can be used fully offline. While it might be a bit extra work to download and run a model locally, with the growth of internet speed and device capabilities, this might be a good idea in the near future.

How WebLLM Works

The technology behind WebLLM allows it to work fully in the browser without any server-side infrastructure. but how does it work?

Use of WebAssembly (WASM) and WebGPU: It Converts the model computations into a specific format, is efficient for WebAssembly modules, and uses WebGPU for acceleration to run the model.
Running on the Client Side: Unlike traditional cloud-based LLMs, WebLLM works fully locally and in the browser, so it does not send data to remote servers and ensures privacy and security.
Optimized model Size: WebLLM Uses a quantized and optimized version of models so it can be faster and more efficient in a browser environment.

Cloud vs. In-Browser LLMs

Comparing Native LLM embedding vs web-based LLM embedding might be a good way to understand the differences between the two. Of course it might not be ideal for all projects, but running LLMs in the browser is a great advantage for most projects.

Feature	LLM served from Cloud	WebLLM (In-Browser)
Offline Support	Limited (requires internet)	Can run offline once loaded
Performance	Faster (dedicated hardware)	Slower (limited by browser capabilities)
Privacy	Limited due to send/receive data to server	Fully private (runs locally)
Installation	Requires specific servers (GPUs, Memory)	Open the website and download the model
Portability	Limited to specific OS/hardware	Cross-platform (any modern browser)
Latency	Lower latency (powerful hardware)	Higher latency (browser execution overhead)

Keep in mind that using WebLLM will require the model to be downloaded and stored locally on the client. So it is important to consider the size of the model and the amount of data that needs to be downloaded.

How Can I Implement It in My Website?

WebLLM has npm package(@mlc-ai/web-llm) to work with it, and it can be installed easily as an npm package:

npm i @mlc-ai/web-llm

Then import the module in your code and use it. It can also be dynamically imported:

const webllm = await import ("https://esm.run/@mlc-ai/web-llm");

Create MLCEngine

Most operations in WebLLM are done through MLCEngine. here is a sample code snippet

Create your LLM engine

import * as webllm from "@mlc-ai/web-llm";

const selectedModel = "Llama-3.1-8B-q4f32_1-MLC";

const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
  selectedModel,
  {
    initProgressCallback: (initProgress) => {
      console.log(initProgress);
    },
    logLevel: "INFO",
  },
);

As soon as you have the engine loaded, and ready to use, you can start calling the APIs. Note that the engine creation is asynchronous, so you need to wait until it finishes loading the model before you can use it.

const messages = [
  { role: "system", content: "You are a helpful AI assistant." },
  { role: "user", content: "Hello!" },
]

const reply = await engine.chat.completions.create({
  messages,
});
console.log(reply.choices[0].message);
console.log(reply.usage);

It also supports streaming, and it can be done easily by passing stream: true property to the create method.

const chunks = await engine.chat.completions.create({
  messages,
  temperature: 1,
  stream: true, // <-- Enable streaming
  stream_options: { include_usage: true },
});

let reply = "";
for await (const chunk of chunks) {
  reply += chunk.choices[0]?.delta.content || "";
  console.log(reply);
  if (chunk.usage) {
    console.log(chunk.usage); // only last chunk has usage
  }
}

Since these operations are running on the same thread of the browser, it can be a bit tricky to run them in the applications, and it can harm the performance of the application. So instead of using it directly, it is recommended to use it in a separate thread(Worker).

In browser environment, we have Web Workers, and Service Workers. Fortunately, WebLLM provides a wrapper for these APIs, so you can use it in the same way as the native API.

import * as webllm from "@mlc-ai/web-llm";

const engine = new webllm.ServiceWorkerMLCEngine();
await engine.reload("Llama-3-8B-Instruct-q4f16_1-MLC");

async function main() {
  const stream = await engine.chat.completions.create({
    messages: [{ role: "user", content: "Hello!" }],
    stream: true,
  });
  
  for await (const chunk of stream) {
    updateUI(chunk.choices[0]?.delta?.content || ""); // TO UPDATE UI
  }
}

If you have worked with the OpenAI library previously, the code structure might look similar to you, the reason is that they tried to keep it similar in a way that it feels like interacting with a server via sending and receiving JSON.

It also supports media formats like image URLs. Also, there are some examples of how to use WebLLM in different projects/frameworks on mlc-ai team repository for web-llm: examples folder.

Try it out

The MLC-AI team has developed the website: https://chat.webllm.ai/ allows you to download and try a wide range of LLMs locally in the browser without any installation or configuration. It can give a quick overview of how webLLM works and how to use it.

WebLLM: Running LLMs in the Browser

What is WebLLM?

How WebLLM Works

Cloud vs. In-Browser LLMs

How Can I Implement It in My Website?

Create MLCEngine

Try it out

Share

Jobs

.Net Backend Developer / Software Engineer

.Net Backend Developer / Software Engineer

Fullstack JavaScript Engineer

Software Quality Assurance Engineer

View all jobs