·9 min read

What is UTF-8? The Encoding That Powers the Modern Web

Over 98% of all web pages use UTF-8. It's the encoding that lets you read this text, send emoji in messages, and display Chinese, Arabic, and Hindi on the same page. Here's how it works.

What is UTF-8?

UTF-8 (Unicode Transformation Format — 8-bit) is a variable-width character encoding that can represent every character in the Unicode standard. It encodes each character using 1 to 4 bytes, making it both space-efficient for English text and capable of representing every writing system on earth.

UTF-8 was designed in 1992 by Ken Thompson and Rob Pike — the same people behind Unix and the Go programming language. The story goes that they sketched the encoding on a placemat in a New Jersey diner and implemented it overnight.

Today, UTF-8 is the dominant encoding on the web. According to W3Techs, over 98% of all websites use UTF-8, and it's the default encoding for HTML5, JSON, YAML, TOML, and most modern programming languages.

Unicode vs UTF-8: What's the Difference?

People often confuse Unicode and UTF-8, but they serve different roles:

  • Unicode is a character set — a giant table that assigns a unique number (called a code point) to every character. For example, "A" is U+0041, "€" is U+20AC, and "😀" is U+1F600.
  • UTF-8 is an encoding — a way to convert those code point numbers into actual bytes that computers can store and transmit.

Think of it this way: Unicode is the dictionary that lists every character with its number, while UTF-8 is the delivery method that packs those numbers into bytes.

There are other Unicode encodings too (UTF-16, UTF-32), but UTF-8 is by far the most popular because of its efficiency and backward compatibility with ASCII.

How UTF-8 Encoding Works

UTF-8 uses a clever variable-width scheme. The number of bytes depends on the code point value:

Code Point RangeBytesByte PatternExample
U+0000 – U+007F10xxxxxxxA, z, 5, !
U+0080 – U+07FF2110xxxxx 10xxxxxxé, ñ, ü, Σ
U+0800 – U+FFFF31110xxxx 10xxxxxx 10xxxxxx中, €, ह, ✓
U+10000 – U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx😀, 🎉, 𝕳

The key design insight: the leading bitsof each byte tell you exactly what you're looking at:

  • Starts with 0 → single-byte character (ASCII)
  • Starts with 110 → first byte of a 2-byte character
  • Starts with 1110 → first byte of a 3-byte character
  • Starts with 11110 → first byte of a 4-byte character
  • Starts with 10 → continuation byte (not the start of a character)

This makes UTF-8 self-synchronizing — if you start reading in the middle of a stream, you can always find the next character boundary by scanning for a byte that doesn't start with 10.

Step-by-Step Encoding Examples

1-byte: "A" (U+0041)

Code point: U+0041 = 65 = 1000001 in binary
Range: U+0000–U+007F → 1 byte
Pattern: 0xxxxxxx

Fill in: 0 1000001
         ↓
Byte:    01000001 = 0x41

"A" in UTF-8 = 0x41 (identical to ASCII!)

2-byte: "é" (U+00E9)

Code point: U+00E9 = 233 = 11101001 in binary
Range: U+0080–U+07FF → 2 bytes
Pattern: 110xxxxx 10xxxxxx

Split bits: 00011  101001
Fill in:    110 00011  10 101001
            ↓          ↓
Bytes:      0xC3       0xA9

"é" in UTF-8 = 0xC3 0xA9

3-byte: "€" (U+20AC)

Code point: U+20AC = 8364 = 10000010101100 in binary
Range: U+0800–U+FFFF → 3 bytes
Pattern: 1110xxxx 10xxxxxx 10xxxxxx

Split bits: 0010  000010  101100
Fill in:    1110 0010  10 000010  10 101100
            ↓          ↓          ↓
Bytes:      0xE2       0x82       0xAC

"€" in UTF-8 = 0xE2 0x82 0xAC

4-byte: "😀" (U+1F600)

Code point: U+1F600 = 128512 = 11111011000000000 in binary
Range: U+10000–U+10FFFF → 4 bytes
Pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Split bits: 000  011111  011000  000000
Fill in:    11110 000  10 011111  10 011000  10 000000
            ↓          ↓          ↓          ↓
Bytes:      0xF0       0x9F       0x98       0x80

"😀" in UTF-8 = 0xF0 0x9F 0x98 0x80

Why UTF-8 Won

UTF-8 became the dominant encoding for several compelling reasons:

  • Backward compatible with ASCII — Any ASCII text is already valid UTF-8. This meant billions of existing documents worked without conversion.
  • Space efficient — English text (the bulk of early web content) uses just 1 byte per character, same as ASCII. Other scripts use only what they need.
  • No byte-order issues — Unlike UTF-16 and UTF-32, UTF-8 doesn't need a BOM (Byte Order Mark) because byte order is unambiguous.
  • Self-synchronizing — You can jump to any byte and find the next character boundary. This makes random access and error recovery reliable.
  • No null bytes in text — The only way to get a 0x00 byte is the actual NUL character, which means C-style string functions work correctly.
  • Universal — It can encode every Unicode character, from basic Latin to emoji to ancient scripts.

UTF-8 vs UTF-16 vs UTF-32

Unicode has three main encodings. Here's how they compare:

FeatureUTF-8UTF-16UTF-32
Bytes per char1–42 or 44 (always)
ASCII compatibleYesNoNo
Byte order issueNoYes (needs BOM)Yes (needs BOM)
"Hello" size5 bytes10 bytes20 bytes
Used inWeb, Linux, macOS, JSONWindows, Java, JavaScriptInternal processing
Web usage98%+~0.01%Virtually 0%

UTF-16 is notable because JavaScript and Java use it internally for strings. This is why JavaScript's .lengthreturns 2 for a single emoji — the emoji requires a UTF-16 "surrogate pair" (two 16-bit units).

UTF-8 in Code

JavaScript

// Encode string to UTF-8 bytes
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello €");
console.log(bytes);
// Uint8Array [72, 101, 108, 108, 111, 32, 226, 130, 172]
// "€" takes 3 bytes: 0xE2, 0x82, 0xAC

// Decode UTF-8 bytes back to string
const decoder = new TextDecoder("utf-8");
const text = decoder.decode(bytes);
console.log(text);  // "Hello €"

// ⚠️ String length vs byte length
"Hello".length;  // 5 (5 bytes in UTF-8)
"café".length;   // 4 characters, but 5 bytes in UTF-8
"😀".length;     // 2 (JS uses UTF-16 internally!)

Python

# Encode string to UTF-8 bytes
text = "Hello €"
utf8_bytes = text.encode("utf-8")
print(utf8_bytes)   # b'Hello \xe2\x82\xac'
print(len(utf8_bytes))  # 9 bytes

# Decode UTF-8 bytes to string
decoded = utf8_bytes.decode("utf-8")
print(decoded)  # "Hello €"

# Character vs byte length
len("café")                    # 4 characters
len("café".encode("utf-8"))    # 5 bytes (é = 2 bytes)

HTML

<!-- Always declare UTF-8 in your HTML -->
<meta charset="UTF-8">

<!-- This should be the FIRST element inside <head> -->
<!-- Place it before <title> or any other content -->

Common UTF-8 Problems

Most encoding bugs happen when text is read using the wrong encoding:

❌ Mojibake (garbled text)

When UTF-8 text is read as Latin-1 (ISO-8859-1), you get garbage like é instead of é, or € instead of .

Fix: Ensure both the writer and reader agree on UTF-8 encoding.

❌ Replacement characters (�)

The (U+FFFD) appears when a decoder encounters bytes that aren't valid UTF-8.

Fix: Check if the source file was actually saved as UTF-8, not some other encoding.

❌ BOM causing issues

Some editors (notably Windows Notepad) add a Byte Order Mark (EF BB BF) at the start of UTF-8 files. This can break JSON parsers, shell scripts, and PHP files.

Fix: Save files as "UTF-8 without BOM" or use a BOM-aware tool.

❌ String length ≠ byte length

A 10-character string might be 10 bytes (ASCII) or 40 bytes (emoji). Database VARCHAR(10) columns may truncate multi-byte characters.

Fix: Use character-based limits, not byte-based limits. In MySQL, use utf8mb4 instead of utf8.

Best Practices

Follow these rules to avoid encoding headaches:

  • Always use UTF-8 — Unless you have a specific reason not to, UTF-8 should be your default for everything.
  • Declare the encoding — Add <meta charset="UTF-8"> in HTML, set Content-Type: text/html; charset=utf-8 in HTTP headers.
  • Save files as UTF-8 — Configure your editor to save all files as UTF-8 (without BOM).
  • Use utf8mb4 in MySQL — MySQL's utf8 only supports 3-byte characters. Use utf8mb4 for full Unicode support (including emoji).
  • Distinguish characters from bytes — When counting, truncating, or splitting strings, work with characters/code points, not raw bytes.
  • Test with non-ASCII text — Use strings like "naïve café résumé 日本語 🎉" in your test cases to catch encoding bugs early.

Encode & Decode UTF-8 Text

Use our free UTF-8 Converter tool to inspect UTF-8 byte sequences, encode text, and decode bytes — right in your browser with no data uploaded to any server.

Try UTF-8 Converter →

References

  1. Yergeau, F. (2003). UTF-8, a transformation format of ISO 10646. RFC 3629, IETF. https://datatracker.ietf.org/doc/html/rfc3629
  2. Pike, R. & Thompson, K. (2003). Hello World, or Καλημέρα κόσμε, or こんにちは 世界. UTF-8 history. https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
  3. The Unicode Consortium. The Unicode Standard. https://www.unicode.org/standard/standard.html
  4. Mozilla Developer Network. TextEncoder — Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder
  5. W3Techs. Usage statistics of character encodings for websites. https://w3techs.com/technologies/overview/character_encoding