What is UTF-8? The Encoding That Powers the Modern Web
Over 98% of all web pages use UTF-8. It's the encoding that lets you read this text, send emoji in messages, and display Chinese, Arabic, and Hindi on the same page. Here's how it works.
Table of Contents
What is UTF-8?
UTF-8 (Unicode Transformation Format — 8-bit) is a variable-width character encoding that can represent every character in the Unicode standard. It encodes each character using 1 to 4 bytes, making it both space-efficient for English text and capable of representing every writing system on earth.
UTF-8 was designed in 1992 by Ken Thompson and Rob Pike — the same people behind Unix and the Go programming language. The story goes that they sketched the encoding on a placemat in a New Jersey diner and implemented it overnight.
Today, UTF-8 is the dominant encoding on the web. According to W3Techs, over 98% of all websites use UTF-8, and it's the default encoding for HTML5, JSON, YAML, TOML, and most modern programming languages.
Unicode vs UTF-8: What's the Difference?
People often confuse Unicode and UTF-8, but they serve different roles:
- Unicode is a character set — a giant table that assigns a unique number (called a code point) to every character. For example, "A" is U+0041, "€" is U+20AC, and "😀" is U+1F600.
- UTF-8 is an encoding — a way to convert those code point numbers into actual bytes that computers can store and transmit.
Think of it this way: Unicode is the dictionary that lists every character with its number, while UTF-8 is the delivery method that packs those numbers into bytes.
There are other Unicode encodings too (UTF-16, UTF-32), but UTF-8 is by far the most popular because of its efficiency and backward compatibility with ASCII.
How UTF-8 Encoding Works
UTF-8 uses a clever variable-width scheme. The number of bytes depends on the code point value:
| Code Point Range | Bytes | Byte Pattern | Example |
|---|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx | A, z, 5, ! |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx | é, ñ, ü, Σ |
| U+0800 – U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx | 中, €, ह, ✓ |
| U+10000 – U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 😀, 🎉, 𝕳 |
The key design insight: the leading bitsof each byte tell you exactly what you're looking at:
- Starts with
0→ single-byte character (ASCII) - Starts with
110→ first byte of a 2-byte character - Starts with
1110→ first byte of a 3-byte character - Starts with
11110→ first byte of a 4-byte character - Starts with
10→ continuation byte (not the start of a character)
This makes UTF-8 self-synchronizing — if you start reading in the middle of a stream, you can always find the next character boundary by scanning for a byte that doesn't start with 10.
Step-by-Step Encoding Examples
1-byte: "A" (U+0041)
Code point: U+0041 = 65 = 1000001 in binary
Range: U+0000–U+007F → 1 byte
Pattern: 0xxxxxxx
Fill in: 0 1000001
↓
Byte: 01000001 = 0x41
"A" in UTF-8 = 0x41 (identical to ASCII!)2-byte: "é" (U+00E9)
Code point: U+00E9 = 233 = 11101001 in binary
Range: U+0080–U+07FF → 2 bytes
Pattern: 110xxxxx 10xxxxxx
Split bits: 00011 101001
Fill in: 110 00011 10 101001
↓ ↓
Bytes: 0xC3 0xA9
"é" in UTF-8 = 0xC3 0xA93-byte: "€" (U+20AC)
Code point: U+20AC = 8364 = 10000010101100 in binary
Range: U+0800–U+FFFF → 3 bytes
Pattern: 1110xxxx 10xxxxxx 10xxxxxx
Split bits: 0010 000010 101100
Fill in: 1110 0010 10 000010 10 101100
↓ ↓ ↓
Bytes: 0xE2 0x82 0xAC
"€" in UTF-8 = 0xE2 0x82 0xAC4-byte: "😀" (U+1F600)
Code point: U+1F600 = 128512 = 11111011000000000 in binary
Range: U+10000–U+10FFFF → 4 bytes
Pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Split bits: 000 011111 011000 000000
Fill in: 11110 000 10 011111 10 011000 10 000000
↓ ↓ ↓ ↓
Bytes: 0xF0 0x9F 0x98 0x80
"😀" in UTF-8 = 0xF0 0x9F 0x98 0x80Why UTF-8 Won
UTF-8 became the dominant encoding for several compelling reasons:
- Backward compatible with ASCII — Any ASCII text is already valid UTF-8. This meant billions of existing documents worked without conversion.
- Space efficient — English text (the bulk of early web content) uses just 1 byte per character, same as ASCII. Other scripts use only what they need.
- No byte-order issues — Unlike UTF-16 and UTF-32, UTF-8 doesn't need a BOM (Byte Order Mark) because byte order is unambiguous.
- Self-synchronizing — You can jump to any byte and find the next character boundary. This makes random access and error recovery reliable.
- No null bytes in text — The only way to get a 0x00 byte is the actual NUL character, which means C-style string functions work correctly.
- Universal — It can encode every Unicode character, from basic Latin to emoji to ancient scripts.
UTF-8 vs UTF-16 vs UTF-32
Unicode has three main encodings. Here's how they compare:
| Feature | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|
| Bytes per char | 1–4 | 2 or 4 | 4 (always) |
| ASCII compatible | Yes | No | No |
| Byte order issue | No | Yes (needs BOM) | Yes (needs BOM) |
| "Hello" size | 5 bytes | 10 bytes | 20 bytes |
| Used in | Web, Linux, macOS, JSON | Windows, Java, JavaScript | Internal processing |
| Web usage | 98%+ | ~0.01% | Virtually 0% |
UTF-16 is notable because JavaScript and Java use it internally for strings. This is why JavaScript's .lengthreturns 2 for a single emoji — the emoji requires a UTF-16 "surrogate pair" (two 16-bit units).
UTF-8 in Code
JavaScript
// Encode string to UTF-8 bytes
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello €");
console.log(bytes);
// Uint8Array [72, 101, 108, 108, 111, 32, 226, 130, 172]
// "€" takes 3 bytes: 0xE2, 0x82, 0xAC
// Decode UTF-8 bytes back to string
const decoder = new TextDecoder("utf-8");
const text = decoder.decode(bytes);
console.log(text); // "Hello €"
// ⚠️ String length vs byte length
"Hello".length; // 5 (5 bytes in UTF-8)
"café".length; // 4 characters, but 5 bytes in UTF-8
"😀".length; // 2 (JS uses UTF-16 internally!)Python
# Encode string to UTF-8 bytes
text = "Hello €"
utf8_bytes = text.encode("utf-8")
print(utf8_bytes) # b'Hello \xe2\x82\xac'
print(len(utf8_bytes)) # 9 bytes
# Decode UTF-8 bytes to string
decoded = utf8_bytes.decode("utf-8")
print(decoded) # "Hello €"
# Character vs byte length
len("café") # 4 characters
len("café".encode("utf-8")) # 5 bytes (é = 2 bytes)HTML
<!-- Always declare UTF-8 in your HTML -->
<meta charset="UTF-8">
<!-- This should be the FIRST element inside <head> -->
<!-- Place it before <title> or any other content -->Common UTF-8 Problems
Most encoding bugs happen when text is read using the wrong encoding:
❌ Mojibake (garbled text)
When UTF-8 text is read as Latin-1 (ISO-8859-1), you get garbage like é instead of é, or € instead of €.
Fix: Ensure both the writer and reader agree on UTF-8 encoding.
❌ Replacement characters (�)
The �(U+FFFD) appears when a decoder encounters bytes that aren't valid UTF-8.
Fix: Check if the source file was actually saved as UTF-8, not some other encoding.
❌ BOM causing issues
Some editors (notably Windows Notepad) add a Byte Order Mark (EF BB BF) at the start of UTF-8 files. This can break JSON parsers, shell scripts, and PHP files.
Fix: Save files as "UTF-8 without BOM" or use a BOM-aware tool.
❌ String length ≠ byte length
A 10-character string might be 10 bytes (ASCII) or 40 bytes (emoji). Database VARCHAR(10) columns may truncate multi-byte characters.
Fix: Use character-based limits, not byte-based limits. In MySQL, use utf8mb4 instead of utf8.
Best Practices
Follow these rules to avoid encoding headaches:
- Always use UTF-8 — Unless you have a specific reason not to, UTF-8 should be your default for everything.
- Declare the encoding — Add
<meta charset="UTF-8">in HTML, setContent-Type: text/html; charset=utf-8in HTTP headers. - Save files as UTF-8 — Configure your editor to save all files as UTF-8 (without BOM).
- Use utf8mb4 in MySQL — MySQL's
utf8only supports 3-byte characters. Useutf8mb4for full Unicode support (including emoji). - Distinguish characters from bytes — When counting, truncating, or splitting strings, work with characters/code points, not raw bytes.
- Test with non-ASCII text — Use strings like "naïve café résumé 日本語 🎉" in your test cases to catch encoding bugs early.
Encode & Decode UTF-8 Text
Use our free UTF-8 Converter tool to inspect UTF-8 byte sequences, encode text, and decode bytes — right in your browser with no data uploaded to any server.
Try UTF-8 Converter →References
- Yergeau, F. (2003). UTF-8, a transformation format of ISO 10646. RFC 3629, IETF. https://datatracker.ietf.org/doc/html/rfc3629
- Pike, R. & Thompson, K. (2003). Hello World, or Καλημέρα κόσμε, or こんにちは 世界. UTF-8 history. https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
- The Unicode Consortium. The Unicode Standard. https://www.unicode.org/standard/standard.html
- Mozilla Developer Network. TextEncoder — Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder
- W3Techs. Usage statistics of character encodings for websites. https://w3techs.com/technologies/overview/character_encoding