January 17, 2026/9 min read

What is UTF-8? The Encoding That Powers the Modern Web

UTF-8 is the encoding that lets one page contain English, Thai, Chinese, Arabic, mathematical symbols, and emoji without switching formats. Here is how it works and why it became the default for the modern web.

What is UTF-8?
Unicode vs UTF-8: What is the Difference?
How UTF-8 Encoding Works
Step-by-Step Encoding Examples
Why UTF-8 Won
UTF-8 vs UTF-16 vs UTF-32
UTF-8 in Code
Common UTF-8 Problems
Best Practices

What is UTF-8?

UTF-8, short for Unicode Transformation Format - 8-bit, is a variable-width character encoding that can represent every character in the Unicode standard.

It stores each character using 1 to 4 bytes. Basic ASCII characters such as A, z, 5, and ! use one byte, while accented letters, Thai characters, Chinese characters, and emoji use more bytes as needed.

UTF-8 was designed in 1992 by Ken Thompson and Rob Pike. Today it is the dominant text encoding for websites, APIs, source code, databases, and configuration files.

Unicode vs UTF-8: What is the Difference?

People often mix up Unicode and UTF-8, but they solve different parts of the same problem:

Unicode is a character set: a huge catalog that assigns a unique code point to each character. For example, A is U+0041 and the Euro sign is U+20AC.
UTF-8 is an encoding: a practical way to turn Unicode code points into bytes that computers can store, send, and read.

A useful analogy: Unicode is the dictionary of characters and numbers, while UTF-8 is the delivery format that packs those numbers into bytes. Other Unicode encodings exist, such as UTF-16 and UTF-32, but UTF-8 became the web standard because it is compact and backward compatible with ASCII.

How UTF-8 Encoding Works

UTF-8 uses a variable-width byte pattern. The number of bytes depends on the Unicode code point value:

Code Point Range	Bytes	Byte Pattern	Example
U+0000 - U+007F	1	`0xxxxxxx`	A, z, 5, !
U+0080 - U+07FF	2	`110xxxxx 10xxxxxx`	é, ñ, ü
U+0800 - U+FFFF	3	`ก, €, 中, ✓`
U+10000 - U+10FFFF	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	emoji and rare historic scripts

The leading bits of each byte tell a decoder what kind of byte it is:

Starts with 0: a single-byte ASCII character.
Starts with 110: first byte of a 2-byte character.
Starts with 1110: first byte of a 3-byte character.
Starts with 11110: first byte of a 4-byte character.
Starts with 10: continuation byte, not the start of a character.

This design makes UTF-8 self-synchronizing. If a program starts reading in the middle of a byte stream, it can scan forward until it finds a byte that does not start with 10.

Step-by-Step Encoding Examples

1-byte: "A" (U+0041)

Code point: U+0041 = 65 = 1000001 in binary
Range: U+0000-U+007F -> 1 byte
Pattern: 0xxxxxxx

Fill in: 0 1000001
Byte:    01000001 = 0x41

"A" in UTF-8 = 0x41 (identical to ASCII)

2-byte: accented e (U+00E9)

Code point: U+00E9 = 233 = 11101001 in binary
Range: U+0080-U+07FF -> 2 bytes
Pattern: 110xxxxx 10xxxxxx

Split bits: 00011  101001
Fill in:    11000011 10101001
Bytes:      0xC3     0xA9

"e with acute" in UTF-8 = 0xC3 0xA9

3-byte: Euro sign (U+20AC)

Code point: U+20AC = 8364 = 10000010101100 in binary
Range: U+0800-U+FFFF -> 3 bytes
Pattern: 1110xxxx 10xxxxxx 10xxxxxx

Split bits: 0010  000010  101100
Fill in:    11100010 10000010 10101100
Bytes:      0xE2     0x82     0xAC

"Euro sign" in UTF-8 = 0xE2 0x82 0xAC

4-byte: grinning face emoji (U+1F600)

Code point: U+1F600 = 128512
Range: U+10000-U+10FFFF -> 4 bytes
Pattern: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Split bits: 000  011111  011000  000000
Fill in:    11110000 10011111 10011000 10000000
Bytes:      0xF0     0x9F     0x98     0x80

"grinning face emoji" in UTF-8 = 0xF0 0x9F 0x98 0x80

Why UTF-8 Won

UTF-8 became the dominant encoding for several practical reasons:

Backward compatible with ASCII - existing ASCII documents are already valid UTF-8.
Space efficient - English text uses one byte per character, while other scripts use only the bytes they need.
No byte-order issues - unlike UTF-16 and UTF-32, UTF-8 does not need byte-order switching.
Self-synchronizing - programs can recover character boundaries after errors or random seeking.
Works well with C-style strings - ordinary text does not contain unexpected null bytes.
Universal - it can represent every Unicode character, from basic Latin to Thai text to emoji.

UTF-8 vs UTF-16 vs UTF-32

Unicode has three main encodings. Their tradeoffs look like this:

Feature	UTF-8	UTF-16	UTF-32
Bytes per character	1-4	2 or 4	4 always
ASCII compatible	Yes	No	No
Byte order issue	No	Yes, may need BOM	Yes, may need BOM
"Hello" size	5 bytes	10 bytes	20 bytes
Common use	Web, Linux, macOS, JSON	Windows APIs, Java, JavaScript internals	Internal processing
Web usage	Dominant	Very rare	Virtually none

UTF-16 is still important because JavaScript and Java use 16-bit code units internally for strings. This is why string length and visible character count can surprise you when emoji or other supplementary characters are involved.

UTF-8 in Code

Most modern languages and browsers include built-in UTF-8 tools:

JavaScript

// Encode string to UTF-8 bytes
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello €");
console.log(bytes);
// Uint8Array [72, 101, 108, 108, 111, 32, 226, 130, 172]

// Decode UTF-8 bytes back to string
const decoder = new TextDecoder("utf-8");
const text = decoder.decode(bytes);
console.log(text);  // "Hello €"

// Character length is not always byte length
"Hello".length;  // 5
new TextEncoder().encode("Hello").length;  // 5
new TextEncoder().encode("cafe").length;   // 4
new TextEncoder().encode("café").length;   // 5

Python

# Encode string to UTF-8 bytes
text = "Hello €"
utf8_bytes = text.encode("utf-8")
print(utf8_bytes)       # b'Hello \xe2\x82\xac'
print(len(utf8_bytes))  # 9 bytes

# Decode UTF-8 bytes to string
decoded = utf8_bytes.decode("utf-8")
print(decoded)  # "Hello €"

# Character vs byte length
len("café")                    # 4 characters
len("café".encode("utf-8"))    # 5 bytes

HTML

<!-- Always declare UTF-8 in HTML -->
<meta charset="UTF-8">

<!-- Put it early inside <head>, before text content that needs decoding -->

Common UTF-8 Problems

Most encoding bugs happen when bytes are read using the wrong encoding or when code confuses bytes with characters:

Mojibake (garbled text)

When UTF-8 bytes are decoded as Latin-1 or Windows-1252, readable text can turn into strange characters. Fix it by making sure the writer and reader both use UTF-8.

Replacement characters

The replacement character U+FFFD appears when a decoder finds bytes that are not valid UTF-8. Check the original file encoding and avoid double-decoding.

BOM surprises

Some tools add a UTF-8 Byte Order Mark at the start of files. It can break JSON, shell scripts, or strict parsers. Save as UTF-8 without BOM when possible.

Character length vs byte length

A 10-character string may be 10 bytes or much larger. Use character-aware limits for UI and byte-aware limits for storage and network constraints.

Best Practices

Follow these habits to avoid encoding problems:

Use UTF-8 by default unless you have a specific legacy requirement.
Declare the encoding with <meta charset="UTF-8"> in HTML and charset=utf-8 in HTTP headers.
Save source files as UTF-8, preferably without BOM for code and data files.
Use utf8mb4 in MySQL so the database can store full Unicode, including emoji.
Do not split raw UTF-8 bytes blindly; split by characters, code points, or grapheme clusters when user-visible text matters.
Test with multilingual text so bugs appear before users paste real-world names, addresses, and messages.

Encode & Decode UTF-8 Text

Use our free UTF-8 Converter tool to inspect byte sequences, encode text, and decode bytes right in your browser.

Try UTF-8 Converter

References

Yergeau, F. (2003). UTF-8, a transformation format of ISO 10646. RFC 3629, IETF. https://datatracker.ietf.org/doc/html/rfc3629
Pike, R. & Thompson, K. (2003). UTF-8 history. https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
The Unicode Consortium. The Unicode Standard. https://www.unicode.org/standard/standard.html
Mozilla Developer Network. TextEncoder - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder
W3Techs. Usage statistics of character encodings for websites. https://w3techs.com/technologies/overview/character_encoding

Table of Contents